Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The...
Transcript of Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The...
Articulated Model Detection in Range Images
Jim Rodgers
Stanford University
June 14, 2006
Abstract
We present a method for detecting the placement of complex articulated models in 3D
partial view range images. Specifically, we detect the placement of the parts of an articulated
model in a scene. This is a difficult task because the object can be substantially occluded
by clutter or portions may not be observed. Our approach uses detectors to provide initial
location hypotheses for parts. We expand upon these hypotheses, using a mechanism for
dealing with missing parts. We then create a probabilistic model to determine the most likely
placement of parts based on individual and pairwise potentials. We devise efficient scoring
mechanisms to evaluate placements and provide these potentials. During a post processing
phase we further refine our results, providing the final pose of the model in the scene. In
this thesis, I present our algorithm and results of successfully running it on two data sets of
objects of 15 or more parts.
Acknowledgements
Professor Daphne Koller, my advisor, has been instrumental to this project, always providing
support and new ideas. She has challenged me to think in new ways and given me an
understanding of the research process and taught me a great deal. Enormous credit also
goes to Dragomir Anguelov, who has been a mentor for this project, and under whose
guidance I first was exposed to the project as a summer research project through the CURIS
program in the Computer Science department during the summer of 2004. He was always
there to answer questions and discuss various approaches and offer numerous suggestions. I
am also indebted to Jimmy Pang, who worked with me on this project during the summer
of 2004 and was instrumental in helping me get started in research. James Davis from UC
Santa Cruz and Stefano Corazza and Lars Mundermann from the Stanford Biomechanics
Lab were crucial in acquiring data sets. Finally, sincere appreciation goes to colleagues and
friends in the Computer Science department and elsewhere for listening to me discuss my
project and providing suggestions and support.
1
Contents
1 Introduction 4
1.1 Background and related work . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Probabilistic Model and Algorithmic Overview 10
3 Phase 1: Detectors 13
3.1 Detector background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Spin images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Clustering, pruning and alignment . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4 Domain enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4 Phase 2A: Scoring 21
4.1 Individual part scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.1 Area explanation/occlusion score . . . . . . . . . . . . . . . . . . . . 22
4.1.2 Edge match score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2
4.1.3 Final part score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2 Inter-part scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.1 Joint spacing score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.2 Angle score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.3 Part intersection score . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.4 Final inter-part score . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Phase 2B: Inference 32
5.1 Probabilistic network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Inference in network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.3 Adding edges to network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.4 Largest connected component . . . . . . . . . . . . . . . . . . . . . . . . . . 36
6 Phase 2C: Post-processing and Further Inference 41
6.1 Articulated ICP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Missing part hypothesis generation . . . . . . . . . . . . . . . . . . . . . . . 42
6.3 Repeating inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7 Experimental Results 45
8 Conclusions and Future Directions 51
3
Chapter 1
Introduction
Consider a complex scene, such as a person standing behind an object. The person’s body is
partly hidden by the object. There are other objects which surround the person. When we
observe this scene, we can only see it from one side at a time, inevitably leaving important
information hidden from view. The question we address is: given a 3D model of the person
(or one of the objects), how can one locate and determine the pose of the object in the scene
when the object is only partly visible?
In this thesis, we present an algorithm for reconstructing the 3D pose of an object and
demonstrate it on a 15-part human-shaped puppet and a 16-part human model. Our al-
gorithm takes as input two meshes, which are collections of 3D points connected together
to form triangles. The first is a complete 3D model of the object we wish to detect. The
model is an articulated model, meaning it is broken up into (almost) rigid parts that move
as a single unit. These parts are collections of points which move together around a joint
connecting two parts. Figure 1.1(a) shows the articulated model of the puppet, where the
4
(a) ArticulatedModel
(b) Partial view
Figure 1.1: (a) An articulated object model consists of rigid parts, whose motion is con-strained by joints. The goal of our algorithm is to determine the pose of the model in (b) apartial view of the scene.
rigid parts of the articulated model are the 15 parts of the puppet, that is, the head, upper
torso, arms, legs, etc. Our algorithm also takes as input a partial view of a scene including
the object as in Figure 1.1(b). The partial view is a scan of the scene from a particular view
point, which, unlike the model, is not complete and only includes the points visible on one
side. It also includes other items observed in the scene around the object. Our goal is to
take a partial view of the scene including the object in an unknown pose and determine the
placement and configuration of the model object in the scene.
This process is challenging because of the inherent limitations of the partial view. First,
the partial view only shows one side of the object, meaning that, depending on the positioning
of the object, some of its parts may be substantially occluded. Further, we must deal with
occlusion resulting from clutter in the scene, such as surfaces in front of the object. Finally,
5
there is the problem of missing data, which arises in regions where there is insufficient
information from the scanner to determine whether or not surface was present.
Our specific goal is to determine the location of each part in the scene, given the model
of the object and the partial view of the scene. We accomplish this using a two-phase
approach. In the first phase, detectors provide initial hypotheses for the locations of parts.
In the second, inference in a probabilistic network is used to find the most likely assignment
of locations to parts, giving us our output: the placement of parts in the scene.
1.1 Background and related work
Pose detection is an active research field being investigated in many directions. Some ap-
proaches don’t construct a part-based model as we do, but instead connect the appearance
of an object directly with its pose (see, for example, [16], [10], and [15]). A major downside,
though, is that thousands or even millions of training examples are needed to provide enough
information to learn the relationship. Our approach avoids that problem in part through
the use of a shape-based model.
A majority of the work in pose detection has dealt with detecting 2D models in 2D data.
Work focusing on this includes [5], [7], and [17]. These approaches share some important
similarities with ours. Initial locations for parts can be provided by detectors for low-level
features in the images. They then often use a tree-shaped model over parts where adjacent
parts are connected and share constraints. However, 2D approaches suffer from a major
limitation: they depend on a particular view of an object. In 2D, a person looks completely
6
different from the front and the side. However, with 3D data, the difference is a question of
looking at different sides of the same 3D model of a person and doesn’t require constructing
separate models.
The pose-detection problem is also being explored in 3D, the area we focus on with our
approach. Working in three dimensions offers advantages in that it provides more informa-
tion, but also brings a new challenge. The added dimensions results in a significant increase
in the search space. Large search problems (for example, the enumeration of all pairs of
adjacent part configurations done by [5]) that were once feasible in 2D are no longer possible
in 3D.
3D scanning and modeling relies on taking a scene, determining the location of points, and
then connecting the points together using triangles to form a mesh. This can be accomplished
by means of various techniques, such as laser range finders that reflect a laser beam off a
surface and measure return time to compute the distance to the surface, or by the use of
stereo vision cameras to determine the distance to points.
Articulated models like those we use are ubiquitous in this field, employed for tracking,
animation and shape detection. Another of their important features is that they can be
recovered automatically, meaning manual creation of a model is not required. With the
correlated correspondence algorithm [1], an articulated model can be recovered from scans
of an object in various poses. This allows us to produce a model broken into articulated
parts which we can then match against scenes showing it in newly observed poses.
Much of the 3D pose detection work has involved tracking from video (for example, [13]
7
and [14]), where a series of images over time provides additional information. In our approach
we work with a single image of a scene, consisting of 3D range data.
An important challenge that we address in our approach is dealing with clutter and
occlusion. This is a difficult to solve problem. For instance, the 3D registration we use with
the correlated correspondence algorithm [1] to build the object model would not function in
the presence of other cluttering and occluding objects. We present a mechanism for locating
and dealing with hidden and missing parts.
1.2 Applications
The range of applications for object pose detection is very broad. There are many situations
where it would be useful to be able to use a computer to automatically analyze an image of
a scene and identify the location and pose of objects. For example, to create a robot that
would interact with people and objects in the human world, the robot would need to be able
to identify objects (including people) in the world, and be able to determine the placement
of these objects in the world even when they are partially hidden from view. Imagine a
situation where a robot must be able to locate a person who happens to be sitting in a chair.
The chair is partially hidden from view by the person, and the person is partially hidden
by the chair, making it difficult to recognize. A system that can detect the pose of objects
even when occluded is necessary in order to automatically analyze the scene and determine
the placement of the persona and chair that is observed in the scene. Other applications
range from automatic processing of surveillance video to detection of pose for an advanced
8
computer interfaces. Gavrila [6] provides a survey of more possible applications specific to
the domain of human detection.
1.3 Overview
In the following chapters, we present our algorithm and the results of testing it on two data
sets. Chapter 2 lays out the probabilistic model we use and provides a brief overview of
our algorithm. Chapters 3 through 6 cover the process in detail. In Chapter 7, we provide
images showing our results on the data sets and discuss them. We conclude in Chapter 8
with a summary of our work and discussion of future research possibilities.
9
Chapter 2
Probabilistic Model and Algorithmic
Overview
In this chapter, we provide an overview of the probabilistic model based on the object. This
is at the center of our approach to determining the placement of the parts of an articulated
model in a scene. For each part in the object we define a variable. The domain of each
variable consists of the possible locations for the part in the scene. We refer to each pos-
sible location as a hypothesis. For example, the variable for part a can take on the values
h(a)1 , . . . , h
(a)k , each of which corresponds to a location for part a in the scene.
We construct a Markov random field over the parts in the object. A Markov random
field (also referred to as an MRF) is a mechanism for representing the joint probability
distribution over a set of variables, which in this case are the placements of each part. The
likelihood of a part’s placement depends on two factors: how well a particular location for a
10
part fits the scene, and how well a particular location for a part fits with the other parts in
the model. An MRF provides a compact way to represent the likelihood of placement of an
individual part and the pairwise likelihoods for the placement of pairs of parts. Specifically,
an MRF can be viewed as a graph where variables (in this case each of the parts) each have
their own distribution and are represented as nodes in the graph. Pairs of variables that are
jointly dependent are connected by edges in the graph, and the joint probabilities over two
variables are calculated for each edge.
Recall that in our probabilistic model, each part is represented by a variable. Potentials
model the likelihood of each hypothesis for the placement in the secene. Constraints on
placements of parts relative to each other result in pairwise potentials along edges which
are added to the MRF. Inference is used to determine the maximum likelihood placement of
parts in the scene.
Our algorithm has two main phases. The first phase (discussed in Chapter 3) involves
the use of detectors to provide initial hypotheses for the location of each of the individual
object parts. The detector-phase matches the surface surrounding individual points on one
of the model parts to the local surface around a point in the partial view. The strongest
matches are used to provide the possible locations for parts for the next phase.
The second phase (discussed in Chapters 4 through 6) is where most of our work occurs.
It involves creating the probabilistic object model, which requires several components. The
first of these is scoring, which we present in Chapter 4. We define an energy function that
prefers placements of parts that are consistent with the observed scene and consistent with
11
our expectations of the object’s structure. We present a model for efficiently computing
a score for an individual part and scores for two adjacent parts. The second component,
covered in Chapter 5, involves inference in a probabilistic network. We create the network
described above and use the score to define potentials in the network. We then run inference
to determine the best joint assignment of location hypotheses to parts, thereby providing the
reconstructed articulated object model. Finally, in post-processing, described in Chapter 6,
we perform a fine-tuning alignment of the model with the scene, attempt to fill in missing
parts, and run further inference. The result is our final reconstructed object model.
12
Chapter 3
Phase 1: Detectors
We begin our process by finding possible locations of parts, which are used to initialize the
domains of the nodes in our probabilistic model outlined in the previous chapter. In this
chapter, we describe the low-level detectors that we use to identify where a particular part
of the articulated model may be located in the scene (similar to the approach of Ioffe and
Forsyth [7]).
3.1 Detector background
The goal of our approach is to find a placement of each part in the scene. A placement
hypothesis can be defined in terms of a transformation that rotates a part from its location
in the model and then translates it linearly to a placement in the scene. This can be
represented as a transformation matrix, and the problem can be conceived of as searching
for the best transformation matrix to match a part to a location in the scene.
13
Rotation of the part can be defined in terms of rotation about each of 3 different axes,
and translation can be defined by a shift in each of 3 different dimensions. Therefore, a
transformation consists of six different parameters. We are working with placements in a
continuous space, so we are essentially dealing with search in a six-dimensional continuous
space.
Exhaustive search of this space would not be feasible. The detectors help solve this
problem by providing initial discrete locations where parts may be located. We run the
detectors for each of the 15 parts to come up with hypotheses for where each of the parts
may be located in the scene.
3.2 Spin images
Our detectors are based on spin images [8], a mechanism for describing the local surface
around points on the mesh. Spin images operate by taking a specific point on the mesh
and placing a cylinder vertically along the normal to the surface at that point. The points
falling inside the cylinder are projected onto the radial axis and vertical axis of the cylinder,
and these coordinates are used to place the points into bins. The result is an n-dimensional
vector describing the frequency of points in each of the n bins. We compute spin images
for points on both the model and scene mesh, resulting in vectors that describe the surface
surrounding the points on both meshes.
Our goal is to find points on the scene mesh where the surrounding surface is similar
to points on the model mesh. Therefore, we take points in the scene mesh and, using a
14
nearest neighbor search structure, find the closest matches (based on L2 distance of the
n-dimensional descriptive vector) to points in the model. Using these matches, we are able
to create candidate correspondences between model and scene points.
3.3 Clustering, pruning and alignment
Now that we have candidate correspondences between the object model and the scene, we
wish to find placement hypotheses. For each model part, we collect the highest scoring
correspondences (based on L2 distance described above) between points on the part and
points in the scene. We cluster these using a method similar to the approach of Johnson and
Herbert [9].
Specifically, we sort all correspondences from smallest distance to largest distance. We
begin with the first correspondence, which we denote as c. For all other correspondences
i for which the distance between model points and the distance between scene points is
within a distance threshold d, we calculate a correspondence score with c. For this score,
we separately compute, for the model and the scene, the vector from the spin-image origin
to where the point for correspondence i falls in the spin image for correspondence c. For
a good match, they should be close to the same value, indicating that both the model and
scene points of i are in a similar orientation relative to those of c. We compute the difference
between these directional vectors and if they are less than a threshold t, correspondence i is
added to the cluster for correspondence c. We repeat this process, with any correspondences
that have already been placed in a cluster removed from the list, until no new clusters of a
15
minimum size s can be formed.
Once we have established clusters, we sort them based on a cluster score equal to the
average distance between the spin-image descriptive vectors for each of the correspondences
in the cluster. For the top m clusters, we now compute a placement hypothesis for the
part. This is done using the iterative closest point (ICP) algorithm [2], which solves for a
transformation matrix to be applied to the part that minimizes the distance between each
of model-scene correspondences.
If there were many spin image matches between the same region of the model and scene,
then after generating hypotheses from each cluster, we will have several hypotheses that are
very similar. However, we want to limit the number of hypotheses to reduce the search space
while simultaneously forcing hypotheses to be spread out in order to explore the complete
scene. Therefore, we prune similar hypotheses by determining how close to the same location
different hypotheses place the part. With the hypotheses sorted by their cluster score, we
compute the difference between the top hypothesis T and all other hypotheses S:
d(S, T ) =∑
j∈joint points‖T (j) − S(j)‖
That is, the score is the sum of the difference in placement of each joint point j when
transformed by hypotheses S and T . A lower score indicates that two hypotheses do not
move the part as much since the joints remain in similar locations. To eliminate similar
hypotheses, we discard hypotheses S when d(S, T ) is less than a minimum distance dmin.
We repeat this process with hypothesis T now equal to the next highest remaining hypothesis
16
Figure 3.1: Top scoring placement hypotheses for puppet upper torso resulting from detectorphase.
on the list and continue until we have compared all pairs of hypotheses to ensure a minimum
separation distance.
Once we have the remaining hypotheses, we improve their alignment with the scene
surface using an ICP alignment method in which points on the part transformed by the initial
hypothesis are matched to nearby points in the scene. We then solve for the transformation
to minimize the distance between the pairs of points, thereby bringing the part into closer
alignment with the surface. The process is repeated, each time moving the part into a slightly
better alignment. After running ICP on all the hypotheses for a part a, the result is a set of
hypotheses h(a)1 , . . . , h
(a)k for that part, similar to those shown in Figure 3.1.
17
3.4 Domain enrichment
At this point, we have placement hypotheses for most or all parts, but a major problem
remains. We may be missing good hypotheses for some parts. For example, in the partial
view shown in Figure 1.1(b), the lower torso is almost completely obscured, making it im-
possible to obtain a correct hypothesis using spin image surface matching. We therefore add
hypotheses to the domains of parts in two ways.
First, we generate hypotheses for parts adjacent to one of the other parts in the model.
We consider each of the joints around a part a transformed by a hypothesis. In the region
around the joint, we identify the top spin image match between the scene surface and the
surface of a part b which is connected to a by the joint. We use this spin image match
to generate a hypothesis for the placement of b. Specifically, we use ICP to determine
a hypothesis transformation matrix for b that minimizes the distance between 3 pairs of
points: (1) the joint in the already placed part a and the joint in part b, (2) the spin image
point in the scene and point on the part, and (3) a point along the normal to the spin image
match in the scene and a point an equal distance along the normal to the spin image match
in the model. This provides a hypothesis for b consistent with the joint in a and at least one
patch of the surface around the joint. In effect, we have used a hypothesis for a to generate
a hypothesis for its neighbor b. This approach allows us to enrich the domains of parts in
instances where spin images may not provide a cluster of matches needed for a hypothesis
in the initial detector phase.
The second mechanism we use fills in missing parts between two parts. We take hy-
18
potheses for two parts attached to different joints of a third part and use them to generate a
hypothesis placing the third part between them. We generate hypotheses for a part c based
on two distinct neighboring parts a and b, with hypotheses i and j respectively. Specifi-
cally, for each distinct pair (i, j) we generate a new hypothesis. We use ICP to enforce the
constraints that the joint shared between c and a must match the location for part a under
hypothesis i and, similarly, the joint shared between c and b must match the location for
part b under hypothesis j. Since ICP with only two points is under-constrained, we add
weaker placement constraints between points around the joint on the border between the
neighboring parts. That is, we add constraints that points on c that border points on a in
the model must be close to those points on a in the scene. We do the same for part b. Using
the ICP algorithm, we solve for a transformation that gives a hypothesis k for c that respects
the neighboring joints and the other neighboring points to the greatest extent possible. As
an example, this approach allows us to add a hypothesis for the occluded lower torso in the
puppet scene based on our hypotheses for the upper torso and one of the upper legs, which
are neighbors to the lower torso. It is worth noting that many of the hypotheses produced
with this method may be poor since there is no guarantee that i and j are consistent with
each other. For example, i and j may place a and b very far apart, which would result in a
poor hypothesis for c. We can address this by pruning newly generated hypotheses where the
joint spacing is too great or too small to be consistent with a part c actually being between
the locations determined by i and j.
After these two enrichment steps, we now have the hypotheses h(p)1 , . . . , h
(p)k for each part
19
p necessary for use in the next phase.
20
Chapter 4
Phase 2A: Scoring
Now that we have a process for identifying possible locations of parts (described in Chap-
ter 3), we need a mechanism to determine whether a possible placement is plausible. Scoring
allows us to quantitatively assess the quality of a location hypothesis and favor those that
are better in our probabilistic model.
There are two different factors that must be taken into account when scoring a hypothesis.
The first has to deal with individual parts and reflects how well the hypothesis matches the
observed scene, or in other words, how well the part surface fits the partial view surface. For
example, the puppet arm should ideally be placed flush with the partial view scene surface,
not sticking out into space in front of it.
The other factor has to do with how well two parts fit relative to each other and results
in an inter-part score. Because we are constructing a skeleton, each part cannot be placed
individually; rather, the placement of parts in relation to one another must be considered.
The second part of the score takes this into account by enforcing requirements that adjacent
21
parts be placed in compatible locations. For example, the head of the puppet must be placed
close to the torso, rather than separated by a significant distance.
4.1 Individual part scores
The individual part score is further divided into two parts: an area explanation/occlusion
score and an edge match score.
4.1.1 Area explanation/occlusion score
The area explanation/occlusion score does two things. First, it favors placements of parts
that align with the surface of the partial view, thereby explaining area. Second, it penal-
izes parts that protrude in front of the observed surface, thereby occluding portions of the
observed scene.
Figure 4.1(a) shows an example of a good scoring placement of an upper leg. (This
hypothesis for the leg places it such that it is intertwined with the surface of the partial view
and the grey surface of the partial view mixes with the green of the part.) This configuration
scores well because the part surface matches up with the partial view surface, meaning it
explains the area well. On the other hand, Figure 4.1(b) shows an example of a leg placement
that does not score well under our scoring function. There are a few points where the leg
matches with the surface of the arm, but for the most part it is sticking out into space and
occluding the rest of the partial view.
22
(a) Good scoring placement (b) Bad scoring placement
Figure 4.1: Area explanation/occlusion score prefers placement of parts that are well alignedwith surface as opposed to those that occlude surface.
Scoring details
For scoring individual parts, then, we wish to determine how well a candidate placement
matches the observed surface. To do this, we discretize the surface and then consider the
path from the observation point through points on the surface of the part. For each path
through a part surface point, we determine where along that path the partial view surface
lies. This is useful, because, for our model, we need to determine the consistency of the
partial view surface with the part surface. For partial view points close to the surface,
we model the likelihood as a Gaussian distribution with mean 0 and a standard deviation
σa. This yields a maximum likelihood when the observed partial view surface is exactly
aligned with the part. The likelihood decreases the farther away the partial view surface
is from the part. For a partial view surface that lies relatively far in front of the part, we
23
assign it a moderate uniform likelihood α because there is a reasonably high likelihood of
the part being behind some partial view surface that occludes it. However, for a partial view
surface observed a significant distance behind the candidate part, we assign it a uniform low
likelihood β because it is unlikely that surface would be observed behind the part since the
part surface would have occluded whatever is behind it.
Finally, we have to deal with points of the part that fall in regions where there is no
partial view surface in the data. Due to the limitations of the scanner, some surfaces in
the partial view may not be registered. Regions where there is no data must, therefore, be
treated as unknown areas; we cannot assume the presence or absence of surface. Thus, we
assign a moderate uniform likelihood γ to part points that fall within that region because
there is a reasonably high likelihood of observing no partial view surface even when a part
is placed there.
In summary, we assign scores for individual points as follows, where x is the distance
between partial view and part surface, α and γ are moderately high uniform likelihoods, and
β is a low uniform likelihood:
likelihoodp =
MAX(α, exp( −x2
2∗σa2 )) for points where partial view surface is in front of part
MAX(β, exp( −x2
2∗σa2 )) for points where partial view surface is behind part
γ, for points where no partial view surface is observed
The total score of a part then is the product for all points of the likelihood for a point
24
weighted by the area of the point:1
area-score =∏
p
likelihoodpareap
Scoring Implementation
Scoring is implemented using OpenGL. This allows for very efficient computation of the area
explanation/occlusion scores because OpenGL is well-suited to rapid rendering of 3D points.
The partial view is initially rendered in OpenGL as seen from the viewpoint. Then for each
part we wish to score, we render it separately. Since we are dealing with graphics rendering,
we can easily discretize it based on pixels and compare Z-buffer values between the part and
partial view to determine where the two surfaces lie in relation to each other along the path
from the partial view to the part. Using this approach, we are able to leverage hardware
accelerated 3D rendering in the graphics card to score several hundred part hypotheses per
second.
4.1.2 Edge match score
When the partial view scene is viewed from the observation point, discontinuities in surface
depth from edges appear. The edge match score reflects the idea that these edges should
be aligned with the edges of parts. A configuration where the edges of a part match the
1Weighting by the area of the of the point is necessary because the discretization is based on pixels. Pixelson surfaces closer to the viewpoint represent less surface than pixels on surfaces farther from the viewpoint.Since the score is based on the area of the part, we want it to be weighted appropriately based on actualpart size and not its location relative to the viewpoint.
25
observed scene edges has a higher likelihood because it’s more likely that the edges result
from part’s placement there.
Scoring details
We discretize the surface into points (based on pixels, as with the edge score) and form
a matrix of distances to each of the points. We then apply the well-known Canny edge
detection algorithm [3] to the matrix to identify major discontinuities that form edges. We
do the same things with an individual part we wish to score, identifying the edges of it.
For each point along the edge of the part, we wish to model the likelihood that the
observed edge results from the part’s placement at that location. We match each part edge
point with the nearest edge point in the partial view, and then model the likelihood as a
Gaussian distribution with mean 0 and standard deviation σe. Thus, an edge that exactly
aligns with the part edge has a very high likelihood of having been generated by the part, and
the likelihood decreases the farther away it gets. Because of the problem of unknown data
in the partial view, we cannot definitively conclude that the absence of an edge means there
should not actually have been an edge, so we assign a uniform probability α at distances far
away from the part edge, representing the idea that a part edge could actually generate a
reading of no edge.
For an individual point along the edge:
likelihoodp = MAX(γ, exp(−x2
2σe2))
26
where x is the distance between partial view edge and part edge.
To obtain the score for the entire part, we multiply the likelihood of each point, weighted
by the segment edge length:2
edge-match-score =∏
p
likelihoodplengthp
Scoring implementation
OpenGL is again used to allow fast computation of the edge score. Using OpenGL, we
render the partial view surface to a buffer, and use the Z-buffer values to compute edges
with the Canny edge detection algorithm. The preformed discretization into pixels is used
as the basis for breaking up the edges into points. The same process is used for obtaining
edges of each of the part hypotheses being scored.
4.1.3 Final part score
The final likelihood score of the part is obtained by multiplying together the two constituent
scores:
part-score = area-score ∗ edge-match-score
2As with the area score, discretization is based on pixels. Since a pixel farther away represents an edgesegment of greater length, we need to weight by the length of that segment so that scores can be comparablebetween placements of parts independent of their distance from the viewpoint.
27
4.2 Inter-part scores
The inter-part score enforces compatible placements of parts relative to each other, and has
three components. The first two apply to pairs of adjacent parts, while the third applies to
any pair of parts.
4.2.1 Joint spacing score
Two parts can fit the surface and match the edges of the partial view very well, but the
placement of the two parts may not be compatible. For example, the upper left arm may
fit well off to the right side of the torso, or even in the place of one of the legs. The joint
spacing score allows us to penalize placements such as those and to favor placements of
adjacent parts that are close together.
Every two adjacent parts share a single point between them, defined as a joint. In an
ideal placement, this shared point would be in the same location for each part. However, in
a poor placement, the point that should be shared is in one location in the scene in one part
and a significant distance away in the other part.
For two parts a and b, having hypotheses i and j respectively, and sharing a joint (a, b),
we can define −→v ba,i to be the location in the 3D space of the joint point for part a based on
hypothesis i, and −→v ab,j to be the location of the joint point for part b based on hypothesis j.
We can then compute the distance between them, which we denote as x(a,i)(b,j):
x(a,i)(b,j) = ‖−→v ba,i −
−→v ab,j‖
28
The joint match score is then modeled as a Gaussian distribution with mean 0 and
standard deviation σj :
joint-spacing-score = exp(−x2
(a,i)(b,j)
2σj2
)
A distance of 0 would receive the maximum score, and the score decreases as the distance
increases.
4.2.2 Angle score
Even if two parts have a shared joint that is placed consistently, they can be twisted at
large angles that are actually impossible to achieve with the model. The angle score favors
placements where the angle between two parts is closer to the angle in the model. We
introduce a prior belief over the angle between adjacent parts, with a preference for smaller
angle differences from the model. This helps to eliminate configurations where the object
could be twisted at angles consistent with the data but not with the restrictions of the joint.
For the purposes of the angle score, a part placement can be described with a 3D vector
pointing from the centroid of the part to the joint. For a joint (a, b) between parts a and b,
we compute these vectors −→w ba,i for the part a with hypothesis i and −→w a
b,j for the part b with
hypothesis j. Using the dot product operation, we can obtain a scalar value for the angle
between them, represented by θ(a,i)(b,j). For the model we also compute these vectors −→w ba,∗ for
the part a with its placement ∗ in the model and −→w ab,∗ for the part b with placement ∗. We
can then compute the scalar angle θ(a,∗)(b,∗) for the joint (a, b) in the model. The angle score
is based on the difference d = |θ(a,i)(b,j) − θ(a,∗)(b,∗)|. Because we wish to allow a reasonable
29
amount of movement, we only penalize angle differences above a certain threshold t. For
angle differences smaller than the threshold, the score is 1. For larger angle differences, we
score using a Gaussian distribution with mean 0 and standard deviation σl:
angle-score = exp(−(d− t)
2σl2
)
A difference less than or equal to t receives the maximum score of 1. The score decreases
as the difference between the angle in the model and the angle of the candidate placements
increases.
4.2.3 Part intersection score
The other requirement we wish to enforce between two parts is that they do not overlap in
space. For example, the elbow joint could be well-placed between two part hypotheses, but
the lower arm could be folded back on top of the upper arm. This overlap problem could also
exist with two non-adjacent parts, where, for example, the lower arm is placed intersecting
the torso. Therefore, we use the volume intersection score to favor placements of two parts
where they do not overlap with each other and to penalize part placements that do.
The part intersection score is based on the amount of overlapping volume between two
parts. Parts are discretized using a 3D grid. We then determine which elements in the grid
are inside the part surface. For each of these points inside a part i we consider whether the
point lies inside another part j, and sum the total number of intersecting points, which we
denote as x.
30
We then derive a score by modeling the intersection as a Gaussian distribution with mean
0 and standard deviation σv:
part-intersection-score = exp(−x2
2σv2)
As the amount of volume intersection increases, the score for the part placements decreases.
4.2.4 Final inter-part score
The inter-part likelihood score for adjacent parts is obtained by multiplying the three com-
ponent scores together:
inter-part-score = joint-spacing-score ∗ angle-score ∗ part-intersection-score
Since the first two components apply only to adjacent parts, the score for non-adjacent parts
consists of only the third one:
inter-part-score = part-intersection-score
31
Chapter 5
Phase 2B: Inference
Thus far, we have described a process for obtaining a series of location hypotheses for each
of the parts (Chapter 3) and a mechanism for scoring individual part hypotheses and pairs
of hypotheses for different parts (Chapter 4). We now require a mechanism for assigning a
location hypothesis to each part. We do this via inference in a probabilistic network, which
we discuss in this chapter.
5.1 Probabilistic network
As outlined in Chapter 2, we define a probabilistic network with a variable representing each
part’s placement and with edges connecting nodes (parts) that are adjacent in the structure
of the skeleton.
The domains for each of the nodes are the different part location hypotheses for that
part. A part a is defined to have possible hypotheses h(a)1 , . . . , h
(a)k where each hypothesis
32
corresponds to a location in the image. We choose the potential to represent the likelihood of
the part being located there. For each hypothesis i for part a we define a singleton potential
using the individual part scores defined in Section 4.1:
φ(h(a)i ) = part-score(h
(a)i ) = area-score(h
(a)i ) ∗ edge-match-score(h
(a)i )
Next, we add edges between all pairs of adjacent parts in the articulated model skeleton.
The pairwise potential for these edges models the joint likelihood of two parts a and b taking
on particular values in their domains. This potential is modeled on the inter-part scores
described in section 4.2, and it is used to define a preference for two adjacent parts being
placed close together, for two adjacent parts being at and an angle relative to each other that
is similar to the angle in the model, and for two adjacent parts not to overlap. Specifically,
we create a potential ψ(h(a)i , h
(b)j ) representing the likelihood that part a takes on hypothesis
i and part b takes on hypothesis j. We define the potential as:
ψ(h(a)i , h
(b)j ) = joint-spacing-score(h
(a)i , h
(b)j ) ∗
angle-score(h(a)i , h
(b)j ) ∗
part-intersection-score(h(a)i , h
(b)j )
33
5.2 Inference in network
With the probabilistic network defined, we run a form of inference known as loopy belief
propagation1 [12, 11] to find the maximum likelihood assignment of part hypotheses to
locations. At this point, the most likely location hypothesis for each part can be used as the
location that is output for each part.
The relative strengths of the potentials, which define the trade-off between individual
part placement relative to the partial view and placement of parts relative to each other,
can be controlled by adjusting the values of the standard deviations σa, σe, σj , σl, and σv.
5.3 Adding edges to network
After running inference on one of the puppet scenes, we obtain the results shown in Fig-
ure 5.1. One of the first problems we note is that parts are overlapping. This is because
our probabilistic network only has edges enforcing part non-intersection between adjacent
parts. In reality, no two parts should intersect, regardless of whether they are adjacent.
The solution is to add edges between non-adjacent parts that enforce the non-intersection
constraints. We define the potential for these edges in terms of only the part-intersection
score since there is no shared fixed joint with which to calculate a joint spacing score:
ψ(h(a)i , h
(b)j ) = part-intersection-score(h
(a)i , h
(b)j )
1Our model, as defined thus far, is actually a tree structure and has no loops. Therefore, an inferencealgorithm that can handle loops is unnecessary. However, in Section 5.3 we describe how edges are addedthat lead to loops in the graph and for which loopy belief propagation is necessary.
34
Figure 5.1: Result of inference in initial model. Here, non-adjacent parts overlap significantlybecause edges are not added to constrain their placement.
This potential models the likelihood that part a takes on hypothesis i and part b takes on
hypothesis j.
However, it is not feasible to add edges between all non-adjacent parts because that would
result in a fully-connected graph, which would make inference computationally infeasible.
Further, it is unlikely that most non-adjacent parts would intersect, even without the non-
intersection constraint. For example, the foot would most likely not be placed in the same
location as the head because the surfaces would not match well. Therefore, we run inference
in the network and examine the highest scoring placements for each part. We take all pairs,
and if any intersect more than a certain threshold percentage, we add an edge representing
the non-intersection constraint between the two parts. We then re-run inference in the
network, and repeat this process until no new edges are added. As a result of adding edges
35
Figure 5.2: Result of inference in model with additional edges. After first running infer-ence, the legs overlapped. However, the legs no longer overlap, because we have added anadditional edge to the network to enforce non-intersection of those parts.
between non-adjacent parts, the network is no longer a tree and loops exist in it. It is for
this reason that we chose loopy belief propagation as our inference mechanism since it is
well-suited to networks with loops.
Employing this mechanism, we are able to improve on the results shown in Figure 5.1,
resolving the problem of the intersecting legs, which produces the results in Figure 5.2.
5.4 Largest connected component
There is still one major limitation of this probabilistic model: it requires that every part of
the model be placed somewhere in the scene. However, in scenes with heavy occlusion this
is not necessarily practical or even desirable. Consider a scene with heavy occlusion, such
36
as that shown in Figure 5.2. Here, some of the limbs of the puppet are completely hidden
from view. Because none of their surface is visible, there is no data from which to obtain
placement hypotheses consistent with the rest of the puppet parts. Even the mechanisms for
suggesting part placements based on neighbors (described in Section 3.4) cannot provide good
hypotheses with so much missing information. Thus, we will not have any good placement
candidates in the domains of those parts, and the only hypotheses we have for those parts
would match other surfaces, thereby placing the hidden parts in locations inconsistent with
the other parts. Therefore, we relax the constraint that all parts must be placed in the scene,
choosing instead to obtain the largest good-scoring connected component.
In practice, we allow parts to be missing by including a ”part-missing” hypothesis in the
domain for each part. In our model we designate a root part; we allow any part or parts to
be missing, with the additional constraint that if a part on a path from the root is missing,
all parts further down that path must also be missing. Since we have a tree structure in our
puppet and human skeletons, the largest connected component that encompasses the root
part results. In the case of the puppet and human, the upper torso serves as a good root
part.2
2In practice, the upper torso is a good choice of a root part because if it is present in the scene, it isgenerally easy to match because of its large area and distinctiveness. If it is not present and we are unableto suggest a location for it based on either of the arms, the head, or the lower torso, then most of the objectis probably not visible in the scene, and the likelihood of successfully obtaining any of it is low.
Selecting the upper torso as the root part for the puppet and human does not diminish the generality ofour algorithm. For objects that don’t have a clear choice for a good root part, multiple root parts could beconsidered using one of two different mechanisms. The first, and simplest approach, is that we could build theprobabilistic network with each part in turn designated the root and re-run inference. Each time we wouldobtain an assignment of parts to locations, and the assignment with the highest score would be output as theresult. Alternatively, we could avoid building the probabilistic model and running inference multiple timesby constructing a richer model. We could add additional variables (and corresponding pairwise constraints)that would encode both which part is designated as the root part and the presence and absence of parts.
37
The score associated with this part-missing hypothesis is neutral, so that it does not
score better than a part that explains area or edge, but does score better than a part that
occludes surface which it should not. We define the individual part potential for the part-
missing hypothesis with a uniform neutral score δ as:
φ(h(a)null) = δ
The pairwise scores between a missing part and its present neighbor are also neutral:
they do not impose joint spacing or part-intersection penalties. We modify the pairwise
potential to enforce the requirement that a part farther down the tree from a missing part
is not present. The pairwise potential is as follows for a part b below a part a in the tree,
where i and j are real placement hypotheses and null is the part-missing hypothesis, and
where ζ is a uniform neutral score:
ψ(h(a)i , h
(b)j ) = joint-spacing-score(h
(a)i , h
(b)j ) ∗
angle-score(h(a)i , h
(b)j ) ∗
part-intersection-score(h(a)i , h
(b)j )
ψ(h(a)null, h
(b)j ) = 0
ψ(h(a)null, h
(b)null) = 1
ψ(h(a)i , h
(b)null) = ζ
We could then run inference in the enhanced network to obtain our result.
38
This creates a model in which a part will always be added to the network as long as its
addition (or the addition of it and some of the parts beneath it in the tree) have a positive
effect on the likelihood of the resulting configuration. That is, a part is added when the area
and edge it matches outweighs the penalty from non-matching area, joint-spacing mismatch,
angle mismatch, and part-intersection. This is a desirable outcome because we do not force
ourselves into inconsistent part placement assignments by requiring the presence of parts
which are not visible in the scene and do not enable us to place other parts. By limiting
ourselves to the search for the largest connected component, we obtain the results shown in
Figure 5.3.
Now that we have our initial results, we proceed to a post-processing phase described in
the next chapter.
39
Figure 5.3: Result of inference in model where the constraint that all parts must be presentis relaxed and we search for the largest connected component. The arm parts, for which weare lacking good placement hypotheses, are no longer present, and we are not forced into aninconsistent configuration where the arms are placed incorrectly relative to other parts.
40
Chapter 6
Phase 2C: Post-processing and
Further Inference
At this point in the process, we have run inference in our probabilistic network as described
in Chapter 5 to obtain the most likely placement of parts. However, these placements can
be further refined, additional hypotheses can be generated, and the inference process can be
repeated to yield improved results. We describe these post-processing mechanisms in this
chapter.
6.1 Articulated ICP
Once we have the resulting best hypothesis for each part from inference, we can further
fine-tune them using a process called Articulated ICP. The goal is to better align the parts
with the partial view surface and with the neighboring part joints.
41
We take each of the parts included in our results and associate points on the parts
with nearby parts on the partial view. We also associate the joint points in each part
with the corresponding joint points in the neighboring parts. We then use ICP to solve
for transformations for each part that move these associated pairs of points closer together.
After applying the transformations, the process is repeated for a fixed number of iterations.
Unlike the ICP we employed earlier, where only one part was moved at a time, our
articulated ICP approach here allows us to move all the parts at once and relative to each
other. The result is an improved alignment, where the parts are drawn together or pushed
apart at the joints so that they align well and are adjusted to match the surface of the partial
view.
6.2 Missing part hypothesis generation
At this point, we have an articulated model containing the largest connected component.
However, some parts may still be missing if we did not have good hypotheses when we
initially ran inference. In this phase, we generate new hypotheses for parts not included in
initial results but adjacent to a part that was included. In essence, we are attempting to
grow outward from our starting connected component.
First, for any neighbors of missing parts, we use the new hypothesis that resulted from
articulated ICP to suggest a hypothesis for this missing part based on the spin image, as we
did in the initial domain enrichment phase prior to inference (described in Section 3.4).
Additionally, for any neighbors of missing parts, we take the top K (in this case, 3)
42
hypotheses from inference plus the hypothesis from articulated ICP on that part, and we
use these K + 1 hypotheses to generate new hypotheses for the missing neighboring parts
using a uniform sampling approach. For each of these hypotheses, we examine the sphere
around the joint shared with the missing part and consider the placement of the missing
part pointing outward from the joint at fixed intervals of angles in that sphere. For each
of these new hypotheses, we run ICP to align it with any surface in the vicinity. We then
calculate the individual part score (Section 4.1) for each new hypothesis and based on that
keep the top M (in this case, 3). In total, this yields (K + 1) ∗M new hypotheses for the
missing part. This uniform sampling allows us to identify possible part placements that may
have been missed because, for various reasons, the spin images were not good enough to
obtain a good match. This process is not feasible at earlier stages when we have many more
hypotheses because it requires running ICP on too many samples and generates too many
new hypotheses. However, it works well when we have a limited number of good hypotheses
for the present parts.
6.3 Repeating inference
The final stage in our process is to repeat inference. We set up the probabilistic model as
before. For all parts, we include the top K hypotheses from inference in their domains. For
the parts present in our previous result, we also add the hypothesis from articulated ICP to
the domain. For the missing parts that are neighbors to these parts, we add the hypotheses
generated as described above to their domains.
43
We run inference as before and then apply articulated ICP to get our new result.
If more parts are now present in the result, there may now be new missing parts that are
adjacent to the present parts. We repeat the process of missing part hypothesis generation,
inference, and articulated ICP until no new parts are added.
When we reach the point that no new parts are added, we return the skeleton resulting
from articulated ICP as our final result. We have now completed the process, and have a
placement in the scene for each of the parts that are in the largest connected component we
can find.
44
Chapter 7
Experimental Results
We present results of running our algorithm (described in the preceding chapters) on two
data sets: scans of the model puppet described before and a series of scans on a human.
We tested our algorithm on several scenes involving the 15-part model puppet viewed in
different poses and from various directions. The scenes have various cluttering and occluding
objects. The scans were generated from a temporal stereo scanner [4]. The scans are made
using two cameras to capture stereo information. Light is shined in changing bands on the
object and captured by the two cameras. Using the observations of the light bands from
the two viewpoints, the 3D surface mesh can be extrapolated from a series of images. One
of the major challenges with this data, however, is the presence of large shadows. Surface
can only be located when it is visible to both cameras and reached by the light source. As
a result of this, any objects farther forward in the scene essentially cast three shadows (one
blocking the light, one blocking the left camera, and one blocking the right camera). This
is the reason the ring in Figure 1.1(b) hides a larger portion of the lower torso than the
45
ring’s own surface would otherwise account for. This is also the reason for missing areas of
the background. An important consequence of this is that there are large regions where we
can neither assume the presence nor absence of surface (because it is possible that there is
surface there which the object is hiding in one of its shadows). As a result, the detection
problem is more challenging.
A sample of the puppet scenes and the resulting puppet embeddings are shown in Figure
7.1. We correctly identified the torso and head core of the puppet in almost all cases, even
in situations with substantial occlusion. In most cases, we were also able to place the limbs
correctly. This was possible even in scenes with substantial occlusion or parts not visible.
For example, in the scene of the puppet kicking the ball (Figure 7.1(a)), its own arm occludes
much of it from view. In the scene with the puppet holding two sticks (Figure 7.1(g)), much
of the puppet is not visible, but the limbs were placed in the generally correct location by
our algorithm.
Even in situations where limbs are not placed correctly, they were placed in a configu-
ration consistent with the scene data. For example, in the scene of the puppet holding the
ring (Figure 7.1(d)), the leg was turned up and placed along the surface of the ring. This is
consistent with our model, in which we wish to place our object in such a way that it explains
the observed data. In the scene in which we obtained the worst results, i.e. the scene with
the puppet holding the ring around its waist (Figure 7.1(e)), the puppet was placed entirely
consistent with the observed data, although it was twisted in an unusual manner. Missing
parts were generated to fill in around other hypotheses and complete the puppet skeleton.
46
(a) Puppet kicking ball (b) Puppet kneeling next to cup (c) Puppet with a smaller puppeton its shoulders
(d) Puppet holding a ring (e) Puppet with a ring around it (f) Puppet stepping on an object
(g) Puppet with sticks (h) Puppet with a wire snakingaround
Figure 7.1: Sample of scenes and the resulting puppet embeddings
47
This, in fact, shows the power of our hypothesis generation mechanism and the difficulty of
distinguishing an object from similarly shaped nearby surfaces.
We also tested our algorithm on a human data set, matching a 16-part articulated model
to a series of partial view scans of a human, produced using a Cyberware WRX scanner.
This scanner functions by measuring the time it takes for laser beam to be reflected back by
the surface. Using this, it is possible to compute the distance to the surface and construct
a 3D mesh. An object (in this case a human and various occluding objects) are placed in
the scanner and scanned from four sides. A single one of these four scans is a partial view.
Human data introduces a new challenge in addition to occlusion because individual parts
are now deformable due to the fact that muscle and tissue are non-rigid.
Figure 7.2 shows four human scenes and the corresponding pose recovered for each one.
In the human scenes, we found the correct placement of the head and torso, and were
generally able to find the arms and legs. In the pose with the person’s arms crossed in front
(Figure 7.2(d)), we were able to correctly reconstruct the entire skeleton, despite the fact
that most of the torso is missing due to occlusion from the arms and missing data from
the scan due to shadows. In the scenes with the person partially occluded by the chair
(Figure 7.2(b)) and the board (Figure 7.2(c)), we were able to recover the correct placement
of most of the limbs. In the cases where the limbs were not placed in the correct location,
they were consistent with the observed data, either placed behind an occluding object or in
the unknown area surrounding the person (recall that we cannot make assumptions from the
absence of surface and must treat those regions as unknown).
48
(a) Person standing (b) Person with chair
(c) Person holding board (d) Person with arms crossed in front
Figure 7.2: Sample of 4 scenes and the resulting human poses. For each pair, the image onthe left is the original scene and the image on the right is the recovered human pose.
49
Our algorithm functioned without changes for both data sets. Both sets of results were
generated with the same parameters, with the exception of the threshold for the joint angle
prior, which differed for humans and puppets because the puppet joints have a greater
freedom of movement than human joints.
50
Chapter 8
Conclusions and Future Directions
In conclusion, we have developed an algorithm for determining the pose of articulated objects
in 3D range data. Our algorithm begins with detectors that provide initial hypotheses. It
then uses efficient scoring mechanisms that we have defined to compute scores for individual
parts and pairs of parts. Our approach uses these scores to define a probabilistic model
with which we can estimate the most likely pose given the observed data. In the end, we
demonstrated successful application to puppet and human data sets.
This work addresses a challenging problem: that of detecting occluded objects in range
images and determining their pose. It presents a novel solution, based on a well-formulated
probabilistic framework. There is a broad range of applications of objection detection in
scenes, with potential applications ranging from automated monitoring and analysis to a
robotic system that interacts with the real world.
Future work will involve testing with other articulated model data sets, including possible
non-human-shaped objects. In addition, we would like to explore learning features of the
51
space of poses. One major obstacle to obtaining better results is that we have no specific prior
knowledge of the space of poses for an object. By learning this information and incorporating
it, we will be able to distinguish between various poses that are all consistent with the data
but not necessarily consistent with physical limitations of the object. This would allow us
to better recover the actual pose of the object and improve the results.
52
Bibliography
[1] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, H.-C. Pang, and J. Davis. The cor-
related correspondence algorithm for unsupervised surface registration. In Proc. NIPS,
2004.
[2] P. Besl and N. McKay. A method for registration of 3d shapes. Transactions on Pattern
Analysis and Machine Intelligence, 14(2):239–256, 1992.
[3] J Canny. A computational approach to edge detection. IEEE Trans. Pattern Anal.
Mach. Intell., 8(6):679–698, 1986.
[4] J. Davis, R. Ramamoorthi, and S. Rusinkiewicz. Spacetime stereo : A unifying frame-
work for depth from triangulation. In CVPR, 2003.
[5] P. Felzenszwalb and D. Huttenlocher. Efficient matching of pictorial structures. In Proc.
CVPR, pages 66–73, 2000.
[6] D.M. Gavrila. The visual analysis of human movement: A survey. Computer Vision
and Image Understanding, 73(1):82–98, 1999.
53
[7] S. Ioffe and D. Forsyth. Probabilistic methods for finding people. Int. Journal of
Computer Vision, 43(1):45–68, 2001.
[8] A. Johnson. Spin-Images: A Representation for 3-D Surface Matching. PhD thesis,
Carnegie Mellon University, 1997.
[9] A. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered
3d scenes. Proc. IEEE PAMI, 21:433–449, 1999.
[10] G. Mori and J. Malik. Estimating human body configurations using shape context
matching. Proc. ECCV, 3:666–680, 2002.
[11] K. Murphy, Y. Weiss, and MI Jordan. Loopy belief propagation for approximate infer-
ence: An empirical study. In Proc. Fifteenth Conference on Uncertainty in Artificial
Intelligence (UAI ’99), pages 467–475, 1999.
[12] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Fran-
cisco, 1988.
[13] L. Sigal, S. Bhatia, S. Roth, M. J. Black, and Michael Isard. Tracking loose-limbed
people. Proc. CVPR, 1:421–428, 2004.
[14] E. Sudderth, M. Mandel, W. Freeman, and A. Willsky. Distributed occlusion reasoning
for tracking with nonparametric belief propagation. In Proc. NIPS, 2004.
[15] J. Sullivan and S. Carlsson. Recognizing and tracking human action. Proc. ECCV,
1:629–644, 2002.
54
[16] K. Toyama and A. Blake. Probabilistic tracking with exemplars in a metric space. Proc.
ICCV, 48:9–19, 2001.
[17] S. Yu, R. Gross, and J. Shi. Concurrent object recognition and segmentation with graph
partitioning. In Proc. NIPS, 2002.
55