Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The...

57
Articulated Model Detection in Range Images Jim Rodgers Stanford University [email protected] June 14, 2006

Transcript of Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The...

Page 1: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Articulated Model Detection in Range Images

Jim Rodgers

Stanford University

[email protected]

June 14, 2006

Page 2: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Abstract

We present a method for detecting the placement of complex articulated models in 3D

partial view range images. Specifically, we detect the placement of the parts of an articulated

model in a scene. This is a difficult task because the object can be substantially occluded

by clutter or portions may not be observed. Our approach uses detectors to provide initial

location hypotheses for parts. We expand upon these hypotheses, using a mechanism for

dealing with missing parts. We then create a probabilistic model to determine the most likely

placement of parts based on individual and pairwise potentials. We devise efficient scoring

mechanisms to evaluate placements and provide these potentials. During a post processing

phase we further refine our results, providing the final pose of the model in the scene. In

this thesis, I present our algorithm and results of successfully running it on two data sets of

objects of 15 or more parts.

Page 3: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Acknowledgements

Professor Daphne Koller, my advisor, has been instrumental to this project, always providing

support and new ideas. She has challenged me to think in new ways and given me an

understanding of the research process and taught me a great deal. Enormous credit also

goes to Dragomir Anguelov, who has been a mentor for this project, and under whose

guidance I first was exposed to the project as a summer research project through the CURIS

program in the Computer Science department during the summer of 2004. He was always

there to answer questions and discuss various approaches and offer numerous suggestions. I

am also indebted to Jimmy Pang, who worked with me on this project during the summer

of 2004 and was instrumental in helping me get started in research. James Davis from UC

Santa Cruz and Stefano Corazza and Lars Mundermann from the Stanford Biomechanics

Lab were crucial in acquiring data sets. Finally, sincere appreciation goes to colleagues and

friends in the Computer Science department and elsewhere for listening to me discuss my

project and providing suggestions and support.

1

Page 4: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Contents

1 Introduction 4

1.1 Background and related work . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Probabilistic Model and Algorithmic Overview 10

3 Phase 1: Detectors 13

3.1 Detector background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Spin images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Clustering, pruning and alignment . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 Domain enrichment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Phase 2A: Scoring 21

4.1 Individual part scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.1.1 Area explanation/occlusion score . . . . . . . . . . . . . . . . . . . . 22

4.1.2 Edge match score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2

Page 5: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

4.1.3 Final part score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Inter-part scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.1 Joint spacing score . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.2 Angle score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.3 Part intersection score . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2.4 Final inter-part score . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5 Phase 2B: Inference 32

5.1 Probabilistic network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2 Inference in network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Adding edges to network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.4 Largest connected component . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6 Phase 2C: Post-processing and Further Inference 41

6.1 Articulated ICP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6.2 Missing part hypothesis generation . . . . . . . . . . . . . . . . . . . . . . . 42

6.3 Repeating inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

7 Experimental Results 45

8 Conclusions and Future Directions 51

3

Page 6: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Chapter 1

Introduction

Consider a complex scene, such as a person standing behind an object. The person’s body is

partly hidden by the object. There are other objects which surround the person. When we

observe this scene, we can only see it from one side at a time, inevitably leaving important

information hidden from view. The question we address is: given a 3D model of the person

(or one of the objects), how can one locate and determine the pose of the object in the scene

when the object is only partly visible?

In this thesis, we present an algorithm for reconstructing the 3D pose of an object and

demonstrate it on a 15-part human-shaped puppet and a 16-part human model. Our al-

gorithm takes as input two meshes, which are collections of 3D points connected together

to form triangles. The first is a complete 3D model of the object we wish to detect. The

model is an articulated model, meaning it is broken up into (almost) rigid parts that move

as a single unit. These parts are collections of points which move together around a joint

connecting two parts. Figure 1.1(a) shows the articulated model of the puppet, where the

4

Page 7: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

(a) ArticulatedModel

(b) Partial view

Figure 1.1: (a) An articulated object model consists of rigid parts, whose motion is con-strained by joints. The goal of our algorithm is to determine the pose of the model in (b) apartial view of the scene.

rigid parts of the articulated model are the 15 parts of the puppet, that is, the head, upper

torso, arms, legs, etc. Our algorithm also takes as input a partial view of a scene including

the object as in Figure 1.1(b). The partial view is a scan of the scene from a particular view

point, which, unlike the model, is not complete and only includes the points visible on one

side. It also includes other items observed in the scene around the object. Our goal is to

take a partial view of the scene including the object in an unknown pose and determine the

placement and configuration of the model object in the scene.

This process is challenging because of the inherent limitations of the partial view. First,

the partial view only shows one side of the object, meaning that, depending on the positioning

of the object, some of its parts may be substantially occluded. Further, we must deal with

occlusion resulting from clutter in the scene, such as surfaces in front of the object. Finally,

5

Page 8: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

there is the problem of missing data, which arises in regions where there is insufficient

information from the scanner to determine whether or not surface was present.

Our specific goal is to determine the location of each part in the scene, given the model

of the object and the partial view of the scene. We accomplish this using a two-phase

approach. In the first phase, detectors provide initial hypotheses for the locations of parts.

In the second, inference in a probabilistic network is used to find the most likely assignment

of locations to parts, giving us our output: the placement of parts in the scene.

1.1 Background and related work

Pose detection is an active research field being investigated in many directions. Some ap-

proaches don’t construct a part-based model as we do, but instead connect the appearance

of an object directly with its pose (see, for example, [16], [10], and [15]). A major downside,

though, is that thousands or even millions of training examples are needed to provide enough

information to learn the relationship. Our approach avoids that problem in part through

the use of a shape-based model.

A majority of the work in pose detection has dealt with detecting 2D models in 2D data.

Work focusing on this includes [5], [7], and [17]. These approaches share some important

similarities with ours. Initial locations for parts can be provided by detectors for low-level

features in the images. They then often use a tree-shaped model over parts where adjacent

parts are connected and share constraints. However, 2D approaches suffer from a major

limitation: they depend on a particular view of an object. In 2D, a person looks completely

6

Page 9: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

different from the front and the side. However, with 3D data, the difference is a question of

looking at different sides of the same 3D model of a person and doesn’t require constructing

separate models.

The pose-detection problem is also being explored in 3D, the area we focus on with our

approach. Working in three dimensions offers advantages in that it provides more informa-

tion, but also brings a new challenge. The added dimensions results in a significant increase

in the search space. Large search problems (for example, the enumeration of all pairs of

adjacent part configurations done by [5]) that were once feasible in 2D are no longer possible

in 3D.

3D scanning and modeling relies on taking a scene, determining the location of points, and

then connecting the points together using triangles to form a mesh. This can be accomplished

by means of various techniques, such as laser range finders that reflect a laser beam off a

surface and measure return time to compute the distance to the surface, or by the use of

stereo vision cameras to determine the distance to points.

Articulated models like those we use are ubiquitous in this field, employed for tracking,

animation and shape detection. Another of their important features is that they can be

recovered automatically, meaning manual creation of a model is not required. With the

correlated correspondence algorithm [1], an articulated model can be recovered from scans

of an object in various poses. This allows us to produce a model broken into articulated

parts which we can then match against scenes showing it in newly observed poses.

Much of the 3D pose detection work has involved tracking from video (for example, [13]

7

Page 10: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

and [14]), where a series of images over time provides additional information. In our approach

we work with a single image of a scene, consisting of 3D range data.

An important challenge that we address in our approach is dealing with clutter and

occlusion. This is a difficult to solve problem. For instance, the 3D registration we use with

the correlated correspondence algorithm [1] to build the object model would not function in

the presence of other cluttering and occluding objects. We present a mechanism for locating

and dealing with hidden and missing parts.

1.2 Applications

The range of applications for object pose detection is very broad. There are many situations

where it would be useful to be able to use a computer to automatically analyze an image of

a scene and identify the location and pose of objects. For example, to create a robot that

would interact with people and objects in the human world, the robot would need to be able

to identify objects (including people) in the world, and be able to determine the placement

of these objects in the world even when they are partially hidden from view. Imagine a

situation where a robot must be able to locate a person who happens to be sitting in a chair.

The chair is partially hidden from view by the person, and the person is partially hidden

by the chair, making it difficult to recognize. A system that can detect the pose of objects

even when occluded is necessary in order to automatically analyze the scene and determine

the placement of the persona and chair that is observed in the scene. Other applications

range from automatic processing of surveillance video to detection of pose for an advanced

8

Page 11: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

computer interfaces. Gavrila [6] provides a survey of more possible applications specific to

the domain of human detection.

1.3 Overview

In the following chapters, we present our algorithm and the results of testing it on two data

sets. Chapter 2 lays out the probabilistic model we use and provides a brief overview of

our algorithm. Chapters 3 through 6 cover the process in detail. In Chapter 7, we provide

images showing our results on the data sets and discuss them. We conclude in Chapter 8

with a summary of our work and discussion of future research possibilities.

9

Page 12: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Chapter 2

Probabilistic Model and Algorithmic

Overview

In this chapter, we provide an overview of the probabilistic model based on the object. This

is at the center of our approach to determining the placement of the parts of an articulated

model in a scene. For each part in the object we define a variable. The domain of each

variable consists of the possible locations for the part in the scene. We refer to each pos-

sible location as a hypothesis. For example, the variable for part a can take on the values

h(a)1 , . . . , h

(a)k , each of which corresponds to a location for part a in the scene.

We construct a Markov random field over the parts in the object. A Markov random

field (also referred to as an MRF) is a mechanism for representing the joint probability

distribution over a set of variables, which in this case are the placements of each part. The

likelihood of a part’s placement depends on two factors: how well a particular location for a

10

Page 13: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

part fits the scene, and how well a particular location for a part fits with the other parts in

the model. An MRF provides a compact way to represent the likelihood of placement of an

individual part and the pairwise likelihoods for the placement of pairs of parts. Specifically,

an MRF can be viewed as a graph where variables (in this case each of the parts) each have

their own distribution and are represented as nodes in the graph. Pairs of variables that are

jointly dependent are connected by edges in the graph, and the joint probabilities over two

variables are calculated for each edge.

Recall that in our probabilistic model, each part is represented by a variable. Potentials

model the likelihood of each hypothesis for the placement in the secene. Constraints on

placements of parts relative to each other result in pairwise potentials along edges which

are added to the MRF. Inference is used to determine the maximum likelihood placement of

parts in the scene.

Our algorithm has two main phases. The first phase (discussed in Chapter 3) involves

the use of detectors to provide initial hypotheses for the location of each of the individual

object parts. The detector-phase matches the surface surrounding individual points on one

of the model parts to the local surface around a point in the partial view. The strongest

matches are used to provide the possible locations for parts for the next phase.

The second phase (discussed in Chapters 4 through 6) is where most of our work occurs.

It involves creating the probabilistic object model, which requires several components. The

first of these is scoring, which we present in Chapter 4. We define an energy function that

prefers placements of parts that are consistent with the observed scene and consistent with

11

Page 14: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

our expectations of the object’s structure. We present a model for efficiently computing

a score for an individual part and scores for two adjacent parts. The second component,

covered in Chapter 5, involves inference in a probabilistic network. We create the network

described above and use the score to define potentials in the network. We then run inference

to determine the best joint assignment of location hypotheses to parts, thereby providing the

reconstructed articulated object model. Finally, in post-processing, described in Chapter 6,

we perform a fine-tuning alignment of the model with the scene, attempt to fill in missing

parts, and run further inference. The result is our final reconstructed object model.

12

Page 15: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Chapter 3

Phase 1: Detectors

We begin our process by finding possible locations of parts, which are used to initialize the

domains of the nodes in our probabilistic model outlined in the previous chapter. In this

chapter, we describe the low-level detectors that we use to identify where a particular part

of the articulated model may be located in the scene (similar to the approach of Ioffe and

Forsyth [7]).

3.1 Detector background

The goal of our approach is to find a placement of each part in the scene. A placement

hypothesis can be defined in terms of a transformation that rotates a part from its location

in the model and then translates it linearly to a placement in the scene. This can be

represented as a transformation matrix, and the problem can be conceived of as searching

for the best transformation matrix to match a part to a location in the scene.

13

Page 16: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Rotation of the part can be defined in terms of rotation about each of 3 different axes,

and translation can be defined by a shift in each of 3 different dimensions. Therefore, a

transformation consists of six different parameters. We are working with placements in a

continuous space, so we are essentially dealing with search in a six-dimensional continuous

space.

Exhaustive search of this space would not be feasible. The detectors help solve this

problem by providing initial discrete locations where parts may be located. We run the

detectors for each of the 15 parts to come up with hypotheses for where each of the parts

may be located in the scene.

3.2 Spin images

Our detectors are based on spin images [8], a mechanism for describing the local surface

around points on the mesh. Spin images operate by taking a specific point on the mesh

and placing a cylinder vertically along the normal to the surface at that point. The points

falling inside the cylinder are projected onto the radial axis and vertical axis of the cylinder,

and these coordinates are used to place the points into bins. The result is an n-dimensional

vector describing the frequency of points in each of the n bins. We compute spin images

for points on both the model and scene mesh, resulting in vectors that describe the surface

surrounding the points on both meshes.

Our goal is to find points on the scene mesh where the surrounding surface is similar

to points on the model mesh. Therefore, we take points in the scene mesh and, using a

14

Page 17: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

nearest neighbor search structure, find the closest matches (based on L2 distance of the

n-dimensional descriptive vector) to points in the model. Using these matches, we are able

to create candidate correspondences between model and scene points.

3.3 Clustering, pruning and alignment

Now that we have candidate correspondences between the object model and the scene, we

wish to find placement hypotheses. For each model part, we collect the highest scoring

correspondences (based on L2 distance described above) between points on the part and

points in the scene. We cluster these using a method similar to the approach of Johnson and

Herbert [9].

Specifically, we sort all correspondences from smallest distance to largest distance. We

begin with the first correspondence, which we denote as c. For all other correspondences

i for which the distance between model points and the distance between scene points is

within a distance threshold d, we calculate a correspondence score with c. For this score,

we separately compute, for the model and the scene, the vector from the spin-image origin

to where the point for correspondence i falls in the spin image for correspondence c. For

a good match, they should be close to the same value, indicating that both the model and

scene points of i are in a similar orientation relative to those of c. We compute the difference

between these directional vectors and if they are less than a threshold t, correspondence i is

added to the cluster for correspondence c. We repeat this process, with any correspondences

that have already been placed in a cluster removed from the list, until no new clusters of a

15

Page 18: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

minimum size s can be formed.

Once we have established clusters, we sort them based on a cluster score equal to the

average distance between the spin-image descriptive vectors for each of the correspondences

in the cluster. For the top m clusters, we now compute a placement hypothesis for the

part. This is done using the iterative closest point (ICP) algorithm [2], which solves for a

transformation matrix to be applied to the part that minimizes the distance between each

of model-scene correspondences.

If there were many spin image matches between the same region of the model and scene,

then after generating hypotheses from each cluster, we will have several hypotheses that are

very similar. However, we want to limit the number of hypotheses to reduce the search space

while simultaneously forcing hypotheses to be spread out in order to explore the complete

scene. Therefore, we prune similar hypotheses by determining how close to the same location

different hypotheses place the part. With the hypotheses sorted by their cluster score, we

compute the difference between the top hypothesis T and all other hypotheses S:

d(S, T ) =∑

j∈joint points‖T (j) − S(j)‖

That is, the score is the sum of the difference in placement of each joint point j when

transformed by hypotheses S and T . A lower score indicates that two hypotheses do not

move the part as much since the joints remain in similar locations. To eliminate similar

hypotheses, we discard hypotheses S when d(S, T ) is less than a minimum distance dmin.

We repeat this process with hypothesis T now equal to the next highest remaining hypothesis

16

Page 19: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Figure 3.1: Top scoring placement hypotheses for puppet upper torso resulting from detectorphase.

on the list and continue until we have compared all pairs of hypotheses to ensure a minimum

separation distance.

Once we have the remaining hypotheses, we improve their alignment with the scene

surface using an ICP alignment method in which points on the part transformed by the initial

hypothesis are matched to nearby points in the scene. We then solve for the transformation

to minimize the distance between the pairs of points, thereby bringing the part into closer

alignment with the surface. The process is repeated, each time moving the part into a slightly

better alignment. After running ICP on all the hypotheses for a part a, the result is a set of

hypotheses h(a)1 , . . . , h

(a)k for that part, similar to those shown in Figure 3.1.

17

Page 20: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

3.4 Domain enrichment

At this point, we have placement hypotheses for most or all parts, but a major problem

remains. We may be missing good hypotheses for some parts. For example, in the partial

view shown in Figure 1.1(b), the lower torso is almost completely obscured, making it im-

possible to obtain a correct hypothesis using spin image surface matching. We therefore add

hypotheses to the domains of parts in two ways.

First, we generate hypotheses for parts adjacent to one of the other parts in the model.

We consider each of the joints around a part a transformed by a hypothesis. In the region

around the joint, we identify the top spin image match between the scene surface and the

surface of a part b which is connected to a by the joint. We use this spin image match

to generate a hypothesis for the placement of b. Specifically, we use ICP to determine

a hypothesis transformation matrix for b that minimizes the distance between 3 pairs of

points: (1) the joint in the already placed part a and the joint in part b, (2) the spin image

point in the scene and point on the part, and (3) a point along the normal to the spin image

match in the scene and a point an equal distance along the normal to the spin image match

in the model. This provides a hypothesis for b consistent with the joint in a and at least one

patch of the surface around the joint. In effect, we have used a hypothesis for a to generate

a hypothesis for its neighbor b. This approach allows us to enrich the domains of parts in

instances where spin images may not provide a cluster of matches needed for a hypothesis

in the initial detector phase.

The second mechanism we use fills in missing parts between two parts. We take hy-

18

Page 21: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

potheses for two parts attached to different joints of a third part and use them to generate a

hypothesis placing the third part between them. We generate hypotheses for a part c based

on two distinct neighboring parts a and b, with hypotheses i and j respectively. Specifi-

cally, for each distinct pair (i, j) we generate a new hypothesis. We use ICP to enforce the

constraints that the joint shared between c and a must match the location for part a under

hypothesis i and, similarly, the joint shared between c and b must match the location for

part b under hypothesis j. Since ICP with only two points is under-constrained, we add

weaker placement constraints between points around the joint on the border between the

neighboring parts. That is, we add constraints that points on c that border points on a in

the model must be close to those points on a in the scene. We do the same for part b. Using

the ICP algorithm, we solve for a transformation that gives a hypothesis k for c that respects

the neighboring joints and the other neighboring points to the greatest extent possible. As

an example, this approach allows us to add a hypothesis for the occluded lower torso in the

puppet scene based on our hypotheses for the upper torso and one of the upper legs, which

are neighbors to the lower torso. It is worth noting that many of the hypotheses produced

with this method may be poor since there is no guarantee that i and j are consistent with

each other. For example, i and j may place a and b very far apart, which would result in a

poor hypothesis for c. We can address this by pruning newly generated hypotheses where the

joint spacing is too great or too small to be consistent with a part c actually being between

the locations determined by i and j.

After these two enrichment steps, we now have the hypotheses h(p)1 , . . . , h

(p)k for each part

19

Page 22: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

p necessary for use in the next phase.

20

Page 23: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Chapter 4

Phase 2A: Scoring

Now that we have a process for identifying possible locations of parts (described in Chap-

ter 3), we need a mechanism to determine whether a possible placement is plausible. Scoring

allows us to quantitatively assess the quality of a location hypothesis and favor those that

are better in our probabilistic model.

There are two different factors that must be taken into account when scoring a hypothesis.

The first has to deal with individual parts and reflects how well the hypothesis matches the

observed scene, or in other words, how well the part surface fits the partial view surface. For

example, the puppet arm should ideally be placed flush with the partial view scene surface,

not sticking out into space in front of it.

The other factor has to do with how well two parts fit relative to each other and results

in an inter-part score. Because we are constructing a skeleton, each part cannot be placed

individually; rather, the placement of parts in relation to one another must be considered.

The second part of the score takes this into account by enforcing requirements that adjacent

21

Page 24: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

parts be placed in compatible locations. For example, the head of the puppet must be placed

close to the torso, rather than separated by a significant distance.

4.1 Individual part scores

The individual part score is further divided into two parts: an area explanation/occlusion

score and an edge match score.

4.1.1 Area explanation/occlusion score

The area explanation/occlusion score does two things. First, it favors placements of parts

that align with the surface of the partial view, thereby explaining area. Second, it penal-

izes parts that protrude in front of the observed surface, thereby occluding portions of the

observed scene.

Figure 4.1(a) shows an example of a good scoring placement of an upper leg. (This

hypothesis for the leg places it such that it is intertwined with the surface of the partial view

and the grey surface of the partial view mixes with the green of the part.) This configuration

scores well because the part surface matches up with the partial view surface, meaning it

explains the area well. On the other hand, Figure 4.1(b) shows an example of a leg placement

that does not score well under our scoring function. There are a few points where the leg

matches with the surface of the arm, but for the most part it is sticking out into space and

occluding the rest of the partial view.

22

Page 25: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

(a) Good scoring placement (b) Bad scoring placement

Figure 4.1: Area explanation/occlusion score prefers placement of parts that are well alignedwith surface as opposed to those that occlude surface.

Scoring details

For scoring individual parts, then, we wish to determine how well a candidate placement

matches the observed surface. To do this, we discretize the surface and then consider the

path from the observation point through points on the surface of the part. For each path

through a part surface point, we determine where along that path the partial view surface

lies. This is useful, because, for our model, we need to determine the consistency of the

partial view surface with the part surface. For partial view points close to the surface,

we model the likelihood as a Gaussian distribution with mean 0 and a standard deviation

σa. This yields a maximum likelihood when the observed partial view surface is exactly

aligned with the part. The likelihood decreases the farther away the partial view surface

is from the part. For a partial view surface that lies relatively far in front of the part, we

23

Page 26: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

assign it a moderate uniform likelihood α because there is a reasonably high likelihood of

the part being behind some partial view surface that occludes it. However, for a partial view

surface observed a significant distance behind the candidate part, we assign it a uniform low

likelihood β because it is unlikely that surface would be observed behind the part since the

part surface would have occluded whatever is behind it.

Finally, we have to deal with points of the part that fall in regions where there is no

partial view surface in the data. Due to the limitations of the scanner, some surfaces in

the partial view may not be registered. Regions where there is no data must, therefore, be

treated as unknown areas; we cannot assume the presence or absence of surface. Thus, we

assign a moderate uniform likelihood γ to part points that fall within that region because

there is a reasonably high likelihood of observing no partial view surface even when a part

is placed there.

In summary, we assign scores for individual points as follows, where x is the distance

between partial view and part surface, α and γ are moderately high uniform likelihoods, and

β is a low uniform likelihood:

likelihoodp =

MAX(α, exp( −x2

2∗σa2 )) for points where partial view surface is in front of part

MAX(β, exp( −x2

2∗σa2 )) for points where partial view surface is behind part

γ, for points where no partial view surface is observed

The total score of a part then is the product for all points of the likelihood for a point

24

Page 27: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

weighted by the area of the point:1

area-score =∏

p

likelihoodpareap

Scoring Implementation

Scoring is implemented using OpenGL. This allows for very efficient computation of the area

explanation/occlusion scores because OpenGL is well-suited to rapid rendering of 3D points.

The partial view is initially rendered in OpenGL as seen from the viewpoint. Then for each

part we wish to score, we render it separately. Since we are dealing with graphics rendering,

we can easily discretize it based on pixels and compare Z-buffer values between the part and

partial view to determine where the two surfaces lie in relation to each other along the path

from the partial view to the part. Using this approach, we are able to leverage hardware

accelerated 3D rendering in the graphics card to score several hundred part hypotheses per

second.

4.1.2 Edge match score

When the partial view scene is viewed from the observation point, discontinuities in surface

depth from edges appear. The edge match score reflects the idea that these edges should

be aligned with the edges of parts. A configuration where the edges of a part match the

1Weighting by the area of the of the point is necessary because the discretization is based on pixels. Pixelson surfaces closer to the viewpoint represent less surface than pixels on surfaces farther from the viewpoint.Since the score is based on the area of the part, we want it to be weighted appropriately based on actualpart size and not its location relative to the viewpoint.

25

Page 28: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

observed scene edges has a higher likelihood because it’s more likely that the edges result

from part’s placement there.

Scoring details

We discretize the surface into points (based on pixels, as with the edge score) and form

a matrix of distances to each of the points. We then apply the well-known Canny edge

detection algorithm [3] to the matrix to identify major discontinuities that form edges. We

do the same things with an individual part we wish to score, identifying the edges of it.

For each point along the edge of the part, we wish to model the likelihood that the

observed edge results from the part’s placement at that location. We match each part edge

point with the nearest edge point in the partial view, and then model the likelihood as a

Gaussian distribution with mean 0 and standard deviation σe. Thus, an edge that exactly

aligns with the part edge has a very high likelihood of having been generated by the part, and

the likelihood decreases the farther away it gets. Because of the problem of unknown data

in the partial view, we cannot definitively conclude that the absence of an edge means there

should not actually have been an edge, so we assign a uniform probability α at distances far

away from the part edge, representing the idea that a part edge could actually generate a

reading of no edge.

For an individual point along the edge:

likelihoodp = MAX(γ, exp(−x2

2σe2))

26

Page 29: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

where x is the distance between partial view edge and part edge.

To obtain the score for the entire part, we multiply the likelihood of each point, weighted

by the segment edge length:2

edge-match-score =∏

p

likelihoodplengthp

Scoring implementation

OpenGL is again used to allow fast computation of the edge score. Using OpenGL, we

render the partial view surface to a buffer, and use the Z-buffer values to compute edges

with the Canny edge detection algorithm. The preformed discretization into pixels is used

as the basis for breaking up the edges into points. The same process is used for obtaining

edges of each of the part hypotheses being scored.

4.1.3 Final part score

The final likelihood score of the part is obtained by multiplying together the two constituent

scores:

part-score = area-score ∗ edge-match-score

2As with the area score, discretization is based on pixels. Since a pixel farther away represents an edgesegment of greater length, we need to weight by the length of that segment so that scores can be comparablebetween placements of parts independent of their distance from the viewpoint.

27

Page 30: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

4.2 Inter-part scores

The inter-part score enforces compatible placements of parts relative to each other, and has

three components. The first two apply to pairs of adjacent parts, while the third applies to

any pair of parts.

4.2.1 Joint spacing score

Two parts can fit the surface and match the edges of the partial view very well, but the

placement of the two parts may not be compatible. For example, the upper left arm may

fit well off to the right side of the torso, or even in the place of one of the legs. The joint

spacing score allows us to penalize placements such as those and to favor placements of

adjacent parts that are close together.

Every two adjacent parts share a single point between them, defined as a joint. In an

ideal placement, this shared point would be in the same location for each part. However, in

a poor placement, the point that should be shared is in one location in the scene in one part

and a significant distance away in the other part.

For two parts a and b, having hypotheses i and j respectively, and sharing a joint (a, b),

we can define −→v ba,i to be the location in the 3D space of the joint point for part a based on

hypothesis i, and −→v ab,j to be the location of the joint point for part b based on hypothesis j.

We can then compute the distance between them, which we denote as x(a,i)(b,j):

x(a,i)(b,j) = ‖−→v ba,i −

−→v ab,j‖

28

Page 31: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

The joint match score is then modeled as a Gaussian distribution with mean 0 and

standard deviation σj :

joint-spacing-score = exp(−x2

(a,i)(b,j)

2σj2

)

A distance of 0 would receive the maximum score, and the score decreases as the distance

increases.

4.2.2 Angle score

Even if two parts have a shared joint that is placed consistently, they can be twisted at

large angles that are actually impossible to achieve with the model. The angle score favors

placements where the angle between two parts is closer to the angle in the model. We

introduce a prior belief over the angle between adjacent parts, with a preference for smaller

angle differences from the model. This helps to eliminate configurations where the object

could be twisted at angles consistent with the data but not with the restrictions of the joint.

For the purposes of the angle score, a part placement can be described with a 3D vector

pointing from the centroid of the part to the joint. For a joint (a, b) between parts a and b,

we compute these vectors −→w ba,i for the part a with hypothesis i and −→w a

b,j for the part b with

hypothesis j. Using the dot product operation, we can obtain a scalar value for the angle

between them, represented by θ(a,i)(b,j). For the model we also compute these vectors −→w ba,∗ for

the part a with its placement ∗ in the model and −→w ab,∗ for the part b with placement ∗. We

can then compute the scalar angle θ(a,∗)(b,∗) for the joint (a, b) in the model. The angle score

is based on the difference d = |θ(a,i)(b,j) − θ(a,∗)(b,∗)|. Because we wish to allow a reasonable

29

Page 32: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

amount of movement, we only penalize angle differences above a certain threshold t. For

angle differences smaller than the threshold, the score is 1. For larger angle differences, we

score using a Gaussian distribution with mean 0 and standard deviation σl:

angle-score = exp(−(d− t)

2σl2

)

A difference less than or equal to t receives the maximum score of 1. The score decreases

as the difference between the angle in the model and the angle of the candidate placements

increases.

4.2.3 Part intersection score

The other requirement we wish to enforce between two parts is that they do not overlap in

space. For example, the elbow joint could be well-placed between two part hypotheses, but

the lower arm could be folded back on top of the upper arm. This overlap problem could also

exist with two non-adjacent parts, where, for example, the lower arm is placed intersecting

the torso. Therefore, we use the volume intersection score to favor placements of two parts

where they do not overlap with each other and to penalize part placements that do.

The part intersection score is based on the amount of overlapping volume between two

parts. Parts are discretized using a 3D grid. We then determine which elements in the grid

are inside the part surface. For each of these points inside a part i we consider whether the

point lies inside another part j, and sum the total number of intersecting points, which we

denote as x.

30

Page 33: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

We then derive a score by modeling the intersection as a Gaussian distribution with mean

0 and standard deviation σv:

part-intersection-score = exp(−x2

2σv2)

As the amount of volume intersection increases, the score for the part placements decreases.

4.2.4 Final inter-part score

The inter-part likelihood score for adjacent parts is obtained by multiplying the three com-

ponent scores together:

inter-part-score = joint-spacing-score ∗ angle-score ∗ part-intersection-score

Since the first two components apply only to adjacent parts, the score for non-adjacent parts

consists of only the third one:

inter-part-score = part-intersection-score

31

Page 34: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Chapter 5

Phase 2B: Inference

Thus far, we have described a process for obtaining a series of location hypotheses for each

of the parts (Chapter 3) and a mechanism for scoring individual part hypotheses and pairs

of hypotheses for different parts (Chapter 4). We now require a mechanism for assigning a

location hypothesis to each part. We do this via inference in a probabilistic network, which

we discuss in this chapter.

5.1 Probabilistic network

As outlined in Chapter 2, we define a probabilistic network with a variable representing each

part’s placement and with edges connecting nodes (parts) that are adjacent in the structure

of the skeleton.

The domains for each of the nodes are the different part location hypotheses for that

part. A part a is defined to have possible hypotheses h(a)1 , . . . , h

(a)k where each hypothesis

32

Page 35: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

corresponds to a location in the image. We choose the potential to represent the likelihood of

the part being located there. For each hypothesis i for part a we define a singleton potential

using the individual part scores defined in Section 4.1:

φ(h(a)i ) = part-score(h

(a)i ) = area-score(h

(a)i ) ∗ edge-match-score(h

(a)i )

Next, we add edges between all pairs of adjacent parts in the articulated model skeleton.

The pairwise potential for these edges models the joint likelihood of two parts a and b taking

on particular values in their domains. This potential is modeled on the inter-part scores

described in section 4.2, and it is used to define a preference for two adjacent parts being

placed close together, for two adjacent parts being at and an angle relative to each other that

is similar to the angle in the model, and for two adjacent parts not to overlap. Specifically,

we create a potential ψ(h(a)i , h

(b)j ) representing the likelihood that part a takes on hypothesis

i and part b takes on hypothesis j. We define the potential as:

ψ(h(a)i , h

(b)j ) = joint-spacing-score(h

(a)i , h

(b)j ) ∗

angle-score(h(a)i , h

(b)j ) ∗

part-intersection-score(h(a)i , h

(b)j )

33

Page 36: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

5.2 Inference in network

With the probabilistic network defined, we run a form of inference known as loopy belief

propagation1 [12, 11] to find the maximum likelihood assignment of part hypotheses to

locations. At this point, the most likely location hypothesis for each part can be used as the

location that is output for each part.

The relative strengths of the potentials, which define the trade-off between individual

part placement relative to the partial view and placement of parts relative to each other,

can be controlled by adjusting the values of the standard deviations σa, σe, σj , σl, and σv.

5.3 Adding edges to network

After running inference on one of the puppet scenes, we obtain the results shown in Fig-

ure 5.1. One of the first problems we note is that parts are overlapping. This is because

our probabilistic network only has edges enforcing part non-intersection between adjacent

parts. In reality, no two parts should intersect, regardless of whether they are adjacent.

The solution is to add edges between non-adjacent parts that enforce the non-intersection

constraints. We define the potential for these edges in terms of only the part-intersection

score since there is no shared fixed joint with which to calculate a joint spacing score:

ψ(h(a)i , h

(b)j ) = part-intersection-score(h

(a)i , h

(b)j )

1Our model, as defined thus far, is actually a tree structure and has no loops. Therefore, an inferencealgorithm that can handle loops is unnecessary. However, in Section 5.3 we describe how edges are addedthat lead to loops in the graph and for which loopy belief propagation is necessary.

34

Page 37: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Figure 5.1: Result of inference in initial model. Here, non-adjacent parts overlap significantlybecause edges are not added to constrain their placement.

This potential models the likelihood that part a takes on hypothesis i and part b takes on

hypothesis j.

However, it is not feasible to add edges between all non-adjacent parts because that would

result in a fully-connected graph, which would make inference computationally infeasible.

Further, it is unlikely that most non-adjacent parts would intersect, even without the non-

intersection constraint. For example, the foot would most likely not be placed in the same

location as the head because the surfaces would not match well. Therefore, we run inference

in the network and examine the highest scoring placements for each part. We take all pairs,

and if any intersect more than a certain threshold percentage, we add an edge representing

the non-intersection constraint between the two parts. We then re-run inference in the

network, and repeat this process until no new edges are added. As a result of adding edges

35

Page 38: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Figure 5.2: Result of inference in model with additional edges. After first running infer-ence, the legs overlapped. However, the legs no longer overlap, because we have added anadditional edge to the network to enforce non-intersection of those parts.

between non-adjacent parts, the network is no longer a tree and loops exist in it. It is for

this reason that we chose loopy belief propagation as our inference mechanism since it is

well-suited to networks with loops.

Employing this mechanism, we are able to improve on the results shown in Figure 5.1,

resolving the problem of the intersecting legs, which produces the results in Figure 5.2.

5.4 Largest connected component

There is still one major limitation of this probabilistic model: it requires that every part of

the model be placed somewhere in the scene. However, in scenes with heavy occlusion this

is not necessarily practical or even desirable. Consider a scene with heavy occlusion, such

36

Page 39: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

as that shown in Figure 5.2. Here, some of the limbs of the puppet are completely hidden

from view. Because none of their surface is visible, there is no data from which to obtain

placement hypotheses consistent with the rest of the puppet parts. Even the mechanisms for

suggesting part placements based on neighbors (described in Section 3.4) cannot provide good

hypotheses with so much missing information. Thus, we will not have any good placement

candidates in the domains of those parts, and the only hypotheses we have for those parts

would match other surfaces, thereby placing the hidden parts in locations inconsistent with

the other parts. Therefore, we relax the constraint that all parts must be placed in the scene,

choosing instead to obtain the largest good-scoring connected component.

In practice, we allow parts to be missing by including a ”part-missing” hypothesis in the

domain for each part. In our model we designate a root part; we allow any part or parts to

be missing, with the additional constraint that if a part on a path from the root is missing,

all parts further down that path must also be missing. Since we have a tree structure in our

puppet and human skeletons, the largest connected component that encompasses the root

part results. In the case of the puppet and human, the upper torso serves as a good root

part.2

2In practice, the upper torso is a good choice of a root part because if it is present in the scene, it isgenerally easy to match because of its large area and distinctiveness. If it is not present and we are unableto suggest a location for it based on either of the arms, the head, or the lower torso, then most of the objectis probably not visible in the scene, and the likelihood of successfully obtaining any of it is low.

Selecting the upper torso as the root part for the puppet and human does not diminish the generality ofour algorithm. For objects that don’t have a clear choice for a good root part, multiple root parts could beconsidered using one of two different mechanisms. The first, and simplest approach, is that we could build theprobabilistic network with each part in turn designated the root and re-run inference. Each time we wouldobtain an assignment of parts to locations, and the assignment with the highest score would be output as theresult. Alternatively, we could avoid building the probabilistic model and running inference multiple timesby constructing a richer model. We could add additional variables (and corresponding pairwise constraints)that would encode both which part is designated as the root part and the presence and absence of parts.

37

Page 40: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

The score associated with this part-missing hypothesis is neutral, so that it does not

score better than a part that explains area or edge, but does score better than a part that

occludes surface which it should not. We define the individual part potential for the part-

missing hypothesis with a uniform neutral score δ as:

φ(h(a)null) = δ

The pairwise scores between a missing part and its present neighbor are also neutral:

they do not impose joint spacing or part-intersection penalties. We modify the pairwise

potential to enforce the requirement that a part farther down the tree from a missing part

is not present. The pairwise potential is as follows for a part b below a part a in the tree,

where i and j are real placement hypotheses and null is the part-missing hypothesis, and

where ζ is a uniform neutral score:

ψ(h(a)i , h

(b)j ) = joint-spacing-score(h

(a)i , h

(b)j ) ∗

angle-score(h(a)i , h

(b)j ) ∗

part-intersection-score(h(a)i , h

(b)j )

ψ(h(a)null, h

(b)j ) = 0

ψ(h(a)null, h

(b)null) = 1

ψ(h(a)i , h

(b)null) = ζ

We could then run inference in the enhanced network to obtain our result.

38

Page 41: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

This creates a model in which a part will always be added to the network as long as its

addition (or the addition of it and some of the parts beneath it in the tree) have a positive

effect on the likelihood of the resulting configuration. That is, a part is added when the area

and edge it matches outweighs the penalty from non-matching area, joint-spacing mismatch,

angle mismatch, and part-intersection. This is a desirable outcome because we do not force

ourselves into inconsistent part placement assignments by requiring the presence of parts

which are not visible in the scene and do not enable us to place other parts. By limiting

ourselves to the search for the largest connected component, we obtain the results shown in

Figure 5.3.

Now that we have our initial results, we proceed to a post-processing phase described in

the next chapter.

39

Page 42: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Figure 5.3: Result of inference in model where the constraint that all parts must be presentis relaxed and we search for the largest connected component. The arm parts, for which weare lacking good placement hypotheses, are no longer present, and we are not forced into aninconsistent configuration where the arms are placed incorrectly relative to other parts.

40

Page 43: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Chapter 6

Phase 2C: Post-processing and

Further Inference

At this point in the process, we have run inference in our probabilistic network as described

in Chapter 5 to obtain the most likely placement of parts. However, these placements can

be further refined, additional hypotheses can be generated, and the inference process can be

repeated to yield improved results. We describe these post-processing mechanisms in this

chapter.

6.1 Articulated ICP

Once we have the resulting best hypothesis for each part from inference, we can further

fine-tune them using a process called Articulated ICP. The goal is to better align the parts

with the partial view surface and with the neighboring part joints.

41

Page 44: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

We take each of the parts included in our results and associate points on the parts

with nearby parts on the partial view. We also associate the joint points in each part

with the corresponding joint points in the neighboring parts. We then use ICP to solve

for transformations for each part that move these associated pairs of points closer together.

After applying the transformations, the process is repeated for a fixed number of iterations.

Unlike the ICP we employed earlier, where only one part was moved at a time, our

articulated ICP approach here allows us to move all the parts at once and relative to each

other. The result is an improved alignment, where the parts are drawn together or pushed

apart at the joints so that they align well and are adjusted to match the surface of the partial

view.

6.2 Missing part hypothesis generation

At this point, we have an articulated model containing the largest connected component.

However, some parts may still be missing if we did not have good hypotheses when we

initially ran inference. In this phase, we generate new hypotheses for parts not included in

initial results but adjacent to a part that was included. In essence, we are attempting to

grow outward from our starting connected component.

First, for any neighbors of missing parts, we use the new hypothesis that resulted from

articulated ICP to suggest a hypothesis for this missing part based on the spin image, as we

did in the initial domain enrichment phase prior to inference (described in Section 3.4).

Additionally, for any neighbors of missing parts, we take the top K (in this case, 3)

42

Page 45: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

hypotheses from inference plus the hypothesis from articulated ICP on that part, and we

use these K + 1 hypotheses to generate new hypotheses for the missing neighboring parts

using a uniform sampling approach. For each of these hypotheses, we examine the sphere

around the joint shared with the missing part and consider the placement of the missing

part pointing outward from the joint at fixed intervals of angles in that sphere. For each

of these new hypotheses, we run ICP to align it with any surface in the vicinity. We then

calculate the individual part score (Section 4.1) for each new hypothesis and based on that

keep the top M (in this case, 3). In total, this yields (K + 1) ∗M new hypotheses for the

missing part. This uniform sampling allows us to identify possible part placements that may

have been missed because, for various reasons, the spin images were not good enough to

obtain a good match. This process is not feasible at earlier stages when we have many more

hypotheses because it requires running ICP on too many samples and generates too many

new hypotheses. However, it works well when we have a limited number of good hypotheses

for the present parts.

6.3 Repeating inference

The final stage in our process is to repeat inference. We set up the probabilistic model as

before. For all parts, we include the top K hypotheses from inference in their domains. For

the parts present in our previous result, we also add the hypothesis from articulated ICP to

the domain. For the missing parts that are neighbors to these parts, we add the hypotheses

generated as described above to their domains.

43

Page 46: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

We run inference as before and then apply articulated ICP to get our new result.

If more parts are now present in the result, there may now be new missing parts that are

adjacent to the present parts. We repeat the process of missing part hypothesis generation,

inference, and articulated ICP until no new parts are added.

When we reach the point that no new parts are added, we return the skeleton resulting

from articulated ICP as our final result. We have now completed the process, and have a

placement in the scene for each of the parts that are in the largest connected component we

can find.

44

Page 47: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Chapter 7

Experimental Results

We present results of running our algorithm (described in the preceding chapters) on two

data sets: scans of the model puppet described before and a series of scans on a human.

We tested our algorithm on several scenes involving the 15-part model puppet viewed in

different poses and from various directions. The scenes have various cluttering and occluding

objects. The scans were generated from a temporal stereo scanner [4]. The scans are made

using two cameras to capture stereo information. Light is shined in changing bands on the

object and captured by the two cameras. Using the observations of the light bands from

the two viewpoints, the 3D surface mesh can be extrapolated from a series of images. One

of the major challenges with this data, however, is the presence of large shadows. Surface

can only be located when it is visible to both cameras and reached by the light source. As

a result of this, any objects farther forward in the scene essentially cast three shadows (one

blocking the light, one blocking the left camera, and one blocking the right camera). This

is the reason the ring in Figure 1.1(b) hides a larger portion of the lower torso than the

45

Page 48: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

ring’s own surface would otherwise account for. This is also the reason for missing areas of

the background. An important consequence of this is that there are large regions where we

can neither assume the presence nor absence of surface (because it is possible that there is

surface there which the object is hiding in one of its shadows). As a result, the detection

problem is more challenging.

A sample of the puppet scenes and the resulting puppet embeddings are shown in Figure

7.1. We correctly identified the torso and head core of the puppet in almost all cases, even

in situations with substantial occlusion. In most cases, we were also able to place the limbs

correctly. This was possible even in scenes with substantial occlusion or parts not visible.

For example, in the scene of the puppet kicking the ball (Figure 7.1(a)), its own arm occludes

much of it from view. In the scene with the puppet holding two sticks (Figure 7.1(g)), much

of the puppet is not visible, but the limbs were placed in the generally correct location by

our algorithm.

Even in situations where limbs are not placed correctly, they were placed in a configu-

ration consistent with the scene data. For example, in the scene of the puppet holding the

ring (Figure 7.1(d)), the leg was turned up and placed along the surface of the ring. This is

consistent with our model, in which we wish to place our object in such a way that it explains

the observed data. In the scene in which we obtained the worst results, i.e. the scene with

the puppet holding the ring around its waist (Figure 7.1(e)), the puppet was placed entirely

consistent with the observed data, although it was twisted in an unusual manner. Missing

parts were generated to fill in around other hypotheses and complete the puppet skeleton.

46

Page 49: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

(a) Puppet kicking ball (b) Puppet kneeling next to cup (c) Puppet with a smaller puppeton its shoulders

(d) Puppet holding a ring (e) Puppet with a ring around it (f) Puppet stepping on an object

(g) Puppet with sticks (h) Puppet with a wire snakingaround

Figure 7.1: Sample of scenes and the resulting puppet embeddings

47

Page 50: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

This, in fact, shows the power of our hypothesis generation mechanism and the difficulty of

distinguishing an object from similarly shaped nearby surfaces.

We also tested our algorithm on a human data set, matching a 16-part articulated model

to a series of partial view scans of a human, produced using a Cyberware WRX scanner.

This scanner functions by measuring the time it takes for laser beam to be reflected back by

the surface. Using this, it is possible to compute the distance to the surface and construct

a 3D mesh. An object (in this case a human and various occluding objects) are placed in

the scanner and scanned from four sides. A single one of these four scans is a partial view.

Human data introduces a new challenge in addition to occlusion because individual parts

are now deformable due to the fact that muscle and tissue are non-rigid.

Figure 7.2 shows four human scenes and the corresponding pose recovered for each one.

In the human scenes, we found the correct placement of the head and torso, and were

generally able to find the arms and legs. In the pose with the person’s arms crossed in front

(Figure 7.2(d)), we were able to correctly reconstruct the entire skeleton, despite the fact

that most of the torso is missing due to occlusion from the arms and missing data from

the scan due to shadows. In the scenes with the person partially occluded by the chair

(Figure 7.2(b)) and the board (Figure 7.2(c)), we were able to recover the correct placement

of most of the limbs. In the cases where the limbs were not placed in the correct location,

they were consistent with the observed data, either placed behind an occluding object or in

the unknown area surrounding the person (recall that we cannot make assumptions from the

absence of surface and must treat those regions as unknown).

48

Page 51: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

(a) Person standing (b) Person with chair

(c) Person holding board (d) Person with arms crossed in front

Figure 7.2: Sample of 4 scenes and the resulting human poses. For each pair, the image onthe left is the original scene and the image on the right is the recovered human pose.

49

Page 52: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Our algorithm functioned without changes for both data sets. Both sets of results were

generated with the same parameters, with the exception of the threshold for the joint angle

prior, which differed for humans and puppets because the puppet joints have a greater

freedom of movement than human joints.

50

Page 53: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Chapter 8

Conclusions and Future Directions

In conclusion, we have developed an algorithm for determining the pose of articulated objects

in 3D range data. Our algorithm begins with detectors that provide initial hypotheses. It

then uses efficient scoring mechanisms that we have defined to compute scores for individual

parts and pairs of parts. Our approach uses these scores to define a probabilistic model

with which we can estimate the most likely pose given the observed data. In the end, we

demonstrated successful application to puppet and human data sets.

This work addresses a challenging problem: that of detecting occluded objects in range

images and determining their pose. It presents a novel solution, based on a well-formulated

probabilistic framework. There is a broad range of applications of objection detection in

scenes, with potential applications ranging from automated monitoring and analysis to a

robotic system that interacts with the real world.

Future work will involve testing with other articulated model data sets, including possible

non-human-shaped objects. In addition, we would like to explore learning features of the

51

Page 54: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

space of poses. One major obstacle to obtaining better results is that we have no specific prior

knowledge of the space of poses for an object. By learning this information and incorporating

it, we will be able to distinguish between various poses that are all consistent with the data

but not necessarily consistent with physical limitations of the object. This would allow us

to better recover the actual pose of the object and improve the results.

52

Page 55: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

Bibliography

[1] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, H.-C. Pang, and J. Davis. The cor-

related correspondence algorithm for unsupervised surface registration. In Proc. NIPS,

2004.

[2] P. Besl and N. McKay. A method for registration of 3d shapes. Transactions on Pattern

Analysis and Machine Intelligence, 14(2):239–256, 1992.

[3] J Canny. A computational approach to edge detection. IEEE Trans. Pattern Anal.

Mach. Intell., 8(6):679–698, 1986.

[4] J. Davis, R. Ramamoorthi, and S. Rusinkiewicz. Spacetime stereo : A unifying frame-

work for depth from triangulation. In CVPR, 2003.

[5] P. Felzenszwalb and D. Huttenlocher. Efficient matching of pictorial structures. In Proc.

CVPR, pages 66–73, 2000.

[6] D.M. Gavrila. The visual analysis of human movement: A survey. Computer Vision

and Image Understanding, 73(1):82–98, 1999.

53

Page 56: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

[7] S. Ioffe and D. Forsyth. Probabilistic methods for finding people. Int. Journal of

Computer Vision, 43(1):45–68, 2001.

[8] A. Johnson. Spin-Images: A Representation for 3-D Surface Matching. PhD thesis,

Carnegie Mellon University, 1997.

[9] A. Johnson and M. Hebert. Using spin images for efficient object recognition in cluttered

3d scenes. Proc. IEEE PAMI, 21:433–449, 1999.

[10] G. Mori and J. Malik. Estimating human body configurations using shape context

matching. Proc. ECCV, 3:666–680, 2002.

[11] K. Murphy, Y. Weiss, and MI Jordan. Loopy belief propagation for approximate infer-

ence: An empirical study. In Proc. Fifteenth Conference on Uncertainty in Artificial

Intelligence (UAI ’99), pages 467–475, 1999.

[12] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, San Fran-

cisco, 1988.

[13] L. Sigal, S. Bhatia, S. Roth, M. J. Black, and Michael Isard. Tracking loose-limbed

people. Proc. CVPR, 1:421–428, 2004.

[14] E. Sudderth, M. Mandel, W. Freeman, and A. Willsky. Distributed occlusion reasoning

for tracking with nonparametric belief propagation. In Proc. NIPS, 2004.

[15] J. Sullivan and S. Carlsson. Recognizing and tracking human action. Proc. ECCV,

1:629–644, 2002.

54

Page 57: Articulated Model Detection in Range Imagesai.stanford.edu/~koller/Papers/Rodgers:05.pdf · The first is a complete 3D model of the object we wish to detect. The model is an articulated

[16] K. Toyama and A. Blake. Probabilistic tracking with exemplars in a metric space. Proc.

ICCV, 48:9–19, 2001.

[17] S. Yu, R. Gross, and J. Shi. Concurrent object recognition and segmentation with graph

partitioning. In Proc. NIPS, 2002.

55