Learning Generative Models of Scene Features - CiteSeer

Learning Generative Models of Scene

Features∗

Robert Sim and Gregory Dudek{simra,dudek}@cim.mcgill.ca

Centre for Intelligent Machines, McGill University3480 University St., Montreal, Canada H3A 2A7

Keywords

Robot pose estimation, appearance-based modeling, probabilistic localiza-tion, generative modeling, feature extraction, feature representation, visualattention.

Abstract

We present a method for learning a set of generative models which are suit-able for representing selected image-domain features of a scene as a functionof changes in the camera viewpoint. Such models are important for robotictasks, such as probabilistic position estimation (i.e. localization), as well asvisualization. Our approach entails the automatic selection of the features,as well as the synthesis of models of their visual behavior. The model wepropose is capable of generating maximum-likelihood views, as well as a mea-sure of the likelihood of a particular view from a particular camera position.Training the models involves regularizing observations of the features fromknown camera locations. The uncertainty of the model is evaluated usingcross validation, which allows for a priori evaluation of features and theirattributes. The features themselves are initially selected as salient points by

∗Portions of this paper appeared in R. Sim and G. Dudek, “Learning Generative Modelsof Scene Features”, IEEE Computer Society Conference on Computer Vision and PatternRecognition (CVPR), Hawaii, 2001

1

a measure of visual attention, and are tracked across multiple views. Whilethe motivation for this work is for robot localization, the results have impli-cations for image interpolation, image-based scene reconstruction and objectrecognition. This paper presents a formulation of the problem and illustrativeexperimental results.

1 Introduction

This paper describes a technique for learning a set of image-domain mod-els of selected features in a scene, and then using them for camera positionestimation. The models capture not only projective geometry, but also ap-pearance variation due to perspective and illumination phenomena arisingfrom changes in viewpoint. We also measure our confidence in each modelso as to deliver likelihood estimates of future observations. Our goal is toemploy these models for a variety of visualization and robotics tasks. In thispaper we consider specifically the task of robot localization.

Robot pose estimation, or localization, is an important prerequisite forautonomy. A naive approach to localization is to use odometers or accelerom-eters to measure the displacements of the robot. This approach is subject toerrors due to external factors beyond the robot’s control, such as wheel slip-page, or collisions. More importantly, dead reckoning errors increase withoutbound unless the robot employs sensor feedback in order to recalibrate itsposition estimate.

The vast majority of sensor-driven localization methods rely on rangedata, typically derived from a laser range-finder or sonar [12, 26, 31, 2].One issue with most range-finding sensors is that they require the emissionof energy into the environment, either as light or sound. Furthermore, thedata that is collected from such a sensor is often noisy and highly generic innature, making the problem of selecting features and disambiguating themexpensive. The typical solution to disambiguation is to employ Markov lo-calization [7], whereby the robot travels through the environment collectingdata until there is sufficient information to disambiguate between similar po-sitions. However, it is often the case that a single camera image from therobot’s current position will be sufficient to accurately localize, given therichness of the sensor content and the variety of the visual world. One goalof our work is to enable localization using an uncalibrated monocular visionsystem. We avoid stereo-based vision systems, and structure-from-motion

approaches due to the expense of calibration, their dependence on a spe-cific imaging geometry and their reliance on geometric, rather than visualproperties of the environment.

An important feature of our work is the development of a generativeframework for feature modeling. Generative models make it possible to pre-dict image feature behavior as a function of robot pose. This fact posesseveral advantages, not the least of which are the ability to visualize thedata for diagnostic and demonstration purposes, and their easy applicabilityto a variety of existing localization frameworks, including a Bayesian ap-proach (such as Markov localization), and derived approximations, such as aKalman Filter.

Our work is among the first to employ generic image-domain featuremodels that do not rely on assumptions concerning feature and/or camerageometry. We explicitly employ image features, rather than global imageproperties, because they provide robustness to limited illumination variation,partial occlusion due to scene dynamics and possibly even small changes incamera parameters. Furthermore, the computational complexity of inferenceis reduced by using only subregions of an image, a feature that evolutionarybiology has exploited with remarkable success.

An important aspect of feature modeling is the selection and evaluation ofthe features themselves. Our approach to this problem is to employ a modelof visual saliency to initially select candidate features, and track them acrossan ensemble of training images. Given these tracked feature observations, aset of feature models are constructed and subsequently evaluated and filteredusing cross-validation. The result is a set of feature models that have beendetermined to be reliable for tracking and localization.

The feature models examined in this paper are generative in nature. Thebenefit of employing a generative model lies in its wide applicability to theproblem of inference. For example, many robotic tasks include the problemof evaluating the likelihood of an observation z of the environment givensome piece of relevant information q, such as the location of the camera,or a particular object model hypothesis. The likelihood function p(z|q) isuseful for the inference of the maximum likelihood location or model q∗ usingBayes’ Rule:

p(q|z) =p(z|q)p(q)

p(z)(1)

whereq∗ = arg max

qp(q|z) (2)

which in this context relates an observation to the pose from which it wasmost likely observed.

As an illustrative example, Figures 1 a) and c) depict images from alaboratory environment from two known poses q0 = 0 and q1 = 1. Given theimage in Figure 1 b), taken from an unknown pose q which lies somewhereon the line connecting q0 and q1, the task of localization is to find a q∗ whichmaximizes the likelihood of the image according to Equation 1.

(a) (b) (c)

Figure 1: Laboratory Scene: a) known pose q=0, b) q unknown, c) knownpose q=1

Rather than computing the likelihood of the entire image, which is a com-putationally complex problem, a set of models of local image features canbe used to compute the likelihood of observations of these features from aparticular pose. This is accomplished for any given feature f by computingthe maximum likelihood observation z∗ given the pose of the camera q, andemploying an associated model uncertainty to compute the likelihood func-tion p(z|q) based on ||z − z∗|| over some metric space. The resulting set ofdistributions (one for each feature) can be combined in a robust manner anda distribution over pose space computed according to Equation 1.

Our approach operates by automatically selecting potentially useful fea-tures {fi} from a set of training images of the scene taken from a variety ofcamera poses (i.e. samples of the configuration space of the sensor). The fea-tures are selected from each image at each position on the basis of the outputof a visual attention operator and are tracked over the training images. This

results in a set of observations for each feature, as they are observed fromdifferent positions. For a given feature f , the reconstruction task then be-comes one of learning the imaging function Ff (·), parameterized by camerapose, that gives rise to the imaged observation z∗ of f :

z∗ = Ff (q) (3)

Clearly, the imaging function is also dependent on scene geometry, light-ing conditions and camera parameters, which are difficult and costly to re-cover [25]. Traditional approaches to the problem of inferring Ff (·) haveeither focused on recovering properties of the feature under strict surfaceor illumination constraints (c.f. [1]), or developed implicit appearance-basedrepresentations (for example, principal components analysis) derived fromthe entire image, which often ignore the effects of geometry, and hence leadto blurred interpolations between views. Our work addresses the problemsinherent in appearance-based representations by implicitly capturing featuregeometry, as well as appearance. That is, both the appearance and geometricattributes of the feature are captured in a single regularization framework.We accomplish this by representing geometry in the space of affine trans-formations of the image in the neighborhood of the feature. The best-fittransformation parameters are clearly dependent on the camera position,and can be applied as a precursor to developing an appearance-based rep-resentation, which is better suited to representing variation due to radiosityand illumination conditions. The resulting models recover the viewpoint-dependent behavior of the features without explicit parameterizations of thefeature geometry or illumination. Furthermore, the application of an atten-tion operator allows one to focus on the local behaviors of features, whichthemselves may be easier to model than global properties, while providingrobustness to errors arising from scene dynamics and sensor occlusion.

In the next section we consider prior work on the problem of vision-basedrobot localization.

2 Prior Work

Our work is motivated by a need to address the task of probabilistic robotmapping, localization and navigation using a vision sensor. Prior work onthis task has been successful using sonar and other range-sensing modali-ties [12, 26, 31]. Recent work by Nayar et al. and Pourraz and Crowley

have examined an appearance-based model of the environment and per-form localization by interpolation in the manifold of principal components(PCA) [19, 16]. In other work, Dellaert et al. have demonstrated the feasi-bility of employing an optical sensor in the Markov framework [4]. In thatwork the model of the environment is reduced to an overhead planar mosaic,and the sensor model is reduced to a single intensity measurement derivedfrom the center of the image at each camera location. While these approachesdemonstrate the utility of appearance-based modeling, they can suffer due tothe dependency of the result on global sensor information and any assump-tions they make concerning the structure of the environment. Furthermore,it is not clear that a standard PCA-based representation can scale easily forlarger environments.

Recent work by Se et al. [22], by Lowe [14], by Jugessur and Dudek [5]and by Schmid [21] in the problem domains of object recognition and of robotlocalization demonstrate that object descriptions are captured well by localpseudo-invariants. In these works, an attention-like mechanism is employedto extract a set of local object features, and the features are matched againstpreviously learned features for each object class or robot pose. The benefitsof local representations include robustness to partial occlusion and sensornoise. An important aspect of these works is the task of recognizing pseudo-invariants under changes in viewing conditions. In particular, the attentionoperators developed are robust to changes in scale and planar rotation. Forthe localization problem, it is not only important to be able to recognizepseudo-invariants, but to be able to parameterize the effects that changes inpose have on the feature. While our current work considers only translationinvariance, these prior works indicate the feasibility of readily including otherparameterizations.

Our prior work has demonstrated the utility and potential accuracy offeature-based models for localization [23]. In that work, features were mod-eled using linear subspace analysis and pose estimates determined by pro-jection into the space spanned by the feature observations. However, thisframework presupposed a one-to-one mapping between feature observationsand pose. A voting mechanism was employed to disambiguate between poseswith similar observations, but a description of the resulting probability dis-tribution over the pose space was not produced. This paper addresses theseissues by reconsidering the problem in the context of producing generativemodels of feature behavior, and considering the full posterior distributionover the pose space.

Several authors have addressed the problem of appearance-based model-ing [10, 3, 13, 15, 29, 1], and in particular the problem of inferring appearancefrom novel viewpoints. Several of these approaches assume knowledge of thegeometric structure of the environment, while others operate only over verylimited distances, with very simple image variations or very specific objectmodels (such as faces). In this paper we employ a technique that can func-tion with very few assumptions with respect to the scene structure and whichcan recover image structure even over large variations in viewpoint.

In the subsequent sections we provide an overview of the feature learningframework, discuss the feature model, feature detection and feature track-ing, and finally consider the application of the model to the tasks of scenereconstruction and robot localization. We subsequently present experimentalresults to validate our approach.

3 The Learning Framework

In this section we present our approach to collecting and extracting obser-vations of scene features. This process is necessary in order to a) instantiatemodels in the first place, and b) consider a wide variety of potential features.

3.1 Overview

The learning approach operates as follows (Figure 2):

1. The robot explores the environment, collecting images from a samplingof positions. It is assumed that a mechanism is available for accuratepose estimation during the exploratory stage (such as assistance froma second observing robot [20], or the utilization of an expectation-maximization approach to map building [26]).

2. A subset of images are selected that span the explored pose space, andcandidate features are extracted from them using a model of saliency.

3. For each extracted feature, a generative feature model is initialized.

4. The generative model is applied in conjunction with the saliency mea-sure to locate a match to each feature in each of the collected images(as described below). As new observations (matches) are found, thegenerative model is updated.

Figure 2: Learning framework: An ensemble of images is collected (top rect-angles) sampling views of the scene. Candidate features fi are extracted andmatched, and subsequently modeled using a generative model Fi(·). Refer totext for further details.

5. When the matching is complete, a confidence measure is computed foreach feature model, and the models are stored for future use.

Note that while we have presented our approach as a batch computationover the training images, it is sequential in nature and the matching andmodel updating can be performed in conjunction with the collection of newtraining images. Our method is in fact an anytime algorithm and the mapcan be used for localization before it is completed.

In the following sections we will discuss the details of how features are de-tected and tracked in order to collect training observations, and subsequentlyhow the features are modeled from the training data.

3.2 Feature detection

Potential features are initially extracted from a subset of the training imagesusing a model of visual saliency. In this work we employ edge density as ourattention operator.

The details of the density operator are as follows. We define the measureof local edge density as the operator Ψ(x), where x = [u v]T is an imagelocation. The magnitude image computed from the Canny edge detector isconvolved with a Gaussian kernel and local maxima of the convolution areselected as salient features. Define X = {∀x ∈ I} as the set of points in theimage I, and the initial set of features, M0 = {arg maxx∈X Ψ(x)}, that is,the point in the image where the saliency function Ψ is maximal, then definethe set of candidate points at the ith iteration to be

Ui = {x ∈ X : ∀mj∈Mi||x−mj||2 > σ} (4)

where σ is the standard deviation of the Gaussian mask used to define Ψ,and the set of features at the ith iteration to be

Mi = Mi−1 ∪ {arg maxx∈Ui

Ψ(x)} (5)

Iteration halts when maxx∈Ui Ψ(x) falls below a threshold which is definedas t = µD + kσD, representing a user-defined k standard deviations from themean density. In this paper we use σ = 8 for the Gaussian convolution andk = 1.0 for the saliency threshold.

Figure 3: Detected features in an image. The original image, and the con-volved edge map or density function. The extracted features are marked bysquares.

Figure 3 depicts the selected features from an image as superimposedsquares over the original, and the convolved edge map. Our experience withthis operator suggests that it is reliable for candidate feature selection. How-ever, in some circumstances a more sophisticated operator, such as a corner

detector or other measure of saliency, may be required. These circumstancesinclude images where maxima in edge density are not well-localized, such asthe presence of long prominent edges where edge density is roughly the samealong the length of the edge.

3.3 Feature matching

Once an initial set of features have been extracted, the next phase involvesmatching the detected features over the entire training image set. The train-ing images are sorted in order of distance from the centroid of the trainingposes, and each training image is searched in sequence for each feature. Thecamera pose of any given training image is known and therefore we computethe generative model of the feature (described below) to predict the inten-sity image lf of the feature for the training image being searched. We definethe best match to lf in the image to be the image sub-window l∗ centeredat position t∗ = (x∗, y∗) that has maximal correlation ρ with the predictedimage lf :

ρ = cos θ =l(x,y) · lf||l(x,y)|| ||lf ||

(6)

Rather than search the entire image for an optimal match to lf , westochastically sample image locations by weighting them according to thesaliency operator Ψ(x), and perform gradient ascent in the neighborhood ofeach sampled point. Sampling repeats until 50% of the total image saliency∑

x Ψ(x) is considered. This approach enables matching in neighborhoodsbeyond the set of local maxima in Ψ(·), while avoiding the cost of computingρ over the entire image.

When the sub-window maximizing Equation 6 is determined, the corre-sponding intensity neighborhood and position [i∗ t∗] is added to the featuremodel for f . When every training image has been considered, we have a setof matched features, each of which is comprised of a set of observations fromdifferent camera poses. Figure 4 depicts one such set, where each observa-tion is laid out at a grid intersection of an overhead view of the pose spacecorresponding to the location from which it was obtained; grid intersectionswhere there is no observation correspond to locations in the pose space wherethe feature was not found in the corresponding training image. Note thatthe generative nature of the matching mechanism allows the appearance ofthe feature to evolve significantly over the pose space.

(a)

(b)

Figure 4: a) A set of observations of an extracted scene feature. The gridrepresents an overhead view of the pose space of the camera, and featureobservations are placed at the grid intersection corresponding to the posewhere they were observed. Note that the observations capture variation infeature appearance. The lower-left thumbnail is highlighted in scene fromFigure b), below.

C = <n The n-dimensional pose-space of the robot.Z = <m The m-dimensional observation-space of each feature.f A feature.q ∈ C A robot pose.z ∈ Z An observation.k The number of training observations of a given feature.t ∈ <2 The transformation component of an observation z.l ∈ <m−2 The appearance component of an observation z.Z ∈ <k×m The row-wise set of k observations of a feature.G ∈ <k×k The radial-basis design matrix for a given feature.W ∈ <k×m The learned weight matrix relating G and Z (Z = (G+ λI)W ).wi The i-th row of W .wij The (i, j)-th element of W .R The cross-validation covariance associated with a feature model.

Table 1: Notation and definitions for the generative feature model

3.4 The generative feature model

We are interested in learning a pose-dependent model of a scene feature,given a set of observations of the feature from known camera positions. Themodel will be capable of producing maximum-likelihood virtual observations(predictions) of the feature from previously unvisited poses. It will also becapable of estimating the likelihood of a new observation, given the (hypo-thetical) pose from which it might be observed. For reference, a summary ofthe notation employed in this section is provided in Table 1.

We will represent an observation z ∈ <m of a feature f by the vector

z =

[tl

]

where t ∈ <2 is a vector representing the parameters that specify the affinetransformation of the image sub-window of f within the image, and l ∈ <m−2

is a vector corresponding to the local intensity image of f , unrolled in raster-scan order. In this paper, we consider only the translation of the featurein the image plane as the space of possible transformations- one can alsoconsider rotation and scaling, but we will defer this issue to future work.The observation z is a vector-valued function of the pose of the camera q.We seek to learn an approximation F (·) of this function, as expressed in

Equation 3. In this work, we define the space of robot poses as q = [x y]T

and orientation recovery as a separate problem.The approach we take to learning F (·) is by modeling each element of

z as a linear combination of radial basis functions (RBFs), each of which iscentered at a particular robot pose determined by the set of training poses.

Formally, given a set of observations from known poses (zi,qi), a predictedobservation z from pose q is expressed as

z = F (q) =∑i

wiG(q,qi) (7)

where G(·, ·) is a scalar exponential function centered at the locus qi ofobservation i,

G(q,qi) = exp(−||q− qi||2

2σ2) (8)

and the wi’s are vectors of weights that are learned from the trainingobservations. The RBF width σ is set by hand.

The computation of the weight vectors wi is well understood in thecontext of regularization and interpolation theory and is described else-where [27, 18, 9]. In brief, the optimal weights wij are the solution to thelinear least squares problem

(G+ λI)W = Z

where the elements Gi,j of the design matrix G correspond to Equation 8evaluated at observation pose i and RBF center j, the matrix W corre-sponds to the matrix of unknown training weights, and the rows of matrixZ correspond to the training observations. When λ is 0 and G−1 exists, thecomputed weights result in a network whereby Equation 7 interpolates theobservations exactly. However, the presence of noise and outliers and thecomplexity of the underlying function being modeled, can result in an inter-polation which is highly unstable. The solution can be stabilized by addinga diagonal matrix of regularization parameters λI to the design matrix G.In our work, these regularization parameters and the RBF width σ are setby hand at the outset. While ridge regression can be employed to computethe optimal regularization parameters, we do not find that this is necessaryfor the kinds of measurements we are interpolating.

If the design matrix employs every observation pose as a center for aRBF, the computational cost of computing the weights for n observations is

that of an O(n3) singular values decomposition of an n by n matrix, followedby an O(n) back-substitution for each element of the feature vector zi.

Figure 5 depicts three generated instances of the same feature from dif-ferent poses. The predicted feature image l is plotted at the predicted imagelocation t. Note the variation in both appearance and position of the featurein the image.

(a) (b) (c)

Figure 5: A single feature as generated from three different camera positions.

3.5 Visibility

As the robot or other observer moves through the environment, features willmove in and out of view due to both camera geometry and occlusion. There-fore it is valuable to explicitly model feature visibility ; that is, whether or nota particular feature is visible from a particular location in pose-space. Thisinformation aids the task of localization and is important for the problemof reconstructing the scene. We employ the same regularization frameworkto learn a visibility likelihood function p(visible(f)|q), training the functionwith the binary-valued observability of each feature from each visited pose inthe training set1. This information is also useful for informing the questionof where to collect new training examples.

1The computed model could produce likelihood values less than zero or greater thanone– we clamp these outputs when they occur

3.6 Model uncertainty and evaluation

Given an observation zi of feature fi, we can compute the likelihood that itcame from pose q by computing a maximum likelihood observation z∗ usingthe generative model and comparing the actual and predicted observationsusing some distance metric ||z− z∗||. It is not clear, however, how a metricin the space of observations should be defined (recall that an observation isa combination of pixel intensities and transformation parameters). Nor is itclear that the observation space is smooth and/or continuous. Furthermore,how does the likelihood behave as a function of the metric? In order toaddress these issues, we evaluate the computed models using leave-one-outcross-validation, and model the likelihood function p(z|q) as a Gaussian witha covariance R defined as the cross-validation covariance [17, 30, 11].

Cross validation operates by constructing the model with one data pointexcluded, predicting that data point using the construction and measuringthe difference between the actual point and the prediction. By iterating overseveral (ideally all) of the training data, and computing the covariance of theresulting error measures, we can build up a measure of how well the modelfits the data and, more importantly, how well we might expect it to predictnew observations.

Given the very high dimensionality of the observation space, the covari-ance R, when computed over a Euclidean metric over the observation space, ishighly likely to be rank-deficient. This poses problems for numerical stabilityin the presence of noisy observations. To overcome this problem, we reducethe dimensionality of the error space by computing the disparity between anobservation and it’s prediction as a vector composed of the Euclidean error inthe image component l combined with the vector error in the transformationcomponent t. Specifically, if z = (l, t) is an observation and z∗ = (l∗, t∗)is the prediction of that observation from the cross-validation model (thefeature model computed with z omitted), then the error vector ze is definedas

ze(z, z∗) =

[||l− l∗||2

t− t∗

]. (9)

Given this definition of ze, the cross-validation covariance R is defined as

R =1

k

k∑j=1

zezTe (10)

Given R, the observation likelihood function is then expressed as

p(z|q) = c exp(−0.5zTe R−1ze) (11)

where c = ((2π)M detR)−1/2, ze is the transformed z − z∗, M is the dimen-sionality of the transformed observation space and exp(x) = ex.

The covariance R is not only useful as a model parameter, but is alsoa useful measure of model fit. Trained features whose model covariancehas a large determinant can be eliminated from the set of features on thebasis that the feature is not modeled well and will not be useful for featurereconstruction or camera localization.

The cross-validation error associated with a particular learned featurerepresents a measure of the reliability of the model and the feature it hastracked. We are interested in examining the reliability of the particularfeature representation we have chosen- that is, the attributes that we aim tomodel generatively. By cross validating over individual potential attributes,such as the affine transformation t, intensity image, l, or other attributes suchas the edge distribution E(·) of l, one can determine which feature attributesin particular are well-modeled.

Table 2 summarizes the quality of the three attributes mentioned herefor three different scenes (Figure 6) and the results are depicted graphicallyin Figure 7. The first two scenes were imaged using a camera mounted onthe end-effector of a robot arm, and the third was imaged using a cameramounted on a mobile robot. The number of training images for each scene isrecorded in the table. For any given feature, the image position t, intensitydistribution l and edge distribution E(l) of the features are each used togenerate a separate model and cross-validation error. For each attributea ∈ {t, l, E(l)}, the cross-validation error e(a) is defined as

e(a) =k∑j=1

||ze||2 (12)

where ze is defined appropriately as above, for each attribute.Tabulated are the mean cross-validation error of these properties over all

observed features. The smaller the value, the more reliable the attribute canbe considered to be. It is important to note that the units for defining theerror in each attribute differ (pixels2, for the position attribute, and gray-levelintensities2 for appearance and edge distribution, respectively). As such, itis somewhat difficult to compare these measures. In the figure, the errormeasures are grouped by Scene.

(a)

(b)

(c)

Figure 6: Images from scenes I, II and III used for evaluating feature at-tributes.

It is interesting to note that the affine transformation of the feature in theimage is in general the most accurately modeled whereas the edge distributionis poorly modeled. The order-of-magnitude difference between the affinetransform error and the intensity distribution error is due in part to thesignificant difference in the dimensionality of the attributes. We can concludethat for the purposes of inference, in most circumstances the affine transformwill be the most useful attribute.

Attribute SceneI II III

Training images 256 121 121Affine Transform 17.1 143 33.7

Intensity Distribution 2105 5230 2531Edge Distribution 17268 21960 13283

Table 2: A priori mean cross-validation error by attribute for the scenes inFigure 6. Refer to the text for details.

3.7 Scene evaluation

In addition to measuring feature quality, it is also possible to evaluate theability of the model to represent the environment as a function of pose. Wedo this by computing a quality estimate for the subset of features observablefrom a given position. At each training pose q, we can compute a measureof reliability

rq =∑fi∈Γ

1

|Rfi|(13)

where Γ is the set of tracked features which are observed from pose q, and|Rfi| is the determinant of the feature uncertainty covariance. Note that forposes other than the training poses, a similar measure can be computed byweighting the terms of rq by their visibility likelihood, p(visible|q), since thedeterminant of the feature covariance is an indication of how weak the poseconstraint for a given feature may be. Clearly, larger values of r should leadto more reliable pose estimates. Figure 9 plots r as a function of pose forScene III, depicted in Figure 8. In this plot, the orientation of the camera isfixed to face in the negative y direction while the robot moves over a 2m by2m pose space. Note that the reliability is particularly low for small values ofy. This is due to the fact that images in that region of the pose space changedramatically under small changes in pose, leading to difficulty in trackingthe features.

4 Applications

The real benefit of constructing a generative model is realized in the abil-ity to predict observations from an arbitrary pose. This ability enables a

I II III0

0.5

1

1.5

2

2.5x 104

Scene

Cro

ss−V

alid

atio

n E

rror

Cross Validation Error by Feature Attribute

PositionAppearanceEdge Distribution

Figure 7: Cross-validation error by attribute for each scene.

wide variety of tasks, including partial scene reconstruction from previouslyunvisited poses, and robot navigation and localization. In this section weelaborate on these potential applications.

4.1 Scene Reconstruction

Given a set of trained features and a particular pose q, one can generate amaximum likelihood reconstruction of the scene features. Given the gener-ated observations, the full image is reconstructed by positioning each featureaccording to its predicted transformation and mapping the generated inten-sity image in the image neighborhood. We model a large image neighborhoodin order to predict as much of the image as possible. Where features over-lap, the pixel intensity I(x, y) is determined by selecting the pixel intensity

Figure 8: The scene evaluated for a priori training reliability.

corresponding to the feature that maximizes the weighting function

v =p(visible(f))

cfe−

∆p2

2σ2

where cf is the total cross-validation error for the feature, ∆p is theEuclidean distance between the pixel (x, y) and the predicted position of thefeature, and σ is a parameter describing the region of influence of each featurein the image. This winner-takes-all strategy selects a feature for whose pixelprediction we are most confident.

For example, Figure 3 a) shows a training image from a laboratory scenefor which training images have been collected at 25cm intervals over a 6.0mby 3.0m pose space; Figure 10 depicts the reconstruction of the same scenefrom a nearby pose. Note that the reconstruction cannot predict pixels forwhich there is no feature model, and as such, the lower edge of the image isleft unshaded. It may also be possible to interpolate limited amounts of theunshaded regions using Markovian reconstruction methods [6, 8].

4.2 Localization

Given a set of feature models, the task of robot localization can be per-formed by applying Bayes’ Rule, as per Equation 1. When the camera isat an unknown position, an observation is obtained and optimal matches to

0

50

100

150

200

0

50

100

150

2000

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

x (cm)y (cm)

Rel

iabi

lity

Figure 9: A priori training reliability r as a function of pose for the scenedepicted in Figure 8. The camera faces in the negative y direction.

the learned features are detected in the image, Z = {zf}. Each feature ob-servation zf then contributes a probability density function p(zf |q), whichis defined as the product of the distribution due to the maximum likelihoodprediction of the model (Equation 11) and the feature visibility likelihoodp(visible(f)|q). In the absence of informative priors, the pose q∗ that maxi-mizes the joint likelihood of the observations is considered to be the maximumlikelihood position of the robot, as illustrated by Equation 1. Numerically,the joint likelihood can be difficult to compute, as it requires summing overall permutations of successful and unsuccessful feature matches. Instead,we approximate the joint posterior using a mixture model of the individualfeature-derived distributions:

p(Z|q) ≈ 1

n

∑zf∈Z

p(zf |q) (14)

This model takes an extreme outlier approach whereby it is assumed thatthe probability of an incorrect feature match is high. A complete description

Figure 10: A reconstruction of the laboratory scene depicted in Figure 3, aspredicted from a nearby camera pose.

of the probability density function should take into account the likelihood ofthe match between each detected feature and all possible generated observa-tions. Our empirical experience indicates that the mixture model providesresistance to outlier matches without the need for computing a full jointposterior, as we demonstrate in the next section.

5 Experimental Results

In this experiment, we evaluate the performance of the learning frameworkand feature models on the task of robot localization. The laboratory scenedepicted in Figure 3 was explored by taking 291 training images at uniformintervals of approximately 25cm over a 3.0m by 6.0m pose space. A secondobserving robot was deployed to estimate the ground-truth position of theexploring robot to an accuracy of approximately 4cm, as described in [20].The observer employed a laser range-finder to accurately determine the cam-

Figure 11: Robots employed for data collection. The three-plane targetmounted on the exploring robot is sensed by the stationary robot, allowingfor the computation of pose estimates for the explorer. The pose estimatesare employed as an approximation to ground-truth, both for training andevaluating the vision-based localizer.

era position from the range and orientation of a three-plane target mountedon the exploring robot (Figure 11). For the purposes of this experiment, therobot attempted to take training images at the same global orientation. How-ever, uncertainty in the robot’s odometry, as well as the observing robot’sestimate, led to some variation in this orientation from pose to pose.

A set of initial features were extracted from a small subset of the trainingimages, and more than 117 feature models were trained. Those models withhigh uncertainty, or with too few observations were removed, resulting in 80reliable feature models.

To validate the learned models, an additional set of 93 images were col-lected from random poses, constrained to lie anywhere within the 3.0m by6.0m training space. These test images were used to compute maximum-likelihood (ML) estimates of the camera’s position, and the ML estimateswere compared against the ground truth estimates provided by the observ-ing robot. The estimates themselves were computed by exhaustive search

over a multi-resolution discretization of the training space, selecting the hy-pothesized q that maximized Equation 14. In particular, the training spacewas discretized into a 40 by 40 grid covering the entire training space andEquation 14 was evaluated at each position in the grid. Subsequently, at themaximal grid location a new 10 by 10 grid was instantiated over a neigh-borhood spanning 7 by 7 grid positions in the larger grid and Equation 14was evaluated over the new grid. This process recursed to a pre-determinedresolution and the maximal grid pose at the highest resolution was returned.Note that in a production environment, a more efficient estimator, such asMonte Carlo sampling, could be deployed.

In practical settings, one is not always interested in the ML pose estimate,but rather the entire probability distribution over the pose space, whichcan provide more information about alternative hypotheses in environmentswhich exhibit significant self-similarity. Figure 12 depicts the probabilitydensity function, modulo a normalizing constant, resulting from evaluatingEquation 14 for a single test image over a uniform grid of poses. The figureclearly indicates a region where the pose is more probable, as well as a second,less probable region. The second region may be due to a mis-classified feature(a failure in the matching stage), or some self-similarity in a trained feature.

Given that each ML estimate has a particular likelihood, it is possibleto reject pose estimates that do not meet a particular confidence threshold.In this way, four of the 93 estimates in the test set were rejected. Interest-ingly, the majority of these estimates were associated with images that wereobtained when the robot was very close to the wall it was facing, where itwas difficult to reliably track features at the selected training sample density.This behavior coincides with that predicted by our a priori evaluation of asimilar scene, as exhibited in Figure 9.

Figure 13 plots the location of the unrejected ML estimates for the testimages (’x’) against the ground truth camera position (’o’) by joining thetwo points with a line segment. The length of each line segment correspondsto the magnitude of the error between the corresponding pose estimate andground truth. The mean absolute error is 17cm, (7.7cm in the x directionvs 15cm in the y direction). The larger error in the y direction correspondsto the fact that the camera was pointed parallel to the positive y axis, andchanges in observations due to forward motion are not as pronounced aschanges due to side-to-side motion. The smallest absolute error was 0.49cm,which is insignificant compared to the ground truth error, and the largesterror was 76cm. Note that most of the larger errors occur for large values of

0100

200300

400500

600

050

100150

200250

300350

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Y position (cm)

Pose Likelihood Distribution

X position (cm)

Like

lihoo

d

Figure 12: Likelihood function of pose of robot over 3.0m by 6.0m pose space.Note that the distribution is not unimodal, possibly due to a mis-recognizedfeature, or model self-similarity at different poses.

y. This is due to the fact that the camera was closest to the wall it was facingat these positions y, and as has been mentioned, tracking scene features over25cm pose intervals became a difficult task.

6 Discussion and Conclusions

We have presented a method for learning generative models of visual fea-tures that are useful for robot pose estimation and scene reconstruction.The method operates by matching image features over a set of training im-ages, and learning a generating function parameterized by the pose of thecamera which can produce maximum likelihood feature observations. Wetrain a radial basis function network for modeling each feature. The system

100 150 200 250 300 350

−300

−200

−100

0

100

200

300

X coordinates in cm

Y c

oord

inat

es in

cm

Localization results for database.dat

Figure 13: Localization results: the set of maximum-likelihood pose estimates(’x’) plotted against their ground truth estimates (’o’).

also models the uncertainty of the generated features, allowing for Bayesianinference of camera pose.

A comparative study of running times and robustness for this algorithmversus other standard localization methods would be valuable. While it isbeyond the scope of this paper to include such an undertaking, we refer thereader to recent work in which we conducted a study comparing a methodsimilar to the one presented here with principal components analysis [24].In that work, we compared running times of the algorithms and demon-strated robustness to a variety of adverse conditions, such as partial imageocclusion. An interesting result from that study indicated that while thereare arguments to be made for exploiting visual attention to reduce compu-tational complexity (e.g. [28]), the additional overhead involved in featureextraction, tracking and modeling can yield running times comparable to

PCA-based representations.There remain several outstanding questions for future work. First, it is

not clear how many training images are sufficient to cover the pose spacesstudied. Empirically, we have found the limiting factor to be the problem ofreliable feature tracking– decreasing the sampling density inevitably results inlost features. One approach to addressing this question would be to computean on-line estimate of how far the robot can travel before collecting a newtraining image. Such a measure might be based on the number of successfulfeatures tracked, or perhaps a measure of optical flow. Second, the prob-lem of recovering orientation can be addressed. While the framework doesnot preclude learning feature models in a three, or higher, degree-of-freedompose space, the number of training images required can be prohibitive. Oneproposed solution is to learn feature models at a small number of selectedorientations, and localize the robot by taking images at several orientations(ideally, a sufficient number to enable generation of a panorama), and locat-ing the image that maximizes the confidence of the ML pose estimate. Thisapproach is similar to an active vision approach whereby a robot must traveluntil it locates landmarks that disambiguates its position.

The experimental results in this paper demonstrate the utility of usinglearned feature models for pose estimation, as well as other tasks, such asscene reconstruction. Our experiments have demonstrated the stability andsmoothness of the resulting posterior distribution over camera pose, and wewere able to detect most outliers by thresholding the likelihood of the MLestimates. However, important issues are raised in this work with respectto the density of training samples. In order to capture aspects of the scenethat change significantly, one must sample at higher densities. One possiblesolution is to select the robot’s viewing direction before sensing in order totake in more stable parts of the environment (for example, point the cameraat the farthest wall). Our future work is addressing some of the issues raisedhere, as well as expanding the approach to much larger environments.

References

[1] P. Belhumeur and D. Kriegman. What is the set of images of an ob-ject under all possible lighting conditions? In Proceedings of the IEEEComputer Society Conference on Computer Vision and Pattern Recogi-tion (CVPR), pages 270–277, San Francisco, CA, 1996. IEEE Press.

[2] H. Choset and K. Nagatani. Topological simultaneous localization andmapping (SLAM): Toward exact localization without explicit localiza-tion. IEEE Transactions on Robotics and Automation, 17(2):125–137,April 2001.

[3] T. Cootes and C. Taylor. Modelling object appearance using thegreylevel surface. In Proceedings of the British Machine Vision Con-ference, 1994.

[4] F. Dellaert, W. Burgard, D. Fox, and S. Thrun. Using the CONDENSA-TION algorithm for robust, vision-based mobile robot localization. InIEEE Computer Society Conference on Computer Vision and PatternRecognition, pages 2588–2593, Ft. Collins, CO, June 1999. IEEE Press.

[5] G. Dudek and D. Jugessur. Robust place recognition using local ap-pearance based methods. In International Conference on Robotics andAutomation, pages 466–474, San Francisco, April 2000. IEEE Press.

[6] A. A. Efros and T. K. Leung. Texture synthesis by non-parametricsampling. IEEE International Conference on Computer Vision, pages1033–1038, September 1999.

[7] D. Fox, W. Burgard, and S. Thrun. Active Markov localization formobile robots. Robotics and Autonomous Systems (RAS), 25:195–207,1998.

[8] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions,and the Bayesian restoration of images. IEEE Transactions on PatternAnalysis and Machine Intelligence, 6(6):721–741, November 1984.

[9] S. Haykin. Neural Networks. MacMillan College Publishing Company,New York, NY, 1994.

[10] B.K.P. Horn and M.J. Brooks, editors. Shape from Shading. MIT Press,Cambridge, Mass, 1989.

[11] R. Kohavi. A study of cross-validation and bootstrap for accuracy es-timation and model selection. In Proceedings of the 1995 InternationalJoint Conference on Artificial Intelligence (IJCAI), pages 1137–1145,Montreal, August 1995.

[12] J. J. Leonard and H. F. Durrant-Whyte. Simultaneous map buildingand localization for an autonomous mobile robot. In Proceedings of theIEEE Int. Workshop on Intelligent Robots and Systems, pages 1442–1447, Osaka, Japan, November 1991.

[13] M. Lhuillier and L. Quan. Image interpolation by joint view triangula-tion. In Proceedings of IEEE Computer Vision and Pattern Recognition(CVPR), pages 139–145, Fort Collins, CO, 1999.

[14] D. G. Lowe. Object recognition from local scale-invariant features. InProceedings of the International Conference on Computer Vision, pages1150–1157, Corfu, Greece, September 1999. IEEE Press.

[15] C. Nastar, B. Moghaddam, and A. Pentland. Generalized image match-ing: Statistical learning of physically-based deformations. In Proceedingsof the Fourth European Conference on Computer Vision (ECCV’96),Cambridge, UK, April 1996.

[16] S.K. Nayar, H. Murase, and S.A. Nene. Learning, positioning, and track-ing visual appearance. In Proc. IEEE Conf on Robotics and Automation,pages 3237–3246, San Diego, CA, May 1994.

[17] R. R. Picard and R. D. Cook. Cross-validation of regression models.Journal of the American Statistical Association, 79(387):575–583, 1984.

[18] T. Poggio and S. Edelman. A network that learns to recognize 3d objects.Nature, 343:263–266, January 1990.

[19] F. Pourraz and J. L. Crowley. Continuity properties of the appearancemanifold for mobile robot position estimation. In Proceedings of theIEEE Computer Society Conference on Pattern Recognition Workshopon Perception for Mobile Agents, Ft. Collins, CO, June 1999. IEEEPress.

[20] I. M. Rekleitis, G. Dudek, and E. Milios. Multi-robot collaboration forrobust exploration. In Proceedings of the IEEE International Conferenceon Robotics and Automation, pages 3164–3169, San Francisco, CA, April2000.

[21] C. Schmid. A structured probabilistic model for recognition. In Pro-ceedings of the IEEE Computer Society Conference on Computer Visionand Pattern Recognition, pages 485–490, Ft. Collins, CO, June 1999.

[22] S. Se, D. Lowe, and J. Little. Global localization using distinctive vi-sual features. In Proceedings of the IEEE/RSJ Conference on IntelligentRobots and Systems (IROS), pages 226–231, Lausanne, Switzerland, Oc-tober 2002.

[23] R. Sim and G. Dudek. Learning environmental features for pose esti-mation. Image and Vision Computing, Elsevier Press, 19(11):733–739,2001.

[24] R. Sim and G. Dudek. Comparing image-based localization methods.In Proceedings of the Eighteenth International Joint Conference on Ar-tificial Intelligence (IJCAI), page 6, Acapulco, Mexico, August 2003.Morgan Kaufmann.

[25] G. P. Stein and A. Shashua. Model-based brightness constraints: On di-rect estimation of structure and motion. IEEE Transactions on PatternAnalysis and Machine Intelligence, 22(9):992–1015, September 2000.

[26] S. Thrun, D. Fox, and W. Burgard. A probabilistic approach to con-current mapping and localization for mobile robots. Machine Learning,31:29–53, 1998. also appeared in Autonomous Robots 5, 253–271.

[27] A. N. Tikhonov and V. Y. Arsenin. Solutions of Ill-Posed Problems.V.H. Winston & Sons, John Wiley & Sons, Washington D.C., 1977.Translation editor Fritz John.

[28] John K. Tsotsos. Analysing vision at the complexity level. Behavioraland Brain Sciences, 13(3):423–496, 1990.

[29] Thomas Vetter and Tomaso Poggio. Linear object classes and imagesynthesis from a single example image. IEEE Transactions on PatternAnalysis and Machine Intelligence, 19(7):733–742, 1997.

[30] G. Wahba. Convergence rates of ’thin plate’ smoothing splines whenthe data are noisy. Smoothing Techniques for Curve Estimation, pages233–245, 1979.

[31] B. Yamauchi, A. Schultz, and W. Adams. Mobile robot exploration andmap building with continuous localization. In Proceedings of the IEEEInternational Conference on Robotics and Automation, pages 3715–2720,Leuven, Belgium, May 1998. IEEE Press.

Learning Generative Models of Scene Features - CiteSeer

Documents

Transcript of Learning Generative Models of Scene Features - CiteSeer