Multiple Kernel Gaussian Process Classi cation for Generic ...MKG.pdf · Erik Rodner, Doaa Hegazy,...

8
Multiple Kernel Gaussian Process Classification for Generic 3D Object Recognition Erik Rodner, Doaa Hegazy, and Joachim Denzler Chair for Computer Vision Friedrich Schiller University of Jena, Germany Email: [email protected] www.inf-cv.uni-jena.de Abstract We present an approach to generic object recognition with range information obtained using a Time-of- Flight camera and colour images from a visual sensor. Multiple sensor information is fused with Bayesian kernel combination using Gaussian processes (GP) and hyper-parameter optimisation. We study the suitability of approximate GP classification methods for such tasks and present and evaluate different image kernel functions for range and colour images. Experiments show that our approach significantly outperforms previous work on a challenging dataset which boosts the recognition rate from 78% to 88%. Keywords: Gaussian Processes, Time-of-Flight Camera, Kernel Combination, Bag-of-Features 1 Introduction In the last decade, research in machine learning and computer vision mainly concentrated on image categorisation using a single visual sensor or photos from the web [1]. Despite the great success of those methods, the importance of depth informa- tion for reliable generic object recognition is mostly ignored. In the following work we present an approach to generic object recognition using the combined in- formation of a visual sensor and range information obtained from a Time-of-Flight (ToF) camera [2]. A ToF camera offers real-time depth images ob- tained by modulating an outgoing beam with a carrier signal and measuring the phase shift of that carrier when received back at the ToF sensor. In- corporating the advantages of this new camera tech- nology into computer vision systems is current re- search and is successfully applied to 3d reconstruc- tion tasks [3] or marker-less human body motion capture [4]. We use a ToF camera together with a CCD camera to solve generic object recognition problems. To combine the information of our two sensors, we utilise Gaussian process classification [5], which al- lows efficient hyperparameter estimation and integ- ration of multiple sensor information using kernel combination in a Bayesian framework. In con- trast to previous work [6], which used GP regres- sion to approximate the underlying discrete clas- sification problem, we also study Laplace approx- imation (LA) which directly tackles the discrete nature of the categorisation problem. Hyperpara- meter estimation is done by extending multi-task techniques for GP regression [7, 6] to LA. Addi- tionally, we show how to compute image based kernel functions using range images by applying the framework of Spatial Pyramid Matching Ker- nels [8] (SPMK) to different kinds of local range features. Figure 1 presents an overview of the proposed approach and the main steps involved. The main contributions of this paper are as follows: 1. We present multiple kernel learning with Gaussian process classification for sensor data fusion (previous papers are limited to visual sensors only [6]). 2. Different noise models of GP classification are studied for image categorisation and sensor data fusion (previous papers are limited to Gaussian noise and GP regression [6]). 3. Our method yields a significant improvement in terms of recognition performance compared to the current state-of-the-art of object re- cognition with Time-of-Flight range images [9]. The remainder of the paper is organised as follows. Section 2 gives an overview about related work in the area of image categorisation with range images and applications of GP classification. We briefly review classification and regression with Gaussian processes in Sect. 3, followed by presenting our 978-1-4244-9631-0/10/$26.00 c 2010 IEEE

Transcript of Multiple Kernel Gaussian Process Classi cation for Generic ...MKG.pdf · Erik Rodner, Doaa Hegazy,...

Page 1: Multiple Kernel Gaussian Process Classi cation for Generic ...MKG.pdf · Erik Rodner, Doaa Hegazy, and Joachim Denzler Chair for Computer Vision Friedrich Schiller University of Jena,

Multiple Kernel Gaussian Process Classification forGeneric 3D Object Recognition

Erik Rodner, Doaa Hegazy, and Joachim Denzler

Chair for Computer VisionFriedrich Schiller University of Jena, Germany

Email: [email protected]

Abstract

We present an approach to generic object recognition with range information obtained using a Time-of-Flight camera and colour images from a visual sensor. Multiple sensor information is fused with Bayesiankernel combination using Gaussian processes (GP) and hyper-parameter optimisation. We study thesuitability of approximate GP classification methods for such tasks and present and evaluate differentimage kernel functions for range and colour images. Experiments show that our approach significantlyoutperforms previous work on a challenging dataset which boosts the recognition rate from 78% to 88%.

Keywords: Gaussian Processes, Time-of-Flight Camera, Kernel Combination, Bag-of-Features

1 Introduction

In the last decade, research in machine learningand computer vision mainly concentrated on imagecategorisation using a single visual sensor or photosfrom the web [1]. Despite the great success ofthose methods, the importance of depth informa-tion for reliable generic object recognition is mostlyignored.

In the following work we present an approach togeneric object recognition using the combined in-formation of a visual sensor and range informationobtained from a Time-of-Flight (ToF) camera [2].A ToF camera offers real-time depth images ob-tained by modulating an outgoing beam with acarrier signal and measuring the phase shift of thatcarrier when received back at the ToF sensor. In-corporating the advantages of this new camera tech-nology into computer vision systems is current re-search and is successfully applied to 3d reconstruc-tion tasks [3] or marker-less human body motioncapture [4].

We use a ToF camera together with a CCD camerato solve generic object recognition problems. Tocombine the information of our two sensors, weutilise Gaussian process classification [5], which al-lows efficient hyperparameter estimation and integ-ration of multiple sensor information using kernelcombination in a Bayesian framework. In con-trast to previous work [6], which used GP regres-sion to approximate the underlying discrete clas-sification problem, we also study Laplace approx-

imation (LA) which directly tackles the discretenature of the categorisation problem. Hyperpara-meter estimation is done by extending multi-tasktechniques for GP regression [7, 6] to LA. Addi-tionally, we show how to compute image basedkernel functions using range images by applyingthe framework of Spatial Pyramid Matching Ker-nels [8] (SPMK) to different kinds of local rangefeatures. Figure 1 presents an overview of theproposed approach and the main steps involved.

The main contributions of this paper are as follows:

1. We present multiple kernel learning withGaussian process classification for sensor datafusion (previous papers are limited to visualsensors only [6]).

2. Different noise models of GP classificationare studied for image categorisation and sensordata fusion (previous papers are limited toGaussian noise and GP regression [6]).

3. Our method yields a significant improvementin terms of recognition performance comparedto the current state-of-the-art of object re-cognition with Time-of-Flight range images [9].

The remainder of the paper is organised as follows.Section 2 gives an overview about related work inthe area of image categorisation with range imagesand applications of GP classification. We brieflyreview classification and regression with Gaussianprocesses in Sect. 3, followed by presenting our

978-1-4244-9631-0/10/$26.00 c©2010 IEEE

Page 2: Multiple Kernel Gaussian Process Classi cation for Generic ...MKG.pdf · Erik Rodner, Doaa Hegazy, and Joachim Denzler Chair for Computer Vision Friedrich Schiller University of Jena,

Gaussian Process Classifier

CCD Camera

Time−of−Flight Camera

Spatial Pyramid Matching

Local Range Features

OpponentSIFT Features

range image

colour image

weighted kernel combination K =∑

i αiKi

kernel functions Ki

Figure 1: An overview of the proposed approach: colour and range images are captured using a CCD camera

and a Time-of-Flight camera (PMD Vision Technologies 19k); Local features are extracted from the data of both

sensors and “kernelized” using Spatial Pyramid Matching, yielding different kernel functions.

choice of local range features (Sect. 4) and theiruse for Pyramid Matching Kernels (Sect. 5) Ex-periments in Sect. 6 study the different parts ofour approach and evaluates them with a publicavailable dataset. A summary of our findings anda discussion of future research directions concludethe paper.

2 Related Work

One of the earlier works on local features computedon range images is the work of Hetzel et al. [10].They present different feature types and similar-ity measures for histograms which can be used fornearest neighbour classification. Toldo et al. [11]propose a preliminary segmentation of complete3d models and a description of the resulting partsutilising the Bag-of-Features idea. Classificationis done with multiple equally weighted histogramintersection kernels and support vector machines.The suitability of SIFT features for image match-ing is studied by Zhang et al. [12], who proposeto compute SIFT features on normal texture andshape index images [10]. In contrast, Lo et al.[13] present an extension to SIFT features espe-cially suitable for range images. Special featuretypes for Time-of-Flight cameras are studied byHaker et al. [14]. A key idea of their work is thenon-equidistant Fourier transformation to repres-ent range images.

Our method for feature computation is mainly basedon Spatial Pyramid Matching as presented byLazebnik et al. [8] for standard image categorisa-tion and utilised by Li et al. [15] for range images.Similar to our previous work [9, 16] we concentrateon the combination of colour or texture informa-tion obtained from a standard CCD camera andrange information from a Time-of-Flight camera.A beneficial combination of these two informationsources with kernel-based methods requires an ef-ficient method for hyperparameter optimisation,

which is available in Bayesian frameworks, such asGaussian process classification [5]. As shown inKapoor et al. [6] regression with Gaussian processpriors can be successfully integrated in an imagecategorisation setting and can handle multiple ker-nels.

3 Multi-Kernel Classificationwith Gaussian Processes

In the following section, we briefly review Gaus-sian process regression and binary classification.We also explain multi-class classification with GPpriors and efficient kernel combination by hyper-parameter estimation.

3.1 Bayesian Estimation with GaussianProcess Priors

We first give some initial motivation before describ-ing the use of GP priors in more mathematicalterms: Machine learning can often be describedas estimating the relation between inputs x andoutputs or labels y. This relation is often describedusing a regression function f and a noise term ε,e.g. y = f(x) + ε. The idea of Bayesian estima-tion is now that instead of searching for a specificfunction f , one can regard f as a latent randomvariable which can be marginalised out. Gaussianprocesses allow to nicely model the prior distribu-tion of regression functions f by defining a suitablecovariance function controlling the smoothness ofthe function samples.

Let X be the space of all possible input data (fea-ture vectors or images). Given n training examplesxi ∈ XT ⊂ X and corresponding binary labelsyi ∈ −1, 1 (multi-class classification is explainedin subsequent sections), we would like to predictthe label y∗ of an unseen example x∗ ∈ X . Thetwo main assumptions of Gaussian processes forregression or classification are:

Page 3: Multiple Kernel Gaussian Process Classi cation for Generic ...MKG.pdf · Erik Rodner, Doaa Hegazy, and Joachim Denzler Chair for Computer Vision Friedrich Schiller University of Jena,

1. There is an underlying latent function f :X → R, so that labels yi are conditionallyindependent from the input xi given f(xi).Thus, the labels are distributed according tothe so called noise model p(yi | f(xi)).

2. The function f is a sample of a Gaussian pro-cess (GP) prior and represents itself a ran-dom variable: f |X ∼ GP(0,K(X ,X )) withzero mean and covariance or kernel functionK : X × X → R.

The Gaussian process prior enables to model thecovariance of outputs f(x∗) as a function of inputsx∗. With K being a kernel function describingthe similarity of two inputs, the common smooth-ness assumption that similar inputs should leadto similar function values (and similar labels) canbe modelled. Incorporating our assumptions intoa Bayesian framework and marginalising over thelatent function values, we have the following equa-tions for the inference of the label y∗ of an unseenexample x∗:

p(y∗|x∗,y,XT ) =

∫Rp(y∗|f∗) p(f∗|x∗,y,XT )df∗

(1)where we marginalise the latent function value f∗ =f(x∗) corresponding to x. The conditional distri-bution of f∗ is also available with marginalisationof all latent function values f = (f(xi))

ni=1 of the

training set XT :

p(f∗|x∗,y,XT ) =

∫Rn

p(f∗|x∗,f)p(f |y,XT )df (2)

Finally, the incorporation of the noise model andour assumption of independent training examplesleads to:

p(f | y,XT ) =p(f | XT )

p(y | XT )

(∏i

p(yi | fi)

)(3)

The distribution p(f | XT ) is a n-dimensional nor-mal distribution with zero mean and covariancematrix K = (K(xi,xj))i,j , which is often calledkernel matrix. The noise model can take variousforms depending on the nature of the labels yi,which leads to the distinction between Gaussianprocess regression and classification.

3.2 Regression withGaussian Process Priors

Gaussian process regression uses a Gaussian noisemodel

p(yi | fi) = N (yi | fi, σ2) (4)

with variance σ2. The advantage of this model isthat we do not have to use approximated inference

methods to handle the involved marginalisation inequation (1) and (2). A given binary classificationproblem is solved as a regression problem whichregards yi as real-valued function values instead ofdiscrete labels. The GP regression model assump-tions lead to analytical solutions of the integralsand allow to directly predict the label y∗. Let k∗ bethe kernel values (k∗)i = K(xi,x∗) correspondingto a test example x∗. The GP model for regressionleads to the following equation for the predictiony∗ [5]:

y∗(x∗) = kT∗ (K + σ2I)−1y (5)

with I denoting the identity matrix.

3.3 Classification with GP Priors andLaplace Approximation

To incorporate knowledge about the discrete natureof the labels, we can use noise models such as thecumulative Gaussian

p(yi | fi) =1

2

(erf

(`yifi

2

)+ 1

)(6)

with length-scale parameter `. Exact inference withthis model is intractable and approximation meth-ods are required. One common efficient approx-imation method is Laplace approximation [5]. Thekey idea to handle the integral in equation (2) is toapproximate p(f |XT ,y) with a Gaussian distribu-tion. This can be done by finding the mode f withnonlinear optimisation techniques and approximat-ing the covariance matrix by utilising the Hessianat f . An important advantage of the cumulativeGaussian noise model is the availability of analyt-ical solutions for the marginalisation involved inequation (1). For implementation details we referthe reader to the excellent text book of Rasmussenand Williams [5].

3.4 Kernel Combination for Classifica-tion with Multiple Sensors

Let us assume m kernel functions Ki(·, ·) are given.In the multi-sensor setting, these kernel functionscan be computed from different data modalities.To use the information of all kernels, they can becombined by a single kernel function K, e.g. in alinear fashion with exponential weights α ∈ Rm:

Kα(x,x′) =

m∑i=1

exp (αi) Ki(x,x′) (7)

For binary classification this relates to an equi-valent representation of the latent function f asa linear sum f =

∑mi=1 exp (αi)f

i of independ-

ent latent functions f i sampled from GP priorsGP(0,Ki). Another combination technique we ex-perimented with, is to plug the linear combination

Page 4: Multiple Kernel Gaussian Process Classi cation for Generic ...MKG.pdf · Erik Rodner, Doaa Hegazy, and Joachim Denzler Chair for Computer Vision Friedrich Schiller University of Jena,

into an exponential kernel [17] which leads to a sim-ilar classification performance. The parameters αi

are hyperparameters of the kernel function and canbe optimised using GP model selection techniques,such as likelihood optimisation. The negative log-likelihood of all training examples for the binaryGP regression model is given by [5]:

− log p(y | XT ,α) =1

2yT(Kα + σ2I

)−1y (8)

+1

2log |Kα + σ2I|+ n

2log 2π

with Kα being the parameterised kernel matrixaccording to equation (7). Further details regard-ing the likelihood of the Laplace Approximationand its gradients can be found in [5]. Maxim-isation of the likelihood can be done with iter-ative nonlinear optimisation methods. To reduceover-fitting problems, [18] propose to choose theparameter according to the minimum leave-one-out error among the parameters calculated in eachiteration of the likelihood optimisation. However,we did not observed a performance gain using thistechnique.

3.5 Multi-Class Classification andModel Selection

So far, we presented GP classification with binarylabels. However, our image categorisation applic-ation requires to solve a multi-class classificationproblem. Rasmussen and Williams [5] present anextension of the Laplace Approximation for multi-class classification tasks with y ∈ 1, . . . ,M = Ω.This method requires MCMC techniques and isthus computationally demanding.

We follow Kapoor et al. [6] by utilising the one-vs-all technique. For each class i ∈ Ω a binaryclassifier is trained which uses all images of i as pos-itive examples and remaining images as a negativetraining set. Classification is done by returningthe class with the highest probability estimated bythe corresponding binary classifier. The one-vs-all approach also offers to perform efficient modelselection by joint optimisation of hyperparametersfor all involved binary problems [7]. The objectivefunction is simply the sum of all binary negativelog-likelihoods as given in equation (8) for GP re-gression. An equivalent idea can be applied toLaplace approximation.

4 Local Features for ToF RangeImages and Colour Images

In the following we present local features whichcan be calculated using range images from a Time-of-Flight camera. The extraction process is per-formed in two steps: (1) preprocessing and sampling

of regions and (2) the calculation of local featuredescriptors. Figure 2 gives an overview of the pro-posed processing pipeline and guides through thefollowing two sections.

4.1 Preprocessing and Sampling

The range data obtained with the ToF camera (inour case a Photonic Mixer Device) suffer from severestatistical noise. In order to filter this noise andsmooth the range data, a median filter of size 5×5is applied. One idea to select appropriate regionsfor local features is to use an interest detector [16].However, due to the low resolution of the intensityimage provided by the ToF camera this often leadsto uninformative interest points and a small num-ber of local features. Therefore, we compute localfeatures on a predefined grid as suggested by [9].

Dense sampling has shown to be beneficial also forimage categorisation tasks using colour images [17].Thus, the same sampling strategy is applied to col-our images of the visual sensor and we extract localdescriptors at a predefined grid with a horizontaland vertical pixel spacing of 10 pixels.

4.2 Local Range Features

Range images have the advantage of providing dir-ect information about the shape of objects. There-fore, it is beneficial to give preference to featuresthat capture different shape aspects. Local shapedescriptors are preferable as they provide some ro-bustness to clutter and occlusion. Hetzel et al. [10]introduce and use three shape-specific local fea-ture histograms (pixel depth, surface normals andcurvature) for the task of free-form specific 3d ob-ject recognition. The main advantages of thesefeatures are the robustness to viewpoint changesand the discriminative information they containfor specific 3d object recognition [10] and genericobject recognition [16].

Pixel Depth A local histogram of depth valuesis the simplest available feature. All values arenaturally invariant with respect to image planetranslations and rotations and provide a rough rep-resentation of the object shape. As pointed outby [10], depth histograms can be misleading inscenarios with a large amount of background clut-ter and the presence of multiple objects. In thispaper, a histogram of 64 bins of pixel distances iscalculated and used.

Surface Normals A histogram of surface nor-mals represented as a pair of two angles (φ, θ) insphere coordinates provides a first-order statisticof the local object shape. Surface normals canbe easily derived from approximating the gradient

Page 5: Multiple Kernel Gaussian Process Classi cation for Generic ...MKG.pdf · Erik Rodner, Doaa Hegazy, and Joachim Denzler Chair for Computer Vision Friedrich Schiller University of Jena,

ToF Range Images Local Features Clustering using Random Forests BoF Histograms for SPMK

Figure 2: Local features for range images are computed on a grid without using interest point detectors. All local

features are clustered with a Random Forest and used to compute Bag-of-Features (BoF) histograms for each cell.

Only one level of the pyramid is shown in this figure.

with finite differences. We use a two dimensionalhistogram with 8× 8 bins.

Curvature Representing the curvature can be doneby using the shape index introduced in [19] andutilized in [10]:

S(p) =1

2− 1

πarctan

(kmax(p) + kmin(p)

kmax(p)− kmin(p)

)(9)

where kmax(p) and kmin(p) denote the principlecurvatures computed at point p. The principalcurvatures are proportional to the eigenvalues ofthe structure matrix [20]. Therefore, the secondderivatives have to be computed in order to calcu-late the shape index. The histogram calculated ofall S values consists of 64 bins.

4.3 Local Colour Features

Many papers have been published about compar-ing different local descriptors of colour images forimage categorisation tasks. The most recent workis the comparison of local colour feature descriptorsby Van de Sande et al. [17]. We use Opponent-SIFT1 features which was among the best discrim-inative descriptors in the evaluation of [17].

5 Image-Based Kernel Functions

In the following section we show how to computeimage-based kernel functions using the local fea-tures types presented in the last section.

One of the state-of-the-art feature extraction ap-proaches for image categorisation is the Bag-of-Features (BoF) idea. A quantisation of local fea-tures which is often called a codebook, is com-puted at the time of training. We use the su-pervised method of Moosmann et al. [21] for clus-tering which provides discriminative codebook ele-ments. For each image a histogram is calculatedwhich counts the number of matching local featuresfor each codebook entry. On the one hand, theintuition behind this idea is that the clustering

1We used the software provided by Koen van de Sandeat http://www.colordescriptors.com.

corresponds to an automatic decomposition of ob-jects into corresponding parts. Thus, the histo-gram counts the occurrence of several latent objectparts in an image, which results in a highly dis-criminative descriptor. On the other hand, currentlocal features are often unable to describe highlycomplex object parts. In these cases the globaldescriptor contains sophisticated statistics aboutthe image content.

A standard way to apply the BoF idea to kernel-based classifiers is to use the calculated histogramas a feature vector and apply a traditional kernelfunction such as the radial basis function kernel.In contrast, we define the kernel function directlyon images. The Spatial Pyramid Matching Kernelas proposed by Lazebnik et al. [8] extends the BoFidea and divides the image recursively into cells(e.g. 2×2 cells). In each cell the BoF histogram iscalculated and the kernel value is computed usinga weighted combination of histogram intersectionkernels corresponding to each cell. Each of thefour local feature types presented in Sect. 4 yieldsa different kernel, which can be combined with GPmodel selection as explained in Sect. 3.4.

6 Experiments

In the following we empirically support the follow-ing hypotheses:

1. Our method significantly outperforms our pre-vious approach [9] (Sect. 6.3).

2. GP regression outperforms Laplace approx-imation (Sect. 6.3).

3. Generic object recognition benefits from rangeinformation (Sect. 6.2).

4. Combining only range image kernels by GPlikelihood optimisation leads to better resultsthan fixed equal kernel weights. However,weight optimisation is not beneficial when allkernels and few training examples are used(Sect. 6.2).

5. Local range features computed using surfacenormals lead to the best recognition perform-ance (Sect. 6.1).

Page 6: Multiple Kernel Gaussian Process Classi cation for Generic ...MKG.pdf · Erik Rodner, Doaa Hegazy, and Joachim Denzler Chair for Computer Vision Friedrich Schiller University of Jena,

cars

fruits

animals

cups

toys

Figure 3: Some examples of image pairs included in

the dataset of [16] with images obtained from a ToF

camera and a visual sensor. From top to bottom: cars,

animals, cups, fruits and toys. The dataset includes

high intraclass variabilities (animals, fruits, toys) and

small interclass distance (animals and toys).

We used the object category dataset presented in[16] in all evaluation experiments. Note that thisdatabase is the only available dataset for genericobject recognition that provides both Time-of-Flightcamera images and images from a CCD camera.

The database of [16] consists of a large set of 2d/3dimages obtained from a CCD camera and a Time-of-Flight camera [2], which is a PMD Vision 19kwith a resolution of 160 × 120 pixels. Examplescan be found in Figure 3. The images belong to fivedifferent generic object categories (cars, toys, cups,fruits and animals). Each category consists ofseven object instances with 32 image pairs. Imagescan contain multiple instances of the same class,large viewpoint and orientation variations, partialocclusion (e.g. by other objects), truncation (e.g.by the image boundaries) as well as backgroundclutter.

In contrast to our previous work [16], which use apredefined split into a training set with 100 imagesand a test set of 60 images for each category, weevaluate our approach using a varying number g ofobject instances for training (32g image pairs) andthe remaining images of the dataset for testing. Asa performance measure we use the mean of the av-erage recognition rate obtained from 50 evaluationswith a random selection of training instances.

6.1 Evaluation of Range Feature Types

First of all, we compare the performance for alllocal range features presented in Sect. 4.2. Clas-

sification is done with GP regression and withoutadditional hyperparameters. The results are shownin Figure 4 (a) for different numbers of object in-stances used for training. The performance rank-ing of the respective methods is clearly visible andlocal features calculated using surface normals res-ult in the best average recognition rate.

6.2 Evaluation of GP Classification andKernel Combination

Let us have a look on the performance of GP re-gression compared to approximate GP classifica-tion with Laplace approximation (LA). Figure 4 (c)shows the performance of both methods using theimage kernel function of the CCD camera. GPregression significantly outperforms LA, which isa surprising result because of the theoretical suit-ability of LA for classification problems. We alsotested LA with hyperparameter optimisation of thelength-scale ` included in the noise model and de-fined by equation (6). This additional hyperpara-meter optimisation leads to a small performancegain, but is still inferior to GP regression.

Figure 4 (b) shows that by combining multiplerange kernels, the categorisation performance in-creases compared to single range kernel functions.The best method is GP regression with weightsoptimised by likelihood optimisation as presentedin Sect. 3.4. As also observed in the previous ex-periment, LA does not lead to a performance gain,even with hyperparameter optimisation.

The results for combined image kernels of bothsensors are shown in Figure 4 (d). Combiningrange data from the ToF sensor and images ofthe visual sensor leads to a superior categorisationperformance (76.3% using 1 instance) compared tothe best results using a single sensor (71% with1 instance). The performance gain due to sensorcombination is most prevalent for few training ex-amples or object instances. An interesting fact isthat in this case we do not benefit from weightoptimisation for kernel combination. This is likelydue to over-fitting in the presence of the highlydiscriminative colour kernel. Therefore, one shouldprefer to use equal weights in those scenarios.

6.3 Comparison with Previous Work

We also compared our method with our previousapproach [16] and its extension using dense sampling[9]. Note that these works are the only ones provid-ing methods for ToF-based categorisation. In thisexperiment the same number of training examplesis used to allow direct comparison of the recog-nition rates. The results are shown in Table 1.Our approach utilising GP regression and hyper-parameter optimisation significantly outperforms

Page 7: Multiple Kernel Gaussian Process Classi cation for Generic ...MKG.pdf · Erik Rodner, Doaa Hegazy, and Joachim Denzler Chair for Computer Vision Friedrich Schiller University of Jena,

10

20

30

40

50

60

70

80

32 64 96 128

aver

age

reco

gniti

on r

ate

image pairs for each category used for training

(a) Comparision of Range Feature Types

Normals + GP regressionDepth + GP Regression

Shape-Index + GP regression 60

65

70

75

80

85

90

95

32 64 96 128

aver

age

reco

gniti

on r

ate

image pairs for each category used for training

(b) Range Features

GP regression (equal weights)GP regression (optimization)Laplace (equal weights)Laplace (optimization)

60

65

70

75

80

85

90

95

32 64 96 128

aver

age

reco

gniti

on r

ate

image pairs for each category used for training

(c) Visual Features

GP regressionLaplace

Laplace (lengthscale optimization) 60

65

70

75

80

85

90

95

32 64 96 128

aver

age

reco

gniti

on r

ate

image pairs for each category used for training

(d) Combined Features

GP regression (equal weights)GP regression (optimization)

Laplace (equal weights)Laplace (optimization)

Figure 4: Evaluation of (a) different types of range features; (b) GP methods with multiple range features; (c)

GP methods with colour features; (d) and combined features from the CCD camera and the ToF camera. 32

image pairs corresponds to one object instance or type.

Table 1: Comparison of our GP based approach

to previous work with an equal number of training

examples. (s) denotes range features computed on

interest points only.

Ref. Method FeaturesRecog.Rate

[16] Boosting range feat. (s) 39.8 %[9] Boosting range features 62.8 %

Ours GP Reg. range features 79.2 %

[16] Boosting comb. feat. (s) 64.2 %[9] Boosting combined feat. 78.4 %

Ours GP Reg. combined feat. 88.1 %

their approach for single sensor data from the ToFcamera and combined information of both sensors.Even by using range features only we achieve 79.2%average recognition rate which is superior to theoverall classification system of [9] with a recogni-tion rate of 78.4%. The confusion matrix in Fig-ure 5 highlights the still existing classification dif-ficulties.

7 Conclusions and Further Work

We present an approach to generic object recogni-tion using combined information obtained from aCCD camera and a Time-of-Flight camera. Ourmethod is based on Gaussian process (GP) classi-

88.4 7.6 5.7 5.7

88.4 6.8 3.7

6.2 81.4

4.5 91.8

90.7

cars animals cups fruits toys

cars

animals

cups

fruits

toys

Figure 5: Results of our GP approach represented as a

confusion matrix. Only values above 3% are displayed

and highlight difficult cases.

fication and kernel functions computed using differ-ent types of local features. This framework allowsus to study various aspects of image categorisationwith GP and lead to interesting results such as thesuperiority of GP regression compared to Laplaceapproximation. We also observe that Bayesian ker-nel combination does not always lead to betterresults compared to equally weighted kernels espe-cially for few training examples and with a singlehighly discriminative kernel.

An interesting direction for future research would

Page 8: Multiple Kernel Gaussian Process Classi cation for Generic ...MKG.pdf · Erik Rodner, Doaa Hegazy, and Joachim Denzler Chair for Computer Vision Friedrich Schiller University of Jena,

be to incorporate the variable sensor uncertaintyof time-of-flight sensors directly into the problemof kernel combination. This uncertainty is directlyavailable for each pixel of those cameras and mightgive hints about whether to trust the range-basedkernel in situations where range estimation is notreliable (such as black surfaces).

8 Acknowledgments

The authors would like to thank Michael Kemmlerfor valuable discussions about Gaussian Processesand his comments on the paper.

References

[1] J. Zhang, M. Marszalek, S. Lazebnik, andC. Schmid, “Local features and kernels forclassification of texture and object categories:A comprehensive study,” IJCV, vol. 73, no. 2,pp. 213–238, 2007.

[2] R. Lange, “3d time-of-flight distance measure-ment with custom solid-state image sensorsin cmos/ccd-technology,” Ph.D. dissertation,University of Siegen, 2000.

[3] Y. Cui, S. Schuon, C. Derek, S. Thrun, andC. Theobalt, “3d shape scanning with a time-of-flight camera,” in CVPR, 2010.

[4] V. Ganapathi, C. Plagemann, D. Koller, andS. Thrun, “Real time motion capture using asingle time-of-flight camera,” in CVPR, 2010.

[5] C. E. Rasmussen and C. K. I. Williams,Gaussian Processes for Machine Learning(Adaptive Computation and MachineLearning). The MIT Press, 2005.

[6] A. Kapoor, K. Grauman, R. Urtasun, andT. Darrell, “Gaussian processes for objectcategorization,” IJCV, vol. 88, no. 2, pp. 169–188, 2010.

[7] N. D. Lawrence, J. C. Platt, and M. I.Jordan, “Extensions of the informative vectormachine,” in Deterministic and StatisticalMethods in Machine Learning, 2004, pp. 56–87.

[8] S. Lazebnik, C. Schmid, and J. Ponce, “Bey-ond bags of features: Spatial pyramid match-ing for recognizing natural scene categories,”in CVPR, 2006, pp. 2169–2178.

[9] D. Hegazy, “Boosting for generic 2d/3d objectrecognition,” Ph.D. dissertation, FriedrichSchiller University of Jena, 2009.

[10] G. Hetzel, B. Leibe, P. Levi, and B. Schiele,“3d object recognition from range imagesusing local feature histograms,” in CVPR,vol. 2, 2001, pp. 394–399.

[11] R. Toldo, U. Castellani, and A. Fusiello,“A bag of words approach for 3d ob-ject categorization,” in MIRAGE: 4th IntlConference on Computer Vision/ComputerGraphics Collaboration Techniques, 2009, pp.116–127.

[12] G. Zhang and Y. Wang, “Faceprint: Fusion oflocal features for 3d face recognition,” in ICB:Intl Conf on Advances in Biometrics, 2009, pp.394–403.

[13] T.-W. R. Lo and J. P. Siebert, “Local fea-ture extraction and matching on range im-ages: 2.5d sift,” Computer Vision ImageUnderstanding, vol. 113, no. 12, pp. 1235–1250, 2009.

[14] M. Haker, M. Bohme, T. Martinetz, andE. Barth, “Scale-invariant range features fortime-of-flight camera applications,” in CVPRWorkshop on Time-of-Flight-based ComputerVision (TOF-CV), 2008, pp. 1–6.

[15] X. Li and I. Guskov, “3d object recognitionfrom range images using pyramid matching,”in ICCV, 2007, pp. 1–6.

[16] D. Hegazy and J. Denzler, “Combining ap-pearance and range based information formulti-class generic object recognition,” inCIARP, 2009, pp. 741–748.

[17] K. van de Sande, T. Gevers, and C. Snoek,“Evaluating color descriptors for object andscene recognition,” PAMI, vol. 32, no. 9, pp.1582–1596, 2010.

[18] Y. A. Qi, T. P. Minka, R. W. Picard, andZ. Ghahramani, “Predictive automatic relev-ance determination by expectation propaga-tion,” in ICML, 2004, pp. 85–92.

[19] J. J. Koenderink and A. J. van Doorn, “Sur-face shape and curvature scales,” Image andVision Computing, vol. 10, no. 8, pp. 557–564,October 1992.

[20] D. G. Lowe, “Distinctive image features fromscale-invariant keypoints,” IJCV, vol. 60,no. 2, pp. 91–110, 2004.

[21] F. Moosmann, B. Triggs, and F. Jurie, “Fastdiscriminative visual codebooks using ran-domized clustering forests,” in NIPS, 2006,pp. 985–992.