Indoor Scene Classi cation using combined 3D and Gist...

Indoor Scene Classification using combined3D and Gist Features

Agnes Swadzba and Sven Wachsmuth

Applied Informatics, Bielefeld University, Germanyaswadzba, [email protected]

Abstract. Scene categorization is an important mechanism for provid-ing high-level context which can guide methods for a more detailedanalysis of scenes. State-of-the-art techniques like Torralba’s Gist fea-tures show a good performance on categorizing outdoor scenes but haveproblems in categorizing indoor scenes. In contrast to object based ap-proaches, we propose a 3D feature vector capturing general properties ofthe spatial layout of indoor scenes like shape and size of extracted planarpatches and their orientation to each other. This idea is supported bypsychological experiments which give evidence for the special role of 3Dgeometry in categorizing indoor scenes. In order to study the influence ofthe 3D geometry we introduce in this paper a novel 3D indoor databaseand a method for defining 3D features on planar surfaces extracted in 3Ddata. Additionally, we propose a voting technique to fuse 3D features and2D Gist features and show in our experiments a significant contributionof the 3D features to the indoor scene categorization task.

1 Introduction

Fig. 1: Example room from 3D database, corresponding 3D data fordefining 3D features, and computed 2D Gist image.

An important abilityfor an agent (bothhuman and robot) isto build an instanta-neous concept of thesurrounding environ-ment. Such holistic concepts activate top-down knowledge that guides the visualanalysis in further scene interpretation tasks. For this purpose, Oliva et al. [1]introduced the so-called Gist features that represent the spatial envelope of ascene given as a 2D image. Based on this general idea a variety of scene recog-nition approaches have been proposed that work well for outdoor scenes – like“buildings”, “street”, “mountains”, etc. – but frequently break down for indoorscenes, e.g. “kitchen”, “living room”, etc. Recently, there are some approaches,e.g., [2], that try to combine global Gist information with local object informa-tion to achieve a better performance on different indoor categories. Although,they achieve significant improvement, the training of such methods relies on aprevious hand labeling of relevant regions of interest. Psychological evidence evensuggests that objects do not contribute to the low-level visual process responsi-ble for determining the room category. In early experiments, Brewer and Treyen

2 Agnes Swadzba and Sven Wachsmuth

provided hints that with the perception of room schemata objects (e.g., books)were memorized to be in the experimental room (e.g., office) though they werenot really present. Henderson et al. [3, 4] even found through their fMRI studiesbrain areas that are sensitive to 3D geometric structures enabling a distinctionof indoor scenes from other complex stimuli but do not show any activationon close-up views of scene-relevant objects, e.g., a kitchen oven. This empha-sizes the contribution of the scene geometry to the ability of categorizing indoorscenes. Currently, most approaches to scene recognition are based on color andtexture statistics of 2D images neglecting the 3D geometric structure of scenes.Therefore, our goal is to capture the 3D spatial structure of scene categories (seeFig. 1 for impression of used data) in a holistically defined feature vector. Moreprecisely, as man-made environments mainly consist of planar surfaces we defineappropriate statistics on these spatial structures capturing their shape and sizeand their orientations to each other. To analyze their performance on the indoorscene categorization problem a new dataset is introduced consisting of six dif-ferent room categories and 28 individual rooms that have been recorded with aTime-of-Flight (ToF) sensor in order to provide 3D data. The classification re-sults are contrasted to results achieved with 2D Gist features. Further, possiblecombination schemes are discussed. In the following, the paper discusses relatedwork, describes the 3D indoor database and the computation of our 3D scenedescriptor and combination schemes with 2D features. Finally, evaluation resultsare discussed.

2 Related Work

As navigation and localization was one of the first tasks exhaustively investi-gated in robotics research some interesting approaches to solve the problem ofrecognizing known rooms and categorizing unknown rooms can be found. Knownrooms can be recognized by invariant features in 360 laser scans [5], by compar-ing current 2D features to saved views [6], and by recognizing specific objects andtheir configurations [7]. Burgard’s group has examined a huge bunch of differentapproaches to the more general place categorization problem ranging from thedetection of concepts like “room”, “hall”, and “corridor” by simple geometricfeatures defined on a 2D laser scan [8] to subconcepts like “kitchen” [9] usingontologies encoding the relationship between objects and the subconcepts [10].Alternatively, occurrence statistics of objects in place categories [11, 12] and in-terdependencies among objects and locations [13] have been tried out.

Parallel to the described approaches, Torralba [14] has developed a low-levelglobal image representation by wavelet image decomposition that provides rel-evant information for place categorization. These so-called Gist features per-form well on a lot of outdoor categories like “building”, “street”, “forest”, etc.but have problems to categorize indoor scenes reliably. Therefore, they haveextended their approach by local discriminative information [2]. They learn pro-totypes from hand-labeled images, which define the mapping between images andscene labels. Similar to those prototypes, Li and Perona represent an image by a

Indoor Scene Classification using combined 3D and Gist Features 3

bath.1

eat.1

kit.2

liv.2

bath.2

eat.2

kit.3

liv.3

bath.3

eat.3

kit.4

liv.4

bed.1

eat.4

kit.5

liv.5

bed.2

eat.5

kit.6

off.1

bed.3

eat.6

kit.7

off.2

bed.4

kit.1

liv.1

off.3

Fig. 2: Photos of 28 different rooms scanned in 3D by a Swissranger SR3100. The images were takenfor visualization at the position of the 3D camera. The database contains 3 bathrooms, 4 bedrooms,6 eating places, 7 kitchens, 5 livingrooms, and 3 offices.

collection of local regions denoted as codewords which they obtain by unsuper-vised learning [15]. Similar approaches based on computing visual or semantictypicality vocabularies and measuring per image the frequency of these wordsor concepts are introduced by Posner [16], Bosch [17], and Vogel [18]. A localfeatures based technique used in this paper for comparison is proposed by Lazeb-nik et. al [19]. They partition an image into increasingly fine sub-regions, theso-called spatial pyramid, and represent these regions through SIFT descriptors.In [6] the application of such local features for place categorization on a mobilerobot in a real environment is tested with moderate success.

So far, to the best of our knowledge, 3D information only has been used fordetecting objects [20–22] or classifying each 3D point into given classes whichcould be “vegetation”, “facade”, ... [23], “chair”, “table”, ... [24], or “plane”,“sphere”, “cylinder”, ... [25]. The introduced features are of local nature sinceeach 3D point is represented by a feature vector computed from the point’sneighborhood. They are not applicable as we aim for classifying a set of pointsas a whole.

3 The 3D Indoor Database

In order to systematically study room categorization based on the 3D geometryof rooms a sufficiently robust method for extracting 3D data is necessary. Eventhough recent papers show an impressive performance in extracting the 3D roomframe [26, 27] or depth-ordered planes from a single 2D image [28] they are atthe moment not precise enough in computing all spatial structures given by,e.g., the furniture, on the desired level of detail. Therefore, existing 2D indoordatabases cannot be utilized. Recording rooms in 3D could be done in principleby stereo camera systems, laser range finders, or Time-of-Flight (ToF) cameras.As room structures are often homogeneous in color and less textured stereo


cameras will not provide enough data. In order to acquire instantaneously dense3D point clouds from a number of rooms laser range finders are unhandy andnot fast enough. Thus, we use ToF cameras to acquire appropriate data but ourapproach is not limited to this kind of data. The Swissranger SR3100 from MesaImaging1 is a near-infrared sensor delivering in real-time a dense depth map of176×144 pixels resolution [29]. The depth is estimated per pixel from measuringthe time difference between the signal sent and received. Besides the 3D pointcloud (→ Fig. 3(b)), the camera delivers a 2D amplitude image (→ Fig. 3(a))which looks similar to a normal gray-scale image but encodes for each pixel theamount of reflected infrared light. The noise ratio of the sensor depends on thereflection properties of the materials contained in the scene. Points measuredfrom surfaces reflecting infrared light in a diffuse manner have a deviation of 1to 3 cm.

For acquiring many different arranged rooms per category, we recorded datain a regular IKEA home-center2. Its exhibition is ideal for our purpose becauseit is assembled by 3D boxes representing single rooms. Each box is furnisheddifferently. We placed the 3D camera at an arbitrary position of the open side andscanned the room for 20 to 30 seconds while the camera is continuously movedby round about 40 left/right and 10 up/down. This simulates a robot movingits head for perceiving the entire scene. We acquired data for 3 bathrooms, 4bedrooms, 6 eating places, 7 kitchens, 5 livingrooms, and 3 offices3. Fig. 2 showsphotos of the scanned rooms taken at the position of the 3D camera.

4 Learning a Holistic Scene Representation

This section presents the computation of our novel 3D scene descriptor andthe 2D Gist descriptor from a Swissranger frame F . Further, details are givenon learning of room models from these feature types and the combination ofclassification results from both features and several consecutive frames to achievea robust scene categorization on data of a so far unseen room.

Scene Features From 3D Data. As man-made environments mostly consistof flat surfaces a simplified representation of the scene is generated by extract-ing bounded planar surfaces from the 3D point cloud of a frame. Noisy data issmoothed by applying median filters to the depth map of each frame. Artifactsat jumping edges (which is a common problem in actively sensing sensors) areremoved by declaring points as invalid if they lie on edges in the depth image.A normal vector ni is computed for each valid 3D point pi via Principal Com-ponent Analysis (PCA) of points in the neighborhood N5×5 in the Swissrangerimage plane. The decomposition of the 3D point cloud into connected planarregions is done via region growing as proposed in [30] based on the conormalitymeasurement defined in [31]. Iteratively, points are selected randomly as seed ofa region and extended with points of its N3×3 neighborhood if the points are

1 http://www.mesa-imaging.ch2 http://www.ikea.com/us/en/3 http://aiweb.techfak.uni-bielefeld.de/user/aswadzba


(a) liv.5: frame 162 (b) (c)

Fig. 3: (a): Swissranger amplitude image; (b): corresponding 3D point cloud with extracted planarsurfaces (highlighted by color), an angle between two planes; a box enclosing a plane with its edgesparallel to the two largest variance directions, and a plane transformed so that the vectors indi-cating the two largest variance directions in the data are parallel to the coordinate axis; (c): theresponses of the Gabor filters applied to the amplitude image are averaged to one image visualizingthe information encoded by the Gist feature vector.

valid and conormal. Two points p1 and p2 are conormal, when their normals n1

and n2 hold:

α = ^(n1,n2)

arccos(n1 · n2) : arccos(n1 · n2) <= π

2

π − arccos(n1 · n2) : elsewith α < θα (1)

Here, we have chosen θα = pi18 . The resulting regions are refined by some runs

of the RANSAC algorithm [32]. Fig. 3(b) displays for a frame F the resultingset of planes Pj |nj · p− dj = 0j=1...m where nj is computed through PCA ofpoints pi assigned to Pj and dj by solving the plane equation for the centroidof these plane points.

In the following, the computation of a 3D feature vector x3D based on theextracted planar patches in F is described. This vector encodes the 3D geometryof the perceived scene independent from colors and textures, the view point, andthe camera orientation. For m planes Pjj=1,...,m in frame F shape and sizecharacteristics and for all plane pairs (Pk,Pl)k,l∈(1...m),k 6=l angle and size ratio

characteristics are computed. For each plane characteristic, cs, cA, c÷, and c^,the resulting values are binned into a separated histogram which is normalized tolength 1 through dividing the histogram by the number of planes or plane pairs,respectively. Then, the resulting four vectors are concatenated to one featurevector encoding general spatial properties of the scene like the orientation of thepatches to each other, their shapes, and their size characteristics:

x3D = 〈 x^, xs, xA, x÷ 〉, with (2)

cs =min(a, b)

max(a, b)→ xs =

1

m·Hs(csj), j = 1 . . .m, (3)

cA = a · b → xA =1

m·HA(cAj ), j = 1 . . .m, (4)

c÷kl =min(cAk , c

Al )

max(cAk , cAl )

→ x÷ =2

m(m− 1)·H÷(c÷kl), k, l = 1 . . .m, (5)

c^kl = ^(nk,nl) → x^ =2

m(m− 1)·H^(c^kl), k, l = 1 . . .m. (6)

Shape characteristic cs. Each planar patch P is spanned by the points assignedto this patch during plane extraction. PCA of these points provides three vectors〈 a, b, n 〉 with a indicating the direction of the largest variance in the data


and b the direction orthogonal to a with the second largest variance. n gives thenormal vector of the planar patch. Fig. 3(b) shows a planar patch P transformedso that a and b are parallel to the coordinate axes. A minimum bounding box iscomputed with its orientation parallel to a and b including all plane points. Theheight of the box is assumed to be a and the width b. As can be seen in Eq. 3the feature vector xs is a normalized histogram Hs over csj with h = 5 binsin the interval ]0, 1[ and a bin width of ∆h = 0.2. It encodes the distribution ofshapes which range from elongated (cs near 0) to squarish (cs near 1).Size characteristic cA. Considering the minimum bounding box of P the coveredarea can be easily computed. The normalized histogram HA (→ Eq. 4) con-sists of h = 6 bins with the boundaries [0; (25cm)2[, [(25cm)2; (50cm)2[, [(50cm)2;

(100cm)2[, [(100cm)2; (200cm)2[, [(200cm)2; (300cm)2[, and [(300cm)2; ∞[.Size ratio characteristic c÷. For a plane pair (Pk,Pl) the size ratio denotes witha value near 0 that two planes significantly differ in their size whereas the planescover nearly the same area if the value is near 1. The histogram H÷ is designed

equally to histogram Hs (h = 5, ∆h = 0.2). Assuming m(m−1)2 pairs of planes

the feature vector is computed using Eq. 5.Angle characteristic c^. Using Eq. 1, the acute angle between two planes (Pk,Pl)can be computed using the planes’ normals. Eq. 6 gives the corresponding his-togram H^ (h = 9, ∆h = π

18 , [0, π2 ]).

Scene Features From 2D Data. Due to the fact that the ToF camera deliv-ers additionally to the 3D point cloud an amplitude image which can be treatedlike a normal grayscale image we are able to compute Torralba’s Gist features4.As the feature vector x3D captures information about the arrangement of planarpatches in the scene, xGist encodes additional global scene information. TheseGist features are texture features computed from a wavelet image decomposi-tion [14]. Each image location is represented by the output of filters tuned todifferent orientations and scales. Here, we used 8 orientations and 4 scales ap-plied to a 144×144 clipping of the 176×144 amplitude image bilinear resized to256×256. The resulting representation is downsampled to 4×4 pixels. Thus, thedimensionality of xGist

t is 8× 4× 16 = 512. Fig. 3(c) shows the average of filterresponses on the amplitude image. Since the depth values of the Swissranger areorganized in a regular 2D grid this so-called depth image can also be used tocompute in the same way a so-called Depth-Gist feature vector, xDGist. It is analternative approach to compute features from 3D data.

Training Room Models and Combining Single Classifications. In thelearning stage of our system, we decided to learn a discriminant model for eachclass which can be used to compute the distance of a feature vector to each classboundary. As we use Support Vector Machines (SVM) the class boundaries aredefined by the learned support vectors. The learning of a class model is realizedby a one-against-all learning technique where feature vectors which are in-classform the positive samples. The same amount of negative examples is sampleduniformly from all other classes. Formally written, for our set of six classes

4 http://people.csail.mit.edu/torralba/code/spatialenvelope


Ω =bath., bed., eat., kit., liv., off. we learn a set of discriminant func-tions G(x) which map the n-dimensional feature vector x on a |Ω|-dimensionaldistance vector d with di denoting the distance to the class boundary of ωi [33]:

G(x) : x→ d = 〈 d1, . . . , d6 〉 = 〈di = gi(x) | gi : <n → <〉i=1...6, (7)

E(d)

: d→ e = 〈 e1, . . . , e6 〉 = 〈 0, ej = 1, 0 〉 with dj = maxdii=1,...,6. (8)

The classification is then realized by our decision function E(.) which maps thedistance vector d to a binary vector e where the j-th component is equal 1 ifdj = max. . . , di, . . . and all other components are set to 0. In our scenario asequence of frames will be recorded from an unknown room while tilting andpanning the camera. A classification result for a single frame might not be veryreliable as the view field of the camera is limited so that only a small part ofthe room can be seen per frame. By considering the classification results of a setof consecutive frames Ftt∈∆t, the decision is becoming more reliable. In thefollowing, we are going to present two different fusion schemes, V1×E and V2×E .Voting Scheme V2×E. It relies on the idea to perform a classification decisionon the frame level, to sum decisions within the window ∆t weighted by theirreliability, and to make a second decision on the resulting sum:

e∆t = E(d∆t

), d∆t =

∑t∈∆t

A(dt)· E(dt), dt = G

(xt)

(9)

A(d)

= l(dj1)− l(dj2), l(x) =

1

1 + e−x. (10)

dj1 = maxdi, dj2 = maxdi\dj1 (11)

Eq. 11 determines the reliability of a decision by the difference between the valueof the best and the value of the second best class. The bigger the difference themore reliable is the decision. For a plausible comparison of different classificationresults the distances to the class boundaries have to be normalized. For thispurpose we have chosen the strictly monotic logistic function l(.) which maps thedistances on values in ]0, 1[ resulting in the weighting function A(.) (→ Eq. 10).

Voting Scheme V1×E. Making a winner-takes-all class decision based on oneframe might be vulnerable to noise. Therefore, an alternative approach is to skipthe decision on the frame level. Instead, it passes the support for all classes to thefinal classification. The benefit of this scheme arises in cases where some framesgive more or less equal support for two or more classes. Intuitively, making a harddecision towards one of these classes involves a high risk for misclassification. Theidea is to lower this risk by keeping the support for all classes and counting onframes in ∆t with significant support for one class providing a reliable classdecision. Formulating this mathematically, normalized distance vectors dt arecollect over time resulting in an accumulated distance vector d∆t on which thedecision function E(.) is applied:

e∆t = E(d∆t

), d∆t =

∑t∈∆t

L(dt), L

(d)

=⟨. . . , l(di), . . .

⟩. (12)

Combining Different Feature Types. The voting and weighting technique basedon different sums allows also for a straight forward combination of different fea-ture types with differing dimensions. In our case, a set of discriminant functions


G3D(.) is learned based on 3D features and another set GGist(.) based on Gist fea-tures . For a frame window Ftt∈∆t and a set of feature vectors x3D

t ,xGistt t∈∆t

the corresponding distance vectors are denoted by d3Dt = G3D

(x3Dt

)and dGist

t =

GGist(xGistt

). The combined decision is formulated as follows using the described

voting schemes:

eComb∆t = E

(d3D∆t + d

Gist∆t

), d

3D∆t =

∑t∈∆t

A(d3Dt

)E(d3Dt

), d

Gist∆t =

∑t∈∆t

A(dGistt

)E(dGistt

)(13)

eComb∆t = E

(d3D∆t + d

Gist∆t

), d

3D∆t =

∑t∈∆t

L(d3Dt

), d

Gist∆t =

∑t∈∆t

L(dGistt

). (14)

Rejection. Depending on the selected window size, the speed of the camera drive,and the current frame rate of the camera the actually acquired frames might onlyshow uninformative scene views or may be disturbed by persons moving in frontof the camera. The resulting classifications are unreliable and can be excludedby using the modified decision function E′(.) (→ Eq. 15). Classifications arerejected (0) if distances to the class boundary of the best class and the secondbest class do not differ significantly. During evaluation θrej = 0.05 is used leadingto rejection of not more than 20% of the test frames. We apply E′(.) only in thefinal classification stage to vector d∆t.

E′(d)

=⟨0, ej1 , 0

⟩, where ej1 =

1 :

dj1−dj2dj1

> θrej,

0 : else(15)

5 Evaluation of room type categorization

This section describes the evaluation of the performance in determining the roomtype for a set of frames of an unknown room, using our proposed combinationof spatial 3D and Gist features (x3D,xGist), Eq. 13 and Eq. 14. Its performanceis compared to classification results using pure x3D or pure xGist. Additionally,the Depth-Gist feature vector xDGist and its combination with the standard Gistfeature vector, (xDGist,xGist), is tested as an alternative combination of featuresfrom 2D and 3D data. We also apply a scene classification scheme to our datasetthat is based on spatial pyramid matching proposed by Lazebnik et al. [19].The required dictionary is built on the amplitude images of our training setswith a dictionary size of 100 and 8500-dimensional feature vectors5, xSP. Theevaluation utilizes the 3D indoor database of rooms presented in Sec. 3. All 2Ddescriptors are computed on the amplitude image to achieve a fair comparisonfor the 3D descriptor as the same amount of scene details is considered. Further,to avoid a bias arising from chosen test sequences all presented classificationresults are averaged over 10 training runs. Within each run a room per class ischosen randomly extracting all frames of the room as test sequence while theframes of the remaining rooms form the training set. For a better comparisonof the different features SVM models with an RBF kernel are trained where thekernel’s parameters are optimized through a 10-crossvalidation scheme choosinga model with the smallest possible set of support vectors (Occam’s razor law).

5 http://www.cs.unc.edu/∼lazebnik/


(a) voting scheme: V1×E ; rej.: θrej = 0.05

50 100 150 200 250 300

0.4

0.5

0.6

0.7

0.8

0.9

1

window size

classification rate

(b) voting scheme: V1×E ; rej.: no

50 100 150 200 250 300

0.4

0.5

0.6

0.7

0.8

0.9

1

window size

classification rate

(c) voting scheme: V2×E ; rej.: θrej = 0.05

50 100 150 200 250 300

0.4

0.5

0.6

0.7

0.8

0.9

1

window size

classification rate

(d) voting scheme: V2×E ; rej.: no

50 100 150 200 250 300

0.4

0.5

0.6

0.7

0.8

0.9

1

window sizeclassification rate

Fig. 4: This figure shows the development of the classification results when enlarging the window∆t from 1 to 300 frames (→ x-axis). Two voting schemes are evaluated V1×E and V2×E introduced

in Eq. 14 and 13. The tested feature vectors and combinations are: green → (x3D,xGist), orange →(xDGist,xGist), blue→ xGist, cyan → xDGist, red → x3D, and black → xSP. σ denotes the meanstandard deviation of the classification curve. r denotes the mean rejection (rej.) rate. (best viewedin color)

Here, we utilized the SVMlight implementation of the support vector machinealgorithm [34, 35]. Additionally, the models learnt on the IKEA database aretested for their applicability on Swissranger frames acquired in two real flats.

An important parameter of our system is the number of consecutive framesin ∆t used for determining the current class label through voting. As our se-quences are recorded with a camera standing at one or two positions and beingmoved continuously simulating a robot’s head looking around, the possible val-ues could range from one frame to all frames of the sequence. Performing roomcategory decision on one frame is expected to be very fast but vulnerable tonoise. The more frames are considered the more stable the classification resultshould become. There is a trade-off between getting a room type label quicklyand reliably. The curves in Fig. 4 show the development of the classificationrates if the window size ∆t is enlarged from 1 to 300 frames. As assumed theperformance of all features increases if the window gets bigger. Taking a deeperlook on all subfigures of Fig. 4 it can be seen that our proposed combination(x3D,xGist) of spatial 3D and Gist features clearly outperforms the other fea-tures under all voting and rejection conditions. E.g., the classification rates inFig. 4(a) (mean: 0.84, max: 0.94) are higher than the corresponding values ofthe second best curve (which are 0.71 and 0.79 using combined Depth-Gist andGist features). The error is reduced by 45 % and 71 %, respectively. This re-duction is especially remarkable, because it points out that features needed to becarefully defined on 3D data in order to capture informative spatial structures


in a sufficiently generalized way. Here, we are able to show that this can be donewith lower dimensional features.

Fig. 5: This figure shows the confusion matri-ces of all examined features. Here, the windowsize for voting over consecutive frames is set to∆t = 60. The average classification rates for in-dividual classes are listed along the diagonal, allother entries show misclassifications.

Figure 4 shows also the influ-ence of different voting and rejectionschemes on the categorization perfor-mance. Skipping the winner-takes-alldecision on the frame level, Eq. 14,influences the pure Gist based ap-proaches only marginally while clas-sification rates using x3D are signif-icantly increased from (mean: 0.55,max: 0.60) to (mean: 0.65, max:0.80). As shown in Fig. 4(a) the spa-tial 3D features perform comparableto Gist features, even though the di-mensionality of 25 is 20 times smallerthan the 512 dimensions of the Gistvectors. Lazebnik’s spatial pyramidfeatures, xSP, show the worst perfor-mance even though 8500-dimensionalfeature vectors are defined. The highmean standard deviation value of0.19 arises from the fact that theywork well for one or two classes whilethey do not work at all for otherclasses (→ Fig. 5).

Rejection of frames when no cleardecision can be taken is a reasonablestep as some frames will not showmeaningful views of a scene. As ex-pected the classification results are improved (compare 4(a) with 4(b) and 4(c)with 4(d)). The influence of rejection is higher on voting scheme V1×E thanon V2×E . Taking decisions on the frame level leads to clearer decisions on thewindow level with the drawback of being more vulnerable to noise. In Fig. 5confusion matrices of the curves in Fig. 4(a) at window size ∆t = 60 are given.Taking into account the frame rate of the camera there is an initial delay of 2 to6 seconds before the system would deliver class labels or reach its full categoriza-tion performance. The confusion matrices of x3D and xGist give an impression oftheir complementary nature. Therefore, fusion of both features leads to improvedindoor scene classification capabilities.

To test the generalizability of the room models trained on the entire IKEAdatabase we additionally acquired sequences from rooms (bedroom, eating place,kitchen, and living room) of two real flats, F1 and F2. Per room 300 frames wereacquired with the Swissranger positioned at the door opening. The room modelsare acquired using 3D and Gist features combined with voting scheme V1×E over


Fig. 6: This figure shows the classification results of test sequences (bed.F1, eat.F1, kit.F1, liv.F1,bed.F2, eat.F2, kit.F2, and liv.F2) acquired in two real flat, F1 and F2. The red line refers to theground truth and the black circles mark the classification results. The shown results are achieved byusing 3D features and Gist features in combination, the voting scheme V1×E and a voting windowof size ∆t = 60. (best viewed in color)

a voting window ∆t = 60. Fig. 6 shows pictures of the rooms and the distribu-tion of the resulting class labels using the learnt room models. The continuousred line denotes the ground truth labeling and the black circles mark the clas-sification results. In general, a quite good overall categorization performance of0.65 and 0.84 is achieved. Especially, the sequences, kit.F1, kit.F2, liv.F1,liv.F2, and eat.F2 are correct recognized nearly over the whole sequence. Onlysome frames in the middle of kit.F2 are rejected or wrong classified due to thefact that the camera was directed towards the kitchen window which let to alot of noisy data because of the disturbing effect of the sun light on the ToFmeasurement principle. This effect is also responsible for the misclassification ofthe complete eat.F1-sequence. Suppressing this effect is a subject to a technicalsolution since there already exists a suppression of background illumination forToF-sensors [36]. Also, an atypical missing or arrangement of furniture, like onlya bed in a room or the bed placed in the corner, leads to misclassifications andrejections as happened for bed.F1 and bed.F2. This problem can be tackled byintroducing such rooms into the training of the room models.

Finally, we can state that our light-weighted 25-dimensional feature vector,x3D, encodes enough meaningful information about the structural propertiesof rooms leading to a remarkable good performance in determining the roomtype for a given sequence of frames. The small dimensionality of our 3D fea-tures recommends this features for the use on robots with limited computingresources and restricted capabilities to acquire thousands of training images.Complementary information from Gist features xGist computed on the corre-sponding amplitude images further improves the categorization.


6 Conclusion

In this paper we have shown that feature vectors capturing the 3D geometry ofa scene can significantly contribute to solve the problem of indoor scene catego-rization. We introduce a 3D indoor database and propose a method for definingfeatures on extracted planar surfaces that encode the 3D geometry of the scene.Additionally, we propose voting techniques which allows to fuse single classifi-cation results from several consecutive frames and different feature types like,e.g., 3D features and 2D Gist features. In our experiments, we have been able toshow that Gist features and 3D features complement one another when combinedthrough voting. The 3D cue leads earlier to better labeling results.

As the fusion of classification results from consecutive frames provide a morestable scene categorization result we plan to investigate the categorization per-formance on registered 3D point clouds. Such a registered cloud holds a biggerpart of the scene which should lead to a more stable classification ability com-pared to the current situation where the scene classification has to be done ona partial scene view. Also we would like to introduce class hierarchies where onthe top level the room types like kitchen are located and on the lower levels sub-parts of the scenes like “a wall with bookshelves” or “sideboard-like-furniturefor placing things on it”.

Acknowledgment: This work is funded by the Collaborative Research CenterCRC 673 “Alignment in Communication”.

References

1. Oliva, A., Torralba, A.: Modeling the shape of the scene: A holistic representationof the spatial envelope. IJCV 42 (2001) 145–175

2. Quattoni, A., Torralba, A.: Recognizing indoor scenes. In: CVPR. (2009)3. Henderson, J.M., Larson, C.L., Zhu, D.C.: Full scenes produce more activation than

close-up scenes and scene-diagnostic objects in parahippocampal and retrosplenialcortex: An fMRI study. Brain and Cognition 66 (2008) 40–49

4. Henderson, J.M., Larson, C.L., Zhu, D.C.: Cortical activation to indoor versusoutdoor scenes: An fMRI study. Experimental Brain Research 179 (2007) 75–84

5. Topp, E., Christensen, H.: Dectecting structural ambiguities and transistions dur-ing a guided tour. In: ICRA. (2008) 2564–2570

6. Ullah, M.M., Pronobis, A., Caputo, B., Luo, J., Jensfelt, P., Christensen, H.I.:Towards Robust Place Recognition for Robot Localization. In: ICRA. (2008)

7. Ranganathan, A., Dellaert, F.: Semantic modeling of places using objects. In:Robotics: Science and Systems. (2007)

8. Mozos, O.M., Stachniss, C., Burgard, W.: Supervised learning of places from rangedata using adaboost. In: ICRA. (2005) 1730–1735

9. Zender, H., Mozos, O., Jensfelt, P., Kruijff, G.J.M., Burgard, W.: Conceptualspatial representations for indoor mobile robots. RAS 56 (2008) 493–502

10. Galindo, C., Saffiotti, A., Coradeschi, S., Buschka, P., Fernandez-Madrigal, J.A.,Gonzalez, J.: Multi-hierarchical semantic maps for mobile robotics. In: IROS.(2005) 3492–3497

11. Vasudevan, S., Gachter, S., Nguyen, V., Siegwart, R.: Cognitive maps for mobilerobots - an object based approach. RAS 55 (2007) 359–371


12. Kim, S., Kweon, I.S.: Scene interpretation: Unified modeling of visual contextby particle-based belief propagation in hierarchical graphical model. In: ACCV.(2006) 963–972

13. Pirri, F.: Indoor environment classification and perceptual matching. In: Intl.Conf. on Knowledge Representation. (2004) 30–41

14. Torralba, A., Murphy, K.P., Freeman, W.T., Rubin, M.A.: Context-based visionsystem for place and object recognition. In: ICCV. (2003)

15. Fei-Fei, L., Perona, P.: A bayesian hierarchical model for learning natural scenecategories. In: CVPR. (2005)

16. Posner, I., Schroter, D., Newman, P.M.: Using scene similarity for place labelling.In: Intl. Symp. on Experimental Robotics. (2006) 85–98

17. Bosch, A., Zisserman, A., Munoz, X.: Scene classification using a hybrid genera-tive/discriminative approach. PAMI 30 (2008) 712–727

18. Vogel, J., Schiele, B.: A semantic typicality measure for natural scene categoriza-tion. In: LNCS: Pattern Recognition – DAGM Symposium. (2004) 195–203

19. Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: Spatial pyramidmatching for recognizing natural scene categories. In: CVPR. (2006)

20. Huber, D.F., Kapuria, A., Donamukkala, R., Hebert, M.: Parts-based 3D objectclassification. In: CVPR. (2004) 82–89

21. Csakany, P., Wallace, A.M.: Representation and classification of 3D objects. Sys-tems, Man, and Cybernetics 33 (2003) 638–647

22. Johnson, A.E., Herbert, M.: Using spin images for efficient object recognition incluttered 3d scenes. PAMI 21 (1999) 433–449

23. Munoz, D., Bagnell, J.A., Vandapel, N., Hebert, M.: Contextual classification withfunctional max-margin markov networks. In: CVPR. (2009)

24. Triebel, R., Schmidt, R., Mozos, O.M., Burgard, W.: Instance-based amn classifi-cation for improved object recognition in 2D and 3D laser range data. In: IJCAI.(2007) 2225–2230

25. Rusu, R.B., Marton, Z.C., Blodow, N., Beetz, M.: Learning informative pointclasses for the acquisition of object model maps. In: Intl. Conf. on Control, Au-tomation, Robotics and Vision, Hanoi, Vietnam (2008)

26. Lee, D., Hebert, M., Kanade, T.: Geometric reasoning for single image structurerecovery. In: CVPR. (2009)

27. Hedau, V., Hoiem, D., Forsyth, D.: Recovering the spatial layout of clutteredrooms. In: ICCV. (2009)

28. Yu, S.X., Zhang, H., Malik, J.: Inferring spatial layout from a single image viadepth-ordered grouping. In: CVPR Workshops. (2008)

29. Weingarten, J., Gruener, G., Siegwart, R.: A state-of-the-art 3D sensor for robotnavigation. In: IROS. Volume 3. (2004) 2155–2160

30. Hois, J., Wunstel, M., Bateman, J., Rofer, T.: Dialog-based 3D-image recognitionusing a domain ontology. In: Spatial Cognition. Volume 4287. (2008) 107–126

31. Stamos, I., Allen, P.K.: Geometry and texture recovery of scenes of large scale.CVIU 88 (2002) 94–118

32. Fischler, M.A., Bolles, R.C.: Random sample consensus: A paradigm for modelfitting with applications to image analysis and automated cartography. Commu-nications of the ACM 24 (1981) 381–395

33. Kuncheva, L.I.: Combining Pattern Classifiers. John Wiley & Sons (2004)34. Vapnik, V.N.: The Nature of Statistical Learning Theory. (1995)35. Joachims, T.: Learning to Classify Text Using Support Vector Machines. PhD

thesis, Cornell University (2002)36. Moller, T., Kraft, H., Frey, J., Albrecht, M., Lange, R.: Robust 3D measurement

with PMD sensors. In: 1st Range Imaging Research Day at ETH. (2005)

Indoor Scene Classi cation using combined 3D and Gist...

Documents

Transcript of Indoor Scene Classi cation using combined 3D and Gist...