CENTER FOR Indexing Images for Visual Memory by Using DNN...

CENTER FOR

MACHINE PERCEPTION

CZECH TECHNICAL

UNIVERSITY

RESEARCHREPORT

ISSN

1213-2365

Indexing Images for Visual Memoryby Using DNN Descriptors –Preliminary Experiments

Erik Derner and Tomas Svoboda

[email protected], [email protected]

CTU–CMP–2014–25

January 19, 2015

Available atftp://cmp.felk.cvut.cz/pub/cmp/articles/svoboda/Derner-TR-2014-25.pdf

The work was supported by the EC project FP7-ICT-609763TRADR and by the CTU project SGS13/142/OHK3/2T/13. Anyopinions expressed in this paper do not necessarily reflect the viewsof the European Community. The Community is not liable for anyuse that may be made of the information contained herein.

Research Reports of CMP, Czech Technical University in Prague, No. 25, 2014

Published by

Center for Machine Perception, Department of CyberneticsFaculty of Electrical Engineering, Czech Technical University

Technicka 2, 166 27 Prague 6, Czech Republicfax +420 2 2435 7385, phone +420 2 2435 7637, www: http://cmp.felk.cvut.cz

Indexing Images for Visual Memory by Using DNN

Descriptors – Preliminary Experiments

Erik Derner and Tomas Svoboda

January 19, 2015

Abstract

Visual memory in mobile robotics is important to make the local-ization of a robot robust to situations, when GPS or similar localizationmethods are not available. Unlike many conventional approaches us-ing local features, we use a holistic method that employs deep neuralnetworks (DNNs) to calculate a global descriptor of the whole image.

We consider a scenario in which a robot equipped with an omni-directional camera calculates and stores DNN descriptors of imagestogether with the positions as it moves in the environment. When theposition is unknown to the robot, the algorithm estimates it for a givenomnidirectional image by matching it with the most similar databaseimage.

We compared our approach with a recently tested GIST-based ap-proach on the same dataset and we found out that the DNN-basedapproach yields better results. The experiments also show that theDNN-based algorithm is quite robust to partial occlusion, rotation andchanges in lighting conditions.

Contents

1 Introduction 31.1 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . 31.2 Caffe DNN toolbox . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Algorithm 42.1 Descriptor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Similarity measure . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Outline of the algorithm . . . . . . . . . . . . . . . . . . . . . 4

3 Experiments 53.1 Comparison with the GIST descriptor . . . . . . . . . . . . . 63.2 Finding closest images in a single sequence . . . . . . . . . . 83.3 Invariance to transformations . . . . . . . . . . . . . . . . . . 10

3.3.1 Rotation . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3.2 Lighting conditions . . . . . . . . . . . . . . . . . . . . 11

3.4 Distinguishing images from different sequences . . . . . . . . 11

4 Conclusion 14

1 Introduction

Visual odometry and localization of mobile robots are important topics inthe recent research. There might not be any GPS data available for variousreasons (absent or broken GPS sensor, GPS denied environment, etc.), andin such cases, images from a robot equipped with an omnidirectional cameracan help to localize the robot.

In our task, a database of images is built as the robot records imagestogether with positions during the course of its mission. Therefore, findingimages similar to a given image corresponds to finding the physically closestpreviously visited places, which is our objective in order to localize the robot.

We suggest to use a holistic approach. Image content is represented bya global image descriptor, which is calculated using deep neural networks.Similarity between images is computed as a difference between related de-scriptors.

1.1 Deep neural networks

Deep neural networks (DNNs) as an innovative approach in the field of ma-chine learning recently show promising results in various fields. The DNNscan be defined as a class of artificial neural networks that are distinguishedfrom shallow artificial neural networks by extra layers, which are used tocompose the features from lower layers. This allows for high-level abstrac-tion and learning of complex models, a typical trait of deep learning methods[1].

Conventional methods for visual memory often require some preprocess-ing of the images, such as applying Gabor filters [2]. On the contrary, DNNsare capable of computing the descriptor directly from the image, without anypreprocessing. The realization of the preprocessing stage, i.e., the selectionof the suitable filters, is incorporated in the network [3].

1.2 Caffe DNN toolbox

Caffe, which stands for Convolutional Architecture for Fast Feature Em-bedding, is a C++/CUDA architecture for deep learning with Matlab andPython interfaces. Although the toolbox is designed to work on any domain,such as speech, haptics, neuroscience etc., the main focus of the authors is onthe field of visual recognition. More technical details about the architecturecan be found in [3].

From the variety of software packages within this project that are freelyavailable under the BSD license, we use a framework containing a Matlabbridge that is designed for retrieval of similar images from a large imagedatabase. Further details will be discussed in Section 2.

3

2 Algorithm

In this section, we will explain, how the database of images is created in ourtask and how we estimate the location of the robot given a query image. Atfirst, we will define the DNN descriptor and the similarity measure.

2.1 Descriptor

We use the response of the last hidden DNN layer as a descriptor of animage. The descriptor is a 4096-dimensional vector of float values.

The descriptor is calculated by the function matcaffe demo img() fromthe Caffe toolbox. As explained in Section 1.1, the original image is used asthe input to the DNN without any preprocessing. This function is the onlyone that we use from the toolbox, all other work is performed using Matlabfunctions and scripts that we have created for this task.

The computation of the descriptor takes less than 100 ms1, which makesit well usable in real-time applications.

2.2 Similarity measure

We define the similarity measure to be the square of the Euclidean distancebetween the 4096-dimensional descriptors of the database image and thequery image. When referring to database images, we mean the elements ofour dataset, which are pairs (image, location). Please note that it is impor-tant not to confuse the distance of descriptors with the physical (real-world)distance, which both will be frequently used throughout this report. Thedistance of the descriptors is the function that we minimize when searchingfor the most similar image, while the physical distance from the most similarfound image to the query image is rather a measure of performance of thealgorithm.

2.3 Outline of the algorithm

As briefly mentioned in Section 1, we consider in this work the scenario thata robot records omnidirectional images and positions as it moves throughthe environment. More precisely, it stores the 4096-dimensional descriptorsof the images together with the closest recorded positions. The imagesand the positions are paired based on their timestamps. Imprecisions mayoccur here, as the delay between recording the image and the correspondingposition may be quite long. Additionally, the position is not known directlyfrom the sensors in every moment, so it might have to be estimated.

When the robot captures an omnidirectional image and it is unable todetermine its location, we calculate the similarity measure between the de-

1Tested on a machine with quad-core CPU Intel i7-3770 @ 3.4 GHz, 16 GB RAM.

4

A

B

RC

dA d

C

dB

Figure 1: Illustration of the distance measure used for the algorithm evalua-tion. This illustration shows four points, representing images in a sequence.The red reference point R represents the query image and the green point Cis the database image nearest to the reference point R (in terms of physicaldistance). Then, if the algorithm selects the image A as the most similarone to R, the distance measure will be d = dA − dC . Analogically, if theimage B is selected, d = dB − dC . Finally, if the closest point C is selectedas the most similar one, the distance measure d = dC − dC = 0.

scriptor of the query image and the descriptors of all database images to findthe image with the minimal distance. The position of this image is reportedas an estimate of the current position of the robot.

3 Experiments

We performed experiments on datasets consisting of sequences of omnidi-rectional images with assigned locations, as captured by the NIFTi robot atthe Center for Machine Perception, Czech Technical University in Prague.We want to evaluate the reliability of our method and compare it with amethod using GIST descriptors [2].

Before we describe the individual experiments, let’s introduce the dis-tance measure, which is used as a quality measure of the algorithm in allexperiments. Given a query image, the distance measure between the queryimage and an image from the database is calculated as the Euclidean dis-tance between the 2-dimensional vectors representing their positions, fromwhich we subtract the distance from the query image to the physically clos-est image. In other words, we take the distance to the physically closestimage in the dataset as the shortest achievable one and assign 0 as the valueof the distance measure for that case, as no other match in the dataset canbe closer. The distance to any image is then decreased by that shortestdistance. The closest images are calculated in advance and serve only forevaluation of the algorithm. An example of the distance measure is shownin Figure 1.

5

10 20 30 40 50 60 70 80

0

10

20

30

40

50

x (m)

y (

m)

Trajectory of the robot in the ETH dataset

database images

query images

Figure 2: Trajectory of the robot in the ETH dataset. The longer (blue)sequence consists of 4 603 images, the shorter (red) sequence contains 2 456images.

3.1 Comparison with the GIST descriptor

One of the main objectives of this work is to compare the proposed DNN-based method with a method using GIST as the descriptor [2].

In brief, in [2], the GIST descriptor is calculated on rows instead of tilesas in the original paper [4]. It is calculated only on the central part of theimage, i.e., top and bottom parts of the image are discarded, because theGIST descriptor is unstable near the margins of the image. As a possibleimprovement, it is suggested to stack two copies of the image horizontallyside-by-side, so that no information is lost because of the need to discardthe response of the GIST descriptor near the margins.

Our experiment is run on the same ETH dataset as the experimentsin [2]. There are two sequences in the dataset, a shorter one and a longerone, see Figure 2. The sequences are partially overlapping, which is de-sirable to test the ability of the algorithm to find similar (closest) images,even though the images were not taken from the exactly same place andunder the exactly same conditions. The longer sequence serves as the visualmemory (database), while the images in the shorter sequence are used asquery images. The distance to the physically closest image, as described inthe introduction of Section 3, is measured for each point from the shortersequence to the closest point in the longer sequence.

Figure 3 shows that the DNN descriptor significantly outperforms the

6

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

nbest

reca

ll

Recall on the ETH dataset, dmax

= 0 m

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

nbest

reca

ll


= 0.5 m

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

nbest

reca

ll


= 1 m

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

nbest

reca

llRecall on the ETH dataset, d

max = 1.5 m

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

nbest

reca

ll


= 2 m

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

nbest

reca

ll


= 2.5 m

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

nbest

reca

ll


= 3 m

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

nbest

reca

ll


= 3.5 m

DNN without crop

DNN with crop

GIST without repeat

GIST with repeat

Figure 3: Comparison of the performance of the DNN-based and GIST-based algorithms on the ETH dataset. nbest is the number of the mostsimilar images that are returned for each query image, dmax is the maximaltolerance in the distance measure. DNN with crop means that the partcontaining the robot’s body and flippers was removed, while GIST withrepeat means that two copies of the image were stacked side-by-side.

7

Figure 4: Removing the irrelevant part of the image, i.e., the part containingthe robot’s body and flippers. For the setting DNN with crop, the dark partof the image was cut off and only the bright part was used for computingthe descriptor.

GIST descriptor for the values of nbest < 100. The performance on lowvalues of nbest is more important, as it is much easier for the closest imageto appear in larger sets of the most similar images. Moreover, the presenceof the closest image in such a large set does not help much to determinethe position of the query image in any possible further processing of theretrieved set of the most similar images.

The version of the GIST descriptor using the modification with the twoimages stacked side-by-side has an extensively better performance than theversion using only the single image, so the preference of this modificationstated in [2] is justified.

Selecting only the relevant part of the image in our DNN-based ap-proach, i.e. removing the part containing the robot’s body and flippers, alsoimproves the performance, but the difference is significant only for dmax = 0.For larger tolerance of the distance measure, the effect of cropping the rele-vant part of the image is rather negligible.

3.2 Finding closest images in a single sequence

In this experiment, we use the FEL dataset, see Figure 5. As there is onlya single sequence in the dataset, we proposed to perform the experimentby leaving out one of the images, which will serve as the query image, andthe rest will be the visual memory. This is repeated for all images in thesequence.

The performance of the algorithm is summarized in Figure 6. It canbe seen that for a reasonable tolerance in the distance measure, the closestimage appears in the set of a very few most similar images.

8

−20 −15 −10 −5 0 5 10 15 20

−35

−30

−25

−20

−15

−10

−5

0

x (m)

y (

m)

Trajectory of the robot in the FEL dataset

Figure 5: Trajectory of the robot in the FEL dataset. The sequence contains114 images, which are taken at the positions marked by blue points in theplot.

5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall on the FEL dataset

nbest

reca

ll

dmax

= 0 m

dmax

= 0.5 m

dmax

= 1 m

dmax

= 1.5 m

dmax

= 2 m

Figure 6: Performance of the algorithm when finding the closest images ina single sequence. nbest is the number of the most similar images that arereturned for each query image, dmax is the maximal tolerance in the distancemeasure.

9

5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall on the FEL dataset under rotation

nbest

reca

ll

no rotation

45° rotation

90° rotation

180° rotation

270° rotation

random rotation

Figure 7: Performance of the algorithm on the FEL dataset under differentrotations. nbest is the number of the most similar images that are returnedfor each query image.

3.3 Invariance to transformations

The experiment described in Section 3.1 shows that the algorithm is ableto match the omnidirectional images under slight rotation, partial occlusionand changes in lighting conditions. However, in the following experiments,we focus on measuring the effect of an artificially introduced rotation andchanges in illumination on the performance of the algorithm.

The experiments are performed in a similar manner as in Section 3.2,i.e., leaving out one of the images in the FEL sequence. The query imageis transformed by a rotation or by a simulation of a change in the lightingconditions, while the rest of the sequence remains unchanged.

3.3.1 Rotation

Figure 7 shows the performance of the algorithm under rotations (pan) byvarious angles. While the rotation by 45◦ has a rather negligible effect, therotations by larger angles yield worse results, but still it can be said that thealgorithm is quite robust to rotations. In all the tested rotation scenarios,the closest image is present in the set of the 4 most similar images for morethan 75 % of the tested images.

10

a b c

Figure 8: Examples of the query images used in the experiment testing therobustness of the algorithm to changes in the lighting conditions: a – originalimage, b – darker image, c – brighter image.

3.3.2 Lighting conditions

It is a quite difficult task to model real-world changes in the lighting con-ditions. In this experiment, we use a simplified way of modeling the illu-mination changes by the exposure adjustment in the photo editing programShotwell2. We generated query images that are darker and brighter thanthe original images, simulating a dusk and a bright sun. Examples of thequery images used in this experiment are in Figure 8.

It can be seen in the results in Figure 9 that such changes in lightingconditions do not have any considerable effect on the performance of thealgorithm.

3.4 Distinguishing images from different sequences

The previous experiments were performed on image sequences captured at asingle particular location. In this experiment, we would like to test whetherthe algorithm keeps finding images from the location they belong to, if thevisual memory contains also image sequences captured at other locations.

The dataset used in this experiment is composed of three sequences, eachcontaining 50 images. The first two sequences are from the datasets ETHand FEL, the third one is from the dataset Kamaishi3. Figure 10 showsexamples of images in the datasets.

Like in the previous experiments, we leave out one image, which servesas the query image, and the rest of the dataset is the visual memory. It canbe seen in Figure 11 that the distinguishing power of the algorithm is reallystrong – when we consider up to 30 most similar images, there is no imagefrom a different sequence in the set of the most similar images for each imagein all the three sequences. Even for larger sets of the most similar images,the number of images from a different sequence is negligible.

2http://wiki.gnome.org/Apps/Shotwell3http://www.vision.is.tohoku.ac.jp/us/download/

11

http://wiki.gnome.org/Apps/Shotwell

http://www.vision.is.tohoku.ac.jp/us/download/

5 10 15 20 25 300

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1Recall on the FEL dataset under lighting changes

nbest

reca

ll

original

darker

brighter

Figure 9: Performance of the algorithm on the FEL dataset under differentlighting conditions. nbest is the number of the most similar images that arereturned for each query image.

a b c

Figure 10: Examples of images in the datasets used in the experiments: a –ETH dataset, b – Kamaishi dataset, c – FEL dataset.

12

5 10 15 20 25 30 35 40 450

0.5

1

1.5

nbest

na

Average number of images from another sequence

Figure 11: Average number of images that do not belong to the same se-quence as the query image. nbest is the number of the most similar imagesthat are returned for each query image, na is the sum of the numbers of im-ages present in the set of the most similar images that do not belong to thesame sequence as the query image, calculated for each query image, dividedby the number of images in the dataset, i.e., 150.

13

4 Conclusion

We tested an approach to visual memory using deep neural networks that iscapable of localization of a robot in a GPS denied environment by finding themost similar images in terms of the minimal distance of the DNN descriptors.In comparison with an approach using GIST descriptors [2], our approachyields better performance when using a reasonably small size of the set of themost similar images. The experiments also have shown that the DNN-basedapproach seems to be robust to image transformations such as rotations andchanges in illumination, although it would be appropriate to perform real-world experiments to support the outcome of these experiments based onartificially generated images.

Further work can be done in processing the set of the most similar images.Also it would be necessary to store the database images in a more efficientdata structure than a linear list. Finding the most similar images wouldotherwise become too slow for larger image databases when using an imageretrieval method requiring linear time.

References

[1] Yoshua Bengio. Learning deep architectures for ai. Foundations andTrends in Machine Learning, 2(1):1–127, 2009.

[2] Otakar Jasek and Tomas Svoboda. Visual-based memory using gist de-scriptor. Research Report CTU–CMP–2014–10, Center for Machine Per-ception, K13133 FEE Czech Technical University, Prague, Czech Repub-lic, October 2014.

[3] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, JonathanLong, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe:Convolutional architecture for fast feature embedding. arXiv preprintarXiv:1408.5093, 2014.

[4] Ana Cristina Murillo, Gautam Singh, Jana Kosecka, and Jose JesusGuerrero. Localization in urban environments using a panoramic gistdescriptor. IEEE Transactions on Robotics, 29(1):146–160, February2013.

14

CENTER FOR Indexing Images for Visual Memory by Using DNN...

Documents

Transcript of CENTER FOR Indexing Images for Visual Memory by Using DNN...