Robust Object Recognition Through Symbiotic Deep Learning In … · implementados diretamente num...

Robust Object Recognition Through Symbiotic DeepLearning In Mobile Robots

João Miguel Vieira Cartucho

Thesis to obtain the Master of Science Degree in

Aerospace Engineering

Supervisors: Prof. Manuela Maria VelosoProf. Rodrigo Martins de Matos Ventura

Examination Committee

Chairperson: Prof. José Fernando Alves da SilvaSupervisor: Prof. Rodrigo Martins de Matos Ventura

Member of the Committee: Prof. Alexandre José Malheiro Bernardino

June 2018

Acknowledgements

I would like to thank:

• My thesis advisor: Professor Rodrigo Ventura, of the Institute for Systems and Robotics at Instituto

Superior Tecnico. Professor Ventura has continuously supported and steered this research work in

the right direction while also giving me an enthusiastic encouragement. From Skype calls around

the world, to knocking on the Professor’s office door without notice, his willingness to give his time

so generously has been very much appreciated.

• My supervisor at Carnegie Mellon University: Professor Manuela Veloso, of the Machine Learning

Department, School of Computer Science. Professor Veloso is extremely knowledgeable and has

given priceless and constructive suggestions during the planning and development of this research

work. Since the beginning, I felt inspired with her unique approach of symbiotic autonomy and

self-explanatory artificial intelligence, culminating in robots that know how to address their own

limitations.

• Other researchers from ISR and CORAL Lab. Interacting and discussing ideas with other students

has been one of the best ways of getting valuable advice. Special thanks to Oscar Lima and Rute

Luz from ISR and Robin Schmucker and Kevin Zhang from CORAL Lab.

• My family and friends, for their participation at research experiments, advices, thesis revisions,

incentive and support throughout this work. I would like to offer my special thanks to my mom,

sisters, cousin Johnny Cartucho and my friend Luıs Rosmaninho.

Thank you for all the encouragement!

This work was supported by the FCT project [UID/EEA/50009/2013] and partially funded with grant

6204/BMOB/17, from CMU Portugal.

i

Abstract

Despite the recent success of state-of-the-art deep learning algorithms in object detection, when these

are deployed as-is on a mobile service robot, we observed that they failed to recognize many objects

in real human environments. In this work, we introduce a learning algorithm in which robots address

this flaw by asking humans for help, also known as a symbiotic autonomy approach. In particular, we

bootstrap YOLOv2, a state-of-the-art deep neural network and train a new neural network, that we call

HUMAN, using only collected data. Using an RGB camera and an on-board tablet, the robot proactively

seeks human input to assist it in labeling surrounding objects. Pepper and CoBot, located at CMU, and

Monarch Mbot, located at ISR-Lisbon, were the service robots that we used to validate the proposed

approach. We conducted a study in a realistic domestic environment over the course of 20 days with 6

research participants. To improve object detection, we used the two neural nets, YOLOv2 + HUMAN, in

parallel. Following this methodology, the robot was able to detect twice the number of objects compared

to the initial YOLOv2 neural net, and achieved a higher mAP (mean Average Precision) score. Using

the learning algorithm the robot also collected data about where an object was located and to whom it

belonged to by asking humans. This enabled us to explore a future use case where robots can search

for a specific person’s object. We view the contribution of this work to be relevant for service robots in

general, in addition to CoBot, Pepper, and Mbot.

Keywords: Cognitive Human-Robot Interaction; Deep Learning in Robotics and Automation; Service

Robots; Social Robots

iii

Resumo

Apesar dos recentes progressos nos algoritmos estado-da-arte para detecao de objectos, estes, quando

implementados diretamente num robo de servico, falham no reconhecimento de muitos dos objetos pre-

sentes nos ambientes humanos reais. Esta tese, introduz um algoritmo de aprendizagem atraves do

qual os robos enderecam esta falha pedindo ajuda humana, numa abordagem denominada por “autono-

mia simbiotica”. Em particular, partimos do YOLOv2, uma rede neuronal do estado-da-arte, e criamos

uma nova rede neuronal – HUMAN – com a informacao recolhida pelo robo atraves da assistencia hu-

mana. Com uma camara RGB e um tablet no robo, este procura proactivamente por auxılio humano

para classificar os objetos em seu redor. Para validar esta abordagem utilizamos tres robos de servico,

CoBot e Pepper, localizados na CMU, e Monarch Mbot, no ISR-Lisboa, e realizamos um estudo num

ambiente domestico real com 6 participantes ao longo de 20 dias. Verificamos um melhoramento na

detecao de objetos quando as duas redes neuronais (YOLOv2 e HUMAN) sao usadas em paralelo. No

final da experiencia, o robo foi capaz de detectar o dobro dos objetos e ainda revelou uma melhor mAP

(“mean Average Precision”). Atraves da informacao recolhida com as perguntas feitas aos humanos,

mostramos ainda um possıvel caso pratico em que o robo procura um objeto para uma pessoa em

especıfico. Este trabalho contribui para robos de servico em geral, para alem do CoBot, Pepper, e

Mbot.

Palavras-Chave: Interacao Cognitiva Humano-Robo; “Deep Learning” em Robotica e Automacao;

Robos de Servico; Robos Sociais

v

Contents

Acknowledgements i

Abstract iii

Resumo v

List of Tables xi

List of Figures xv

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3.2 Open Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Framework and State-of-the-art Overview 6

2.1 Brief History of Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 An Overview on YOLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.1 Functioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

vii

viii CONTENTS

2.2.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.4 Advantages of YOLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2.5 YOLOv2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 mAP - mean Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.2 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3.3 Intersection Over Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3.4 Calculating the AP - Average Precision . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 YOLOv2 in a Real-World Scenario 21

3.1 PASCAL VOC Dataset Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.1 MBot Test in a Domestic Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1.2 CoBot Test in a University Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.3 Performance of YOLOv2 Trained on PASCAL VOC . . . . . . . . . . . . . . . . . . 22

3.2 COCO Dataset Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Learning Algorithm 28

4.1 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.1 Part I - Capturing Images and Predicting the Objects . . . . . . . . . . . . . . . . . 29

4.1.2 Part II - Interacting with Humans . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.1.3 Part III - Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.2 YOLOv2 + HUMAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2.1 Problem and Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2.2 Removing Duplicate Detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 Results 42

5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.1 Domestic Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.2 Looking for a Specific’s Person Object . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2 Domestic Scenario Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2.1 Correctness of Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2.2 Evaluation of the Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3 Further Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.3.1 Looking for a Specific’s Person Object . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3.2 Sharing Knowledge Between Robots . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Conclusion 55

6.1 Summary of Thesis Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Bibliography 56

A Training Set Information 63

B Resulting Predictions Information 67

ix

List of Tables

3.1 PASCAL VOC2012 [36] test detection results. YOLOv2 performs on the same level as

other state-of-the-art detectors like Faster R-CNN [29] with ResNet [53] and SSD512 [54]

while still running 2 to 10 times faster [9]. Table adapted from [9]. . . . . . . . . . . . . . . 23

3.2 MBot Domestic Scenario PASCAL VOC2012 [36] test detection results on YOLOv2. . . . 23

3.3 CoBot University Scenario PASCAL VOC2012 [36] test detection results on YOLOv2. . . 24

5.1 Predictions - Week 1 - Image 0 to 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Predictions - Week 2 - Image 50 to 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.3 Predictions - Week 3 - Image 100 to 150. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.4 Average Precision using an external ground-truth composed of 100 images in different

poses. Values are in percentage (%) and the largest one per row marked in bold. . . . . 49

xi

List of Figures

1.1 CoBot - Collaborative Robot. Designed for servicing multi-floor buildings [8]. . . . . . . . . 2

1.2 Mobile robots used for the evaluation of the proposed method. Photos: SoftBank/Aldebaran

and IDMind Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 Viola and Jones algorithm for face detection. Image from OpenCV website. . . . . . . . . 7

2.2 E.g. SIFT object detection with partial occlusion. Image from OpenCV website. . . . . . . 7

2.3 E.g. HOG detector cue mainly on silhouette contours. Image from [17]. . . . . . . . . . . 8

2.4 R-CNN Object Detection System Overview. Image from [26]. . . . . . . . . . . . . . . . . 9

2.5 Example image splitted into S×S grid and one of its cells (marked in red). Image from

YOLO website. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.6 Resulting predictions from all the grid cells (the higher the confidence the thicker the box).

Image from YOLO website. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.7 Class probability map. Image from YOLO website. . . . . . . . . . . . . . . . . . . . . . . 11

2.8 Combining bounding boxes with class probabilities. Image from YOLO website. . . . . . . 11

2.9 Final predictions after applying a threshold and non-maximum suppression. Image from

YOLO website. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.10 The architecture of YOLO. Image from [31] . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.11 Example training YOLO to detect a “dog”. Image from YOLO website. . . . . . . . . . . . 14

2.12 YOLO generalization results on Picasso and People-Art datasets. Image from [31] . . . . 15

2.13 Anchor Boxes vs. Dimension Clusters. Image from YOLO website. . . . . . . . . . . . . . 16

xiii

xiv LIST OF FIGURES

2.14 Accuracy and speed on VOC 2007. Image from YOLO website. . . . . . . . . . . . . . . . 17

2.15 Example images captured by MBot capturing different perspectives and lighting conditions. 18

3.1 Example images captured by MBot capturing different perspectives and lighting conditions. 22

3.2 CoBots with on-board computation, interaction interfaces, omni-directional motion, carry-

ing baskets, and depth sensors (Kinect and LIDAR). . . . . . . . . . . . . . . . . . . . . . 23

3.3 Example images captured by CoBot for our YOLOv2 evaluation experiments. . . . . . . . 23

3.4 MBot, information about the 100 captured images PASCAL VOC test. . . . . . . . . . . . 25

3.5 CoBot, information about the 100 captured images PASCAL VOC test. . . . . . . . . . . . 25

3.6 MBot example images of YOLOv2 true and false predictions. . . . . . . . . . . . . . . . . 25

3.7 CoBot example images of YOLOv2 true and false predictions. . . . . . . . . . . . . . . . . 26

3.8 MBot, information about the 100 captured images COCO test. . . . . . . . . . . . . . . . 26

3.9 MBot, information about the 100 captured images COCO test. . . . . . . . . . . . . . . . 27

4.1 E.g. Outcomes from q1 and q2. Cat image credit to Flickr user William McCamment. . . 32

4.2 Some classes that can be added to the robot using the labels from OpenImages dataset.[55] 33

4.3 Example of the human-robot interface when labeling an object. . . . . . . . . . . . . . . . 35

4.4 Mbot (top) and Pepper (bottom) learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.5 An example case where the robot gets repeated labels for the same object. . . . . . . . . 37

4.6 Example of YOLOv2+HUMAN using predictions from both the neural nets. . . . . . . . . . 39

4.7 Duplicate detection example. YOLOv2+HUMAN uses the prediction with higher confidence. 40

5.1 ISRoboNet@Home Testbed – Domestic scenario where we ran the experiments. . . . . . 43

5.2 Points of reference comprising the surrounding area. . . . . . . . . . . . . . . . . . . . . . 44

5.3 Example images captured by the robot for the ground-truth. . . . . . . . . . . . . . . . . . 47

5.4 Information about the ground-truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.5 Pepper looking for Joao’s backpack demo. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.6 Pepper detecting objects using MBot’s knowledge. . . . . . . . . . . . . . . . . . . . . . . 54

A.1 Information about HUMAN50 training data. . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A.2 Information about HUMAN100 training data . . . . . . . . . . . . . . . . . . . . . . . . . . 65

A.3 Information about HUMAN150 training data. . . . . . . . . . . . . . . . . . . . . . . . . . . 66

B.1 YOLOv2 predictions with the ground-truth pictures. . . . . . . . . . . . . . . . . . . . . . . 68

B.2 HUMAN50 predictions with the ground-truth pictures. . . . . . . . . . . . . . . . . . . . . . 68

B.3 HUMAN100 predictions with the ground-truth pictures. . . . . . . . . . . . . . . . . . . . . 69

B.4 HUMAN150 predictions for the ground-truth pictures. . . . . . . . . . . . . . . . . . . . . . 70

B.5 YOLOv2+HUMAN150 predictions for the ground-truth pictures. . . . . . . . . . . . . . . . 71

xv

Chapter 1

Introduction

The goal of this Chapter is to provide the reader with an overview of the work developed in this thesis.

Firstly, we introduce the problem we are addressing (Section 1.1), along with the main motivation and

objectives (Section 1.2). Then, we describe the contributions of this work (Section 1.3). Finally, we

present the structure carried out during the development of this thesis (Section 1.4).

1.1 Motivation

Human-robot symbiotic learning is an increasingly active area of research [1, 2, 3, 4]. Anthropomorphic

robots are being increasingly deployed in real-world scenarios, such as homes, offices, and hospitals [5,

6, 7]. However, exposure to real human environments raises multiple challenges often overlooked in

controlled laboratory experiments. For instance, robots equipped with state-of-the-art neural nets trained

for object detection still fail to provide accurate descriptions for the majority of the objects surrounding

them outside controlled environments. We conducted a preliminary experiment with CoBot (shown in

Figure 1.1), designed for servicing multi-floor buildings [8], located at Carnegie Mellon University, USA,

and Monarch MBot (shown in Figure 1.2b), located at Instituto Superior Tecnico, Portugal, to evaluate the

performance of the state-of-the-art object detector – YOLOv2 – in real-world scenarios. Unfortunately,

in both these robots, YOLOv2 fell short when compared to the expected performance (further described

in Chapter 3).

1

2 Chapter 1. Introduction

Figure 1.1: CoBot - Collaborative Robot. Designed for servicing multi-floor buildings [8].

1.2 Objectives

This thesis tackles the aforementioned problem using a symbiotic interaction approach, in which the

robot seeks human assistance in order to improve its object detection skills (our primary aim of this

work).

This is achieved by deploying a learning algorithm that empowers the robot to ask humans for help. Over

time, it can be measured that the human input increases the robot’s effectiveness. The learning process

is bootstrapped by an external state-of-the-art neural net — YOLOv2 — for real-time object detection [9].

The robot, using its RGB camera, in conjunction with its on-board tablet, explores its environment whilst

labeling objects. The robot then confirms these labels by interacting with humans, asking them to

respond to simple Yes/No questions and/or requesting that they adjust a selection rectangle positioned

around an object in the tablet.

With the learning algorithm the robot also collects information about where the objects were seen in

the map of the scenario and to whom they belong, if they are personal objects. Providing accurate

information will equip the robot with the means to do other tasks like actively seek a personal object, on

request. This was one of the use case experiments that were made possible due to our realization of

the wide applications that can be derived from the data that the robot collected. Another application for

this data would be to improve the robot’s object detection skills via sharing information between different

robots.

The social robots used to test our approach were Pepper (shown in Figure 1.2a), a service robot devel-

oped by Softbank/Aldebaran Robotics and specifically designed for social interaction [10], and Monarch

1.3. Contributions 3

(a) Pepper the robot (b) Mbot the robot

Figure 1.2: Mobile robots used for the evaluation of the proposed method. Photos: SoftBank/Aldebaranand IDMind Robotics

Mbot (shown in Figure 1.2b), designed for edutainment activities in a pediatric hospital [11]. In addition,

the two robotic platforms were located in separate working environments: Pepper was located at CMU,

USA, and Mbot at ISRobotNet@Home1, Test Bed of ISR-Lisbon, Portugal. The experiments were con-

ducted in a realistic domestic scenario. By having many research participants interact and modify the

test environment, the unpredictability of object placement and arrangement was ensured.

At the end of our experiment, the robot was able to detect twice the number of objects compared to the

initial YOLOv2, with an improved mean Average Precision (mAP) score.

1.3 Contributions

Robots are being increasingly deployed in real-world scenarios. Homes, offices, and hospitals [5, 6, 7]

are merely a small example of the places where we will be able to find these robots in the future,

performing an ever-growing number of tasks in many different applications. With time, these tasks will

progressively become more complex and demanding, requiring robots to have an improved perception

of their surroundings.

This work contributes to service robots in general by empowering them with the ability to improve their

perception, in particular, by providing them with the ability to learn new objects and adapt to their sur-

rounding environment.

In practice, it is difficult to predict all the objects that a robot will be exposed to in real-world scenarios.

Scenarios are dynamic and constantly undergoing alterations as objects get displaced, modified, and1More info can be found at http://isrobonet athome.isr.tecnico.ulisboa.pt/

4 Chapter 1. Introduction

replaced by new ones. Using the learning algorithm introduced in this work, the robots were equipped

to address this - they asked for help if need be while improving their own perception of the human world.

The approach of symbiotic autonomy uses the interaction with humans as a mean to overcome obstacles

(e.g. when the robot is unable to detect an object), rather than exclusivity replying on the native abilities

of the robot. In turn, robots improve their object detection skills and provide more value to humans.

Hence this symbiotic relationship has the potential to form a positive feedback loop, where humans help

robots improve, which in turn helps robots serve humans better.

Summarized contributions of this work:

• We tested and evaluated (using CoBot in the corridors of a university and MBot in a domestic

scenario) that the state-of-the-art object detector (YOLOv2) fails to detect many of the objects in

real human environments;

• We introduced a learning algorithm that empowers the robot with the ability to detect new objects

and collect more information about them (to whom they belong, and where they are located) via

human-robot interaction;

• We introduced an approach of using two neural nets combined (YOLOv2 + HUMAN) and verified

that the YOLOv2 + HUMAN approach is able to detect more objects with increased accuracy;

• We showed that using the collected data, a robot can locate a specific’s person object in the

environment;

• We showed that a robot can detect objects using the knowledge shared from another robot.

1.3.1 Publications

The learning algorithm introduced in this work and main results were written in a paper accepted to

IROS 2018 [12]. Additionally, a few other experiments conducted during the development of this work

will be presented at AAMAS 2018 [13].

1.4. Thesis Structure 5

1.3.2 Open Source

Recently, we shared part of the code that has been developed, into two GitHub repositories2:

• OpenLabeling: An open-source image labeling tool which supports the format required by YOLO

for training (released on 04/03/2018 and currently with 224 stars and 45 forks3).

• mAP: mean Average Precision - Code to evaluate the performance of your neural net for object

detection (released on 09/04/2018 and currently with 156 stars and 57 forks).

1.4 Thesis Structure

This thesis is structured as follows: Chapter 2 reviews the state-of-the-art in object detection and pro-

vides some background. Chapter 3 tests and evaluates the performance of YOLOv2 in real human

environments. Chapter 4 describes the learning algorithm and how the robot will use input from hu-

mans to create a neural net – HUMAN – adapted to the local real-world scenario and presents the

approach of using the two neural nets in parallel – YOLOv2 + HUMAN – to improve the robot’s capa-

bility of detecting objects. Chapter 5 describes the experimental setup where we conducted our study

and presents/discusses the results. Lastly, Chapter 6, presents the conclusions, summarizes the overall

findings, and discusses the possible next steps for future work.

2These repositories can be found at https://github.com/Cartucho3GitHub star (shows appreciation to the repository maintainer for their work), and forks (personal copy of the repository for

further development)

Chapter 2

Framework and State-of-the-art

Overview

This chapter provides a general framework for the task of object detection. The goal of object detection

is to enclose objects in an image within rectangles (usually called bounding boxes) and say what those

objects are. This chapter also puts the main studies about robots detecting objects into perspective.

Firstly, we briefly review the history of the object detection task (Section 2.1). Secondly, we overview the

YOLO neural net (Section 2.2). Then, we describe the mean Average Precision (mAP) metric (Section

2.3). Lastly, we describe the related work (Section 2.4).

2.1 Brief History of Object Detection

In 2001 Paul Viola and Michael Jones [14] presented the first remarkable facial detection algorithm (the

Viola-Jones algorithm). Object detection has been around since the 1960s but this was the first time

that it was really working and running in real-time. This was mainly due to its simple and efficient design.

Similarly to previous algorithms at the time, they hand-coded in features and fed them into a classifier (a

support vector machine). It was trained on a dataset of faces and they hand-coded the location of eyes,

mouth, etc and their relations to each other. Unfortunately, since the features had been hand-coded,

any other kind of configurations like a person with an eyepatch or with the face slightly tilted would not

be detected.

In 2004 Lowe et al. [15] presented a very successful algorithm for feature matching called SIFT (scale

invariant feature transform). The basic idea was to transform the image data into scale invariant coor-

6

2.1. Brief History of Object Detection 7

Figure 2.1: Viola and Jones algorithm for face detection. Image from OpenCV website.

Figure 2.2: E.g. SIFT object detection with partial occlusion. Image from OpenCV website.

dinates. Using these coordinates the goal was to extract distinctive features invariant to image scale

and rotation. Meaning that the object detections are robust to changes in the viewpoints from where the

images are captured. Also, this algorithm takes local features (each feature captures a distinctive part of

the object) meaning that it is also robust to occlusion and clutter (see Figure 2.2). Other alternatives to

SIFT, like SURF for example, have similar performance while being much faster [16]. SIFT is particularly

useful for identification tasks since it captures the features of a specific object. Unfortunately, this comes

with the price of poor generalization (e.g., SIFT may be very good at identifying a particular “shoe”, but

if other shoes in the same picture do not look like that one, they may not be detected).

In 2005 another efficient technique came out called HOG (Histograms of Oriented Gradients) [17]. This

method was similar to the SIFT descriptor but differed in that it decomposed an image’s pixels into

8 Chapter 2. Framework and State-of-the-art Overview

Figure 2.3: E.g. HOG detector cue mainly on silhouette contours. Image from [17].

oriented gradients. With these gradients, the full images got converted into simple representations. This

way they captured the essence of what a human looks like in a picture (see Figure 2.3) and they used

this so that given a new picture they could say: “Yes, this is a human” or “No, this is not a human”. This

kind of basic feature map looks very similar to what convolution nets learn themselves, but in this case,

they had to hand-code what a human figure looks like when converted to oriented gradients.

Later, in 2012 the era of deep learning began. In this year Krizhevsky et al. won the ImageNet [18] com-

petition (a yearly competition on visual detection tasks) using a convolutional network that outperformed

everybody else [19]. Convolutional neural networks have been around since the 1980s [20, 21] but this

time they really worked due to the overall improvement in computational power (with the development of

modern GPU’s) and also due to the increased amount of training data.

One way to perform object detection is to use classifiers like VGG-Net [22] or Inception [23, 24] (huge

convolutional neural nets trained on big datasets). By sliding these classifiers over a number of squares

of an image we get a set of classifications and we only keep the ones the classifier is the most certain

about. Using this we can then draw a bounding box around the classified objects in the image. But, this

is a very brute-force and computationally expensive approach. Methods like deformable parts models

(DPM) use this sliding window approach, where a classifier is run at evenly spaced locations over the

entire image and high classification scores correspond to detections [25].

In 2014 a better approach was presented by Girshick et. al, called R-CNN: Regions with CNN features

[26]. The idea behind R-CNN was before they fed it to a convolutional network they would do a process

called selective search [27] to create a set of bounding boxes that could correspond to an object. At a

high level, the selective search groups together adjacent pixels by texture, color or intensity to identify

objects. As illustrated in Figure 2.4, given an input image instead of using a sliding window, R-CNN,

first extracted bounding boxes (region proposals); then it ran the images inside these bounding boxes

through a pre-trained CNN (e.g. AlexNet [19]) to compute the features for each bounding box. Finally,

it would used a support vector machine to classify what the object in the image of each box is. After

2.2. An Overview on YOLO 9

Figure 2.4: R-CNN Object Detection System Overview. Image from [26].

classification, the bounding boxes would be refined to output tighter coordinates to the objects and

eliminate duplicate detections.

This proved to be an effective approach for object detection. R-CNN evolved into Fast RCNN [28], Faster

RCNN [29] and more recently Mask RCNN [30]. However, they all first generate potential bounding

boxes, then run a classifier and finally do some post-processing. These complex pipelines are slow and

hard to optimize because each individual component must be trained separately [31].

Although, all of these methods are looking at each image thousands or hundreds of thousands of times

to perform detection, so this involves much evaluation of these classifiers over and over again on different

parts of the image and at different scales. YOLO took a completely different approach and outperformed

these previous methods.

2.2 An Overview on YOLO

You Only Look Once (YOLO) is one of the most popular and newer techniques out there for object

detection. This neural net was introduced in 2015 [31] (CVPR 2016, OpenCV People’s Choice Award)

and recently substantially improved its accuracy and speed on object detection – YOLOv2 [9] (CVPR

2017, Best Paper Honorable Mention).

In this Section we describe how this object detector works, how it is structured and what are its main

advantages.

2.2.1 Functioning

Given an image, YOLO starts by splitting that image into an S×S grid. (see Figure 2.5). In this grid,

each of the cells (e.g: in Figure 2.5 one of the cells is marked in red) is responsible for predicting B


Figure 2.5: Example image splitted into S×S grid and one of its cells (marked in red). Image from YOLOwebsite.

bounding boxes and B confidence scores (one for each box). Each individual box confidence score tells

us how certain YOLO is that the predicted bounding box actually encloses some object and also how

well it thinks the predicted bounding box is adjusted to that object.

Figure 2.6: Resulting predictions from all the grid cells (the higher the confidence the thicker the box).Image from YOLO website.

In Figure 2.6 we can visualize all the predictions from all the cells together. Essentially, you get many

bounding boxes ranked by their confidence value (the higher the confidence the thicker the box). Hence

now you know where the objects are in the image however you still do not know what they are.

Then, each grid cell (see Figure 2.7) also predicts C conditional class probabilities. It only predicts one

set of class probabilities per grid cell, regardless of the number of boxes (B) associated with that cell. It


Figure 2.7: Class probability map. Image from YOLO website.

Figure 2.8: Combining bounding boxes with class probabilities. Image from YOLO website.

is important to notice that since it is a conditional probability if a grid cell predicts “dog” it is not saying

that there is a “dog” in this grid cell, it is just saying that if there is an object in this grid cell then that

object is a “dog”.

The previously calculated confidence score for the bounding box and the class conditional prediction

are combined into one final score that tells us the probability that each bounding box contains a specific

type of object as we can see from Figure 2.8.

However, most of these predictions have a very low confidence score and consequently, we only keep the

boxes whose final score is higher than a specific threshold. Additionally, the duplicate boxes predicting

the same object are removed using a non-maximum suppression resulting in the final detections for one

image (see Figure 2.9).


Figure 2.9: Final predictions after applying a threshold and non-maximum suppression. Image fromYOLO website.

To summarize each cell in the S×S grid, predicts:

• B bounding boxes and associated box confidences;

• C conditional class probabilities (one per class).

Each of the B bounding boxes is coupled with 5 predictions (x, y, w, h, and box confidence). The x and y

coordinates are offsets between the center of the box relative to the bounds of the grid cell. The w and

h are the width and the height of the predicted bounding box. All these values are normalized relative

to the whole image. Finally, the box confidence was already previously explained (how certain YOLO is

that the predicted bounding box actually encloses some object and how well adjusted that box is).

This parameterization fixes the output size for detection (S×S×(B×5 + C)). For example, for PASCAL

VOC, YOLO used a 7 by 7 grid (S=7) and 2 bounding box proposals per cell (B=2) and there are 20

classes (C=20). This results in a 7×7×30 = 1470 outputs.

To summarize, YOLO trains a neural network to predict this output tensor so that in one go it calculates all

the detections for an image. The main secret why YOLO is so good is that the output separate predictions

are all made at the same time looking only once at the image, and this is why it is called YOLO (You Only

Look Once). Since it predicts all of these detections simultaneously YOLO also implicitly incorporates

global context in the detection process so it can learn things about which objects tend to occur together,

the relative size and location of objects and other assorted things.


Figure 2.10: The architecture of YOLO. Image from [31]

2.2.2 Network Architecture

The architecture is inspired by the GoogLeNet model for image classification [58]. Shown in Figure 2.10

the full network is basically composed of Convolution and MaxPooling layers alternatively, over and over

again.

Using it is very simple, you give it an input image which goes through the convolutional network in a

single pass and it comes out the other hand as a sized tensor describing the predicted bounding boxes

for the grid cells. And all you need to do then is compute the final scores for the bounding boxes and

remove the boxes that are scoring less than a pre-determined threshold value and also remove the

duplicate bounding boxes (non-maximum suppression).

Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detec-

tion performance [31].

2.2.3 Training

ImageNet [18] is currently the biggest dataset for Computer Vision tasks. It includes millions of images

capturing 1000 different object classes in their natural scenes. When we train YOLO to detect our own

custom object classes we usually have a small training set. Remarkably, YOLO was pre-trained on the

ImageNet dataset. During this pre-train, YOLO learned to convert the objects in millions of images into a

feature map (convolutional weights). Therefore, when YOLO trains it adapts these convolutional weights

to detect our custom object classes in the training set images. Specifically, it applies a technique called

Transfer-Learning (also known as Domain-Adaptation). This allows us to get a good performance even

with a small training set.


(a) Increased confidence during training. (b) Decreased confidence during training.

Figure 2.11: Example training YOLO to detect a “dog”. Image from YOLO website.

Since YOLO predicts all the existing instances of objects from a single full image it also has to train on

full images. First, given a training image coupled with the ground-truth labels, YOLO matches these

labels with the appropriate grid cells. This is done by calculating the center of each bounding box

and wherever grid cell that center falls in that will be the one responsible for predicting that detection.

Then YOLO adjusts that cell’s class predictions (e.g. in Figure 2.11 we want it to predict “dog”) and its

bounding box proposals. The bounding box overlapping the most with the ground-truth label will have

its confidence score increased (e.g. the one with the green arrow in Figure 2.11) and its coordinates

adjusted while the remaining boxes will simply get their confidence score decreased (e.g. the one with

the red arrow in Figure 2.11) since they don’t properly overlap the object.

Some remaining cells do not have any ground-truth detections assigned to them. In this case, all the

predicted bounding boxes’ confidence is decreased, since they are not responsible for predicting any

object. One important thing to note is that in this case, YOLO does not adjust the class probabilities

or coordinates for the proposed bounding boxes since there are not any actual ground-truth objects

assigned to that cell.

Overall the training of this network is pretty straightforward and corresponds to a lot of standards in the

Computer Vision community: YOLO pre-trains on ImageNet; uses stochastic gradient descent; and uses

data augmentation. For the data augmentation, YOLO randomly scales and translates of up to 20% of

the original image size and also randomly adjusts the exposure and saturation of the image by up to a

factor of 1.5 and hue by up to a factor 0.1 in the HSV color space [31].


Figure 2.12: YOLO generalization results on Picasso and People-Art datasets. Image from [31]

2.2.4 Advantages of YOLO

One of the main benefits of using YOLO is that it is considerably fast allowing real-time processing for

service robots and in many other applications.

Another great advantage is that YOLO reasons globally about the image when making predictions. Un-

like sliding window and region proposal-based techniques, YOLO sees the entire image during training

and test time so it implicitly encodes contextual information about classes as well as their appearance.

Fast R-CNN, a top detection method [28], mistakes background patches in an image for objects be-

cause it can not see the larger context [31]. YOLO makes less than half the number of background

errors compared to Fast R-CNN. It also predicts all bounding boxes across all classes for an image

simultaneously.

Finally, YOLO generalizes rather well into new domains (see Figure 2.12). After being trained on natural

images it has been run on artwork and it still managed to detect the objects classes like “human” and

“cat”, even with abstract representations of people like in the painting of The Scream by Edvard Munch.

2.2.5 YOLOv2

From the initial version to YOLOv2 some incremental improvements were applied enhancing the accu-

racy significantly while making it faster as well. For example, in YOLOv2 they added batch normalization


Figure 2.13: Anchor Boxes vs. Dimension Clusters. Image from YOLO website.

[32] in the convolutional layers leading to an increased accuracy in the detections. In this Sub-section,

we review some of the most significant improvements that were made:

Dimension Clusters

Other systems like Faster R-CNN and SSD both use pre-defined anchor boxes consisting of 3 different

scales and 3 different aspect ratios (see Figure 2.13). YOLOv2 also uses pre-defined boxes however

instead of using these anchor boxes with “made-up” aspect ratios, it creates a new set of boxes (dimen-

sion clusters) tuned with the real objects that are actually in the training data.The dimension clusters

capture more variability in the training data with fewer boxes resulting in a faster and more accurate

detection.

Multi-scale Training

YOLOv2 also includes a multi-scale training regime. Generally, detectors are trained at a single aspect

ratio (e.g. 448×448) and at test time the input image would get resized to that size in specific. In the

original version of YOLO, they trained the full detection pipeline just on a single scale. With YOLOv2 they

came up with the idea of resizing the network randomly throughout the training process to many different

scales and making the network fully convolutional (they removed the fully connected layers at the end of

the original architecture). During training, YOLOv2 resizes the images by factors of 32 from 288×288,

320×320 and so on up to 608×608. This technique boosts the network’s performance both at a single

scale (e.g. if you only want to detect objects on a 448×448 image) and at other scales as well. By doing

this YOLOv2 can essentially be resized at test time to numerous different sizes. Accordingly, without

changing the previously trained weights we can run detection at different scales getting a smooth trade-

off between performance and accuracy (see Figure 2.14). So, for example, if we perform detection at

544×544 we have a model that is very accurate and runs a little bit slower whereas if we resize down to

2.3. mAP - mean Average Precision 17

Figure 2.14: Accuracy and speed on VOC 2007. Image from YOLO website.

288×288 we have a model that is a lot faster but less accurate. This multi-scale training can be thought

of a sort of data augmentation. In object detection we try to do as much data augmentation as possible

and doing this training at different scales basically means that the detector will learn to predict well the

objects at different scales leading to a significant performance improvement.

2.3 mAP - mean Average Precision

In this Section we provide some background on the meaning and how the mean Average Precision

(mAP) value is calculated.

2.3.1 Definition

The mean Average Precision (mAP), is the standard single-number metric used to compare the state-

of-the-art for object detection. Let’s decompose this acronym into two parts, “m” and “AP”. The “m” in

the mAP stands for “mean” (also known as arithmetic mean) and is the sum of all the values (in this

case the AP values) divided by the number of items. The “AP” (Average Precision) is an average of the

precision values. This will be further explained in the following Sub-sections.

2.3.2 Precision and Recall

When calculating the mAP it is constantly measured the precision (the percentage of items classified

as positive that actually are positive) and recall (the percentage of positives that are classified as posi-

tive) [33].


(a) Example IOU = 78.7% ≥ 50% (b) Example IOU = 31.0% < 50%

Figure 2.15: Example images captured by MBot capturing different perspectives and lighting conditions.

precision =tp

tp+ fp, (2.1)

recall =tp

tp+ fn, (2.2)

where tp stands for - true positive, fp - false positive and fn - false negative. So in this context, precision

is the fraction of all the detected objects (tp+ fp) that are relevant to the ground-truth (tp), and recall is

the fraction of all the ground-truth objects (tp+ fn) that are successfully detected (tp).

2.3.3 Intersection Over Union

In the object detection task, the algorithm is expected to localize the object in the image. To evaluate if the

predicted bounding boxes are well adjusted to the objects these algorithm measure the IOU (Intersection

Over Union):

IOU =Area(Bp ∩Bg)

Area(Bp ∪Bg), (2.3)

where Bp∩Bg and Bp∪Bg respectively denotes the intersection and union of the predicted and ground-

truth bounding-boxes.

Two bounding boxes match if there is an IOU≥ 50%. This value was set in the PASCAL VOC competition

[34], humans tend to be slightly more lenient than the IOU larger than the 50% criterion [35].

For example, in Figure 2.15 (a) given an instance of the object class “pottedplant” we can see the

2.4. Related Work 19

ground-truth bounding box (in blue) and the one predicted by the algorithm (in green) with an IOU ≥

50%. In (b) given an instance of the object class “chair” we can see the ground-truth bounding box (in

blue) and the one predicted by the algorithm (in red) with an IOU < 50%.

2.3.4 Calculating the AP - Average Precision

The neural nets predictions were judged by the precision/recall (PR) curve. The quantitative measure

used was the average precision (AP) with the VOC metric [34, 36, 37].

It was computed as follows [36]:

• First, we computed a version of the measured precision/recall curve with precision monotonically

decreasing, by setting the precision for recall r to the maximum precision obtained for any recall

r′ > r.

• Then, we computed the AP as the area under this curve by numerical integration. No approxima-

tion was involved since the curve is piecewise constant.

First, we map each detection to a ground-truth object instance. There is a match if the class labels are

the same and the IOU (Intersection Over Union) is larger or equal to 50%, by the formula 2.3.

In the case of multiple detections of the same object, only 1 (one) is set as a true detection and the

repeated ones are set as false detections [34].

2.4 Related Work

The complexity of indoor environments grows exponentially due to a various number of factors, increas-

ing the difficulty for robots to complete tasks successfully. The particular task of learning and recognizing

useful representations of places (such as a multi-floor building) and manipulating objects has been a

subject of active research namely, in a symbiotic interaction with humans [8].

In this work, we aim to detect objects coupled with their bounding-box. A few works have explored a

human-based active learning method specifically for training object class detectors. Some of them focus

on human verification of bounding boxes [35] or rely on a large amount of data corrected by annotators

[38]. Others explore how people teach or are influenced by the robot [39].

It is worth mentioning that symbiotic autonomy has been actively pursued in the past [40, 41]. Some

interesting approaches are focused on improving the robot’s perception with remote human assis-


tance [42]. Research groups have also worked on developing autonomous service robots such as the

CoBots [2], the PR2 [43] and many others.

Considering spatial information analysis and mapping, studies using ultrasonic imaging with neural net-

works [44], 3D convolutional neural networks with RGB-D [45] and a novel combination of the RANSAC

(an outlier detection method) and Mean Shift (for cluster analysis) algorithms [46] have been in de-

velopment for several decades, which demonstrates a clear evolutionary trajectory that enables the

developments of the present.

Other works using scene understanding and image recognition show a strong affinity with this thesis.

Works such as: using saliency mapping plus neural nets to tackle scene understanding [47]; full pose

estimation of relevant objects relying on algorithmic processing and comparison of a dataset of images

[46]; feature-matching technique implemented by a Hopfield neural network [48]; and data augmentation

[49].

When it comes to showing an object located in the real world to the robot, previous work has investigated

alternative ways of intuitively and unambiguously selecting objects, using a green laser pointer [50]. In

our approach, we took advantage of the robot’s tablet purposely inbuilt for interacting with humans.

Finally, there has been work done in the area of getting robots to navigate in a realistic setting and

detecting objects in order to place them on a map [51]. In our case, using input from human interaction

allows the robot to generate this information. This enables the robot to store where each object was

seen and how many times it was seen at each location.

Chapter 3

YOLOv2 in a Real-World Scenario

You Only Look Once (YOLO), is currently one of the most famous state-of-the-art neural nets for object

detection in real-time computing. This chapter evaluates its performance when deployed in a robot in

a real-world scenario and exposes its need for improvement. Firstly, YOLOv2 is evaluated to judge its

detections when trained using the PASCAL VOC 2012 dataset [36] (Section 3.1), and then, using the

COCO dataset [52] (Section 3.2). Lastly, we discuss these evaluations (Section 3.3).

3.1 PASCAL VOC Dataset Test

In this Chapter we test YOLOv2, trained with the PASCAL VOC dataset, with Mbot in a domestic scenario

(Sub-section 3.1.1) and with CoBot in a university scenario (Sub-section 3.1.2). On each scenario the

robot’s collected a total of 100 images, capturing a sub-group of the 20 objects classes1 of the PASCAL

VOC 2012 dataset [36]. Both the robot’s images were captured at the same resolution of 640×4802. At

the end of this Section, we compare the results with the ones presented in the literature (Sub-section

3.1.3).

3.1.1 MBot Test in a Domestic Scenario

First MBot captured a total of 100 images for this test in a domestic scenario. In these images and using

YOLOv2 MBot detected a total of 7 out of the 20 classes present in the PASCAL VOC dataset. These

classes were: “bottle”, “chair”, “person”, “plant”, “sofa”, “table” and “tv”. These images were captured

1The 20 PASCAL VOC class labels can be found at http://host.robots.ox.ac.uk/pascal/VOC/voc2012/2The MBot’s images were captured using an ASUS Xtion and the CoBot’s ones with a Kinect 2

21

22 Chapter 3. YOLOv2 in a Real-World Scenario

(a) Domestic scenario during day time (b) Domestic scenario during night time

Figure 3.1: Example images captured by MBot capturing different perspectives and lighting conditions.

on different days (implicitly varying the lighting conditions), and poses (position + orientation) relative to

a fixed world frame of the scenario (examples in Figure 3.1), in ISRobotNet@Home3, Test Bed of ISR-

Lisbon, Portugal. In Sub-section 3.1.3 we evaluate the YOLOv2’s performance when detecting objects

on these images.

3.1.2 CoBot Test in a University Scenario

Similarly, we deployed YOLOv2 in the mobile, indoor, service robot CoBot - collaborative robot (see

Figure 3.2). CoBot also captured 100 images for testing but, this time in a university scenario. It captured

4 of the 20 classes present in the PASCAL VOC dataset. These classes were: “chair”, “plant”, “table”

and “tv”. To collect these images (examples in Figure 3.3) the robot navigated in one of the floors of

the Machine Learning Department, at Carnegie Mellon University, in Pittsburgh, PA, USA. It navigated

continuously, for a total of half an hour, choosing randomly each next destination in that building. Despite

being a real university scenario, the class “person” was excluded from the captured images due to

privacy issues.

3.1.3 Performance of YOLOv2 Trained on PASCAL VOC

For each of the captured images (by MBot and CoBot), we created a correspondent ground-truth file,

correctly labeling all the instances of the PASCAL VOC’s object classes, and a YOLOv2 predictions

file (see plots in Figure 3.4 and Figure 3.5). Then, we confronted the neural net’s predictions with the

ground-truth to calculate the AP (Average Precision) for each class (see Table 3.2 and 3.3), and the

resulting mAP (mean Average Precision), this metric is explained in Chapter 2, Section 2.3.

3More info can be found at http://isrobonet athome.isr.tecnico.ulisboa.pt/

3.1. PASCAL VOC Dataset Test 23

Figure 3.2: CoBots with on-board computation, interaction interfaces, omni-directional motion, carryingbaskets, and depth sensors (Kinect and LIDAR).

(a) Corridor (b) Room

Figure 3.3: Example images captured by CoBot for our YOLOv2 evaluation experiments.

Table 3.1: PASCAL VOC2012 [36] test detection results. YOLOv2 performs on the same level as otherstate-of-the-art detectors like Faster R-CNN [29] with ResNet [53] and SSD512 [54] while still running 2to 10 times faster [9]. Table adapted from [9].

The values presented for each class are the AP (Average Precision) in percentage (%).

Method data mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Fast R-CNN [28] VOC12 68.4 82.3 78.4 70.8 52.3 38.7 77.8 71.6 89.3 44.2 73.0 55.0 87.5 80.5 80.8 72.0 35.1 68.3 65.7 80.4 64.2Faster R-CNN [29] VOC12 70.4 84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5YOLO [31] VOC12 57.9 77.0 67.2 57.7 38.3 22.7 68.3 55.9 81.4 36.2 60.8 48.5 77.2 72.3 71.3 63.5 28.9 52.2 54.8 73.9 50.8SSD300 [54] VOC12 72.4 85.6 80.1 70.5 57.6 46.2 79.4 76.1 89.2 53.0 77.0 60.8 87.0 83.1 82.3 79.4 45.9 75.9 69.5 81.9 67.5SSD512 [54] VOC12 74.9 82.3 75.8 59.0 52.6 81.7 81.5 90.0 55.4 79.0 59.8 88.4 84.3 84.7 84.8 83.3 50.2 78.0 66.3 86.3 72.0ResNet [53] VOC12 73.8 86.5 81.6 77.2 58.0 51.0 78.6 76.6 93.2 48.6 80.4 59.0 92.1 85.3 84.8 80.7 48.1 77.3 66.5 84.7 65.6

YOLOv2 544 [9] VOC12 73.4 86.3 82.0 74.8 59.2 51.8 79.8 76.5 90.6 52.1 78.2 58.5 89.3 82.5 83.4 81.3 49.1 77.2 62.4 83.8 68.7

Table 3.2: MBot Domestic Scenario PASCAL VOC2012 [36] test detection results on YOLOv2.


Method data mAP bottle chair table person plant sofa tv

YOLOv2 VOC12 (PASCAL VOC) [9] 60.6 51.8 52.1 58.5 81.3 49.1 62.4 68.7MBot Domestic (100 images) 39.0 6.9 55.9 21.0 63.9 50.7 53.2 21.4


Table 3.3: CoBot University Scenario PASCAL VOC2012 [36] test detection results on YOLOv2.


Method data mAP chair table plant tv

YOLOv2 VOC12 (PASCAL VOC) [9] 57.1 52.1 58.5 49.1 68.7CoBot University (100 images) 11.6 27.0 1.4 18.1 0

Table 3.1 illustrates the analogous performance of YOLOv2 versus other state-of-the-art detectors. Us-

ing some of the values of this table we then created table 3.2 and 3.3. Table 3.2 allows us to compare

the performance of YOLOv2 when using PASCAL VOC’s test images (1st row of the table) versus when

using the MBot’s 100 images from the domestic scenario test (2nd row). Given these results, we can

first notice the comparative performance of the class “chair” and “plant”, between the two datasets. Fur-

thermore, we find the classes “person” and “sofa” with AP values dropping less than 20%, and finally,

the classes “bottle”, “table” and “tv” dropping over 35%, from the PASCAL VOC to the MBot’s 100 im-

ages data. Overall, the mAP (mean Average Precision) dropped approximately 20%, from 60.6% (using

PASCAL VOC data) to 39.0% (using MBot’s 100 images data). The plots at Figure 3.4 provide more

details about the MBot’s domestic scenario test, illustrating, for example, that YOLOv2 predicted in the

100 images 14 “bottle’s”, 10 of which were assigned as false predictions and 4 as true ones. You can

find example images of these predictions in Figure 3.6.

Similarly, Table 3.3 allows us to compare the performance of YOLOv2 when using PASCAL VOC’s test

images (1st row of the table) versus when using the CoBot’s 100 images from the university scenario

test (2nd row). In this scenario, YOLOv2 performed poorly compared to the previous test. The AP of the

classes “chair” and “plant” dropped significantly (over 25%) and the AP of the classes “table” and “tv”

dropped practically to 0% (Figure 3.7 illustrates one of the detections of a “tv”). Overall the mAP dropped

from 57.1% to 11.6%. Most of the images captured by CoBot were captured while it was moving around

the building resulting in more blurred images when compared to the MBot’s ones. The plots at Figure

3.5 provide more information about the performance of YOLOv2 given the images from CoBot’s test.

We can see from that plot that the classes “train”, “sofa” and “bird” are also included in the YOLOv2’s

predicted objects although there were no instances of these classes in the images and therefore in the

ground-truth. For example, note that the class “train” was detected 11 times in those images of the

corridors of the university. You can find example images of these predictions in Figure 3.7.

3.2 COCO Dataset Test

YOLOv2 was also separately trained using the COCO dataset [52] obtaining 44.0 mAP [9], on the same

VOC metric (see more details about this metric in Chapter 2, Sub-section 2.3.4). The COCO dataset

3.2. COCO Dataset Test 25

(a) Information about objects in the ground-truth (b) Information about objects predicted by YOLOv2

Figure 3.4: MBot, information about the 100 captured images PASCAL VOC test.


Figure 3.5: CoBot, information about the 100 captured images PASCAL VOC test.

(a) Correct prediction of a “bottle” (in green) (b) False prediction of a “person” (in red)

Figure 3.6: MBot example images of YOLOv2 true and false predictions.


(a) Correct prediction of a “chair” (in green) (b) False prediction of a “tv” (in red)

Figure 3.7: CoBot example images of YOLOv2 true and false predictions.

includes 80 object classes4, extending by 60 the 20 classes of PASCAL VOC. Both PASCAL VOC and

COCO datasets gather training images of objects in a vast variety of day-to-day scenes in their natural

context. Since the Average Precision results of YOLOv2 were not published in their publications for each

of the individual 80 classes, in this Section, we evaluate how many (out of these classes) are detected

in the images previously collected by MBot and CoBot and whether if these detections are true or false

ones.

Figure 3.8 and 3.9 show the results obtained with YOLOv2 trained on the COCO dataset for the same

previous images. In the MBot’s domestic scenario a total of 23 classes were detected and in the case

of CoBot’s university scenario a total of 10 classes. Although not all these YOLOv2’s predictions were

correct, having a look at the class “refrigerator”, for example, we can see that all its predictions were

false (37 times on MBot and 4 times on CoBot) since there was not any refrigerator in the test scenarios.


Figure 3.8: MBot, information about the 100 captured images COCO test.

4The 80 COCO class labels can be found at http://cocodataset.org

3.3. Discussion 27


Figure 3.9: MBot, information about the 100 captured images COCO test.

3.3 Discussion

To summarize, in this Chapter we tested and evaluated YOLOv2 when deployed in two robots in the

corridors of a university and in a domestic scenario. We evaluated YOLOv2 when trained in PASCAL

VOC and verified that the mean Average Precision (mAP) dropped from 60.6% to 39.0% in the case of

MBot in a domestic scenario (Table 3.2), and from 57.1% to 11.6% in the case of CoBot in the corridors

of a university (Table 3.3). We further evaluated the performance of YOLOv2 when trained with the

COCO dataset and verified that some of the classes that are detected are not even present in these

scenarios, for example the class “refrigerator” was falsely detected in both the test scenarios (Figure 3.8

and 3.9).

On the other hand, it is important to note that YOLOv2 trained with both the datasets, PASCAL VOC and

COCO, is limited to detect 20 and 80 object classes, respectively. In real human scenarios the robot will

be confronted with an unpredictable amount of classes that may even vary with time. For example, new

classes of objects are constantly introduced in our homes as technology evolves and new machines are

created.

Chapter 4

Learning Algorithm

This Chapter, describes the learning algorithm and how the robot will use input from humans to create

a neural net – HUMAN – adapted to the local real-world scenario and presents the approach of using

the two neural nets in parallel – YOLOv2 + HUMAN – to improve the robot’s capability of recognizing

objects.

4.1 Learning Algorithm

In this Section, we introduce the learning algorithm. The goal of this algorithm is to create a neural net,

called HUMAN, that is able to detect objects within a scenario that the robot is deployed into. The idea

is to train this neural net with images labeled by the humans present in that scenario.

The algorithm consists of three parts: Firstly (Section 4.1.1), the robot captures images and predicts

what and where the objects are located in the images. Secondly (Section 4.1.2), the robot asks the

humans questions to confirm its previous predictions and collect additional information. Lastly (Section

4.1.3), the robot re-trains its detector using the collected information to improve its future predictions.

In Algorithm 1 you can find the pseudocode of this learning algorithm. The functions starting with upper

case letter (“CaptureAndPredict”, “InteractWithHumans” and “Training”), are the ones we will describe in

this Chapter with their respective pseudocodes. We assumed the robot will apply the learning algorithm

to a set of pre-defined points in the map of the scenario (further explained in Chapter 5, Section 5.1.1).

28

4.1. Learning Algorithm 29

Algorithm 1 Learning Algorithm

1: Input: target position // position (points in the map) from where the robot will capture images;2: Output: HUMAN // resulting HUMAN neural net;3: procedure LEARNINGALGORITHM(target position)4: // Part I - Capturing Images and Predicting the Objects5: new images and predictions = CaptureAndPredict(target position)6: // Part II - Interacting with Humans7: new labeled images = InteractWithHumans(new images and predictions)8: // Part III - Training9: save data(new labeled images)

10: all labeled images = load all data()11: if meets training conditions(all labeled images) then12: HUMAN = Training(all labeled images)13: Return HUMAN14: end if15: end procedure

4.1.1 Part I - Capturing Images and Predicting the Objects

The robot should be able to detect the surrounding objects independently of its current pose (location

+ orientation), relative to a fixed world frame. To solve this, the robot defines a set of sparse points of

reference in the map, from where it will capture the images. On the other hand, the robot also needs

to be able to detect the objects in different lightning conditions (related to various hours of the day). To

solve this, we defined that the robot will go to 1 point of reference in the map per day, at a random time

of this day. Lastly, the robot should also be able to detect the objects in independently of its orientation,

relatively to the fixed world frame, so we defined it will capture 8 images while rotating in its axis (1 per

each 45o the robot rotates) in the same point of reference in the map. Consequently, in total the robot

captures 8 pictures per day and these are the images it will be using for learning to detect objects.

The robot starts by navigating to a position (pre-defined point in the map of the scenario) that it has

not visited before. In this location, it captures different images while rotating around its axis. Before

requesting help from a human, the robot predicts what the objects are and where they are located in

each of the images. From each object’s prediction, we retrieve information about the identified class,

its location in the image and finally a confidence level from 0 to 100%. In Algorithm 2 we can find the

pseudocode for this part (Part I) of the learning algorithm.

To make these predictions the robot uses a state-of-the-art detector trained on a set of classes. In this

work we used YOLOv2 (explained in Chapter 2), trained with COCO, able to detect up to 80 classes

of objects1. These 80 classes apply to a vast number of scenarios. For example, a domestic scenario

usually includes classes like “person”, “tv” or “sofa”, while a garden would include classes like “bench”,

“bird” or “potted plant”.

1The 80 COCO class labels can be found at http://cocodataset.org

30 Chapter 4. Learning Algorithm

Algorithm 2 Learning Algorithm - Part I - Capturing Images and Predicting the Objects

1: Input: target position // location (point in the map) from where the robot captures the images;2: Output: new images and predictions // image captured by the robot and respective predictions;3:4: procedure CAPTUREANDPREDICT(target position)5: new images and predictions = start data structure()6: Y OLOv2, HUMAN = load detectors()7: navigate to(target position)8: // The images are captured while the robot rotates in its axis (explained in Chapter 5.1)9: image list, target orientation list = capture images while rotating()

10: for each image, target orientation in image list, target orientation list do11: image id = save image(image, target position, target orientation)12: prediction list = predict objects in image(Y OLOv2, image)13: if HUMAN 6= None then14: // The HUMAN neural net will also be used to predict the objects in each image and

these predictions will be combined with the ones from Y OLOv2 (Explained in Part III)15: HUMAN prediction list = predict objects in image(HUMAN, image)16: prediction list.append(HUMAN prediction list)17: end if18: new images and predictions[image id] = prediction list19: end for20: Return new images and predictions21: end procedure

By the end of this part of the learning algorithm, the robot ends up with a set of images and the corre-

sponding predictions of the objects in those images.

4.1.2 Part II - Interacting with Humans

In order to confirm if the previous predictions are correct the robot will now ask for help to the humans

around it (Part II). To interact with the human the robot will use its tablet as an interface, described in

Sub-section 4.1.2.

When interacting with humans the robot will be asking questions:

• Labeling the objects from predictions (Sub-section 4.1.2)

• Identifying to whom the objects belong (Sub-section 4.1.2)

In Algorithm 3 you can find the pseudocode for interacting with humans. The functions starting with

upper case letter (“LabelObjects” and “WoWhomBelongs”), will be described in this Section.

Labeling the Objects from Predictions

As previously stated, the robot starts by capturing the images and generating predictions for what and

where the objects are in the image. To correctly label these objects, the robot will first confirm the


Algorithm 3 Learning Algorithm - Part II - Interacting with Humans

1: Input: new images and predictions // image captured by the robot and respective predictions;2: Output: new labeled images // images labeled with the human input;3:4: procedure INTERACTWITHHUMANS(new images and predictions)5: new labeled images = start data structure()6: for each image id, prediction list in load all(new images and predictions) do7: image = load image(image id)8: wait for person being detected() //different images can be labeled by different people9: // ask tperson to label objects from predictions

10: verified prediction list = LabelObjects(image, prediction list)11: // ask person to whom an obect belongs12: verified prediction list = ToWhomBelongs(image, verified prediction list)13: new labeled images[image id] = verified prediction list14: end for15: Return new labeled images16: end procedure

previous predictions by asking a human for help in the scenario. To confirm its predictions, the robot

asks two Yes/No questions. For example, given a prediction of the object “cat”, the robot asks:

• q1: Is this a cat, or part of it, in the rectangle?

• (if Yes to q1) - q2a: Is the rectangle properly adjusted to the cat?

• (if No to q1) - q2b: Ok, this is not a cat. Is the rectangle properly adjusted to an object?

The robot instructs the human by explaining that the properly adjusted rectangle refers to a rectangle

specifying the extent of the object visible in the image. It also explains that objects seen in a mirror or a

TV monitor should also be labeled.

These first two questions (q1 and q2) return 4 possible outcomes (see Figure 4.1). If the human says

No to q2a, the robot asks for an adjustment of the bounding box. If the human says Yes to q2b, the robot

asks for a relabeling. You can find this pseudocode in Algorithm 4.

After confirming and correcting the predictions the robot then asks:

• q3: Are all the objects in this image labeled?

• (if Yes to q3): The image is ready for Part III - Training.

• (if No to q3): The robots ask for the human to adjust the rectangle to an object and label it. Then it

repeats the question (q3).

When labeling, a list of the previously labeled object classes shows up. If the person does not find

the object’s class in this list, a new class can be added from a predefined dictionary of classes. This


Figure 4.1: E.g. Outcomes from q1 and q2. Cat image credit to Flickr user William McCamment.

dictionary restrains the lexical redundancies within a language, for example “waste bin” and “trash can”

are categorized as only the “wastecontainer” class.

When adding a new class, the robot shows the person the zoomable diagram (Figure 4.2), organized

by categories from the OpenImages dataset [55]. This diagram includes 600 classes, in an organized

structure. The robot only accepts new classes from this list.

Algorithm 4 presents the pseudocode for labeling the objects in the images collected by the robot in Part

I. As previously explained in this Section, it first asks for help from a human and afterwards, verifies each

of the predictions generated by its object detectors.


Figure 4.2: Some classes that can be added to the robot using the labels from OpenImages dataset.[55]

To Whom an Object Belongs

We tagged the OpenImages classes as personal or not (e.g. classes like “mobile phone” and “backpack”

are tagged as personal classes and others like “oven” and “table” as not personal). This is used to ask

the second type of questions.

After labeling all the objects in the image, the robot then asks to whom each objects belongs (if they are

tagged as personal classes). You can find this pseudocode in Algorithm 5. This way the robot also stores

information that will be useful when searching for an object belonging to a specific person. The robot

instructs the human to answer “communal” if there is a case of a personal object class being shared.

Human-Robot Interface

To interact with the human the robot uses its tablet as an interface and reads out loud the question it is

making. It would also have been possible to use voice recognition to read part of the input from the user

(for example the Yes or No answers). However, the voice commands could potentially generate more

errors and would make it difficult for us to collect all the needed input. For example, when trying to show

an object to the robot in the image this can be easily achieved with the human dragging a “rectangle” in

the tablet (see Figure 4.3). Figure 4.4 shows an example when MBot and Pepper were interacting with

a human.


Algorithm 4 Learning Algorithm - Part II - Interacting with Humans - Labeling Objects from Predictions

1: Input: image // image captured by the robot;2: prediction list // list of a predictions corresponding to that input image;3: Output: verified prediction list // human verified list of objects corresponding to that input

image;4:5: procedure LABELOBJECTS(image, prediction list)6: verified prediction list = [ ]7: for each prediction in prediction list do8: class name, confidence, bounding box = get prediction info(prediction)9: // e.g : Is this a cat or part of it, in the rectangle?

10: q1 question = “Is this a/an ” + class name+ “, or part of it, in the rectangle?”11: answer to q1 = ask human yes or no(q1 question, image, bounding box)12: if answer to q1 is Yes then13: // e.g : Is the rectangle properly adjusted to the cat?14: q2a question = “Is the rectangle properly adjusted to the ” + class name+ “?”15: answer to q2a = ask human yes or no(q2a question, image, bounding box)16: if answer to q2a is Yes then17: verified prediction list.append(prediction)18: else if answer to q2a is No then19: adjusted prediction = ask human to relocate bounding box(class name, image, bounding box)20: verified prediction list.append(adjusted prediction)21: end if22: else if answer to q1 is No then23: // e.g : Ok, this is not a cat. Is the rectangle properly adjusted to an object?24: q2b question = “Ok, this is not a ” + class name+ “. Is the rectangle properly adjusted

to an object?”25: answer to q2b = ask human yes or no(q2b question, image, bounding box)26: if answer to q2b is Yes then27: adjusted prediction = ask human to relabel object(image, bounding box)28: verified prediction list.append(adjusted prediction)29: else if answer to q2b is No then30: continue31: end if32: end if33: end for34: are all objects labeled = False35: do36: q3 question = “Are all the objects in this image labeled?”37: answer to q3 = ask human yes or no(q3 question, image, verified prediction list)38: if answer to q3 is Yes then39: are all objects labeled = True40: else if answer to q3 is No then41: new object = ask human to label new object(image, verified prediction list)42: verified prediction list.append(new object)43: end if44: while are all objects labeled 6= True45: Return verified prediction list46: end procedure

4.1.3 Part III - Training

The interaction with humans results in a set of labeled images. Every time the robot collects a multiple

of 50 images it trains a neural net, HUMAN50, HUMAN100, HUMAN150 and so on, using all the human-

labeled images. From these images, 70% are used for training and 30% are used for testing. The value


Algorithm 5 Learning Algorithm - Part II - Interacting with Humans - To Whom an Object Belongs

1: Input: image // image captured by the robot;2: verified prediction list // human verified list of objects corresponding to that input image;3: Output: verified prediction list // final list of objects corresponding to the input image;4:5: procedure TOWHOMBELONGS(image, verified prediction list)6: for each verified prediction in verified prediction list do7: class name, bounding box = load info(verified prediction)8: if is personal object class(class name) then9: question = “Whose” + class name + “ is this?” // e.g. Whose phone is this?

10: answer = ask question(question, image, bounding box)11: save information(verified prediction, answer)12: end if13: end for14: Return verified prediction list15: end procedure

(a) Showing Where the object is. (b) Showing What the object is.

Figure 4.3: Example of the human-robot interface when labeling an object.

of 50 images in practice corresponds to training the neural net once per week since in Chapter 5.1 we

define that the robot will go to one target position (point in the map) per day and capture a total of 8

images (at different orientations).

The HUMAN neural net uses the same structure and algorithm as YOLOv2 but is re-trained using only

the images labeled by humans. As previously described in Chapter 2, YOLOv2 trains its detector on top

of convolutional weights that are pre-trained on Imagenet.

The reason why we train the HUMAN neural net using only the images labeled by humans is due to

the fact that YOLOv2 uses the full images when training (read more in Chapter 2, Sub-section 2.2.3).

So if we wanted to add a new class to YOLOv2, trained with the COCO dataset, for example “lamp”

we would need to label all the lamps in thousands of images of that dataset. Otherwise, the neural net

when training with the thousands of images from the COCO dataset would always identify all the present


Figure 4.4: Mbot (top) and Pepper (bottom) learning.

“lamps” in those pictures as false detections and therefore it wouldn’t be able to learn to recognize this

object.

After training, the HUMAN neural net is also used in Part I - Predicting the objects. Using the predictions

from both neural nets, the robot is able to predict more objects present in an image. Note that it is also

possible that one object gets detected twice (e.g: first by YOLOv2 and then by HUMAN). In this case

we may end up with a duplicate label for the same object. Although this will not be a problem as we will

explain in the next Section.

Filtering Labels for Training

Before training the neural net we remove repeating labels over the same area in the image.

For example, looking at Figure 4.5, if a person answers Yes to q1 (correct label), and No to q2a

(bounding-box poorly adjusted), referring to two different predictions pertaining to the same object, then

we get two repeated labels. To avoid this problem, if in an image, an IOU (Intersection Over Union) larger

than 50% is detected between two labels of the same class (e.g with class “bed” in Figure 4.5) we say

we have a repeated label. In this case, we only use one of them (the one with a larger area) for training.

Algorithm 6 present the pseudocode for removing the repeated labels and then train the detector.

The IOU is a measure of how close two bounding boxes match. To calculate the IOU we divide the

intersection area by the union (total area) of the bounding-boxes (previously explained in Chapter 2,


Figure 4.5: An example case where the robot gets repeated labels for the same object.

Sub-section 2.3.3). In the PASCAL VOC competition they defined that when the IOU value is larger

or equal to 50% then we have a match between the ground-truth and the label [34]. This IOU value is

currently the standard being used in the object detection literature, so it was also the one adopted during

the development of this work.


Algorithm 6 Learning Algorithm - Part III - Training

1: Input: all labeled images // all images coupled with corresponding object labels;2: Output: HUMAN // resulting HUMAN neural net;3:4: procedure TRAINING(all labeled images)5: // filter multiple objects with the same class label6: for each image id, verified prediction list in load info(all labeled images) do7: already tested combinations = [ ]8: found repeated label = True9: while found repeated label do

10: found repeated label = False11: for each label 1, label 2 in list(combinations(verified prediction list, 2)) do12: if (label 1, label 2) not in already tested combinations then13: already tested combinations.append((label 1, label 2))14: class name 1, bbox 1 = load label info(label 1)15: class name 2, bbox 2 = load label info(label 2)16: if class name 1 == class name 2 and calculate IOU(bbox 1, bbox 2) ≥ 0.5 then17: // repeated label18: if get area(bbox 1) < get area(bbox 2) then19: verified prediction list.remove(label 1)20: else21: verified prediction list.remove(label 2)22: end if23: found repeated label = True24: break25: end if26: end if27: end for28: end while29: all labeled images[image id] = verified prediction list30: end for31: test percentage = 0.3 // 30% of the images are used for testing, as previously stated;32: train set = set up train((1− test percentage), all labeled images)33: test set = set up test(test percentage, all labeled images, train set)34: HUMAN = train neural net(train set, test set)35: Return HUMAN36: end procedure

4.2 YOLOv2 + HUMAN

In this Chapter, we explain the challenges when using the two neural nets in isolation (Section 4.2.1)

and how it deals with duplicate detections (Section 4.2.2).

4.2.1 Problem and Solution

The problem of using only the HUMAN neural network is that we would be losing the current capabilities

of YOLOv2 (trained with thousands of pictures in a global effort). For example, if after training the

HUMAN neural network we introduced a new class to the scenario where the robot is included. If this

4.2. YOLOv2 + HUMAN 39

class was present in YOLOv2 it would most likely be able to detect it but, the HUMAN neural net would

not. To solve this, we propose using a combination of the two neural networks – YOLOv2 + HUMAN. In

a real human scenario, robots will need to interact with an unpredictable set of classes. When using the

two neural nets combined they both cooperate together resulting in a more robust object detection. We

can visualize an example of this cooperation in Figure 4.6.

Figure 4.6: Example of YOLOv2+HUMAN using predictions from both the neural nets.

With these means the robot is able to:

• detect the same objects as the default YOLOv2 (that was trained with thousands of images in a

global effort);

• detect new objects from the scenario where the robot is inserted using the HUMAN neural net (that

was trained with a small amount of images of the same objects to be detected).

As explained in Chapter 4, the HUMAN neural net is trained with the images collected by the robot and

labeled by the humans that interacted with the robot. The robot asks the humans to label all the objects

in the captured images so it can always learn new classes. Although the HUMAN neural net is trained

with a small amount of images, those images contain the same objects that will be detected later on in

that same scenario. So this neural net is able to learn this objects even with a small amount of training

samples (validated in Chapter 5).


4.2.2 Removing Duplicate Detections

The HUMAN neural net learns to detect the object classes from the images labeled by the humans

interacting with the robot. Since the humans decide which classes to teach the robot there might be

cases where the HUMAN neural net learns to detect a class that YOLOv2 was already able to detect.

As a consequence when running the two neural nets in parallel this may lead to conflicts of repeated

detection of the same object.

When the two neural nets detect the same object (with the same class label) in the same area in the

picture (IOU ≥ 50%), we use the prediction with higher confidence (see example in Figure 4.7). By

default, a confidence from 0.0 to 1.0 is given by the YOLOv2 algorithm, associated to each prediction

[9].

Figure 4.7: Duplicate detection example. YOLOv2+HUMAN uses the prediction with higher confidence.

In fact, this is a sort of a non-maximum suppression but instead of removing duplicate detections from

a single neural net it removes duplicates from the predictions of two neural nets. You can find the

pseudo-code in Algorithm 7.

In Chapter 5 we validate this method by evaluating whether using the two neural nets in parallel with this

technique lead to better results than when using them separately.

4.2. YOLOv2 + HUMAN 41

Algorithm 7 YOLOv2 + HUMAN - Remove Duplicates

1: Input: Y OLOv2 predictions // list of predictions from Y OLOv2 neural net;2: HUMAN predictions // list of predictions from HUMAN neural net;3: Output: merged predictions // list of predictions merging the input from the two neural nets;4:5: procedure REMOVEDUPLICATES(Y OLOv2 predictions, HUMAN predictions)6: tested combinations = [ ]7: collision found = True8: while collision found do9: collision found = False

10: // apply cartesian product to get combinations between the two lists11: combinations list = list(product(Y OLOv2 predictions,HUMAN predictions))12: for each (pred Y OLOv2, pred HUMAN) in combinations list do13: if (pred Y OLOv2, pred HUMAN) not in tested combinations then14: tested combinations.append((pred Y OLOv2, pred HUMAN))15: class Y, confidence Y, bbox Y = load prediction info(pred Y OLOv2)16: class H, confidence H, bbox H = load prediction info(pred HUMAN)17: if class Y == class H and calculate IOU(bbox Y, bbox H) ≥ 0.5 then18: if confidence Y > confidence H then19: HUMAN predictions.remove(pred HUMAN)20: else21: Y OLOv2 predictions.remove(pred Y OLOv2)22: end if23: collision found = True24: break25: end if26: end if27: end for28: end while29: merged predictions = Y OLOv2 predictions+HUMAN predictions30: Return merged predictions31: end procedure

Chapter 5

Results

In this Chapter, we evaluate the results of the experiments conducted during the development of this

work. Firstly, we analyze the results obtained in the domestic scenario experiment using the learning

algorithm (Section 5.2), then we analyze the results from additional experiments (Section 5.3).

5.1 Experimental Setup

In this Section, we report the experimental setup. Firstly, we describe the domestic environment experi-

ment (Section 5.1.1). Then, the search for a specific’s person object experiment (Section 5.1.2).

5.1.1 Domestic Environment

Using our current computational resources, running the two neural nets as separate processes requires

an external computer with a GPU. We ran the computations and trained the neural nets in this external

computer1. The average time it took to train the neural nets was 6 hours respectively with the 50, 100

and 150 images collected using the learning algorithm (Chapter 4). You can find more details about the

neural net in Chapter 2. While YOLOv2 is able to recognize the objects in real-time our bottleneck was

the robot’s WiFi connection, we were able to detect the objects at 5 frames per second.

Using a realistic domestic scenario – ISRoboNet@Home Testbed2 (Figure 5.1) composed of one bed-

room, one living room, one dining room and a kitchen. The experience was conducted over the course

of 20 days, taking a total of 6 hours. Six research participants acted as the human input for the robot,

1GPU:NVIDIA’s GeForce GTX 1080 Ti; CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz2More info can be found at http://isrobonet athome.isr.tecnico.ulisboa.pt/

42

5.1. Experimental Setup 43

Figure 5.1: ISRoboNet@Home Testbed – Domestic scenario where we ran the experiments.

answering questions about objects in order to evaluate if the robot could train the HUMAN neural net

when confined to a small-scale environment. We also tested the applicability of using the two neural

nets in parallel: YOLOv2 + HUMAN and evaluated the correctness of the robot’s predictions.

Established Points of Reference

The robot should be able to recognize the surrounding objects independently of its current pose (location

+ orientation), relative to a fixed world frame. There are infinite poses within a confined environment but

the robot can not feasibly ask infinite questions to fulfill its objective. To solve this, it starts by defining

a set of sparse points. When the distance between the robot and all the existing reference points is

greater than a fixed distance it creates a new reference point. We can visualize this as a robot with an

imaginary circle around it that creates a new reference point when there are no other points inside the

circle (Figure 5.2).

Using the Points of Reference

The robot should also be able to recognize the objects in different light conditions (related to various

hours of the day) and at different positions, angles and perspectives. We defined that each day the

robot will navigate to 1 point of reference, at a random time and capture a total of 8 images at different

orientations relative to a fixed world frame. This way it captures different exposures, viewpoints and the

implicit unpredictability of misplacement of day-to-day objects.

44 Chapter 5. Results

Figure 5.2: Points of reference comprising the surrounding area.

5.1.2 Looking for a Specific’s Person Object

When the robot asks a human a question about an object associated with a certain Point of Reference,

it gathers the spacial information of where that object was detected.

If the robot registers, for example, a “backpack”, as being 1 time in the living room and 10 times in the

bedroom, if asked to search for that same object it will hierarchically search by divisions with a greater

number of occurrences of that object. This somewhat emulates the thought process of humans when

they are looking for a misplaced object, wherein they will look for the most common places where they

see or use said object.

Furthermore when a user asks where his or hers object is, the robot, using the neural nets (YOLOv2

+ HUMAN150), starts by detecting the objects. When the searched personal object class is detected,

e.g. “backpack”, the robot compares the part of the image inside the bounding-box with all the previous

instances of backpacks it has recorded and associates it to the one with the highest amount of features

in common (using OpenCV’s SIFT code3).

5.2 Domestic Scenario Results

In this Section, we will be evaluating and comparing YOLOv2 with the HUMAN neural network in its

different stages (after collecting 50, 100 and 150 images). As previously explained the HUMAN neural

3The code can be found at https://opencv.org/

5.2. Domestic Scenario Results 45

net was trained using the introduced learning algorithm (Chapter 4) during the previously described

experiment (Section 5.1).

The goal of this Section is to show that the HUMAN neural net has an analogous performance versus

YOLOv2 when restrained to a set of objects and that when used in parallel – YOLOv2+HUMAN – we

can further increase this performance.

5.2.1 Correctness of Predictions

As previously explained both YOLOv2 and the HUMAN neural net were used to predict the objects when

the robot’s used the learning algorithm (Chapter 4) during the 20 days experiment. These predictions

are then shown to a human which will answer two Yes/No questions. The answer to these two ques-

tions determines if the object in each prediction was correctly labeled and if the bounding box was well

adjusted to that object (as previously illustrated in Figure 4.1, in Chapter 4).

During the 20 day experiment, the robot trained 3 stages of the HUMAN neural net: HUMAN50, HU-

MAN100, and HUMAN150. The robot took approximately 50 images per week. In the first week (less

than 50 images), the robot was using only YOLOv2 to generate the predictions. In the second week

(from 50 to 100 images) the robot used YOLOv2 and HUMAN50. And finally, during the last week (from

100 to 150 images) it used YOLOv2 and HUMAN100.

Looking at Table 5.1, 5.2, and 5.3 we can see the results from the Yes or No questions. For both the

neural networks, over 50% off all the predictions were true positives (Label Xand Bounding Box X),

while less than 15% were false negatives (Label × and Bounding Box ×). The rest of the predictions

either ended up with the human relocating the bounding box or relabeling the object as previously de-

scribed in Chapter 4. This allowed the participants to quickly fix these predictions. We also observed

that although the HUMAN neural net generated a smaller amount of predictions, those had a compara-

tive performance versus YOLOv2, even when trained only with 50 images.

Table 5.1: Predictions - Week 1 - Image 0 to 50.

YOLOv2 - total number of predictions: 190

(values in %) BoundingBox X

BoundingBox ×

Label X 60.0 13.7Label × 14.2 12.1





BoundingBox ×

Label X 61.7 16.7Label × 9.3 12.3

HUMAN50 - total number of predictions: 122


BoundingBox ×

Label X 54.1 33.6Label × 4.1 8.2




BoundingBox ×

Label X 58.0 14.3Label × 16.7 11.0

HUMAN100 - total number of predictions: 107


BoundingBox ×

Label X 52.3 24.3Label × 13.1 10.3

During the experiment, we identified two primary categories of human error in labeling images: (a)

unlabeled objects (happens most frequently when objects are difficult to label (e.g. small objects), or

when people couldn’t find the object’s label in the list (e.g. “fire extinguisher” was not included in the

OpenImages labels); (b) wrong label (e.g. human clicks Yes when should have clicked No).

Despite this human error, we observed (as it will be explained in the next Section) that the mean Average

Precision (Table 5.4) and the percentage of correct predictions (Appendix B) increased over the three

stages of the HUMAN neural net (explained in the next Section). We think that since the frequency of

human mistakes is small they did not affect significantly the training of the neural networks.

5.2.2 Evaluation of the Neural Networks

To evaluate the proposed system, we compared the neural nets using an external ground-truth com-

posed of 100 pictures. These were captured by the robot in different places and at different days of the

experiment scenario without the constraint of being in a Point of Reference (the locations from where

the robot learned, as explained in Chapter 5.1). They also included different lighting conditions and


(a) Night time (with the robot not moving) (b) Day time (with the robot moving)

Figure 5.3: Example images captured by the robot for the ground-truth.

10 blurred pictures (due to the robot moving). Figure 5.3 shows some of these pictures and Figure

5.4 provides more detail about the number of instances for each of the object classes captured in the

ground-truth. Note that since the pictures were captured “naturally” around the scenario some classes

are more predominant than others. For example, the class “chair” is the most frequent one since there

were a total of 6 chairs spread around the experiment scenario. On the other hand, classes like “shirt”

and “bicyclehelmet” were only present in the scenario a couple of times.

Given this ground-truth, we can see and compare the results in Table 5.4. This table shows the resulting

Average Precision (AP) for each of the object classes (one per row) and for each of the neural nets (one

per column). In the last two rows of the table, we find the total number of classes detected by each

neural net and the corresponding mean Average Precision (mAP).

Also, it is important to note that in this table (5.4) some classes are not present in some neural nets

(marked with a hyphen symbol “-”). For example, the class “doll” only appeared a couple of times in

the pictures during the last week of the experiment, so only the last stage of the HUMAN neural net

(HUMAN150) could possibly detect this class. If a class is not present in the training set then it can not

be detected, hence the “-”. The last column shows the results when using the two neural nets in parallel

– YOLOv2 + HUMAN150.

In Appendix A you can find more information about the training sets of the HUMAN neural net, in its

different stages. In its plots (Figure A.1, A.2 and A.3) you can see that the class “bicyclehelmet”, for

example, was the least frequent class in the training set (only appeared once). So unless that picture of

the “bicyclehelmet” was very similar to the one in the ground-truth it would be very hard to detect it (in

practice it was not detected, bicyclehelmet AP = 0).

In Table 5.4 we can see that YOLOv2 could only possibly detect a total of 20 of the classes present

in the ground-truth, while HUMAN50 could detect 36, HUMAN100 40 and HUMAN150 41. Conversely,


Figure 5.4: Information about the ground-truth.


Table 5.4: Average Precision using an external ground-truth composed of 100 images in different poses.Values are in percentage (%) and the largest one per row marked in bold.

values in % YOLOv2 H50 H100 H150YOLOv2

+H150

apple 25.0 0 25.0 12.5 25.0backpack 6.3 39.9 79.1 53.1 57.6banana 20.0 0 20.0 20.0 40.0

bed 66.7 16.7 48.8 41.7 89.6bicyclehelmet - 0 0 0 0

book 13.2 23.7 24.0 26.2 28.8bookcase - 22.2 30.6 33.3 33.3

bottle 31.3 - 6.7 6.7 39.6bowl 48.1 0 15.8 28.6 54.7

cabinetry - 36.4 44.8 55.2 55.2candle - - 0 16.7 16.7chair 55.5 27.0 39.0 45.0 56.4

coffeetable - 24.8 35.1 57.1 57.1countertop - 46.5 76.2 85.7 85.7

cup 25.5 1.0 32.8 32.3 43.1diningtable 40.6 35.2 51.1 46.6 52.1

doll - - - 0 0door 0 58.1 70.6 61.8 61.8

doorhandle - 0 0 16.3 16.3envelope - - 0 3.9 3.9glasses 0 56.9 31.3 53.1 53.1heater - 28.6 40.8 35.7 35.7lamp - 0 0 18.3 18.3

nightstand - 62.5 50.0 50.0 50.0orange 20.0 0 20.0 40.0 40.0person 50.0 0 37.5 25.0 47.5

pictureframe - 35.8 35.0 34.6 34.6pillow - 25.9 22.9 36.2 36.2

pottedplant 70.7 41.9 45.9 66.7 63.6remote 65.1 0 0 0 65.1shelf - 0 0 41.7 41.7shirt - 0 50.0 50.0 50.0sink 15.8 - 6.7 13.3 20.8sofa 88.0 7.7 38.5 61.3 92.3tap - 0 5.6 36.9 36.9

tincan - 9.5 34.2 26.6 26.6tvmonitor 38.2 36.4 55.6 74.0 67.8

vase 20.8 20.0 6.7 11.1 24.0wardrobe - 12.5 0 50.0 50.0

wastecontainer - 90.9 98.6 99.2 99.2windowblind - 35.3 35.3 47.1 47.1

# total of classes 20 36 40 41 41mAP 35.0 22.1 30.3 36.9 44.3

HUMAN50 already included 36 out of all the 41 classes that were labeled (88%). Given that the objects

present within the scenario were the same throughout the duration of the experiment; using the Points

of Reference at different orientations (previously explained in 5.1.1), the robot efficiently captured most

of the objects in the first 50 pictures.


When using the two neural nets in parallel (YOLOv2 + HUMAN) we get a union of the detections. When

a class is not embedded in YOLOv2, YOLOv2+HUMAN outputs the same as just the HUMAN neural net.

For example “countertop” which is not present in YOLOv2, can only be detected by the HUMAN neural

net (leading to an equivalent Average Precision score between YOLOv2+HUMAN150 and HUMAN150).

The same thing applies the other way around.

Another possible scenario is when a class is present in both the neural nets. In this case, we also

need to filter duplicate detections of the same object class in a picture, if existent, via non-maximum

suppression (basically, when we encounter a conflict of two bounding boxes of the same class, we keep

the one with a higher confidence score as it was previously explained in Chapter 4.2). We can evaluate

the cooperation between the neural nets by having a better look at Table 5.4. Out of the 41 class scores,

in total there were only two classes where the YOLOv2 had a higher score than YOLOv2 + HUMAN150:

“person” and “pottedplant” and only one class where HUMAN150 had a higher score than YOLOv2 +

HUMAN150: “tvmonitor”. In both these cases, the difference in the score was always smaller than 8%.

In the other 38 classes, out of the 41, using the neural nets in parallel drove to a score higher or equal to

the maximum score between the other two columns (YOLOv2 and HUMAN150 when used isolated). In

these cases, the difference in the score went up to 23% improvement (see the class “bed”, for example).

We also marked in bold the highest score per each class. Out of the 41 classes, 31 of them (75%)

scored higher when using the two neural nets in parallel.

In Appendix B you can find information about all the predictions generated by these neural nets given

the ground-truth pictures. For example, in Figure B.1 you can see that YOLOv2 also detected objects

that were not present in the ground-truth. For example, although there was not a single “refrigerator” in

the experiment scenario, YOLOv2 detected a total of 39 instances of refrigerators in those ground-truth

pictures. Therefore, all of these were false predictions.

Finally, we compared the most relevant score: the mAP (mean Average Precision). In this experiment,

we verified the improvement of the HUMAN neural net with an increasing mAP, from 22.1% to 30.3% and

finally 36.9%. On average the mAP value significantly increased by 7.5% between the different stages.

In practice, it makes sense that the more pictures the robot has to train with, the better should the results

be.

In the table, we can also note that in some classes (e.g. “backpack”) the scores obtained with the HU-

MAN neural nets surpassed the YOLOv2. In fact, having a look at the final mAP score of the HUMAN150

we can see that this neural net already achieved a score higher than the default YOLOv2 and is able to

detect twice the number of classes. Even more impressive was the final score obtained with the neural

nets in parallel achieving a total of 44.3%, which was almost 10% higher than the YOLOv2 score.

5.3. Further Experiments 51

Although the HUMAN150 score already surpassed the score of YOLOv2 there is still a big advantage in

using the two neural networks in parallel – YOLOv2 + HUMAN150. For example, imagine you introduce

a new object class to the scenario: a “cat”. Since this class was never present in the HUMAN training

data this neural net will not be able to detect it. Although YOLOv2, trained with thousands of images in

a global effort will most likely detect the “cat”. Since YOLOv2+HUMAN is a union of the detections it will

also be able to detect that “cat”, with the same confidence as YOLOv2 when used in isolation.

5.2.3 Discussion

To summarize, we started by verifying that both YOLOv2 and HUMAN are able to predict the location

(bounding-boxes) and class labels of the objects in the images. During the 20 days experiment both

these neural networks were used to detect objects in the images collected by the robot and more than

50% of these predictions (for both the neural networks) corresponded to true positives (Label Xand

Bounding Box X), as described in Subsection 5.2.1. Then, in Sub-section 5.2.2, we verified the con-

tinuous improvement in accuracy of the robot’s object detection skills. The HUMAN neural network

(trained using only the input from humans in close proximity) increased its mean Average Precision

(mAP) score from 22.1% with 50 training images, to 30.3% with 100 images, and finally to 36.9% with

150 images. The last stage of the HUMAN neural net obtained a mAP score 1.9% higher than YOLOv2,

which scored 35.0%. Finally, we verified that using the two neural networks in parallel (YOLOv2 + HU-

MAN) we achieved a score of 44.3% (almost 10% higher than both the scores of neural networks used

in isolation). Also note that YOLOv2 + HUMAN was able to detect twice the number of the objects when

compared to YOLOv2.

5.3 Further Experiments

In this Section we present two further experiments conducted when we realized the wide applications

that can be derived from the data collected by the robot when using the introduced learning algorithm.

The purpose of this Section is to show use cases for this work. We view these experiments as a good

starting points for possible future work.


5.3.1 Looking for a Specific’s Person Object

In Figure 5.5, we can see the results of our next experiment, where we simulated the misplacement of

a backpack and remotely requested the robot to search for it (using Telegram bot API4). As the figure

suggests, the robot searched the surrounding environment. After approximately 2 minutes it had a

positive match on the subject’s backpack. Pertinently there were two more backpacks present at the

scene and Pepper was able to identify the desired one.

The goal of this simple experiment was to present a possible use case conducted with the data collected

by the robot. Providing the robot with the ability of looking for a specific’s person object would be of

great value - specifically for helping elderly and disabled (e.g: robot helping visually impaired looking for

objects).

4More info can be found at https://core.telegram.org/

5.3. Further Experiments 53

Figure 5.5: Pepper looking for Joao’s backpack demo.

5.3.2 Sharing Knowledge Between Robots

The final experiment we conducted was to evaluate what would happen if we used the MBot’s (located in

Lisbon, Portugal) HUMAN neural net on Pepper (located in Pittsburgh, PA, USA). Some object classes,

e.g. fruits, kitchenware, and others can be commonly found in many places in the world so it makes

sense that one robot could benefit from the other one’s knowledge. So we decided to run a simple test

where we showed Pepper an “apple”, an “orange” and a “banana” over a “countertop”. Using MBot’s

HUMAN150 neural net Pepper detected two “oranges”, one “banana” and the “countertop” (see Figure

5.6).

Despite being a world away we can already see the potential of sharing knowledge between the robots.


This simple test motivated us into thinking about new possibilities of research work that we plan to do in

future.

Figure 5.6: Pepper detecting objects using MBot’s knowledge.

Chapter 6

Conclusion

In this Chapter, we present the conclusions of this thesis. Firstly, we summarize its achievements (Sec-

tion 6.1). Finally, we briefly discuss potential future work (Section 6.2).

6.1 Summary of Thesis Achievements

We tested and evaluated the state-of-the-art object detection algorithm – YOLOv2 – with two robots

CoBot and MBot and verified that the results fell short of expectations, given the published results of

their evaluation. These tests were conducted in two different scenarios, the corridors of a building in a

university and a domestic scenario and we verified that YOLOv2, when deployed in these robots, fails to

recognize many of the objects in real human environments. This thesis presents an approach to address

the object detection limitations of service robots. In particular, we bootstrap YOLOv2, a state-of-the-art

object detection neural network, together with human teaching provided in close proximity. The robot

trains a neural net with the collected data and uses two neural networks in parallel: YOLOv2 + HUMAN.

HUMAN, was the neural net we created using only the data collected by the robot after interacting with

the humans. By using the two neural networks, the robot is equipped with the ability to adapt to a

new environment without losing its previous knowledge. We implemented our learning algorithm in two

different robots, Pepper and Mbot, and conducted experiments to test the object detection performance

of these neural networks in a domestic scenario. We verified the continuous improvement in accuracy

of the robot’s object detection skills as it interacts more with the humans and that using the two neural

networks in parallel is advantageous. We also showed that a robot could look for a specific’s person

object. Finally, we conducted a simple experiment where Pepper, located in the USA, detected objects

using the MBot’s knowledge, located in Portugal.

55

56 Chapter 6. Conclusion

6.2 Future Work

Potential future work includes further enhancing the object detection capabilities. Since the robot keeps

track of all the objects labeled by the humans, it also knows which ones it has seen less frequently. The

robot could use this information to ask for more images of a specific object. For example, it could ask

the human to show it different perspectives of a specific object. We believe that this new capability could

improve its detection skills since the robot would be able to include more variability in its training set.

Another potential path of future work could focus on sharing information between the robots (located

in different places/countries). We did a simple final experiment of sharing the MBot’s knowledge with

Pepper and showed that it successfully detected most of the objects in an image. We can only imagine

what would happen if we shared hundreds of labeled images between multiple robots (e.g. between

Pepper and Mbot).

Bibliography

[1] Thomaz, Andrea Lockerd, and Cynthia Breazeal. ”Reinforcement learning with human teachers:

Evidence of feedback and guidance with implications for learning performance.” In Aaai, vol. 6, pp.

1000-1005. 2006.

[2] Veloso, Manuela M., Joydeep Biswas, Brian Coltin, and Stephanie Rosenthal. ”CoBots: Robust

Symbiotic Autonomous Mobile Service Robots.” In IJCAI, p. 4423. 2015.

[3] Argall, Brenna D., Sonia Chernova, Manuela Veloso, and Brett Browning. ”A survey of robot learning

from demonstration.” Robotics and autonomous systems 57, no. 5 (2009): 469-483.

[4] Thrun, Sebastian, and Tom M. Mitchell. ”Lifelong robot learning.” Robotics and autonomous systems

15, no. 1-2 (1995): 25-46.

[5] SoftBank Robotics, 2018. Who is Pepper? (2018). Retrieved February 1, 2018 from

https://www.ald.softbankrobotics.com/en/robots/pepper

[6] Hawes, Nick, Christopher Burbridge, Ferdian Jovan, Lars Kunze, Bruno Lacerda, Lenka Mudrova,

Jay Young et al. ”The STRANDS project: Long-term autonomy in everyday environments.” IEEE

Robotics & Automation Magazine 24, no. 3 (2017): 146-156.

[7] Agence France Presse. 2016. Robot receptionists introduced at hospitals in Belgium. (2016). Re-

trieved February 2018 from https://www.theguardian.com/technology/2016/jun/14/robot-receptionists-

hospitals-belgium-pepper-humanoid

[8] Veloso, Manuela M., Joydeep Biswas, Brian Coltin, Stephanie Rosenthal, Susana Brandao, Tekin

Mericli, and Rodrigo Ventura. ”Symbiotic-autonomous service robots for user-requested tasks in a

multi-floor building.” (2012).

[9] Redmon, Joseph, and Ali Farhadi. ”YOLO9000: better, faster, stronger.” arXiv preprint 1612 (2016).

[10] Sam Byford (The Verge). 2014.SoftBank announces emotional robots to staff

its stores and watch your baby. (2014). Retrieved February 1, 2018 from

https://www.theverge.com/2014/6/5/5781628/softbank-announces-pepper-robot

57

58 BIBLIOGRAPHY

[11] Messias, Joao, Rodrigo Ventura, Pedro Lima, Joao Sequeira, Paulo Alvito, Carlos Marques, and

Paulo Carrico. ”A robotic platform for edutainment activities in a pediatric hospital.” In Autonomous

Robot Systems and Competitions (ICARSC), 2014 IEEE International Conference on, pp. 193-198.

IEEE, 2014.

[12] Joao Cartucho, Rodrigo Ventura, and Manuela Veloso. ”Robust Object Recognition Through Sym-

biotic Deep Learning In Mobile Robots.” Submitted to IROS 2018.

[13] Michiel de Jong, Kevin Zhang, Travers Rhodes, Aaron Roth, Robin Schmucker, Chenghui Zhou,

Sofia Ferreira, Joao Cartucho, Manuela Veloso. ”Towards a Robust Interactive and Learning Social

Robot.” AAMAS 2018.

[14] Viola, Paul, and Michael Jones. ”Rapid object detection using a boosted cascade of simple fea-

tures.” In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001

IEEE Computer Society Conference on, vol. 1, pp. I-I. IEEE, 2001.

[15] Lowe, David G. ”Distinctive image features from scale-invariant keypoints.” International journal of

computer vision 60, no. 2 (2004): 91-110.

[16] Edouard Oyallon, Julien Rabin, ”An Analysis and Implementation of the SURF Method, and its

Comparison to SIFT”, Image Processing On Line

[17] Dalal, Navneet, and Bill Triggs. ”Histograms of oriented gradients for human detection.” Computer

Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1.

IEEE, 2005.

[18] Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ”Imagenet: A large-scale

hierarchical image database.” In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE

Conference on, pp. 248-255. IEEE, 2009.

[19] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet classification with deep con-

volutional neural networks.” In Advances in neural information processing systems, pp. 1097-1105.

2012.

[20] Fukushima, Kunihiko, and Sei Miyake. ”Neocognitron: A self-organizing neural network model for a

mechanism of visual pattern recognition.” In Competition and cooperation in neural nets, pp. 267-285.

Springer, Berlin, Heidelberg, 1982.

[21] LeCun, Yann, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne

Hubbard, and Lawrence D. Jackel. ”Backpropagation applied to handwritten zip code recognition.”

Neural computation 1, no. 4 (1989): 541-551.

BIBLIOGRAPHY 59

[22] Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional networks for large-scale image

recognition.” arXiv preprint arXiv:1409.1556 (2014).

[23] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du-

mitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. ”Going deeper with convolutions.” Cvpr,

2015.

[24] Szegedy, Christian, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. ”Rethinking

the inception architecture for computer vision.” In Proceedings of the IEEE Conference on Computer

Vision and Pattern Recognition, pp. 2818-2826. 2016.

[25] Felzenszwalb, Pedro F., Ross B. Girshick, David McAllester, and Deva Ramanan. ”Object detection

with discriminatively trained part-based models.” IEEE transactions on pattern analysis and machine

intelligence 32, no. 9 (2010): 1627-1645.

[26] Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. ”Rich feature hierarchies for ac-

curate object detection and semantic segmentation.” In Proceedings of the IEEE conference on com-

puter vision and pattern recognition, pp. 580-587. 2014.

[27] Uijlings, Jasper RR, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. ”Selective

search for object recognition.” International journal of computer vision 104, no. 2 (2013): 154-171.

[28] Girshick, Ross. ”Fast r-cnn.” arXiv preprint arXiv:1504.08083 (2015).

[29] Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. ”Faster r-cnn: Towards real-time object

detection with region proposal networks.” In Advances in neural information processing systems, pp.

91-99. 2015.

[30] He, Kaiming, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. ”Mask r-cnn.” In Computer Vision

(ICCV), 2017 IEEE International Conference on, pp. 2980-2988. IEEE, 2017.

[31] Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. ”You only look once: Unified,

real-time object detection.” In Proceedings of the IEEE conference on computer vision and pattern

recognition, pp. 779-788. 2016.

[32] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing inter-

nal covariate shift. arXiv preprint arXiv:1502.03167 , 2015

[33] Forman, George. ”An extensive empirical study of feature selection metrics for text classification.”

Journal of machine learning research 3, no. Mar (2003): 1289-1305.

[34] Everingham, Mark, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman.

”The pascal visual object classes (voc) challenge.” International journal of computer vision 88, no. 2

(2010): 303-338.

60 BIBLIOGRAPHY

[35] Papadopoulos, Dim P., Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. ”We don’t need

no bounding-boxes: Training object class detectors using only human verification.” arXiv preprint

arXiv:1602.08405 (2016).

[36] Everingham, M. and Van Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman,

A. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, http://www.pascal-

network.org/challenges/VOC/voc2012/workshop/index.html

[37] Everingham, Mark, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and An-

drew Zisserman. ”The pascal visual object classes challenge: A retrospective.” International journal

of computer vision 111, no. 1 (2015): 98-136.

[38] Vijayanarasimhan, Sudheendra, and Kristen Grauman. ”Large-scale live active learning: Training

object detectors with crawled data and crowds.” International Journal of Computer Vision 108, no. 1-2

(2014): 97-114.

[39] Thomaz, Andrea L., and Maya Cakmak. ”Learning about objects with human teachers.” In Pro-

ceedings of the 4th ACM/IEEE international conference on Human robot interaction, pp. 15-22. ACM,

2009.

[40] Rosenthal, Stephanie, Joydeep Biswas, and Manuela Veloso. ”An effective personal mobile robot

agent through symbiotic human-robot interaction.” In Proceedings of the 9th International Confer-

ence on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pp. 915-922. International

Foundation for Autonomous Agents and Multiagent Systems, 2010.

[41] Rosenthal, Stephanie, and Manuela Veloso. ”Using symbiotic relationships with humans to help

robots overcome limitations.” In Workshop for Collaborative Human/AI Control for Interactive Experi-

ences. 2010.

[42] Ventura, Rodrigo, Brian Coltin, and Manuela Veloso. ”Web-based remote assistance to overcome

robot perceptual limitations.” In AAAI Conference on Artificial Intelligence (AAAI-13), Workshop on

Intelligent Robot Systems. AAAI, Bellevue, WA. 2013.

[43] Bohren, Jonathan, Radu Bogdan Rusu, E. Gil Jones, Eitan Marder-Eppstein, Caroline Pantofaru,

Melonee Wise, Lorenz Mosenlechner, Wim Meeussen, and Stefan Holzer. ”Towards autonomous

robotic butlers: Lessons learned with the PR2.” In Robotics and Automation (ICRA), 2011 IEEE Inter-

national Conference on, pp. 5568-5575. IEEE, 2011.

[44] Watanabe, Sumio, and Masahide Yoneyama. ”An ultrasonic visual sensor for three-dimensional

object recognition using neural networks.” IEEE transactions on Robotics and Automation 8, no. 2

(1992): 240-249.

BIBLIOGRAPHY 61

[45] Maturana, Daniel, and Sebastian Scherer. ”Voxnet: A 3d convolutional neural network for real-time

object recognition.” In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Confer-

ence on, pp. 922-928. IEEE, 2015.

[46] Collet, Alvaro, Dmitry Berenson, Siddhartha S. Srinivasa, and Dave Ferguson. ”Object recognition

and full pose registration from a single image for robotic manipulation.” In Robotics and Automation,

2009. ICRA’09. IEEE International Conference on, pp. 48-55. IEEE, 2009.

[47] Itti, Laurent, Christof Koch, and Ernst Niebur. ”A model of saliency-based visual attention for rapid

scene analysis.” IEEE Transactions on pattern analysis and machine intelligence 20, no. 11 (1998):

1254-1259.

[48] Nasrabadi, Nasser M., and Wei Li. ”Object recognition by a Hopfield neural network.” IEEE Trans-

actions on Systems, Man, and Cybernetics 21, no. 6 (1991): 1523-1535.

[49] D’Innocente, Antonio, Fabio Maria Carlucci, Mirco Colosi, and Barbara Caputo. ”Bridging between

computer and robot vision through data augmentation: a case study on object recognition.” In Inter-

national Conference on Computer Vision Systems, pp. 384-393. Springer, Cham, 2017.

[50] Kemp, Charles C., Cressel D. Anderson, Hai Nguyen, Alexander J. Trevor, and Zhe Xu. ”A point-

and-click interface for the real world: laser designation of objects for mobile manipulation.” In Human-

Robot Interaction (HRI), 2008 3rd ACM/IEEE International Conference on, pp. 241-248. IEEE, 2008.

[51] Ekvall, Staffan, Patric Jensfelt, and Danica Kragic. ”Integrating active mobile robot object recogni-

tion and slam in natural environments.” In Intelligent Robots and Systems, 2006 IEEE/RSJ Interna-

tional Conference on, pp. 5792-5797. IEEE, 2006.

[52] Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr

Dollar, and C. Lawrence Zitnick. ”Microsoft coco: Common objects in context.” In European conference

on computer vision, pp. 740-755. Springer, Cham, 2014.

[53] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ”Deep residual learning for image

recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.

770-778. 2016.

[54] Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and

Alexander C. Berg. ”Ssd: Single shot multibox detector.” In European conference on computer vision,

pp. 21-37. Springer, Cham, 2016.

[55] Krasin I., Duerig T., Alldrin N., Ferrari V., Abu-El-Haija S., Kuznetsova A., Rom H., Uijlings J., Popov

S., Veit A., Belongie S., Gomes V., Gupta A., Sun C., Chechik G., Cai D., Feng Z., Narayanan D., Mur-

62 BIBLIOGRAPHY

phy K. OpenImages: A public dataset for large-scale multi-label and multi-class image classification,

2017. Available from https://github.com/openimages

[56] Marko Bjelonic. 2017. YOLO ROS: Real-Time Object Detection for ROS. (2017). Retrieved January

21, 2018 from https://github.com/leggedrobotics/darknet ros

[57] Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng

Huang et al. ”Imagenet large scale visual recognition challenge.” International Journal of Computer

Vision 115, no. 3 (2015): 211-252.

[58] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.

Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014

Appendix A

Training Set Information

In this Appendix, we provide more information about the training set in the three stages of the HUMAN

neural net (HUMAN50, HUMAN100, and HUMAN150). Note that from each stage the HUMAN only

trained with 70% of the pictures. For example, out of the first 50, 35 of them were used for training. In

the following plots (Figure A.1, A.2 and A.3) we find the number of instances per object class in each

training set. Some of the less frequent classes like “refrigerator” and “handbag” were never present in the

experiment scenario, they result from human input mistakes. Other object classes were only labeled by a

single research participant (for example the class “doll” was only labeled by a single participant). Another

thing to take into account is that some objects show up less frequently and others more frequently in the

pictures. For example, the class “chair” was the dominant one since there a total of 6 chairs distributed

around the experiment scenario. Another important thing to note is that this class (“chair”) was starting

to get unbalanced relatively to the other classes. In practice class imbalance usually matters only when

the ratios are more like N:1, where N is 100 or 1000 or more. The ideal way to avoid this would be by

collecting more data, for example, the robot would ask something like:

“Could you please show me a remote?”

And the robot would capture more pictures while the human is showing the object. This is one of the

next logical steps we are currently considering to proceed with, hopefully improving furtherer the results.

63

64 Appendix A. Training Set Information

Figure A.1: Information about HUMAN50 training data.

65

Figure A.2: Information about HUMAN100 training data

66 Appendix A. Training Set Information

Figure A.3: Information about HUMAN150 training data.

Appendix B

Resulting Predictions Information

In this Appendix you can find the resulting neural net’s predictions for the ground-truth pictures (Figure

B.1, B.2, B.3, B.4, B.5). In HUMAN50 68% of the predictions were correct, while in HUMAN100 76%

and finally in HUMAN150 78%. On the other hand only 57% of YOLOv2’s predictions were correct. In

figure B.1 we can note that second most detected class by YOLOv2 was “refrigerator” and since there

were not any refrigerators in the scene it resulted in a total of 39 false predictions.

67

68 Appendix B. Resulting Predictions Information

Figure B.1: YOLOv2 predictions with the ground-truth pictures.

Figure B.2: HUMAN50 predictions with the ground-truth pictures.

69

Figure B.3: HUMAN100 predictions with the ground-truth pictures.

70 Appendix B. Resulting Predictions Information

Figure B.4: HUMAN150 predictions for the ground-truth pictures.

71

Figure B.5: YOLOv2+HUMAN150 predictions for the ground-truth pictures.

Robust Object Recognition Through Symbiotic Deep Learning In … · implementados diretamente num...

Documents

Transcript of Robust Object Recognition Through Symbiotic Deep Learning In … · implementados diretamente num...