Robust Object Recognition Through Symbiotic Deep Learning In … · implementados diretamente num...
Transcript of Robust Object Recognition Through Symbiotic Deep Learning In … · implementados diretamente num...
Robust Object Recognition Through Symbiotic DeepLearning In Mobile Robots
João Miguel Vieira Cartucho
Thesis to obtain the Master of Science Degree in
Aerospace Engineering
Supervisors: Prof. Manuela Maria VelosoProf. Rodrigo Martins de Matos Ventura
Examination Committee
Chairperson: Prof. José Fernando Alves da SilvaSupervisor: Prof. Rodrigo Martins de Matos Ventura
Member of the Committee: Prof. Alexandre José Malheiro Bernardino
June 2018
Acknowledgements
I would like to thank:
• My thesis advisor: Professor Rodrigo Ventura, of the Institute for Systems and Robotics at Instituto
Superior Tecnico. Professor Ventura has continuously supported and steered this research work in
the right direction while also giving me an enthusiastic encouragement. From Skype calls around
the world, to knocking on the Professor’s office door without notice, his willingness to give his time
so generously has been very much appreciated.
• My supervisor at Carnegie Mellon University: Professor Manuela Veloso, of the Machine Learning
Department, School of Computer Science. Professor Veloso is extremely knowledgeable and has
given priceless and constructive suggestions during the planning and development of this research
work. Since the beginning, I felt inspired with her unique approach of symbiotic autonomy and
self-explanatory artificial intelligence, culminating in robots that know how to address their own
limitations.
• Other researchers from ISR and CORAL Lab. Interacting and discussing ideas with other students
has been one of the best ways of getting valuable advice. Special thanks to Oscar Lima and Rute
Luz from ISR and Robin Schmucker and Kevin Zhang from CORAL Lab.
• My family and friends, for their participation at research experiments, advices, thesis revisions,
incentive and support throughout this work. I would like to offer my special thanks to my mom,
sisters, cousin Johnny Cartucho and my friend Luıs Rosmaninho.
Thank you for all the encouragement!
This work was supported by the FCT project [UID/EEA/50009/2013] and partially funded with grant
6204/BMOB/17, from CMU Portugal.
i
ii
Abstract
Despite the recent success of state-of-the-art deep learning algorithms in object detection, when these
are deployed as-is on a mobile service robot, we observed that they failed to recognize many objects
in real human environments. In this work, we introduce a learning algorithm in which robots address
this flaw by asking humans for help, also known as a symbiotic autonomy approach. In particular, we
bootstrap YOLOv2, a state-of-the-art deep neural network and train a new neural network, that we call
HUMAN, using only collected data. Using an RGB camera and an on-board tablet, the robot proactively
seeks human input to assist it in labeling surrounding objects. Pepper and CoBot, located at CMU, and
Monarch Mbot, located at ISR-Lisbon, were the service robots that we used to validate the proposed
approach. We conducted a study in a realistic domestic environment over the course of 20 days with 6
research participants. To improve object detection, we used the two neural nets, YOLOv2 + HUMAN, in
parallel. Following this methodology, the robot was able to detect twice the number of objects compared
to the initial YOLOv2 neural net, and achieved a higher mAP (mean Average Precision) score. Using
the learning algorithm the robot also collected data about where an object was located and to whom it
belonged to by asking humans. This enabled us to explore a future use case where robots can search
for a specific person’s object. We view the contribution of this work to be relevant for service robots in
general, in addition to CoBot, Pepper, and Mbot.
Keywords: Cognitive Human-Robot Interaction; Deep Learning in Robotics and Automation; Service
Robots; Social Robots
iii
iv
Resumo
Apesar dos recentes progressos nos algoritmos estado-da-arte para detecao de objectos, estes, quando
implementados diretamente num robo de servico, falham no reconhecimento de muitos dos objetos pre-
sentes nos ambientes humanos reais. Esta tese, introduz um algoritmo de aprendizagem atraves do
qual os robos enderecam esta falha pedindo ajuda humana, numa abordagem denominada por “autono-
mia simbiotica”. Em particular, partimos do YOLOv2, uma rede neuronal do estado-da-arte, e criamos
uma nova rede neuronal – HUMAN – com a informacao recolhida pelo robo atraves da assistencia hu-
mana. Com uma camara RGB e um tablet no robo, este procura proactivamente por auxılio humano
para classificar os objetos em seu redor. Para validar esta abordagem utilizamos tres robos de servico,
CoBot e Pepper, localizados na CMU, e Monarch Mbot, no ISR-Lisboa, e realizamos um estudo num
ambiente domestico real com 6 participantes ao longo de 20 dias. Verificamos um melhoramento na
detecao de objetos quando as duas redes neuronais (YOLOv2 e HUMAN) sao usadas em paralelo. No
final da experiencia, o robo foi capaz de detectar o dobro dos objetos e ainda revelou uma melhor mAP
(“mean Average Precision”). Atraves da informacao recolhida com as perguntas feitas aos humanos,
mostramos ainda um possıvel caso pratico em que o robo procura um objeto para uma pessoa em
especıfico. Este trabalho contribui para robos de servico em geral, para alem do CoBot, Pepper, e
Mbot.
Palavras-Chave: Interacao Cognitiva Humano-Robo; “Deep Learning” em Robotica e Automacao;
Robos de Servico; Robos Sociais
v
vi
Contents
Acknowledgements i
Abstract iii
Resumo v
List of Tables xi
List of Figures xv
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.2 Open Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Framework and State-of-the-art Overview 6
2.1 Brief History of Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 An Overview on YOLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Functioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
vii
viii CONTENTS
2.2.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.4 Advantages of YOLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.5 YOLOv2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 mAP - mean Average Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Precision and Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.3 Intersection Over Union . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.4 Calculating the AP - Average Precision . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 YOLOv2 in a Real-World Scenario 21
3.1 PASCAL VOC Dataset Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.1 MBot Test in a Domestic Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.1.2 CoBot Test in a University Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1.3 Performance of YOLOv2 Trained on PASCAL VOC . . . . . . . . . . . . . . . . . . 22
3.2 COCO Dataset Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Learning Algorithm 28
4.1 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.1 Part I - Capturing Images and Predicting the Objects . . . . . . . . . . . . . . . . . 29
4.1.2 Part II - Interacting with Humans . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1.3 Part III - Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 YOLOv2 + HUMAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 Problem and Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Removing Duplicate Detections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 Results 42
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.1 Domestic Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.2 Looking for a Specific’s Person Object . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Domestic Scenario Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.1 Correctness of Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.2 Evaluation of the Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 Further Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3.1 Looking for a Specific’s Person Object . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3.2 Sharing Knowledge Between Robots . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6 Conclusion 55
6.1 Summary of Thesis Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
Bibliography 56
A Training Set Information 63
B Resulting Predictions Information 67
ix
x
List of Tables
3.1 PASCAL VOC2012 [36] test detection results. YOLOv2 performs on the same level as
other state-of-the-art detectors like Faster R-CNN [29] with ResNet [53] and SSD512 [54]
while still running 2 to 10 times faster [9]. Table adapted from [9]. . . . . . . . . . . . . . . 23
3.2 MBot Domestic Scenario PASCAL VOC2012 [36] test detection results on YOLOv2. . . . 23
3.3 CoBot University Scenario PASCAL VOC2012 [36] test detection results on YOLOv2. . . 24
5.1 Predictions - Week 1 - Image 0 to 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Predictions - Week 2 - Image 50 to 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.3 Predictions - Week 3 - Image 100 to 150. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.4 Average Precision using an external ground-truth composed of 100 images in different
poses. Values are in percentage (%) and the largest one per row marked in bold. . . . . 49
xi
xii
List of Figures
1.1 CoBot - Collaborative Robot. Designed for servicing multi-floor buildings [8]. . . . . . . . . 2
1.2 Mobile robots used for the evaluation of the proposed method. Photos: SoftBank/Aldebaran
and IDMind Robotics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Viola and Jones algorithm for face detection. Image from OpenCV website. . . . . . . . . 7
2.2 E.g. SIFT object detection with partial occlusion. Image from OpenCV website. . . . . . . 7
2.3 E.g. HOG detector cue mainly on silhouette contours. Image from [17]. . . . . . . . . . . 8
2.4 R-CNN Object Detection System Overview. Image from [26]. . . . . . . . . . . . . . . . . 9
2.5 Example image splitted into S×S grid and one of its cells (marked in red). Image from
YOLO website. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Resulting predictions from all the grid cells (the higher the confidence the thicker the box).
Image from YOLO website. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 Class probability map. Image from YOLO website. . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Combining bounding boxes with class probabilities. Image from YOLO website. . . . . . . 11
2.9 Final predictions after applying a threshold and non-maximum suppression. Image from
YOLO website. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.10 The architecture of YOLO. Image from [31] . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.11 Example training YOLO to detect a “dog”. Image from YOLO website. . . . . . . . . . . . 14
2.12 YOLO generalization results on Picasso and People-Art datasets. Image from [31] . . . . 15
2.13 Anchor Boxes vs. Dimension Clusters. Image from YOLO website. . . . . . . . . . . . . . 16
xiii
xiv LIST OF FIGURES
2.14 Accuracy and speed on VOC 2007. Image from YOLO website. . . . . . . . . . . . . . . . 17
2.15 Example images captured by MBot capturing different perspectives and lighting conditions. 18
3.1 Example images captured by MBot capturing different perspectives and lighting conditions. 22
3.2 CoBots with on-board computation, interaction interfaces, omni-directional motion, carry-
ing baskets, and depth sensors (Kinect and LIDAR). . . . . . . . . . . . . . . . . . . . . . 23
3.3 Example images captured by CoBot for our YOLOv2 evaluation experiments. . . . . . . . 23
3.4 MBot, information about the 100 captured images PASCAL VOC test. . . . . . . . . . . . 25
3.5 CoBot, information about the 100 captured images PASCAL VOC test. . . . . . . . . . . . 25
3.6 MBot example images of YOLOv2 true and false predictions. . . . . . . . . . . . . . . . . 25
3.7 CoBot example images of YOLOv2 true and false predictions. . . . . . . . . . . . . . . . . 26
3.8 MBot, information about the 100 captured images COCO test. . . . . . . . . . . . . . . . 26
3.9 MBot, information about the 100 captured images COCO test. . . . . . . . . . . . . . . . 27
4.1 E.g. Outcomes from q1 and q2. Cat image credit to Flickr user William McCamment. . . 32
4.2 Some classes that can be added to the robot using the labels from OpenImages dataset.[55] 33
4.3 Example of the human-robot interface when labeling an object. . . . . . . . . . . . . . . . 35
4.4 Mbot (top) and Pepper (bottom) learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.5 An example case where the robot gets repeated labels for the same object. . . . . . . . . 37
4.6 Example of YOLOv2+HUMAN using predictions from both the neural nets. . . . . . . . . . 39
4.7 Duplicate detection example. YOLOv2+HUMAN uses the prediction with higher confidence. 40
5.1 ISRoboNet@Home Testbed – Domestic scenario where we ran the experiments. . . . . . 43
5.2 Points of reference comprising the surrounding area. . . . . . . . . . . . . . . . . . . . . . 44
5.3 Example images captured by the robot for the ground-truth. . . . . . . . . . . . . . . . . . 47
5.4 Information about the ground-truth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.5 Pepper looking for Joao’s backpack demo. . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.6 Pepper detecting objects using MBot’s knowledge. . . . . . . . . . . . . . . . . . . . . . . 54
A.1 Information about HUMAN50 training data. . . . . . . . . . . . . . . . . . . . . . . . . . . 64
A.2 Information about HUMAN100 training data . . . . . . . . . . . . . . . . . . . . . . . . . . 65
A.3 Information about HUMAN150 training data. . . . . . . . . . . . . . . . . . . . . . . . . . . 66
B.1 YOLOv2 predictions with the ground-truth pictures. . . . . . . . . . . . . . . . . . . . . . . 68
B.2 HUMAN50 predictions with the ground-truth pictures. . . . . . . . . . . . . . . . . . . . . . 68
B.3 HUMAN100 predictions with the ground-truth pictures. . . . . . . . . . . . . . . . . . . . . 69
B.4 HUMAN150 predictions for the ground-truth pictures. . . . . . . . . . . . . . . . . . . . . . 70
B.5 YOLOv2+HUMAN150 predictions for the ground-truth pictures. . . . . . . . . . . . . . . . 71
xv
xvi
Chapter 1
Introduction
The goal of this Chapter is to provide the reader with an overview of the work developed in this thesis.
Firstly, we introduce the problem we are addressing (Section 1.1), along with the main motivation and
objectives (Section 1.2). Then, we describe the contributions of this work (Section 1.3). Finally, we
present the structure carried out during the development of this thesis (Section 1.4).
1.1 Motivation
Human-robot symbiotic learning is an increasingly active area of research [1, 2, 3, 4]. Anthropomorphic
robots are being increasingly deployed in real-world scenarios, such as homes, offices, and hospitals [5,
6, 7]. However, exposure to real human environments raises multiple challenges often overlooked in
controlled laboratory experiments. For instance, robots equipped with state-of-the-art neural nets trained
for object detection still fail to provide accurate descriptions for the majority of the objects surrounding
them outside controlled environments. We conducted a preliminary experiment with CoBot (shown in
Figure 1.1), designed for servicing multi-floor buildings [8], located at Carnegie Mellon University, USA,
and Monarch MBot (shown in Figure 1.2b), located at Instituto Superior Tecnico, Portugal, to evaluate the
performance of the state-of-the-art object detector – YOLOv2 – in real-world scenarios. Unfortunately,
in both these robots, YOLOv2 fell short when compared to the expected performance (further described
in Chapter 3).
1
2 Chapter 1. Introduction
Figure 1.1: CoBot - Collaborative Robot. Designed for servicing multi-floor buildings [8].
1.2 Objectives
This thesis tackles the aforementioned problem using a symbiotic interaction approach, in which the
robot seeks human assistance in order to improve its object detection skills (our primary aim of this
work).
This is achieved by deploying a learning algorithm that empowers the robot to ask humans for help. Over
time, it can be measured that the human input increases the robot’s effectiveness. The learning process
is bootstrapped by an external state-of-the-art neural net — YOLOv2 — for real-time object detection [9].
The robot, using its RGB camera, in conjunction with its on-board tablet, explores its environment whilst
labeling objects. The robot then confirms these labels by interacting with humans, asking them to
respond to simple Yes/No questions and/or requesting that they adjust a selection rectangle positioned
around an object in the tablet.
With the learning algorithm the robot also collects information about where the objects were seen in
the map of the scenario and to whom they belong, if they are personal objects. Providing accurate
information will equip the robot with the means to do other tasks like actively seek a personal object, on
request. This was one of the use case experiments that were made possible due to our realization of
the wide applications that can be derived from the data that the robot collected. Another application for
this data would be to improve the robot’s object detection skills via sharing information between different
robots.
The social robots used to test our approach were Pepper (shown in Figure 1.2a), a service robot devel-
oped by Softbank/Aldebaran Robotics and specifically designed for social interaction [10], and Monarch
1.3. Contributions 3
(a) Pepper the robot (b) Mbot the robot
Figure 1.2: Mobile robots used for the evaluation of the proposed method. Photos: SoftBank/Aldebaranand IDMind Robotics
Mbot (shown in Figure 1.2b), designed for edutainment activities in a pediatric hospital [11]. In addition,
the two robotic platforms were located in separate working environments: Pepper was located at CMU,
USA, and Mbot at ISRobotNet@Home1, Test Bed of ISR-Lisbon, Portugal. The experiments were con-
ducted in a realistic domestic scenario. By having many research participants interact and modify the
test environment, the unpredictability of object placement and arrangement was ensured.
At the end of our experiment, the robot was able to detect twice the number of objects compared to the
initial YOLOv2, with an improved mean Average Precision (mAP) score.
1.3 Contributions
Robots are being increasingly deployed in real-world scenarios. Homes, offices, and hospitals [5, 6, 7]
are merely a small example of the places where we will be able to find these robots in the future,
performing an ever-growing number of tasks in many different applications. With time, these tasks will
progressively become more complex and demanding, requiring robots to have an improved perception
of their surroundings.
This work contributes to service robots in general by empowering them with the ability to improve their
perception, in particular, by providing them with the ability to learn new objects and adapt to their sur-
rounding environment.
In practice, it is difficult to predict all the objects that a robot will be exposed to in real-world scenarios.
Scenarios are dynamic and constantly undergoing alterations as objects get displaced, modified, and1More info can be found at http://isrobonet athome.isr.tecnico.ulisboa.pt/
4 Chapter 1. Introduction
replaced by new ones. Using the learning algorithm introduced in this work, the robots were equipped
to address this - they asked for help if need be while improving their own perception of the human world.
The approach of symbiotic autonomy uses the interaction with humans as a mean to overcome obstacles
(e.g. when the robot is unable to detect an object), rather than exclusivity replying on the native abilities
of the robot. In turn, robots improve their object detection skills and provide more value to humans.
Hence this symbiotic relationship has the potential to form a positive feedback loop, where humans help
robots improve, which in turn helps robots serve humans better.
Summarized contributions of this work:
• We tested and evaluated (using CoBot in the corridors of a university and MBot in a domestic
scenario) that the state-of-the-art object detector (YOLOv2) fails to detect many of the objects in
real human environments;
• We introduced a learning algorithm that empowers the robot with the ability to detect new objects
and collect more information about them (to whom they belong, and where they are located) via
human-robot interaction;
• We introduced an approach of using two neural nets combined (YOLOv2 + HUMAN) and verified
that the YOLOv2 + HUMAN approach is able to detect more objects with increased accuracy;
• We showed that using the collected data, a robot can locate a specific’s person object in the
environment;
• We showed that a robot can detect objects using the knowledge shared from another robot.
1.3.1 Publications
The learning algorithm introduced in this work and main results were written in a paper accepted to
IROS 2018 [12]. Additionally, a few other experiments conducted during the development of this work
will be presented at AAMAS 2018 [13].
1.4. Thesis Structure 5
1.3.2 Open Source
Recently, we shared part of the code that has been developed, into two GitHub repositories2:
• OpenLabeling: An open-source image labeling tool which supports the format required by YOLO
for training (released on 04/03/2018 and currently with 224 stars and 45 forks3).
• mAP: mean Average Precision - Code to evaluate the performance of your neural net for object
detection (released on 09/04/2018 and currently with 156 stars and 57 forks).
1.4 Thesis Structure
This thesis is structured as follows: Chapter 2 reviews the state-of-the-art in object detection and pro-
vides some background. Chapter 3 tests and evaluates the performance of YOLOv2 in real human
environments. Chapter 4 describes the learning algorithm and how the robot will use input from hu-
mans to create a neural net – HUMAN – adapted to the local real-world scenario and presents the
approach of using the two neural nets in parallel – YOLOv2 + HUMAN – to improve the robot’s capa-
bility of detecting objects. Chapter 5 describes the experimental setup where we conducted our study
and presents/discusses the results. Lastly, Chapter 6, presents the conclusions, summarizes the overall
findings, and discusses the possible next steps for future work.
2These repositories can be found at https://github.com/Cartucho3GitHub star (shows appreciation to the repository maintainer for their work), and forks (personal copy of the repository for
further development)
Chapter 2
Framework and State-of-the-art
Overview
This chapter provides a general framework for the task of object detection. The goal of object detection
is to enclose objects in an image within rectangles (usually called bounding boxes) and say what those
objects are. This chapter also puts the main studies about robots detecting objects into perspective.
Firstly, we briefly review the history of the object detection task (Section 2.1). Secondly, we overview the
YOLO neural net (Section 2.2). Then, we describe the mean Average Precision (mAP) metric (Section
2.3). Lastly, we describe the related work (Section 2.4).
2.1 Brief History of Object Detection
In 2001 Paul Viola and Michael Jones [14] presented the first remarkable facial detection algorithm (the
Viola-Jones algorithm). Object detection has been around since the 1960s but this was the first time
that it was really working and running in real-time. This was mainly due to its simple and efficient design.
Similarly to previous algorithms at the time, they hand-coded in features and fed them into a classifier (a
support vector machine). It was trained on a dataset of faces and they hand-coded the location of eyes,
mouth, etc and their relations to each other. Unfortunately, since the features had been hand-coded,
any other kind of configurations like a person with an eyepatch or with the face slightly tilted would not
be detected.
In 2004 Lowe et al. [15] presented a very successful algorithm for feature matching called SIFT (scale
invariant feature transform). The basic idea was to transform the image data into scale invariant coor-
6
2.1. Brief History of Object Detection 7
Figure 2.1: Viola and Jones algorithm for face detection. Image from OpenCV website.
Figure 2.2: E.g. SIFT object detection with partial occlusion. Image from OpenCV website.
dinates. Using these coordinates the goal was to extract distinctive features invariant to image scale
and rotation. Meaning that the object detections are robust to changes in the viewpoints from where the
images are captured. Also, this algorithm takes local features (each feature captures a distinctive part of
the object) meaning that it is also robust to occlusion and clutter (see Figure 2.2). Other alternatives to
SIFT, like SURF for example, have similar performance while being much faster [16]. SIFT is particularly
useful for identification tasks since it captures the features of a specific object. Unfortunately, this comes
with the price of poor generalization (e.g., SIFT may be very good at identifying a particular “shoe”, but
if other shoes in the same picture do not look like that one, they may not be detected).
In 2005 another efficient technique came out called HOG (Histograms of Oriented Gradients) [17]. This
method was similar to the SIFT descriptor but differed in that it decomposed an image’s pixels into
8 Chapter 2. Framework and State-of-the-art Overview
Figure 2.3: E.g. HOG detector cue mainly on silhouette contours. Image from [17].
oriented gradients. With these gradients, the full images got converted into simple representations. This
way they captured the essence of what a human looks like in a picture (see Figure 2.3) and they used
this so that given a new picture they could say: “Yes, this is a human” or “No, this is not a human”. This
kind of basic feature map looks very similar to what convolution nets learn themselves, but in this case,
they had to hand-code what a human figure looks like when converted to oriented gradients.
Later, in 2012 the era of deep learning began. In this year Krizhevsky et al. won the ImageNet [18] com-
petition (a yearly competition on visual detection tasks) using a convolutional network that outperformed
everybody else [19]. Convolutional neural networks have been around since the 1980s [20, 21] but this
time they really worked due to the overall improvement in computational power (with the development of
modern GPU’s) and also due to the increased amount of training data.
One way to perform object detection is to use classifiers like VGG-Net [22] or Inception [23, 24] (huge
convolutional neural nets trained on big datasets). By sliding these classifiers over a number of squares
of an image we get a set of classifications and we only keep the ones the classifier is the most certain
about. Using this we can then draw a bounding box around the classified objects in the image. But, this
is a very brute-force and computationally expensive approach. Methods like deformable parts models
(DPM) use this sliding window approach, where a classifier is run at evenly spaced locations over the
entire image and high classification scores correspond to detections [25].
In 2014 a better approach was presented by Girshick et. al, called R-CNN: Regions with CNN features
[26]. The idea behind R-CNN was before they fed it to a convolutional network they would do a process
called selective search [27] to create a set of bounding boxes that could correspond to an object. At a
high level, the selective search groups together adjacent pixels by texture, color or intensity to identify
objects. As illustrated in Figure 2.4, given an input image instead of using a sliding window, R-CNN,
first extracted bounding boxes (region proposals); then it ran the images inside these bounding boxes
through a pre-trained CNN (e.g. AlexNet [19]) to compute the features for each bounding box. Finally,
it would used a support vector machine to classify what the object in the image of each box is. After
2.2. An Overview on YOLO 9
Figure 2.4: R-CNN Object Detection System Overview. Image from [26].
classification, the bounding boxes would be refined to output tighter coordinates to the objects and
eliminate duplicate detections.
This proved to be an effective approach for object detection. R-CNN evolved into Fast RCNN [28], Faster
RCNN [29] and more recently Mask RCNN [30]. However, they all first generate potential bounding
boxes, then run a classifier and finally do some post-processing. These complex pipelines are slow and
hard to optimize because each individual component must be trained separately [31].
Although, all of these methods are looking at each image thousands or hundreds of thousands of times
to perform detection, so this involves much evaluation of these classifiers over and over again on different
parts of the image and at different scales. YOLO took a completely different approach and outperformed
these previous methods.
2.2 An Overview on YOLO
You Only Look Once (YOLO) is one of the most popular and newer techniques out there for object
detection. This neural net was introduced in 2015 [31] (CVPR 2016, OpenCV People’s Choice Award)
and recently substantially improved its accuracy and speed on object detection – YOLOv2 [9] (CVPR
2017, Best Paper Honorable Mention).
In this Section we describe how this object detector works, how it is structured and what are its main
advantages.
2.2.1 Functioning
Given an image, YOLO starts by splitting that image into an S×S grid. (see Figure 2.5). In this grid,
each of the cells (e.g: in Figure 2.5 one of the cells is marked in red) is responsible for predicting B
10 Chapter 2. Framework and State-of-the-art Overview
Figure 2.5: Example image splitted into S×S grid and one of its cells (marked in red). Image from YOLOwebsite.
bounding boxes and B confidence scores (one for each box). Each individual box confidence score tells
us how certain YOLO is that the predicted bounding box actually encloses some object and also how
well it thinks the predicted bounding box is adjusted to that object.
Figure 2.6: Resulting predictions from all the grid cells (the higher the confidence the thicker the box).Image from YOLO website.
In Figure 2.6 we can visualize all the predictions from all the cells together. Essentially, you get many
bounding boxes ranked by their confidence value (the higher the confidence the thicker the box). Hence
now you know where the objects are in the image however you still do not know what they are.
Then, each grid cell (see Figure 2.7) also predicts C conditional class probabilities. It only predicts one
set of class probabilities per grid cell, regardless of the number of boxes (B) associated with that cell. It
2.2. An Overview on YOLO 11
Figure 2.7: Class probability map. Image from YOLO website.
Figure 2.8: Combining bounding boxes with class probabilities. Image from YOLO website.
is important to notice that since it is a conditional probability if a grid cell predicts “dog” it is not saying
that there is a “dog” in this grid cell, it is just saying that if there is an object in this grid cell then that
object is a “dog”.
The previously calculated confidence score for the bounding box and the class conditional prediction
are combined into one final score that tells us the probability that each bounding box contains a specific
type of object as we can see from Figure 2.8.
However, most of these predictions have a very low confidence score and consequently, we only keep the
boxes whose final score is higher than a specific threshold. Additionally, the duplicate boxes predicting
the same object are removed using a non-maximum suppression resulting in the final detections for one
image (see Figure 2.9).
12 Chapter 2. Framework and State-of-the-art Overview
Figure 2.9: Final predictions after applying a threshold and non-maximum suppression. Image fromYOLO website.
To summarize each cell in the S×S grid, predicts:
• B bounding boxes and associated box confidences;
• C conditional class probabilities (one per class).
Each of the B bounding boxes is coupled with 5 predictions (x, y, w, h, and box confidence). The x and y
coordinates are offsets between the center of the box relative to the bounds of the grid cell. The w and
h are the width and the height of the predicted bounding box. All these values are normalized relative
to the whole image. Finally, the box confidence was already previously explained (how certain YOLO is
that the predicted bounding box actually encloses some object and how well adjusted that box is).
This parameterization fixes the output size for detection (S×S×(B×5 + C)). For example, for PASCAL
VOC, YOLO used a 7 by 7 grid (S=7) and 2 bounding box proposals per cell (B=2) and there are 20
classes (C=20). This results in a 7×7×30 = 1470 outputs.
To summarize, YOLO trains a neural network to predict this output tensor so that in one go it calculates all
the detections for an image. The main secret why YOLO is so good is that the output separate predictions
are all made at the same time looking only once at the image, and this is why it is called YOLO (You Only
Look Once). Since it predicts all of these detections simultaneously YOLO also implicitly incorporates
global context in the detection process so it can learn things about which objects tend to occur together,
the relative size and location of objects and other assorted things.
2.2. An Overview on YOLO 13
Figure 2.10: The architecture of YOLO. Image from [31]
2.2.2 Network Architecture
The architecture is inspired by the GoogLeNet model for image classification [58]. Shown in Figure 2.10
the full network is basically composed of Convolution and MaxPooling layers alternatively, over and over
again.
Using it is very simple, you give it an input image which goes through the convolutional network in a
single pass and it comes out the other hand as a sized tensor describing the predicted bounding boxes
for the grid cells. And all you need to do then is compute the final scores for the bounding boxes and
remove the boxes that are scoring less than a pre-determined threshold value and also remove the
duplicate bounding boxes (non-maximum suppression).
Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detec-
tion performance [31].
2.2.3 Training
ImageNet [18] is currently the biggest dataset for Computer Vision tasks. It includes millions of images
capturing 1000 different object classes in their natural scenes. When we train YOLO to detect our own
custom object classes we usually have a small training set. Remarkably, YOLO was pre-trained on the
ImageNet dataset. During this pre-train, YOLO learned to convert the objects in millions of images into a
feature map (convolutional weights). Therefore, when YOLO trains it adapts these convolutional weights
to detect our custom object classes in the training set images. Specifically, it applies a technique called
Transfer-Learning (also known as Domain-Adaptation). This allows us to get a good performance even
with a small training set.
14 Chapter 2. Framework and State-of-the-art Overview
(a) Increased confidence during training. (b) Decreased confidence during training.
Figure 2.11: Example training YOLO to detect a “dog”. Image from YOLO website.
Since YOLO predicts all the existing instances of objects from a single full image it also has to train on
full images. First, given a training image coupled with the ground-truth labels, YOLO matches these
labels with the appropriate grid cells. This is done by calculating the center of each bounding box
and wherever grid cell that center falls in that will be the one responsible for predicting that detection.
Then YOLO adjusts that cell’s class predictions (e.g. in Figure 2.11 we want it to predict “dog”) and its
bounding box proposals. The bounding box overlapping the most with the ground-truth label will have
its confidence score increased (e.g. the one with the green arrow in Figure 2.11) and its coordinates
adjusted while the remaining boxes will simply get their confidence score decreased (e.g. the one with
the red arrow in Figure 2.11) since they don’t properly overlap the object.
Some remaining cells do not have any ground-truth detections assigned to them. In this case, all the
predicted bounding boxes’ confidence is decreased, since they are not responsible for predicting any
object. One important thing to note is that in this case, YOLO does not adjust the class probabilities
or coordinates for the proposed bounding boxes since there are not any actual ground-truth objects
assigned to that cell.
Overall the training of this network is pretty straightforward and corresponds to a lot of standards in the
Computer Vision community: YOLO pre-trains on ImageNet; uses stochastic gradient descent; and uses
data augmentation. For the data augmentation, YOLO randomly scales and translates of up to 20% of
the original image size and also randomly adjusts the exposure and saturation of the image by up to a
factor of 1.5 and hue by up to a factor 0.1 in the HSV color space [31].
2.2. An Overview on YOLO 15
Figure 2.12: YOLO generalization results on Picasso and People-Art datasets. Image from [31]
2.2.4 Advantages of YOLO
One of the main benefits of using YOLO is that it is considerably fast allowing real-time processing for
service robots and in many other applications.
Another great advantage is that YOLO reasons globally about the image when making predictions. Un-
like sliding window and region proposal-based techniques, YOLO sees the entire image during training
and test time so it implicitly encodes contextual information about classes as well as their appearance.
Fast R-CNN, a top detection method [28], mistakes background patches in an image for objects be-
cause it can not see the larger context [31]. YOLO makes less than half the number of background
errors compared to Fast R-CNN. It also predicts all bounding boxes across all classes for an image
simultaneously.
Finally, YOLO generalizes rather well into new domains (see Figure 2.12). After being trained on natural
images it has been run on artwork and it still managed to detect the objects classes like “human” and
“cat”, even with abstract representations of people like in the painting of The Scream by Edvard Munch.
2.2.5 YOLOv2
From the initial version to YOLOv2 some incremental improvements were applied enhancing the accu-
racy significantly while making it faster as well. For example, in YOLOv2 they added batch normalization
16 Chapter 2. Framework and State-of-the-art Overview
Figure 2.13: Anchor Boxes vs. Dimension Clusters. Image from YOLO website.
[32] in the convolutional layers leading to an increased accuracy in the detections. In this Sub-section,
we review some of the most significant improvements that were made:
Dimension Clusters
Other systems like Faster R-CNN and SSD both use pre-defined anchor boxes consisting of 3 different
scales and 3 different aspect ratios (see Figure 2.13). YOLOv2 also uses pre-defined boxes however
instead of using these anchor boxes with “made-up” aspect ratios, it creates a new set of boxes (dimen-
sion clusters) tuned with the real objects that are actually in the training data.The dimension clusters
capture more variability in the training data with fewer boxes resulting in a faster and more accurate
detection.
Multi-scale Training
YOLOv2 also includes a multi-scale training regime. Generally, detectors are trained at a single aspect
ratio (e.g. 448×448) and at test time the input image would get resized to that size in specific. In the
original version of YOLO, they trained the full detection pipeline just on a single scale. With YOLOv2 they
came up with the idea of resizing the network randomly throughout the training process to many different
scales and making the network fully convolutional (they removed the fully connected layers at the end of
the original architecture). During training, YOLOv2 resizes the images by factors of 32 from 288×288,
320×320 and so on up to 608×608. This technique boosts the network’s performance both at a single
scale (e.g. if you only want to detect objects on a 448×448 image) and at other scales as well. By doing
this YOLOv2 can essentially be resized at test time to numerous different sizes. Accordingly, without
changing the previously trained weights we can run detection at different scales getting a smooth trade-
off between performance and accuracy (see Figure 2.14). So, for example, if we perform detection at
544×544 we have a model that is very accurate and runs a little bit slower whereas if we resize down to
2.3. mAP - mean Average Precision 17
Figure 2.14: Accuracy and speed on VOC 2007. Image from YOLO website.
288×288 we have a model that is a lot faster but less accurate. This multi-scale training can be thought
of a sort of data augmentation. In object detection we try to do as much data augmentation as possible
and doing this training at different scales basically means that the detector will learn to predict well the
objects at different scales leading to a significant performance improvement.
2.3 mAP - mean Average Precision
In this Section we provide some background on the meaning and how the mean Average Precision
(mAP) value is calculated.
2.3.1 Definition
The mean Average Precision (mAP), is the standard single-number metric used to compare the state-
of-the-art for object detection. Let’s decompose this acronym into two parts, “m” and “AP”. The “m” in
the mAP stands for “mean” (also known as arithmetic mean) and is the sum of all the values (in this
case the AP values) divided by the number of items. The “AP” (Average Precision) is an average of the
precision values. This will be further explained in the following Sub-sections.
2.3.2 Precision and Recall
When calculating the mAP it is constantly measured the precision (the percentage of items classified
as positive that actually are positive) and recall (the percentage of positives that are classified as posi-
tive) [33].
18 Chapter 2. Framework and State-of-the-art Overview
(a) Example IOU = 78.7% ≥ 50% (b) Example IOU = 31.0% < 50%
Figure 2.15: Example images captured by MBot capturing different perspectives and lighting conditions.
precision =tp
tp+ fp, (2.1)
recall =tp
tp+ fn, (2.2)
where tp stands for - true positive, fp - false positive and fn - false negative. So in this context, precision
is the fraction of all the detected objects (tp+ fp) that are relevant to the ground-truth (tp), and recall is
the fraction of all the ground-truth objects (tp+ fn) that are successfully detected (tp).
2.3.3 Intersection Over Union
In the object detection task, the algorithm is expected to localize the object in the image. To evaluate if the
predicted bounding boxes are well adjusted to the objects these algorithm measure the IOU (Intersection
Over Union):
IOU =Area(Bp ∩Bg)
Area(Bp ∪Bg), (2.3)
where Bp∩Bg and Bp∪Bg respectively denotes the intersection and union of the predicted and ground-
truth bounding-boxes.
Two bounding boxes match if there is an IOU≥ 50%. This value was set in the PASCAL VOC competition
[34], humans tend to be slightly more lenient than the IOU larger than the 50% criterion [35].
For example, in Figure 2.15 (a) given an instance of the object class “pottedplant” we can see the
2.4. Related Work 19
ground-truth bounding box (in blue) and the one predicted by the algorithm (in green) with an IOU ≥
50%. In (b) given an instance of the object class “chair” we can see the ground-truth bounding box (in
blue) and the one predicted by the algorithm (in red) with an IOU < 50%.
2.3.4 Calculating the AP - Average Precision
The neural nets predictions were judged by the precision/recall (PR) curve. The quantitative measure
used was the average precision (AP) with the VOC metric [34, 36, 37].
It was computed as follows [36]:
• First, we computed a version of the measured precision/recall curve with precision monotonically
decreasing, by setting the precision for recall r to the maximum precision obtained for any recall
r′ > r.
• Then, we computed the AP as the area under this curve by numerical integration. No approxima-
tion was involved since the curve is piecewise constant.
First, we map each detection to a ground-truth object instance. There is a match if the class labels are
the same and the IOU (Intersection Over Union) is larger or equal to 50%, by the formula 2.3.
In the case of multiple detections of the same object, only 1 (one) is set as a true detection and the
repeated ones are set as false detections [34].
2.4 Related Work
The complexity of indoor environments grows exponentially due to a various number of factors, increas-
ing the difficulty for robots to complete tasks successfully. The particular task of learning and recognizing
useful representations of places (such as a multi-floor building) and manipulating objects has been a
subject of active research namely, in a symbiotic interaction with humans [8].
In this work, we aim to detect objects coupled with their bounding-box. A few works have explored a
human-based active learning method specifically for training object class detectors. Some of them focus
on human verification of bounding boxes [35] or rely on a large amount of data corrected by annotators
[38]. Others explore how people teach or are influenced by the robot [39].
It is worth mentioning that symbiotic autonomy has been actively pursued in the past [40, 41]. Some
interesting approaches are focused on improving the robot’s perception with remote human assis-
20 Chapter 2. Framework and State-of-the-art Overview
tance [42]. Research groups have also worked on developing autonomous service robots such as the
CoBots [2], the PR2 [43] and many others.
Considering spatial information analysis and mapping, studies using ultrasonic imaging with neural net-
works [44], 3D convolutional neural networks with RGB-D [45] and a novel combination of the RANSAC
(an outlier detection method) and Mean Shift (for cluster analysis) algorithms [46] have been in de-
velopment for several decades, which demonstrates a clear evolutionary trajectory that enables the
developments of the present.
Other works using scene understanding and image recognition show a strong affinity with this thesis.
Works such as: using saliency mapping plus neural nets to tackle scene understanding [47]; full pose
estimation of relevant objects relying on algorithmic processing and comparison of a dataset of images
[46]; feature-matching technique implemented by a Hopfield neural network [48]; and data augmentation
[49].
When it comes to showing an object located in the real world to the robot, previous work has investigated
alternative ways of intuitively and unambiguously selecting objects, using a green laser pointer [50]. In
our approach, we took advantage of the robot’s tablet purposely inbuilt for interacting with humans.
Finally, there has been work done in the area of getting robots to navigate in a realistic setting and
detecting objects in order to place them on a map [51]. In our case, using input from human interaction
allows the robot to generate this information. This enables the robot to store where each object was
seen and how many times it was seen at each location.
Chapter 3
YOLOv2 in a Real-World Scenario
You Only Look Once (YOLO), is currently one of the most famous state-of-the-art neural nets for object
detection in real-time computing. This chapter evaluates its performance when deployed in a robot in
a real-world scenario and exposes its need for improvement. Firstly, YOLOv2 is evaluated to judge its
detections when trained using the PASCAL VOC 2012 dataset [36] (Section 3.1), and then, using the
COCO dataset [52] (Section 3.2). Lastly, we discuss these evaluations (Section 3.3).
3.1 PASCAL VOC Dataset Test
In this Chapter we test YOLOv2, trained with the PASCAL VOC dataset, with Mbot in a domestic scenario
(Sub-section 3.1.1) and with CoBot in a university scenario (Sub-section 3.1.2). On each scenario the
robot’s collected a total of 100 images, capturing a sub-group of the 20 objects classes1 of the PASCAL
VOC 2012 dataset [36]. Both the robot’s images were captured at the same resolution of 640×4802. At
the end of this Section, we compare the results with the ones presented in the literature (Sub-section
3.1.3).
3.1.1 MBot Test in a Domestic Scenario
First MBot captured a total of 100 images for this test in a domestic scenario. In these images and using
YOLOv2 MBot detected a total of 7 out of the 20 classes present in the PASCAL VOC dataset. These
classes were: “bottle”, “chair”, “person”, “plant”, “sofa”, “table” and “tv”. These images were captured
1The 20 PASCAL VOC class labels can be found at http://host.robots.ox.ac.uk/pascal/VOC/voc2012/2The MBot’s images were captured using an ASUS Xtion and the CoBot’s ones with a Kinect 2
21
22 Chapter 3. YOLOv2 in a Real-World Scenario
(a) Domestic scenario during day time (b) Domestic scenario during night time
Figure 3.1: Example images captured by MBot capturing different perspectives and lighting conditions.
on different days (implicitly varying the lighting conditions), and poses (position + orientation) relative to
a fixed world frame of the scenario (examples in Figure 3.1), in ISRobotNet@Home3, Test Bed of ISR-
Lisbon, Portugal. In Sub-section 3.1.3 we evaluate the YOLOv2’s performance when detecting objects
on these images.
3.1.2 CoBot Test in a University Scenario
Similarly, we deployed YOLOv2 in the mobile, indoor, service robot CoBot - collaborative robot (see
Figure 3.2). CoBot also captured 100 images for testing but, this time in a university scenario. It captured
4 of the 20 classes present in the PASCAL VOC dataset. These classes were: “chair”, “plant”, “table”
and “tv”. To collect these images (examples in Figure 3.3) the robot navigated in one of the floors of
the Machine Learning Department, at Carnegie Mellon University, in Pittsburgh, PA, USA. It navigated
continuously, for a total of half an hour, choosing randomly each next destination in that building. Despite
being a real university scenario, the class “person” was excluded from the captured images due to
privacy issues.
3.1.3 Performance of YOLOv2 Trained on PASCAL VOC
For each of the captured images (by MBot and CoBot), we created a correspondent ground-truth file,
correctly labeling all the instances of the PASCAL VOC’s object classes, and a YOLOv2 predictions
file (see plots in Figure 3.4 and Figure 3.5). Then, we confronted the neural net’s predictions with the
ground-truth to calculate the AP (Average Precision) for each class (see Table 3.2 and 3.3), and the
resulting mAP (mean Average Precision), this metric is explained in Chapter 2, Section 2.3.
3More info can be found at http://isrobonet athome.isr.tecnico.ulisboa.pt/
3.1. PASCAL VOC Dataset Test 23
Figure 3.2: CoBots with on-board computation, interaction interfaces, omni-directional motion, carryingbaskets, and depth sensors (Kinect and LIDAR).
(a) Corridor (b) Room
Figure 3.3: Example images captured by CoBot for our YOLOv2 evaluation experiments.
Table 3.1: PASCAL VOC2012 [36] test detection results. YOLOv2 performs on the same level as otherstate-of-the-art detectors like Faster R-CNN [29] with ResNet [53] and SSD512 [54] while still running 2to 10 times faster [9]. Table adapted from [9].
The values presented for each class are the AP (Average Precision) in percentage (%).
Method data mAP aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
Fast R-CNN [28] VOC12 68.4 82.3 78.4 70.8 52.3 38.7 77.8 71.6 89.3 44.2 73.0 55.0 87.5 80.5 80.8 72.0 35.1 68.3 65.7 80.4 64.2Faster R-CNN [29] VOC12 70.4 84.9 79.8 74.3 53.9 49.8 77.5 75.9 88.5 45.6 77.1 55.3 86.9 81.7 80.9 79.6 40.1 72.6 60.9 81.2 61.5YOLO [31] VOC12 57.9 77.0 67.2 57.7 38.3 22.7 68.3 55.9 81.4 36.2 60.8 48.5 77.2 72.3 71.3 63.5 28.9 52.2 54.8 73.9 50.8SSD300 [54] VOC12 72.4 85.6 80.1 70.5 57.6 46.2 79.4 76.1 89.2 53.0 77.0 60.8 87.0 83.1 82.3 79.4 45.9 75.9 69.5 81.9 67.5SSD512 [54] VOC12 74.9 82.3 75.8 59.0 52.6 81.7 81.5 90.0 55.4 79.0 59.8 88.4 84.3 84.7 84.8 83.3 50.2 78.0 66.3 86.3 72.0ResNet [53] VOC12 73.8 86.5 81.6 77.2 58.0 51.0 78.6 76.6 93.2 48.6 80.4 59.0 92.1 85.3 84.8 80.7 48.1 77.3 66.5 84.7 65.6
YOLOv2 544 [9] VOC12 73.4 86.3 82.0 74.8 59.2 51.8 79.8 76.5 90.6 52.1 78.2 58.5 89.3 82.5 83.4 81.3 49.1 77.2 62.4 83.8 68.7
Table 3.2: MBot Domestic Scenario PASCAL VOC2012 [36] test detection results on YOLOv2.
The values presented for each class are the AP (Average Precision) in percentage (%).
Method data mAP bottle chair table person plant sofa tv
YOLOv2 VOC12 (PASCAL VOC) [9] 60.6 51.8 52.1 58.5 81.3 49.1 62.4 68.7MBot Domestic (100 images) 39.0 6.9 55.9 21.0 63.9 50.7 53.2 21.4
24 Chapter 3. YOLOv2 in a Real-World Scenario
Table 3.3: CoBot University Scenario PASCAL VOC2012 [36] test detection results on YOLOv2.
The values presented for each class are the AP (Average Precision) in percentage (%).
Method data mAP chair table plant tv
YOLOv2 VOC12 (PASCAL VOC) [9] 57.1 52.1 58.5 49.1 68.7CoBot University (100 images) 11.6 27.0 1.4 18.1 0
Table 3.1 illustrates the analogous performance of YOLOv2 versus other state-of-the-art detectors. Us-
ing some of the values of this table we then created table 3.2 and 3.3. Table 3.2 allows us to compare
the performance of YOLOv2 when using PASCAL VOC’s test images (1st row of the table) versus when
using the MBot’s 100 images from the domestic scenario test (2nd row). Given these results, we can
first notice the comparative performance of the class “chair” and “plant”, between the two datasets. Fur-
thermore, we find the classes “person” and “sofa” with AP values dropping less than 20%, and finally,
the classes “bottle”, “table” and “tv” dropping over 35%, from the PASCAL VOC to the MBot’s 100 im-
ages data. Overall, the mAP (mean Average Precision) dropped approximately 20%, from 60.6% (using
PASCAL VOC data) to 39.0% (using MBot’s 100 images data). The plots at Figure 3.4 provide more
details about the MBot’s domestic scenario test, illustrating, for example, that YOLOv2 predicted in the
100 images 14 “bottle’s”, 10 of which were assigned as false predictions and 4 as true ones. You can
find example images of these predictions in Figure 3.6.
Similarly, Table 3.3 allows us to compare the performance of YOLOv2 when using PASCAL VOC’s test
images (1st row of the table) versus when using the CoBot’s 100 images from the university scenario
test (2nd row). In this scenario, YOLOv2 performed poorly compared to the previous test. The AP of the
classes “chair” and “plant” dropped significantly (over 25%) and the AP of the classes “table” and “tv”
dropped practically to 0% (Figure 3.7 illustrates one of the detections of a “tv”). Overall the mAP dropped
from 57.1% to 11.6%. Most of the images captured by CoBot were captured while it was moving around
the building resulting in more blurred images when compared to the MBot’s ones. The plots at Figure
3.5 provide more information about the performance of YOLOv2 given the images from CoBot’s test.
We can see from that plot that the classes “train”, “sofa” and “bird” are also included in the YOLOv2’s
predicted objects although there were no instances of these classes in the images and therefore in the
ground-truth. For example, note that the class “train” was detected 11 times in those images of the
corridors of the university. You can find example images of these predictions in Figure 3.7.
3.2 COCO Dataset Test
YOLOv2 was also separately trained using the COCO dataset [52] obtaining 44.0 mAP [9], on the same
VOC metric (see more details about this metric in Chapter 2, Sub-section 2.3.4). The COCO dataset
3.2. COCO Dataset Test 25
(a) Information about objects in the ground-truth (b) Information about objects predicted by YOLOv2
Figure 3.4: MBot, information about the 100 captured images PASCAL VOC test.
(a) Information about objects in the ground-truth (b) Information about objects predicted by YOLOv2
Figure 3.5: CoBot, information about the 100 captured images PASCAL VOC test.
(a) Correct prediction of a “bottle” (in green) (b) False prediction of a “person” (in red)
Figure 3.6: MBot example images of YOLOv2 true and false predictions.
26 Chapter 3. YOLOv2 in a Real-World Scenario
(a) Correct prediction of a “chair” (in green) (b) False prediction of a “tv” (in red)
Figure 3.7: CoBot example images of YOLOv2 true and false predictions.
includes 80 object classes4, extending by 60 the 20 classes of PASCAL VOC. Both PASCAL VOC and
COCO datasets gather training images of objects in a vast variety of day-to-day scenes in their natural
context. Since the Average Precision results of YOLOv2 were not published in their publications for each
of the individual 80 classes, in this Section, we evaluate how many (out of these classes) are detected
in the images previously collected by MBot and CoBot and whether if these detections are true or false
ones.
Figure 3.8 and 3.9 show the results obtained with YOLOv2 trained on the COCO dataset for the same
previous images. In the MBot’s domestic scenario a total of 23 classes were detected and in the case
of CoBot’s university scenario a total of 10 classes. Although not all these YOLOv2’s predictions were
correct, having a look at the class “refrigerator”, for example, we can see that all its predictions were
false (37 times on MBot and 4 times on CoBot) since there was not any refrigerator in the test scenarios.
(a) Information about objects in the ground-truth (b) Information about objects predicted by YOLOv2
Figure 3.8: MBot, information about the 100 captured images COCO test.
4The 80 COCO class labels can be found at http://cocodataset.org
3.3. Discussion 27
(a) Information about objects in the ground-truth (b) Information about objects predicted by YOLOv2
Figure 3.9: MBot, information about the 100 captured images COCO test.
3.3 Discussion
To summarize, in this Chapter we tested and evaluated YOLOv2 when deployed in two robots in the
corridors of a university and in a domestic scenario. We evaluated YOLOv2 when trained in PASCAL
VOC and verified that the mean Average Precision (mAP) dropped from 60.6% to 39.0% in the case of
MBot in a domestic scenario (Table 3.2), and from 57.1% to 11.6% in the case of CoBot in the corridors
of a university (Table 3.3). We further evaluated the performance of YOLOv2 when trained with the
COCO dataset and verified that some of the classes that are detected are not even present in these
scenarios, for example the class “refrigerator” was falsely detected in both the test scenarios (Figure 3.8
and 3.9).
On the other hand, it is important to note that YOLOv2 trained with both the datasets, PASCAL VOC and
COCO, is limited to detect 20 and 80 object classes, respectively. In real human scenarios the robot will
be confronted with an unpredictable amount of classes that may even vary with time. For example, new
classes of objects are constantly introduced in our homes as technology evolves and new machines are
created.
Chapter 4
Learning Algorithm
This Chapter, describes the learning algorithm and how the robot will use input from humans to create
a neural net – HUMAN – adapted to the local real-world scenario and presents the approach of using
the two neural nets in parallel – YOLOv2 + HUMAN – to improve the robot’s capability of recognizing
objects.
4.1 Learning Algorithm
In this Section, we introduce the learning algorithm. The goal of this algorithm is to create a neural net,
called HUMAN, that is able to detect objects within a scenario that the robot is deployed into. The idea
is to train this neural net with images labeled by the humans present in that scenario.
The algorithm consists of three parts: Firstly (Section 4.1.1), the robot captures images and predicts
what and where the objects are located in the images. Secondly (Section 4.1.2), the robot asks the
humans questions to confirm its previous predictions and collect additional information. Lastly (Section
4.1.3), the robot re-trains its detector using the collected information to improve its future predictions.
In Algorithm 1 you can find the pseudocode of this learning algorithm. The functions starting with upper
case letter (“CaptureAndPredict”, “InteractWithHumans” and “Training”), are the ones we will describe in
this Chapter with their respective pseudocodes. We assumed the robot will apply the learning algorithm
to a set of pre-defined points in the map of the scenario (further explained in Chapter 5, Section 5.1.1).
28
4.1. Learning Algorithm 29
Algorithm 1 Learning Algorithm
1: Input: target position // position (points in the map) from where the robot will capture images;2: Output: HUMAN // resulting HUMAN neural net;3: procedure LEARNINGALGORITHM(target position)4: // Part I - Capturing Images and Predicting the Objects5: new images and predictions = CaptureAndPredict(target position)6: // Part II - Interacting with Humans7: new labeled images = InteractWithHumans(new images and predictions)8: // Part III - Training9: save data(new labeled images)
10: all labeled images = load all data()11: if meets training conditions(all labeled images) then12: HUMAN = Training(all labeled images)13: Return HUMAN14: end if15: end procedure
4.1.1 Part I - Capturing Images and Predicting the Objects
The robot should be able to detect the surrounding objects independently of its current pose (location
+ orientation), relative to a fixed world frame. To solve this, the robot defines a set of sparse points of
reference in the map, from where it will capture the images. On the other hand, the robot also needs
to be able to detect the objects in different lightning conditions (related to various hours of the day). To
solve this, we defined that the robot will go to 1 point of reference in the map per day, at a random time
of this day. Lastly, the robot should also be able to detect the objects in independently of its orientation,
relatively to the fixed world frame, so we defined it will capture 8 images while rotating in its axis (1 per
each 45o the robot rotates) in the same point of reference in the map. Consequently, in total the robot
captures 8 pictures per day and these are the images it will be using for learning to detect objects.
The robot starts by navigating to a position (pre-defined point in the map of the scenario) that it has
not visited before. In this location, it captures different images while rotating around its axis. Before
requesting help from a human, the robot predicts what the objects are and where they are located in
each of the images. From each object’s prediction, we retrieve information about the identified class,
its location in the image and finally a confidence level from 0 to 100%. In Algorithm 2 we can find the
pseudocode for this part (Part I) of the learning algorithm.
To make these predictions the robot uses a state-of-the-art detector trained on a set of classes. In this
work we used YOLOv2 (explained in Chapter 2), trained with COCO, able to detect up to 80 classes
of objects1. These 80 classes apply to a vast number of scenarios. For example, a domestic scenario
usually includes classes like “person”, “tv” or “sofa”, while a garden would include classes like “bench”,
“bird” or “potted plant”.
1The 80 COCO class labels can be found at http://cocodataset.org
30 Chapter 4. Learning Algorithm
Algorithm 2 Learning Algorithm - Part I - Capturing Images and Predicting the Objects
1: Input: target position // location (point in the map) from where the robot captures the images;2: Output: new images and predictions // image captured by the robot and respective predictions;3:4: procedure CAPTUREANDPREDICT(target position)5: new images and predictions = start data structure()6: Y OLOv2, HUMAN = load detectors()7: navigate to(target position)8: // The images are captured while the robot rotates in its axis (explained in Chapter 5.1)9: image list, target orientation list = capture images while rotating()
10: for each image, target orientation in image list, target orientation list do11: image id = save image(image, target position, target orientation)12: prediction list = predict objects in image(Y OLOv2, image)13: if HUMAN 6= None then14: // The HUMAN neural net will also be used to predict the objects in each image and
these predictions will be combined with the ones from Y OLOv2 (Explained in Part III)15: HUMAN prediction list = predict objects in image(HUMAN, image)16: prediction list.append(HUMAN prediction list)17: end if18: new images and predictions[image id] = prediction list19: end for20: Return new images and predictions21: end procedure
By the end of this part of the learning algorithm, the robot ends up with a set of images and the corre-
sponding predictions of the objects in those images.
4.1.2 Part II - Interacting with Humans
In order to confirm if the previous predictions are correct the robot will now ask for help to the humans
around it (Part II). To interact with the human the robot will use its tablet as an interface, described in
Sub-section 4.1.2.
When interacting with humans the robot will be asking questions:
• Labeling the objects from predictions (Sub-section 4.1.2)
• Identifying to whom the objects belong (Sub-section 4.1.2)
In Algorithm 3 you can find the pseudocode for interacting with humans. The functions starting with
upper case letter (“LabelObjects” and “WoWhomBelongs”), will be described in this Section.
Labeling the Objects from Predictions
As previously stated, the robot starts by capturing the images and generating predictions for what and
where the objects are in the image. To correctly label these objects, the robot will first confirm the
4.1. Learning Algorithm 31
Algorithm 3 Learning Algorithm - Part II - Interacting with Humans
1: Input: new images and predictions // image captured by the robot and respective predictions;2: Output: new labeled images // images labeled with the human input;3:4: procedure INTERACTWITHHUMANS(new images and predictions)5: new labeled images = start data structure()6: for each image id, prediction list in load all(new images and predictions) do7: image = load image(image id)8: wait for person being detected() //different images can be labeled by different people9: // ask tperson to label objects from predictions
10: verified prediction list = LabelObjects(image, prediction list)11: // ask person to whom an obect belongs12: verified prediction list = ToWhomBelongs(image, verified prediction list)13: new labeled images[image id] = verified prediction list14: end for15: Return new labeled images16: end procedure
previous predictions by asking a human for help in the scenario. To confirm its predictions, the robot
asks two Yes/No questions. For example, given a prediction of the object “cat”, the robot asks:
• q1: Is this a cat, or part of it, in the rectangle?
• (if Yes to q1) - q2a: Is the rectangle properly adjusted to the cat?
• (if No to q1) - q2b: Ok, this is not a cat. Is the rectangle properly adjusted to an object?
The robot instructs the human by explaining that the properly adjusted rectangle refers to a rectangle
specifying the extent of the object visible in the image. It also explains that objects seen in a mirror or a
TV monitor should also be labeled.
These first two questions (q1 and q2) return 4 possible outcomes (see Figure 4.1). If the human says
No to q2a, the robot asks for an adjustment of the bounding box. If the human says Yes to q2b, the robot
asks for a relabeling. You can find this pseudocode in Algorithm 4.
After confirming and correcting the predictions the robot then asks:
• q3: Are all the objects in this image labeled?
• (if Yes to q3): The image is ready for Part III - Training.
• (if No to q3): The robots ask for the human to adjust the rectangle to an object and label it. Then it
repeats the question (q3).
When labeling, a list of the previously labeled object classes shows up. If the person does not find
the object’s class in this list, a new class can be added from a predefined dictionary of classes. This
32 Chapter 4. Learning Algorithm
Figure 4.1: E.g. Outcomes from q1 and q2. Cat image credit to Flickr user William McCamment.
dictionary restrains the lexical redundancies within a language, for example “waste bin” and “trash can”
are categorized as only the “wastecontainer” class.
When adding a new class, the robot shows the person the zoomable diagram (Figure 4.2), organized
by categories from the OpenImages dataset [55]. This diagram includes 600 classes, in an organized
structure. The robot only accepts new classes from this list.
Algorithm 4 presents the pseudocode for labeling the objects in the images collected by the robot in Part
I. As previously explained in this Section, it first asks for help from a human and afterwards, verifies each
of the predictions generated by its object detectors.
4.1. Learning Algorithm 33
Figure 4.2: Some classes that can be added to the robot using the labels from OpenImages dataset.[55]
To Whom an Object Belongs
We tagged the OpenImages classes as personal or not (e.g. classes like “mobile phone” and “backpack”
are tagged as personal classes and others like “oven” and “table” as not personal). This is used to ask
the second type of questions.
After labeling all the objects in the image, the robot then asks to whom each objects belongs (if they are
tagged as personal classes). You can find this pseudocode in Algorithm 5. This way the robot also stores
information that will be useful when searching for an object belonging to a specific person. The robot
instructs the human to answer “communal” if there is a case of a personal object class being shared.
Human-Robot Interface
To interact with the human the robot uses its tablet as an interface and reads out loud the question it is
making. It would also have been possible to use voice recognition to read part of the input from the user
(for example the Yes or No answers). However, the voice commands could potentially generate more
errors and would make it difficult for us to collect all the needed input. For example, when trying to show
an object to the robot in the image this can be easily achieved with the human dragging a “rectangle” in
the tablet (see Figure 4.3). Figure 4.4 shows an example when MBot and Pepper were interacting with
a human.
34 Chapter 4. Learning Algorithm
Algorithm 4 Learning Algorithm - Part II - Interacting with Humans - Labeling Objects from Predictions
1: Input: image // image captured by the robot;2: prediction list // list of a predictions corresponding to that input image;3: Output: verified prediction list // human verified list of objects corresponding to that input
image;4:5: procedure LABELOBJECTS(image, prediction list)6: verified prediction list = [ ]7: for each prediction in prediction list do8: class name, confidence, bounding box = get prediction info(prediction)9: // e.g : Is this a cat or part of it, in the rectangle?
10: q1 question = “Is this a/an ” + class name+ “, or part of it, in the rectangle?”11: answer to q1 = ask human yes or no(q1 question, image, bounding box)12: if answer to q1 is Yes then13: // e.g : Is the rectangle properly adjusted to the cat?14: q2a question = “Is the rectangle properly adjusted to the ” + class name+ “?”15: answer to q2a = ask human yes or no(q2a question, image, bounding box)16: if answer to q2a is Yes then17: verified prediction list.append(prediction)18: else if answer to q2a is No then19: adjusted prediction = ask human to relocate bounding box(class name, image, bounding box)20: verified prediction list.append(adjusted prediction)21: end if22: else if answer to q1 is No then23: // e.g : Ok, this is not a cat. Is the rectangle properly adjusted to an object?24: q2b question = “Ok, this is not a ” + class name+ “. Is the rectangle properly adjusted
to an object?”25: answer to q2b = ask human yes or no(q2b question, image, bounding box)26: if answer to q2b is Yes then27: adjusted prediction = ask human to relabel object(image, bounding box)28: verified prediction list.append(adjusted prediction)29: else if answer to q2b is No then30: continue31: end if32: end if33: end for34: are all objects labeled = False35: do36: q3 question = “Are all the objects in this image labeled?”37: answer to q3 = ask human yes or no(q3 question, image, verified prediction list)38: if answer to q3 is Yes then39: are all objects labeled = True40: else if answer to q3 is No then41: new object = ask human to label new object(image, verified prediction list)42: verified prediction list.append(new object)43: end if44: while are all objects labeled 6= True45: Return verified prediction list46: end procedure
4.1.3 Part III - Training
The interaction with humans results in a set of labeled images. Every time the robot collects a multiple
of 50 images it trains a neural net, HUMAN50, HUMAN100, HUMAN150 and so on, using all the human-
labeled images. From these images, 70% are used for training and 30% are used for testing. The value
4.1. Learning Algorithm 35
Algorithm 5 Learning Algorithm - Part II - Interacting with Humans - To Whom an Object Belongs
1: Input: image // image captured by the robot;2: verified prediction list // human verified list of objects corresponding to that input image;3: Output: verified prediction list // final list of objects corresponding to the input image;4:5: procedure TOWHOMBELONGS(image, verified prediction list)6: for each verified prediction in verified prediction list do7: class name, bounding box = load info(verified prediction)8: if is personal object class(class name) then9: question = “Whose” + class name + “ is this?” // e.g. Whose phone is this?
10: answer = ask question(question, image, bounding box)11: save information(verified prediction, answer)12: end if13: end for14: Return verified prediction list15: end procedure
(a) Showing Where the object is. (b) Showing What the object is.
Figure 4.3: Example of the human-robot interface when labeling an object.
of 50 images in practice corresponds to training the neural net once per week since in Chapter 5.1 we
define that the robot will go to one target position (point in the map) per day and capture a total of 8
images (at different orientations).
The HUMAN neural net uses the same structure and algorithm as YOLOv2 but is re-trained using only
the images labeled by humans. As previously described in Chapter 2, YOLOv2 trains its detector on top
of convolutional weights that are pre-trained on Imagenet.
The reason why we train the HUMAN neural net using only the images labeled by humans is due to
the fact that YOLOv2 uses the full images when training (read more in Chapter 2, Sub-section 2.2.3).
So if we wanted to add a new class to YOLOv2, trained with the COCO dataset, for example “lamp”
we would need to label all the lamps in thousands of images of that dataset. Otherwise, the neural net
when training with the thousands of images from the COCO dataset would always identify all the present
36 Chapter 4. Learning Algorithm
Figure 4.4: Mbot (top) and Pepper (bottom) learning.
“lamps” in those pictures as false detections and therefore it wouldn’t be able to learn to recognize this
object.
After training, the HUMAN neural net is also used in Part I - Predicting the objects. Using the predictions
from both neural nets, the robot is able to predict more objects present in an image. Note that it is also
possible that one object gets detected twice (e.g: first by YOLOv2 and then by HUMAN). In this case
we may end up with a duplicate label for the same object. Although this will not be a problem as we will
explain in the next Section.
Filtering Labels for Training
Before training the neural net we remove repeating labels over the same area in the image.
For example, looking at Figure 4.5, if a person answers Yes to q1 (correct label), and No to q2a
(bounding-box poorly adjusted), referring to two different predictions pertaining to the same object, then
we get two repeated labels. To avoid this problem, if in an image, an IOU (Intersection Over Union) larger
than 50% is detected between two labels of the same class (e.g with class “bed” in Figure 4.5) we say
we have a repeated label. In this case, we only use one of them (the one with a larger area) for training.
Algorithm 6 present the pseudocode for removing the repeated labels and then train the detector.
The IOU is a measure of how close two bounding boxes match. To calculate the IOU we divide the
intersection area by the union (total area) of the bounding-boxes (previously explained in Chapter 2,
4.1. Learning Algorithm 37
Figure 4.5: An example case where the robot gets repeated labels for the same object.
Sub-section 2.3.3). In the PASCAL VOC competition they defined that when the IOU value is larger
or equal to 50% then we have a match between the ground-truth and the label [34]. This IOU value is
currently the standard being used in the object detection literature, so it was also the one adopted during
the development of this work.
38 Chapter 4. Learning Algorithm
Algorithm 6 Learning Algorithm - Part III - Training
1: Input: all labeled images // all images coupled with corresponding object labels;2: Output: HUMAN // resulting HUMAN neural net;3:4: procedure TRAINING(all labeled images)5: // filter multiple objects with the same class label6: for each image id, verified prediction list in load info(all labeled images) do7: already tested combinations = [ ]8: found repeated label = True9: while found repeated label do
10: found repeated label = False11: for each label 1, label 2 in list(combinations(verified prediction list, 2)) do12: if (label 1, label 2) not in already tested combinations then13: already tested combinations.append((label 1, label 2))14: class name 1, bbox 1 = load label info(label 1)15: class name 2, bbox 2 = load label info(label 2)16: if class name 1 == class name 2 and calculate IOU(bbox 1, bbox 2) ≥ 0.5 then17: // repeated label18: if get area(bbox 1) < get area(bbox 2) then19: verified prediction list.remove(label 1)20: else21: verified prediction list.remove(label 2)22: end if23: found repeated label = True24: break25: end if26: end if27: end for28: end while29: all labeled images[image id] = verified prediction list30: end for31: test percentage = 0.3 // 30% of the images are used for testing, as previously stated;32: train set = set up train((1− test percentage), all labeled images)33: test set = set up test(test percentage, all labeled images, train set)34: HUMAN = train neural net(train set, test set)35: Return HUMAN36: end procedure
4.2 YOLOv2 + HUMAN
In this Chapter, we explain the challenges when using the two neural nets in isolation (Section 4.2.1)
and how it deals with duplicate detections (Section 4.2.2).
4.2.1 Problem and Solution
The problem of using only the HUMAN neural network is that we would be losing the current capabilities
of YOLOv2 (trained with thousands of pictures in a global effort). For example, if after training the
HUMAN neural network we introduced a new class to the scenario where the robot is included. If this
4.2. YOLOv2 + HUMAN 39
class was present in YOLOv2 it would most likely be able to detect it but, the HUMAN neural net would
not. To solve this, we propose using a combination of the two neural networks – YOLOv2 + HUMAN. In
a real human scenario, robots will need to interact with an unpredictable set of classes. When using the
two neural nets combined they both cooperate together resulting in a more robust object detection. We
can visualize an example of this cooperation in Figure 4.6.
Figure 4.6: Example of YOLOv2+HUMAN using predictions from both the neural nets.
With these means the robot is able to:
• detect the same objects as the default YOLOv2 (that was trained with thousands of images in a
global effort);
• detect new objects from the scenario where the robot is inserted using the HUMAN neural net (that
was trained with a small amount of images of the same objects to be detected).
As explained in Chapter 4, the HUMAN neural net is trained with the images collected by the robot and
labeled by the humans that interacted with the robot. The robot asks the humans to label all the objects
in the captured images so it can always learn new classes. Although the HUMAN neural net is trained
with a small amount of images, those images contain the same objects that will be detected later on in
that same scenario. So this neural net is able to learn this objects even with a small amount of training
samples (validated in Chapter 5).
40 Chapter 4. Learning Algorithm
4.2.2 Removing Duplicate Detections
The HUMAN neural net learns to detect the object classes from the images labeled by the humans
interacting with the robot. Since the humans decide which classes to teach the robot there might be
cases where the HUMAN neural net learns to detect a class that YOLOv2 was already able to detect.
As a consequence when running the two neural nets in parallel this may lead to conflicts of repeated
detection of the same object.
When the two neural nets detect the same object (with the same class label) in the same area in the
picture (IOU ≥ 50%), we use the prediction with higher confidence (see example in Figure 4.7). By
default, a confidence from 0.0 to 1.0 is given by the YOLOv2 algorithm, associated to each prediction
[9].
Figure 4.7: Duplicate detection example. YOLOv2+HUMAN uses the prediction with higher confidence.
In fact, this is a sort of a non-maximum suppression but instead of removing duplicate detections from
a single neural net it removes duplicates from the predictions of two neural nets. You can find the
pseudo-code in Algorithm 7.
In Chapter 5 we validate this method by evaluating whether using the two neural nets in parallel with this
technique lead to better results than when using them separately.
4.2. YOLOv2 + HUMAN 41
Algorithm 7 YOLOv2 + HUMAN - Remove Duplicates
1: Input: Y OLOv2 predictions // list of predictions from Y OLOv2 neural net;2: HUMAN predictions // list of predictions from HUMAN neural net;3: Output: merged predictions // list of predictions merging the input from the two neural nets;4:5: procedure REMOVEDUPLICATES(Y OLOv2 predictions, HUMAN predictions)6: tested combinations = [ ]7: collision found = True8: while collision found do9: collision found = False
10: // apply cartesian product to get combinations between the two lists11: combinations list = list(product(Y OLOv2 predictions,HUMAN predictions))12: for each (pred Y OLOv2, pred HUMAN) in combinations list do13: if (pred Y OLOv2, pred HUMAN) not in tested combinations then14: tested combinations.append((pred Y OLOv2, pred HUMAN))15: class Y, confidence Y, bbox Y = load prediction info(pred Y OLOv2)16: class H, confidence H, bbox H = load prediction info(pred HUMAN)17: if class Y == class H and calculate IOU(bbox Y, bbox H) ≥ 0.5 then18: if confidence Y > confidence H then19: HUMAN predictions.remove(pred HUMAN)20: else21: Y OLOv2 predictions.remove(pred Y OLOv2)22: end if23: collision found = True24: break25: end if26: end if27: end for28: end while29: merged predictions = Y OLOv2 predictions+HUMAN predictions30: Return merged predictions31: end procedure
Chapter 5
Results
In this Chapter, we evaluate the results of the experiments conducted during the development of this
work. Firstly, we analyze the results obtained in the domestic scenario experiment using the learning
algorithm (Section 5.2), then we analyze the results from additional experiments (Section 5.3).
5.1 Experimental Setup
In this Section, we report the experimental setup. Firstly, we describe the domestic environment experi-
ment (Section 5.1.1). Then, the search for a specific’s person object experiment (Section 5.1.2).
5.1.1 Domestic Environment
Using our current computational resources, running the two neural nets as separate processes requires
an external computer with a GPU. We ran the computations and trained the neural nets in this external
computer1. The average time it took to train the neural nets was 6 hours respectively with the 50, 100
and 150 images collected using the learning algorithm (Chapter 4). You can find more details about the
neural net in Chapter 2. While YOLOv2 is able to recognize the objects in real-time our bottleneck was
the robot’s WiFi connection, we were able to detect the objects at 5 frames per second.
Using a realistic domestic scenario – ISRoboNet@Home Testbed2 (Figure 5.1) composed of one bed-
room, one living room, one dining room and a kitchen. The experience was conducted over the course
of 20 days, taking a total of 6 hours. Six research participants acted as the human input for the robot,
1GPU:NVIDIA’s GeForce GTX 1080 Ti; CPU: Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz2More info can be found at http://isrobonet athome.isr.tecnico.ulisboa.pt/
42
5.1. Experimental Setup 43
Figure 5.1: ISRoboNet@Home Testbed – Domestic scenario where we ran the experiments.
answering questions about objects in order to evaluate if the robot could train the HUMAN neural net
when confined to a small-scale environment. We also tested the applicability of using the two neural
nets in parallel: YOLOv2 + HUMAN and evaluated the correctness of the robot’s predictions.
Established Points of Reference
The robot should be able to recognize the surrounding objects independently of its current pose (location
+ orientation), relative to a fixed world frame. There are infinite poses within a confined environment but
the robot can not feasibly ask infinite questions to fulfill its objective. To solve this, it starts by defining
a set of sparse points. When the distance between the robot and all the existing reference points is
greater than a fixed distance it creates a new reference point. We can visualize this as a robot with an
imaginary circle around it that creates a new reference point when there are no other points inside the
circle (Figure 5.2).
Using the Points of Reference
The robot should also be able to recognize the objects in different light conditions (related to various
hours of the day) and at different positions, angles and perspectives. We defined that each day the
robot will navigate to 1 point of reference, at a random time and capture a total of 8 images at different
orientations relative to a fixed world frame. This way it captures different exposures, viewpoints and the
implicit unpredictability of misplacement of day-to-day objects.
44 Chapter 5. Results
Figure 5.2: Points of reference comprising the surrounding area.
5.1.2 Looking for a Specific’s Person Object
When the robot asks a human a question about an object associated with a certain Point of Reference,
it gathers the spacial information of where that object was detected.
If the robot registers, for example, a “backpack”, as being 1 time in the living room and 10 times in the
bedroom, if asked to search for that same object it will hierarchically search by divisions with a greater
number of occurrences of that object. This somewhat emulates the thought process of humans when
they are looking for a misplaced object, wherein they will look for the most common places where they
see or use said object.
Furthermore when a user asks where his or hers object is, the robot, using the neural nets (YOLOv2
+ HUMAN150), starts by detecting the objects. When the searched personal object class is detected,
e.g. “backpack”, the robot compares the part of the image inside the bounding-box with all the previous
instances of backpacks it has recorded and associates it to the one with the highest amount of features
in common (using OpenCV’s SIFT code3).
5.2 Domestic Scenario Results
In this Section, we will be evaluating and comparing YOLOv2 with the HUMAN neural network in its
different stages (after collecting 50, 100 and 150 images). As previously explained the HUMAN neural
3The code can be found at https://opencv.org/
5.2. Domestic Scenario Results 45
net was trained using the introduced learning algorithm (Chapter 4) during the previously described
experiment (Section 5.1).
The goal of this Section is to show that the HUMAN neural net has an analogous performance versus
YOLOv2 when restrained to a set of objects and that when used in parallel – YOLOv2+HUMAN – we
can further increase this performance.
5.2.1 Correctness of Predictions
As previously explained both YOLOv2 and the HUMAN neural net were used to predict the objects when
the robot’s used the learning algorithm (Chapter 4) during the 20 days experiment. These predictions
are then shown to a human which will answer two Yes/No questions. The answer to these two ques-
tions determines if the object in each prediction was correctly labeled and if the bounding box was well
adjusted to that object (as previously illustrated in Figure 4.1, in Chapter 4).
During the 20 day experiment, the robot trained 3 stages of the HUMAN neural net: HUMAN50, HU-
MAN100, and HUMAN150. The robot took approximately 50 images per week. In the first week (less
than 50 images), the robot was using only YOLOv2 to generate the predictions. In the second week
(from 50 to 100 images) the robot used YOLOv2 and HUMAN50. And finally, during the last week (from
100 to 150 images) it used YOLOv2 and HUMAN100.
Looking at Table 5.1, 5.2, and 5.3 we can see the results from the Yes or No questions. For both the
neural networks, over 50% off all the predictions were true positives (Label Xand Bounding Box X),
while less than 15% were false negatives (Label × and Bounding Box ×). The rest of the predictions
either ended up with the human relocating the bounding box or relabeling the object as previously de-
scribed in Chapter 4. This allowed the participants to quickly fix these predictions. We also observed
that although the HUMAN neural net generated a smaller amount of predictions, those had a compara-
tive performance versus YOLOv2, even when trained only with 50 images.
Table 5.1: Predictions - Week 1 - Image 0 to 50.
YOLOv2 - total number of predictions: 190
(values in %) BoundingBox X
BoundingBox ×
Label X 60.0 13.7Label × 14.2 12.1
46 Chapter 5. Results
Table 5.2: Predictions - Week 2 - Image 50 to 100.
YOLOv2 - total number of predictions: 162
(values in %) BoundingBox X
BoundingBox ×
Label X 61.7 16.7Label × 9.3 12.3
HUMAN50 - total number of predictions: 122
(values in %) BoundingBox X
BoundingBox ×
Label X 54.1 33.6Label × 4.1 8.2
Table 5.3: Predictions - Week 3 - Image 100 to 150.
YOLOv2 - total number of predictions: 210
(values in %) BoundingBox X
BoundingBox ×
Label X 58.0 14.3Label × 16.7 11.0
HUMAN100 - total number of predictions: 107
(values in %) BoundingBox X
BoundingBox ×
Label X 52.3 24.3Label × 13.1 10.3
During the experiment, we identified two primary categories of human error in labeling images: (a)
unlabeled objects (happens most frequently when objects are difficult to label (e.g. small objects), or
when people couldn’t find the object’s label in the list (e.g. “fire extinguisher” was not included in the
OpenImages labels); (b) wrong label (e.g. human clicks Yes when should have clicked No).
Despite this human error, we observed (as it will be explained in the next Section) that the mean Average
Precision (Table 5.4) and the percentage of correct predictions (Appendix B) increased over the three
stages of the HUMAN neural net (explained in the next Section). We think that since the frequency of
human mistakes is small they did not affect significantly the training of the neural networks.
5.2.2 Evaluation of the Neural Networks
To evaluate the proposed system, we compared the neural nets using an external ground-truth com-
posed of 100 pictures. These were captured by the robot in different places and at different days of the
experiment scenario without the constraint of being in a Point of Reference (the locations from where
the robot learned, as explained in Chapter 5.1). They also included different lighting conditions and
5.2. Domestic Scenario Results 47
(a) Night time (with the robot not moving) (b) Day time (with the robot moving)
Figure 5.3: Example images captured by the robot for the ground-truth.
10 blurred pictures (due to the robot moving). Figure 5.3 shows some of these pictures and Figure
5.4 provides more detail about the number of instances for each of the object classes captured in the
ground-truth. Note that since the pictures were captured “naturally” around the scenario some classes
are more predominant than others. For example, the class “chair” is the most frequent one since there
were a total of 6 chairs spread around the experiment scenario. On the other hand, classes like “shirt”
and “bicyclehelmet” were only present in the scenario a couple of times.
Given this ground-truth, we can see and compare the results in Table 5.4. This table shows the resulting
Average Precision (AP) for each of the object classes (one per row) and for each of the neural nets (one
per column). In the last two rows of the table, we find the total number of classes detected by each
neural net and the corresponding mean Average Precision (mAP).
Also, it is important to note that in this table (5.4) some classes are not present in some neural nets
(marked with a hyphen symbol “-”). For example, the class “doll” only appeared a couple of times in
the pictures during the last week of the experiment, so only the last stage of the HUMAN neural net
(HUMAN150) could possibly detect this class. If a class is not present in the training set then it can not
be detected, hence the “-”. The last column shows the results when using the two neural nets in parallel
– YOLOv2 + HUMAN150.
In Appendix A you can find more information about the training sets of the HUMAN neural net, in its
different stages. In its plots (Figure A.1, A.2 and A.3) you can see that the class “bicyclehelmet”, for
example, was the least frequent class in the training set (only appeared once). So unless that picture of
the “bicyclehelmet” was very similar to the one in the ground-truth it would be very hard to detect it (in
practice it was not detected, bicyclehelmet AP = 0).
In Table 5.4 we can see that YOLOv2 could only possibly detect a total of 20 of the classes present
in the ground-truth, while HUMAN50 could detect 36, HUMAN100 40 and HUMAN150 41. Conversely,
48 Chapter 5. Results
Figure 5.4: Information about the ground-truth.
5.2. Domestic Scenario Results 49
Table 5.4: Average Precision using an external ground-truth composed of 100 images in different poses.Values are in percentage (%) and the largest one per row marked in bold.
values in % YOLOv2 H50 H100 H150YOLOv2
+H150
apple 25.0 0 25.0 12.5 25.0backpack 6.3 39.9 79.1 53.1 57.6banana 20.0 0 20.0 20.0 40.0
bed 66.7 16.7 48.8 41.7 89.6bicyclehelmet - 0 0 0 0
book 13.2 23.7 24.0 26.2 28.8bookcase - 22.2 30.6 33.3 33.3
bottle 31.3 - 6.7 6.7 39.6bowl 48.1 0 15.8 28.6 54.7
cabinetry - 36.4 44.8 55.2 55.2candle - - 0 16.7 16.7chair 55.5 27.0 39.0 45.0 56.4
coffeetable - 24.8 35.1 57.1 57.1countertop - 46.5 76.2 85.7 85.7
cup 25.5 1.0 32.8 32.3 43.1diningtable 40.6 35.2 51.1 46.6 52.1
doll - - - 0 0door 0 58.1 70.6 61.8 61.8
doorhandle - 0 0 16.3 16.3envelope - - 0 3.9 3.9glasses 0 56.9 31.3 53.1 53.1heater - 28.6 40.8 35.7 35.7lamp - 0 0 18.3 18.3
nightstand - 62.5 50.0 50.0 50.0orange 20.0 0 20.0 40.0 40.0person 50.0 0 37.5 25.0 47.5
pictureframe - 35.8 35.0 34.6 34.6pillow - 25.9 22.9 36.2 36.2
pottedplant 70.7 41.9 45.9 66.7 63.6remote 65.1 0 0 0 65.1shelf - 0 0 41.7 41.7shirt - 0 50.0 50.0 50.0sink 15.8 - 6.7 13.3 20.8sofa 88.0 7.7 38.5 61.3 92.3tap - 0 5.6 36.9 36.9
tincan - 9.5 34.2 26.6 26.6tvmonitor 38.2 36.4 55.6 74.0 67.8
vase 20.8 20.0 6.7 11.1 24.0wardrobe - 12.5 0 50.0 50.0
wastecontainer - 90.9 98.6 99.2 99.2windowblind - 35.3 35.3 47.1 47.1
# total of classes 20 36 40 41 41mAP 35.0 22.1 30.3 36.9 44.3
HUMAN50 already included 36 out of all the 41 classes that were labeled (88%). Given that the objects
present within the scenario were the same throughout the duration of the experiment; using the Points
of Reference at different orientations (previously explained in 5.1.1), the robot efficiently captured most
of the objects in the first 50 pictures.
50 Chapter 5. Results
When using the two neural nets in parallel (YOLOv2 + HUMAN) we get a union of the detections. When
a class is not embedded in YOLOv2, YOLOv2+HUMAN outputs the same as just the HUMAN neural net.
For example “countertop” which is not present in YOLOv2, can only be detected by the HUMAN neural
net (leading to an equivalent Average Precision score between YOLOv2+HUMAN150 and HUMAN150).
The same thing applies the other way around.
Another possible scenario is when a class is present in both the neural nets. In this case, we also
need to filter duplicate detections of the same object class in a picture, if existent, via non-maximum
suppression (basically, when we encounter a conflict of two bounding boxes of the same class, we keep
the one with a higher confidence score as it was previously explained in Chapter 4.2). We can evaluate
the cooperation between the neural nets by having a better look at Table 5.4. Out of the 41 class scores,
in total there were only two classes where the YOLOv2 had a higher score than YOLOv2 + HUMAN150:
“person” and “pottedplant” and only one class where HUMAN150 had a higher score than YOLOv2 +
HUMAN150: “tvmonitor”. In both these cases, the difference in the score was always smaller than 8%.
In the other 38 classes, out of the 41, using the neural nets in parallel drove to a score higher or equal to
the maximum score between the other two columns (YOLOv2 and HUMAN150 when used isolated). In
these cases, the difference in the score went up to 23% improvement (see the class “bed”, for example).
We also marked in bold the highest score per each class. Out of the 41 classes, 31 of them (75%)
scored higher when using the two neural nets in parallel.
In Appendix B you can find information about all the predictions generated by these neural nets given
the ground-truth pictures. For example, in Figure B.1 you can see that YOLOv2 also detected objects
that were not present in the ground-truth. For example, although there was not a single “refrigerator” in
the experiment scenario, YOLOv2 detected a total of 39 instances of refrigerators in those ground-truth
pictures. Therefore, all of these were false predictions.
Finally, we compared the most relevant score: the mAP (mean Average Precision). In this experiment,
we verified the improvement of the HUMAN neural net with an increasing mAP, from 22.1% to 30.3% and
finally 36.9%. On average the mAP value significantly increased by 7.5% between the different stages.
In practice, it makes sense that the more pictures the robot has to train with, the better should the results
be.
In the table, we can also note that in some classes (e.g. “backpack”) the scores obtained with the HU-
MAN neural nets surpassed the YOLOv2. In fact, having a look at the final mAP score of the HUMAN150
we can see that this neural net already achieved a score higher than the default YOLOv2 and is able to
detect twice the number of classes. Even more impressive was the final score obtained with the neural
nets in parallel achieving a total of 44.3%, which was almost 10% higher than the YOLOv2 score.
5.3. Further Experiments 51
Although the HUMAN150 score already surpassed the score of YOLOv2 there is still a big advantage in
using the two neural networks in parallel – YOLOv2 + HUMAN150. For example, imagine you introduce
a new object class to the scenario: a “cat”. Since this class was never present in the HUMAN training
data this neural net will not be able to detect it. Although YOLOv2, trained with thousands of images in
a global effort will most likely detect the “cat”. Since YOLOv2+HUMAN is a union of the detections it will
also be able to detect that “cat”, with the same confidence as YOLOv2 when used in isolation.
5.2.3 Discussion
To summarize, we started by verifying that both YOLOv2 and HUMAN are able to predict the location
(bounding-boxes) and class labels of the objects in the images. During the 20 days experiment both
these neural networks were used to detect objects in the images collected by the robot and more than
50% of these predictions (for both the neural networks) corresponded to true positives (Label Xand
Bounding Box X), as described in Subsection 5.2.1. Then, in Sub-section 5.2.2, we verified the con-
tinuous improvement in accuracy of the robot’s object detection skills. The HUMAN neural network
(trained using only the input from humans in close proximity) increased its mean Average Precision
(mAP) score from 22.1% with 50 training images, to 30.3% with 100 images, and finally to 36.9% with
150 images. The last stage of the HUMAN neural net obtained a mAP score 1.9% higher than YOLOv2,
which scored 35.0%. Finally, we verified that using the two neural networks in parallel (YOLOv2 + HU-
MAN) we achieved a score of 44.3% (almost 10% higher than both the scores of neural networks used
in isolation). Also note that YOLOv2 + HUMAN was able to detect twice the number of the objects when
compared to YOLOv2.
5.3 Further Experiments
In this Section we present two further experiments conducted when we realized the wide applications
that can be derived from the data collected by the robot when using the introduced learning algorithm.
The purpose of this Section is to show use cases for this work. We view these experiments as a good
starting points for possible future work.
52 Chapter 5. Results
5.3.1 Looking for a Specific’s Person Object
In Figure 5.5, we can see the results of our next experiment, where we simulated the misplacement of
a backpack and remotely requested the robot to search for it (using Telegram bot API4). As the figure
suggests, the robot searched the surrounding environment. After approximately 2 minutes it had a
positive match on the subject’s backpack. Pertinently there were two more backpacks present at the
scene and Pepper was able to identify the desired one.
The goal of this simple experiment was to present a possible use case conducted with the data collected
by the robot. Providing the robot with the ability of looking for a specific’s person object would be of
great value - specifically for helping elderly and disabled (e.g: robot helping visually impaired looking for
objects).
4More info can be found at https://core.telegram.org/
5.3. Further Experiments 53
Figure 5.5: Pepper looking for Joao’s backpack demo.
5.3.2 Sharing Knowledge Between Robots
The final experiment we conducted was to evaluate what would happen if we used the MBot’s (located in
Lisbon, Portugal) HUMAN neural net on Pepper (located in Pittsburgh, PA, USA). Some object classes,
e.g. fruits, kitchenware, and others can be commonly found in many places in the world so it makes
sense that one robot could benefit from the other one’s knowledge. So we decided to run a simple test
where we showed Pepper an “apple”, an “orange” and a “banana” over a “countertop”. Using MBot’s
HUMAN150 neural net Pepper detected two “oranges”, one “banana” and the “countertop” (see Figure
5.6).
Despite being a world away we can already see the potential of sharing knowledge between the robots.
54 Chapter 5. Results
This simple test motivated us into thinking about new possibilities of research work that we plan to do in
future.
Figure 5.6: Pepper detecting objects using MBot’s knowledge.
Chapter 6
Conclusion
In this Chapter, we present the conclusions of this thesis. Firstly, we summarize its achievements (Sec-
tion 6.1). Finally, we briefly discuss potential future work (Section 6.2).
6.1 Summary of Thesis Achievements
We tested and evaluated the state-of-the-art object detection algorithm – YOLOv2 – with two robots
CoBot and MBot and verified that the results fell short of expectations, given the published results of
their evaluation. These tests were conducted in two different scenarios, the corridors of a building in a
university and a domestic scenario and we verified that YOLOv2, when deployed in these robots, fails to
recognize many of the objects in real human environments. This thesis presents an approach to address
the object detection limitations of service robots. In particular, we bootstrap YOLOv2, a state-of-the-art
object detection neural network, together with human teaching provided in close proximity. The robot
trains a neural net with the collected data and uses two neural networks in parallel: YOLOv2 + HUMAN.
HUMAN, was the neural net we created using only the data collected by the robot after interacting with
the humans. By using the two neural networks, the robot is equipped with the ability to adapt to a
new environment without losing its previous knowledge. We implemented our learning algorithm in two
different robots, Pepper and Mbot, and conducted experiments to test the object detection performance
of these neural networks in a domestic scenario. We verified the continuous improvement in accuracy
of the robot’s object detection skills as it interacts more with the humans and that using the two neural
networks in parallel is advantageous. We also showed that a robot could look for a specific’s person
object. Finally, we conducted a simple experiment where Pepper, located in the USA, detected objects
using the MBot’s knowledge, located in Portugal.
55
56 Chapter 6. Conclusion
6.2 Future Work
Potential future work includes further enhancing the object detection capabilities. Since the robot keeps
track of all the objects labeled by the humans, it also knows which ones it has seen less frequently. The
robot could use this information to ask for more images of a specific object. For example, it could ask
the human to show it different perspectives of a specific object. We believe that this new capability could
improve its detection skills since the robot would be able to include more variability in its training set.
Another potential path of future work could focus on sharing information between the robots (located
in different places/countries). We did a simple final experiment of sharing the MBot’s knowledge with
Pepper and showed that it successfully detected most of the objects in an image. We can only imagine
what would happen if we shared hundreds of labeled images between multiple robots (e.g. between
Pepper and Mbot).
Bibliography
[1] Thomaz, Andrea Lockerd, and Cynthia Breazeal. ”Reinforcement learning with human teachers:
Evidence of feedback and guidance with implications for learning performance.” In Aaai, vol. 6, pp.
1000-1005. 2006.
[2] Veloso, Manuela M., Joydeep Biswas, Brian Coltin, and Stephanie Rosenthal. ”CoBots: Robust
Symbiotic Autonomous Mobile Service Robots.” In IJCAI, p. 4423. 2015.
[3] Argall, Brenna D., Sonia Chernova, Manuela Veloso, and Brett Browning. ”A survey of robot learning
from demonstration.” Robotics and autonomous systems 57, no. 5 (2009): 469-483.
[4] Thrun, Sebastian, and Tom M. Mitchell. ”Lifelong robot learning.” Robotics and autonomous systems
15, no. 1-2 (1995): 25-46.
[5] SoftBank Robotics, 2018. Who is Pepper? (2018). Retrieved February 1, 2018 from
https://www.ald.softbankrobotics.com/en/robots/pepper
[6] Hawes, Nick, Christopher Burbridge, Ferdian Jovan, Lars Kunze, Bruno Lacerda, Lenka Mudrova,
Jay Young et al. ”The STRANDS project: Long-term autonomy in everyday environments.” IEEE
Robotics & Automation Magazine 24, no. 3 (2017): 146-156.
[7] Agence France Presse. 2016. Robot receptionists introduced at hospitals in Belgium. (2016). Re-
trieved February 2018 from https://www.theguardian.com/technology/2016/jun/14/robot-receptionists-
hospitals-belgium-pepper-humanoid
[8] Veloso, Manuela M., Joydeep Biswas, Brian Coltin, Stephanie Rosenthal, Susana Brandao, Tekin
Mericli, and Rodrigo Ventura. ”Symbiotic-autonomous service robots for user-requested tasks in a
multi-floor building.” (2012).
[9] Redmon, Joseph, and Ali Farhadi. ”YOLO9000: better, faster, stronger.” arXiv preprint 1612 (2016).
[10] Sam Byford (The Verge). 2014.SoftBank announces emotional robots to staff
its stores and watch your baby. (2014). Retrieved February 1, 2018 from
https://www.theverge.com/2014/6/5/5781628/softbank-announces-pepper-robot
57
58 BIBLIOGRAPHY
[11] Messias, Joao, Rodrigo Ventura, Pedro Lima, Joao Sequeira, Paulo Alvito, Carlos Marques, and
Paulo Carrico. ”A robotic platform for edutainment activities in a pediatric hospital.” In Autonomous
Robot Systems and Competitions (ICARSC), 2014 IEEE International Conference on, pp. 193-198.
IEEE, 2014.
[12] Joao Cartucho, Rodrigo Ventura, and Manuela Veloso. ”Robust Object Recognition Through Sym-
biotic Deep Learning In Mobile Robots.” Submitted to IROS 2018.
[13] Michiel de Jong, Kevin Zhang, Travers Rhodes, Aaron Roth, Robin Schmucker, Chenghui Zhou,
Sofia Ferreira, Joao Cartucho, Manuela Veloso. ”Towards a Robust Interactive and Learning Social
Robot.” AAMAS 2018.
[14] Viola, Paul, and Michael Jones. ”Rapid object detection using a boosted cascade of simple fea-
tures.” In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001
IEEE Computer Society Conference on, vol. 1, pp. I-I. IEEE, 2001.
[15] Lowe, David G. ”Distinctive image features from scale-invariant keypoints.” International journal of
computer vision 60, no. 2 (2004): 91-110.
[16] Edouard Oyallon, Julien Rabin, ”An Analysis and Implementation of the SURF Method, and its
Comparison to SIFT”, Image Processing On Line
[17] Dalal, Navneet, and Bill Triggs. ”Histograms of oriented gradients for human detection.” Computer
Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. Vol. 1.
IEEE, 2005.
[18] Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ”Imagenet: A large-scale
hierarchical image database.” In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE
Conference on, pp. 248-255. IEEE, 2009.
[19] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. ”Imagenet classification with deep con-
volutional neural networks.” In Advances in neural information processing systems, pp. 1097-1105.
2012.
[20] Fukushima, Kunihiko, and Sei Miyake. ”Neocognitron: A self-organizing neural network model for a
mechanism of visual pattern recognition.” In Competition and cooperation in neural nets, pp. 267-285.
Springer, Berlin, Heidelberg, 1982.
[21] LeCun, Yann, Bernhard Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne
Hubbard, and Lawrence D. Jackel. ”Backpropagation applied to handwritten zip code recognition.”
Neural computation 1, no. 4 (1989): 541-551.
BIBLIOGRAPHY 59
[22] Simonyan, Karen, and Andrew Zisserman. ”Very deep convolutional networks for large-scale image
recognition.” arXiv preprint arXiv:1409.1556 (2014).
[23] Szegedy, Christian, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Du-
mitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. ”Going deeper with convolutions.” Cvpr,
2015.
[24] Szegedy, Christian, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. ”Rethinking
the inception architecture for computer vision.” In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 2818-2826. 2016.
[25] Felzenszwalb, Pedro F., Ross B. Girshick, David McAllester, and Deva Ramanan. ”Object detection
with discriminatively trained part-based models.” IEEE transactions on pattern analysis and machine
intelligence 32, no. 9 (2010): 1627-1645.
[26] Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. ”Rich feature hierarchies for ac-
curate object detection and semantic segmentation.” In Proceedings of the IEEE conference on com-
puter vision and pattern recognition, pp. 580-587. 2014.
[27] Uijlings, Jasper RR, Koen EA Van De Sande, Theo Gevers, and Arnold WM Smeulders. ”Selective
search for object recognition.” International journal of computer vision 104, no. 2 (2013): 154-171.
[28] Girshick, Ross. ”Fast r-cnn.” arXiv preprint arXiv:1504.08083 (2015).
[29] Ren, Shaoqing, Kaiming He, Ross Girshick, and Jian Sun. ”Faster r-cnn: Towards real-time object
detection with region proposal networks.” In Advances in neural information processing systems, pp.
91-99. 2015.
[30] He, Kaiming, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. ”Mask r-cnn.” In Computer Vision
(ICCV), 2017 IEEE International Conference on, pp. 2980-2988. IEEE, 2017.
[31] Redmon, Joseph, Santosh Divvala, Ross Girshick, and Ali Farhadi. ”You only look once: Unified,
real-time object detection.” In Proceedings of the IEEE conference on computer vision and pattern
recognition, pp. 779-788. 2016.
[32] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing inter-
nal covariate shift. arXiv preprint arXiv:1502.03167 , 2015
[33] Forman, George. ”An extensive empirical study of feature selection metrics for text classification.”
Journal of machine learning research 3, no. Mar (2003): 1289-1305.
[34] Everingham, Mark, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman.
”The pascal visual object classes (voc) challenge.” International journal of computer vision 88, no. 2
(2010): 303-338.
60 BIBLIOGRAPHY
[35] Papadopoulos, Dim P., Jasper RR Uijlings, Frank Keller, and Vittorio Ferrari. ”We don’t need
no bounding-boxes: Training object class detectors using only human verification.” arXiv preprint
arXiv:1602.08405 (2016).
[36] Everingham, M. and Van Gool, L. and Williams, C. K. I. and Winn, J. and Zisserman,
A. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, http://www.pascal-
network.org/challenges/VOC/voc2012/workshop/index.html
[37] Everingham, Mark, SM Ali Eslami, Luc Van Gool, Christopher KI Williams, John Winn, and An-
drew Zisserman. ”The pascal visual object classes challenge: A retrospective.” International journal
of computer vision 111, no. 1 (2015): 98-136.
[38] Vijayanarasimhan, Sudheendra, and Kristen Grauman. ”Large-scale live active learning: Training
object detectors with crawled data and crowds.” International Journal of Computer Vision 108, no. 1-2
(2014): 97-114.
[39] Thomaz, Andrea L., and Maya Cakmak. ”Learning about objects with human teachers.” In Pro-
ceedings of the 4th ACM/IEEE international conference on Human robot interaction, pp. 15-22. ACM,
2009.
[40] Rosenthal, Stephanie, Joydeep Biswas, and Manuela Veloso. ”An effective personal mobile robot
agent through symbiotic human-robot interaction.” In Proceedings of the 9th International Confer-
ence on Autonomous Agents and Multiagent Systems: volume 1-Volume 1, pp. 915-922. International
Foundation for Autonomous Agents and Multiagent Systems, 2010.
[41] Rosenthal, Stephanie, and Manuela Veloso. ”Using symbiotic relationships with humans to help
robots overcome limitations.” In Workshop for Collaborative Human/AI Control for Interactive Experi-
ences. 2010.
[42] Ventura, Rodrigo, Brian Coltin, and Manuela Veloso. ”Web-based remote assistance to overcome
robot perceptual limitations.” In AAAI Conference on Artificial Intelligence (AAAI-13), Workshop on
Intelligent Robot Systems. AAAI, Bellevue, WA. 2013.
[43] Bohren, Jonathan, Radu Bogdan Rusu, E. Gil Jones, Eitan Marder-Eppstein, Caroline Pantofaru,
Melonee Wise, Lorenz Mosenlechner, Wim Meeussen, and Stefan Holzer. ”Towards autonomous
robotic butlers: Lessons learned with the PR2.” In Robotics and Automation (ICRA), 2011 IEEE Inter-
national Conference on, pp. 5568-5575. IEEE, 2011.
[44] Watanabe, Sumio, and Masahide Yoneyama. ”An ultrasonic visual sensor for three-dimensional
object recognition using neural networks.” IEEE transactions on Robotics and Automation 8, no. 2
(1992): 240-249.
BIBLIOGRAPHY 61
[45] Maturana, Daniel, and Sebastian Scherer. ”Voxnet: A 3d convolutional neural network for real-time
object recognition.” In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Confer-
ence on, pp. 922-928. IEEE, 2015.
[46] Collet, Alvaro, Dmitry Berenson, Siddhartha S. Srinivasa, and Dave Ferguson. ”Object recognition
and full pose registration from a single image for robotic manipulation.” In Robotics and Automation,
2009. ICRA’09. IEEE International Conference on, pp. 48-55. IEEE, 2009.
[47] Itti, Laurent, Christof Koch, and Ernst Niebur. ”A model of saliency-based visual attention for rapid
scene analysis.” IEEE Transactions on pattern analysis and machine intelligence 20, no. 11 (1998):
1254-1259.
[48] Nasrabadi, Nasser M., and Wei Li. ”Object recognition by a Hopfield neural network.” IEEE Trans-
actions on Systems, Man, and Cybernetics 21, no. 6 (1991): 1523-1535.
[49] D’Innocente, Antonio, Fabio Maria Carlucci, Mirco Colosi, and Barbara Caputo. ”Bridging between
computer and robot vision through data augmentation: a case study on object recognition.” In Inter-
national Conference on Computer Vision Systems, pp. 384-393. Springer, Cham, 2017.
[50] Kemp, Charles C., Cressel D. Anderson, Hai Nguyen, Alexander J. Trevor, and Zhe Xu. ”A point-
and-click interface for the real world: laser designation of objects for mobile manipulation.” In Human-
Robot Interaction (HRI), 2008 3rd ACM/IEEE International Conference on, pp. 241-248. IEEE, 2008.
[51] Ekvall, Staffan, Patric Jensfelt, and Danica Kragic. ”Integrating active mobile robot object recogni-
tion and slam in natural environments.” In Intelligent Robots and Systems, 2006 IEEE/RSJ Interna-
tional Conference on, pp. 5792-5797. IEEE, 2006.
[52] Lin, Tsung-Yi, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr
Dollar, and C. Lawrence Zitnick. ”Microsoft coco: Common objects in context.” In European conference
on computer vision, pp. 740-755. Springer, Cham, 2014.
[53] He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. ”Deep residual learning for image
recognition.” In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.
770-778. 2016.
[54] Liu, Wei, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and
Alexander C. Berg. ”Ssd: Single shot multibox detector.” In European conference on computer vision,
pp. 21-37. Springer, Cham, 2016.
[55] Krasin I., Duerig T., Alldrin N., Ferrari V., Abu-El-Haija S., Kuznetsova A., Rom H., Uijlings J., Popov
S., Veit A., Belongie S., Gomes V., Gupta A., Sun C., Chechik G., Cai D., Feng Z., Narayanan D., Mur-
62 BIBLIOGRAPHY
phy K. OpenImages: A public dataset for large-scale multi-label and multi-class image classification,
2017. Available from https://github.com/openimages
[56] Marko Bjelonic. 2017. YOLO ROS: Real-Time Object Detection for ROS. (2017). Retrieved January
21, 2018 from https://github.com/leggedrobotics/darknet ros
[57] Russakovsky, Olga, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng
Huang et al. ”Imagenet large scale visual recognition challenge.” International Journal of Computer
Vision 115, no. 3 (2015): 211-252.
[58] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A.
Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014
Appendix A
Training Set Information
In this Appendix, we provide more information about the training set in the three stages of the HUMAN
neural net (HUMAN50, HUMAN100, and HUMAN150). Note that from each stage the HUMAN only
trained with 70% of the pictures. For example, out of the first 50, 35 of them were used for training. In
the following plots (Figure A.1, A.2 and A.3) we find the number of instances per object class in each
training set. Some of the less frequent classes like “refrigerator” and “handbag” were never present in the
experiment scenario, they result from human input mistakes. Other object classes were only labeled by a
single research participant (for example the class “doll” was only labeled by a single participant). Another
thing to take into account is that some objects show up less frequently and others more frequently in the
pictures. For example, the class “chair” was the dominant one since there a total of 6 chairs distributed
around the experiment scenario. Another important thing to note is that this class (“chair”) was starting
to get unbalanced relatively to the other classes. In practice class imbalance usually matters only when
the ratios are more like N:1, where N is 100 or 1000 or more. The ideal way to avoid this would be by
collecting more data, for example, the robot would ask something like:
“Could you please show me a remote?”
And the robot would capture more pictures while the human is showing the object. This is one of the
next logical steps we are currently considering to proceed with, hopefully improving furtherer the results.
63
64 Appendix A. Training Set Information
Figure A.1: Information about HUMAN50 training data.
65
Figure A.2: Information about HUMAN100 training data
66 Appendix A. Training Set Information
Figure A.3: Information about HUMAN150 training data.
Appendix B
Resulting Predictions Information
In this Appendix you can find the resulting neural net’s predictions for the ground-truth pictures (Figure
B.1, B.2, B.3, B.4, B.5). In HUMAN50 68% of the predictions were correct, while in HUMAN100 76%
and finally in HUMAN150 78%. On the other hand only 57% of YOLOv2’s predictions were correct. In
figure B.1 we can note that second most detected class by YOLOv2 was “refrigerator” and since there
were not any refrigerators in the scene it resulted in a total of 39 false predictions.
67
68 Appendix B. Resulting Predictions Information
Figure B.1: YOLOv2 predictions with the ground-truth pictures.
Figure B.2: HUMAN50 predictions with the ground-truth pictures.
69
Figure B.3: HUMAN100 predictions with the ground-truth pictures.
70 Appendix B. Resulting Predictions Information
Figure B.4: HUMAN150 predictions for the ground-truth pictures.
71
Figure B.5: YOLOv2+HUMAN150 predictions for the ground-truth pictures.