Visual SLAM and Human Recognition System of a Legged Robot

48
Visual SLAM and Human Recognition System of a Legged Robot by Jose Carlos Perez Ipiales A thesis submitted in partial fulfillment of the requirements for the degree of Master of Engineering in Mechatronics Examination Committee: Prof. Manukid Parnichkun (Chairperson) Dr. Mongkol Ekpanyapong Dr. Pisut Koomsap Nationality: Ecuadorian Previous Degree: Bachelor of Science in Mechanical Engineering The University of the Armed Forces-ESPE, Ecuador Scholarship Donor: Secretariat of Higher Education, Science, Technology and Innovation (SENESCYT), Ecuador Asian Institute of Technology School of Engineering and Technology Thailand December 2019

Transcript of Visual SLAM and Human Recognition System of a Legged Robot

Page 1: Visual SLAM and Human Recognition System of a Legged Robot

Visual SLAM and Human Recognition System of a Legged

Robot

by

Jose Carlos Perez Ipiales

A thesis submitted in partial fulfillment of the requirements for the

degree of Master of Engineering in

Mechatronics

Examination Committee: Prof. Manukid Parnichkun (Chairperson) Dr. Mongkol Ekpanyapong Dr. Pisut Koomsap

Nationality: Ecuadorian

Previous Degree: Bachelor of Science in Mechanical Engineering

The University of the Armed Forces-ESPE, Ecuador

Scholarship Donor: Secretariat of Higher Education, Science, Technology and

Innovation (SENESCYT), Ecuador

Asian Institute of Technology

School of Engineering and Technology

Thailand

December 2019

Page 2: Visual SLAM and Human Recognition System of a Legged Robot

ii

Acknowledgements

I want to express my deepest gratefulness to the hearts that gave me purpose in this world,

to my mother, father and brother.

Besides this, I would like to express my sincere gratitude to the government of Ecuador and

to the Secretariat of Higher Education, Science, Technology and Innovation for entirely

funding my studies at AIT.

Page 3: Visual SLAM and Human Recognition System of a Legged Robot

iii

Abstract

The purpose of this project was to design a 3D printed 4-legged indoors robot which

incorporates a Visual SLAM system and that is able to manage the intrinsic movement of

the depth camera when the locomotion and the perturbations associated are happening. At

the same time this robot runs a tiny YOLOv2 deep neural network which can recognize

persons with an acceptable level of accuracy and in real time operation.

This robot uses ROS (Robotic Operational System) and is remotely connected to a PC that

performs the computational processes required by the Visual SLAM using RTabMap while

YOLO and the high-level control is being managed by a Jetson Nano board which has a

serial connection with an Arduino Due whose function is to perform the low-level control

necessary to perform the motion.

Page 4: Visual SLAM and Human Recognition System of a Legged Robot

iv

Table of Contents

Chapter Title Page

Title Page i

Acknowledgements ii

Abstract iii

Table of Contents iv

List of Figures v

List of Tables vi

1 Introduction 1

1.1 Background 1

1.2 Statement of Problem 2

1.3 Objectives 2

1.4 Scope 2

2 Literature Review 3

2.1 Perception 4

2.1.1 Cameras 4

2.2 Simultaneous Localization and Mapping 5

2.2.1 Visual SLAM 5

2.3 Robot Locomotion 6

2.3.1 Stability 7

2.3.2 Legged Robot 8

2.4 Human Detection System 9

2.4.1 Tiny Yolo V2 11

3 Methodology 12

3.1 Development of the Physical Parts of the Robot 14

3.2 Components 15

3.3 Circuit Required 19

3.4 Program Structure 20

4 Results and Discussion 26

4.1 Visual SLAM Experiment 26

4.2 Human Recognition Experiment 32

5 Conclusion and Recommendations 35

5.1 Conclusion 35

5.2 Recommendations

35

References 36

Appendices 38

Appendix A 38

Appendix B 39

Page 5: Visual SLAM and Human Recognition System of a Legged Robot

v

List of Figures

Figure Title Page

2.1 Levels of Influence of SR 1

2.2 Service Robots (SR) 1

2.3 See โ€“ Think โ€“ Act Cycle of an Autonomous Mobile Robot 3

2.4 Microsoft Azure RGB โ€“ D Camera 4

2.5 Feature-Based ORB-SLAM Reconstruction of a Bedroom 5

2.6 Example of a Semi-Dense 3D Semantic 6

2.7 Locomotion Mechanisms Found in Nature 7

2.8 Gabrielli-von Karman Diagram 9

2.9 YOLO. Detecting a Variety of Object Classes in Real-Time 10

2.10 Tiny Yolo v2 Architecture 11

3.1 4-Legged Robot Developed 12

3.2 Walking Pattern Gait 13

3.3 Perspective of the Movement of Leg 13

3.4 Leg and frame Models of the Developed Legged Robot 14

3.5 PCB and Controller Supports 14

3.6 Example of Connections Made with Threaded Heat Inserts 15

3.7 Components Overview 16

3.8 Servo RDS3135 MG Assembly 17

3.9 Nvidia Jetson Nano 18

3.10 Circuit Diagram 19

3.11 ROS Architecture of the System 21

3.12 Block Diagram of the Structure of Rtab-Map Node 22

3.13 Graph Representation of Locations in the Memory 22

3.14 Block Diagram of the Structure of RGBD Odometry Node 23

3.15 ROS Arduino Communication Node 25

4.1 Initialization of the Robot 26

4.2 Path and Map of the Visual SLAM Experiment 27

4.3 Real Image of the Place of Experiment 28

4.4 X-Axis Comparison of Results and Ground Truth 29

4.5 Y-Axis Comparison of Results and Ground Truth 29

4.6 Yaw Angle Comparison of Results and Ground Truth 30

4.7 Error Associated to the Measurement in the X and Y Axis 30

4.8 Error Associated to the Measurement of the Heading Angle 30

4.9 Example of False Positive with the Object Detection Algorithm 33

4.10 Graphical Representation of Results of Tiny-Yolo-v2 33

Example of RANSAC fitting model estimation using a line

Pinhole Camera Model Graphical Interpretation

Reprojection Error

Page 6: Visual SLAM and Human Recognition System of a Legged Robot

vi

List of Tables

Table Title Page

3.1 Properties of 3135 MG and 3115 MG Servos 17

4.1 Results Obtained in the Visual Slam Experiment 27

4.2 Errors Associated with the Experiment for X, Y and Yaw Angle 28

4.3 Results Obtained with Tiny-Yolo-v2 32

Page 7: Visual SLAM and Human Recognition System of a Legged Robot

1

Chapter 1

Introduction

1.1. Background

Service sector gives the impression to be at an inflection point, regarding productivity gains

and service industrialization, similar to the industrial revolution that started in the eighteenth

century. In the present, rapidly improving technology that becomes better, smarter, smaller,

and cheaper will renovate almost all service sectors (Wirtz, 2018).

Robots are one of the technologies that will help us to make this transformation in our daily

lives and in the industry, and that will help to fulfill a growing number of roles that range

from factory automation, medical care, entertainment, cleaning, personal assistance, etc. The

development of service robots has been divided into two sectors: (a) non-manufacturing

productive sectors such as agriculture, mining industry, or medicine; and (b) the personal

service sector, including personal assistance, cleaning, education, entertainment, etc.

(Garcรญa-Solera, 2018)

Service robots are technical devices ready to execute an operation or specific task with a

semi or fully autonomous behavior which have, at the same time, to face a different type of

dynamic and static objects in their environment, such as humans and others, while

performing its duty in real time.

Service robots will have an important role in the days to come at various levels: the micro

(i.e. individual customer experience), meso (e.g. the market for a particular service) and

macro level (with societal implications).

Figure 2.1: Levels of Influence of SR (Wirtz, 2018) Figure 2.2: Service Robots (SR)

Page 8: Visual SLAM and Human Recognition System of a Legged Robot

2

1.2. Statement of the Problem

In order to understand the world and to navigate it, service robots must have the ability to

create a map of their environment and to locate themselves in it. While doing it so, using

different SLAM techniques, the perceived world can be of a 2D or a 3D shape depending

on the algorithms and devices used for that purpose.

For many tasks, a 2D map would be enough for the robot to perform the navigation, and on

the other hand a 3D counterpart would help us to obtain a more accurate approximation to

interact with.

The service robot this work is intended to generate has a legged based locomotion -that will

give it the capability of adaptable movement for unideal conditions such as the case of

objects that obstruct the paths or uneven floors- and will also perform a Visual SLAM which

will create a 2D and/or 3D map and at the same time will have to deal with the intrinsic

complex movement that the body of this legged robots have when they perform the

locomotion, especially in 4 legged-case whose displacement will have less contact points

with the floor to give support to its structure and center of gravity -compared with robots

with 6, 8 or more legs-.

At the same time, the proposed robot will incorporate an object recognition algorithm that

can help us to localize persons in environments with different light conditions.

1.3. Objectives

The objective of this work is to develop a 4-legged robot which is able to perform a Visual-

SLAM and recognize humans in an indoors environment with different conditions of light.

1.4. Scope

โ€ข Development of a 4-legged robot whose size is about 320x320x280 and its control

algorithm.

โ€ข Capability of walking in a terrain with a variation of 2 cms in height

โ€ข Implementation of a Visual Slam system

โ€ข Human and object recognition using the proper algorithm

Page 9: Visual SLAM and Human Recognition System of a Legged Robot

3

Chapter 2

Literature Review

An autonomous robot must have the ability to map its surroundings and to locate itself in it

because, after performing this โ€œunderstandingโ€, any task will imply a displacement in the

world. The previous statement needs to be complemented with 3 questions that the robot

implicitly has to ask itself before deciding how to act in -or interact with- the environment

in order to navigate in it successfully and perform the desired task: where am I? Where am

I going? How do I get there?

Figure 2.3: See โ€“ Think โ€“ Act Cycle of an Autonomous Mobile Robot (Tang Dan, 2018)

The figure 2.3 describes the sense-think-act paradigm or cycle that has to be repeated by

autonomous robots in order to navigate in the world, and it consists on:

1) Perceiving and analyzing the data obtained from sensors, thus, perception.

2) Map building and self-localization in it: SLAM; and,

3) The path planning where kinematics and the motion control take the responsibility of

sending commands to the actuators in order to move the robot and to perform displacements.

Page 10: Visual SLAM and Human Recognition System of a Legged Robot

4

2.1 Perception

Perception is the interface between the robot and the environment that surrounds it and, in

consequence, is a task with relevant importance that will define the subsequent algorithm

used for Simultaneous Localization and Mapping, and due to this reason, also, will finally

shape the way the system is able to capture, โ€œunderstandโ€, define and interpret the 2D -or

3D- map of the world.

The perception begins when the sensors incorporated in the robot capture or acquire the raw

data which will have a particular nature according to the devices used. For instance, in the

case of a camera, the raw data will be the images or video obtained and in the case of a sonar,

it will be a sound signal.

The second level of the perception is the clustering of the raw data into meaningful features

and then into objects (Siegwart R., 2011).

2.1.1 Cameras

A camera is nothing but a device conceived for recording visual images in the form of

photographs, film, or video signals.

Nowadays, there are different digital technologies (besides the common monocular cameras)

that help us to get visual-spacial information from the world and that allow the robots itself

to work in different tasks such as object tracking and recognition, image segmentation,

Visual Slam, etc.

Between the alternatives, and particularly the ones that are capable of depth sensing we have:

stereo cameras, Time-of-Flight (ToF) cameras, and RGB-D cameras.

Figure 2.4: Microsoft Azure RGB โ€“ D Camera

Page 11: Visual SLAM and Human Recognition System of a Legged Robot

5

2.2 Simultaneous Localization and Mapping

Also known as SLAM (acronym), is a prerequisite for many robotic applications; in here,

the techniques used estimate jointly a map of an unknown environment and the robot pose

within such map, only from the data streams of its on-board sensors (Bescos, 2018). This

process allows the system to try to localize itself in the map and in the environment without

accumulating drift.

SLAM problem has been addressed from different perspectives but the major part of them

include approaches and datasets that imply a static environment. This means that

unfortunately these systems will only be able to work successfully with small fractions of

dynamic content and, in consequence their applicability in real-world environments can be

compromised but many times and for different tasks it is good enough.

2.2.1 Visual SLAM

Recently, higher degrees of efforts and attention has been put in what is called visual SLAM,

in which the main sensor is a monocular, stereo or a depth camera, but the advantages that

this technique can give are maximized with the last two in mention.

This type of SLAM can be classified in 3 categories:

1) Feature-based methods, where the algorithm โ€“as its name suggests- performs a mapping

of the features that can normally be used for recognition. The figure 2.5 show us an example

of a bedroom reconstruction after using this kind of approach.

Figure 2.5: Feature-Based ORB-SLAM Reconstruction of a Bedroom. Left: map consists of 180

keyframes and 8258 points. Right: map contains 102 keyframes and 8269 points (Mur-Artal, 2014)

2) Direct methods, that try to estimate complete dense reconstructions out of direct

minimization of the reprojection error (geometric error corresponding to the image distance

between a projected point and a measured one) and of the total variation regularization

Page 12: Visual SLAM and Human Recognition System of a Legged Robot

6

(image denoising method), or in other hand, estimate semi-dense maps focusing its attention

on semantic areas.

Some direct methods imply the utilization of deep learning to recognize static and also

dynamic objects thus altering the perception of the space during the process. An example of

this is presented in figure 2.6 where deep learning helps to execute the semantic

segmentation of the space to, ensuingly, proceed to make the 3D reconstruction.

Figure 2.6: Example of a Semi-Dense 3D Semantic (Li, 2016)

3) Appearance based methods, where the algorithms use information obtained from vision

sensors to localize the robot and map the environment using a process called Loop Closure

which is used to determine whether the robot has seen a location before or not. When the

robot travels to new areas in its surroundings, the map is extended and the number of images

that each new image must be compared to the information obtained analyzing previous ones.

The loop closure process takes longer processing time due to the fact that its complexity

increases linearly when the robot has to make more comparisons unless it uses a convenient

memory allocation system.

2.3 Robot Locomotion

Another key aspect we have to take into account when we are to build a robot is the

locomotion strategy because it defines the way the system makes displacements in the world

and, in consequence, it will also define the actuators, the control algorithms for them and

the sensors used for the perception.

Figure 2.7 shows some examples of some locomotion mechanisms found in nature but the

focus of this elaboration is, particularly, legged and wheeled locomotion.

Page 13: Visual SLAM and Human Recognition System of a Legged Robot

7

Figure 2.7: Locomotion Mechanisms Found in Nature (Bรถttcher, 2006)

Legged robot locomotion mechanisms are often inspired by biological systems, which are

very successful in moving through a wide area of harsh, uneven and complex environments.

In this context, the human bipedal walking can be approximated as a rolling polygon

โ€œsimilarโ€ to a wheel (Bรถttcher, 2006) โ€“of course only as an approximation because in this

case there is no actively powered wheel-.

2.3.1 Stability

This is a key issue when we are talking about locomotion because a failure in this would

possibly mean a catastrophic accident for the robot or its definite failure to achieve the goal

position.

We can differentiate 2 types of stability, static and dynamic, where the first one in mention

refers to the situation when there is no motion and when the center of mass is completely

within the support polygon (a non-zero area conformed in-between the legs or the wheels

that have contact with the floor).

Most of the robots which are able to walk statically stable have six legs because, for instance,

walking with 4 legs means that the robot will be able to use only 1 leg at the same time and

the other 3 will be creating the support (3 is the minimum amount of contact points to create

a support polygon), thus walking becomes problematic and slow.

Page 14: Visual SLAM and Human Recognition System of a Legged Robot

8

2.3.2 Legged Robot

A legged robot is well suited to work in a variety of complex environments due to its inherent

capability of climbing steps and crossing gaps -which can be as large as its stride- and to

outcome the ground irregularities or obstacles that a common wheeled robot would

definitely find impossible.

Each leg must have a minimum of 2 DOF (degrees of freedom): 1 for swinging and 1 for

lifting, and for each DOF we require a joint and a motor; this means that, in consequence,

6-legged robots would have a minimum amount of 12 motors which directly affects the

power consumption of the system.

The addition of DOF, in general, increases the maneuverability of the robot, range of terrain

and the ability to travel in variety of gaits; on the other hand, the disadvantages it gives are

the increase of power consumption, weight and presumably costs due to the number of

motors used (Bรถttcher, 2006).

If the robot has more than 2 legs there is the problem of coordination in order to perform the

locomotion. Traditionally, to proceed with the movement of the motors that perform the

motion of the legs, kinematics and inverse kinematics transformations helped to solve the

equations related with the position and movement of each part involved in this task.

The number of possible gaits that a robot can use depends on the number of legs it has. The

gait is a periodic sequence of lift and release events for each leg. If a robot has k legs the

number of possible events N is, accordant to:

N=(2k-1)!

In case of a bipedal walking machine the number of possible events is:

N=(2k-1)! = (2*2-1)! = 3! = 6

In this case, the possible events are:

1. Lift the left leg

2. Release left let

3. Lift the right leg

4. Release right leg

5. Lift both legs together

6. Release both legs together

It is important to mention that for 6 legged robots, there are 39916800 possible events. Due

to this reason the control is more complex for 6 than 2 legs but, at the same moment, we

have to take into account other aspects of the problem like, for instance, the stability.

There are some performance indices that can be used to evaluate legged robots of different

configurations (Kajita, 2018). Here we are going to talk about 2, but in the literature, we can

find more of them, such as Froude number.

โ€ข Specific Resistance, defined by the equation:

Page 15: Visual SLAM and Human Recognition System of a Legged Robot

9

Where E is the total energy consumption for a travel of distance d, M is the mass of the

vehicle, and g is gravity.

This equation shows us, in principle, how much energy is being transformed to real

displacement which means that is -somehow- a measure of how โ€œsmoothโ€ is the locomotion

of the robot.

In figure 2.8 we can appreciate calculated specific resistances for a different kind of robots.

โ€ข Duty Factor ฮฒ, which can be used to make a distinction between walks (ฮฒ โ‰ฅ 0.5) and

runs (ฮฒ โ‰ฅ 0.5). It is defined by the following equation:

Figure 2.8: Gabrielli-von Karman Diagram (Kajita, 2018)

2.4 Human Detection System

Human detection is a branch of the more general object detection problem which tackles the

issue of identifying the presence of predefined classes of objects in an image; in this case,

human beings. In order to perform this task, we have to proceed to identify the presence of

the objects and then, to locate the rectangular boundaries that surrounds them.

Early approaches for human detection include techniques like Haar Cascade, widely used

for face detection, proposed by Viola and Jones which uses an AdaBoost machine learning

classifier (Viola & Jones, 2001); or others like Histogram of Oriented Gradients (HOG)

Page 16: Visual SLAM and Human Recognition System of a Legged Robot

10

which is a pedestrian detection algorithm developed by Dalal and Triggs that uses a

Supporting Vector Machine (SVM) algorithm for the classification purpose of the objects.

These two approaches are not very good in detecting humans in various poses unless

multiple models are used to detect humans in each pose (Vidanapathirana, 2018). Another

problem observed with these approaches is the fact that quite often a person is detected in

one frame and is not detected in the following one and vice versa, or in other words,

detections are susceptible to flickering. Modern approaches for human detection fall into

what is called Deep Convolutional Neural Networks which provides more accurate results

at the cost of more computations.

Figure 2.9: YOLO. Detecting a Variety of Object Classes in Real-Time (Redmon, 2016)

Some of the current state-of-the-art object detection algorithms โ€“and consequently able to

detect humans too- that are being widely implemented are:

โ€ข R-CNN approaches that use Region Proposal methods combined with CNN to first

generate potential bounding boxes in an image and then run a classifier on these

proposed boxes. After classification, post-processing is used to refine the bounding

boxes, eliminate duplicate detections, and rescore the boxes based on other objects in

the scene (Liu, 2016). Examples of this type of algorithms are Fast R-CNN, Faster R-

CNN and Mask R-CNN which are improvements to original R-CNN model initially

proposed.

โ€ข Single Shot Multibox Detector (SSD) that discretizes the output space of bounding

boxes into a set of default boxes over different aspect ratios and scales per feature map

location (a feature map is a representation of the dominant features of the image at

different scales) (Liu, 2016).

โ€ข You Only Look Once YOLO v3, which can be 3 times faster than SSD in a 320x320

image and on a Pascal Titan X can process images at 30 FPS (Redmon, 2016). The most

salient feature of YOLO v3 is that it makes detections at three different scales which

can be very useful in real time situations. YOLO can be 1000x as faster as R-CNN and

as much as 100x faster than Fast R-CNN.

Page 17: Visual SLAM and Human Recognition System of a Legged Robot

11

โ€ข There are other lighter and smaller versions of YOLO algorithm whose architecture

makes them more suitable for real-time applications using boards with limited

computational power, one example of this is the so-called tiny YOLOv2 which is briefly

described here. On the other hand, the full YOLOv2 model uses as many as three times

the number of layers in comparison to this optimized version.

2.4.1 Tiny Yolo V2

The architecture of this feedforward neural network presented in figure 2.10 consists in 9

convolutional layers each with a leaky rectified linear unit used as activation function and

batch normalization operation interspersed with 6 max-pooling layers and a region layer.

Tiny-Yolo-v2 takes input image size 416 x 416 to 20 output classes in VOC dataset whereas

80 output classes in COCO dataset. This neural network uses convolutions with a 3ร—3 kernel

(only the last convolution uses a 1x1 kernel) and the max-pooling is made with a 2ร—2 kernel

and a stride of 2.

Figure 2.10: Tiny Yolo v2 Architecture (Yap, 2018)

It has been proved (Cong & Xiao, 2014) that the convolutional layers occupy over 90 % of

the feed-forward computation period, and therefore this is the reason why this architecture

performs faster than other deep learning-based algorithms although this also result in less

detection accuracy.

Page 18: Visual SLAM and Human Recognition System of a Legged Robot

12

Chapter 3

Methodology

The robot with 8 degrees of freedom (DOF) developed can be observed in the figure 3.1.

The mechanical design implies that each leg will have 2 DOF, this means that 2 servos are

performing the movements, 1 servo RDS3115 MG and 1 RDS3135 MG model (both

described briefly later on this chapter) that will rotate in perpendicular plains: horizontal and

vertical, respectively.

Figure 3.1: 4-Legged Robot Developed

A dynamic walking gait pattern was used and it consists in using the backwards rotation in

the horizontal plane of 2 legs (opposed with respect to the origin if we consider the center

of the body as such) that are supported on the ground as the way to gain an impulse for the

movement of the robot in the desired direction whilst the other 2 legs are in the air and

moving forward, finally repeating the same pattern but with an alternation of pairs of legs

as it can be seen in the figure 3.2:

Page 19: Visual SLAM and Human Recognition System of a Legged Robot

13

Figure 3.2: Walking Pattern Gait

Each leg will be changing its position as it is shown in the figure 3.3, creating a circular

shape that can be analyzed and that will give us the way to calculate the movement of the

robot. For a better understanding of the figure 3.3 it has to be mentioned that the orange

dashed line represents the position of the robotโ€™s leg when it is standing and the red lines are

the limits of its movement.

Figure 3.3: Perspective of the Movement of Legs

The distance traveled by each step at ideal conditions can be defined by the following

equation:

๐ท๐‘–๐‘ ๐‘๐‘™๐‘Ž๐‘๐‘’๐‘š๐‘’๐‘›๐‘ก = 2 โˆ— ๐‘ƒ๐‘Ÿ๐‘œ๐‘—๐‘’๐‘๐‘ก๐‘–๐‘œ๐‘› ๐ป๐‘œ๐‘Ÿ๐‘–๐‘ง๐‘œ๐‘›๐‘ก๐‘Ž๐‘™ ๐ท๐‘–๐‘Ž๐‘š๐‘’๐‘ก๐‘’๐‘Ÿ ๐‘œ๐‘“ ๐‘กโ„Ž๐‘’ ๐‘™๐‘’๐‘”๐‘  โˆ— ๐‘ ๐‘–๐‘›(๐ด๐‘›๐‘”๐‘™๐‘’)

Page 20: Visual SLAM and Human Recognition System of a Legged Robot

14

๐ท๐‘–๐‘ ๐‘๐‘™๐‘Ž๐‘๐‘’๐‘š๐‘’๐‘›๐‘ก = 2 โˆ— 15.2 โˆ— ๐‘ ๐‘–๐‘›(15)

๐ท๐‘–๐‘ ๐‘๐‘™๐‘Ž๐‘๐‘’๐‘š๐‘’๐‘›๐‘ก = 7.8 ๐‘๐‘š๐‘ 

The diameter used in this equation is the one described by the projection of the length of the

leg on the horizontal plane.

3.1. Development of the Physical Parts of the Robot

This project starts with the development of 3D models correspondent to the structure of the

robot: the 4 legs, the frame and the supports of the circuits and controllers. The design was

made using the software SolidWorks and the parts were created using a 3D printer.

Figure 3.4: a) Leg, and b) Frame Models of the Developed Legged Robot

The legs and the frame, both components were printed using PLA, can be visualized in the

figure 3.4 and the 2 supports required for the rest of the structure, printed using ABS can be

visualized in the figure 3.5.

Figure 3.5: a) PCB and b) Controller Supports

Page 21: Visual SLAM and Human Recognition System of a Legged Robot

15

Threaded heat inserts were used for the connections made between the main 3D printed parts

with the rest of the components (see Figure 3.6) that will be connected directly to them.

The threaded inserts have proved to be good for the task since they provide a good resistance

to both torque and pull-out when the plastic solidifies around them and also making it

possible to frequent assembly and disassembly of the unit for service or repair when

necessary without affecting the integrity of the material.

Figure 3.6: Example of Connections Made with Threaded Heat Inserts

3.2 Components

For simplicity, in order to describe the components used to build the robot, they will be

divided into 2 sections: first the ones used for low level control and secondly, the

components used for the high-level command management including also the visual system.

In the figure 3.7 we can see an overview of the system, its main components (sensors,

actuators, controllers and power supply) and the communication method used between them.

Page 22: Visual SLAM and Human Recognition System of a Legged Robot

16

Figure 3.7: Components Overview

- Low level Control Components

โ€ข Actuators:

For the movement of the 4 legs with 2 degrees of freedom (each one), a total of 8 servos was

used:

โ€ข 4 RDS 3115MG Servos

โ€ข 4 RDS 3135MG Servos

The properties of these 2 different types of motors are listed in the table 3.1. It has to be

mentioned that, for each leg, the servo with more capacity (RDS 3135 model) is the one

holding the robotโ€™s weight since it is making a vertical rotation, whilst the one with less

capacity is in charge of the horizontal rotation as it will be further explained in the gait

pattern description later on this chapter.

An image of the servos used can be seen in the following figure 3.8, which includes the

assembly of the metallic parts that will help us to connect this part to the legs and the frame

of the robot.

Page 23: Visual SLAM and Human Recognition System of a Legged Robot

17

Figure 3.8: Servo RDS3135 MG Assembly

Table 3.1: Properties of 3135 MG and 3115 MG Servos

Model RDS 3135 MG RDS 3115 MG

Operational

Voltage

6 โ€“ 8.4 V 4.8 โ€“ 7.2 V

7.4 V 8.4 V 6 V 7.2 V

Operating Speed 0.13 s/60ยบ 0.11 s/60ยบ 0.16 s/60ยบ 0.14 s/60ยบ

Stall Current 3.2 A 3.5 A 1.3 A 1.5 A

Stall Torque 31 Kg.cm 35 Kg.cm 13.5 Kg.cm 15 Kg.cm

Weight 64 g 60 g

โ€ข Sensors:

MPU 9255

MPU-9255 is a multi-chip module which houses a 3-Axis gyroscope, a 3-Axis accelerometer

and a 3-Axis magnetometer.

โ€ข Gyroscope Features

The triple-axis MEMS gyroscope in the MPU-9255 has the following features:

โœ“ Digital-output X, Y and Z angular rate sensors with a user-programmable scale range

of ยฑ250, ยฑ500, ยฑ1000, and ยฑ2000 ยฐ/s and an integrated 16-bit ADC.

โœ“ Current demanded: 3.2 mA

โ€ข Accelerometer Features

The triple-axis MEMS accelerometer in MPU-9255 includes a wide range of features:

Page 24: Visual SLAM and Human Recognition System of a Legged Robot

18

โœ“ Digital-output triple-axis accelerometer with a scale range of ยฑ2g, ยฑ4g, ยฑ8g and

ยฑ16g and integrated 16-bit ADCs

โœ“ Current demanded: 450ฮผA

โœ“ User-programmable interrupts

โ€ข Microcontroller:

Arduino Due

The Arduino Due is a microcontroller board based on the Atmel SAM3X8E ARM Cortex-

M3 CPU. This model is based on a 32-bit ARM core microcontroller and is able to provide

54 digital input/output pins (of which 12 can be used as PWM outputs for the servo motors

that will be used in this project), 12 analog inputs, 4 UARTs (hardware serial ports), a 84

MHz clock and 2 DAC (digital to analog).

This microcontroller will be connected directly to the Jetson Nano and will get its power

supply from it besides the commands that will execute.

- Visual System and High-Level Control Components

The components used for the visual SLAM, the human recognition system and for the high-

level control are the following ones:

1 Vision Sensor

Microsoft Kinect Depth RGB Camera whose depth sensor consists of an infrared laser

projector combined with a monochrome CMOS sensor. This inexpensive vision sensor is

also equipped with an RGB camera and an integrated microphone array which could give

this system further capabilities in future researches.

2 Nvidia Jetson Nano (figure 3.9)

Nvidia Jetson Nano is a low-cost embedded system-on-module (SoM) option created mainly

for AI visual systems which possesses an integrated 128-core Maxwell GPU, a quad-core

ARM A57 64-bit CPU and a 4GB LPDDR4 memory.

Figure 3.9: Nvidia Jetson Nano

Page 25: Visual SLAM and Human Recognition System of a Legged Robot

19

This device is compatible with L4T Linux Tegra operative system where the parallel

computing platform and programming model CUDA was installed in order to accelerate the

graphics processing needed for the tasks of the visual systems using the GPU.

It has to be mentioned that Jetson Nano required a 32 gb SD Card and an external Wifi

module in order to connect it with another PC using a remote communication.

A swap file of 8 Gbs was created in the SD Card, thus allowing the system to use this space

as a simulation of RAM extra memory and, in consequence, speeding up the performance.

3.3. Circuit Required

For the control of the locomotion and stability of the robot, the circuit showed in figure 3.10

was implemented and it includes a Lipo battery as power supply for the servo motors, each

of which had a 470 uF capacitor connected in parallel in order to smooth the current drown

thus providing a more reliable response.

It has to be pointed out that the system uses a โ€œstarโ€ connection type which means that the

ground of the motors will be also connected to the Arduino ground pin. A voltage regulator

is used to maintain a constant voltage of 6,8 V which was decided in

order to not put the servo RDS3115 at its limit although it makes the RDS3135 model have

less torque but enough to hold the robot and to perform the respective locomotion.

Another aspect that has to be mentioned is that 2 different PCBs were prepared in order to

divide the current that each of them will have to handle, consequently protecting all the

components.

Figure 3.10: Circuit Diagram

Page 26: Visual SLAM and Human Recognition System of a Legged Robot

20

3.4. Program Structure

The low-level control implemented in the Arduino Due board is the one in charge of making

4 different activities:

โ€ข Stabilization of the robot using the readings of the roll and pitch angles obtained by

processing the information provided by the MPU 9255.

โ€ข Movement towards the front

โ€ข Rotation of a certain angle

โ€ข Reading the commands sent by ROS and act accordingly

On the other hand, the high-level control implemented uses the ROS Melodic platform

available for the L4T Linux Tegra operative system running in the Jetson Nano and for

Ubuntu 18.04 which is running in a computer that will be connected remotely.

For the implementation in ROS, 5 different nodes were created in order to implement the

tasks that the robot will perform:

โ€ข RTab-Map and RTabMap Viz nodes

โ€ข RGBD_Odometry node

โ€ข Object recognition YOLO darknet node

โ€ข ROS_Arduino communication node

Besides this, Openni nodelet will provide ROS access to the camera and another nodelet

called data_throttle is created with the purpose of effectuating the synchronization and

management of the relevant topics published by Openni: RGB images and the Depth

information posteriorly sent by TCP/IP to the client remote PC which uses other nodes

related with the Visual SLAM system.

The figure 3.11 shows the architecture and communication process of the groups, nodes,

nodelets and topics that are running in the system.

Page 27: Visual SLAM and Human Recognition System of a Legged Robot

21

Figure 3.11: ROS Architecture of the System

Page 28: Visual SLAM and Human Recognition System of a Legged Robot

22

โ€ข Rtab-Map Node

The online output of the node showed in the remote PC screen is the local graph with the

latest added data to the map which is ultimately composed of nodes and links.

This node has 2 different types of memory: working (WM) and long-term (LTM). This

division of the memory permits a long term operation at a large scale due to the dynamic

management approach used to detect loop closures because it keeps the most recent and

frequently observed places (whose weight is evaluated as high) and transfers the oldest ones

to the long term memory where they are stored and can be retrieved if convenient. The

structure of this node can be visualized in the figure 3.12.

An important aspect of the options chosen in RTabMap was to constrain the loop closure

detection and graph optimization to 3 degrees of freedom which is enough for a robot whose

movement is constrained to the floor.

Figure 3.12: Block Diagram of the Structure of Rtab-Map Node (Labbรฉ, 2019)

Rtab-Map receives the data from the odometry node and the throttled depth and RGB images

on the remote computer where the Short-Term Memory STM (which was increased until the

maximum due to the help it would contribute to find a loop closure during the continuous

change of the images received by the system when the legged robot is moving) creates a

node (as part of the map) which will contain the odometry and the sensor data.

Figure 3.13: Graph Representation of Locations in the Memory (Labbรฉ, 2014)

Page 29: Visual SLAM and Human Recognition System of a Legged Robot

23

To determine which nodes to transfer to LTM, a weighting mechanism identifies locations

that are more important than others and to do so, when creating a new node, STM initializes

the nodeโ€™s weight to 0 and compares it visually with the last node in the graph. If they are

similar (with the percentage of corresponding quantized SURF features over the similarity

threshold โ€œMem/RehearsalSimilarityโ€ which was slightly decreased), the weight of the new

node is increased by one plus the weight of the last node.

The loop closure detection relies on a Bayesian filter (Labbรฉ, 2014) (described in the

appendix A) which is used to keep track of its hypothesis by estimating the probability that

the current location matches one of the already visited locations stored in the working

memory or a location estimated to be close enough to it (Angeli, 2018).

โ€ข RGBD Odometry Node

For the purpose of the visual odometry, this node is able to work with 2 different approaches

which are called Frame-to-Map (F2M) and Frame-To-Frame (F2F).

The difference of these 2 options is that the first in mention(F2M) compares the features

obtained by the scene with the local map of features created from past frames and the one

that had better results F2F approach makes the comparison against only the last keyframe

instead.

The architecture of this node can be visualized in the figure 3.14.

Figure 3.14: Block Diagram of the Structure of RtabMap Node (Labbรฉ, 2019)

Page 30: Visual SLAM and Human Recognition System of a Legged Robot

24

A 3 DOF odometry was selected for the same reasons previously mentioned for Rtabmap

node restriction to 3 DOF mapping optimization.

The process that this node does is described by the following steps:

โ€ข A maximum number of GFTT (Good Features to Track) was manually set. GFTT was

selected by this node due to the fact that it contributes to obtain features in a uniform

way for images with different sizes and light intensity.

โ€ข Feature Matching which uses different approaches if the algorithm is set as F2M or

F2F. For F2M a nearest neighbor distance ratio (NNDR) using BRIEF descriptors

evaluation is done and for the visual Slam implemented, F2F, optical flow is directly

done on the GFTT features providing faster feature correspondences against the last key

frame.

โ€ข Motion Prediction which is used to predict where the features of the key frame (F2F) or

feature map (F2M) frame (based on previous motion and assuming a constant velocity).

โ€ข Motion Estimation made with the Perspective-n-point PnP RANSAC algorithm used by

OpenCV.

โ€ข Local Bundle Adjustment which refines the scene (3D points cloud) and the camera

poses by minimizing the reprojection errors.

โ€ข Pose Update

โ€ข Key Frame and Feature Map update

The previously mentioned Perspective-n-point RANSAC algorithm used by OpenCV in

RTabMap relies in the combination of 2 algorithms:

1. RANSAC algorithm used to filter the noise

2. Levenberg-Marquadt Optimization to get the solution of the Pinhole Camera Model

which solves the Perspective-n-point problem.

This is further explained in the Appendix B.

โ€ข Object Recognition YOLO_ROS Node

This node was implemented as a package and launched in ROS Melodic running at the

Jetson Nano. It has to be mentioned that the necessary computations required by the different

models tested used the integrated GPU of the device thanks to the CUDA software.

Yolov3 was initially tested but unfortunately it was too computational expensive for Jetson

Nano due to the fact that it is already processing the images required by RtabMap and Visual

Odometry nodes. Yolov3 had delays and was only able to run at around 1.2 FPS.

For this reason, the YOLO version used in this implementation is the so called Tiny-Yolo-

v2 which is able to process the incoming images with higher speed (approximately 10.2

frames per second). The weights used in this project were pretrained on VOC dataset and

the algorithm was configured to be able to recognize only 1 class: persons.

The incoming images used by this algorithm come directly from Openni node as the RGB

raw data.

Page 31: Visual SLAM and Human Recognition System of a Legged Robot

25

โ€ข ROS_ARDUINO Connection Node

A Client-Service Node was created in order to send commands to the robot from the remote

PC connected using the TCP/IP Protocole.

A ROS Node called ROS_Arduino containing the service was launched in the robot and is

constantly open for the user. On the other hand, a client is launched in the remote PC when

we want to send information about the movements that the robot should do. Both Client and

Service scripts were written in Python programming language.

After the Service receives the commands from the Client, it codifies this information into an

array and then proceeds to send this to the Arduino Board which is connected with the Jetson

Nano using a serial communication. Once this process is finished, the Arduino will read the

information received and execute the commands moving the robot in the desired way.

The figure 3.15 shows a diagram which contains the way that the communication is done in

order to process this service-client node when it is called by the user.

Figure 3.15.: ROS Arduino Communication Node

Page 32: Visual SLAM and Human Recognition System of a Legged Robot

26

Chapter 4

Results and Discussion

In order to test the system, 2 different experiment were made with the purpose of measuring

the error of the visual SLAM -when it is estimating the relative position of the robot- and of

the object recognition algorithm -while it is processing the images and indicating the

presence of humans on them-.

4.1. Visual SLAM Experiment

For the Visual SLAM, the relative position was measured and afterwards this was compared

with the real position which was previously measured and marked on the floor.

The parameters that we are going to analyze are the x, y and yaw angle which are enough to

define the pose of a robot constrained to move in the XY plane. The average frame rate of

depth and RGB images was 2.9 hz.

It has to be pointed out that to carry on the experiment, the robot was initialized on a place

whose map has not been previously obtained and afterwards the system starts scanning the

world with the RGB and depth information available -as it can be seen in the figure 4.1

where the features are being detected- and later the visual odometry was set to the position

(0,0) with yaw angle 0 described by position 0 in the table 4.1.

Figure 4.1: Initialization of the Robot

Page 33: Visual SLAM and Human Recognition System of a Legged Robot

27

Table 4.1: Results Obtained in the Visual Slam Experiment

Position x_real y_real yaw_real X y yaw

0 0 0 0 -0,0014 0,00113 0,032

1 -0,1 0 0 -0,12 -0,00296 -0,743

2 0,5 0 0 0,56 0,0236 -2,47

3 1,1 0 0 1,09 0,0708 -2,3

4 1,7 0 0 1,67 0,0809 -6,78

5 2,3 0 0 2,25 0,0625 -7,78

6 2,9 -0,6 0 2,83 -0,51 -0,0142

7 2,3 -1,2 -90 2,55 -1,29 -93

8 2,3 -1,2 -120 2,4 -1,12 -112

9 2,3 -2,4 -90 2,28 -2,18 -99

10 2,6 -3,6 -120 2,4 -3,39 -114

The table 4.2, describes the positioning results obtained when the robot was following a

continuous path which can also be observed with a colored line in the figure 4.2.

Figure 4.2: Path and Map of the Visual SLAM Experiment

Page 34: Visual SLAM and Human Recognition System of a Legged Robot

28

Given that the previous figure shows the map created when the robot is in this environment,

we can also use the actual photo (see figure 4.3) of the place in order to give a further analysis

about some particularities found in them.

Figure 4.3: Real Image of the place used for the experiment of the VISUAL SLAM

Now the table 4.2 will show the errors associated with the measurements taken in the

experiment.

Table 4.2: Errors Associated with the Experiment for X, Y and Yaw angle

Position X error (cms) Y error (cms) Angle error (degrees)

0 0,14 -0,113 -0,032

1 2 0,296 0,743

2 -6 -2,36 2,47

3 1 -7,08 2,3

4 3 -8,09 6,78

5 5 -6,25 7,78

6 7 -9 0,0142

7 -25 9 3

8 -10 -8 -8

9 2 -22 9

10 20 -21 -6

Page 35: Visual SLAM and Human Recognition System of a Legged Robot

29

Now the following figures (4.4, 4.5 and 4.6) will present the comparison of the actual

trajectory points with the estimated ones for the 2 axis and for the heading angle.

Figure 4.4: X-Axis Comparison of Results and Ground Truth

Figure 4.5: Y-Axis Comparison of Results and Ground Truth

-0,5-0,25

00,25

0,50,75

11,25

1,51,75

22,25

2,52,75

33,25

1 2 3 4 5 6 7 8 9 10 11

x p

osi

tio

n (

m)

Measurement

Real

VSlam

-4-3,75

-3,5-3,25

-3-2,75

-2,5-2,25

-2-1,75

-1,5-1,25

-1-0,75

-0,5-0,25

00,25

0,5

1 2 3 4 5 6 7 8 9 10 11

y p

osi

tio

n (

m)

Measurement

Real

VSlam

Page 36: Visual SLAM and Human Recognition System of a Legged Robot

30

Figure 4.6: Yaw Angle Comparison of Results and Ground Truth

Figure 4.7: Error Associated to the Measurement in the X and Y Axis

The figures 4.7 and 4.8 show the amount of error that the visual system presents for the X -

Y axis measurements and yaw angle respectively.

Figure 4.8: Error Associated to the Measurement of the Heading angle

-140

-120

-100

-80

-60

-40

-20

0

20

1 2 3 4 5 6 7 8 9 10 11

yaw

an

gle

(deg

rees

)

Measurement

Real

VSlam

-30

-20

-10

0

10

20

30

1 2 3 4 5 6 7 8 9 10 11

Axis errors in cms

x error (cms) y error (cms)

-10

-5

0

5

10

1 2 3 4 5 6 7 8 9 10 11

Angle error (degrees)

Page 37: Visual SLAM and Human Recognition System of a Legged Robot

31

Now the Root Mean Square Error RMSE and the Mean Absolute Error MAE will be

calculated with each one of the 3 values studied, given the following equations:

๐‘…๐‘€๐‘†๐ธ = โˆšโˆ‘ (๐‘ฆ๏ฟฝฬ‚๏ฟฝ โˆ’ ๐‘ฆ๐‘ก)2๐‘

๐‘ก=1

๐‘

Where:

N is the number of observations,

๐‘ฆ๏ฟฝฬ‚๏ฟฝ is the real value of the variable

๐‘ฆ๐‘ก is the value measured for the given variable

๐‘€๐ด๐ธ = โˆ‘ |๐‘ฆ๏ฟฝฬ‚๏ฟฝ โˆ’ ๐‘ฆ๐‘ก|๐‘

๐‘ก=1

๐‘

Where all the variables are the same as previously mentioned for RMSE.

The calculated RMS Errors are:

๐‘…๐‘€๐‘†๐ธ๐‘‹ ๐‘Ž๐‘ฅ๐‘–๐‘  = 8.135

๐‘…๐‘€๐‘†๐ธ๐‘Œ ๐‘Ž๐‘ฅ๐‘–๐‘  = 9.493

๐‘…๐‘€๐‘†๐ธ๐‘Œ๐ด๐‘Š ๐‘Ž๐‘›๐‘”๐‘™๐‘’ = 5.295

And the MAE calculated are:

๐‘€๐ด๐ธ๐‘‹ ๐‘Ž๐‘ฅ๐‘–๐‘  = 6.012

๐‘€๐ด๐ธ๐‘Œ ๐‘Ž๐‘ฅ๐‘–๐‘  = 7.562

๐‘€๐ด๐ธ๐‘Œ๐ด๐‘Š ๐‘Ž๐‘›๐‘”๐‘™๐‘’ = 4.192

โ€ข Discussion

The system will be able to localize itself with an acceptable amount of error although they

show an increase when several consecutive rotations to different orientations are made, this

happened in a place where the depth was mostly outside the range of 4 meters which is

beyond the limit of the Kinect sensor.

It seems that the error in both axis has a similar behavior because both of them act almost

like if they are comparable to a mirror to each other in relation to the horizontal line (the one

that represents 0 error) (see figure 4.8), this show a close relationship between these 2 values

that tells us that an error in one axis will likely lead to error in the other one.

The positioning error increases when the 90 degrees turning was done, indicating that when

it is creating a completely new map with no features (data) previously seen it may require

more observations of the environment.

Page 38: Visual SLAM and Human Recognition System of a Legged Robot

32

The mean absolute error (MAE) shows us that the X and Y axis may have around 6 and 7

cms of localization error respectively although this value was influenced by few

measurements that had higher error than the others especially in X axis; this is corroborated

by the RMSE which in consequence is higher than the MAE due to these abnormalities in

the errors (the sensitivity to big errors is implicit in its quadratic nature). For the heading or

yaw angle, the root mean square error and the mean absolute error are closer indicating less

variation in the error obtained from the estimations and ground truth.

In the center of the figure 4.3 there is an object which was covered using plastic that was not

scanned (see figure 4.2), probably due to the reflective properties of the material.

The blue chair in the right of the map created demonstrate us that when the algorithm sees

an object it will determine its depth based on the perspective it has which will create some

further noises or errors in some regions of the newly created map.

Zones that are not directly observed and which contain varied light conditions (are more

susceptible to have noise in the map.

4.2. Object Recognition Experiment

This experiment which included tiny YOLO v2 algorithm was made while the robot was

walking to increase the noise of the images and to see how good the performance of the

object recognition can be when it is subject to the intrinsic movements involved by the

process of the dynamic walking.

The data collected consisted in 605 frames that were recorded in 59.23 seconds which

basically gives an average rate of 10.21 hz. This should be enough for real time operation.

It has to be pointed out that these settings do not have any kind of lag while running.

The data collected is be divided into 3 different types:

1. True Positives: which are the frames which successfully detected humans when they

were present in them.

2. False Positives: when the algorithm detected humans but they were not present in

the images.

3. False Negative: when Yolo failed to detect the humans present in the images.

The table 4.3 shows us the results of the test.

Table 4.3: Results Obtained with Tiny-Yolo-v2

Number of Frames True Positives False Positives False Negative

605 218 25 44

Page 39: Visual SLAM and Human Recognition System of a Legged Robot

33

Figure 4.9: Example of False Positive of the Object Detection Algorithm

Figure 4.10: Graphical Representation of Results of Tiny-Yolo-v2

In order to interpret the results obtained, the precision and the recall are calculated.

๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› =๐‘‡๐‘Ÿ๐‘ข๐‘’๐‘ƒ๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’

๐‘‡๐‘Ÿ๐‘ข๐‘’๐‘ƒ๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’ + ๐น๐‘Ž๐‘™๐‘ ๐‘’๐‘ƒ๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’

0

50

100

150

200

250

True Positives False Positives False Negative

Page 40: Visual SLAM and Human Recognition System of a Legged Robot

34

๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› =218

218 + 25

๐‘ƒ๐‘Ÿ๐‘’๐‘๐‘–๐‘ ๐‘–๐‘œ๐‘› = 0.897

๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™ =๐‘‡๐‘Ÿ๐‘ข๐‘’๐‘ƒ๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’

๐‘‡๐‘Ÿ๐‘ข๐‘’๐‘ƒ๐‘œ๐‘ ๐‘–๐‘ก๐‘–๐‘ฃ๐‘’ + ๐น๐‘Ž๐‘™๐‘ ๐‘’๐‘๐‘’๐‘”๐‘Ž๐‘ก๐‘–๐‘ฃ๐‘’

๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™ =218

218 + 44

๐‘…๐‘’๐‘๐‘Ž๐‘™๐‘™ = 0.83

โ€ข Discussion

The precision calculated tells us that 89.7% of the predictions made are correct which

implies that only 11.3 % are false positives and, on the other hand, the recall gives us 83.2%

of positive recognitions when a human is present on the range of the camera.

The figure 4.9 shows us one example of false positives that was detected during the operation

of the robot but this parameter has less relevance in comparison to false negatives because

according to the results, the later has a count almost twice bigger. The false negatives mostly

occur when the human figure is not completely observed by the camera -thus the robot can

only see a part of it- or when it is partially occluded by an object. Sudden movement of the

camera sometimes causes undesired false positives.

This algorithm has shown to be robust against flickering and light conditions that may vary.

Page 41: Visual SLAM and Human Recognition System of a Legged Robot

35

Chapter 5

Conclusion and Recommendations

5.1. Conclusion

This thesis project handles the implementation of a visual SLAM for a 4-legged robot able

to walk in terrains with obstacles and variations of height that can go up to 25 mm and whose

mechanical and electronic system was designed and built from scratch. It also integrates a

real time object recognition algorithm used in order to identify humans when the robot

captures their image analyzing the RGB information obtained through a depth camera -

Microsoft Kinect 360- with a deep learning neural network known as tiny-yolov2.

The main structure of the legged robot was 3D printed and has shown enough strength to be

functional. On the actuation side, a major challenge was to generate and tune a dynamic gait

pattern that can act fast enough to maintain the camera in a relatively stable state enough to

perform the analysis on the visual data as it corresponds for the demanded tasks. The division

of the tasks and ROS nodes that this robot demanded was successfully implemented, and it

consists on tiny-yolo-v2 fully running on the NVIDIA board which is also in charge of the

connection with the Arduino Due and of the throttling and transport of the RGB and depth

information to the remote client PC which handles this data and transforms it into the 3D

map and pose estimation using RTabMap and RGBD Odometry Nodes.

The visual SLAM system has a mean absolute error (MAE) of 6 and 7.5 cms in x and y axis

respectively; it also has a MAE of 4.2 degrees on the localization task but, nonetheless, it

can have major improvements and perform better (obtaining a better localization results and

a map with higher resolution) if the processing power of the associated device is increased.

The object recognition deep neural network used has showed an accuracy of 89.7% and a

recall of 83% in the task of human recognition.

5.2. Recommendation

Since the system is currently relying solely on the camera information, the addition of other

sensors would help to obtain a more accurate localization or map especially when we take

into account long displacements and/or dynamic environments. However, this combination

would be part of a longer project that relies on the accuracy of a sensor fusion estimation

which will have to be further analyzed.

Page 42: Visual SLAM and Human Recognition System of a Legged Robot

36

References

Wirtz, J. (2018). Brave new world: service robots in the frontline. Journal of Service

Management.

Garcรญa-Solera, รlvaro et. al. (2018). Inclusion of service robots in the daily lives of frail

older users: A step-by-step definition procedure on users' requirements. Archives of

Gerontology and Geriatrics, Elsevier, pp. 191-196.

Tang, Dan, et. al. (2018). Development Process of a Smart UAV for Autonomous Target

Detection. The 16th LACCEI International Multi-Conference for Engineering,

Education, and Technology: โ€œInnovation in Education and Inclusionโ€. Nueva York.

Siegwart R., et. al. (2011). Introduction to Autonomous Mobile Robots. Cambridge: MA:

MIT Press.

Bescos B., et. al. (2018). DynaSLAM: Tracking, Mapping and Inpainting in Dynamic

Scenes. IEEE Robotics and Automation Letters (Volume: 3, Issue: 4).

Mur-Artal, Raul. (2014). ORB-SLAM: Tracking and Mapping Recognizable. Conference:

Workshop on Multi view Geometry in Robotics (MVIGRO). Zaragoza.

Li, X. et. al. (2016). Semi-Dense 3D Semantic Mapping from Monocular SLAM.

Bรถttcher, S. (2006). Seminar in Principles of robot locomotion. Obtenido de Southern

Illinois University: http://www2.cs.siu.edu/~hexmoor/classes/CS404-

S09/RobotLocomotion.pdf

Kajita, Shuuji & Espiau, Bernard. (2018). Politecnico di Milano. Obtenido de Legged

Robots Tutorial: http://home.deib.polimi.it/gini/robot/docs/legged.pdf

Vidanapathirana, M. (2018) Real-time Human Detection in Computer Vision. Obtenido de

Mediu,: https://medium.com/@madhawavidanapathirana/https-medium-com-

madhawavidanapathirana-real-time-human-detection-in-computer-vision-part-1-

2acb851f4e55

Jones, P. & Viola, M. (2001). Rapid Object Detection using a Boosted Cascade of Simple.

Proceedings IEEE Computer Society Conference on Computer Vision and Pattern

Recognition. Kauai, USA: IEEE.

Wei, Liu et. al. (2016). SSD: Single Shot MultiBox Detector. European Conference on

Computer Vision, 21-37.

Redmon, Joseph et. al. (2016) You Only Look Once: Unified, Real-Time Object Detection.

Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas: IEEE.

Yap, June al. (2018) Fixed Point Implementation of Tiny-Yolo-v2 using OpenCL on FPGA.

International Journal of Advanced Computer Science and Applications (IJACSA)

(Volume: 9, Issue: 10).

J. Cong, and B. Xiao (2014) Minimizing computation in convolutional neural networks.

International Conference on Artificial Neural Networks, pp. 281-290.

Labbรฉ, Mathieu et. al. (2019) RTABโ€Map as an openโ€source lidar and visual simultaneous

localization and mapping library for largeโ€scale and longโ€term online operation.,

Journal of Field Robotics (Volume: 36, Issue: 2).

Page 43: Visual SLAM and Human Recognition System of a Legged Robot

37

Labbรฉ, Mathieu et. al. (2014) Online Global Loop Closure Detection for Large-Scale Multi-

Session Graph-Based SLAM, 2014 IEEE/RSJ International Conference on Intelligent

Robots and Systems. Chicago, USA: IEEE.

Angeli, Adrien & Doncieux, Stรฉphane & Meyer, J.-A & Filliat, David. (2008). Real-Time

Visual Loop-Closure Detection. Proceedings - IEEE International Conference on

Robotics and Automation. 1842 - 1847.

Bradski, G. (2008). Camera Calibration and 3D Reconstruction. The OpenCV Library Software

Tools. Retrieved Nov 11, 2019, from:

https://docs.opencv.org/2.4/modules/calib3d/doc/camera_calibration_and_3d_reconstr

uction.html?fbclid=IwAR1ZgKeaOXt3b66e4oDkT7hnr1THzYH3-yi-Gsmy-

bC1WeBtnKxUVuA6OWA#bool%20solvePnP(InputArray%20objectPoints,%20Inpu

tArray%20imagePoints,%20InputArray%20cameraMatrix,%20InputArray%20distCo

effs,%20OutputArray%20rvec,%20OutputArray%20tvec,%20bool%20useExtrinsicG

uess,%20int%20flags)

Page 44: Visual SLAM and Human Recognition System of a Legged Robot

38

Appendices

Appendix A

Bayesian Filter Used for the Loop Closure Detection by Rtabmap Algorithm

The filter estimates the full posterior probability ๐‘(๐‘†๐‘ก|๐ฟ๐‘ก)for all i=โˆ’1, ..., ๐‘ก๐‘›, where ๐‘ก๐‘› is the

time index associated with the newest location in Working Memory WM:

where:

๐‘†๐‘ก is random variable that represents the states of all loop closure hypotheses at time t

ฦž is a normalization term

๐ฟ๐‘ก = ๐ฟโˆ’1, . . . , ๐ฟ๐‘กis the sequence of locations contained in WM which changes over time due

to the retrieval of the Long-Term Memory.

For the previous calculation, the observation model ๐‘(๐ฟ๐‘ก|๐‘†๐‘ก)is evaluated using the following

likelihood function which is also used to determine the weight update of each node:

where ๐‘๐‘๐‘Ž๐‘–๐‘Ÿ is the number of matched features between the compared locations ๐‘๐‘๐‘ก and

๐‘๐‘๐‘.

The difference of the score s and the standard deviation is then normalized by the mean ยต of

all non-null scores:

The transition model is used to predict the distribution of ๐‘†๐‘ก, given each state of the

distribution ๐‘†๐‘กโˆ’1 in accordance with the robotโ€™s motion between t and tโˆ’1. Combined with

๐‘(๐‘†๐‘กโˆ’1 = ๐‘–|๐ฟ๐‘กโˆ’1) (the recursive part of the filter), this constitutes the belief of the next loop

closure.

After this normalization process, the process of the loop closure hypothesis selection accepts

the one with the highest score if it is higher than a threshold previously defined.

Page 45: Visual SLAM and Human Recognition System of a Legged Robot

39

Appendix B

Ransac Algorithm, PNP Problem and Levenberg โ€“ Marquardt Algorithm

RANSAC Algorithm

This algorithm is a learning to technique whose purpose is to try to estimate the parameters

of a model while doing a random sampling of the observed data. It consists fundamentally

on 2 steps:

1. A sample subset of the data is randomly chosen and then a fitting model and its

parameters are estimated using them.

2. RANSAC checks all the elements of the data in order to determine whether they are

consistent with the model of the previous step considering a threshold that will give a

tolerance to an amount of error attributable to effects of noise. The elements which donโ€™t fit

in the model are called outliers whereas the others are inliers.

These 2 steps will be repeated until certain estimated model has enough number of inliers.

Example of RANSAC fitting model estimation using a line

A typical example of this algorithm can be showed using lines as a fitting model to determine

inliers and outliers existent in the data.

Perspective-n-point Problem

In the 1980s, Fischler proposed the Perspective-n-Point (PnP) problem whose goal is to

estimate the position and orientation of a calibrated camera from known 3D-2D point

correspondences.

Page 46: Visual SLAM and Human Recognition System of a Legged Robot

40

The Pinhole Camera Model is the most popular model to solve this problem and it is also

the one used by OpenCV to do so. The figure 37 and the following equation describes the

mathematical model used by this model:

or:

where:

โ€ข are the coordinates of a 3D point in the world coordinate space

โ€ข are the coordinates of the projection point in pixels

โ€ข is a camera matrix, or a matrix of intrinsic parameters

โ€ข is a principal point that is usually at the image center

โ€ข are the focal lengths expressed in pixel units.

Pinhole Camera Model Graphical Interpretation

In order to solve this mathematical formulation, this node through OpenCV uses the

Levenber-Marquardt algorithm (LMA) and by this procedure finds the approximate

Page 47: Visual SLAM and Human Recognition System of a Legged Robot

41

solutions of the angles and position of the camera (parameters รŸ used by LMA) that

minimize the reprojection error.

The reprojection error which is the Euclidian distance in pixels (on the camera plane)

between the point q perceived by the camera and the point ~q computated using the results

of the estimated pose of the camera obtained by LMA for the same feature.

Levenberg โ€“ Marquardt Algorithm (LMA)

This is an iterative procedure whose purpose is to find the parameters รŸ of the model curve

๐‘“(๐‘ฅ, รŸ)so that the sum of the square error ๐‘†(รŸ)is minimized:

This minimization process occurs when the parameter vector รŸ is replaced by a new estimate

รŸ + ฮด to determine the function ๐‘“(๐‘ฅ๐‘– , รŸ + ๐›ฟ)which can be approximated by:

where:

is the gradient of f with respect to รŸ. In consequence the approximation of ๐‘†(รŸ + ๐›ฟ)is:

or in vector notation:

Taking the derivative of ๐‘†(รŸ + ๐›ฟ) with respect to ฮด and setting the result to zero gives:

Page 48: Visual SLAM and Human Recognition System of a Legged Robot

42

Where Levenbergโ€™s contribution using a damping factor ฮป which is adjusted at each iteration

and will determine the amount of change ฮด given to the parameter รŸ.