Kevin van Hecke, Guido de Croon, Laurens van der Maaten ... · Kevin van Hecke, Guido de Croon,...

17
1 Persistent self-supervised learning principle: from stereo to monocular vision for obstacle avoidance. Kevin van Hecke, Guido de Croon, Laurens van der Maaten, Daniel Hennes, and Dario Izzo Abstract—Self-Supervised Learning (SSL) is a reliable learning mechanism in which a robot uses an original, trusted sensor cue for training to recognize an additional, complementary sensor cue. We study for the first time in SSL how a robot’s learning behavior should be organized, so that the robot can keep performing its task in the case that the original cue becomes unavailable. We study this persistent form of SSL in the context of a flying robot that has to avoid obstacles based on distance estimates from the visual cue of stereo vision. Over time it will learn to also estimate distances based on monocular appearance cues. A strategy is introduced that has the robot switch from stereo vision based flight to monocular flight, with stereo vision purely used as ‘training wheels’ to avoid imminent collisions. This strategy is shown to be an effective approach to the ‘feedback- induced data bias’ problem as also experienced in learning from demonstration. Both simulations and real-world experiments with a stereo vision equipped AR drone 2.0 show the feasibility of this approach, with the robot successfully using monocular vision to avoid obstacles in a 5 × 5m room. The experiments show the potential of persistent SSL as a robust learning approach to enhance the capabilities of robots. Moreover, the abundant training data coming from the own sensors allows to gather large data sets necessary for deep learning approaches. KeywordsPersistent self-supervised learning, stereo vision, monoc- ular depth estimation, robotics. I. I NTRODUCTION It is generally acknowledged that robots operating in the real world benefit greatly from learning mechanisms. Learning allows robots to adapt to environments or circumstances not specifically foreseen at design time. However, the outcome of learning and its influence on the learning robot’s behavior can by definition not be predicted completely. This is a major reason for the delay in introducing successful learning methods such as Reinforcement Learning (RL) in the real world. For instance, with RL it is a major challenge to ensure an exploratory behavior that is safe for both the robot and its environment [1]. Learning from demonstration (LfD) can in this respect be regarded as more reliable. However, in the case of a mobile robot, LfD faces a ‘feedback-induced data bias’ problem [2], [3]. If the robot executes its trained policy on real sensory inputs, its actions will be slightly different from the expert’s. As a result, the trajectory of the robot will be different to K.G. van Hecke and G.C.H.E. de Croon are with the MAV-lab, Con- trol and Simulation, Faculty of Aerospace Engineering, Delft University of Technology, the Netherlands, L.J.P van der Maaten is with the Fac- ulty of Computer Science, Delft University of Technology, the Nether- lands, D. Hennes and D. Izzo are with the Advanced Concepts Team of the European Space Agency, ESTEC, Noordwijk, the Netherlands. e-mail: [email protected], [email protected] when the human expert was in control, leading to a test data distribution that is different from the training distribution. This difference worsens the performance of the learned policy, fur- ther increasing the discrepancy between the data distributions. The solution proposed in [2], [3] is to have the human expert provide novel training data for the sensory inputs experienced by the robot when being in control itself. This leads to an iterative process that requires quite a time investment of the human expert. There is a relatively new learning mechanism for robots that combines reliability with the advantage of not needing any human supervision. Self-Supervised Learning (SSL) does not learn a control policy as LfD and RL, but rather focuses on improving the sensory inputs used in control. Specifically, in SSL the robot uses the outputs of an original, trusted sensor cue to learn to recognize an additional, complementary sensor cue. The reliability comes from the fact that the robot has access to the trusted cue during the entire learning process, ensuring a baseline performance of the system. Until now, the purpose of SSL has mostly been the exploitation of the complementarity between the sensor cues. To illustrate, perhaps the most well-known example is the use of SSL on Stanley, the car that won the grand DARPA challenge [4]. Stanley used a laser scanner to detect the road ahead. The range of the laser scanner was rather limited, which placed a considerable restriction on the robot’s driving speed. SSL was used in order to extend the road detection beyond the range of the laser scanner. In particular, the available laser scanner based road classifications were used to train a color model of drivable terrain in camera images. This color model was then applied to image regions not covered by the laser scanner. These image regions higher up in the image allowed to detect the road further away. The use of SSL permitted Stanley to speed up considerably and was an important factor in winning the competition. A characterizing feature of many SSL studies [4]–[8], is that the learning and SSL estimation takes place restricted to one moment of time, i.e. on one image in case of a video stream. This implies that the supervisory signal always has to be active as new images arrive from the stream. Only a few, very recent, studies have the learned function last longer over time [9], [10]. The idea in these studies is to give the robot the capability to sometimes act based solely on the complementary cue. For instance, in [9] the sense of touch is used to teach a vision process how to recognize traversable paths through vegetation with as goal to gradually reduce time- intensive haptic interaction. Hence, the learning of recognizing the complementary cue will have to persist in time. However, the consequences of this persistent form of SSL on the robot’s behavior when using the complementary cue have not been arXiv:1603.08047v1 [cs.RO] 25 Mar 2016

Transcript of Kevin van Hecke, Guido de Croon, Laurens van der Maaten ... · Kevin van Hecke, Guido de Croon,...

1

Persistent self-supervised learning principlefrom stereo to monocular vision for obstacle avoidance

Kevin van Hecke Guido de Croon Laurens van der Maaten Daniel Hennes and Dario Izzo

AbstractmdashSelf-Supervised Learning (SSL) is a reliable learningmechanism in which a robot uses an original trusted sensorcue for training to recognize an additional complementarysensor cue We study for the first time in SSL how a robotrsquoslearning behavior should be organized so that the robot cankeep performing its task in the case that the original cue becomesunavailable We study this persistent form of SSL in the contextof a flying robot that has to avoid obstacles based on distanceestimates from the visual cue of stereo vision Over time it willlearn to also estimate distances based on monocular appearancecues A strategy is introduced that has the robot switch fromstereo vision based flight to monocular flight with stereo visionpurely used as lsquotraining wheelsrsquo to avoid imminent collisions Thisstrategy is shown to be an effective approach to the lsquofeedback-induced data biasrsquo problem as also experienced in learning fromdemonstration Both simulations and real-world experiments witha stereo vision equipped AR drone 20 show the feasibility of thisapproach with the robot successfully using monocular visionto avoid obstacles in a 5 times 5m room The experiments showthe potential of persistent SSL as a robust learning approachto enhance the capabilities of robots Moreover the abundanttraining data coming from the own sensors allows to gather largedata sets necessary for deep learning approaches

KeywordsmdashPersistent self-supervised learning stereo vision monoc-ular depth estimation robotics

I INTRODUCTION

It is generally acknowledged that robots operating in the realworld benefit greatly from learning mechanisms Learningallows robots to adapt to environments or circumstances notspecifically foreseen at design time However the outcome oflearning and its influence on the learning robotrsquos behaviorcan by definition not be predicted completely This is amajor reason for the delay in introducing successful learningmethods such as Reinforcement Learning (RL) in the realworld For instance with RL it is a major challenge to ensurean exploratory behavior that is safe for both the robot and itsenvironment [1]Learning from demonstration (LfD) can in this respect beregarded as more reliable However in the case of a mobilerobot LfD faces a lsquofeedback-induced data biasrsquo problem [2][3] If the robot executes its trained policy on real sensoryinputs its actions will be slightly different from the expertrsquosAs a result the trajectory of the robot will be different to

KG van Hecke and GCHE de Croon are with the MAV-lab Con-trol and Simulation Faculty of Aerospace Engineering Delft Universityof Technology the Netherlands LJP van der Maaten is with the Fac-ulty of Computer Science Delft University of Technology the Nether-lands D Hennes and D Izzo are with the Advanced Concepts Team ofthe European Space Agency ESTEC Noordwijk the Netherlands e-mailkgvanhecketudelftnl gchedecroontudelftnl

when the human expert was in control leading to a test datadistribution that is different from the training distribution Thisdifference worsens the performance of the learned policy fur-ther increasing the discrepancy between the data distributionsThe solution proposed in [2] [3] is to have the human expertprovide novel training data for the sensory inputs experiencedby the robot when being in control itself This leads to aniterative process that requires quite a time investment of thehuman expertThere is a relatively new learning mechanism for robots thatcombines reliability with the advantage of not needing anyhuman supervision Self-Supervised Learning (SSL) does notlearn a control policy as LfD and RL but rather focuses onimproving the sensory inputs used in control Specifically inSSL the robot uses the outputs of an original trusted sensorcue to learn to recognize an additional complementary sensorcue The reliability comes from the fact that the robot hasaccess to the trusted cue during the entire learning processensuring a baseline performance of the systemUntil now the purpose of SSL has mostly been the exploitationof the complementarity between the sensor cues To illustrateperhaps the most well-known example is the use of SSL onStanley the car that won the grand DARPA challenge [4]Stanley used a laser scanner to detect the road ahead Therange of the laser scanner was rather limited which placed aconsiderable restriction on the robotrsquos driving speed SSL wasused in order to extend the road detection beyond the rangeof the laser scanner In particular the available laser scannerbased road classifications were used to train a color model ofdrivable terrain in camera images This color model was thenapplied to image regions not covered by the laser scannerThese image regions higher up in the image allowed to detectthe road further away The use of SSL permitted Stanley tospeed up considerably and was an important factor in winningthe competitionA characterizing feature of many SSL studies [4]ndash[8] is thatthe learning and SSL estimation takes place restricted toone moment of time ie on one image in case of a videostream This implies that the supervisory signal always hasto be active as new images arrive from the stream Only afew very recent studies have the learned function last longerover time [9] [10] The idea in these studies is to give therobot the capability to sometimes act based solely on thecomplementary cue For instance in [9] the sense of touchis used to teach a vision process how to recognize traversablepaths through vegetation with as goal to gradually reduce time-intensive haptic interaction Hence the learning of recognizingthe complementary cue will have to persist in time Howeverthe consequences of this persistent form of SSL on the robotrsquosbehavior when using the complementary cue have not been

arX

iv1

603

0804

7v1

[cs

RO

] 2

5 M

ar 2

016

2

addressed in the above-mentioned studiesIn this article we perform an in-depth study of the behavioralaspects of persistent SSL in the context of a scenario in whichthe robot should keep performing its task even when thesupervisory cue becomes completely unavailable Importantlywhen the robot relies only on the complementary cue it willencounter the feedback-induced data bias problem known fromLfD We suggest a novel decision strategy to handle thisproblem in persistent SSLSpecifically we study a flying robot with a stereo vision systemthat has to avoid obstacles The robot uses SSL to learn amapping from monocular appearance cues to the stereo-baseddistance estimates We have chosen this context because it isrelevant for any stereo-based robot that needs to be robust toa permanent failure of one of its cameras In computer visionmonocular distance estimation is also studied There the mainchallenges are the gathering of sufficient data (eg for deepneural networks) and the generalization of the learned distanceestimation to an unforeseen operation environment Both ofthese challenges are addressed to some extent by SSL as learn-ing data is abundant and the robot learns in the environment inwhich it operates We regard SSL as an important supplementto machine learning for robots Therefore we end the studywith a discussion on the position of (persistent) SSL in thebroader context of robot and machine learning comparingit among others with reinforcement learning learning fromdemonstration and supervised learningThe remainder of the article is set up as follows First inSection II we discuss related work Then in Section IIIwe more formally introduce persistent SSL and explain ourimplementation for the stereo-to-mono learning We analyzethe similarity of the specific SSL case studied in this articlewith Learning from Demonstration approaches Subsequentlyin Section IV we perform offline vision experiments in order toassess the performance of various parameter settings There-after in Section V we compare various learning strategiesThe best learning strategy is implemented for experimentswith a Parrot AR drone 2 the results and analysis of whichare described in Section VI The broader implications of thefindings on persistent SSL are discussed in Section VII andconclusions are drawn in Section VIII

II RELATED WORK

We study persistent SSL in the context of a stereo-visionequipped robot that has to learn to navigate with a singlecamera In this section we discuss the state-of-the-art in themost relevant areas to the study monocular navigation andself-supervised learning

A Monocular navigationThe large majority of monocular navigation approaches fo-cuses on motion cues The optical flow of world points allowsfor the detection of obstacles [11] or even the extraction ofstructure from motion as in monocular Simultaneous Local-ization And Mapping (SLAM) [12] The main issue of usingmonocular motion cues for navigation is that optical flowonly conveys information on the ratio between distance and

the camerarsquos velocity Additional information is necessary forretrieving distance This information is typically provided byadditional sensors [13] but can also be retrieved by performingspecific optical flow maneuvers [14] [15]In contrast it is well known that the appearance of objectsin a single still image does contain distance informationSuccessfully estimating distances in a single image allowsrobots to evaluate distances without having to move In addi-tion many appearance extraction and evaluation methods arecomputationally efficient Both these factors can aid the robotin the making of quick navigation decisions Below we give anoverview of work in the area of monocular appearance-baseddistance estimation and navigation1) Appearance-based navigation without distance estimationThere are some appearance-based navigation methods that donot involve an explicit distance estimate For instance in [16]an appearance variation cue is used to detect the proximity toobstacles which is shown to be successful at complementingoptical flow based time-to-contact estimates A threshold isset that makes the flying robot turn if the variation drops toomuch which will lead to turns at different distancesAn alternative approach is to directly learn the mapping fromvisual inputs to control actions In order to fly a quad rotorthrough a forest Ross et al [3] use a variant of Learningfrom Demonstration (LfD) [17] to acquire training data onavoiding trees in a forest First a human pilot controls thedrone creating a training data set of sensory inputs and desiredactions Subsequently a control policy is trained to mimick thepilotrsquos commands as good as possible This control policy isthen implemented on the droneA major problem of this approach is the feedback-induceddata bias A robot has a feedback loop of actions and sensoryinputs so its control policy determines the distribution of worldstates that it encounters (with corresponding sensory inputsand optimal actions) Small deviations between the trainedcontroller and the human may bring the robot in unknownstates in which it has received no training Its control policymay generalize badly to such situations The solution proposedin [3] is a transitional model called DAgger [2] in whichactions from the expert are mixed with actions from thetrained controller In the real-world experiments in [3] severaliterations have been performed in which the robot flies withthe trained controller and the captured videos are labeledoffline by a human This approach requires skilled pilots andsignificant human effort2) Offline monocular distance learning Humans are able tosee distances in a single still image and there is a growingbody of work in computer vision utilizing machine learning todo the same Interest in single image depth estimation wassparked by work from Hoiem et al [18] and Saxena et al[19] [20] Hoiemrsquos Automatic Photo Pop-up tries to groupparts of the image into segments that can be popped up atdifferent depths Saxenarsquos Make-3D uses a Markov RandomField (MRF) approach to classify a depth per image patch ondifferent scales These studies focus on creating a dense depthmap with a machine learning computer vision approach Bothmethods use supervised learning on a large training datasetSome work was done on adopting variants of Saxenarsquos MRF

3

work for driving rovers and even for MAVs Lenz et al [21]propose a solution based on a MRF to detect obstacles onboard an MAV but it does not infer how far the objects areInstead it is trained offline to recognize three different obstacleclass types Any different object could hence lead to navigationproblemsRecently again focusing on creating a dense depth map froma single image Eigen et al [22] propose a multi-scale deepneural network approach trained on the KITTI dataset makingit more resilient for practical robot data Training deep neuralnetworks requires a large data set which is often obtained bydeforming training data or by artificially generating trainingdata Michels et al [23] use artificial data to learn recognizingobstacles on a rover but in order to generalize well it requiresthe use of a very realistic simulator In addition the samework reports significant improvement if the artificial data isaugmented with labeled real-world dataOther groups acquire training data for supervised learning byhaving another separate robot or system acquire data Thisdata is then processed and learned offline after which thelearned algorithm is deployed on the target robot Dey et al[24] use an RC car with a stereo vision setup to acquire datafrom an environment apply machine learning on this dataoffline and navigate a similar but unseen environment with anMAV based on the trained algorithm Creating and operating asecondary system designed to acquire training data howeveris no free lunch Moreover it introduces inconvenient biasesin the training data because an RC car will not behave in asimilar way both in terms of dynamics camera viewpoint andthe path chosen through the environmentNone of the above methods have the robot gather the data andlearn while in operation

B Self-supervised learningThe idea of self-supervised learning has been around since thelate 1990s [25] but the successful application of it to terrainclassification on the autonomously driving car Stanley [26]demonstrated its first major practical use A similar approachwas taken by Hadsel et al [7] but now using a stereo visionsystem instead of a LIDAR system and complex convolutionalfilters instead of simple and fixed color based features Theseapproaches largely forgo the need for manually labeled dataas they are designed to work in unseen environmentsIn the studies on self-supervised learning for terrain classifica-tion the ground truth is assumed to be always available In factthe learned function only lasts for a single image implying thatthe trained terrain appearance models are directly applicableto very similar dataTwo very recent studies do have the learned function lastlonger over time both with the idea of having the robottake some decisions based on the complementary sensor cuealone Beleia et al [9] study a rover with a haptic antennasensor In their application of terrain mapping they try tomap monocular cues to obstacles based on earlier events ofencountering similar situations which resulted in either a hardobstacle a traversable obstacle or a clear path The monocularinformation is used in a path planning task requiring a cost

function for either exploring unknown potential obstacles ordriving through a terrain on the current available informationSince checking whether a potential obstacle is traversable iscostly (the rover needs to travel there in order for the antennato provide ground truth on that) the robot learns to classifythe terrain ahead with vision On each sample an analysis isperformed to determine whether the vision-based classifier issufficiently confident it either decides the terrain is traversablenot traversable or unsure In the unsure case the sample issensed using the antenna Gradually this will become necessaryless often thus learning to navigate using its Kinect sensoralone In Ho et al [10] a flying robot first uses optical flow toselect a landing site that is flat and free of obstacles In orderfor this to work the robot has to move sufficiently with respectto the objects on the landing site While flying the robot usesSSL to learn a regression function that maps an (appearance-based) texton-distribution to the values coming from the opticalflow process The learned function extends the capabilities ofthe robot as after learning it is also able to select landingsites without moving (from hover) The article shows that ifthe robot is uncertain on its appearance-based estimates it canswitch back to the original optical flow based cueIn this article we focus on the behavioral aspects of persistentSSL We study how to best set up the learning process so thatthe robot will be able to keep performing its task when theoriginal sensor cue becomes completely unavailable

III METHODOLOGY OVERVIEW

This section starts by formally and generally posing thepersistent SSL learning mechanism (Subsection III-A) Nextin Subsection III-B we show how this method is used forour specific proof of concept case monocular depth estimationin flying robots providing subsequent implementation detailsin the following subsections III-C-III-F Lastly in SubsectionIII-G we identify a behavioral issue when applying persistentSSL and formally define the learning problem for our proofof concept For a comparison between persistent SSL andother machine learning methods we refer the reader to thediscussion

A Persistent SSL principleThe persistent SSL principle is schematically depicted inFigure 1 In persistent SSL an original pre-wired sensory cueprovides supervised outputs to a learning process that takes adifferent complementary sensory cue as input The goal is tobe able to replace the pre-wired cue if necessary When consid-ering the system as a whole learning with persistent SSL canbe considered as unsupervised it requires no manual labelingor pre-training before deployment in the field Internally it usesa supervised learning method that in fact needs ground truthlabels to learn This ground truth is however assumed to beprovided online and autonomously without human or outsideinterferenceIn the schematic the input variable x represents the sensoryinputs available on board The variables xg and xf are possiblyoverlapping subsets of these sensory inputs In particular func-tion g(xg) extracts a trusted ground truth sensory cue from

4

Fig 1 The persistent self-supervised learning principle

the sensory inputs xg In classical systems g(xg) provides therequired functionality on its own

g xg rarr y xg sube x (1)

The function f(xf ) is learned with a supervised learningalgorithm in order to approximate g(xg) based on xf

f xf rarr y f isin F xf sube x (2)

f = argminfisinF

E [l(f(xf ) g(xg))] (3)

where l(f(xf ) g(xg)) is a suitable loss function that is tobe minimized The system can either choose or be forcedto switch θ so that λ is either set to g(xg) or f(xf ) foruse in control Future work may also include fusing the twobut in this article we focus on using the complementary cuein a stand-alone fashion It must be noted that while bothxg sube x and xf sube x in general it may be that xf doesnot contain all necessary information to predict y In additioneven if xg = xf it is possible that F does not contain afunction f that perfectly models g The information in xf andthe function space F may not allow for a perfect estimateof g(xg) On the other hand there may be an f(xf ) thathandles certain situations better than g(xg) (think of landingsite selection from hover as in [10]) In any case fundamentaldifferences between g(xg) and f(xf ) are to be expected whichmay significantly influence the behavior when switching θHandling these differences is of central interest in this article

B Stereo-to-mono proof of concept

Figure 2 presents a block diagram of the proposed proof ofconcept system in order to visualize how the persistent SSLmethod is employed in our application estimating monoculardepth in a flying robot Input is provided by a stereo visioncamera with either the left or right camera image routed to themonocular estimator We use a Visual Bag of Words (VBoW)method for this estimator (see Subsection III-D) The groundtruth for persistent SSL in this context is provided by the outputof a stereo vision algorithm In this case the average valueof the disparity map is used both for training the monocularestimator and as an input to the switch θ Based on the switchthe system either delivers the monocular or the stereo visionaverage disparity to the behavior controller

C Stereo vision processingThe stereo camera delivers a synchronized gray-scale stereo-pair image per time sample A stereo vision algorithm firstcomputes a disparity map but often this is a far from perfectprocess Especially in the context of an MAVrsquos size weight andcomputational constraints errors caused by imperfect stereocalibration resolution limits etc can cause large pixel errorsin the results Moreover learning to estimate a dense disparitymap even when this is based on a high quality and consistentdata set is already very challenging Since we use the stereoresult as ground truth for learning we minimize the errorby averaging the disparity map to a single scalar A singlescalar is much easier to learn than a full depth map and hasbeen demonstrated to provide elementary obstacle avoidancecapability [27] [23] [28]The disparity λ relates to the depth d of the input image

d prop 1

λ(4)

Using averaged disparity instead of averaged depth fits theobstacle avoidance application better because small but closeby objects are emphasized due the non-linear relation of Eq 4However linear learning methods may have difficulty mappingthis relation In our final design we thus choose to learn thedisparity with a non-parametric approach which is resilient tononlinearities

D Monocular disparity estimationThe monocular disparity estimator forms a function from theimagersquos pixel values to the average disparity in the imageSince the main goal of the article is to study SSL on board adrone in real-time we have put efficiency of both the learningand execution of this function are at a prime Hence weconverged to a computationally extremely efficient Visual Bagof Words (VBoW) approach for the robotic experiments Wehave also explored a deep neural network approach but thehardware and time available for learning did not allow forhaving the deep neural learning on board the drone at thisstageThe VBoW method uses small image patches of wtimesh pixelsas successfully introduced in [29] for a complex texture clas-sification problem First a dictionary is created by clusteringthe image patches with Kohonen clustering (as in [28]) Then cluster centroids are called lsquotextonsrsquo After formation of thedictionary when an image is received m patches are extractedfrom the W timesH pixel image Each patch is compared to thedictionary in order to form a texton occurrence histogram forthe image The histogram is normalized to sum to 1 Theneach normalized histogram is supplemented with its Shannonentropy resulting in a feature vector of size n + 1 The ideabehind adding the entropy is that the variation of textures in animage decreases when the camera gets closer to obstacles [28]To illustrate the change in texton histograms when approachingan obstacle a time series of texton histograms can be seenin Figure 3 Please note how the entropy of the distributionindeed decreases over time and that especially the fourthbin is much higher when close to the poster on the wall A

5

Monocularestimator

Stereo processing

Switch

Behavior routines

Robot controller

Average disparity

Average disparity estimate

Safety overide

Control signal

Filter

PSSL

Fig 2 System overview Please see the text for details

machine learning algorithm will have to learn to notice suchrelationships itself for the robotrsquos environment by learninga mapping from the feature vector to a disparity We haveinvestigated different function representations and learningmethods to this end (see Section IV)

Fig 3 Approaching a poster on the wall Left monocular input Middleoverlaid textons annotated with the color used in the histogram Right textondistribution histogram with the corresponding texton shown beneath it

E Behavior

The proposed system uses a straightforward behavior heuristicto explore navigate and persistently learn a room The heuristicis depicted as a Finite State Machine (FSM) in Figure 4 Instate 0 the robot flies in the direction of the camerarsquos principalaxis When an obstacle is detected (λ gt t) the robot stops andgoes to state 1 in which it randomly chooses a new directionfor the principal axis It immediately passes to state 2 in whichthe robot rotates towards the new direction reducing the errore between the principal axisrsquo current and desired direction Ifin the new direction obstacles are far enough away (λ le t)the robot starts flying forward again (state 0) Else the robotcontinues to turn in the same direction as before (clockwiseor counter clockwise) until λ le t When this is the case itstarts flying straight again (state 0) The FSM detects obstaclesby means of a threshold t applied to the average disparity λChoosing this rather straightforward behavior heuristic enablesautonomous exploration based on only one scalar obtainedfrom a distance sensor

0 Straight ahead

1 Pick random direction

120582 le t120582

2 Rotate

120582 le t120582 amp e le te

e gt te

120582 gt t120582 amp e le te

120582 gt t120582

Fig 4 The behavior heuristic FSM λ is average disparity e is the attitudeerror (meaning the difference between the newly picked direction and thecurrent attitude) tn the respective thresholds

6

F PerformanceThe average disparity λ coming either from stereo visionor from the monocular distance estimation function f(xf )is thresholded for determining whether to turn This leads toa binary classification problem where all samples for whichλ gt t are considered as lsquopositiversquo (c = 1) and all samplesfor which λ le t are considered as lsquonegativersquo (c = 0)Hence the quality of f(xf ) can be characterized by a ROCcurve The ground truth for the ROC curve is determinedby the stereo vision This means that a True Positive Ratio(TPR) of 1 and False Positive Ratio (FPR) of 0 lead to thesame obstacle detection performance as with the stereo visionsystem Generally of course this performance will not bereached and the robot has to determine what threshold to setfor a sufficiently high TPR and low FPRThis leads to the question how to determine what a lsquosufficientrsquoTPR FPR is We evaluate this matter in the context of therobotrsquos obstacle avoidance task In particular we first look atthe probability of a collision with a given TPR and then at theprobability of a spurious turn with a given FPRIn order to model the probability of a collision con-sider a constant velocity approach with input samples (im-ages) 〈x1 x2 xn〉 of n samples long ending at anobstacle A minimum of u samples before actual im-pact an obstacle must be detected by at least one TPor a FP in order to prevent a collision Since the rangeof samples 〈x(nminusu+1) x(nminusu+2) xn〉 does not matterfor the outcome we redefine the approach range to be〈x1 x2 x(nminusu)〉 Consider that for each sample xi holds

1 = p(TP |xi) + p(FP |xi) + p(TN |xi) + p(FN |xi) (5)

since p(FN |xi) = p(TP |xi) = 0 if xi is a negative andp(TN |xi) = p(FP |xi) = 0 if xi is a positive Let us firstassume independent identically distributed (iid) data Thenthe probability of a collision pc can be written as

pc =

nminusuprodi=1

(p(FN |xi) + p(TN |xi))

=

qprodi=1

p(TN |xi)nminusuprodi=q+1

p(FN |xi)(6)

where q is a time step separating two phases in the approachIn the first phase all xi are negative so that any false positivewill lead to a turn preventing the collision Only if all negativesamples are correctly classified as negatives (true negatives)will the robot enter the second phase in which all xi arepositive Then only a complete sequence of false negativeswill lead to a collision since any true positive will lead to atimely turnWe can use Eq 6 to choose an acceptable TPR = 1minusFNRAssuming a constant velocity and frame rate it gives us theprobability of a collision For instance let us assume that therobot flies forward at 050ms with a frame rate of 30Hz it hasa minimal required detection distance of 10m and positives aredefined to be closer than 15m This leads to s = 30 samples

that all have to be classified as negatives In the case of iiddata if the TPR = 095 the probability of a collision ispc = (1minusTPR)s = 00530 asymp 931 10minus40 an extremely smallvalue With this analysis even a TPR = 030 leads to anacceptable pc asymp 225 10minus5

The analysis of the effect of false positives is straightforwardas it can be expressed in the number of spurious turns persecond or equivalently if assuming a constant velocity permeter travelled With the same scenario as above an FPR =005 will on average lead to 3 spurious turns per travelledmeter which is unacceptably high An FPR = 00017 willapproximately lead to 1 spurious turn per 10 meters

The above analysis seems to indicate that quite many falsenegatives are acceptable while there can only be very fewfalse positives However there are two complicating factorsThe first factor is that Eq 6 only holds when X can be as-sumed identically and independently distributed (iid) whichis unlikely due to the nature of the consecutive samples ofan approach towards an obstacle Some reasoning about thenature of the dependence is however possible Assuming astrong correlation between consecutive samples results in ahigher probability of xi being classified the same as x(i+1)In other words if a sample in the range 〈x1xm〉 is an FNthe chance that more samples are FNs increases Hence theexpected dependencies negatively impact performance of thesystem making Eq 6 a best case scenario

The system can be more realistically modelled as a Markovprocess as depicted in Figure 5 From this can be seen thatthe system can be split in a reducible Markov process with anabsorbing avoid state and a chain of states that leads to theabsorbing collision state The values of the transition matrixΩ can be determined from the data gathered during operationThis would allow the robot to better predict the consequencesof a chosen TPR and FPR

As an illustration of the effects of sample dependence let ussuppose a model in which each classification has a probabilityof being identical to the previous classification p(I(ciminus1 ci))If not identical the sample is classified independently Thisdependency model allows us to calculate the transition Ω45

in Figure 5 Given a previous negative classification thetransition probability to another negative classification isΩ45 = p(I(ciminus1 ci)) + (1 minus p(I(ciminus1 ci)))(1 minus TPR) Ifp(I(ciminus1 ci)) = 08 and TPR = 095 as above Ω45 = 081The probability of a collision in such a model is pc = Ω

(sminus1)45 =

18 10minus3 no longer an inconceivably small number

This leads us to the second complicating factor which is spe-cific to our SSL setup Since the robot operates on the basis ofthe ground-truth it should theoretically hardly ever encounterpositive samples Namely the robot should turn when it detectsa positive sample This implies that the uncertainty on theestimated TPR is rather high while the FPR can be estimatedbetter A potential solution to this problem is to purposefullyhave the mono-estimation robot turn earlier than the stereovision based one

7

2No obstacle

TP

1Obstacle

FP

4Obstacle

FN

3No obstacle

TN

Ω10

Ω32Ω31 Ω42

Ω34

Ω33

Ω45

4 +j

FNhellip

5Obstacle

2xFN

Ω20

6Crash

Ω(5)(6)

Ω(hellip)(2)

Ω(5+n-u)(6+n-u)

Ω52

0Avoid

Fig 5 Markov model of the probability of a collision

G Similarity with Learning from Demonstration

The core of SSL is a supervised algorithm that learns thefunction f(xf ) on the basis of supervised outputs g(xg)Normally supervised learning assumes that the training datais drawn from the same data probability distribution D as thetest data However in persistent SSL this assumption generallydoes not hold The problem is that by using control based onf the robot follows a control policy πf 6= πg and hence willinduce a different state distribution Dπf 6= Dπg On thesedifferent states no supervised outputs have been observed yetwhich typically implies an increasing difference between f andgA similar problem of inducing a different state distribution iswell-known in the area of learning from demonstration [2] [3]Actually we will show that under some mild assumptions thepersistent SSL problem studied in this paper is equivalent toan learning from demonstration problem Hence we can drawon solutions in learning from demonstration such as DAgger[2]The goal of learning from demonstration is to find a policy πthat minizes a loss function l under its induced distribution ofstates from [2]

π = argminπisinΠ

EssimDπ [l(s π)] (7)

where an optimal teacher policy πlowast is available to providetraining data for specific states sAt first sight SSL is quite different as it focuses only on thestate information that serves as input to the policy Instead ofoptimizing a policy the supervised learning in persistent SSLcan be defined as finding the function f prime that best matches thetrusted function gprime under the distribution of states induced bythe use of the thresholded version f for control

argminfisinF

ExsimDπf [l(f(x) g(x))] (8)

meaning that we perform regression of f on states that areinduced by the control policy πf which uses the thresholdedversion of f To see the similarity to Eq 7 first realize that the stereo-based policy is in this case the teacher policy πg = πlowast Forthis analysis we simplify the strategy to flying straight whenfar enough away from an obstacle and turning otherwise

πg(s) p(straight|s = 0) = 1

p(turn|s = 1) = 1(9)

where s is the state with s = 1 when g(x) gt tg and s = 0otherwise Please note that πg is a deterministic policy whichis assumed to be optimalWhen we learn a function f it generally will not give exactlythe same outputs as g Using s = f gt tf will result in thefollowing stochastic policy

πf (s) p(straight|s = 0) = TNR

p(turn|s = 0) = FPR

p(turn|s = 1) = TPR

p(straight|s = 1) = FNR

(10)

a stochastic policy which by definition is optimal πf = πg if FPR = FNR = 0 In addition then Dπf = Dπg Thusif we make the assumption that minimizing l(f(x) g(x)) alsominimizes FPR and FNR capturing any preference for oneor the other in the cost function for the behavior l(s π) thenminimizing f in Eq 8 is equivalent to minimizing the loss inEq 7The interest of the above-mentioned similarity lies in theuse of proven techniques from the field of learning fromdemonstration for training the persistent SSL system Forinstance in DAgger during the first learning iteration onlythe expert policy is learned After this iteration a mix of thelearned and expert policy is used πi = πlowast with p = βi

8

and πi = πi with p = (1 minus βi) where βi is reducedover the iterations1 The policy at iteration i depends on allprevious learned data which is aggregated over time In [2] itis proven that this strategy gives a favorable limited no-regretloss which significantly outperforms the traditional learningfrom demonstration strategy of just learning from the datadistribution of the expert policy In Section V we will showthat the main lessons from [2] also apply to persistent SSL

IV OFFLINE VISION EXPERIMENTS

In this section we perform offline vision experiments Thegoal of these experiments is to determine how good theproposed VBoW method is at estimating monocular depth andto determine the best parameter settingsTo measure the performance we use two main metrics themean square error (MSE) and the area under the curve (AUC)of a ROC curve MSE is an easy metric that can be directlyused as a loss function but in practice many situations existin which a low MSE can be achieved but still inadequate per-formance is reached for the basis of reliable MAV behavioralcontrol The AUC captures the trade-off between TPR and FPRand hence is a good indication of how good the performanceis in terms of obstacle detection

Fig 6 Example from Dataset 1 (left 128x96 pixels) and dataset 2 (right640x480)

We use two data sets in the experiments The first dataset is avideo made on a drone during an autonomous flight using theonboard 128 times 96 pixels stereo camera The second datasetis a video made by manually walking with a higher quality640 times 480 pixel stereo camera through an office cubicle in asimilar fashion as the robot should move in the later onlineexperiments The datasets 1 and 2 used in this section aremade available for download publicly23 An example imagefrom each dataset is shown in Figure 6Our implementation of the VBoW method has six mainparameters ranging from the number of intensity and gra-dient textons to the number of samples used to smooththe estimated disparity over time An exhaustive search ofparameters being out of reach we have performed variousinvestigations of parameter changes along a single dimension

1In [2] this mixture was written as πi = βiπlowast + (1 minus βi)π hinting at

a mixed control However it is described as a policy that queries the expertto choose controls a fraction of the time while collecting the next datasetFor this reason and because it makes more sense given our binary controlstrategy we mention this policy as a probabilistic mixture in this article

2Dataset 1 can be downloaded and viewed from http1drvms1NOhIOh3Dataset 2 can be downloaded from http1drvms1NOhLtA

Table I presents a list of the final tuned parameter valuesPlease note that these parameter values have not only beenoptimized for performance Whenever performance differenceswere marginal we have chosen the parameter values that savedon computational effort This choice was guided by our goalto perform the learning on board of a computationally limiteddrone Below we will show a few of the results when varyinga single parameter deviating from the settings in Table I inthe corresponding dimension

Fig 7 Used texton dictionaries Left for camera 1 right for 2

In this work two types of textons are combined to form a singletexton histogram normal intensity textons as obtained withKohonen clustering as in [28] and gradient textons obtainedsimilarly but based upon the gradient of the images Gradienttextures have been shown in [30] to be an important depthcue An example dictionary of each is depicted in figure 7Gradient textons are shown with a color range (from blue =low to red = high) The intensity textons in figure 7 are basedon grayscale intensity pixel valuesFigure 8 shows the results for different numbers of textonsisin 4 8 12 16 20 always consisting of half pixel intensityand half gradient textons From the results we can see that theperformance saturates around 20 textons Hence we selectedthis combination of 10 intensity and 10 gradient textons forthe experimentsThe VBoW method involves choosing a regression algorithmIn order to determine the best learning algorithm we havetested four regression algorithms limiting the choice mainlybased on feasibility for implementing the regression algorithmon board a constrained embedded system We have tested twonon-parametric (kNN and Gaussian Process regression) andtwo parametric (linear and shallow neural network regression)algorithms Figure 9 presents the learning curves for a com-parison of these regressors Clearly in most cases the kNNregression comes out best A naıve implementation of kNNsuffers from having a larger training set in terms of CPU usageduring test time but after implementation on the drone this didnot become a bottleneckThe final offline results on the two datasets are quite satis-factory They can be viewed online45 After a training set ofroughly 4000 samples the kNN approximates the stereo visionbased disparities in the test set rather well Given a desiredTPR of 096 the learner has an FPR of 047 This should besufficient for usage of the estimated disparities in control

V SIMULATION EXPERIMENTS

In Section III we argued that a persistent form of SSL issimilar to learning from demonstration The relevance of this

4A video of VBoW visualizations on dataset 1 is available here http1drvms1KdRtC1

5A video of VBoW visualizations on dataset 2 is available here http1drvms1KdRxlg

9

1000 2000 3000 4000 5000 60000

05

1

15

2

MS

E

kNN regression learning curve minus Number of textons minus Dataset 1

48121620

1000 2000 3000 4000 5000 600005

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 250010

20

30

40

50

60

70

80

90

MS

E

Dataset 2

500 1000 1500 2000 250007

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 8 MSE and AUC number of textons Dashed solid lines refer to results on traintest set

TABLE I PARAMETER SETTINGS

Parameter Value

Number of intensity textons 10Number of gradient textons 10

Patch size 5x5Subsampling samples 500

kNN k=5Smooth size 4

similarity lies in the behavioral schemes used for learning Inthis section we compare three learning schemes in simulation

A SetupWe simulate a lsquoflyingrsquo drone with stereo vision camera inSmartUAV [31] an in-house developed simulator that allowsfor 3D rendering and simulation of the sensors and algorithmsused on board the real drone Figure 10 shows the simulatedlsquooffice roomrsquo The room has a size of 10times 10 meter and thedrone an average forward speed of 05 ms All the vision andlearning algorithms are exactly the same as the ones that runon board of the drone in the real experimentsThree learning schemes are designed as follows They all startwith an initial learning period in which the drone is controlledpurely by means of stereo vision In the first learning schemethe drone will continue to fly based on stereo vision for

Fig 10 SmartUAV simulation environment

the remainder of the learning time After learning the droneimmediately switches to monocular vision For this reasonthe first scheme is referred to as lsquocold turkeyrsquo In the secondlearning scheme the drone will perform a stochastic policyselecting the stereo vision based actions with a probability βiand monocular based actions with a probability (1minusβi) as wasproposed in the original DAgger article [2] In the experimentsβi = 025 Finally in the third learning scheme the dronewill perform monocular based actions with stereo vision only

10

1000 2000 3000 4000 5000 60000

05

1

15

2

25

MS

E

Regression learning curves minus Dataset 1

kNN (k=5)lineargpneural net

1000 2000 3000 4000 5000 600002

03

04

05

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 25000

20

40

60

80

100

120

MS

E

Dataset 2

500 1000 1500 2000 2500

065

07

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 9 VBoW regression algorithms learning curves Dashed solid lines refer to results on traintest set

used to override these actions when the drone gets too closeto an obstacle Therefore we refer to this scheme as lsquotrainingwheelsrsquoAfter learning the drone will use its monocular disparityestimates for control The stereo vision remains active onlyfor overriding the control if the drone gets too close to a wallDuring this testing period we register the number of turns andthe number of overrides The number of overrides is a measureof the number of potential collisions The number of turns iscompared to the number of turns performed when using stereovision to evaluate the number of spurious turns The initiallearning period is 1 minute the remaining learning period is4 minutes and the test time is 5 minutes These times havebeen selected to allow a full experiment on a single battery ofthe real drone (see Section VI)

B Results

Table II contains the results of 30 experiments with the threelearning schemes and a purely stereo-vision-controlled droneThe first observation is that lsquocold turkeyrsquo gives the worstresults This result was to be expected on the basis of the simi-larity between persistent SSL and learning from demonstrationthe learned monocular distance estimates do not generalizewell to the test distribution when the monocular vision is

TABLE II TEST RESULTS FOR THE THREE LEARNING SCHEMES THEAVERAGE AND STANDARD DEVIATION ARE GIVEN FOR THE NUMBER OF

OVERRIDES AND TURNS DURING THE TESTING PERIOD A LOWER NUMBEROF OVERRIDES IS BETTER IN THE TABLE THE BEST RESULTS ARE SHOWN

IN BOLD

Method Overrides TurnsPure stereo NA 456 (σ = 30 )1 Cold turkey 251 (σ = 82 ) 428 (σ = 37 )2 DAgger 107 (σ = 53 ) 414 (σ = 32 )3 Training wheels 43 (σ = 26 ) 404 (σ = 26 )

in control The originally proposed DAgger scheme performsbetter while the third learning scheme termed lsquotraining wheelsrsquoseems most effective The third scheme has the lowest numberof overrides of all learning schemes with a similar totalnumber of turns as a pure stereo vision run The intuitionbehind this method being best is that it allows the drone tobest learn from samples when the drone is beyond the normalstereo vision turning threshold The original DAgger schemehas a larger probability to turn earlier exploring these samplesto a lesser extent Double-sided statistical bootstrap tests [32]indicate that all differences between the learning methods aresignificant with p lt 005The differences between the learning schemes are well illus-trated by the positions the drone visits in the room during thetest phase Figure 11 contains lsquoheat mapsrsquo that show the drone

11

Fig 11 Simulation heatmaps from left to right cold turkey DAgger trainingwheels stereo only Top images are turn locations lower images are theapproaches

Fig 12 The used multicopter

positions during turning (top row) and during straight flight(bottom row) The position distribution has been obtained bybinning the positions during the test phase of all 30 runs Theresults for each scheme are shown per column in Figure 11Right is the pure stereo vision scheme which shows a clearborder around the straight flight trajectories It can be observedthat this border is best approximated by the lsquotraining wheelsrsquoscheme (second from the right)

VI ROBOTIC EXPERIMENTS

The simulation experiments showed that the lsquotraining wheelsrsquosetup resulted in the fewest stereo vision overrides whenswitching to monocular disparity estimation and control Inthis section we test this online learning setup with a flyingrobotThe experiment is set up in the same manner as the simulationThe robot a Parrot AR drone 2 first explores the room withthe help of stereo vision After 1 minute of learning the droneswitches to using the monocular disparity estimates with stereovision running in the background for performing potentialsafety overrides In this phase the drone still continues tolearn After learning 4 to 5 minutes the drone stops learningand enters the test phase Again also for the real robot themain performance measure consists of the number of safetyoverrides performed by the stereo vision during the testingphaseThe AR drone 2 is standard not equipped with a stereo visionsystem Therefore an in-house-developed 4 gram stereo visionsystem is used [33] which sends the raw images over USB tothe ARDrone2 The grayscale stereo camera has a resolutionof 128times96 px and is limited to 10 fps The ARDrone2 comeswith a 1GHz ARM cortex A8 processor and 128MB RAM

and normally runs the Parrot firmware as an autopilot Forthe experiments we replace this firmware with the the opensource Paparazzi autopilot software [34] [35] This allowedus to implement all vision and learning algorithms on boardthe drone The length of each test is dependent on the batterywhich due to wear has considerable variation in the range of8-15 minutesThe tests are performed in an artificial room that has beenconstructed within a motion tracking arena This allows us totrack the trajectory of the drone and facilitates post-experimentanalysis The room is approximately 5 times 5 m as delimitedby plywood walls In order to ensure that the stereo visionalgorithm gave reliable results we added texture in the formof duct-tape to the walls In five tests we had a textured carpethanging over one of the walls (Figure 13 left referred to aslsquoroom 1rsquo) in the other five tests it was on the floor (Figure 13right referred to as lsquoroom 2rsquo)

Fig 13 Two test flight rooms

1) Results Table III shows the summarized results obtainedfrom the monocular test flights Two main observations canbe made from this table First the average number of stereooverrides during the test phase is 3 which is very close to thenumber of overrides in simulation The monocular behavioralso has a similar heat map to simulation Figure 14 shows aheat map of the dronersquos position during the approaches and theavoidance maneuvers (the turns) Again the stereo based flightperforms better in the sense that the drone explores the roommuch more thoroughly and the turns happen consistently justbefore an obstacle is detected On the other hand especially inroom 2 the monocular performance is quite good in the sensethat the system is able to explore most of the roomSecond the selected TPR and FPR are on average 047 and011 The TPR is rather low compared to the offline testsHowever this number is heavily influenced by the monocularestimator based behavior Due to the goal of the robot avoidingobstacles slightly before the stereo ground truth recognizesthem as positives positives should hardly occur at al Only incases of FNs where the estimator is slower or wrong positiveswill be registered by the ground truth Similarly the FPR isalso lower in the context of the estimate based behaviorROC curves of the 10 flights are shown in Figure 16 Acomparison based on the numbers between the first 5 flights(room 1) and the last 5 flights (room 2) does not showany significant differences leading to the suggestion that thesystem is able to learn both rooms equally well Howeverwhen comparing the heat maps of the two situations in themonocular system in Figure 14 and Figure 15 it seems thatthe system shows slightly different behavior The monocularsystem appears to explore room 2 better getting closer to

12

TABLE III TEST FLIGHT SUMMARY

Room 1 Room 2Description 1 2 3 4 5 6 7 8 9 10 Avg

Stereo flight time mss 648 753 213 330 445 439 456 512 458 501 459Mono flight time mss 344 817 645 725 454 1007 446 951 523 512 639

Mean Square Error 07 196 112 095 083 095 087 132 116 106 109False Positive Rate 016 018 013 011 011 008 013 008 01 008 011True Positive Rate 09 044 057 038 038 04 035 035 06 039 047Stereo approaches 29 31 8 14 19 22 22 19 20 21 205Mono approaches 10 21 20 25 14 33 15 28 18 15 199

Auto-overrides 0 6 2 2 1 5 2 7 3 2 3Overrides ratio 0 072 03 027 02 049 042 071 056 038 041

copying the behavior of the stereo based system

Fig 14 Room 1 (plane texture) position heat map Top row is thebinned position during the avoidance turns bottom row during the obstacleapproaches left column during stereo ground truth based operation rightcolumn during learned monocular operation

The experimental setup with the room in the motion trackingarena allows for a more in-depth analysis of the performanceof both stereo and monocular vision Figure 17 shows thespatial view of the flight trajectory of test 10 6 The flightis segmented into approaches and turns which are numberedaccordingly in these figures The color scale in Figure 17Ais created by calculating the theoretically visible closest wallbased on the tracking the systems measured heading andposition of the drone the known position of the walls and theFOV angle of the camera It is clearly visible that the stereoground truth in Figure 17B does not capture this theoreticaldisparity perfectly Especially in the middle of the room thedisparity remains high compared to the theoretical ground truthdue to noise in the stereo disparity map The results of themonocular estimator in Figure 17C shows another decrease inquality compared to the stereo ground truth

6Onboard video data of the flight 10 can be viewed at http1drvms1KC81PN an external video at httpsyoutubelC vj-QNy1I and a VBoWvisualization video at http1drvms1fzvicH

Fig 15 Room 2 (carpet natural texture) position heat map

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

12345678910

Fig 16 ROC curves of the 10 test flights Dashed solid lines refer to resultson room 1 2

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

2

addressed in the above-mentioned studiesIn this article we perform an in-depth study of the behavioralaspects of persistent SSL in the context of a scenario in whichthe robot should keep performing its task even when thesupervisory cue becomes completely unavailable Importantlywhen the robot relies only on the complementary cue it willencounter the feedback-induced data bias problem known fromLfD We suggest a novel decision strategy to handle thisproblem in persistent SSLSpecifically we study a flying robot with a stereo vision systemthat has to avoid obstacles The robot uses SSL to learn amapping from monocular appearance cues to the stereo-baseddistance estimates We have chosen this context because it isrelevant for any stereo-based robot that needs to be robust toa permanent failure of one of its cameras In computer visionmonocular distance estimation is also studied There the mainchallenges are the gathering of sufficient data (eg for deepneural networks) and the generalization of the learned distanceestimation to an unforeseen operation environment Both ofthese challenges are addressed to some extent by SSL as learn-ing data is abundant and the robot learns in the environment inwhich it operates We regard SSL as an important supplementto machine learning for robots Therefore we end the studywith a discussion on the position of (persistent) SSL in thebroader context of robot and machine learning comparingit among others with reinforcement learning learning fromdemonstration and supervised learningThe remainder of the article is set up as follows First inSection II we discuss related work Then in Section IIIwe more formally introduce persistent SSL and explain ourimplementation for the stereo-to-mono learning We analyzethe similarity of the specific SSL case studied in this articlewith Learning from Demonstration approaches Subsequentlyin Section IV we perform offline vision experiments in order toassess the performance of various parameter settings There-after in Section V we compare various learning strategiesThe best learning strategy is implemented for experimentswith a Parrot AR drone 2 the results and analysis of whichare described in Section VI The broader implications of thefindings on persistent SSL are discussed in Section VII andconclusions are drawn in Section VIII

II RELATED WORK

We study persistent SSL in the context of a stereo-visionequipped robot that has to learn to navigate with a singlecamera In this section we discuss the state-of-the-art in themost relevant areas to the study monocular navigation andself-supervised learning

A Monocular navigationThe large majority of monocular navigation approaches fo-cuses on motion cues The optical flow of world points allowsfor the detection of obstacles [11] or even the extraction ofstructure from motion as in monocular Simultaneous Local-ization And Mapping (SLAM) [12] The main issue of usingmonocular motion cues for navigation is that optical flowonly conveys information on the ratio between distance and

the camerarsquos velocity Additional information is necessary forretrieving distance This information is typically provided byadditional sensors [13] but can also be retrieved by performingspecific optical flow maneuvers [14] [15]In contrast it is well known that the appearance of objectsin a single still image does contain distance informationSuccessfully estimating distances in a single image allowsrobots to evaluate distances without having to move In addi-tion many appearance extraction and evaluation methods arecomputationally efficient Both these factors can aid the robotin the making of quick navigation decisions Below we give anoverview of work in the area of monocular appearance-baseddistance estimation and navigation1) Appearance-based navigation without distance estimationThere are some appearance-based navigation methods that donot involve an explicit distance estimate For instance in [16]an appearance variation cue is used to detect the proximity toobstacles which is shown to be successful at complementingoptical flow based time-to-contact estimates A threshold isset that makes the flying robot turn if the variation drops toomuch which will lead to turns at different distancesAn alternative approach is to directly learn the mapping fromvisual inputs to control actions In order to fly a quad rotorthrough a forest Ross et al [3] use a variant of Learningfrom Demonstration (LfD) [17] to acquire training data onavoiding trees in a forest First a human pilot controls thedrone creating a training data set of sensory inputs and desiredactions Subsequently a control policy is trained to mimick thepilotrsquos commands as good as possible This control policy isthen implemented on the droneA major problem of this approach is the feedback-induceddata bias A robot has a feedback loop of actions and sensoryinputs so its control policy determines the distribution of worldstates that it encounters (with corresponding sensory inputsand optimal actions) Small deviations between the trainedcontroller and the human may bring the robot in unknownstates in which it has received no training Its control policymay generalize badly to such situations The solution proposedin [3] is a transitional model called DAgger [2] in whichactions from the expert are mixed with actions from thetrained controller In the real-world experiments in [3] severaliterations have been performed in which the robot flies withthe trained controller and the captured videos are labeledoffline by a human This approach requires skilled pilots andsignificant human effort2) Offline monocular distance learning Humans are able tosee distances in a single still image and there is a growingbody of work in computer vision utilizing machine learning todo the same Interest in single image depth estimation wassparked by work from Hoiem et al [18] and Saxena et al[19] [20] Hoiemrsquos Automatic Photo Pop-up tries to groupparts of the image into segments that can be popped up atdifferent depths Saxenarsquos Make-3D uses a Markov RandomField (MRF) approach to classify a depth per image patch ondifferent scales These studies focus on creating a dense depthmap with a machine learning computer vision approach Bothmethods use supervised learning on a large training datasetSome work was done on adopting variants of Saxenarsquos MRF

3

work for driving rovers and even for MAVs Lenz et al [21]propose a solution based on a MRF to detect obstacles onboard an MAV but it does not infer how far the objects areInstead it is trained offline to recognize three different obstacleclass types Any different object could hence lead to navigationproblemsRecently again focusing on creating a dense depth map froma single image Eigen et al [22] propose a multi-scale deepneural network approach trained on the KITTI dataset makingit more resilient for practical robot data Training deep neuralnetworks requires a large data set which is often obtained bydeforming training data or by artificially generating trainingdata Michels et al [23] use artificial data to learn recognizingobstacles on a rover but in order to generalize well it requiresthe use of a very realistic simulator In addition the samework reports significant improvement if the artificial data isaugmented with labeled real-world dataOther groups acquire training data for supervised learning byhaving another separate robot or system acquire data Thisdata is then processed and learned offline after which thelearned algorithm is deployed on the target robot Dey et al[24] use an RC car with a stereo vision setup to acquire datafrom an environment apply machine learning on this dataoffline and navigate a similar but unseen environment with anMAV based on the trained algorithm Creating and operating asecondary system designed to acquire training data howeveris no free lunch Moreover it introduces inconvenient biasesin the training data because an RC car will not behave in asimilar way both in terms of dynamics camera viewpoint andthe path chosen through the environmentNone of the above methods have the robot gather the data andlearn while in operation

B Self-supervised learningThe idea of self-supervised learning has been around since thelate 1990s [25] but the successful application of it to terrainclassification on the autonomously driving car Stanley [26]demonstrated its first major practical use A similar approachwas taken by Hadsel et al [7] but now using a stereo visionsystem instead of a LIDAR system and complex convolutionalfilters instead of simple and fixed color based features Theseapproaches largely forgo the need for manually labeled dataas they are designed to work in unseen environmentsIn the studies on self-supervised learning for terrain classifica-tion the ground truth is assumed to be always available In factthe learned function only lasts for a single image implying thatthe trained terrain appearance models are directly applicableto very similar dataTwo very recent studies do have the learned function lastlonger over time both with the idea of having the robottake some decisions based on the complementary sensor cuealone Beleia et al [9] study a rover with a haptic antennasensor In their application of terrain mapping they try tomap monocular cues to obstacles based on earlier events ofencountering similar situations which resulted in either a hardobstacle a traversable obstacle or a clear path The monocularinformation is used in a path planning task requiring a cost

function for either exploring unknown potential obstacles ordriving through a terrain on the current available informationSince checking whether a potential obstacle is traversable iscostly (the rover needs to travel there in order for the antennato provide ground truth on that) the robot learns to classifythe terrain ahead with vision On each sample an analysis isperformed to determine whether the vision-based classifier issufficiently confident it either decides the terrain is traversablenot traversable or unsure In the unsure case the sample issensed using the antenna Gradually this will become necessaryless often thus learning to navigate using its Kinect sensoralone In Ho et al [10] a flying robot first uses optical flow toselect a landing site that is flat and free of obstacles In orderfor this to work the robot has to move sufficiently with respectto the objects on the landing site While flying the robot usesSSL to learn a regression function that maps an (appearance-based) texton-distribution to the values coming from the opticalflow process The learned function extends the capabilities ofthe robot as after learning it is also able to select landingsites without moving (from hover) The article shows that ifthe robot is uncertain on its appearance-based estimates it canswitch back to the original optical flow based cueIn this article we focus on the behavioral aspects of persistentSSL We study how to best set up the learning process so thatthe robot will be able to keep performing its task when theoriginal sensor cue becomes completely unavailable

III METHODOLOGY OVERVIEW

This section starts by formally and generally posing thepersistent SSL learning mechanism (Subsection III-A) Nextin Subsection III-B we show how this method is used forour specific proof of concept case monocular depth estimationin flying robots providing subsequent implementation detailsin the following subsections III-C-III-F Lastly in SubsectionIII-G we identify a behavioral issue when applying persistentSSL and formally define the learning problem for our proofof concept For a comparison between persistent SSL andother machine learning methods we refer the reader to thediscussion

A Persistent SSL principleThe persistent SSL principle is schematically depicted inFigure 1 In persistent SSL an original pre-wired sensory cueprovides supervised outputs to a learning process that takes adifferent complementary sensory cue as input The goal is tobe able to replace the pre-wired cue if necessary When consid-ering the system as a whole learning with persistent SSL canbe considered as unsupervised it requires no manual labelingor pre-training before deployment in the field Internally it usesa supervised learning method that in fact needs ground truthlabels to learn This ground truth is however assumed to beprovided online and autonomously without human or outsideinterferenceIn the schematic the input variable x represents the sensoryinputs available on board The variables xg and xf are possiblyoverlapping subsets of these sensory inputs In particular func-tion g(xg) extracts a trusted ground truth sensory cue from

4

Fig 1 The persistent self-supervised learning principle

the sensory inputs xg In classical systems g(xg) provides therequired functionality on its own

g xg rarr y xg sube x (1)

The function f(xf ) is learned with a supervised learningalgorithm in order to approximate g(xg) based on xf

f xf rarr y f isin F xf sube x (2)

f = argminfisinF

E [l(f(xf ) g(xg))] (3)

where l(f(xf ) g(xg)) is a suitable loss function that is tobe minimized The system can either choose or be forcedto switch θ so that λ is either set to g(xg) or f(xf ) foruse in control Future work may also include fusing the twobut in this article we focus on using the complementary cuein a stand-alone fashion It must be noted that while bothxg sube x and xf sube x in general it may be that xf doesnot contain all necessary information to predict y In additioneven if xg = xf it is possible that F does not contain afunction f that perfectly models g The information in xf andthe function space F may not allow for a perfect estimateof g(xg) On the other hand there may be an f(xf ) thathandles certain situations better than g(xg) (think of landingsite selection from hover as in [10]) In any case fundamentaldifferences between g(xg) and f(xf ) are to be expected whichmay significantly influence the behavior when switching θHandling these differences is of central interest in this article

B Stereo-to-mono proof of concept

Figure 2 presents a block diagram of the proposed proof ofconcept system in order to visualize how the persistent SSLmethod is employed in our application estimating monoculardepth in a flying robot Input is provided by a stereo visioncamera with either the left or right camera image routed to themonocular estimator We use a Visual Bag of Words (VBoW)method for this estimator (see Subsection III-D) The groundtruth for persistent SSL in this context is provided by the outputof a stereo vision algorithm In this case the average valueof the disparity map is used both for training the monocularestimator and as an input to the switch θ Based on the switchthe system either delivers the monocular or the stereo visionaverage disparity to the behavior controller

C Stereo vision processingThe stereo camera delivers a synchronized gray-scale stereo-pair image per time sample A stereo vision algorithm firstcomputes a disparity map but often this is a far from perfectprocess Especially in the context of an MAVrsquos size weight andcomputational constraints errors caused by imperfect stereocalibration resolution limits etc can cause large pixel errorsin the results Moreover learning to estimate a dense disparitymap even when this is based on a high quality and consistentdata set is already very challenging Since we use the stereoresult as ground truth for learning we minimize the errorby averaging the disparity map to a single scalar A singlescalar is much easier to learn than a full depth map and hasbeen demonstrated to provide elementary obstacle avoidancecapability [27] [23] [28]The disparity λ relates to the depth d of the input image

d prop 1

λ(4)

Using averaged disparity instead of averaged depth fits theobstacle avoidance application better because small but closeby objects are emphasized due the non-linear relation of Eq 4However linear learning methods may have difficulty mappingthis relation In our final design we thus choose to learn thedisparity with a non-parametric approach which is resilient tononlinearities

D Monocular disparity estimationThe monocular disparity estimator forms a function from theimagersquos pixel values to the average disparity in the imageSince the main goal of the article is to study SSL on board adrone in real-time we have put efficiency of both the learningand execution of this function are at a prime Hence weconverged to a computationally extremely efficient Visual Bagof Words (VBoW) approach for the robotic experiments Wehave also explored a deep neural network approach but thehardware and time available for learning did not allow forhaving the deep neural learning on board the drone at thisstageThe VBoW method uses small image patches of wtimesh pixelsas successfully introduced in [29] for a complex texture clas-sification problem First a dictionary is created by clusteringthe image patches with Kohonen clustering (as in [28]) Then cluster centroids are called lsquotextonsrsquo After formation of thedictionary when an image is received m patches are extractedfrom the W timesH pixel image Each patch is compared to thedictionary in order to form a texton occurrence histogram forthe image The histogram is normalized to sum to 1 Theneach normalized histogram is supplemented with its Shannonentropy resulting in a feature vector of size n + 1 The ideabehind adding the entropy is that the variation of textures in animage decreases when the camera gets closer to obstacles [28]To illustrate the change in texton histograms when approachingan obstacle a time series of texton histograms can be seenin Figure 3 Please note how the entropy of the distributionindeed decreases over time and that especially the fourthbin is much higher when close to the poster on the wall A

5

Monocularestimator

Stereo processing

Switch

Behavior routines

Robot controller

Average disparity

Average disparity estimate

Safety overide

Control signal

Filter

PSSL

Fig 2 System overview Please see the text for details

machine learning algorithm will have to learn to notice suchrelationships itself for the robotrsquos environment by learninga mapping from the feature vector to a disparity We haveinvestigated different function representations and learningmethods to this end (see Section IV)

Fig 3 Approaching a poster on the wall Left monocular input Middleoverlaid textons annotated with the color used in the histogram Right textondistribution histogram with the corresponding texton shown beneath it

E Behavior

The proposed system uses a straightforward behavior heuristicto explore navigate and persistently learn a room The heuristicis depicted as a Finite State Machine (FSM) in Figure 4 Instate 0 the robot flies in the direction of the camerarsquos principalaxis When an obstacle is detected (λ gt t) the robot stops andgoes to state 1 in which it randomly chooses a new directionfor the principal axis It immediately passes to state 2 in whichthe robot rotates towards the new direction reducing the errore between the principal axisrsquo current and desired direction Ifin the new direction obstacles are far enough away (λ le t)the robot starts flying forward again (state 0) Else the robotcontinues to turn in the same direction as before (clockwiseor counter clockwise) until λ le t When this is the case itstarts flying straight again (state 0) The FSM detects obstaclesby means of a threshold t applied to the average disparity λChoosing this rather straightforward behavior heuristic enablesautonomous exploration based on only one scalar obtainedfrom a distance sensor

0 Straight ahead

1 Pick random direction

120582 le t120582

2 Rotate

120582 le t120582 amp e le te

e gt te

120582 gt t120582 amp e le te

120582 gt t120582

Fig 4 The behavior heuristic FSM λ is average disparity e is the attitudeerror (meaning the difference between the newly picked direction and thecurrent attitude) tn the respective thresholds

6

F PerformanceThe average disparity λ coming either from stereo visionor from the monocular distance estimation function f(xf )is thresholded for determining whether to turn This leads toa binary classification problem where all samples for whichλ gt t are considered as lsquopositiversquo (c = 1) and all samplesfor which λ le t are considered as lsquonegativersquo (c = 0)Hence the quality of f(xf ) can be characterized by a ROCcurve The ground truth for the ROC curve is determinedby the stereo vision This means that a True Positive Ratio(TPR) of 1 and False Positive Ratio (FPR) of 0 lead to thesame obstacle detection performance as with the stereo visionsystem Generally of course this performance will not bereached and the robot has to determine what threshold to setfor a sufficiently high TPR and low FPRThis leads to the question how to determine what a lsquosufficientrsquoTPR FPR is We evaluate this matter in the context of therobotrsquos obstacle avoidance task In particular we first look atthe probability of a collision with a given TPR and then at theprobability of a spurious turn with a given FPRIn order to model the probability of a collision con-sider a constant velocity approach with input samples (im-ages) 〈x1 x2 xn〉 of n samples long ending at anobstacle A minimum of u samples before actual im-pact an obstacle must be detected by at least one TPor a FP in order to prevent a collision Since the rangeof samples 〈x(nminusu+1) x(nminusu+2) xn〉 does not matterfor the outcome we redefine the approach range to be〈x1 x2 x(nminusu)〉 Consider that for each sample xi holds

1 = p(TP |xi) + p(FP |xi) + p(TN |xi) + p(FN |xi) (5)

since p(FN |xi) = p(TP |xi) = 0 if xi is a negative andp(TN |xi) = p(FP |xi) = 0 if xi is a positive Let us firstassume independent identically distributed (iid) data Thenthe probability of a collision pc can be written as

pc =

nminusuprodi=1

(p(FN |xi) + p(TN |xi))

=

qprodi=1

p(TN |xi)nminusuprodi=q+1

p(FN |xi)(6)

where q is a time step separating two phases in the approachIn the first phase all xi are negative so that any false positivewill lead to a turn preventing the collision Only if all negativesamples are correctly classified as negatives (true negatives)will the robot enter the second phase in which all xi arepositive Then only a complete sequence of false negativeswill lead to a collision since any true positive will lead to atimely turnWe can use Eq 6 to choose an acceptable TPR = 1minusFNRAssuming a constant velocity and frame rate it gives us theprobability of a collision For instance let us assume that therobot flies forward at 050ms with a frame rate of 30Hz it hasa minimal required detection distance of 10m and positives aredefined to be closer than 15m This leads to s = 30 samples

that all have to be classified as negatives In the case of iiddata if the TPR = 095 the probability of a collision ispc = (1minusTPR)s = 00530 asymp 931 10minus40 an extremely smallvalue With this analysis even a TPR = 030 leads to anacceptable pc asymp 225 10minus5

The analysis of the effect of false positives is straightforwardas it can be expressed in the number of spurious turns persecond or equivalently if assuming a constant velocity permeter travelled With the same scenario as above an FPR =005 will on average lead to 3 spurious turns per travelledmeter which is unacceptably high An FPR = 00017 willapproximately lead to 1 spurious turn per 10 meters

The above analysis seems to indicate that quite many falsenegatives are acceptable while there can only be very fewfalse positives However there are two complicating factorsThe first factor is that Eq 6 only holds when X can be as-sumed identically and independently distributed (iid) whichis unlikely due to the nature of the consecutive samples ofan approach towards an obstacle Some reasoning about thenature of the dependence is however possible Assuming astrong correlation between consecutive samples results in ahigher probability of xi being classified the same as x(i+1)In other words if a sample in the range 〈x1xm〉 is an FNthe chance that more samples are FNs increases Hence theexpected dependencies negatively impact performance of thesystem making Eq 6 a best case scenario

The system can be more realistically modelled as a Markovprocess as depicted in Figure 5 From this can be seen thatthe system can be split in a reducible Markov process with anabsorbing avoid state and a chain of states that leads to theabsorbing collision state The values of the transition matrixΩ can be determined from the data gathered during operationThis would allow the robot to better predict the consequencesof a chosen TPR and FPR

As an illustration of the effects of sample dependence let ussuppose a model in which each classification has a probabilityof being identical to the previous classification p(I(ciminus1 ci))If not identical the sample is classified independently Thisdependency model allows us to calculate the transition Ω45

in Figure 5 Given a previous negative classification thetransition probability to another negative classification isΩ45 = p(I(ciminus1 ci)) + (1 minus p(I(ciminus1 ci)))(1 minus TPR) Ifp(I(ciminus1 ci)) = 08 and TPR = 095 as above Ω45 = 081The probability of a collision in such a model is pc = Ω

(sminus1)45 =

18 10minus3 no longer an inconceivably small number

This leads us to the second complicating factor which is spe-cific to our SSL setup Since the robot operates on the basis ofthe ground-truth it should theoretically hardly ever encounterpositive samples Namely the robot should turn when it detectsa positive sample This implies that the uncertainty on theestimated TPR is rather high while the FPR can be estimatedbetter A potential solution to this problem is to purposefullyhave the mono-estimation robot turn earlier than the stereovision based one

7

2No obstacle

TP

1Obstacle

FP

4Obstacle

FN

3No obstacle

TN

Ω10

Ω32Ω31 Ω42

Ω34

Ω33

Ω45

4 +j

FNhellip

5Obstacle

2xFN

Ω20

6Crash

Ω(5)(6)

Ω(hellip)(2)

Ω(5+n-u)(6+n-u)

Ω52

0Avoid

Fig 5 Markov model of the probability of a collision

G Similarity with Learning from Demonstration

The core of SSL is a supervised algorithm that learns thefunction f(xf ) on the basis of supervised outputs g(xg)Normally supervised learning assumes that the training datais drawn from the same data probability distribution D as thetest data However in persistent SSL this assumption generallydoes not hold The problem is that by using control based onf the robot follows a control policy πf 6= πg and hence willinduce a different state distribution Dπf 6= Dπg On thesedifferent states no supervised outputs have been observed yetwhich typically implies an increasing difference between f andgA similar problem of inducing a different state distribution iswell-known in the area of learning from demonstration [2] [3]Actually we will show that under some mild assumptions thepersistent SSL problem studied in this paper is equivalent toan learning from demonstration problem Hence we can drawon solutions in learning from demonstration such as DAgger[2]The goal of learning from demonstration is to find a policy πthat minizes a loss function l under its induced distribution ofstates from [2]

π = argminπisinΠ

EssimDπ [l(s π)] (7)

where an optimal teacher policy πlowast is available to providetraining data for specific states sAt first sight SSL is quite different as it focuses only on thestate information that serves as input to the policy Instead ofoptimizing a policy the supervised learning in persistent SSLcan be defined as finding the function f prime that best matches thetrusted function gprime under the distribution of states induced bythe use of the thresholded version f for control

argminfisinF

ExsimDπf [l(f(x) g(x))] (8)

meaning that we perform regression of f on states that areinduced by the control policy πf which uses the thresholdedversion of f To see the similarity to Eq 7 first realize that the stereo-based policy is in this case the teacher policy πg = πlowast Forthis analysis we simplify the strategy to flying straight whenfar enough away from an obstacle and turning otherwise

πg(s) p(straight|s = 0) = 1

p(turn|s = 1) = 1(9)

where s is the state with s = 1 when g(x) gt tg and s = 0otherwise Please note that πg is a deterministic policy whichis assumed to be optimalWhen we learn a function f it generally will not give exactlythe same outputs as g Using s = f gt tf will result in thefollowing stochastic policy

πf (s) p(straight|s = 0) = TNR

p(turn|s = 0) = FPR

p(turn|s = 1) = TPR

p(straight|s = 1) = FNR

(10)

a stochastic policy which by definition is optimal πf = πg if FPR = FNR = 0 In addition then Dπf = Dπg Thusif we make the assumption that minimizing l(f(x) g(x)) alsominimizes FPR and FNR capturing any preference for oneor the other in the cost function for the behavior l(s π) thenminimizing f in Eq 8 is equivalent to minimizing the loss inEq 7The interest of the above-mentioned similarity lies in theuse of proven techniques from the field of learning fromdemonstration for training the persistent SSL system Forinstance in DAgger during the first learning iteration onlythe expert policy is learned After this iteration a mix of thelearned and expert policy is used πi = πlowast with p = βi

8

and πi = πi with p = (1 minus βi) where βi is reducedover the iterations1 The policy at iteration i depends on allprevious learned data which is aggregated over time In [2] itis proven that this strategy gives a favorable limited no-regretloss which significantly outperforms the traditional learningfrom demonstration strategy of just learning from the datadistribution of the expert policy In Section V we will showthat the main lessons from [2] also apply to persistent SSL

IV OFFLINE VISION EXPERIMENTS

In this section we perform offline vision experiments Thegoal of these experiments is to determine how good theproposed VBoW method is at estimating monocular depth andto determine the best parameter settingsTo measure the performance we use two main metrics themean square error (MSE) and the area under the curve (AUC)of a ROC curve MSE is an easy metric that can be directlyused as a loss function but in practice many situations existin which a low MSE can be achieved but still inadequate per-formance is reached for the basis of reliable MAV behavioralcontrol The AUC captures the trade-off between TPR and FPRand hence is a good indication of how good the performanceis in terms of obstacle detection

Fig 6 Example from Dataset 1 (left 128x96 pixels) and dataset 2 (right640x480)

We use two data sets in the experiments The first dataset is avideo made on a drone during an autonomous flight using theonboard 128 times 96 pixels stereo camera The second datasetis a video made by manually walking with a higher quality640 times 480 pixel stereo camera through an office cubicle in asimilar fashion as the robot should move in the later onlineexperiments The datasets 1 and 2 used in this section aremade available for download publicly23 An example imagefrom each dataset is shown in Figure 6Our implementation of the VBoW method has six mainparameters ranging from the number of intensity and gra-dient textons to the number of samples used to smooththe estimated disparity over time An exhaustive search ofparameters being out of reach we have performed variousinvestigations of parameter changes along a single dimension

1In [2] this mixture was written as πi = βiπlowast + (1 minus βi)π hinting at

a mixed control However it is described as a policy that queries the expertto choose controls a fraction of the time while collecting the next datasetFor this reason and because it makes more sense given our binary controlstrategy we mention this policy as a probabilistic mixture in this article

2Dataset 1 can be downloaded and viewed from http1drvms1NOhIOh3Dataset 2 can be downloaded from http1drvms1NOhLtA

Table I presents a list of the final tuned parameter valuesPlease note that these parameter values have not only beenoptimized for performance Whenever performance differenceswere marginal we have chosen the parameter values that savedon computational effort This choice was guided by our goalto perform the learning on board of a computationally limiteddrone Below we will show a few of the results when varyinga single parameter deviating from the settings in Table I inthe corresponding dimension

Fig 7 Used texton dictionaries Left for camera 1 right for 2

In this work two types of textons are combined to form a singletexton histogram normal intensity textons as obtained withKohonen clustering as in [28] and gradient textons obtainedsimilarly but based upon the gradient of the images Gradienttextures have been shown in [30] to be an important depthcue An example dictionary of each is depicted in figure 7Gradient textons are shown with a color range (from blue =low to red = high) The intensity textons in figure 7 are basedon grayscale intensity pixel valuesFigure 8 shows the results for different numbers of textonsisin 4 8 12 16 20 always consisting of half pixel intensityand half gradient textons From the results we can see that theperformance saturates around 20 textons Hence we selectedthis combination of 10 intensity and 10 gradient textons forthe experimentsThe VBoW method involves choosing a regression algorithmIn order to determine the best learning algorithm we havetested four regression algorithms limiting the choice mainlybased on feasibility for implementing the regression algorithmon board a constrained embedded system We have tested twonon-parametric (kNN and Gaussian Process regression) andtwo parametric (linear and shallow neural network regression)algorithms Figure 9 presents the learning curves for a com-parison of these regressors Clearly in most cases the kNNregression comes out best A naıve implementation of kNNsuffers from having a larger training set in terms of CPU usageduring test time but after implementation on the drone this didnot become a bottleneckThe final offline results on the two datasets are quite satis-factory They can be viewed online45 After a training set ofroughly 4000 samples the kNN approximates the stereo visionbased disparities in the test set rather well Given a desiredTPR of 096 the learner has an FPR of 047 This should besufficient for usage of the estimated disparities in control

V SIMULATION EXPERIMENTS

In Section III we argued that a persistent form of SSL issimilar to learning from demonstration The relevance of this

4A video of VBoW visualizations on dataset 1 is available here http1drvms1KdRtC1

5A video of VBoW visualizations on dataset 2 is available here http1drvms1KdRxlg

9

1000 2000 3000 4000 5000 60000

05

1

15

2

MS

E

kNN regression learning curve minus Number of textons minus Dataset 1

48121620

1000 2000 3000 4000 5000 600005

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 250010

20

30

40

50

60

70

80

90

MS

E

Dataset 2

500 1000 1500 2000 250007

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 8 MSE and AUC number of textons Dashed solid lines refer to results on traintest set

TABLE I PARAMETER SETTINGS

Parameter Value

Number of intensity textons 10Number of gradient textons 10

Patch size 5x5Subsampling samples 500

kNN k=5Smooth size 4

similarity lies in the behavioral schemes used for learning Inthis section we compare three learning schemes in simulation

A SetupWe simulate a lsquoflyingrsquo drone with stereo vision camera inSmartUAV [31] an in-house developed simulator that allowsfor 3D rendering and simulation of the sensors and algorithmsused on board the real drone Figure 10 shows the simulatedlsquooffice roomrsquo The room has a size of 10times 10 meter and thedrone an average forward speed of 05 ms All the vision andlearning algorithms are exactly the same as the ones that runon board of the drone in the real experimentsThree learning schemes are designed as follows They all startwith an initial learning period in which the drone is controlledpurely by means of stereo vision In the first learning schemethe drone will continue to fly based on stereo vision for

Fig 10 SmartUAV simulation environment

the remainder of the learning time After learning the droneimmediately switches to monocular vision For this reasonthe first scheme is referred to as lsquocold turkeyrsquo In the secondlearning scheme the drone will perform a stochastic policyselecting the stereo vision based actions with a probability βiand monocular based actions with a probability (1minusβi) as wasproposed in the original DAgger article [2] In the experimentsβi = 025 Finally in the third learning scheme the dronewill perform monocular based actions with stereo vision only

10

1000 2000 3000 4000 5000 60000

05

1

15

2

25

MS

E

Regression learning curves minus Dataset 1

kNN (k=5)lineargpneural net

1000 2000 3000 4000 5000 600002

03

04

05

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 25000

20

40

60

80

100

120

MS

E

Dataset 2

500 1000 1500 2000 2500

065

07

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 9 VBoW regression algorithms learning curves Dashed solid lines refer to results on traintest set

used to override these actions when the drone gets too closeto an obstacle Therefore we refer to this scheme as lsquotrainingwheelsrsquoAfter learning the drone will use its monocular disparityestimates for control The stereo vision remains active onlyfor overriding the control if the drone gets too close to a wallDuring this testing period we register the number of turns andthe number of overrides The number of overrides is a measureof the number of potential collisions The number of turns iscompared to the number of turns performed when using stereovision to evaluate the number of spurious turns The initiallearning period is 1 minute the remaining learning period is4 minutes and the test time is 5 minutes These times havebeen selected to allow a full experiment on a single battery ofthe real drone (see Section VI)

B Results

Table II contains the results of 30 experiments with the threelearning schemes and a purely stereo-vision-controlled droneThe first observation is that lsquocold turkeyrsquo gives the worstresults This result was to be expected on the basis of the simi-larity between persistent SSL and learning from demonstrationthe learned monocular distance estimates do not generalizewell to the test distribution when the monocular vision is

TABLE II TEST RESULTS FOR THE THREE LEARNING SCHEMES THEAVERAGE AND STANDARD DEVIATION ARE GIVEN FOR THE NUMBER OF

OVERRIDES AND TURNS DURING THE TESTING PERIOD A LOWER NUMBEROF OVERRIDES IS BETTER IN THE TABLE THE BEST RESULTS ARE SHOWN

IN BOLD

Method Overrides TurnsPure stereo NA 456 (σ = 30 )1 Cold turkey 251 (σ = 82 ) 428 (σ = 37 )2 DAgger 107 (σ = 53 ) 414 (σ = 32 )3 Training wheels 43 (σ = 26 ) 404 (σ = 26 )

in control The originally proposed DAgger scheme performsbetter while the third learning scheme termed lsquotraining wheelsrsquoseems most effective The third scheme has the lowest numberof overrides of all learning schemes with a similar totalnumber of turns as a pure stereo vision run The intuitionbehind this method being best is that it allows the drone tobest learn from samples when the drone is beyond the normalstereo vision turning threshold The original DAgger schemehas a larger probability to turn earlier exploring these samplesto a lesser extent Double-sided statistical bootstrap tests [32]indicate that all differences between the learning methods aresignificant with p lt 005The differences between the learning schemes are well illus-trated by the positions the drone visits in the room during thetest phase Figure 11 contains lsquoheat mapsrsquo that show the drone

11

Fig 11 Simulation heatmaps from left to right cold turkey DAgger trainingwheels stereo only Top images are turn locations lower images are theapproaches

Fig 12 The used multicopter

positions during turning (top row) and during straight flight(bottom row) The position distribution has been obtained bybinning the positions during the test phase of all 30 runs Theresults for each scheme are shown per column in Figure 11Right is the pure stereo vision scheme which shows a clearborder around the straight flight trajectories It can be observedthat this border is best approximated by the lsquotraining wheelsrsquoscheme (second from the right)

VI ROBOTIC EXPERIMENTS

The simulation experiments showed that the lsquotraining wheelsrsquosetup resulted in the fewest stereo vision overrides whenswitching to monocular disparity estimation and control Inthis section we test this online learning setup with a flyingrobotThe experiment is set up in the same manner as the simulationThe robot a Parrot AR drone 2 first explores the room withthe help of stereo vision After 1 minute of learning the droneswitches to using the monocular disparity estimates with stereovision running in the background for performing potentialsafety overrides In this phase the drone still continues tolearn After learning 4 to 5 minutes the drone stops learningand enters the test phase Again also for the real robot themain performance measure consists of the number of safetyoverrides performed by the stereo vision during the testingphaseThe AR drone 2 is standard not equipped with a stereo visionsystem Therefore an in-house-developed 4 gram stereo visionsystem is used [33] which sends the raw images over USB tothe ARDrone2 The grayscale stereo camera has a resolutionof 128times96 px and is limited to 10 fps The ARDrone2 comeswith a 1GHz ARM cortex A8 processor and 128MB RAM

and normally runs the Parrot firmware as an autopilot Forthe experiments we replace this firmware with the the opensource Paparazzi autopilot software [34] [35] This allowedus to implement all vision and learning algorithms on boardthe drone The length of each test is dependent on the batterywhich due to wear has considerable variation in the range of8-15 minutesThe tests are performed in an artificial room that has beenconstructed within a motion tracking arena This allows us totrack the trajectory of the drone and facilitates post-experimentanalysis The room is approximately 5 times 5 m as delimitedby plywood walls In order to ensure that the stereo visionalgorithm gave reliable results we added texture in the formof duct-tape to the walls In five tests we had a textured carpethanging over one of the walls (Figure 13 left referred to aslsquoroom 1rsquo) in the other five tests it was on the floor (Figure 13right referred to as lsquoroom 2rsquo)

Fig 13 Two test flight rooms

1) Results Table III shows the summarized results obtainedfrom the monocular test flights Two main observations canbe made from this table First the average number of stereooverrides during the test phase is 3 which is very close to thenumber of overrides in simulation The monocular behavioralso has a similar heat map to simulation Figure 14 shows aheat map of the dronersquos position during the approaches and theavoidance maneuvers (the turns) Again the stereo based flightperforms better in the sense that the drone explores the roommuch more thoroughly and the turns happen consistently justbefore an obstacle is detected On the other hand especially inroom 2 the monocular performance is quite good in the sensethat the system is able to explore most of the roomSecond the selected TPR and FPR are on average 047 and011 The TPR is rather low compared to the offline testsHowever this number is heavily influenced by the monocularestimator based behavior Due to the goal of the robot avoidingobstacles slightly before the stereo ground truth recognizesthem as positives positives should hardly occur at al Only incases of FNs where the estimator is slower or wrong positiveswill be registered by the ground truth Similarly the FPR isalso lower in the context of the estimate based behaviorROC curves of the 10 flights are shown in Figure 16 Acomparison based on the numbers between the first 5 flights(room 1) and the last 5 flights (room 2) does not showany significant differences leading to the suggestion that thesystem is able to learn both rooms equally well Howeverwhen comparing the heat maps of the two situations in themonocular system in Figure 14 and Figure 15 it seems thatthe system shows slightly different behavior The monocularsystem appears to explore room 2 better getting closer to

12

TABLE III TEST FLIGHT SUMMARY

Room 1 Room 2Description 1 2 3 4 5 6 7 8 9 10 Avg

Stereo flight time mss 648 753 213 330 445 439 456 512 458 501 459Mono flight time mss 344 817 645 725 454 1007 446 951 523 512 639

Mean Square Error 07 196 112 095 083 095 087 132 116 106 109False Positive Rate 016 018 013 011 011 008 013 008 01 008 011True Positive Rate 09 044 057 038 038 04 035 035 06 039 047Stereo approaches 29 31 8 14 19 22 22 19 20 21 205Mono approaches 10 21 20 25 14 33 15 28 18 15 199

Auto-overrides 0 6 2 2 1 5 2 7 3 2 3Overrides ratio 0 072 03 027 02 049 042 071 056 038 041

copying the behavior of the stereo based system

Fig 14 Room 1 (plane texture) position heat map Top row is thebinned position during the avoidance turns bottom row during the obstacleapproaches left column during stereo ground truth based operation rightcolumn during learned monocular operation

The experimental setup with the room in the motion trackingarena allows for a more in-depth analysis of the performanceof both stereo and monocular vision Figure 17 shows thespatial view of the flight trajectory of test 10 6 The flightis segmented into approaches and turns which are numberedaccordingly in these figures The color scale in Figure 17Ais created by calculating the theoretically visible closest wallbased on the tracking the systems measured heading andposition of the drone the known position of the walls and theFOV angle of the camera It is clearly visible that the stereoground truth in Figure 17B does not capture this theoreticaldisparity perfectly Especially in the middle of the room thedisparity remains high compared to the theoretical ground truthdue to noise in the stereo disparity map The results of themonocular estimator in Figure 17C shows another decrease inquality compared to the stereo ground truth

6Onboard video data of the flight 10 can be viewed at http1drvms1KC81PN an external video at httpsyoutubelC vj-QNy1I and a VBoWvisualization video at http1drvms1fzvicH

Fig 15 Room 2 (carpet natural texture) position heat map

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

12345678910

Fig 16 ROC curves of the 10 test flights Dashed solid lines refer to resultson room 1 2

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

3

work for driving rovers and even for MAVs Lenz et al [21]propose a solution based on a MRF to detect obstacles onboard an MAV but it does not infer how far the objects areInstead it is trained offline to recognize three different obstacleclass types Any different object could hence lead to navigationproblemsRecently again focusing on creating a dense depth map froma single image Eigen et al [22] propose a multi-scale deepneural network approach trained on the KITTI dataset makingit more resilient for practical robot data Training deep neuralnetworks requires a large data set which is often obtained bydeforming training data or by artificially generating trainingdata Michels et al [23] use artificial data to learn recognizingobstacles on a rover but in order to generalize well it requiresthe use of a very realistic simulator In addition the samework reports significant improvement if the artificial data isaugmented with labeled real-world dataOther groups acquire training data for supervised learning byhaving another separate robot or system acquire data Thisdata is then processed and learned offline after which thelearned algorithm is deployed on the target robot Dey et al[24] use an RC car with a stereo vision setup to acquire datafrom an environment apply machine learning on this dataoffline and navigate a similar but unseen environment with anMAV based on the trained algorithm Creating and operating asecondary system designed to acquire training data howeveris no free lunch Moreover it introduces inconvenient biasesin the training data because an RC car will not behave in asimilar way both in terms of dynamics camera viewpoint andthe path chosen through the environmentNone of the above methods have the robot gather the data andlearn while in operation

B Self-supervised learningThe idea of self-supervised learning has been around since thelate 1990s [25] but the successful application of it to terrainclassification on the autonomously driving car Stanley [26]demonstrated its first major practical use A similar approachwas taken by Hadsel et al [7] but now using a stereo visionsystem instead of a LIDAR system and complex convolutionalfilters instead of simple and fixed color based features Theseapproaches largely forgo the need for manually labeled dataas they are designed to work in unseen environmentsIn the studies on self-supervised learning for terrain classifica-tion the ground truth is assumed to be always available In factthe learned function only lasts for a single image implying thatthe trained terrain appearance models are directly applicableto very similar dataTwo very recent studies do have the learned function lastlonger over time both with the idea of having the robottake some decisions based on the complementary sensor cuealone Beleia et al [9] study a rover with a haptic antennasensor In their application of terrain mapping they try tomap monocular cues to obstacles based on earlier events ofencountering similar situations which resulted in either a hardobstacle a traversable obstacle or a clear path The monocularinformation is used in a path planning task requiring a cost

function for either exploring unknown potential obstacles ordriving through a terrain on the current available informationSince checking whether a potential obstacle is traversable iscostly (the rover needs to travel there in order for the antennato provide ground truth on that) the robot learns to classifythe terrain ahead with vision On each sample an analysis isperformed to determine whether the vision-based classifier issufficiently confident it either decides the terrain is traversablenot traversable or unsure In the unsure case the sample issensed using the antenna Gradually this will become necessaryless often thus learning to navigate using its Kinect sensoralone In Ho et al [10] a flying robot first uses optical flow toselect a landing site that is flat and free of obstacles In orderfor this to work the robot has to move sufficiently with respectto the objects on the landing site While flying the robot usesSSL to learn a regression function that maps an (appearance-based) texton-distribution to the values coming from the opticalflow process The learned function extends the capabilities ofthe robot as after learning it is also able to select landingsites without moving (from hover) The article shows that ifthe robot is uncertain on its appearance-based estimates it canswitch back to the original optical flow based cueIn this article we focus on the behavioral aspects of persistentSSL We study how to best set up the learning process so thatthe robot will be able to keep performing its task when theoriginal sensor cue becomes completely unavailable

III METHODOLOGY OVERVIEW

This section starts by formally and generally posing thepersistent SSL learning mechanism (Subsection III-A) Nextin Subsection III-B we show how this method is used forour specific proof of concept case monocular depth estimationin flying robots providing subsequent implementation detailsin the following subsections III-C-III-F Lastly in SubsectionIII-G we identify a behavioral issue when applying persistentSSL and formally define the learning problem for our proofof concept For a comparison between persistent SSL andother machine learning methods we refer the reader to thediscussion

A Persistent SSL principleThe persistent SSL principle is schematically depicted inFigure 1 In persistent SSL an original pre-wired sensory cueprovides supervised outputs to a learning process that takes adifferent complementary sensory cue as input The goal is tobe able to replace the pre-wired cue if necessary When consid-ering the system as a whole learning with persistent SSL canbe considered as unsupervised it requires no manual labelingor pre-training before deployment in the field Internally it usesa supervised learning method that in fact needs ground truthlabels to learn This ground truth is however assumed to beprovided online and autonomously without human or outsideinterferenceIn the schematic the input variable x represents the sensoryinputs available on board The variables xg and xf are possiblyoverlapping subsets of these sensory inputs In particular func-tion g(xg) extracts a trusted ground truth sensory cue from

4

Fig 1 The persistent self-supervised learning principle

the sensory inputs xg In classical systems g(xg) provides therequired functionality on its own

g xg rarr y xg sube x (1)

The function f(xf ) is learned with a supervised learningalgorithm in order to approximate g(xg) based on xf

f xf rarr y f isin F xf sube x (2)

f = argminfisinF

E [l(f(xf ) g(xg))] (3)

where l(f(xf ) g(xg)) is a suitable loss function that is tobe minimized The system can either choose or be forcedto switch θ so that λ is either set to g(xg) or f(xf ) foruse in control Future work may also include fusing the twobut in this article we focus on using the complementary cuein a stand-alone fashion It must be noted that while bothxg sube x and xf sube x in general it may be that xf doesnot contain all necessary information to predict y In additioneven if xg = xf it is possible that F does not contain afunction f that perfectly models g The information in xf andthe function space F may not allow for a perfect estimateof g(xg) On the other hand there may be an f(xf ) thathandles certain situations better than g(xg) (think of landingsite selection from hover as in [10]) In any case fundamentaldifferences between g(xg) and f(xf ) are to be expected whichmay significantly influence the behavior when switching θHandling these differences is of central interest in this article

B Stereo-to-mono proof of concept

Figure 2 presents a block diagram of the proposed proof ofconcept system in order to visualize how the persistent SSLmethod is employed in our application estimating monoculardepth in a flying robot Input is provided by a stereo visioncamera with either the left or right camera image routed to themonocular estimator We use a Visual Bag of Words (VBoW)method for this estimator (see Subsection III-D) The groundtruth for persistent SSL in this context is provided by the outputof a stereo vision algorithm In this case the average valueof the disparity map is used both for training the monocularestimator and as an input to the switch θ Based on the switchthe system either delivers the monocular or the stereo visionaverage disparity to the behavior controller

C Stereo vision processingThe stereo camera delivers a synchronized gray-scale stereo-pair image per time sample A stereo vision algorithm firstcomputes a disparity map but often this is a far from perfectprocess Especially in the context of an MAVrsquos size weight andcomputational constraints errors caused by imperfect stereocalibration resolution limits etc can cause large pixel errorsin the results Moreover learning to estimate a dense disparitymap even when this is based on a high quality and consistentdata set is already very challenging Since we use the stereoresult as ground truth for learning we minimize the errorby averaging the disparity map to a single scalar A singlescalar is much easier to learn than a full depth map and hasbeen demonstrated to provide elementary obstacle avoidancecapability [27] [23] [28]The disparity λ relates to the depth d of the input image

d prop 1

λ(4)

Using averaged disparity instead of averaged depth fits theobstacle avoidance application better because small but closeby objects are emphasized due the non-linear relation of Eq 4However linear learning methods may have difficulty mappingthis relation In our final design we thus choose to learn thedisparity with a non-parametric approach which is resilient tononlinearities

D Monocular disparity estimationThe monocular disparity estimator forms a function from theimagersquos pixel values to the average disparity in the imageSince the main goal of the article is to study SSL on board adrone in real-time we have put efficiency of both the learningand execution of this function are at a prime Hence weconverged to a computationally extremely efficient Visual Bagof Words (VBoW) approach for the robotic experiments Wehave also explored a deep neural network approach but thehardware and time available for learning did not allow forhaving the deep neural learning on board the drone at thisstageThe VBoW method uses small image patches of wtimesh pixelsas successfully introduced in [29] for a complex texture clas-sification problem First a dictionary is created by clusteringthe image patches with Kohonen clustering (as in [28]) Then cluster centroids are called lsquotextonsrsquo After formation of thedictionary when an image is received m patches are extractedfrom the W timesH pixel image Each patch is compared to thedictionary in order to form a texton occurrence histogram forthe image The histogram is normalized to sum to 1 Theneach normalized histogram is supplemented with its Shannonentropy resulting in a feature vector of size n + 1 The ideabehind adding the entropy is that the variation of textures in animage decreases when the camera gets closer to obstacles [28]To illustrate the change in texton histograms when approachingan obstacle a time series of texton histograms can be seenin Figure 3 Please note how the entropy of the distributionindeed decreases over time and that especially the fourthbin is much higher when close to the poster on the wall A

5

Monocularestimator

Stereo processing

Switch

Behavior routines

Robot controller

Average disparity

Average disparity estimate

Safety overide

Control signal

Filter

PSSL

Fig 2 System overview Please see the text for details

machine learning algorithm will have to learn to notice suchrelationships itself for the robotrsquos environment by learninga mapping from the feature vector to a disparity We haveinvestigated different function representations and learningmethods to this end (see Section IV)

Fig 3 Approaching a poster on the wall Left monocular input Middleoverlaid textons annotated with the color used in the histogram Right textondistribution histogram with the corresponding texton shown beneath it

E Behavior

The proposed system uses a straightforward behavior heuristicto explore navigate and persistently learn a room The heuristicis depicted as a Finite State Machine (FSM) in Figure 4 Instate 0 the robot flies in the direction of the camerarsquos principalaxis When an obstacle is detected (λ gt t) the robot stops andgoes to state 1 in which it randomly chooses a new directionfor the principal axis It immediately passes to state 2 in whichthe robot rotates towards the new direction reducing the errore between the principal axisrsquo current and desired direction Ifin the new direction obstacles are far enough away (λ le t)the robot starts flying forward again (state 0) Else the robotcontinues to turn in the same direction as before (clockwiseor counter clockwise) until λ le t When this is the case itstarts flying straight again (state 0) The FSM detects obstaclesby means of a threshold t applied to the average disparity λChoosing this rather straightforward behavior heuristic enablesautonomous exploration based on only one scalar obtainedfrom a distance sensor

0 Straight ahead

1 Pick random direction

120582 le t120582

2 Rotate

120582 le t120582 amp e le te

e gt te

120582 gt t120582 amp e le te

120582 gt t120582

Fig 4 The behavior heuristic FSM λ is average disparity e is the attitudeerror (meaning the difference between the newly picked direction and thecurrent attitude) tn the respective thresholds

6

F PerformanceThe average disparity λ coming either from stereo visionor from the monocular distance estimation function f(xf )is thresholded for determining whether to turn This leads toa binary classification problem where all samples for whichλ gt t are considered as lsquopositiversquo (c = 1) and all samplesfor which λ le t are considered as lsquonegativersquo (c = 0)Hence the quality of f(xf ) can be characterized by a ROCcurve The ground truth for the ROC curve is determinedby the stereo vision This means that a True Positive Ratio(TPR) of 1 and False Positive Ratio (FPR) of 0 lead to thesame obstacle detection performance as with the stereo visionsystem Generally of course this performance will not bereached and the robot has to determine what threshold to setfor a sufficiently high TPR and low FPRThis leads to the question how to determine what a lsquosufficientrsquoTPR FPR is We evaluate this matter in the context of therobotrsquos obstacle avoidance task In particular we first look atthe probability of a collision with a given TPR and then at theprobability of a spurious turn with a given FPRIn order to model the probability of a collision con-sider a constant velocity approach with input samples (im-ages) 〈x1 x2 xn〉 of n samples long ending at anobstacle A minimum of u samples before actual im-pact an obstacle must be detected by at least one TPor a FP in order to prevent a collision Since the rangeof samples 〈x(nminusu+1) x(nminusu+2) xn〉 does not matterfor the outcome we redefine the approach range to be〈x1 x2 x(nminusu)〉 Consider that for each sample xi holds

1 = p(TP |xi) + p(FP |xi) + p(TN |xi) + p(FN |xi) (5)

since p(FN |xi) = p(TP |xi) = 0 if xi is a negative andp(TN |xi) = p(FP |xi) = 0 if xi is a positive Let us firstassume independent identically distributed (iid) data Thenthe probability of a collision pc can be written as

pc =

nminusuprodi=1

(p(FN |xi) + p(TN |xi))

=

qprodi=1

p(TN |xi)nminusuprodi=q+1

p(FN |xi)(6)

where q is a time step separating two phases in the approachIn the first phase all xi are negative so that any false positivewill lead to a turn preventing the collision Only if all negativesamples are correctly classified as negatives (true negatives)will the robot enter the second phase in which all xi arepositive Then only a complete sequence of false negativeswill lead to a collision since any true positive will lead to atimely turnWe can use Eq 6 to choose an acceptable TPR = 1minusFNRAssuming a constant velocity and frame rate it gives us theprobability of a collision For instance let us assume that therobot flies forward at 050ms with a frame rate of 30Hz it hasa minimal required detection distance of 10m and positives aredefined to be closer than 15m This leads to s = 30 samples

that all have to be classified as negatives In the case of iiddata if the TPR = 095 the probability of a collision ispc = (1minusTPR)s = 00530 asymp 931 10minus40 an extremely smallvalue With this analysis even a TPR = 030 leads to anacceptable pc asymp 225 10minus5

The analysis of the effect of false positives is straightforwardas it can be expressed in the number of spurious turns persecond or equivalently if assuming a constant velocity permeter travelled With the same scenario as above an FPR =005 will on average lead to 3 spurious turns per travelledmeter which is unacceptably high An FPR = 00017 willapproximately lead to 1 spurious turn per 10 meters

The above analysis seems to indicate that quite many falsenegatives are acceptable while there can only be very fewfalse positives However there are two complicating factorsThe first factor is that Eq 6 only holds when X can be as-sumed identically and independently distributed (iid) whichis unlikely due to the nature of the consecutive samples ofan approach towards an obstacle Some reasoning about thenature of the dependence is however possible Assuming astrong correlation between consecutive samples results in ahigher probability of xi being classified the same as x(i+1)In other words if a sample in the range 〈x1xm〉 is an FNthe chance that more samples are FNs increases Hence theexpected dependencies negatively impact performance of thesystem making Eq 6 a best case scenario

The system can be more realistically modelled as a Markovprocess as depicted in Figure 5 From this can be seen thatthe system can be split in a reducible Markov process with anabsorbing avoid state and a chain of states that leads to theabsorbing collision state The values of the transition matrixΩ can be determined from the data gathered during operationThis would allow the robot to better predict the consequencesof a chosen TPR and FPR

As an illustration of the effects of sample dependence let ussuppose a model in which each classification has a probabilityof being identical to the previous classification p(I(ciminus1 ci))If not identical the sample is classified independently Thisdependency model allows us to calculate the transition Ω45

in Figure 5 Given a previous negative classification thetransition probability to another negative classification isΩ45 = p(I(ciminus1 ci)) + (1 minus p(I(ciminus1 ci)))(1 minus TPR) Ifp(I(ciminus1 ci)) = 08 and TPR = 095 as above Ω45 = 081The probability of a collision in such a model is pc = Ω

(sminus1)45 =

18 10minus3 no longer an inconceivably small number

This leads us to the second complicating factor which is spe-cific to our SSL setup Since the robot operates on the basis ofthe ground-truth it should theoretically hardly ever encounterpositive samples Namely the robot should turn when it detectsa positive sample This implies that the uncertainty on theestimated TPR is rather high while the FPR can be estimatedbetter A potential solution to this problem is to purposefullyhave the mono-estimation robot turn earlier than the stereovision based one

7

2No obstacle

TP

1Obstacle

FP

4Obstacle

FN

3No obstacle

TN

Ω10

Ω32Ω31 Ω42

Ω34

Ω33

Ω45

4 +j

FNhellip

5Obstacle

2xFN

Ω20

6Crash

Ω(5)(6)

Ω(hellip)(2)

Ω(5+n-u)(6+n-u)

Ω52

0Avoid

Fig 5 Markov model of the probability of a collision

G Similarity with Learning from Demonstration

The core of SSL is a supervised algorithm that learns thefunction f(xf ) on the basis of supervised outputs g(xg)Normally supervised learning assumes that the training datais drawn from the same data probability distribution D as thetest data However in persistent SSL this assumption generallydoes not hold The problem is that by using control based onf the robot follows a control policy πf 6= πg and hence willinduce a different state distribution Dπf 6= Dπg On thesedifferent states no supervised outputs have been observed yetwhich typically implies an increasing difference between f andgA similar problem of inducing a different state distribution iswell-known in the area of learning from demonstration [2] [3]Actually we will show that under some mild assumptions thepersistent SSL problem studied in this paper is equivalent toan learning from demonstration problem Hence we can drawon solutions in learning from demonstration such as DAgger[2]The goal of learning from demonstration is to find a policy πthat minizes a loss function l under its induced distribution ofstates from [2]

π = argminπisinΠ

EssimDπ [l(s π)] (7)

where an optimal teacher policy πlowast is available to providetraining data for specific states sAt first sight SSL is quite different as it focuses only on thestate information that serves as input to the policy Instead ofoptimizing a policy the supervised learning in persistent SSLcan be defined as finding the function f prime that best matches thetrusted function gprime under the distribution of states induced bythe use of the thresholded version f for control

argminfisinF

ExsimDπf [l(f(x) g(x))] (8)

meaning that we perform regression of f on states that areinduced by the control policy πf which uses the thresholdedversion of f To see the similarity to Eq 7 first realize that the stereo-based policy is in this case the teacher policy πg = πlowast Forthis analysis we simplify the strategy to flying straight whenfar enough away from an obstacle and turning otherwise

πg(s) p(straight|s = 0) = 1

p(turn|s = 1) = 1(9)

where s is the state with s = 1 when g(x) gt tg and s = 0otherwise Please note that πg is a deterministic policy whichis assumed to be optimalWhen we learn a function f it generally will not give exactlythe same outputs as g Using s = f gt tf will result in thefollowing stochastic policy

πf (s) p(straight|s = 0) = TNR

p(turn|s = 0) = FPR

p(turn|s = 1) = TPR

p(straight|s = 1) = FNR

(10)

a stochastic policy which by definition is optimal πf = πg if FPR = FNR = 0 In addition then Dπf = Dπg Thusif we make the assumption that minimizing l(f(x) g(x)) alsominimizes FPR and FNR capturing any preference for oneor the other in the cost function for the behavior l(s π) thenminimizing f in Eq 8 is equivalent to minimizing the loss inEq 7The interest of the above-mentioned similarity lies in theuse of proven techniques from the field of learning fromdemonstration for training the persistent SSL system Forinstance in DAgger during the first learning iteration onlythe expert policy is learned After this iteration a mix of thelearned and expert policy is used πi = πlowast with p = βi

8

and πi = πi with p = (1 minus βi) where βi is reducedover the iterations1 The policy at iteration i depends on allprevious learned data which is aggregated over time In [2] itis proven that this strategy gives a favorable limited no-regretloss which significantly outperforms the traditional learningfrom demonstration strategy of just learning from the datadistribution of the expert policy In Section V we will showthat the main lessons from [2] also apply to persistent SSL

IV OFFLINE VISION EXPERIMENTS

In this section we perform offline vision experiments Thegoal of these experiments is to determine how good theproposed VBoW method is at estimating monocular depth andto determine the best parameter settingsTo measure the performance we use two main metrics themean square error (MSE) and the area under the curve (AUC)of a ROC curve MSE is an easy metric that can be directlyused as a loss function but in practice many situations existin which a low MSE can be achieved but still inadequate per-formance is reached for the basis of reliable MAV behavioralcontrol The AUC captures the trade-off between TPR and FPRand hence is a good indication of how good the performanceis in terms of obstacle detection

Fig 6 Example from Dataset 1 (left 128x96 pixels) and dataset 2 (right640x480)

We use two data sets in the experiments The first dataset is avideo made on a drone during an autonomous flight using theonboard 128 times 96 pixels stereo camera The second datasetis a video made by manually walking with a higher quality640 times 480 pixel stereo camera through an office cubicle in asimilar fashion as the robot should move in the later onlineexperiments The datasets 1 and 2 used in this section aremade available for download publicly23 An example imagefrom each dataset is shown in Figure 6Our implementation of the VBoW method has six mainparameters ranging from the number of intensity and gra-dient textons to the number of samples used to smooththe estimated disparity over time An exhaustive search ofparameters being out of reach we have performed variousinvestigations of parameter changes along a single dimension

1In [2] this mixture was written as πi = βiπlowast + (1 minus βi)π hinting at

a mixed control However it is described as a policy that queries the expertto choose controls a fraction of the time while collecting the next datasetFor this reason and because it makes more sense given our binary controlstrategy we mention this policy as a probabilistic mixture in this article

2Dataset 1 can be downloaded and viewed from http1drvms1NOhIOh3Dataset 2 can be downloaded from http1drvms1NOhLtA

Table I presents a list of the final tuned parameter valuesPlease note that these parameter values have not only beenoptimized for performance Whenever performance differenceswere marginal we have chosen the parameter values that savedon computational effort This choice was guided by our goalto perform the learning on board of a computationally limiteddrone Below we will show a few of the results when varyinga single parameter deviating from the settings in Table I inthe corresponding dimension

Fig 7 Used texton dictionaries Left for camera 1 right for 2

In this work two types of textons are combined to form a singletexton histogram normal intensity textons as obtained withKohonen clustering as in [28] and gradient textons obtainedsimilarly but based upon the gradient of the images Gradienttextures have been shown in [30] to be an important depthcue An example dictionary of each is depicted in figure 7Gradient textons are shown with a color range (from blue =low to red = high) The intensity textons in figure 7 are basedon grayscale intensity pixel valuesFigure 8 shows the results for different numbers of textonsisin 4 8 12 16 20 always consisting of half pixel intensityand half gradient textons From the results we can see that theperformance saturates around 20 textons Hence we selectedthis combination of 10 intensity and 10 gradient textons forthe experimentsThe VBoW method involves choosing a regression algorithmIn order to determine the best learning algorithm we havetested four regression algorithms limiting the choice mainlybased on feasibility for implementing the regression algorithmon board a constrained embedded system We have tested twonon-parametric (kNN and Gaussian Process regression) andtwo parametric (linear and shallow neural network regression)algorithms Figure 9 presents the learning curves for a com-parison of these regressors Clearly in most cases the kNNregression comes out best A naıve implementation of kNNsuffers from having a larger training set in terms of CPU usageduring test time but after implementation on the drone this didnot become a bottleneckThe final offline results on the two datasets are quite satis-factory They can be viewed online45 After a training set ofroughly 4000 samples the kNN approximates the stereo visionbased disparities in the test set rather well Given a desiredTPR of 096 the learner has an FPR of 047 This should besufficient for usage of the estimated disparities in control

V SIMULATION EXPERIMENTS

In Section III we argued that a persistent form of SSL issimilar to learning from demonstration The relevance of this

4A video of VBoW visualizations on dataset 1 is available here http1drvms1KdRtC1

5A video of VBoW visualizations on dataset 2 is available here http1drvms1KdRxlg

9

1000 2000 3000 4000 5000 60000

05

1

15

2

MS

E

kNN regression learning curve minus Number of textons minus Dataset 1

48121620

1000 2000 3000 4000 5000 600005

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 250010

20

30

40

50

60

70

80

90

MS

E

Dataset 2

500 1000 1500 2000 250007

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 8 MSE and AUC number of textons Dashed solid lines refer to results on traintest set

TABLE I PARAMETER SETTINGS

Parameter Value

Number of intensity textons 10Number of gradient textons 10

Patch size 5x5Subsampling samples 500

kNN k=5Smooth size 4

similarity lies in the behavioral schemes used for learning Inthis section we compare three learning schemes in simulation

A SetupWe simulate a lsquoflyingrsquo drone with stereo vision camera inSmartUAV [31] an in-house developed simulator that allowsfor 3D rendering and simulation of the sensors and algorithmsused on board the real drone Figure 10 shows the simulatedlsquooffice roomrsquo The room has a size of 10times 10 meter and thedrone an average forward speed of 05 ms All the vision andlearning algorithms are exactly the same as the ones that runon board of the drone in the real experimentsThree learning schemes are designed as follows They all startwith an initial learning period in which the drone is controlledpurely by means of stereo vision In the first learning schemethe drone will continue to fly based on stereo vision for

Fig 10 SmartUAV simulation environment

the remainder of the learning time After learning the droneimmediately switches to monocular vision For this reasonthe first scheme is referred to as lsquocold turkeyrsquo In the secondlearning scheme the drone will perform a stochastic policyselecting the stereo vision based actions with a probability βiand monocular based actions with a probability (1minusβi) as wasproposed in the original DAgger article [2] In the experimentsβi = 025 Finally in the third learning scheme the dronewill perform monocular based actions with stereo vision only

10

1000 2000 3000 4000 5000 60000

05

1

15

2

25

MS

E

Regression learning curves minus Dataset 1

kNN (k=5)lineargpneural net

1000 2000 3000 4000 5000 600002

03

04

05

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 25000

20

40

60

80

100

120

MS

E

Dataset 2

500 1000 1500 2000 2500

065

07

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 9 VBoW regression algorithms learning curves Dashed solid lines refer to results on traintest set

used to override these actions when the drone gets too closeto an obstacle Therefore we refer to this scheme as lsquotrainingwheelsrsquoAfter learning the drone will use its monocular disparityestimates for control The stereo vision remains active onlyfor overriding the control if the drone gets too close to a wallDuring this testing period we register the number of turns andthe number of overrides The number of overrides is a measureof the number of potential collisions The number of turns iscompared to the number of turns performed when using stereovision to evaluate the number of spurious turns The initiallearning period is 1 minute the remaining learning period is4 minutes and the test time is 5 minutes These times havebeen selected to allow a full experiment on a single battery ofthe real drone (see Section VI)

B Results

Table II contains the results of 30 experiments with the threelearning schemes and a purely stereo-vision-controlled droneThe first observation is that lsquocold turkeyrsquo gives the worstresults This result was to be expected on the basis of the simi-larity between persistent SSL and learning from demonstrationthe learned monocular distance estimates do not generalizewell to the test distribution when the monocular vision is

TABLE II TEST RESULTS FOR THE THREE LEARNING SCHEMES THEAVERAGE AND STANDARD DEVIATION ARE GIVEN FOR THE NUMBER OF

OVERRIDES AND TURNS DURING THE TESTING PERIOD A LOWER NUMBEROF OVERRIDES IS BETTER IN THE TABLE THE BEST RESULTS ARE SHOWN

IN BOLD

Method Overrides TurnsPure stereo NA 456 (σ = 30 )1 Cold turkey 251 (σ = 82 ) 428 (σ = 37 )2 DAgger 107 (σ = 53 ) 414 (σ = 32 )3 Training wheels 43 (σ = 26 ) 404 (σ = 26 )

in control The originally proposed DAgger scheme performsbetter while the third learning scheme termed lsquotraining wheelsrsquoseems most effective The third scheme has the lowest numberof overrides of all learning schemes with a similar totalnumber of turns as a pure stereo vision run The intuitionbehind this method being best is that it allows the drone tobest learn from samples when the drone is beyond the normalstereo vision turning threshold The original DAgger schemehas a larger probability to turn earlier exploring these samplesto a lesser extent Double-sided statistical bootstrap tests [32]indicate that all differences between the learning methods aresignificant with p lt 005The differences between the learning schemes are well illus-trated by the positions the drone visits in the room during thetest phase Figure 11 contains lsquoheat mapsrsquo that show the drone

11

Fig 11 Simulation heatmaps from left to right cold turkey DAgger trainingwheels stereo only Top images are turn locations lower images are theapproaches

Fig 12 The used multicopter

positions during turning (top row) and during straight flight(bottom row) The position distribution has been obtained bybinning the positions during the test phase of all 30 runs Theresults for each scheme are shown per column in Figure 11Right is the pure stereo vision scheme which shows a clearborder around the straight flight trajectories It can be observedthat this border is best approximated by the lsquotraining wheelsrsquoscheme (second from the right)

VI ROBOTIC EXPERIMENTS

The simulation experiments showed that the lsquotraining wheelsrsquosetup resulted in the fewest stereo vision overrides whenswitching to monocular disparity estimation and control Inthis section we test this online learning setup with a flyingrobotThe experiment is set up in the same manner as the simulationThe robot a Parrot AR drone 2 first explores the room withthe help of stereo vision After 1 minute of learning the droneswitches to using the monocular disparity estimates with stereovision running in the background for performing potentialsafety overrides In this phase the drone still continues tolearn After learning 4 to 5 minutes the drone stops learningand enters the test phase Again also for the real robot themain performance measure consists of the number of safetyoverrides performed by the stereo vision during the testingphaseThe AR drone 2 is standard not equipped with a stereo visionsystem Therefore an in-house-developed 4 gram stereo visionsystem is used [33] which sends the raw images over USB tothe ARDrone2 The grayscale stereo camera has a resolutionof 128times96 px and is limited to 10 fps The ARDrone2 comeswith a 1GHz ARM cortex A8 processor and 128MB RAM

and normally runs the Parrot firmware as an autopilot Forthe experiments we replace this firmware with the the opensource Paparazzi autopilot software [34] [35] This allowedus to implement all vision and learning algorithms on boardthe drone The length of each test is dependent on the batterywhich due to wear has considerable variation in the range of8-15 minutesThe tests are performed in an artificial room that has beenconstructed within a motion tracking arena This allows us totrack the trajectory of the drone and facilitates post-experimentanalysis The room is approximately 5 times 5 m as delimitedby plywood walls In order to ensure that the stereo visionalgorithm gave reliable results we added texture in the formof duct-tape to the walls In five tests we had a textured carpethanging over one of the walls (Figure 13 left referred to aslsquoroom 1rsquo) in the other five tests it was on the floor (Figure 13right referred to as lsquoroom 2rsquo)

Fig 13 Two test flight rooms

1) Results Table III shows the summarized results obtainedfrom the monocular test flights Two main observations canbe made from this table First the average number of stereooverrides during the test phase is 3 which is very close to thenumber of overrides in simulation The monocular behavioralso has a similar heat map to simulation Figure 14 shows aheat map of the dronersquos position during the approaches and theavoidance maneuvers (the turns) Again the stereo based flightperforms better in the sense that the drone explores the roommuch more thoroughly and the turns happen consistently justbefore an obstacle is detected On the other hand especially inroom 2 the monocular performance is quite good in the sensethat the system is able to explore most of the roomSecond the selected TPR and FPR are on average 047 and011 The TPR is rather low compared to the offline testsHowever this number is heavily influenced by the monocularestimator based behavior Due to the goal of the robot avoidingobstacles slightly before the stereo ground truth recognizesthem as positives positives should hardly occur at al Only incases of FNs where the estimator is slower or wrong positiveswill be registered by the ground truth Similarly the FPR isalso lower in the context of the estimate based behaviorROC curves of the 10 flights are shown in Figure 16 Acomparison based on the numbers between the first 5 flights(room 1) and the last 5 flights (room 2) does not showany significant differences leading to the suggestion that thesystem is able to learn both rooms equally well Howeverwhen comparing the heat maps of the two situations in themonocular system in Figure 14 and Figure 15 it seems thatthe system shows slightly different behavior The monocularsystem appears to explore room 2 better getting closer to

12

TABLE III TEST FLIGHT SUMMARY

Room 1 Room 2Description 1 2 3 4 5 6 7 8 9 10 Avg

Stereo flight time mss 648 753 213 330 445 439 456 512 458 501 459Mono flight time mss 344 817 645 725 454 1007 446 951 523 512 639

Mean Square Error 07 196 112 095 083 095 087 132 116 106 109False Positive Rate 016 018 013 011 011 008 013 008 01 008 011True Positive Rate 09 044 057 038 038 04 035 035 06 039 047Stereo approaches 29 31 8 14 19 22 22 19 20 21 205Mono approaches 10 21 20 25 14 33 15 28 18 15 199

Auto-overrides 0 6 2 2 1 5 2 7 3 2 3Overrides ratio 0 072 03 027 02 049 042 071 056 038 041

copying the behavior of the stereo based system

Fig 14 Room 1 (plane texture) position heat map Top row is thebinned position during the avoidance turns bottom row during the obstacleapproaches left column during stereo ground truth based operation rightcolumn during learned monocular operation

The experimental setup with the room in the motion trackingarena allows for a more in-depth analysis of the performanceof both stereo and monocular vision Figure 17 shows thespatial view of the flight trajectory of test 10 6 The flightis segmented into approaches and turns which are numberedaccordingly in these figures The color scale in Figure 17Ais created by calculating the theoretically visible closest wallbased on the tracking the systems measured heading andposition of the drone the known position of the walls and theFOV angle of the camera It is clearly visible that the stereoground truth in Figure 17B does not capture this theoreticaldisparity perfectly Especially in the middle of the room thedisparity remains high compared to the theoretical ground truthdue to noise in the stereo disparity map The results of themonocular estimator in Figure 17C shows another decrease inquality compared to the stereo ground truth

6Onboard video data of the flight 10 can be viewed at http1drvms1KC81PN an external video at httpsyoutubelC vj-QNy1I and a VBoWvisualization video at http1drvms1fzvicH

Fig 15 Room 2 (carpet natural texture) position heat map

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

12345678910

Fig 16 ROC curves of the 10 test flights Dashed solid lines refer to resultson room 1 2

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

4

Fig 1 The persistent self-supervised learning principle

the sensory inputs xg In classical systems g(xg) provides therequired functionality on its own

g xg rarr y xg sube x (1)

The function f(xf ) is learned with a supervised learningalgorithm in order to approximate g(xg) based on xf

f xf rarr y f isin F xf sube x (2)

f = argminfisinF

E [l(f(xf ) g(xg))] (3)

where l(f(xf ) g(xg)) is a suitable loss function that is tobe minimized The system can either choose or be forcedto switch θ so that λ is either set to g(xg) or f(xf ) foruse in control Future work may also include fusing the twobut in this article we focus on using the complementary cuein a stand-alone fashion It must be noted that while bothxg sube x and xf sube x in general it may be that xf doesnot contain all necessary information to predict y In additioneven if xg = xf it is possible that F does not contain afunction f that perfectly models g The information in xf andthe function space F may not allow for a perfect estimateof g(xg) On the other hand there may be an f(xf ) thathandles certain situations better than g(xg) (think of landingsite selection from hover as in [10]) In any case fundamentaldifferences between g(xg) and f(xf ) are to be expected whichmay significantly influence the behavior when switching θHandling these differences is of central interest in this article

B Stereo-to-mono proof of concept

Figure 2 presents a block diagram of the proposed proof ofconcept system in order to visualize how the persistent SSLmethod is employed in our application estimating monoculardepth in a flying robot Input is provided by a stereo visioncamera with either the left or right camera image routed to themonocular estimator We use a Visual Bag of Words (VBoW)method for this estimator (see Subsection III-D) The groundtruth for persistent SSL in this context is provided by the outputof a stereo vision algorithm In this case the average valueof the disparity map is used both for training the monocularestimator and as an input to the switch θ Based on the switchthe system either delivers the monocular or the stereo visionaverage disparity to the behavior controller

C Stereo vision processingThe stereo camera delivers a synchronized gray-scale stereo-pair image per time sample A stereo vision algorithm firstcomputes a disparity map but often this is a far from perfectprocess Especially in the context of an MAVrsquos size weight andcomputational constraints errors caused by imperfect stereocalibration resolution limits etc can cause large pixel errorsin the results Moreover learning to estimate a dense disparitymap even when this is based on a high quality and consistentdata set is already very challenging Since we use the stereoresult as ground truth for learning we minimize the errorby averaging the disparity map to a single scalar A singlescalar is much easier to learn than a full depth map and hasbeen demonstrated to provide elementary obstacle avoidancecapability [27] [23] [28]The disparity λ relates to the depth d of the input image

d prop 1

λ(4)

Using averaged disparity instead of averaged depth fits theobstacle avoidance application better because small but closeby objects are emphasized due the non-linear relation of Eq 4However linear learning methods may have difficulty mappingthis relation In our final design we thus choose to learn thedisparity with a non-parametric approach which is resilient tononlinearities

D Monocular disparity estimationThe monocular disparity estimator forms a function from theimagersquos pixel values to the average disparity in the imageSince the main goal of the article is to study SSL on board adrone in real-time we have put efficiency of both the learningand execution of this function are at a prime Hence weconverged to a computationally extremely efficient Visual Bagof Words (VBoW) approach for the robotic experiments Wehave also explored a deep neural network approach but thehardware and time available for learning did not allow forhaving the deep neural learning on board the drone at thisstageThe VBoW method uses small image patches of wtimesh pixelsas successfully introduced in [29] for a complex texture clas-sification problem First a dictionary is created by clusteringthe image patches with Kohonen clustering (as in [28]) Then cluster centroids are called lsquotextonsrsquo After formation of thedictionary when an image is received m patches are extractedfrom the W timesH pixel image Each patch is compared to thedictionary in order to form a texton occurrence histogram forthe image The histogram is normalized to sum to 1 Theneach normalized histogram is supplemented with its Shannonentropy resulting in a feature vector of size n + 1 The ideabehind adding the entropy is that the variation of textures in animage decreases when the camera gets closer to obstacles [28]To illustrate the change in texton histograms when approachingan obstacle a time series of texton histograms can be seenin Figure 3 Please note how the entropy of the distributionindeed decreases over time and that especially the fourthbin is much higher when close to the poster on the wall A

5

Monocularestimator

Stereo processing

Switch

Behavior routines

Robot controller

Average disparity

Average disparity estimate

Safety overide

Control signal

Filter

PSSL

Fig 2 System overview Please see the text for details

machine learning algorithm will have to learn to notice suchrelationships itself for the robotrsquos environment by learninga mapping from the feature vector to a disparity We haveinvestigated different function representations and learningmethods to this end (see Section IV)

Fig 3 Approaching a poster on the wall Left monocular input Middleoverlaid textons annotated with the color used in the histogram Right textondistribution histogram with the corresponding texton shown beneath it

E Behavior

The proposed system uses a straightforward behavior heuristicto explore navigate and persistently learn a room The heuristicis depicted as a Finite State Machine (FSM) in Figure 4 Instate 0 the robot flies in the direction of the camerarsquos principalaxis When an obstacle is detected (λ gt t) the robot stops andgoes to state 1 in which it randomly chooses a new directionfor the principal axis It immediately passes to state 2 in whichthe robot rotates towards the new direction reducing the errore between the principal axisrsquo current and desired direction Ifin the new direction obstacles are far enough away (λ le t)the robot starts flying forward again (state 0) Else the robotcontinues to turn in the same direction as before (clockwiseor counter clockwise) until λ le t When this is the case itstarts flying straight again (state 0) The FSM detects obstaclesby means of a threshold t applied to the average disparity λChoosing this rather straightforward behavior heuristic enablesautonomous exploration based on only one scalar obtainedfrom a distance sensor

0 Straight ahead

1 Pick random direction

120582 le t120582

2 Rotate

120582 le t120582 amp e le te

e gt te

120582 gt t120582 amp e le te

120582 gt t120582

Fig 4 The behavior heuristic FSM λ is average disparity e is the attitudeerror (meaning the difference between the newly picked direction and thecurrent attitude) tn the respective thresholds

6

F PerformanceThe average disparity λ coming either from stereo visionor from the monocular distance estimation function f(xf )is thresholded for determining whether to turn This leads toa binary classification problem where all samples for whichλ gt t are considered as lsquopositiversquo (c = 1) and all samplesfor which λ le t are considered as lsquonegativersquo (c = 0)Hence the quality of f(xf ) can be characterized by a ROCcurve The ground truth for the ROC curve is determinedby the stereo vision This means that a True Positive Ratio(TPR) of 1 and False Positive Ratio (FPR) of 0 lead to thesame obstacle detection performance as with the stereo visionsystem Generally of course this performance will not bereached and the robot has to determine what threshold to setfor a sufficiently high TPR and low FPRThis leads to the question how to determine what a lsquosufficientrsquoTPR FPR is We evaluate this matter in the context of therobotrsquos obstacle avoidance task In particular we first look atthe probability of a collision with a given TPR and then at theprobability of a spurious turn with a given FPRIn order to model the probability of a collision con-sider a constant velocity approach with input samples (im-ages) 〈x1 x2 xn〉 of n samples long ending at anobstacle A minimum of u samples before actual im-pact an obstacle must be detected by at least one TPor a FP in order to prevent a collision Since the rangeof samples 〈x(nminusu+1) x(nminusu+2) xn〉 does not matterfor the outcome we redefine the approach range to be〈x1 x2 x(nminusu)〉 Consider that for each sample xi holds

1 = p(TP |xi) + p(FP |xi) + p(TN |xi) + p(FN |xi) (5)

since p(FN |xi) = p(TP |xi) = 0 if xi is a negative andp(TN |xi) = p(FP |xi) = 0 if xi is a positive Let us firstassume independent identically distributed (iid) data Thenthe probability of a collision pc can be written as

pc =

nminusuprodi=1

(p(FN |xi) + p(TN |xi))

=

qprodi=1

p(TN |xi)nminusuprodi=q+1

p(FN |xi)(6)

where q is a time step separating two phases in the approachIn the first phase all xi are negative so that any false positivewill lead to a turn preventing the collision Only if all negativesamples are correctly classified as negatives (true negatives)will the robot enter the second phase in which all xi arepositive Then only a complete sequence of false negativeswill lead to a collision since any true positive will lead to atimely turnWe can use Eq 6 to choose an acceptable TPR = 1minusFNRAssuming a constant velocity and frame rate it gives us theprobability of a collision For instance let us assume that therobot flies forward at 050ms with a frame rate of 30Hz it hasa minimal required detection distance of 10m and positives aredefined to be closer than 15m This leads to s = 30 samples

that all have to be classified as negatives In the case of iiddata if the TPR = 095 the probability of a collision ispc = (1minusTPR)s = 00530 asymp 931 10minus40 an extremely smallvalue With this analysis even a TPR = 030 leads to anacceptable pc asymp 225 10minus5

The analysis of the effect of false positives is straightforwardas it can be expressed in the number of spurious turns persecond or equivalently if assuming a constant velocity permeter travelled With the same scenario as above an FPR =005 will on average lead to 3 spurious turns per travelledmeter which is unacceptably high An FPR = 00017 willapproximately lead to 1 spurious turn per 10 meters

The above analysis seems to indicate that quite many falsenegatives are acceptable while there can only be very fewfalse positives However there are two complicating factorsThe first factor is that Eq 6 only holds when X can be as-sumed identically and independently distributed (iid) whichis unlikely due to the nature of the consecutive samples ofan approach towards an obstacle Some reasoning about thenature of the dependence is however possible Assuming astrong correlation between consecutive samples results in ahigher probability of xi being classified the same as x(i+1)In other words if a sample in the range 〈x1xm〉 is an FNthe chance that more samples are FNs increases Hence theexpected dependencies negatively impact performance of thesystem making Eq 6 a best case scenario

The system can be more realistically modelled as a Markovprocess as depicted in Figure 5 From this can be seen thatthe system can be split in a reducible Markov process with anabsorbing avoid state and a chain of states that leads to theabsorbing collision state The values of the transition matrixΩ can be determined from the data gathered during operationThis would allow the robot to better predict the consequencesof a chosen TPR and FPR

As an illustration of the effects of sample dependence let ussuppose a model in which each classification has a probabilityof being identical to the previous classification p(I(ciminus1 ci))If not identical the sample is classified independently Thisdependency model allows us to calculate the transition Ω45

in Figure 5 Given a previous negative classification thetransition probability to another negative classification isΩ45 = p(I(ciminus1 ci)) + (1 minus p(I(ciminus1 ci)))(1 minus TPR) Ifp(I(ciminus1 ci)) = 08 and TPR = 095 as above Ω45 = 081The probability of a collision in such a model is pc = Ω

(sminus1)45 =

18 10minus3 no longer an inconceivably small number

This leads us to the second complicating factor which is spe-cific to our SSL setup Since the robot operates on the basis ofthe ground-truth it should theoretically hardly ever encounterpositive samples Namely the robot should turn when it detectsa positive sample This implies that the uncertainty on theestimated TPR is rather high while the FPR can be estimatedbetter A potential solution to this problem is to purposefullyhave the mono-estimation robot turn earlier than the stereovision based one

7

2No obstacle

TP

1Obstacle

FP

4Obstacle

FN

3No obstacle

TN

Ω10

Ω32Ω31 Ω42

Ω34

Ω33

Ω45

4 +j

FNhellip

5Obstacle

2xFN

Ω20

6Crash

Ω(5)(6)

Ω(hellip)(2)

Ω(5+n-u)(6+n-u)

Ω52

0Avoid

Fig 5 Markov model of the probability of a collision

G Similarity with Learning from Demonstration

The core of SSL is a supervised algorithm that learns thefunction f(xf ) on the basis of supervised outputs g(xg)Normally supervised learning assumes that the training datais drawn from the same data probability distribution D as thetest data However in persistent SSL this assumption generallydoes not hold The problem is that by using control based onf the robot follows a control policy πf 6= πg and hence willinduce a different state distribution Dπf 6= Dπg On thesedifferent states no supervised outputs have been observed yetwhich typically implies an increasing difference between f andgA similar problem of inducing a different state distribution iswell-known in the area of learning from demonstration [2] [3]Actually we will show that under some mild assumptions thepersistent SSL problem studied in this paper is equivalent toan learning from demonstration problem Hence we can drawon solutions in learning from demonstration such as DAgger[2]The goal of learning from demonstration is to find a policy πthat minizes a loss function l under its induced distribution ofstates from [2]

π = argminπisinΠ

EssimDπ [l(s π)] (7)

where an optimal teacher policy πlowast is available to providetraining data for specific states sAt first sight SSL is quite different as it focuses only on thestate information that serves as input to the policy Instead ofoptimizing a policy the supervised learning in persistent SSLcan be defined as finding the function f prime that best matches thetrusted function gprime under the distribution of states induced bythe use of the thresholded version f for control

argminfisinF

ExsimDπf [l(f(x) g(x))] (8)

meaning that we perform regression of f on states that areinduced by the control policy πf which uses the thresholdedversion of f To see the similarity to Eq 7 first realize that the stereo-based policy is in this case the teacher policy πg = πlowast Forthis analysis we simplify the strategy to flying straight whenfar enough away from an obstacle and turning otherwise

πg(s) p(straight|s = 0) = 1

p(turn|s = 1) = 1(9)

where s is the state with s = 1 when g(x) gt tg and s = 0otherwise Please note that πg is a deterministic policy whichis assumed to be optimalWhen we learn a function f it generally will not give exactlythe same outputs as g Using s = f gt tf will result in thefollowing stochastic policy

πf (s) p(straight|s = 0) = TNR

p(turn|s = 0) = FPR

p(turn|s = 1) = TPR

p(straight|s = 1) = FNR

(10)

a stochastic policy which by definition is optimal πf = πg if FPR = FNR = 0 In addition then Dπf = Dπg Thusif we make the assumption that minimizing l(f(x) g(x)) alsominimizes FPR and FNR capturing any preference for oneor the other in the cost function for the behavior l(s π) thenminimizing f in Eq 8 is equivalent to minimizing the loss inEq 7The interest of the above-mentioned similarity lies in theuse of proven techniques from the field of learning fromdemonstration for training the persistent SSL system Forinstance in DAgger during the first learning iteration onlythe expert policy is learned After this iteration a mix of thelearned and expert policy is used πi = πlowast with p = βi

8

and πi = πi with p = (1 minus βi) where βi is reducedover the iterations1 The policy at iteration i depends on allprevious learned data which is aggregated over time In [2] itis proven that this strategy gives a favorable limited no-regretloss which significantly outperforms the traditional learningfrom demonstration strategy of just learning from the datadistribution of the expert policy In Section V we will showthat the main lessons from [2] also apply to persistent SSL

IV OFFLINE VISION EXPERIMENTS

In this section we perform offline vision experiments Thegoal of these experiments is to determine how good theproposed VBoW method is at estimating monocular depth andto determine the best parameter settingsTo measure the performance we use two main metrics themean square error (MSE) and the area under the curve (AUC)of a ROC curve MSE is an easy metric that can be directlyused as a loss function but in practice many situations existin which a low MSE can be achieved but still inadequate per-formance is reached for the basis of reliable MAV behavioralcontrol The AUC captures the trade-off between TPR and FPRand hence is a good indication of how good the performanceis in terms of obstacle detection

Fig 6 Example from Dataset 1 (left 128x96 pixels) and dataset 2 (right640x480)

We use two data sets in the experiments The first dataset is avideo made on a drone during an autonomous flight using theonboard 128 times 96 pixels stereo camera The second datasetis a video made by manually walking with a higher quality640 times 480 pixel stereo camera through an office cubicle in asimilar fashion as the robot should move in the later onlineexperiments The datasets 1 and 2 used in this section aremade available for download publicly23 An example imagefrom each dataset is shown in Figure 6Our implementation of the VBoW method has six mainparameters ranging from the number of intensity and gra-dient textons to the number of samples used to smooththe estimated disparity over time An exhaustive search ofparameters being out of reach we have performed variousinvestigations of parameter changes along a single dimension

1In [2] this mixture was written as πi = βiπlowast + (1 minus βi)π hinting at

a mixed control However it is described as a policy that queries the expertto choose controls a fraction of the time while collecting the next datasetFor this reason and because it makes more sense given our binary controlstrategy we mention this policy as a probabilistic mixture in this article

2Dataset 1 can be downloaded and viewed from http1drvms1NOhIOh3Dataset 2 can be downloaded from http1drvms1NOhLtA

Table I presents a list of the final tuned parameter valuesPlease note that these parameter values have not only beenoptimized for performance Whenever performance differenceswere marginal we have chosen the parameter values that savedon computational effort This choice was guided by our goalto perform the learning on board of a computationally limiteddrone Below we will show a few of the results when varyinga single parameter deviating from the settings in Table I inthe corresponding dimension

Fig 7 Used texton dictionaries Left for camera 1 right for 2

In this work two types of textons are combined to form a singletexton histogram normal intensity textons as obtained withKohonen clustering as in [28] and gradient textons obtainedsimilarly but based upon the gradient of the images Gradienttextures have been shown in [30] to be an important depthcue An example dictionary of each is depicted in figure 7Gradient textons are shown with a color range (from blue =low to red = high) The intensity textons in figure 7 are basedon grayscale intensity pixel valuesFigure 8 shows the results for different numbers of textonsisin 4 8 12 16 20 always consisting of half pixel intensityand half gradient textons From the results we can see that theperformance saturates around 20 textons Hence we selectedthis combination of 10 intensity and 10 gradient textons forthe experimentsThe VBoW method involves choosing a regression algorithmIn order to determine the best learning algorithm we havetested four regression algorithms limiting the choice mainlybased on feasibility for implementing the regression algorithmon board a constrained embedded system We have tested twonon-parametric (kNN and Gaussian Process regression) andtwo parametric (linear and shallow neural network regression)algorithms Figure 9 presents the learning curves for a com-parison of these regressors Clearly in most cases the kNNregression comes out best A naıve implementation of kNNsuffers from having a larger training set in terms of CPU usageduring test time but after implementation on the drone this didnot become a bottleneckThe final offline results on the two datasets are quite satis-factory They can be viewed online45 After a training set ofroughly 4000 samples the kNN approximates the stereo visionbased disparities in the test set rather well Given a desiredTPR of 096 the learner has an FPR of 047 This should besufficient for usage of the estimated disparities in control

V SIMULATION EXPERIMENTS

In Section III we argued that a persistent form of SSL issimilar to learning from demonstration The relevance of this

4A video of VBoW visualizations on dataset 1 is available here http1drvms1KdRtC1

5A video of VBoW visualizations on dataset 2 is available here http1drvms1KdRxlg

9

1000 2000 3000 4000 5000 60000

05

1

15

2

MS

E

kNN regression learning curve minus Number of textons minus Dataset 1

48121620

1000 2000 3000 4000 5000 600005

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 250010

20

30

40

50

60

70

80

90

MS

E

Dataset 2

500 1000 1500 2000 250007

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 8 MSE and AUC number of textons Dashed solid lines refer to results on traintest set

TABLE I PARAMETER SETTINGS

Parameter Value

Number of intensity textons 10Number of gradient textons 10

Patch size 5x5Subsampling samples 500

kNN k=5Smooth size 4

similarity lies in the behavioral schemes used for learning Inthis section we compare three learning schemes in simulation

A SetupWe simulate a lsquoflyingrsquo drone with stereo vision camera inSmartUAV [31] an in-house developed simulator that allowsfor 3D rendering and simulation of the sensors and algorithmsused on board the real drone Figure 10 shows the simulatedlsquooffice roomrsquo The room has a size of 10times 10 meter and thedrone an average forward speed of 05 ms All the vision andlearning algorithms are exactly the same as the ones that runon board of the drone in the real experimentsThree learning schemes are designed as follows They all startwith an initial learning period in which the drone is controlledpurely by means of stereo vision In the first learning schemethe drone will continue to fly based on stereo vision for

Fig 10 SmartUAV simulation environment

the remainder of the learning time After learning the droneimmediately switches to monocular vision For this reasonthe first scheme is referred to as lsquocold turkeyrsquo In the secondlearning scheme the drone will perform a stochastic policyselecting the stereo vision based actions with a probability βiand monocular based actions with a probability (1minusβi) as wasproposed in the original DAgger article [2] In the experimentsβi = 025 Finally in the third learning scheme the dronewill perform monocular based actions with stereo vision only

10

1000 2000 3000 4000 5000 60000

05

1

15

2

25

MS

E

Regression learning curves minus Dataset 1

kNN (k=5)lineargpneural net

1000 2000 3000 4000 5000 600002

03

04

05

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 25000

20

40

60

80

100

120

MS

E

Dataset 2

500 1000 1500 2000 2500

065

07

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 9 VBoW regression algorithms learning curves Dashed solid lines refer to results on traintest set

used to override these actions when the drone gets too closeto an obstacle Therefore we refer to this scheme as lsquotrainingwheelsrsquoAfter learning the drone will use its monocular disparityestimates for control The stereo vision remains active onlyfor overriding the control if the drone gets too close to a wallDuring this testing period we register the number of turns andthe number of overrides The number of overrides is a measureof the number of potential collisions The number of turns iscompared to the number of turns performed when using stereovision to evaluate the number of spurious turns The initiallearning period is 1 minute the remaining learning period is4 minutes and the test time is 5 minutes These times havebeen selected to allow a full experiment on a single battery ofthe real drone (see Section VI)

B Results

Table II contains the results of 30 experiments with the threelearning schemes and a purely stereo-vision-controlled droneThe first observation is that lsquocold turkeyrsquo gives the worstresults This result was to be expected on the basis of the simi-larity between persistent SSL and learning from demonstrationthe learned monocular distance estimates do not generalizewell to the test distribution when the monocular vision is

TABLE II TEST RESULTS FOR THE THREE LEARNING SCHEMES THEAVERAGE AND STANDARD DEVIATION ARE GIVEN FOR THE NUMBER OF

OVERRIDES AND TURNS DURING THE TESTING PERIOD A LOWER NUMBEROF OVERRIDES IS BETTER IN THE TABLE THE BEST RESULTS ARE SHOWN

IN BOLD

Method Overrides TurnsPure stereo NA 456 (σ = 30 )1 Cold turkey 251 (σ = 82 ) 428 (σ = 37 )2 DAgger 107 (σ = 53 ) 414 (σ = 32 )3 Training wheels 43 (σ = 26 ) 404 (σ = 26 )

in control The originally proposed DAgger scheme performsbetter while the third learning scheme termed lsquotraining wheelsrsquoseems most effective The third scheme has the lowest numberof overrides of all learning schemes with a similar totalnumber of turns as a pure stereo vision run The intuitionbehind this method being best is that it allows the drone tobest learn from samples when the drone is beyond the normalstereo vision turning threshold The original DAgger schemehas a larger probability to turn earlier exploring these samplesto a lesser extent Double-sided statistical bootstrap tests [32]indicate that all differences between the learning methods aresignificant with p lt 005The differences between the learning schemes are well illus-trated by the positions the drone visits in the room during thetest phase Figure 11 contains lsquoheat mapsrsquo that show the drone

11

Fig 11 Simulation heatmaps from left to right cold turkey DAgger trainingwheels stereo only Top images are turn locations lower images are theapproaches

Fig 12 The used multicopter

positions during turning (top row) and during straight flight(bottom row) The position distribution has been obtained bybinning the positions during the test phase of all 30 runs Theresults for each scheme are shown per column in Figure 11Right is the pure stereo vision scheme which shows a clearborder around the straight flight trajectories It can be observedthat this border is best approximated by the lsquotraining wheelsrsquoscheme (second from the right)

VI ROBOTIC EXPERIMENTS

The simulation experiments showed that the lsquotraining wheelsrsquosetup resulted in the fewest stereo vision overrides whenswitching to monocular disparity estimation and control Inthis section we test this online learning setup with a flyingrobotThe experiment is set up in the same manner as the simulationThe robot a Parrot AR drone 2 first explores the room withthe help of stereo vision After 1 minute of learning the droneswitches to using the monocular disparity estimates with stereovision running in the background for performing potentialsafety overrides In this phase the drone still continues tolearn After learning 4 to 5 minutes the drone stops learningand enters the test phase Again also for the real robot themain performance measure consists of the number of safetyoverrides performed by the stereo vision during the testingphaseThe AR drone 2 is standard not equipped with a stereo visionsystem Therefore an in-house-developed 4 gram stereo visionsystem is used [33] which sends the raw images over USB tothe ARDrone2 The grayscale stereo camera has a resolutionof 128times96 px and is limited to 10 fps The ARDrone2 comeswith a 1GHz ARM cortex A8 processor and 128MB RAM

and normally runs the Parrot firmware as an autopilot Forthe experiments we replace this firmware with the the opensource Paparazzi autopilot software [34] [35] This allowedus to implement all vision and learning algorithms on boardthe drone The length of each test is dependent on the batterywhich due to wear has considerable variation in the range of8-15 minutesThe tests are performed in an artificial room that has beenconstructed within a motion tracking arena This allows us totrack the trajectory of the drone and facilitates post-experimentanalysis The room is approximately 5 times 5 m as delimitedby plywood walls In order to ensure that the stereo visionalgorithm gave reliable results we added texture in the formof duct-tape to the walls In five tests we had a textured carpethanging over one of the walls (Figure 13 left referred to aslsquoroom 1rsquo) in the other five tests it was on the floor (Figure 13right referred to as lsquoroom 2rsquo)

Fig 13 Two test flight rooms

1) Results Table III shows the summarized results obtainedfrom the monocular test flights Two main observations canbe made from this table First the average number of stereooverrides during the test phase is 3 which is very close to thenumber of overrides in simulation The monocular behavioralso has a similar heat map to simulation Figure 14 shows aheat map of the dronersquos position during the approaches and theavoidance maneuvers (the turns) Again the stereo based flightperforms better in the sense that the drone explores the roommuch more thoroughly and the turns happen consistently justbefore an obstacle is detected On the other hand especially inroom 2 the monocular performance is quite good in the sensethat the system is able to explore most of the roomSecond the selected TPR and FPR are on average 047 and011 The TPR is rather low compared to the offline testsHowever this number is heavily influenced by the monocularestimator based behavior Due to the goal of the robot avoidingobstacles slightly before the stereo ground truth recognizesthem as positives positives should hardly occur at al Only incases of FNs where the estimator is slower or wrong positiveswill be registered by the ground truth Similarly the FPR isalso lower in the context of the estimate based behaviorROC curves of the 10 flights are shown in Figure 16 Acomparison based on the numbers between the first 5 flights(room 1) and the last 5 flights (room 2) does not showany significant differences leading to the suggestion that thesystem is able to learn both rooms equally well Howeverwhen comparing the heat maps of the two situations in themonocular system in Figure 14 and Figure 15 it seems thatthe system shows slightly different behavior The monocularsystem appears to explore room 2 better getting closer to

12

TABLE III TEST FLIGHT SUMMARY

Room 1 Room 2Description 1 2 3 4 5 6 7 8 9 10 Avg

Stereo flight time mss 648 753 213 330 445 439 456 512 458 501 459Mono flight time mss 344 817 645 725 454 1007 446 951 523 512 639

Mean Square Error 07 196 112 095 083 095 087 132 116 106 109False Positive Rate 016 018 013 011 011 008 013 008 01 008 011True Positive Rate 09 044 057 038 038 04 035 035 06 039 047Stereo approaches 29 31 8 14 19 22 22 19 20 21 205Mono approaches 10 21 20 25 14 33 15 28 18 15 199

Auto-overrides 0 6 2 2 1 5 2 7 3 2 3Overrides ratio 0 072 03 027 02 049 042 071 056 038 041

copying the behavior of the stereo based system

Fig 14 Room 1 (plane texture) position heat map Top row is thebinned position during the avoidance turns bottom row during the obstacleapproaches left column during stereo ground truth based operation rightcolumn during learned monocular operation

The experimental setup with the room in the motion trackingarena allows for a more in-depth analysis of the performanceof both stereo and monocular vision Figure 17 shows thespatial view of the flight trajectory of test 10 6 The flightis segmented into approaches and turns which are numberedaccordingly in these figures The color scale in Figure 17Ais created by calculating the theoretically visible closest wallbased on the tracking the systems measured heading andposition of the drone the known position of the walls and theFOV angle of the camera It is clearly visible that the stereoground truth in Figure 17B does not capture this theoreticaldisparity perfectly Especially in the middle of the room thedisparity remains high compared to the theoretical ground truthdue to noise in the stereo disparity map The results of themonocular estimator in Figure 17C shows another decrease inquality compared to the stereo ground truth

6Onboard video data of the flight 10 can be viewed at http1drvms1KC81PN an external video at httpsyoutubelC vj-QNy1I and a VBoWvisualization video at http1drvms1fzvicH

Fig 15 Room 2 (carpet natural texture) position heat map

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

12345678910

Fig 16 ROC curves of the 10 test flights Dashed solid lines refer to resultson room 1 2

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

5

Monocularestimator

Stereo processing

Switch

Behavior routines

Robot controller

Average disparity

Average disparity estimate

Safety overide

Control signal

Filter

PSSL

Fig 2 System overview Please see the text for details

machine learning algorithm will have to learn to notice suchrelationships itself for the robotrsquos environment by learninga mapping from the feature vector to a disparity We haveinvestigated different function representations and learningmethods to this end (see Section IV)

Fig 3 Approaching a poster on the wall Left monocular input Middleoverlaid textons annotated with the color used in the histogram Right textondistribution histogram with the corresponding texton shown beneath it

E Behavior

The proposed system uses a straightforward behavior heuristicto explore navigate and persistently learn a room The heuristicis depicted as a Finite State Machine (FSM) in Figure 4 Instate 0 the robot flies in the direction of the camerarsquos principalaxis When an obstacle is detected (λ gt t) the robot stops andgoes to state 1 in which it randomly chooses a new directionfor the principal axis It immediately passes to state 2 in whichthe robot rotates towards the new direction reducing the errore between the principal axisrsquo current and desired direction Ifin the new direction obstacles are far enough away (λ le t)the robot starts flying forward again (state 0) Else the robotcontinues to turn in the same direction as before (clockwiseor counter clockwise) until λ le t When this is the case itstarts flying straight again (state 0) The FSM detects obstaclesby means of a threshold t applied to the average disparity λChoosing this rather straightforward behavior heuristic enablesautonomous exploration based on only one scalar obtainedfrom a distance sensor

0 Straight ahead

1 Pick random direction

120582 le t120582

2 Rotate

120582 le t120582 amp e le te

e gt te

120582 gt t120582 amp e le te

120582 gt t120582

Fig 4 The behavior heuristic FSM λ is average disparity e is the attitudeerror (meaning the difference between the newly picked direction and thecurrent attitude) tn the respective thresholds

6

F PerformanceThe average disparity λ coming either from stereo visionor from the monocular distance estimation function f(xf )is thresholded for determining whether to turn This leads toa binary classification problem where all samples for whichλ gt t are considered as lsquopositiversquo (c = 1) and all samplesfor which λ le t are considered as lsquonegativersquo (c = 0)Hence the quality of f(xf ) can be characterized by a ROCcurve The ground truth for the ROC curve is determinedby the stereo vision This means that a True Positive Ratio(TPR) of 1 and False Positive Ratio (FPR) of 0 lead to thesame obstacle detection performance as with the stereo visionsystem Generally of course this performance will not bereached and the robot has to determine what threshold to setfor a sufficiently high TPR and low FPRThis leads to the question how to determine what a lsquosufficientrsquoTPR FPR is We evaluate this matter in the context of therobotrsquos obstacle avoidance task In particular we first look atthe probability of a collision with a given TPR and then at theprobability of a spurious turn with a given FPRIn order to model the probability of a collision con-sider a constant velocity approach with input samples (im-ages) 〈x1 x2 xn〉 of n samples long ending at anobstacle A minimum of u samples before actual im-pact an obstacle must be detected by at least one TPor a FP in order to prevent a collision Since the rangeof samples 〈x(nminusu+1) x(nminusu+2) xn〉 does not matterfor the outcome we redefine the approach range to be〈x1 x2 x(nminusu)〉 Consider that for each sample xi holds

1 = p(TP |xi) + p(FP |xi) + p(TN |xi) + p(FN |xi) (5)

since p(FN |xi) = p(TP |xi) = 0 if xi is a negative andp(TN |xi) = p(FP |xi) = 0 if xi is a positive Let us firstassume independent identically distributed (iid) data Thenthe probability of a collision pc can be written as

pc =

nminusuprodi=1

(p(FN |xi) + p(TN |xi))

=

qprodi=1

p(TN |xi)nminusuprodi=q+1

p(FN |xi)(6)

where q is a time step separating two phases in the approachIn the first phase all xi are negative so that any false positivewill lead to a turn preventing the collision Only if all negativesamples are correctly classified as negatives (true negatives)will the robot enter the second phase in which all xi arepositive Then only a complete sequence of false negativeswill lead to a collision since any true positive will lead to atimely turnWe can use Eq 6 to choose an acceptable TPR = 1minusFNRAssuming a constant velocity and frame rate it gives us theprobability of a collision For instance let us assume that therobot flies forward at 050ms with a frame rate of 30Hz it hasa minimal required detection distance of 10m and positives aredefined to be closer than 15m This leads to s = 30 samples

that all have to be classified as negatives In the case of iiddata if the TPR = 095 the probability of a collision ispc = (1minusTPR)s = 00530 asymp 931 10minus40 an extremely smallvalue With this analysis even a TPR = 030 leads to anacceptable pc asymp 225 10minus5

The analysis of the effect of false positives is straightforwardas it can be expressed in the number of spurious turns persecond or equivalently if assuming a constant velocity permeter travelled With the same scenario as above an FPR =005 will on average lead to 3 spurious turns per travelledmeter which is unacceptably high An FPR = 00017 willapproximately lead to 1 spurious turn per 10 meters

The above analysis seems to indicate that quite many falsenegatives are acceptable while there can only be very fewfalse positives However there are two complicating factorsThe first factor is that Eq 6 only holds when X can be as-sumed identically and independently distributed (iid) whichis unlikely due to the nature of the consecutive samples ofan approach towards an obstacle Some reasoning about thenature of the dependence is however possible Assuming astrong correlation between consecutive samples results in ahigher probability of xi being classified the same as x(i+1)In other words if a sample in the range 〈x1xm〉 is an FNthe chance that more samples are FNs increases Hence theexpected dependencies negatively impact performance of thesystem making Eq 6 a best case scenario

The system can be more realistically modelled as a Markovprocess as depicted in Figure 5 From this can be seen thatthe system can be split in a reducible Markov process with anabsorbing avoid state and a chain of states that leads to theabsorbing collision state The values of the transition matrixΩ can be determined from the data gathered during operationThis would allow the robot to better predict the consequencesof a chosen TPR and FPR

As an illustration of the effects of sample dependence let ussuppose a model in which each classification has a probabilityof being identical to the previous classification p(I(ciminus1 ci))If not identical the sample is classified independently Thisdependency model allows us to calculate the transition Ω45

in Figure 5 Given a previous negative classification thetransition probability to another negative classification isΩ45 = p(I(ciminus1 ci)) + (1 minus p(I(ciminus1 ci)))(1 minus TPR) Ifp(I(ciminus1 ci)) = 08 and TPR = 095 as above Ω45 = 081The probability of a collision in such a model is pc = Ω

(sminus1)45 =

18 10minus3 no longer an inconceivably small number

This leads us to the second complicating factor which is spe-cific to our SSL setup Since the robot operates on the basis ofthe ground-truth it should theoretically hardly ever encounterpositive samples Namely the robot should turn when it detectsa positive sample This implies that the uncertainty on theestimated TPR is rather high while the FPR can be estimatedbetter A potential solution to this problem is to purposefullyhave the mono-estimation robot turn earlier than the stereovision based one

7

2No obstacle

TP

1Obstacle

FP

4Obstacle

FN

3No obstacle

TN

Ω10

Ω32Ω31 Ω42

Ω34

Ω33

Ω45

4 +j

FNhellip

5Obstacle

2xFN

Ω20

6Crash

Ω(5)(6)

Ω(hellip)(2)

Ω(5+n-u)(6+n-u)

Ω52

0Avoid

Fig 5 Markov model of the probability of a collision

G Similarity with Learning from Demonstration

The core of SSL is a supervised algorithm that learns thefunction f(xf ) on the basis of supervised outputs g(xg)Normally supervised learning assumes that the training datais drawn from the same data probability distribution D as thetest data However in persistent SSL this assumption generallydoes not hold The problem is that by using control based onf the robot follows a control policy πf 6= πg and hence willinduce a different state distribution Dπf 6= Dπg On thesedifferent states no supervised outputs have been observed yetwhich typically implies an increasing difference between f andgA similar problem of inducing a different state distribution iswell-known in the area of learning from demonstration [2] [3]Actually we will show that under some mild assumptions thepersistent SSL problem studied in this paper is equivalent toan learning from demonstration problem Hence we can drawon solutions in learning from demonstration such as DAgger[2]The goal of learning from demonstration is to find a policy πthat minizes a loss function l under its induced distribution ofstates from [2]

π = argminπisinΠ

EssimDπ [l(s π)] (7)

where an optimal teacher policy πlowast is available to providetraining data for specific states sAt first sight SSL is quite different as it focuses only on thestate information that serves as input to the policy Instead ofoptimizing a policy the supervised learning in persistent SSLcan be defined as finding the function f prime that best matches thetrusted function gprime under the distribution of states induced bythe use of the thresholded version f for control

argminfisinF

ExsimDπf [l(f(x) g(x))] (8)

meaning that we perform regression of f on states that areinduced by the control policy πf which uses the thresholdedversion of f To see the similarity to Eq 7 first realize that the stereo-based policy is in this case the teacher policy πg = πlowast Forthis analysis we simplify the strategy to flying straight whenfar enough away from an obstacle and turning otherwise

πg(s) p(straight|s = 0) = 1

p(turn|s = 1) = 1(9)

where s is the state with s = 1 when g(x) gt tg and s = 0otherwise Please note that πg is a deterministic policy whichis assumed to be optimalWhen we learn a function f it generally will not give exactlythe same outputs as g Using s = f gt tf will result in thefollowing stochastic policy

πf (s) p(straight|s = 0) = TNR

p(turn|s = 0) = FPR

p(turn|s = 1) = TPR

p(straight|s = 1) = FNR

(10)

a stochastic policy which by definition is optimal πf = πg if FPR = FNR = 0 In addition then Dπf = Dπg Thusif we make the assumption that minimizing l(f(x) g(x)) alsominimizes FPR and FNR capturing any preference for oneor the other in the cost function for the behavior l(s π) thenminimizing f in Eq 8 is equivalent to minimizing the loss inEq 7The interest of the above-mentioned similarity lies in theuse of proven techniques from the field of learning fromdemonstration for training the persistent SSL system Forinstance in DAgger during the first learning iteration onlythe expert policy is learned After this iteration a mix of thelearned and expert policy is used πi = πlowast with p = βi

8

and πi = πi with p = (1 minus βi) where βi is reducedover the iterations1 The policy at iteration i depends on allprevious learned data which is aggregated over time In [2] itis proven that this strategy gives a favorable limited no-regretloss which significantly outperforms the traditional learningfrom demonstration strategy of just learning from the datadistribution of the expert policy In Section V we will showthat the main lessons from [2] also apply to persistent SSL

IV OFFLINE VISION EXPERIMENTS

In this section we perform offline vision experiments Thegoal of these experiments is to determine how good theproposed VBoW method is at estimating monocular depth andto determine the best parameter settingsTo measure the performance we use two main metrics themean square error (MSE) and the area under the curve (AUC)of a ROC curve MSE is an easy metric that can be directlyused as a loss function but in practice many situations existin which a low MSE can be achieved but still inadequate per-formance is reached for the basis of reliable MAV behavioralcontrol The AUC captures the trade-off between TPR and FPRand hence is a good indication of how good the performanceis in terms of obstacle detection

Fig 6 Example from Dataset 1 (left 128x96 pixels) and dataset 2 (right640x480)

We use two data sets in the experiments The first dataset is avideo made on a drone during an autonomous flight using theonboard 128 times 96 pixels stereo camera The second datasetis a video made by manually walking with a higher quality640 times 480 pixel stereo camera through an office cubicle in asimilar fashion as the robot should move in the later onlineexperiments The datasets 1 and 2 used in this section aremade available for download publicly23 An example imagefrom each dataset is shown in Figure 6Our implementation of the VBoW method has six mainparameters ranging from the number of intensity and gra-dient textons to the number of samples used to smooththe estimated disparity over time An exhaustive search ofparameters being out of reach we have performed variousinvestigations of parameter changes along a single dimension

1In [2] this mixture was written as πi = βiπlowast + (1 minus βi)π hinting at

a mixed control However it is described as a policy that queries the expertto choose controls a fraction of the time while collecting the next datasetFor this reason and because it makes more sense given our binary controlstrategy we mention this policy as a probabilistic mixture in this article

2Dataset 1 can be downloaded and viewed from http1drvms1NOhIOh3Dataset 2 can be downloaded from http1drvms1NOhLtA

Table I presents a list of the final tuned parameter valuesPlease note that these parameter values have not only beenoptimized for performance Whenever performance differenceswere marginal we have chosen the parameter values that savedon computational effort This choice was guided by our goalto perform the learning on board of a computationally limiteddrone Below we will show a few of the results when varyinga single parameter deviating from the settings in Table I inthe corresponding dimension

Fig 7 Used texton dictionaries Left for camera 1 right for 2

In this work two types of textons are combined to form a singletexton histogram normal intensity textons as obtained withKohonen clustering as in [28] and gradient textons obtainedsimilarly but based upon the gradient of the images Gradienttextures have been shown in [30] to be an important depthcue An example dictionary of each is depicted in figure 7Gradient textons are shown with a color range (from blue =low to red = high) The intensity textons in figure 7 are basedon grayscale intensity pixel valuesFigure 8 shows the results for different numbers of textonsisin 4 8 12 16 20 always consisting of half pixel intensityand half gradient textons From the results we can see that theperformance saturates around 20 textons Hence we selectedthis combination of 10 intensity and 10 gradient textons forthe experimentsThe VBoW method involves choosing a regression algorithmIn order to determine the best learning algorithm we havetested four regression algorithms limiting the choice mainlybased on feasibility for implementing the regression algorithmon board a constrained embedded system We have tested twonon-parametric (kNN and Gaussian Process regression) andtwo parametric (linear and shallow neural network regression)algorithms Figure 9 presents the learning curves for a com-parison of these regressors Clearly in most cases the kNNregression comes out best A naıve implementation of kNNsuffers from having a larger training set in terms of CPU usageduring test time but after implementation on the drone this didnot become a bottleneckThe final offline results on the two datasets are quite satis-factory They can be viewed online45 After a training set ofroughly 4000 samples the kNN approximates the stereo visionbased disparities in the test set rather well Given a desiredTPR of 096 the learner has an FPR of 047 This should besufficient for usage of the estimated disparities in control

V SIMULATION EXPERIMENTS

In Section III we argued that a persistent form of SSL issimilar to learning from demonstration The relevance of this

4A video of VBoW visualizations on dataset 1 is available here http1drvms1KdRtC1

5A video of VBoW visualizations on dataset 2 is available here http1drvms1KdRxlg

9

1000 2000 3000 4000 5000 60000

05

1

15

2

MS

E

kNN regression learning curve minus Number of textons minus Dataset 1

48121620

1000 2000 3000 4000 5000 600005

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 250010

20

30

40

50

60

70

80

90

MS

E

Dataset 2

500 1000 1500 2000 250007

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 8 MSE and AUC number of textons Dashed solid lines refer to results on traintest set

TABLE I PARAMETER SETTINGS

Parameter Value

Number of intensity textons 10Number of gradient textons 10

Patch size 5x5Subsampling samples 500

kNN k=5Smooth size 4

similarity lies in the behavioral schemes used for learning Inthis section we compare three learning schemes in simulation

A SetupWe simulate a lsquoflyingrsquo drone with stereo vision camera inSmartUAV [31] an in-house developed simulator that allowsfor 3D rendering and simulation of the sensors and algorithmsused on board the real drone Figure 10 shows the simulatedlsquooffice roomrsquo The room has a size of 10times 10 meter and thedrone an average forward speed of 05 ms All the vision andlearning algorithms are exactly the same as the ones that runon board of the drone in the real experimentsThree learning schemes are designed as follows They all startwith an initial learning period in which the drone is controlledpurely by means of stereo vision In the first learning schemethe drone will continue to fly based on stereo vision for

Fig 10 SmartUAV simulation environment

the remainder of the learning time After learning the droneimmediately switches to monocular vision For this reasonthe first scheme is referred to as lsquocold turkeyrsquo In the secondlearning scheme the drone will perform a stochastic policyselecting the stereo vision based actions with a probability βiand monocular based actions with a probability (1minusβi) as wasproposed in the original DAgger article [2] In the experimentsβi = 025 Finally in the third learning scheme the dronewill perform monocular based actions with stereo vision only

10

1000 2000 3000 4000 5000 60000

05

1

15

2

25

MS

E

Regression learning curves minus Dataset 1

kNN (k=5)lineargpneural net

1000 2000 3000 4000 5000 600002

03

04

05

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 25000

20

40

60

80

100

120

MS

E

Dataset 2

500 1000 1500 2000 2500

065

07

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 9 VBoW regression algorithms learning curves Dashed solid lines refer to results on traintest set

used to override these actions when the drone gets too closeto an obstacle Therefore we refer to this scheme as lsquotrainingwheelsrsquoAfter learning the drone will use its monocular disparityestimates for control The stereo vision remains active onlyfor overriding the control if the drone gets too close to a wallDuring this testing period we register the number of turns andthe number of overrides The number of overrides is a measureof the number of potential collisions The number of turns iscompared to the number of turns performed when using stereovision to evaluate the number of spurious turns The initiallearning period is 1 minute the remaining learning period is4 minutes and the test time is 5 minutes These times havebeen selected to allow a full experiment on a single battery ofthe real drone (see Section VI)

B Results

Table II contains the results of 30 experiments with the threelearning schemes and a purely stereo-vision-controlled droneThe first observation is that lsquocold turkeyrsquo gives the worstresults This result was to be expected on the basis of the simi-larity between persistent SSL and learning from demonstrationthe learned monocular distance estimates do not generalizewell to the test distribution when the monocular vision is

TABLE II TEST RESULTS FOR THE THREE LEARNING SCHEMES THEAVERAGE AND STANDARD DEVIATION ARE GIVEN FOR THE NUMBER OF

OVERRIDES AND TURNS DURING THE TESTING PERIOD A LOWER NUMBEROF OVERRIDES IS BETTER IN THE TABLE THE BEST RESULTS ARE SHOWN

IN BOLD

Method Overrides TurnsPure stereo NA 456 (σ = 30 )1 Cold turkey 251 (σ = 82 ) 428 (σ = 37 )2 DAgger 107 (σ = 53 ) 414 (σ = 32 )3 Training wheels 43 (σ = 26 ) 404 (σ = 26 )

in control The originally proposed DAgger scheme performsbetter while the third learning scheme termed lsquotraining wheelsrsquoseems most effective The third scheme has the lowest numberof overrides of all learning schemes with a similar totalnumber of turns as a pure stereo vision run The intuitionbehind this method being best is that it allows the drone tobest learn from samples when the drone is beyond the normalstereo vision turning threshold The original DAgger schemehas a larger probability to turn earlier exploring these samplesto a lesser extent Double-sided statistical bootstrap tests [32]indicate that all differences between the learning methods aresignificant with p lt 005The differences between the learning schemes are well illus-trated by the positions the drone visits in the room during thetest phase Figure 11 contains lsquoheat mapsrsquo that show the drone

11

Fig 11 Simulation heatmaps from left to right cold turkey DAgger trainingwheels stereo only Top images are turn locations lower images are theapproaches

Fig 12 The used multicopter

positions during turning (top row) and during straight flight(bottom row) The position distribution has been obtained bybinning the positions during the test phase of all 30 runs Theresults for each scheme are shown per column in Figure 11Right is the pure stereo vision scheme which shows a clearborder around the straight flight trajectories It can be observedthat this border is best approximated by the lsquotraining wheelsrsquoscheme (second from the right)

VI ROBOTIC EXPERIMENTS

The simulation experiments showed that the lsquotraining wheelsrsquosetup resulted in the fewest stereo vision overrides whenswitching to monocular disparity estimation and control Inthis section we test this online learning setup with a flyingrobotThe experiment is set up in the same manner as the simulationThe robot a Parrot AR drone 2 first explores the room withthe help of stereo vision After 1 minute of learning the droneswitches to using the monocular disparity estimates with stereovision running in the background for performing potentialsafety overrides In this phase the drone still continues tolearn After learning 4 to 5 minutes the drone stops learningand enters the test phase Again also for the real robot themain performance measure consists of the number of safetyoverrides performed by the stereo vision during the testingphaseThe AR drone 2 is standard not equipped with a stereo visionsystem Therefore an in-house-developed 4 gram stereo visionsystem is used [33] which sends the raw images over USB tothe ARDrone2 The grayscale stereo camera has a resolutionof 128times96 px and is limited to 10 fps The ARDrone2 comeswith a 1GHz ARM cortex A8 processor and 128MB RAM

and normally runs the Parrot firmware as an autopilot Forthe experiments we replace this firmware with the the opensource Paparazzi autopilot software [34] [35] This allowedus to implement all vision and learning algorithms on boardthe drone The length of each test is dependent on the batterywhich due to wear has considerable variation in the range of8-15 minutesThe tests are performed in an artificial room that has beenconstructed within a motion tracking arena This allows us totrack the trajectory of the drone and facilitates post-experimentanalysis The room is approximately 5 times 5 m as delimitedby plywood walls In order to ensure that the stereo visionalgorithm gave reliable results we added texture in the formof duct-tape to the walls In five tests we had a textured carpethanging over one of the walls (Figure 13 left referred to aslsquoroom 1rsquo) in the other five tests it was on the floor (Figure 13right referred to as lsquoroom 2rsquo)

Fig 13 Two test flight rooms

1) Results Table III shows the summarized results obtainedfrom the monocular test flights Two main observations canbe made from this table First the average number of stereooverrides during the test phase is 3 which is very close to thenumber of overrides in simulation The monocular behavioralso has a similar heat map to simulation Figure 14 shows aheat map of the dronersquos position during the approaches and theavoidance maneuvers (the turns) Again the stereo based flightperforms better in the sense that the drone explores the roommuch more thoroughly and the turns happen consistently justbefore an obstacle is detected On the other hand especially inroom 2 the monocular performance is quite good in the sensethat the system is able to explore most of the roomSecond the selected TPR and FPR are on average 047 and011 The TPR is rather low compared to the offline testsHowever this number is heavily influenced by the monocularestimator based behavior Due to the goal of the robot avoidingobstacles slightly before the stereo ground truth recognizesthem as positives positives should hardly occur at al Only incases of FNs where the estimator is slower or wrong positiveswill be registered by the ground truth Similarly the FPR isalso lower in the context of the estimate based behaviorROC curves of the 10 flights are shown in Figure 16 Acomparison based on the numbers between the first 5 flights(room 1) and the last 5 flights (room 2) does not showany significant differences leading to the suggestion that thesystem is able to learn both rooms equally well Howeverwhen comparing the heat maps of the two situations in themonocular system in Figure 14 and Figure 15 it seems thatthe system shows slightly different behavior The monocularsystem appears to explore room 2 better getting closer to

12

TABLE III TEST FLIGHT SUMMARY

Room 1 Room 2Description 1 2 3 4 5 6 7 8 9 10 Avg

Stereo flight time mss 648 753 213 330 445 439 456 512 458 501 459Mono flight time mss 344 817 645 725 454 1007 446 951 523 512 639

Mean Square Error 07 196 112 095 083 095 087 132 116 106 109False Positive Rate 016 018 013 011 011 008 013 008 01 008 011True Positive Rate 09 044 057 038 038 04 035 035 06 039 047Stereo approaches 29 31 8 14 19 22 22 19 20 21 205Mono approaches 10 21 20 25 14 33 15 28 18 15 199

Auto-overrides 0 6 2 2 1 5 2 7 3 2 3Overrides ratio 0 072 03 027 02 049 042 071 056 038 041

copying the behavior of the stereo based system

Fig 14 Room 1 (plane texture) position heat map Top row is thebinned position during the avoidance turns bottom row during the obstacleapproaches left column during stereo ground truth based operation rightcolumn during learned monocular operation

The experimental setup with the room in the motion trackingarena allows for a more in-depth analysis of the performanceof both stereo and monocular vision Figure 17 shows thespatial view of the flight trajectory of test 10 6 The flightis segmented into approaches and turns which are numberedaccordingly in these figures The color scale in Figure 17Ais created by calculating the theoretically visible closest wallbased on the tracking the systems measured heading andposition of the drone the known position of the walls and theFOV angle of the camera It is clearly visible that the stereoground truth in Figure 17B does not capture this theoreticaldisparity perfectly Especially in the middle of the room thedisparity remains high compared to the theoretical ground truthdue to noise in the stereo disparity map The results of themonocular estimator in Figure 17C shows another decrease inquality compared to the stereo ground truth

6Onboard video data of the flight 10 can be viewed at http1drvms1KC81PN an external video at httpsyoutubelC vj-QNy1I and a VBoWvisualization video at http1drvms1fzvicH

Fig 15 Room 2 (carpet natural texture) position heat map

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

12345678910

Fig 16 ROC curves of the 10 test flights Dashed solid lines refer to resultson room 1 2

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

6

F PerformanceThe average disparity λ coming either from stereo visionor from the monocular distance estimation function f(xf )is thresholded for determining whether to turn This leads toa binary classification problem where all samples for whichλ gt t are considered as lsquopositiversquo (c = 1) and all samplesfor which λ le t are considered as lsquonegativersquo (c = 0)Hence the quality of f(xf ) can be characterized by a ROCcurve The ground truth for the ROC curve is determinedby the stereo vision This means that a True Positive Ratio(TPR) of 1 and False Positive Ratio (FPR) of 0 lead to thesame obstacle detection performance as with the stereo visionsystem Generally of course this performance will not bereached and the robot has to determine what threshold to setfor a sufficiently high TPR and low FPRThis leads to the question how to determine what a lsquosufficientrsquoTPR FPR is We evaluate this matter in the context of therobotrsquos obstacle avoidance task In particular we first look atthe probability of a collision with a given TPR and then at theprobability of a spurious turn with a given FPRIn order to model the probability of a collision con-sider a constant velocity approach with input samples (im-ages) 〈x1 x2 xn〉 of n samples long ending at anobstacle A minimum of u samples before actual im-pact an obstacle must be detected by at least one TPor a FP in order to prevent a collision Since the rangeof samples 〈x(nminusu+1) x(nminusu+2) xn〉 does not matterfor the outcome we redefine the approach range to be〈x1 x2 x(nminusu)〉 Consider that for each sample xi holds

1 = p(TP |xi) + p(FP |xi) + p(TN |xi) + p(FN |xi) (5)

since p(FN |xi) = p(TP |xi) = 0 if xi is a negative andp(TN |xi) = p(FP |xi) = 0 if xi is a positive Let us firstassume independent identically distributed (iid) data Thenthe probability of a collision pc can be written as

pc =

nminusuprodi=1

(p(FN |xi) + p(TN |xi))

=

qprodi=1

p(TN |xi)nminusuprodi=q+1

p(FN |xi)(6)

where q is a time step separating two phases in the approachIn the first phase all xi are negative so that any false positivewill lead to a turn preventing the collision Only if all negativesamples are correctly classified as negatives (true negatives)will the robot enter the second phase in which all xi arepositive Then only a complete sequence of false negativeswill lead to a collision since any true positive will lead to atimely turnWe can use Eq 6 to choose an acceptable TPR = 1minusFNRAssuming a constant velocity and frame rate it gives us theprobability of a collision For instance let us assume that therobot flies forward at 050ms with a frame rate of 30Hz it hasa minimal required detection distance of 10m and positives aredefined to be closer than 15m This leads to s = 30 samples

that all have to be classified as negatives In the case of iiddata if the TPR = 095 the probability of a collision ispc = (1minusTPR)s = 00530 asymp 931 10minus40 an extremely smallvalue With this analysis even a TPR = 030 leads to anacceptable pc asymp 225 10minus5

The analysis of the effect of false positives is straightforwardas it can be expressed in the number of spurious turns persecond or equivalently if assuming a constant velocity permeter travelled With the same scenario as above an FPR =005 will on average lead to 3 spurious turns per travelledmeter which is unacceptably high An FPR = 00017 willapproximately lead to 1 spurious turn per 10 meters

The above analysis seems to indicate that quite many falsenegatives are acceptable while there can only be very fewfalse positives However there are two complicating factorsThe first factor is that Eq 6 only holds when X can be as-sumed identically and independently distributed (iid) whichis unlikely due to the nature of the consecutive samples ofan approach towards an obstacle Some reasoning about thenature of the dependence is however possible Assuming astrong correlation between consecutive samples results in ahigher probability of xi being classified the same as x(i+1)In other words if a sample in the range 〈x1xm〉 is an FNthe chance that more samples are FNs increases Hence theexpected dependencies negatively impact performance of thesystem making Eq 6 a best case scenario

The system can be more realistically modelled as a Markovprocess as depicted in Figure 5 From this can be seen thatthe system can be split in a reducible Markov process with anabsorbing avoid state and a chain of states that leads to theabsorbing collision state The values of the transition matrixΩ can be determined from the data gathered during operationThis would allow the robot to better predict the consequencesof a chosen TPR and FPR

As an illustration of the effects of sample dependence let ussuppose a model in which each classification has a probabilityof being identical to the previous classification p(I(ciminus1 ci))If not identical the sample is classified independently Thisdependency model allows us to calculate the transition Ω45

in Figure 5 Given a previous negative classification thetransition probability to another negative classification isΩ45 = p(I(ciminus1 ci)) + (1 minus p(I(ciminus1 ci)))(1 minus TPR) Ifp(I(ciminus1 ci)) = 08 and TPR = 095 as above Ω45 = 081The probability of a collision in such a model is pc = Ω

(sminus1)45 =

18 10minus3 no longer an inconceivably small number

This leads us to the second complicating factor which is spe-cific to our SSL setup Since the robot operates on the basis ofthe ground-truth it should theoretically hardly ever encounterpositive samples Namely the robot should turn when it detectsa positive sample This implies that the uncertainty on theestimated TPR is rather high while the FPR can be estimatedbetter A potential solution to this problem is to purposefullyhave the mono-estimation robot turn earlier than the stereovision based one

7

2No obstacle

TP

1Obstacle

FP

4Obstacle

FN

3No obstacle

TN

Ω10

Ω32Ω31 Ω42

Ω34

Ω33

Ω45

4 +j

FNhellip

5Obstacle

2xFN

Ω20

6Crash

Ω(5)(6)

Ω(hellip)(2)

Ω(5+n-u)(6+n-u)

Ω52

0Avoid

Fig 5 Markov model of the probability of a collision

G Similarity with Learning from Demonstration

The core of SSL is a supervised algorithm that learns thefunction f(xf ) on the basis of supervised outputs g(xg)Normally supervised learning assumes that the training datais drawn from the same data probability distribution D as thetest data However in persistent SSL this assumption generallydoes not hold The problem is that by using control based onf the robot follows a control policy πf 6= πg and hence willinduce a different state distribution Dπf 6= Dπg On thesedifferent states no supervised outputs have been observed yetwhich typically implies an increasing difference between f andgA similar problem of inducing a different state distribution iswell-known in the area of learning from demonstration [2] [3]Actually we will show that under some mild assumptions thepersistent SSL problem studied in this paper is equivalent toan learning from demonstration problem Hence we can drawon solutions in learning from demonstration such as DAgger[2]The goal of learning from demonstration is to find a policy πthat minizes a loss function l under its induced distribution ofstates from [2]

π = argminπisinΠ

EssimDπ [l(s π)] (7)

where an optimal teacher policy πlowast is available to providetraining data for specific states sAt first sight SSL is quite different as it focuses only on thestate information that serves as input to the policy Instead ofoptimizing a policy the supervised learning in persistent SSLcan be defined as finding the function f prime that best matches thetrusted function gprime under the distribution of states induced bythe use of the thresholded version f for control

argminfisinF

ExsimDπf [l(f(x) g(x))] (8)

meaning that we perform regression of f on states that areinduced by the control policy πf which uses the thresholdedversion of f To see the similarity to Eq 7 first realize that the stereo-based policy is in this case the teacher policy πg = πlowast Forthis analysis we simplify the strategy to flying straight whenfar enough away from an obstacle and turning otherwise

πg(s) p(straight|s = 0) = 1

p(turn|s = 1) = 1(9)

where s is the state with s = 1 when g(x) gt tg and s = 0otherwise Please note that πg is a deterministic policy whichis assumed to be optimalWhen we learn a function f it generally will not give exactlythe same outputs as g Using s = f gt tf will result in thefollowing stochastic policy

πf (s) p(straight|s = 0) = TNR

p(turn|s = 0) = FPR

p(turn|s = 1) = TPR

p(straight|s = 1) = FNR

(10)

a stochastic policy which by definition is optimal πf = πg if FPR = FNR = 0 In addition then Dπf = Dπg Thusif we make the assumption that minimizing l(f(x) g(x)) alsominimizes FPR and FNR capturing any preference for oneor the other in the cost function for the behavior l(s π) thenminimizing f in Eq 8 is equivalent to minimizing the loss inEq 7The interest of the above-mentioned similarity lies in theuse of proven techniques from the field of learning fromdemonstration for training the persistent SSL system Forinstance in DAgger during the first learning iteration onlythe expert policy is learned After this iteration a mix of thelearned and expert policy is used πi = πlowast with p = βi

8

and πi = πi with p = (1 minus βi) where βi is reducedover the iterations1 The policy at iteration i depends on allprevious learned data which is aggregated over time In [2] itis proven that this strategy gives a favorable limited no-regretloss which significantly outperforms the traditional learningfrom demonstration strategy of just learning from the datadistribution of the expert policy In Section V we will showthat the main lessons from [2] also apply to persistent SSL

IV OFFLINE VISION EXPERIMENTS

In this section we perform offline vision experiments Thegoal of these experiments is to determine how good theproposed VBoW method is at estimating monocular depth andto determine the best parameter settingsTo measure the performance we use two main metrics themean square error (MSE) and the area under the curve (AUC)of a ROC curve MSE is an easy metric that can be directlyused as a loss function but in practice many situations existin which a low MSE can be achieved but still inadequate per-formance is reached for the basis of reliable MAV behavioralcontrol The AUC captures the trade-off between TPR and FPRand hence is a good indication of how good the performanceis in terms of obstacle detection

Fig 6 Example from Dataset 1 (left 128x96 pixels) and dataset 2 (right640x480)

We use two data sets in the experiments The first dataset is avideo made on a drone during an autonomous flight using theonboard 128 times 96 pixels stereo camera The second datasetis a video made by manually walking with a higher quality640 times 480 pixel stereo camera through an office cubicle in asimilar fashion as the robot should move in the later onlineexperiments The datasets 1 and 2 used in this section aremade available for download publicly23 An example imagefrom each dataset is shown in Figure 6Our implementation of the VBoW method has six mainparameters ranging from the number of intensity and gra-dient textons to the number of samples used to smooththe estimated disparity over time An exhaustive search ofparameters being out of reach we have performed variousinvestigations of parameter changes along a single dimension

1In [2] this mixture was written as πi = βiπlowast + (1 minus βi)π hinting at

a mixed control However it is described as a policy that queries the expertto choose controls a fraction of the time while collecting the next datasetFor this reason and because it makes more sense given our binary controlstrategy we mention this policy as a probabilistic mixture in this article

2Dataset 1 can be downloaded and viewed from http1drvms1NOhIOh3Dataset 2 can be downloaded from http1drvms1NOhLtA

Table I presents a list of the final tuned parameter valuesPlease note that these parameter values have not only beenoptimized for performance Whenever performance differenceswere marginal we have chosen the parameter values that savedon computational effort This choice was guided by our goalto perform the learning on board of a computationally limiteddrone Below we will show a few of the results when varyinga single parameter deviating from the settings in Table I inthe corresponding dimension

Fig 7 Used texton dictionaries Left for camera 1 right for 2

In this work two types of textons are combined to form a singletexton histogram normal intensity textons as obtained withKohonen clustering as in [28] and gradient textons obtainedsimilarly but based upon the gradient of the images Gradienttextures have been shown in [30] to be an important depthcue An example dictionary of each is depicted in figure 7Gradient textons are shown with a color range (from blue =low to red = high) The intensity textons in figure 7 are basedon grayscale intensity pixel valuesFigure 8 shows the results for different numbers of textonsisin 4 8 12 16 20 always consisting of half pixel intensityand half gradient textons From the results we can see that theperformance saturates around 20 textons Hence we selectedthis combination of 10 intensity and 10 gradient textons forthe experimentsThe VBoW method involves choosing a regression algorithmIn order to determine the best learning algorithm we havetested four regression algorithms limiting the choice mainlybased on feasibility for implementing the regression algorithmon board a constrained embedded system We have tested twonon-parametric (kNN and Gaussian Process regression) andtwo parametric (linear and shallow neural network regression)algorithms Figure 9 presents the learning curves for a com-parison of these regressors Clearly in most cases the kNNregression comes out best A naıve implementation of kNNsuffers from having a larger training set in terms of CPU usageduring test time but after implementation on the drone this didnot become a bottleneckThe final offline results on the two datasets are quite satis-factory They can be viewed online45 After a training set ofroughly 4000 samples the kNN approximates the stereo visionbased disparities in the test set rather well Given a desiredTPR of 096 the learner has an FPR of 047 This should besufficient for usage of the estimated disparities in control

V SIMULATION EXPERIMENTS

In Section III we argued that a persistent form of SSL issimilar to learning from demonstration The relevance of this

4A video of VBoW visualizations on dataset 1 is available here http1drvms1KdRtC1

5A video of VBoW visualizations on dataset 2 is available here http1drvms1KdRxlg

9

1000 2000 3000 4000 5000 60000

05

1

15

2

MS

E

kNN regression learning curve minus Number of textons minus Dataset 1

48121620

1000 2000 3000 4000 5000 600005

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 250010

20

30

40

50

60

70

80

90

MS

E

Dataset 2

500 1000 1500 2000 250007

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 8 MSE and AUC number of textons Dashed solid lines refer to results on traintest set

TABLE I PARAMETER SETTINGS

Parameter Value

Number of intensity textons 10Number of gradient textons 10

Patch size 5x5Subsampling samples 500

kNN k=5Smooth size 4

similarity lies in the behavioral schemes used for learning Inthis section we compare three learning schemes in simulation

A SetupWe simulate a lsquoflyingrsquo drone with stereo vision camera inSmartUAV [31] an in-house developed simulator that allowsfor 3D rendering and simulation of the sensors and algorithmsused on board the real drone Figure 10 shows the simulatedlsquooffice roomrsquo The room has a size of 10times 10 meter and thedrone an average forward speed of 05 ms All the vision andlearning algorithms are exactly the same as the ones that runon board of the drone in the real experimentsThree learning schemes are designed as follows They all startwith an initial learning period in which the drone is controlledpurely by means of stereo vision In the first learning schemethe drone will continue to fly based on stereo vision for

Fig 10 SmartUAV simulation environment

the remainder of the learning time After learning the droneimmediately switches to monocular vision For this reasonthe first scheme is referred to as lsquocold turkeyrsquo In the secondlearning scheme the drone will perform a stochastic policyselecting the stereo vision based actions with a probability βiand monocular based actions with a probability (1minusβi) as wasproposed in the original DAgger article [2] In the experimentsβi = 025 Finally in the third learning scheme the dronewill perform monocular based actions with stereo vision only

10

1000 2000 3000 4000 5000 60000

05

1

15

2

25

MS

E

Regression learning curves minus Dataset 1

kNN (k=5)lineargpneural net

1000 2000 3000 4000 5000 600002

03

04

05

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 25000

20

40

60

80

100

120

MS

E

Dataset 2

500 1000 1500 2000 2500

065

07

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 9 VBoW regression algorithms learning curves Dashed solid lines refer to results on traintest set

used to override these actions when the drone gets too closeto an obstacle Therefore we refer to this scheme as lsquotrainingwheelsrsquoAfter learning the drone will use its monocular disparityestimates for control The stereo vision remains active onlyfor overriding the control if the drone gets too close to a wallDuring this testing period we register the number of turns andthe number of overrides The number of overrides is a measureof the number of potential collisions The number of turns iscompared to the number of turns performed when using stereovision to evaluate the number of spurious turns The initiallearning period is 1 minute the remaining learning period is4 minutes and the test time is 5 minutes These times havebeen selected to allow a full experiment on a single battery ofthe real drone (see Section VI)

B Results

Table II contains the results of 30 experiments with the threelearning schemes and a purely stereo-vision-controlled droneThe first observation is that lsquocold turkeyrsquo gives the worstresults This result was to be expected on the basis of the simi-larity between persistent SSL and learning from demonstrationthe learned monocular distance estimates do not generalizewell to the test distribution when the monocular vision is

TABLE II TEST RESULTS FOR THE THREE LEARNING SCHEMES THEAVERAGE AND STANDARD DEVIATION ARE GIVEN FOR THE NUMBER OF

OVERRIDES AND TURNS DURING THE TESTING PERIOD A LOWER NUMBEROF OVERRIDES IS BETTER IN THE TABLE THE BEST RESULTS ARE SHOWN

IN BOLD

Method Overrides TurnsPure stereo NA 456 (σ = 30 )1 Cold turkey 251 (σ = 82 ) 428 (σ = 37 )2 DAgger 107 (σ = 53 ) 414 (σ = 32 )3 Training wheels 43 (σ = 26 ) 404 (σ = 26 )

in control The originally proposed DAgger scheme performsbetter while the third learning scheme termed lsquotraining wheelsrsquoseems most effective The third scheme has the lowest numberof overrides of all learning schemes with a similar totalnumber of turns as a pure stereo vision run The intuitionbehind this method being best is that it allows the drone tobest learn from samples when the drone is beyond the normalstereo vision turning threshold The original DAgger schemehas a larger probability to turn earlier exploring these samplesto a lesser extent Double-sided statistical bootstrap tests [32]indicate that all differences between the learning methods aresignificant with p lt 005The differences between the learning schemes are well illus-trated by the positions the drone visits in the room during thetest phase Figure 11 contains lsquoheat mapsrsquo that show the drone

11

Fig 11 Simulation heatmaps from left to right cold turkey DAgger trainingwheels stereo only Top images are turn locations lower images are theapproaches

Fig 12 The used multicopter

positions during turning (top row) and during straight flight(bottom row) The position distribution has been obtained bybinning the positions during the test phase of all 30 runs Theresults for each scheme are shown per column in Figure 11Right is the pure stereo vision scheme which shows a clearborder around the straight flight trajectories It can be observedthat this border is best approximated by the lsquotraining wheelsrsquoscheme (second from the right)

VI ROBOTIC EXPERIMENTS

The simulation experiments showed that the lsquotraining wheelsrsquosetup resulted in the fewest stereo vision overrides whenswitching to monocular disparity estimation and control Inthis section we test this online learning setup with a flyingrobotThe experiment is set up in the same manner as the simulationThe robot a Parrot AR drone 2 first explores the room withthe help of stereo vision After 1 minute of learning the droneswitches to using the monocular disparity estimates with stereovision running in the background for performing potentialsafety overrides In this phase the drone still continues tolearn After learning 4 to 5 minutes the drone stops learningand enters the test phase Again also for the real robot themain performance measure consists of the number of safetyoverrides performed by the stereo vision during the testingphaseThe AR drone 2 is standard not equipped with a stereo visionsystem Therefore an in-house-developed 4 gram stereo visionsystem is used [33] which sends the raw images over USB tothe ARDrone2 The grayscale stereo camera has a resolutionof 128times96 px and is limited to 10 fps The ARDrone2 comeswith a 1GHz ARM cortex A8 processor and 128MB RAM

and normally runs the Parrot firmware as an autopilot Forthe experiments we replace this firmware with the the opensource Paparazzi autopilot software [34] [35] This allowedus to implement all vision and learning algorithms on boardthe drone The length of each test is dependent on the batterywhich due to wear has considerable variation in the range of8-15 minutesThe tests are performed in an artificial room that has beenconstructed within a motion tracking arena This allows us totrack the trajectory of the drone and facilitates post-experimentanalysis The room is approximately 5 times 5 m as delimitedby plywood walls In order to ensure that the stereo visionalgorithm gave reliable results we added texture in the formof duct-tape to the walls In five tests we had a textured carpethanging over one of the walls (Figure 13 left referred to aslsquoroom 1rsquo) in the other five tests it was on the floor (Figure 13right referred to as lsquoroom 2rsquo)

Fig 13 Two test flight rooms

1) Results Table III shows the summarized results obtainedfrom the monocular test flights Two main observations canbe made from this table First the average number of stereooverrides during the test phase is 3 which is very close to thenumber of overrides in simulation The monocular behavioralso has a similar heat map to simulation Figure 14 shows aheat map of the dronersquos position during the approaches and theavoidance maneuvers (the turns) Again the stereo based flightperforms better in the sense that the drone explores the roommuch more thoroughly and the turns happen consistently justbefore an obstacle is detected On the other hand especially inroom 2 the monocular performance is quite good in the sensethat the system is able to explore most of the roomSecond the selected TPR and FPR are on average 047 and011 The TPR is rather low compared to the offline testsHowever this number is heavily influenced by the monocularestimator based behavior Due to the goal of the robot avoidingobstacles slightly before the stereo ground truth recognizesthem as positives positives should hardly occur at al Only incases of FNs where the estimator is slower or wrong positiveswill be registered by the ground truth Similarly the FPR isalso lower in the context of the estimate based behaviorROC curves of the 10 flights are shown in Figure 16 Acomparison based on the numbers between the first 5 flights(room 1) and the last 5 flights (room 2) does not showany significant differences leading to the suggestion that thesystem is able to learn both rooms equally well Howeverwhen comparing the heat maps of the two situations in themonocular system in Figure 14 and Figure 15 it seems thatthe system shows slightly different behavior The monocularsystem appears to explore room 2 better getting closer to

12

TABLE III TEST FLIGHT SUMMARY

Room 1 Room 2Description 1 2 3 4 5 6 7 8 9 10 Avg

Stereo flight time mss 648 753 213 330 445 439 456 512 458 501 459Mono flight time mss 344 817 645 725 454 1007 446 951 523 512 639

Mean Square Error 07 196 112 095 083 095 087 132 116 106 109False Positive Rate 016 018 013 011 011 008 013 008 01 008 011True Positive Rate 09 044 057 038 038 04 035 035 06 039 047Stereo approaches 29 31 8 14 19 22 22 19 20 21 205Mono approaches 10 21 20 25 14 33 15 28 18 15 199

Auto-overrides 0 6 2 2 1 5 2 7 3 2 3Overrides ratio 0 072 03 027 02 049 042 071 056 038 041

copying the behavior of the stereo based system

Fig 14 Room 1 (plane texture) position heat map Top row is thebinned position during the avoidance turns bottom row during the obstacleapproaches left column during stereo ground truth based operation rightcolumn during learned monocular operation

The experimental setup with the room in the motion trackingarena allows for a more in-depth analysis of the performanceof both stereo and monocular vision Figure 17 shows thespatial view of the flight trajectory of test 10 6 The flightis segmented into approaches and turns which are numberedaccordingly in these figures The color scale in Figure 17Ais created by calculating the theoretically visible closest wallbased on the tracking the systems measured heading andposition of the drone the known position of the walls and theFOV angle of the camera It is clearly visible that the stereoground truth in Figure 17B does not capture this theoreticaldisparity perfectly Especially in the middle of the room thedisparity remains high compared to the theoretical ground truthdue to noise in the stereo disparity map The results of themonocular estimator in Figure 17C shows another decrease inquality compared to the stereo ground truth

6Onboard video data of the flight 10 can be viewed at http1drvms1KC81PN an external video at httpsyoutubelC vj-QNy1I and a VBoWvisualization video at http1drvms1fzvicH

Fig 15 Room 2 (carpet natural texture) position heat map

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

12345678910

Fig 16 ROC curves of the 10 test flights Dashed solid lines refer to resultson room 1 2

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

7

2No obstacle

TP

1Obstacle

FP

4Obstacle

FN

3No obstacle

TN

Ω10

Ω32Ω31 Ω42

Ω34

Ω33

Ω45

4 +j

FNhellip

5Obstacle

2xFN

Ω20

6Crash

Ω(5)(6)

Ω(hellip)(2)

Ω(5+n-u)(6+n-u)

Ω52

0Avoid

Fig 5 Markov model of the probability of a collision

G Similarity with Learning from Demonstration

The core of SSL is a supervised algorithm that learns thefunction f(xf ) on the basis of supervised outputs g(xg)Normally supervised learning assumes that the training datais drawn from the same data probability distribution D as thetest data However in persistent SSL this assumption generallydoes not hold The problem is that by using control based onf the robot follows a control policy πf 6= πg and hence willinduce a different state distribution Dπf 6= Dπg On thesedifferent states no supervised outputs have been observed yetwhich typically implies an increasing difference between f andgA similar problem of inducing a different state distribution iswell-known in the area of learning from demonstration [2] [3]Actually we will show that under some mild assumptions thepersistent SSL problem studied in this paper is equivalent toan learning from demonstration problem Hence we can drawon solutions in learning from demonstration such as DAgger[2]The goal of learning from demonstration is to find a policy πthat minizes a loss function l under its induced distribution ofstates from [2]

π = argminπisinΠ

EssimDπ [l(s π)] (7)

where an optimal teacher policy πlowast is available to providetraining data for specific states sAt first sight SSL is quite different as it focuses only on thestate information that serves as input to the policy Instead ofoptimizing a policy the supervised learning in persistent SSLcan be defined as finding the function f prime that best matches thetrusted function gprime under the distribution of states induced bythe use of the thresholded version f for control

argminfisinF

ExsimDπf [l(f(x) g(x))] (8)

meaning that we perform regression of f on states that areinduced by the control policy πf which uses the thresholdedversion of f To see the similarity to Eq 7 first realize that the stereo-based policy is in this case the teacher policy πg = πlowast Forthis analysis we simplify the strategy to flying straight whenfar enough away from an obstacle and turning otherwise

πg(s) p(straight|s = 0) = 1

p(turn|s = 1) = 1(9)

where s is the state with s = 1 when g(x) gt tg and s = 0otherwise Please note that πg is a deterministic policy whichis assumed to be optimalWhen we learn a function f it generally will not give exactlythe same outputs as g Using s = f gt tf will result in thefollowing stochastic policy

πf (s) p(straight|s = 0) = TNR

p(turn|s = 0) = FPR

p(turn|s = 1) = TPR

p(straight|s = 1) = FNR

(10)

a stochastic policy which by definition is optimal πf = πg if FPR = FNR = 0 In addition then Dπf = Dπg Thusif we make the assumption that minimizing l(f(x) g(x)) alsominimizes FPR and FNR capturing any preference for oneor the other in the cost function for the behavior l(s π) thenminimizing f in Eq 8 is equivalent to minimizing the loss inEq 7The interest of the above-mentioned similarity lies in theuse of proven techniques from the field of learning fromdemonstration for training the persistent SSL system Forinstance in DAgger during the first learning iteration onlythe expert policy is learned After this iteration a mix of thelearned and expert policy is used πi = πlowast with p = βi

8

and πi = πi with p = (1 minus βi) where βi is reducedover the iterations1 The policy at iteration i depends on allprevious learned data which is aggregated over time In [2] itis proven that this strategy gives a favorable limited no-regretloss which significantly outperforms the traditional learningfrom demonstration strategy of just learning from the datadistribution of the expert policy In Section V we will showthat the main lessons from [2] also apply to persistent SSL

IV OFFLINE VISION EXPERIMENTS

In this section we perform offline vision experiments Thegoal of these experiments is to determine how good theproposed VBoW method is at estimating monocular depth andto determine the best parameter settingsTo measure the performance we use two main metrics themean square error (MSE) and the area under the curve (AUC)of a ROC curve MSE is an easy metric that can be directlyused as a loss function but in practice many situations existin which a low MSE can be achieved but still inadequate per-formance is reached for the basis of reliable MAV behavioralcontrol The AUC captures the trade-off between TPR and FPRand hence is a good indication of how good the performanceis in terms of obstacle detection

Fig 6 Example from Dataset 1 (left 128x96 pixels) and dataset 2 (right640x480)

We use two data sets in the experiments The first dataset is avideo made on a drone during an autonomous flight using theonboard 128 times 96 pixels stereo camera The second datasetis a video made by manually walking with a higher quality640 times 480 pixel stereo camera through an office cubicle in asimilar fashion as the robot should move in the later onlineexperiments The datasets 1 and 2 used in this section aremade available for download publicly23 An example imagefrom each dataset is shown in Figure 6Our implementation of the VBoW method has six mainparameters ranging from the number of intensity and gra-dient textons to the number of samples used to smooththe estimated disparity over time An exhaustive search ofparameters being out of reach we have performed variousinvestigations of parameter changes along a single dimension

1In [2] this mixture was written as πi = βiπlowast + (1 minus βi)π hinting at

a mixed control However it is described as a policy that queries the expertto choose controls a fraction of the time while collecting the next datasetFor this reason and because it makes more sense given our binary controlstrategy we mention this policy as a probabilistic mixture in this article

2Dataset 1 can be downloaded and viewed from http1drvms1NOhIOh3Dataset 2 can be downloaded from http1drvms1NOhLtA

Table I presents a list of the final tuned parameter valuesPlease note that these parameter values have not only beenoptimized for performance Whenever performance differenceswere marginal we have chosen the parameter values that savedon computational effort This choice was guided by our goalto perform the learning on board of a computationally limiteddrone Below we will show a few of the results when varyinga single parameter deviating from the settings in Table I inthe corresponding dimension

Fig 7 Used texton dictionaries Left for camera 1 right for 2

In this work two types of textons are combined to form a singletexton histogram normal intensity textons as obtained withKohonen clustering as in [28] and gradient textons obtainedsimilarly but based upon the gradient of the images Gradienttextures have been shown in [30] to be an important depthcue An example dictionary of each is depicted in figure 7Gradient textons are shown with a color range (from blue =low to red = high) The intensity textons in figure 7 are basedon grayscale intensity pixel valuesFigure 8 shows the results for different numbers of textonsisin 4 8 12 16 20 always consisting of half pixel intensityand half gradient textons From the results we can see that theperformance saturates around 20 textons Hence we selectedthis combination of 10 intensity and 10 gradient textons forthe experimentsThe VBoW method involves choosing a regression algorithmIn order to determine the best learning algorithm we havetested four regression algorithms limiting the choice mainlybased on feasibility for implementing the regression algorithmon board a constrained embedded system We have tested twonon-parametric (kNN and Gaussian Process regression) andtwo parametric (linear and shallow neural network regression)algorithms Figure 9 presents the learning curves for a com-parison of these regressors Clearly in most cases the kNNregression comes out best A naıve implementation of kNNsuffers from having a larger training set in terms of CPU usageduring test time but after implementation on the drone this didnot become a bottleneckThe final offline results on the two datasets are quite satis-factory They can be viewed online45 After a training set ofroughly 4000 samples the kNN approximates the stereo visionbased disparities in the test set rather well Given a desiredTPR of 096 the learner has an FPR of 047 This should besufficient for usage of the estimated disparities in control

V SIMULATION EXPERIMENTS

In Section III we argued that a persistent form of SSL issimilar to learning from demonstration The relevance of this

4A video of VBoW visualizations on dataset 1 is available here http1drvms1KdRtC1

5A video of VBoW visualizations on dataset 2 is available here http1drvms1KdRxlg

9

1000 2000 3000 4000 5000 60000

05

1

15

2

MS

E

kNN regression learning curve minus Number of textons minus Dataset 1

48121620

1000 2000 3000 4000 5000 600005

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 250010

20

30

40

50

60

70

80

90

MS

E

Dataset 2

500 1000 1500 2000 250007

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 8 MSE and AUC number of textons Dashed solid lines refer to results on traintest set

TABLE I PARAMETER SETTINGS

Parameter Value

Number of intensity textons 10Number of gradient textons 10

Patch size 5x5Subsampling samples 500

kNN k=5Smooth size 4

similarity lies in the behavioral schemes used for learning Inthis section we compare three learning schemes in simulation

A SetupWe simulate a lsquoflyingrsquo drone with stereo vision camera inSmartUAV [31] an in-house developed simulator that allowsfor 3D rendering and simulation of the sensors and algorithmsused on board the real drone Figure 10 shows the simulatedlsquooffice roomrsquo The room has a size of 10times 10 meter and thedrone an average forward speed of 05 ms All the vision andlearning algorithms are exactly the same as the ones that runon board of the drone in the real experimentsThree learning schemes are designed as follows They all startwith an initial learning period in which the drone is controlledpurely by means of stereo vision In the first learning schemethe drone will continue to fly based on stereo vision for

Fig 10 SmartUAV simulation environment

the remainder of the learning time After learning the droneimmediately switches to monocular vision For this reasonthe first scheme is referred to as lsquocold turkeyrsquo In the secondlearning scheme the drone will perform a stochastic policyselecting the stereo vision based actions with a probability βiand monocular based actions with a probability (1minusβi) as wasproposed in the original DAgger article [2] In the experimentsβi = 025 Finally in the third learning scheme the dronewill perform monocular based actions with stereo vision only

10

1000 2000 3000 4000 5000 60000

05

1

15

2

25

MS

E

Regression learning curves minus Dataset 1

kNN (k=5)lineargpneural net

1000 2000 3000 4000 5000 600002

03

04

05

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 25000

20

40

60

80

100

120

MS

E

Dataset 2

500 1000 1500 2000 2500

065

07

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 9 VBoW regression algorithms learning curves Dashed solid lines refer to results on traintest set

used to override these actions when the drone gets too closeto an obstacle Therefore we refer to this scheme as lsquotrainingwheelsrsquoAfter learning the drone will use its monocular disparityestimates for control The stereo vision remains active onlyfor overriding the control if the drone gets too close to a wallDuring this testing period we register the number of turns andthe number of overrides The number of overrides is a measureof the number of potential collisions The number of turns iscompared to the number of turns performed when using stereovision to evaluate the number of spurious turns The initiallearning period is 1 minute the remaining learning period is4 minutes and the test time is 5 minutes These times havebeen selected to allow a full experiment on a single battery ofthe real drone (see Section VI)

B Results

Table II contains the results of 30 experiments with the threelearning schemes and a purely stereo-vision-controlled droneThe first observation is that lsquocold turkeyrsquo gives the worstresults This result was to be expected on the basis of the simi-larity between persistent SSL and learning from demonstrationthe learned monocular distance estimates do not generalizewell to the test distribution when the monocular vision is

TABLE II TEST RESULTS FOR THE THREE LEARNING SCHEMES THEAVERAGE AND STANDARD DEVIATION ARE GIVEN FOR THE NUMBER OF

OVERRIDES AND TURNS DURING THE TESTING PERIOD A LOWER NUMBEROF OVERRIDES IS BETTER IN THE TABLE THE BEST RESULTS ARE SHOWN

IN BOLD

Method Overrides TurnsPure stereo NA 456 (σ = 30 )1 Cold turkey 251 (σ = 82 ) 428 (σ = 37 )2 DAgger 107 (σ = 53 ) 414 (σ = 32 )3 Training wheels 43 (σ = 26 ) 404 (σ = 26 )

in control The originally proposed DAgger scheme performsbetter while the third learning scheme termed lsquotraining wheelsrsquoseems most effective The third scheme has the lowest numberof overrides of all learning schemes with a similar totalnumber of turns as a pure stereo vision run The intuitionbehind this method being best is that it allows the drone tobest learn from samples when the drone is beyond the normalstereo vision turning threshold The original DAgger schemehas a larger probability to turn earlier exploring these samplesto a lesser extent Double-sided statistical bootstrap tests [32]indicate that all differences between the learning methods aresignificant with p lt 005The differences between the learning schemes are well illus-trated by the positions the drone visits in the room during thetest phase Figure 11 contains lsquoheat mapsrsquo that show the drone

11

Fig 11 Simulation heatmaps from left to right cold turkey DAgger trainingwheels stereo only Top images are turn locations lower images are theapproaches

Fig 12 The used multicopter

positions during turning (top row) and during straight flight(bottom row) The position distribution has been obtained bybinning the positions during the test phase of all 30 runs Theresults for each scheme are shown per column in Figure 11Right is the pure stereo vision scheme which shows a clearborder around the straight flight trajectories It can be observedthat this border is best approximated by the lsquotraining wheelsrsquoscheme (second from the right)

VI ROBOTIC EXPERIMENTS

The simulation experiments showed that the lsquotraining wheelsrsquosetup resulted in the fewest stereo vision overrides whenswitching to monocular disparity estimation and control Inthis section we test this online learning setup with a flyingrobotThe experiment is set up in the same manner as the simulationThe robot a Parrot AR drone 2 first explores the room withthe help of stereo vision After 1 minute of learning the droneswitches to using the monocular disparity estimates with stereovision running in the background for performing potentialsafety overrides In this phase the drone still continues tolearn After learning 4 to 5 minutes the drone stops learningand enters the test phase Again also for the real robot themain performance measure consists of the number of safetyoverrides performed by the stereo vision during the testingphaseThe AR drone 2 is standard not equipped with a stereo visionsystem Therefore an in-house-developed 4 gram stereo visionsystem is used [33] which sends the raw images over USB tothe ARDrone2 The grayscale stereo camera has a resolutionof 128times96 px and is limited to 10 fps The ARDrone2 comeswith a 1GHz ARM cortex A8 processor and 128MB RAM

and normally runs the Parrot firmware as an autopilot Forthe experiments we replace this firmware with the the opensource Paparazzi autopilot software [34] [35] This allowedus to implement all vision and learning algorithms on boardthe drone The length of each test is dependent on the batterywhich due to wear has considerable variation in the range of8-15 minutesThe tests are performed in an artificial room that has beenconstructed within a motion tracking arena This allows us totrack the trajectory of the drone and facilitates post-experimentanalysis The room is approximately 5 times 5 m as delimitedby plywood walls In order to ensure that the stereo visionalgorithm gave reliable results we added texture in the formof duct-tape to the walls In five tests we had a textured carpethanging over one of the walls (Figure 13 left referred to aslsquoroom 1rsquo) in the other five tests it was on the floor (Figure 13right referred to as lsquoroom 2rsquo)

Fig 13 Two test flight rooms

1) Results Table III shows the summarized results obtainedfrom the monocular test flights Two main observations canbe made from this table First the average number of stereooverrides during the test phase is 3 which is very close to thenumber of overrides in simulation The monocular behavioralso has a similar heat map to simulation Figure 14 shows aheat map of the dronersquos position during the approaches and theavoidance maneuvers (the turns) Again the stereo based flightperforms better in the sense that the drone explores the roommuch more thoroughly and the turns happen consistently justbefore an obstacle is detected On the other hand especially inroom 2 the monocular performance is quite good in the sensethat the system is able to explore most of the roomSecond the selected TPR and FPR are on average 047 and011 The TPR is rather low compared to the offline testsHowever this number is heavily influenced by the monocularestimator based behavior Due to the goal of the robot avoidingobstacles slightly before the stereo ground truth recognizesthem as positives positives should hardly occur at al Only incases of FNs where the estimator is slower or wrong positiveswill be registered by the ground truth Similarly the FPR isalso lower in the context of the estimate based behaviorROC curves of the 10 flights are shown in Figure 16 Acomparison based on the numbers between the first 5 flights(room 1) and the last 5 flights (room 2) does not showany significant differences leading to the suggestion that thesystem is able to learn both rooms equally well Howeverwhen comparing the heat maps of the two situations in themonocular system in Figure 14 and Figure 15 it seems thatthe system shows slightly different behavior The monocularsystem appears to explore room 2 better getting closer to

12

TABLE III TEST FLIGHT SUMMARY

Room 1 Room 2Description 1 2 3 4 5 6 7 8 9 10 Avg

Stereo flight time mss 648 753 213 330 445 439 456 512 458 501 459Mono flight time mss 344 817 645 725 454 1007 446 951 523 512 639

Mean Square Error 07 196 112 095 083 095 087 132 116 106 109False Positive Rate 016 018 013 011 011 008 013 008 01 008 011True Positive Rate 09 044 057 038 038 04 035 035 06 039 047Stereo approaches 29 31 8 14 19 22 22 19 20 21 205Mono approaches 10 21 20 25 14 33 15 28 18 15 199

Auto-overrides 0 6 2 2 1 5 2 7 3 2 3Overrides ratio 0 072 03 027 02 049 042 071 056 038 041

copying the behavior of the stereo based system

Fig 14 Room 1 (plane texture) position heat map Top row is thebinned position during the avoidance turns bottom row during the obstacleapproaches left column during stereo ground truth based operation rightcolumn during learned monocular operation

The experimental setup with the room in the motion trackingarena allows for a more in-depth analysis of the performanceof both stereo and monocular vision Figure 17 shows thespatial view of the flight trajectory of test 10 6 The flightis segmented into approaches and turns which are numberedaccordingly in these figures The color scale in Figure 17Ais created by calculating the theoretically visible closest wallbased on the tracking the systems measured heading andposition of the drone the known position of the walls and theFOV angle of the camera It is clearly visible that the stereoground truth in Figure 17B does not capture this theoreticaldisparity perfectly Especially in the middle of the room thedisparity remains high compared to the theoretical ground truthdue to noise in the stereo disparity map The results of themonocular estimator in Figure 17C shows another decrease inquality compared to the stereo ground truth

6Onboard video data of the flight 10 can be viewed at http1drvms1KC81PN an external video at httpsyoutubelC vj-QNy1I and a VBoWvisualization video at http1drvms1fzvicH

Fig 15 Room 2 (carpet natural texture) position heat map

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

12345678910

Fig 16 ROC curves of the 10 test flights Dashed solid lines refer to resultson room 1 2

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

8

and πi = πi with p = (1 minus βi) where βi is reducedover the iterations1 The policy at iteration i depends on allprevious learned data which is aggregated over time In [2] itis proven that this strategy gives a favorable limited no-regretloss which significantly outperforms the traditional learningfrom demonstration strategy of just learning from the datadistribution of the expert policy In Section V we will showthat the main lessons from [2] also apply to persistent SSL

IV OFFLINE VISION EXPERIMENTS

In this section we perform offline vision experiments Thegoal of these experiments is to determine how good theproposed VBoW method is at estimating monocular depth andto determine the best parameter settingsTo measure the performance we use two main metrics themean square error (MSE) and the area under the curve (AUC)of a ROC curve MSE is an easy metric that can be directlyused as a loss function but in practice many situations existin which a low MSE can be achieved but still inadequate per-formance is reached for the basis of reliable MAV behavioralcontrol The AUC captures the trade-off between TPR and FPRand hence is a good indication of how good the performanceis in terms of obstacle detection

Fig 6 Example from Dataset 1 (left 128x96 pixels) and dataset 2 (right640x480)

We use two data sets in the experiments The first dataset is avideo made on a drone during an autonomous flight using theonboard 128 times 96 pixels stereo camera The second datasetis a video made by manually walking with a higher quality640 times 480 pixel stereo camera through an office cubicle in asimilar fashion as the robot should move in the later onlineexperiments The datasets 1 and 2 used in this section aremade available for download publicly23 An example imagefrom each dataset is shown in Figure 6Our implementation of the VBoW method has six mainparameters ranging from the number of intensity and gra-dient textons to the number of samples used to smooththe estimated disparity over time An exhaustive search ofparameters being out of reach we have performed variousinvestigations of parameter changes along a single dimension

1In [2] this mixture was written as πi = βiπlowast + (1 minus βi)π hinting at

a mixed control However it is described as a policy that queries the expertto choose controls a fraction of the time while collecting the next datasetFor this reason and because it makes more sense given our binary controlstrategy we mention this policy as a probabilistic mixture in this article

2Dataset 1 can be downloaded and viewed from http1drvms1NOhIOh3Dataset 2 can be downloaded from http1drvms1NOhLtA

Table I presents a list of the final tuned parameter valuesPlease note that these parameter values have not only beenoptimized for performance Whenever performance differenceswere marginal we have chosen the parameter values that savedon computational effort This choice was guided by our goalto perform the learning on board of a computationally limiteddrone Below we will show a few of the results when varyinga single parameter deviating from the settings in Table I inthe corresponding dimension

Fig 7 Used texton dictionaries Left for camera 1 right for 2

In this work two types of textons are combined to form a singletexton histogram normal intensity textons as obtained withKohonen clustering as in [28] and gradient textons obtainedsimilarly but based upon the gradient of the images Gradienttextures have been shown in [30] to be an important depthcue An example dictionary of each is depicted in figure 7Gradient textons are shown with a color range (from blue =low to red = high) The intensity textons in figure 7 are basedon grayscale intensity pixel valuesFigure 8 shows the results for different numbers of textonsisin 4 8 12 16 20 always consisting of half pixel intensityand half gradient textons From the results we can see that theperformance saturates around 20 textons Hence we selectedthis combination of 10 intensity and 10 gradient textons forthe experimentsThe VBoW method involves choosing a regression algorithmIn order to determine the best learning algorithm we havetested four regression algorithms limiting the choice mainlybased on feasibility for implementing the regression algorithmon board a constrained embedded system We have tested twonon-parametric (kNN and Gaussian Process regression) andtwo parametric (linear and shallow neural network regression)algorithms Figure 9 presents the learning curves for a com-parison of these regressors Clearly in most cases the kNNregression comes out best A naıve implementation of kNNsuffers from having a larger training set in terms of CPU usageduring test time but after implementation on the drone this didnot become a bottleneckThe final offline results on the two datasets are quite satis-factory They can be viewed online45 After a training set ofroughly 4000 samples the kNN approximates the stereo visionbased disparities in the test set rather well Given a desiredTPR of 096 the learner has an FPR of 047 This should besufficient for usage of the estimated disparities in control

V SIMULATION EXPERIMENTS

In Section III we argued that a persistent form of SSL issimilar to learning from demonstration The relevance of this

4A video of VBoW visualizations on dataset 1 is available here http1drvms1KdRtC1

5A video of VBoW visualizations on dataset 2 is available here http1drvms1KdRxlg

9

1000 2000 3000 4000 5000 60000

05

1

15

2

MS

E

kNN regression learning curve minus Number of textons minus Dataset 1

48121620

1000 2000 3000 4000 5000 600005

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 250010

20

30

40

50

60

70

80

90

MS

E

Dataset 2

500 1000 1500 2000 250007

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 8 MSE and AUC number of textons Dashed solid lines refer to results on traintest set

TABLE I PARAMETER SETTINGS

Parameter Value

Number of intensity textons 10Number of gradient textons 10

Patch size 5x5Subsampling samples 500

kNN k=5Smooth size 4

similarity lies in the behavioral schemes used for learning Inthis section we compare three learning schemes in simulation

A SetupWe simulate a lsquoflyingrsquo drone with stereo vision camera inSmartUAV [31] an in-house developed simulator that allowsfor 3D rendering and simulation of the sensors and algorithmsused on board the real drone Figure 10 shows the simulatedlsquooffice roomrsquo The room has a size of 10times 10 meter and thedrone an average forward speed of 05 ms All the vision andlearning algorithms are exactly the same as the ones that runon board of the drone in the real experimentsThree learning schemes are designed as follows They all startwith an initial learning period in which the drone is controlledpurely by means of stereo vision In the first learning schemethe drone will continue to fly based on stereo vision for

Fig 10 SmartUAV simulation environment

the remainder of the learning time After learning the droneimmediately switches to monocular vision For this reasonthe first scheme is referred to as lsquocold turkeyrsquo In the secondlearning scheme the drone will perform a stochastic policyselecting the stereo vision based actions with a probability βiand monocular based actions with a probability (1minusβi) as wasproposed in the original DAgger article [2] In the experimentsβi = 025 Finally in the third learning scheme the dronewill perform monocular based actions with stereo vision only

10

1000 2000 3000 4000 5000 60000

05

1

15

2

25

MS

E

Regression learning curves minus Dataset 1

kNN (k=5)lineargpneural net

1000 2000 3000 4000 5000 600002

03

04

05

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 25000

20

40

60

80

100

120

MS

E

Dataset 2

500 1000 1500 2000 2500

065

07

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 9 VBoW regression algorithms learning curves Dashed solid lines refer to results on traintest set

used to override these actions when the drone gets too closeto an obstacle Therefore we refer to this scheme as lsquotrainingwheelsrsquoAfter learning the drone will use its monocular disparityestimates for control The stereo vision remains active onlyfor overriding the control if the drone gets too close to a wallDuring this testing period we register the number of turns andthe number of overrides The number of overrides is a measureof the number of potential collisions The number of turns iscompared to the number of turns performed when using stereovision to evaluate the number of spurious turns The initiallearning period is 1 minute the remaining learning period is4 minutes and the test time is 5 minutes These times havebeen selected to allow a full experiment on a single battery ofthe real drone (see Section VI)

B Results

Table II contains the results of 30 experiments with the threelearning schemes and a purely stereo-vision-controlled droneThe first observation is that lsquocold turkeyrsquo gives the worstresults This result was to be expected on the basis of the simi-larity between persistent SSL and learning from demonstrationthe learned monocular distance estimates do not generalizewell to the test distribution when the monocular vision is

TABLE II TEST RESULTS FOR THE THREE LEARNING SCHEMES THEAVERAGE AND STANDARD DEVIATION ARE GIVEN FOR THE NUMBER OF

OVERRIDES AND TURNS DURING THE TESTING PERIOD A LOWER NUMBEROF OVERRIDES IS BETTER IN THE TABLE THE BEST RESULTS ARE SHOWN

IN BOLD

Method Overrides TurnsPure stereo NA 456 (σ = 30 )1 Cold turkey 251 (σ = 82 ) 428 (σ = 37 )2 DAgger 107 (σ = 53 ) 414 (σ = 32 )3 Training wheels 43 (σ = 26 ) 404 (σ = 26 )

in control The originally proposed DAgger scheme performsbetter while the third learning scheme termed lsquotraining wheelsrsquoseems most effective The third scheme has the lowest numberof overrides of all learning schemes with a similar totalnumber of turns as a pure stereo vision run The intuitionbehind this method being best is that it allows the drone tobest learn from samples when the drone is beyond the normalstereo vision turning threshold The original DAgger schemehas a larger probability to turn earlier exploring these samplesto a lesser extent Double-sided statistical bootstrap tests [32]indicate that all differences between the learning methods aresignificant with p lt 005The differences between the learning schemes are well illus-trated by the positions the drone visits in the room during thetest phase Figure 11 contains lsquoheat mapsrsquo that show the drone

11

Fig 11 Simulation heatmaps from left to right cold turkey DAgger trainingwheels stereo only Top images are turn locations lower images are theapproaches

Fig 12 The used multicopter

positions during turning (top row) and during straight flight(bottom row) The position distribution has been obtained bybinning the positions during the test phase of all 30 runs Theresults for each scheme are shown per column in Figure 11Right is the pure stereo vision scheme which shows a clearborder around the straight flight trajectories It can be observedthat this border is best approximated by the lsquotraining wheelsrsquoscheme (second from the right)

VI ROBOTIC EXPERIMENTS

The simulation experiments showed that the lsquotraining wheelsrsquosetup resulted in the fewest stereo vision overrides whenswitching to monocular disparity estimation and control Inthis section we test this online learning setup with a flyingrobotThe experiment is set up in the same manner as the simulationThe robot a Parrot AR drone 2 first explores the room withthe help of stereo vision After 1 minute of learning the droneswitches to using the monocular disparity estimates with stereovision running in the background for performing potentialsafety overrides In this phase the drone still continues tolearn After learning 4 to 5 minutes the drone stops learningand enters the test phase Again also for the real robot themain performance measure consists of the number of safetyoverrides performed by the stereo vision during the testingphaseThe AR drone 2 is standard not equipped with a stereo visionsystem Therefore an in-house-developed 4 gram stereo visionsystem is used [33] which sends the raw images over USB tothe ARDrone2 The grayscale stereo camera has a resolutionof 128times96 px and is limited to 10 fps The ARDrone2 comeswith a 1GHz ARM cortex A8 processor and 128MB RAM

and normally runs the Parrot firmware as an autopilot Forthe experiments we replace this firmware with the the opensource Paparazzi autopilot software [34] [35] This allowedus to implement all vision and learning algorithms on boardthe drone The length of each test is dependent on the batterywhich due to wear has considerable variation in the range of8-15 minutesThe tests are performed in an artificial room that has beenconstructed within a motion tracking arena This allows us totrack the trajectory of the drone and facilitates post-experimentanalysis The room is approximately 5 times 5 m as delimitedby plywood walls In order to ensure that the stereo visionalgorithm gave reliable results we added texture in the formof duct-tape to the walls In five tests we had a textured carpethanging over one of the walls (Figure 13 left referred to aslsquoroom 1rsquo) in the other five tests it was on the floor (Figure 13right referred to as lsquoroom 2rsquo)

Fig 13 Two test flight rooms

1) Results Table III shows the summarized results obtainedfrom the monocular test flights Two main observations canbe made from this table First the average number of stereooverrides during the test phase is 3 which is very close to thenumber of overrides in simulation The monocular behavioralso has a similar heat map to simulation Figure 14 shows aheat map of the dronersquos position during the approaches and theavoidance maneuvers (the turns) Again the stereo based flightperforms better in the sense that the drone explores the roommuch more thoroughly and the turns happen consistently justbefore an obstacle is detected On the other hand especially inroom 2 the monocular performance is quite good in the sensethat the system is able to explore most of the roomSecond the selected TPR and FPR are on average 047 and011 The TPR is rather low compared to the offline testsHowever this number is heavily influenced by the monocularestimator based behavior Due to the goal of the robot avoidingobstacles slightly before the stereo ground truth recognizesthem as positives positives should hardly occur at al Only incases of FNs where the estimator is slower or wrong positiveswill be registered by the ground truth Similarly the FPR isalso lower in the context of the estimate based behaviorROC curves of the 10 flights are shown in Figure 16 Acomparison based on the numbers between the first 5 flights(room 1) and the last 5 flights (room 2) does not showany significant differences leading to the suggestion that thesystem is able to learn both rooms equally well Howeverwhen comparing the heat maps of the two situations in themonocular system in Figure 14 and Figure 15 it seems thatthe system shows slightly different behavior The monocularsystem appears to explore room 2 better getting closer to

12

TABLE III TEST FLIGHT SUMMARY

Room 1 Room 2Description 1 2 3 4 5 6 7 8 9 10 Avg

Stereo flight time mss 648 753 213 330 445 439 456 512 458 501 459Mono flight time mss 344 817 645 725 454 1007 446 951 523 512 639

Mean Square Error 07 196 112 095 083 095 087 132 116 106 109False Positive Rate 016 018 013 011 011 008 013 008 01 008 011True Positive Rate 09 044 057 038 038 04 035 035 06 039 047Stereo approaches 29 31 8 14 19 22 22 19 20 21 205Mono approaches 10 21 20 25 14 33 15 28 18 15 199

Auto-overrides 0 6 2 2 1 5 2 7 3 2 3Overrides ratio 0 072 03 027 02 049 042 071 056 038 041

copying the behavior of the stereo based system

Fig 14 Room 1 (plane texture) position heat map Top row is thebinned position during the avoidance turns bottom row during the obstacleapproaches left column during stereo ground truth based operation rightcolumn during learned monocular operation

The experimental setup with the room in the motion trackingarena allows for a more in-depth analysis of the performanceof both stereo and monocular vision Figure 17 shows thespatial view of the flight trajectory of test 10 6 The flightis segmented into approaches and turns which are numberedaccordingly in these figures The color scale in Figure 17Ais created by calculating the theoretically visible closest wallbased on the tracking the systems measured heading andposition of the drone the known position of the walls and theFOV angle of the camera It is clearly visible that the stereoground truth in Figure 17B does not capture this theoreticaldisparity perfectly Especially in the middle of the room thedisparity remains high compared to the theoretical ground truthdue to noise in the stereo disparity map The results of themonocular estimator in Figure 17C shows another decrease inquality compared to the stereo ground truth

6Onboard video data of the flight 10 can be viewed at http1drvms1KC81PN an external video at httpsyoutubelC vj-QNy1I and a VBoWvisualization video at http1drvms1fzvicH

Fig 15 Room 2 (carpet natural texture) position heat map

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

12345678910

Fig 16 ROC curves of the 10 test flights Dashed solid lines refer to resultson room 1 2

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

9

1000 2000 3000 4000 5000 60000

05

1

15

2

MS

E

kNN regression learning curve minus Number of textons minus Dataset 1

48121620

1000 2000 3000 4000 5000 600005

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 250010

20

30

40

50

60

70

80

90

MS

E

Dataset 2

500 1000 1500 2000 250007

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 8 MSE and AUC number of textons Dashed solid lines refer to results on traintest set

TABLE I PARAMETER SETTINGS

Parameter Value

Number of intensity textons 10Number of gradient textons 10

Patch size 5x5Subsampling samples 500

kNN k=5Smooth size 4

similarity lies in the behavioral schemes used for learning Inthis section we compare three learning schemes in simulation

A SetupWe simulate a lsquoflyingrsquo drone with stereo vision camera inSmartUAV [31] an in-house developed simulator that allowsfor 3D rendering and simulation of the sensors and algorithmsused on board the real drone Figure 10 shows the simulatedlsquooffice roomrsquo The room has a size of 10times 10 meter and thedrone an average forward speed of 05 ms All the vision andlearning algorithms are exactly the same as the ones that runon board of the drone in the real experimentsThree learning schemes are designed as follows They all startwith an initial learning period in which the drone is controlledpurely by means of stereo vision In the first learning schemethe drone will continue to fly based on stereo vision for

Fig 10 SmartUAV simulation environment

the remainder of the learning time After learning the droneimmediately switches to monocular vision For this reasonthe first scheme is referred to as lsquocold turkeyrsquo In the secondlearning scheme the drone will perform a stochastic policyselecting the stereo vision based actions with a probability βiand monocular based actions with a probability (1minusβi) as wasproposed in the original DAgger article [2] In the experimentsβi = 025 Finally in the third learning scheme the dronewill perform monocular based actions with stereo vision only

10

1000 2000 3000 4000 5000 60000

05

1

15

2

25

MS

E

Regression learning curves minus Dataset 1

kNN (k=5)lineargpneural net

1000 2000 3000 4000 5000 600002

03

04

05

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 25000

20

40

60

80

100

120

MS

E

Dataset 2

500 1000 1500 2000 2500

065

07

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 9 VBoW regression algorithms learning curves Dashed solid lines refer to results on traintest set

used to override these actions when the drone gets too closeto an obstacle Therefore we refer to this scheme as lsquotrainingwheelsrsquoAfter learning the drone will use its monocular disparityestimates for control The stereo vision remains active onlyfor overriding the control if the drone gets too close to a wallDuring this testing period we register the number of turns andthe number of overrides The number of overrides is a measureof the number of potential collisions The number of turns iscompared to the number of turns performed when using stereovision to evaluate the number of spurious turns The initiallearning period is 1 minute the remaining learning period is4 minutes and the test time is 5 minutes These times havebeen selected to allow a full experiment on a single battery ofthe real drone (see Section VI)

B Results

Table II contains the results of 30 experiments with the threelearning schemes and a purely stereo-vision-controlled droneThe first observation is that lsquocold turkeyrsquo gives the worstresults This result was to be expected on the basis of the simi-larity between persistent SSL and learning from demonstrationthe learned monocular distance estimates do not generalizewell to the test distribution when the monocular vision is

TABLE II TEST RESULTS FOR THE THREE LEARNING SCHEMES THEAVERAGE AND STANDARD DEVIATION ARE GIVEN FOR THE NUMBER OF

OVERRIDES AND TURNS DURING THE TESTING PERIOD A LOWER NUMBEROF OVERRIDES IS BETTER IN THE TABLE THE BEST RESULTS ARE SHOWN

IN BOLD

Method Overrides TurnsPure stereo NA 456 (σ = 30 )1 Cold turkey 251 (σ = 82 ) 428 (σ = 37 )2 DAgger 107 (σ = 53 ) 414 (σ = 32 )3 Training wheels 43 (σ = 26 ) 404 (σ = 26 )

in control The originally proposed DAgger scheme performsbetter while the third learning scheme termed lsquotraining wheelsrsquoseems most effective The third scheme has the lowest numberof overrides of all learning schemes with a similar totalnumber of turns as a pure stereo vision run The intuitionbehind this method being best is that it allows the drone tobest learn from samples when the drone is beyond the normalstereo vision turning threshold The original DAgger schemehas a larger probability to turn earlier exploring these samplesto a lesser extent Double-sided statistical bootstrap tests [32]indicate that all differences between the learning methods aresignificant with p lt 005The differences between the learning schemes are well illus-trated by the positions the drone visits in the room during thetest phase Figure 11 contains lsquoheat mapsrsquo that show the drone

11

Fig 11 Simulation heatmaps from left to right cold turkey DAgger trainingwheels stereo only Top images are turn locations lower images are theapproaches

Fig 12 The used multicopter

positions during turning (top row) and during straight flight(bottom row) The position distribution has been obtained bybinning the positions during the test phase of all 30 runs Theresults for each scheme are shown per column in Figure 11Right is the pure stereo vision scheme which shows a clearborder around the straight flight trajectories It can be observedthat this border is best approximated by the lsquotraining wheelsrsquoscheme (second from the right)

VI ROBOTIC EXPERIMENTS

The simulation experiments showed that the lsquotraining wheelsrsquosetup resulted in the fewest stereo vision overrides whenswitching to monocular disparity estimation and control Inthis section we test this online learning setup with a flyingrobotThe experiment is set up in the same manner as the simulationThe robot a Parrot AR drone 2 first explores the room withthe help of stereo vision After 1 minute of learning the droneswitches to using the monocular disparity estimates with stereovision running in the background for performing potentialsafety overrides In this phase the drone still continues tolearn After learning 4 to 5 minutes the drone stops learningand enters the test phase Again also for the real robot themain performance measure consists of the number of safetyoverrides performed by the stereo vision during the testingphaseThe AR drone 2 is standard not equipped with a stereo visionsystem Therefore an in-house-developed 4 gram stereo visionsystem is used [33] which sends the raw images over USB tothe ARDrone2 The grayscale stereo camera has a resolutionof 128times96 px and is limited to 10 fps The ARDrone2 comeswith a 1GHz ARM cortex A8 processor and 128MB RAM

and normally runs the Parrot firmware as an autopilot Forthe experiments we replace this firmware with the the opensource Paparazzi autopilot software [34] [35] This allowedus to implement all vision and learning algorithms on boardthe drone The length of each test is dependent on the batterywhich due to wear has considerable variation in the range of8-15 minutesThe tests are performed in an artificial room that has beenconstructed within a motion tracking arena This allows us totrack the trajectory of the drone and facilitates post-experimentanalysis The room is approximately 5 times 5 m as delimitedby plywood walls In order to ensure that the stereo visionalgorithm gave reliable results we added texture in the formof duct-tape to the walls In five tests we had a textured carpethanging over one of the walls (Figure 13 left referred to aslsquoroom 1rsquo) in the other five tests it was on the floor (Figure 13right referred to as lsquoroom 2rsquo)

Fig 13 Two test flight rooms

1) Results Table III shows the summarized results obtainedfrom the monocular test flights Two main observations canbe made from this table First the average number of stereooverrides during the test phase is 3 which is very close to thenumber of overrides in simulation The monocular behavioralso has a similar heat map to simulation Figure 14 shows aheat map of the dronersquos position during the approaches and theavoidance maneuvers (the turns) Again the stereo based flightperforms better in the sense that the drone explores the roommuch more thoroughly and the turns happen consistently justbefore an obstacle is detected On the other hand especially inroom 2 the monocular performance is quite good in the sensethat the system is able to explore most of the roomSecond the selected TPR and FPR are on average 047 and011 The TPR is rather low compared to the offline testsHowever this number is heavily influenced by the monocularestimator based behavior Due to the goal of the robot avoidingobstacles slightly before the stereo ground truth recognizesthem as positives positives should hardly occur at al Only incases of FNs where the estimator is slower or wrong positiveswill be registered by the ground truth Similarly the FPR isalso lower in the context of the estimate based behaviorROC curves of the 10 flights are shown in Figure 16 Acomparison based on the numbers between the first 5 flights(room 1) and the last 5 flights (room 2) does not showany significant differences leading to the suggestion that thesystem is able to learn both rooms equally well Howeverwhen comparing the heat maps of the two situations in themonocular system in Figure 14 and Figure 15 it seems thatthe system shows slightly different behavior The monocularsystem appears to explore room 2 better getting closer to

12

TABLE III TEST FLIGHT SUMMARY

Room 1 Room 2Description 1 2 3 4 5 6 7 8 9 10 Avg

Stereo flight time mss 648 753 213 330 445 439 456 512 458 501 459Mono flight time mss 344 817 645 725 454 1007 446 951 523 512 639

Mean Square Error 07 196 112 095 083 095 087 132 116 106 109False Positive Rate 016 018 013 011 011 008 013 008 01 008 011True Positive Rate 09 044 057 038 038 04 035 035 06 039 047Stereo approaches 29 31 8 14 19 22 22 19 20 21 205Mono approaches 10 21 20 25 14 33 15 28 18 15 199

Auto-overrides 0 6 2 2 1 5 2 7 3 2 3Overrides ratio 0 072 03 027 02 049 042 071 056 038 041

copying the behavior of the stereo based system

Fig 14 Room 1 (plane texture) position heat map Top row is thebinned position during the avoidance turns bottom row during the obstacleapproaches left column during stereo ground truth based operation rightcolumn during learned monocular operation

The experimental setup with the room in the motion trackingarena allows for a more in-depth analysis of the performanceof both stereo and monocular vision Figure 17 shows thespatial view of the flight trajectory of test 10 6 The flightis segmented into approaches and turns which are numberedaccordingly in these figures The color scale in Figure 17Ais created by calculating the theoretically visible closest wallbased on the tracking the systems measured heading andposition of the drone the known position of the walls and theFOV angle of the camera It is clearly visible that the stereoground truth in Figure 17B does not capture this theoreticaldisparity perfectly Especially in the middle of the room thedisparity remains high compared to the theoretical ground truthdue to noise in the stereo disparity map The results of themonocular estimator in Figure 17C shows another decrease inquality compared to the stereo ground truth

6Onboard video data of the flight 10 can be viewed at http1drvms1KC81PN an external video at httpsyoutubelC vj-QNy1I and a VBoWvisualization video at http1drvms1fzvicH

Fig 15 Room 2 (carpet natural texture) position heat map

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

12345678910

Fig 16 ROC curves of the 10 test flights Dashed solid lines refer to resultson room 1 2

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

10

1000 2000 3000 4000 5000 60000

05

1

15

2

25

MS

E

Regression learning curves minus Dataset 1

kNN (k=5)lineargpneural net

1000 2000 3000 4000 5000 600002

03

04

05

06

07

08

09

1

AU

C

Training set size [frames]

500 1000 1500 2000 25000

20

40

60

80

100

120

MS

E

Dataset 2

500 1000 1500 2000 2500

065

07

075

08

085

09

095

1

AU

C

Training set size [frames]

Fig 9 VBoW regression algorithms learning curves Dashed solid lines refer to results on traintest set

used to override these actions when the drone gets too closeto an obstacle Therefore we refer to this scheme as lsquotrainingwheelsrsquoAfter learning the drone will use its monocular disparityestimates for control The stereo vision remains active onlyfor overriding the control if the drone gets too close to a wallDuring this testing period we register the number of turns andthe number of overrides The number of overrides is a measureof the number of potential collisions The number of turns iscompared to the number of turns performed when using stereovision to evaluate the number of spurious turns The initiallearning period is 1 minute the remaining learning period is4 minutes and the test time is 5 minutes These times havebeen selected to allow a full experiment on a single battery ofthe real drone (see Section VI)

B Results

Table II contains the results of 30 experiments with the threelearning schemes and a purely stereo-vision-controlled droneThe first observation is that lsquocold turkeyrsquo gives the worstresults This result was to be expected on the basis of the simi-larity between persistent SSL and learning from demonstrationthe learned monocular distance estimates do not generalizewell to the test distribution when the monocular vision is

TABLE II TEST RESULTS FOR THE THREE LEARNING SCHEMES THEAVERAGE AND STANDARD DEVIATION ARE GIVEN FOR THE NUMBER OF

OVERRIDES AND TURNS DURING THE TESTING PERIOD A LOWER NUMBEROF OVERRIDES IS BETTER IN THE TABLE THE BEST RESULTS ARE SHOWN

IN BOLD

Method Overrides TurnsPure stereo NA 456 (σ = 30 )1 Cold turkey 251 (σ = 82 ) 428 (σ = 37 )2 DAgger 107 (σ = 53 ) 414 (σ = 32 )3 Training wheels 43 (σ = 26 ) 404 (σ = 26 )

in control The originally proposed DAgger scheme performsbetter while the third learning scheme termed lsquotraining wheelsrsquoseems most effective The third scheme has the lowest numberof overrides of all learning schemes with a similar totalnumber of turns as a pure stereo vision run The intuitionbehind this method being best is that it allows the drone tobest learn from samples when the drone is beyond the normalstereo vision turning threshold The original DAgger schemehas a larger probability to turn earlier exploring these samplesto a lesser extent Double-sided statistical bootstrap tests [32]indicate that all differences between the learning methods aresignificant with p lt 005The differences between the learning schemes are well illus-trated by the positions the drone visits in the room during thetest phase Figure 11 contains lsquoheat mapsrsquo that show the drone

11

Fig 11 Simulation heatmaps from left to right cold turkey DAgger trainingwheels stereo only Top images are turn locations lower images are theapproaches

Fig 12 The used multicopter

positions during turning (top row) and during straight flight(bottom row) The position distribution has been obtained bybinning the positions during the test phase of all 30 runs Theresults for each scheme are shown per column in Figure 11Right is the pure stereo vision scheme which shows a clearborder around the straight flight trajectories It can be observedthat this border is best approximated by the lsquotraining wheelsrsquoscheme (second from the right)

VI ROBOTIC EXPERIMENTS

The simulation experiments showed that the lsquotraining wheelsrsquosetup resulted in the fewest stereo vision overrides whenswitching to monocular disparity estimation and control Inthis section we test this online learning setup with a flyingrobotThe experiment is set up in the same manner as the simulationThe robot a Parrot AR drone 2 first explores the room withthe help of stereo vision After 1 minute of learning the droneswitches to using the monocular disparity estimates with stereovision running in the background for performing potentialsafety overrides In this phase the drone still continues tolearn After learning 4 to 5 minutes the drone stops learningand enters the test phase Again also for the real robot themain performance measure consists of the number of safetyoverrides performed by the stereo vision during the testingphaseThe AR drone 2 is standard not equipped with a stereo visionsystem Therefore an in-house-developed 4 gram stereo visionsystem is used [33] which sends the raw images over USB tothe ARDrone2 The grayscale stereo camera has a resolutionof 128times96 px and is limited to 10 fps The ARDrone2 comeswith a 1GHz ARM cortex A8 processor and 128MB RAM

and normally runs the Parrot firmware as an autopilot Forthe experiments we replace this firmware with the the opensource Paparazzi autopilot software [34] [35] This allowedus to implement all vision and learning algorithms on boardthe drone The length of each test is dependent on the batterywhich due to wear has considerable variation in the range of8-15 minutesThe tests are performed in an artificial room that has beenconstructed within a motion tracking arena This allows us totrack the trajectory of the drone and facilitates post-experimentanalysis The room is approximately 5 times 5 m as delimitedby plywood walls In order to ensure that the stereo visionalgorithm gave reliable results we added texture in the formof duct-tape to the walls In five tests we had a textured carpethanging over one of the walls (Figure 13 left referred to aslsquoroom 1rsquo) in the other five tests it was on the floor (Figure 13right referred to as lsquoroom 2rsquo)

Fig 13 Two test flight rooms

1) Results Table III shows the summarized results obtainedfrom the monocular test flights Two main observations canbe made from this table First the average number of stereooverrides during the test phase is 3 which is very close to thenumber of overrides in simulation The monocular behavioralso has a similar heat map to simulation Figure 14 shows aheat map of the dronersquos position during the approaches and theavoidance maneuvers (the turns) Again the stereo based flightperforms better in the sense that the drone explores the roommuch more thoroughly and the turns happen consistently justbefore an obstacle is detected On the other hand especially inroom 2 the monocular performance is quite good in the sensethat the system is able to explore most of the roomSecond the selected TPR and FPR are on average 047 and011 The TPR is rather low compared to the offline testsHowever this number is heavily influenced by the monocularestimator based behavior Due to the goal of the robot avoidingobstacles slightly before the stereo ground truth recognizesthem as positives positives should hardly occur at al Only incases of FNs where the estimator is slower or wrong positiveswill be registered by the ground truth Similarly the FPR isalso lower in the context of the estimate based behaviorROC curves of the 10 flights are shown in Figure 16 Acomparison based on the numbers between the first 5 flights(room 1) and the last 5 flights (room 2) does not showany significant differences leading to the suggestion that thesystem is able to learn both rooms equally well Howeverwhen comparing the heat maps of the two situations in themonocular system in Figure 14 and Figure 15 it seems thatthe system shows slightly different behavior The monocularsystem appears to explore room 2 better getting closer to

12

TABLE III TEST FLIGHT SUMMARY

Room 1 Room 2Description 1 2 3 4 5 6 7 8 9 10 Avg

Stereo flight time mss 648 753 213 330 445 439 456 512 458 501 459Mono flight time mss 344 817 645 725 454 1007 446 951 523 512 639

Mean Square Error 07 196 112 095 083 095 087 132 116 106 109False Positive Rate 016 018 013 011 011 008 013 008 01 008 011True Positive Rate 09 044 057 038 038 04 035 035 06 039 047Stereo approaches 29 31 8 14 19 22 22 19 20 21 205Mono approaches 10 21 20 25 14 33 15 28 18 15 199

Auto-overrides 0 6 2 2 1 5 2 7 3 2 3Overrides ratio 0 072 03 027 02 049 042 071 056 038 041

copying the behavior of the stereo based system

Fig 14 Room 1 (plane texture) position heat map Top row is thebinned position during the avoidance turns bottom row during the obstacleapproaches left column during stereo ground truth based operation rightcolumn during learned monocular operation

The experimental setup with the room in the motion trackingarena allows for a more in-depth analysis of the performanceof both stereo and monocular vision Figure 17 shows thespatial view of the flight trajectory of test 10 6 The flightis segmented into approaches and turns which are numberedaccordingly in these figures The color scale in Figure 17Ais created by calculating the theoretically visible closest wallbased on the tracking the systems measured heading andposition of the drone the known position of the walls and theFOV angle of the camera It is clearly visible that the stereoground truth in Figure 17B does not capture this theoreticaldisparity perfectly Especially in the middle of the room thedisparity remains high compared to the theoretical ground truthdue to noise in the stereo disparity map The results of themonocular estimator in Figure 17C shows another decrease inquality compared to the stereo ground truth

6Onboard video data of the flight 10 can be viewed at http1drvms1KC81PN an external video at httpsyoutubelC vj-QNy1I and a VBoWvisualization video at http1drvms1fzvicH

Fig 15 Room 2 (carpet natural texture) position heat map

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

12345678910

Fig 16 ROC curves of the 10 test flights Dashed solid lines refer to resultson room 1 2

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

11

Fig 11 Simulation heatmaps from left to right cold turkey DAgger trainingwheels stereo only Top images are turn locations lower images are theapproaches

Fig 12 The used multicopter

positions during turning (top row) and during straight flight(bottom row) The position distribution has been obtained bybinning the positions during the test phase of all 30 runs Theresults for each scheme are shown per column in Figure 11Right is the pure stereo vision scheme which shows a clearborder around the straight flight trajectories It can be observedthat this border is best approximated by the lsquotraining wheelsrsquoscheme (second from the right)

VI ROBOTIC EXPERIMENTS

The simulation experiments showed that the lsquotraining wheelsrsquosetup resulted in the fewest stereo vision overrides whenswitching to monocular disparity estimation and control Inthis section we test this online learning setup with a flyingrobotThe experiment is set up in the same manner as the simulationThe robot a Parrot AR drone 2 first explores the room withthe help of stereo vision After 1 minute of learning the droneswitches to using the monocular disparity estimates with stereovision running in the background for performing potentialsafety overrides In this phase the drone still continues tolearn After learning 4 to 5 minutes the drone stops learningand enters the test phase Again also for the real robot themain performance measure consists of the number of safetyoverrides performed by the stereo vision during the testingphaseThe AR drone 2 is standard not equipped with a stereo visionsystem Therefore an in-house-developed 4 gram stereo visionsystem is used [33] which sends the raw images over USB tothe ARDrone2 The grayscale stereo camera has a resolutionof 128times96 px and is limited to 10 fps The ARDrone2 comeswith a 1GHz ARM cortex A8 processor and 128MB RAM

and normally runs the Parrot firmware as an autopilot Forthe experiments we replace this firmware with the the opensource Paparazzi autopilot software [34] [35] This allowedus to implement all vision and learning algorithms on boardthe drone The length of each test is dependent on the batterywhich due to wear has considerable variation in the range of8-15 minutesThe tests are performed in an artificial room that has beenconstructed within a motion tracking arena This allows us totrack the trajectory of the drone and facilitates post-experimentanalysis The room is approximately 5 times 5 m as delimitedby plywood walls In order to ensure that the stereo visionalgorithm gave reliable results we added texture in the formof duct-tape to the walls In five tests we had a textured carpethanging over one of the walls (Figure 13 left referred to aslsquoroom 1rsquo) in the other five tests it was on the floor (Figure 13right referred to as lsquoroom 2rsquo)

Fig 13 Two test flight rooms

1) Results Table III shows the summarized results obtainedfrom the monocular test flights Two main observations canbe made from this table First the average number of stereooverrides during the test phase is 3 which is very close to thenumber of overrides in simulation The monocular behavioralso has a similar heat map to simulation Figure 14 shows aheat map of the dronersquos position during the approaches and theavoidance maneuvers (the turns) Again the stereo based flightperforms better in the sense that the drone explores the roommuch more thoroughly and the turns happen consistently justbefore an obstacle is detected On the other hand especially inroom 2 the monocular performance is quite good in the sensethat the system is able to explore most of the roomSecond the selected TPR and FPR are on average 047 and011 The TPR is rather low compared to the offline testsHowever this number is heavily influenced by the monocularestimator based behavior Due to the goal of the robot avoidingobstacles slightly before the stereo ground truth recognizesthem as positives positives should hardly occur at al Only incases of FNs where the estimator is slower or wrong positiveswill be registered by the ground truth Similarly the FPR isalso lower in the context of the estimate based behaviorROC curves of the 10 flights are shown in Figure 16 Acomparison based on the numbers between the first 5 flights(room 1) and the last 5 flights (room 2) does not showany significant differences leading to the suggestion that thesystem is able to learn both rooms equally well Howeverwhen comparing the heat maps of the two situations in themonocular system in Figure 14 and Figure 15 it seems thatthe system shows slightly different behavior The monocularsystem appears to explore room 2 better getting closer to

12

TABLE III TEST FLIGHT SUMMARY

Room 1 Room 2Description 1 2 3 4 5 6 7 8 9 10 Avg

Stereo flight time mss 648 753 213 330 445 439 456 512 458 501 459Mono flight time mss 344 817 645 725 454 1007 446 951 523 512 639

Mean Square Error 07 196 112 095 083 095 087 132 116 106 109False Positive Rate 016 018 013 011 011 008 013 008 01 008 011True Positive Rate 09 044 057 038 038 04 035 035 06 039 047Stereo approaches 29 31 8 14 19 22 22 19 20 21 205Mono approaches 10 21 20 25 14 33 15 28 18 15 199

Auto-overrides 0 6 2 2 1 5 2 7 3 2 3Overrides ratio 0 072 03 027 02 049 042 071 056 038 041

copying the behavior of the stereo based system

Fig 14 Room 1 (plane texture) position heat map Top row is thebinned position during the avoidance turns bottom row during the obstacleapproaches left column during stereo ground truth based operation rightcolumn during learned monocular operation

The experimental setup with the room in the motion trackingarena allows for a more in-depth analysis of the performanceof both stereo and monocular vision Figure 17 shows thespatial view of the flight trajectory of test 10 6 The flightis segmented into approaches and turns which are numberedaccordingly in these figures The color scale in Figure 17Ais created by calculating the theoretically visible closest wallbased on the tracking the systems measured heading andposition of the drone the known position of the walls and theFOV angle of the camera It is clearly visible that the stereoground truth in Figure 17B does not capture this theoreticaldisparity perfectly Especially in the middle of the room thedisparity remains high compared to the theoretical ground truthdue to noise in the stereo disparity map The results of themonocular estimator in Figure 17C shows another decrease inquality compared to the stereo ground truth

6Onboard video data of the flight 10 can be viewed at http1drvms1KC81PN an external video at httpsyoutubelC vj-QNy1I and a VBoWvisualization video at http1drvms1fzvicH

Fig 15 Room 2 (carpet natural texture) position heat map

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

12345678910

Fig 16 ROC curves of the 10 test flights Dashed solid lines refer to resultson room 1 2

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

12

TABLE III TEST FLIGHT SUMMARY

Room 1 Room 2Description 1 2 3 4 5 6 7 8 9 10 Avg

Stereo flight time mss 648 753 213 330 445 439 456 512 458 501 459Mono flight time mss 344 817 645 725 454 1007 446 951 523 512 639

Mean Square Error 07 196 112 095 083 095 087 132 116 106 109False Positive Rate 016 018 013 011 011 008 013 008 01 008 011True Positive Rate 09 044 057 038 038 04 035 035 06 039 047Stereo approaches 29 31 8 14 19 22 22 19 20 21 205Mono approaches 10 21 20 25 14 33 15 28 18 15 199

Auto-overrides 0 6 2 2 1 5 2 7 3 2 3Overrides ratio 0 072 03 027 02 049 042 071 056 038 041

copying the behavior of the stereo based system

Fig 14 Room 1 (plane texture) position heat map Top row is thebinned position during the avoidance turns bottom row during the obstacleapproaches left column during stereo ground truth based operation rightcolumn during learned monocular operation

The experimental setup with the room in the motion trackingarena allows for a more in-depth analysis of the performanceof both stereo and monocular vision Figure 17 shows thespatial view of the flight trajectory of test 10 6 The flightis segmented into approaches and turns which are numberedaccordingly in these figures The color scale in Figure 17Ais created by calculating the theoretically visible closest wallbased on the tracking the systems measured heading andposition of the drone the known position of the walls and theFOV angle of the camera It is clearly visible that the stereoground truth in Figure 17B does not capture this theoreticaldisparity perfectly Especially in the middle of the room thedisparity remains high compared to the theoretical ground truthdue to noise in the stereo disparity map The results of themonocular estimator in Figure 17C shows another decrease inquality compared to the stereo ground truth

6Onboard video data of the flight 10 can be viewed at http1drvms1KC81PN an external video at httpsyoutubelC vj-QNy1I and a VBoWvisualization video at http1drvms1fzvicH

Fig 15 Room 2 (carpet natural texture) position heat map

0 01 02 03 04 05 06 07 08 09 10

01

02

03

04

05

06

07

08

09

1

12345678910

Fig 16 ROC curves of the 10 test flights Dashed solid lines refer to resultson room 1 2

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

13

Fig 17 Flight path of test 10 in room 2 Monocular flight starts form approach 23 The meaning of the color of the flightpath differs per image A theapproximated disparity based on the external tracking system B the measured stereo average disparity C the monocular estimated disparity D the errorbetween BampC with dark blue meaning zero error E error during FP F error during FN E and F only show the monocular part of the flight

VII DISCUSSION

We start the discussion with an interpretation of the resultsfrom the simulation and real-world experiments after whichwe proceed by discussing persistent SSL in general andprovide a comparison to other machine learning techniques

A Interpretation of the results

Using persistent SSL we were able to autonomously navigateour multicopter on the basis of a stereo vision camera whiletraining a monocular estimator on board and online Althoughthe monocular estimator allows the drone to continue flyingand avoiding obstacles the performance during the approx-imately ten-minute flights is not perfect During monocularflight a fairly limited amount of (autonomous) stereo overrideswas needed while at the same time the robot was not fullyexploring the room like when using stereoSeveral improvements can be suggested First we can simplyhave the drone learn for a longer time accumulating trainingdata over multiple flights In an extended offline test ourVBoW method shows saturation at around 6000 samplesUsing additional features and more advanced learning methods

may result in improved performance if training set sizesincreaseDuring our tests in different environments it proved unnec-essary to tune the VBoW learning algorithm parameters toa new environment as similar performance was obtained Thelearned results on the robot itself may or may not generalize todifferent environments however this is of less concern as therobot can detect a new environment and then decide to continuethe learning process if the original cue is still availableIn order to detect an inadequacy of the learned regressionfunction the robot can occasionally check the estimation erroragainst the stereo ground truth In fact our system alreadydoes so autonomously using its safety override Methods onchecking the performance without using the ground truth egby employing a learner that gives an estimate of uncertaintyare left for future work

B Deep learningAt the time of our robotic experiments implementing state-of-the-art deep learning methods on-board a flying drone wasdeemed infeasible due to hardware restrictions However inAppendix A we investigate using downsized networks in order

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

14

to show that the persistent SSL method can work with deeplearning as well One of the major advantages of persistentSSL is the unprecedented amount of available training dataThis amount of data will be more useful to more complexlearning methods such as deep learning methods than to lesscomplex but computationally efficient methods such as theVBoW method used in our experiments Today with theavailability of strongly improved hardware such as the NVidiaJetson TX1 close-to state-of-the-art models can be trained andrun on-board a drone which may significantly improve thelearning results

C Persistent SSL in relation to other machine learning tech-niques

In order to place persistent SSL in the general framework ofmachine learning we compare it with several techniques Anoverview of this comparison is presented in figure 18

Importance of supervision

Persistent Self-Supervised

Learning

Unsupervised Learning

Learning from Demonstration

ReinforcementLearning

Self-Supervised Learning

SupervisedLearning

Imp

ort

ance

of

be

hav

ior

Classical machine learning

Classical control policy learning

Fig 18 Lay of the machine learning land

Un-semi-super-vised learning Unsupervised learning doesnot require labeled data semi-supervised learning requires onlyan initial set of labeled data [36] and supervised learningrequires all the data to be labeled Internally persistent SSLuses a standard supervised learning scheme which greatlyfacilitates and speeds up learning The typical downside ofsupervised learning - acquiring the labels - does not apply toSSL since the robot continuously provides these labels itselfA major difference between the typical use of supervisedlearning and its use in persistent SSL is that the data forlearning is generally assumed to be iid However persistentSSL controls a behavioral component which in turn affectsboth the dataset obtained during training as well as duringtesting time Operation based on ground truth induces a certainbehavior that differs significantly from behavior induced froma trained estimator even more so for an undertrained estimatorSSL The persistent form of SSL is set apart in the figure fromlsquonormalrsquo SSL because the persistence property introduces amuch more significant behavioral component to the learningWhile normal SSL expects the trusted cue to remain availablepersistent SSL assumes that the robot may sometimes act in

the absence of the trusted cue This introduces the feedback-induced data bias problem which as we have seen requiresspecific behavior strategies for best learning the robotrsquos taskLearning from demonstration Learning from Demonstration(LfD) or Learning from Demonstration (LfD) is a closerelative to persistent SSL Consider for instance teleoperationan LfD scheme in which a (human or robot) teacher remotelyoperates a robot in order for it to learn demonstrated actionsin its environment [17] This can be compared to persistentSSL if we consider the teacher to be the ground truth functiong(xg) in the persistent SSL scheme In most cases described inliterature the teacher shows actions from a control policy takenon the basis of a state instead of just the results from a sensorycue (ie the state) However LfD does contain exceptions inwhich the learner only records the states during demonstrationeg when drawing a map through a 2D representation of theworld in case of a path planning mission in an outdoor robot[37] Like persistent SSL test time decisions taken in LfDschemes influence future observations which may or may notbe contained in known demonstrated territory However onekey difference between LfD and persistent SSL arguably setsthem apart All LfD theory known to the authors implicitlyassumes the teacher is never the same entity as the learner Itmay be that all relevant sensors are on the learner and eventhat the learners body is used to execute teacher commands(like in teleoperation) but the teachers intelligence is alwaysan external entityReinforcement learning Lastly we compare persistent SSLwith Reinforcement Learning (RL) which is a distinctivelydifferent technique [38] In RL a policy is learned using areward function Due to the evaluative feedback provided inRL defining a good reward function is one of fundamentaldifficulties of RL known as reward shaping [39] [38] Sincepersistent SSL uses supervised feedback reward shaping isless of an issue in persistent SSL only requiring a choiceof a loss function between g(xg) and f(xf ) Secondly theinitial exploration phase of RL often infers a lot of trial-and-error making it a dangerous time in which a physicalsystem may crash and be damaged Although this particularproblem is often solved by better initialization eg by usingfor instance LfD or using policy search instead of valuefunction based approaches persistent SSL does not require anuntrained initialization phase at all as a reliable ground truthfunction guarantees a certain minimal correct behaviorPersistent SSL differs from other learning techniques in thesense that no complete training data set is needed to train thealgorithm beforehand Instead it requires a ground truth g(xg)which must be available online in real-time while trainingf(xf ) but can be switched off when f(xf ) is learned tosatisfaction This implies that learning needs to be persistentand that the switch θ must be included in the model Notethat in cases where the environment of the robot may changemeasures can be put in place to detect the output uncertaintyof f(xf ) If the uncertainty goes up the robot can switchback to using the ground truth function and learning can thenbe activated again Developing such measures is however leftfor future work

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

15

D Feedback-induced data bias

The robot induces how its environment is perceived meaning itinfluences the acquired training samples based on its behaviorThe problems arising from this feedback-induced data biasare known from other machine learning disciplines such asRL and LfD [38] In particular Ross et al have proposedDAgger [2] to solve a similar problem in the LfD domainwhich iteratively aggregates the dataset with induced trainingsamples and the experts reaction to it However in the caseof LfD obtaining the induced training samples requires acareful and often additional setup while in persistent SSL thisfunctionality is inherently available Secondly the performanceof the LfD expert (ie in many cases a human) is not easy tocontrol often reacting too late or too early The control policyof the persistent SSL ground truth override system can on theother hand be very deterministic In the case of a DAggerapplication with drones flying through a forest [3] it provedinfeasible to reliably sample the expert in an online fashionAcquired videos had to be processed offline by the experthence the need for (offline harr online) iterations Moreover anadditional online human safety override interface was still nec-essary to prevent damage to the drone while learning Thirdlydue to the cost of (and need for) iterative demonstrationsessions the emphasis of DAgger is on converging fast withneeding as little expert sessions as possible In persistent SSLthere are no costs for using the teacher signals coming from theoriginal sensor cue With persistent SSL we can directly focuson effectively using the available amount of training samplesinstead of minimizing the number of iterations like in DAggerAnother reason why persistent SSL handles the induced train-ing sample issue better than other state of the art robot learningmethods is that in persistent SSL part of the learning problemitself can be easily separated and tested from the behaviorie in a traditional supervised learning setting In our proofof concept this has allowed us to test the learning algorithmsand thoroughly investigate its limits before deployment whichhelped us to identify when the behavioral influence was toblame for bad results

VIII CONCLUSION

We have investigated a novel Self-Supervised Learningscheme in which the supervisory signal is switched off after aninitial learning period In particular we have studied an instanceof such persistent SSL for the task of obstacle avoidancein which the robot uses trusted stereo vision distance esti-mates in order to learn appearance-based monocular distanceestimation We have shown that this particular setup is verysimilar to learning from demonstration This similarity hasbeen corroborated by experiments in simulation which showedthat the worst learning strategy is to make a hard switch fromstereo vision flight to mono vision flight It is best to have therobot fly based on mono vision and using stereo vision only aslsquotraining wheelsrsquo to take over when the robot would otherwisecollide with an obstacle The real-world robot experimentsshow the feasibility of the approach giving acceptable resultsalready with just 4-5 minutes of learning

The findings also indicate interesting future venues of investi-gation First and perhaps most importantly in the 4-5 minutesof the real-world experiments the robot already experiencesroughly 7000 - 9000 supervised learning samples It is clearthat longer learning times can lead to very large superviseddata sets which are suitable for deep learning approachesSuch approaches likely allow the learning to extend to muchlarger more varied environments In addition they could allowthe learning to improve the resolution of disparity estimatesfrom a single value to a full image size disparity mapSecond in the current experiments the robot stayed in a singleenvironment We mentioned that a different environment canmake the learned mapping invalid and that this can be detectedby means of the ground truth Another venue as studied in[10] is to use a machine learning method with an associateduncertainty value For instance one could use a learningmethod such as a Gaussian Process This can help with afurther integration of the behavior with learning for instanceby tuning the forward velocity based on the certainty Thesevenues together could allow for persistent SSL to reach itsfull potential significantly enhancing the robustness of robotsoperating in real-world environments

APPENDIX ADEEP NEURAL NETWORK RESULTS

We investigated the possibility of implementing a deep con-volutional neural network (CNN) on-board a drone to test ourproposed learning scheme We scaled the CNN architecture sothat implementing this network on the ARDrone 2 hardwareremained feasible Due to CPU restrictions of our target sys-tem we choose to use a relatively small CNN inspired by theCifar10 CNN example used in Caffe Our layers are as followsInput 128x128 pixels image (input images are scaled) Firsthidden layer 5x5 convolution max pooling to 3x3 32 kernelscontrast normalization Second hidden layer 5x5 convolutionaverage pooling to 3x3 32 kernels contrast normalizationThird hidden layer 5x5 convolution average pooling to 3x3128 kernels Fourth and fifth layer fully connected Lastly aneuclidean loss output layer to one single output Furthermorewe used a learning rate of 1e-6 09 momentum and used10000 iterations The networks are trained using the opensource Caffe framework on a dedicated compute server withan NVidia GTX660 In Figure 19 the learning curves of ourbest CNN under these constraints versus the best of our VBoWtrained algorithms are shown The figure shows that the CNNis able to learn the problem of monocular depth estimation toa comparable fasion as our VBoW method For the expectedamount of training data during a single flight the VBoWmethod delivers better performance for dataset 1 and slightlyworse for dataset 2

REFERENCES

[1] J Garcıa and F Fernandez ldquoA comprehensive survey on safe rein-forcement learningrdquo Journal of Machine Learning Research vol 16pp 1437ndash1480 2015

[2] S Ross G J Gordon and J A Bagnell ldquoA Reduction of ImitationLearning and Structured Predictionrdquo 14th International Conference onArtificial Intelligence and Statistics vol 15 2011

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 60000

05

1

15

2

25

3

MS

E

Dataset 1

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000

04

05

06

07

08

09

1

Trainnig set size (frames)

Are

a u

nder

RO

C c

urv

e

500 1000 1500 2000 250010

20

30

40

50

60

70

80Dataset 2

VBOW test

VBOW train

CNN test

CNN train

500 1000 1500 2000 2500093

094

095

096

097

098

099

1

Trainnig set size (frames)

Fig 19 Learning curves on two datasets for CNNs versus VBoW Dashed solid lines refer to results on traintest set

[3] S Ross N Melik-Barkhudarov K S Shankar A Wendel D DeyJ A Bagnell and M Hebert ldquoLearning monocular reactive UAVcontrol in cluttered natural environmentsrdquo 2013 IEEE InternationalConference on Robotics and Automation pp 1765ndash1772 May 2013[Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6630809

[4] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann et al ldquoStanley Therobot that won the darpa grand challengerdquo in The 2005 DARPA GrandChallenge Springer 2007 pp 1ndash43

[5] D Lieb A Lookingbill and S Thrun ldquoAdaptive road following usingself-supervised learning and reverse optical flowrdquo in Robotics Scienceand Systems 2005 pp 273ndash280

[6] A Lookingbill J Rogers D Lieb J Curry and S Thrun ldquoReverseoptical flow for self-supervised adaptive autonomous robot navigationrdquoInternational Journal of Computer Vision vol 74 no 3 pp 287ndash3022007

[7] R Hadsell P Sermanet J Ben A Erkan M Scoffier K KavukcuogluU Muller and Y LeCun ldquoLearning long-range vision for autonomousoff-road drivingrdquo Journal of Field Robotics vol 26 no 2 pp 120ndash1442009 [Online] Available httpdoiwileycom101002rob20276httponlinelibrarywileycomdoi101002rob20276abstract

[8] U A Muller L D Jackel Y LeCun and B Flepp ldquoReal-time adaptiveoff-road vehicle navigation and terrain classificationrdquo in SPIE Defense

Security and Sensing International Society for Optics and Photonics2013 pp 87 410Andash87 410A

[9] J Baleia P Santana and J Barata ldquoOn exploiting haptic cues forself-supervised learning of depth-based robot navigation affordancesrdquoJournal of Intelligent amp Robotic Systems pp 1ndash20 2015

[10] H Ho C De Wagter B Remes and G de Croon ldquoOptical flow for self-supervised learning of obstacle appearancerdquo in Intelligent Robots andSystems (IROS) 2015 IEEERSJ International Conference on IEEE2015 pp 3098ndash3104

[11] T Mori and S Scherer ldquoFirst results in detecting and avoiding frontalobstacles from a monocular camera for micro unmanned aerial vehi-clesrdquo Proceedings - IEEE International Conference on Robotics andAutomation pp 1750ndash1757 2013

[12] J Engel J Sturm and D Cremers ldquoScale-aware navigation of a low-cost quadrocopter with a monocular camerardquo Robotics and AutonomousSystems vol 62 no 11 pp 1646ndash1656 2014 [Online] AvailablehttpwwwsciencedirectcomsciencearticlepiiS0921889014000566

[13] mdashmdash ldquoScale-aware navigation of a low-cost quadrocopter with amonocular camerardquo Robotics and Autonomous Systems vol 62 no 11pp 1646ndash1656 2014

[14] F van Breugel K Morgansen and M H Dickinson ldquoMonoculardistance estimation from optic flow during active landing maneuversrdquoBioinspiration amp biomimetics vol 9 no 2 p 025002 2014

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References

17

[15] G C de Croon ldquoMonocular distance estimation with optical flow ma-neuvers and efference copies a stability-based strategyrdquo Bioinspirationamp biomimetics vol 11 no 1 p 016004 2016

[16] G de Croon M Groen C De Wagter B Remes R Ruijsink andB van Oudheusden ldquoDesign aerodynamics and autonomy of theDelFlyrdquo Bioinspiration amp Biomimetics vol 7 no 2 p 025003 2012

[17] B D Argall S Chernova M Veloso and B Browning ldquoA surveyof robot learning from demonstrationrdquo Robotics and AutonomousSystems vol 57 no 5 pp 469ndash483 2009 [Online] Availablehttpdxdoiorg101016jrobot200810024

[18] D Hoiem A a Efros and M Hebert ldquoAutomatic photo pop-uprdquo ACMTransactions on Graphics vol 24 no 3 p 577 2005

[19] A Saxena S H Chung and A Y Ng ldquo3-D Depth Reconstructionfrom a Single Still Imagerdquo International Journal of ComputerVision vol 76 no 1 pp 53ndash69 Aug 2007 [Online] Availablehttplinkspringercom101007s11263-007-0071-y

[20] A Saxena M Sun and A Y Ng ldquoMake3D learning 3D scenestructure from a single still imagerdquo IEEE transactions on patternanalysis and machine intelligence vol 31 no 5 pp 824ndash40 May 2009[Online] Available httpwwwncbinlmnihgovpubmed19299858

[21] I Lenz M Gemici and A Saxena ldquoLow-power parallel algorithmsfor single image based obstacle avoidance in aerial robotsrdquo IEEEInternational Conference on Intelligent Robots and Systems pp772ndash779 2012 [Online] Available httpieeexploreieeeorglpdocsepic03wrapperhtmarnumber=6386146

[22] D Eigen C Puhrsch and R Fergus ldquoDepth map prediction from asingle image using a multi-scale deep networkrdquo Advances in NeuralInformation pp 1ndash9 2014 [Online] Available httppapersnipsccpaper5539-tree-structured-gaussian-process-approximations

[23] J Michels A Saxena and A Y Ng ldquoHigh speed obstacle avoidanceusing monocular vision and reinforcement learningrdquo in Proceedingsof the 22nd International Conference on Machine Learningvol 3 no Suppl 1 ACM Press 2005 pp 593ndash600 [Online]Available httpportalacmorgcitationcfmdoid=11023511102426$delimiterrdquo026E30F$nhttpdlacmorgcitationcfmid=1102426httpportalacmorgcitationcfmid=1102426

[24] D Dey K S Shankar S Zeng R Mehta M T Agcayazi C EriksenS Daftry M Hebert and J A Bagnell ldquoVision and Learning forDeliberative Monocular Cluttered Flightrdquo

[25] K Yamauchi M Oota and N Ishii ldquoA self-supervised learningsystem for pattern recognition by sensory integrationrdquo Neural Networksvol 12 no 10 pp 1347ndash1358 1999

[26] S Thrun M Montemerlo H Dahlkamp D Stavens A Aron J DiebelP Fong J Gale M Halpenny G Hoffmann K Lau C OakleyM Palatucci V Pratt P Stang S Strohband C Dupont L E Jen-drossek C Koelen C Markey C Rummel J van Niekerk E JensenP Alessandrini G Bradski B Davies S Ettinger A Kaehler A Ne-fian and P Mahoney ldquoStanley The robot that won the DARPA GrandChallengerdquo Springer Tracts in Advanced Robotics vol 36 pp 1ndash432007

[27] C De Wagter S Tijmons B Remes and G de Croon ldquoAutonomousFlight of a 20-gram Flapping Wing MAV with a 4-gram OnboardStereo Vision Systemrdquo in IEEE International Conference on Roboticsamp Automation (ICRA) no Section IV 2014 pp 4982ndash4987

[28] G C H E De Croon E De Weerdt C De Wagter B D W Remesand R Ruijsink ldquoThe appearance variation cue for obstacle avoidancerdquoIEEE Transactions on Robotics vol 28 no 2 pp 529ndash534 2012

[29] M Varma and A Zisserman ldquoTexture Classification Are Filter BanksNecessary 1 Introduction 2 A review of the VZ classifierrdquo ComputerVision and Pattern Recognition 2003 Proceedings 2003 IEEE ComputerSociety Conference on 2003

[30] B Wu T L Ooi and Z J He ldquoPerceiving distance accurately bya directional process of integrating ground informationrdquo Nature vol428 no 6978 pp 73ndash77 2004

[31] C De Wagter and M Amelink ldquoHoliday50av technical paperrdquo 2007

[32] P R Cohen ldquoEmpirical methods for artificial intelligencerdquo IEEEIntelligent Systems no 6 p 88 1996

[33] C De Wagter S Tijmons B D Remes and G C de CroonldquoAutonomous flight of a 20-gram flapping wing mav with a 4-gramonboard stereo vision systemrdquo in Robotics and Automation (ICRA)2014 IEEE International Conference on IEEE 2014 pp 4982ndash4987

[34] B Remes D Hensen F Van Tienen C De Wagter E Van der Horstand G De Croon ldquoPaparazzi how to make a swarm of parrot ar dronesfly autonomously based on gpsrdquo in IMAV 2013 Proceedings of theInternational Micro Air Vehicle Conference and Flight CompetitionToulouse France 17-20 September 2013 2013

[35] P Brisset A Drouin M Gorraz P-s Huard P Brisset A DrouinM Gorraz P-s Huard J T T Pa P Brisset A Drouin M GorrazP-s Huard and J Tyler ldquoThe Paparazzi Solutionrdquo 2014

[36] X Zhu ldquoSemi-Supervised Learning Literature Surveyrdquo Sciences-NewYork pp 1ndash59 2007 [Online] Available httpciteseerxistpsueduviewdocdownloaddoi=10111462352amprep=rep1amptype=pdf

[37] N Ratliff D Bradley J A Bagnell and J Chestnutt ldquoBoostingStructured Prediction for Imitation Learning for Imitation LearningrdquoAdvances in Neural Information Processing Systems (NIPS) p 542006

[38] J Kober J a Bagnell and J Peters ldquoReinforcement learningin robotics A surveyrdquo The International Journal of RoboticsResearch vol 32 pp 1238ndash1274 2013 [Online] Availablehttpijrsagepubcomcgidoi1011770278364913495721

[39] M L Littman ldquoReinforcement learning improves behaviour fromevaluative feedbackrdquo Nature vol 521 no 7553 pp 445ndash4512015 [Online] Available httpwwwnaturecomdoifinder101038nature14540

  • I Introduction
  • II Related work
    • II-A Monocular navigation
      • II-A1 Appearance-based navigation without distance estimation
      • II-A2 Offline monocular distance learning
        • II-B Self-supervised learning
          • III Methodology overview
            • III-A Persistent SSL principle
            • III-B Stereo-to-mono proof of concept
            • III-C Stereo vision processing
            • III-D Monocular disparity estimation
            • III-E Behavior
            • III-F Performance
            • III-G Similarity with Learning from Demonstration
              • IV Offline Vision Experiments
              • V Simulation Experiments
                • V-A Setup
                • V-B Results
                  • VI Robotic Experiments
                    • VI-1 Results
                      • VII Discussion
                        • VII-A Interpretation of the results
                        • VII-B Deep learning
                        • VII-C Persistent SSL in relation to other machine learning techniques
                        • VII-D Feedback-induced data bias
                          • VIII Conclusion
                          • Appendix A Deep neural network results
                          • References