1 UnrealROX An eXtremely Photorealistic Virtual Reality ...In this work, we propose an extremely...

1

UnrealROXAn eXtremely Photorealistic Virtual RealityEnvironment for Robotics Simulations and

Synthetic Data GenerationPablo Martinez-Gonzalez, Sergiu Oprea, Alberto Garcia-Garcia, Alvaro Jover-Alvarez,

Sergio Orts-Escolano, and Jose Garcia-Rodriguez

Abstract—Data-driven algorithms have surpassed traditional techniques in almost every aspect in robotic vision problems. Suchalgorithms need vast amounts of quality data to be able to work properly after their training process. Gathering and annotating thatsheer amount of data in the real world is a time-consuming and error-prone task. Those problems limit scale and quality. Synthetic datageneration has become increasingly popular since it is faster to generate and automatic to annotate. However, most of the currentdatasets and environments lack realism, interactions, and details from the real world. UnrealROX is an environment built over UnrealEngine 4 which aims to reduce that reality gap by leveraging hyperrealistic indoor scenes that are explored by robot agents which alsointeract with objects in a visually realistic manner in that simulated world. Photorealistic scenes and robots are rendered by UnrealEngine into a virtual reality headset which captures gaze so that a human operator can move the robot and use controllers for therobotic hands; scene information is dumped on a per-frame basis so that it can be reproduced offline to generate raw data and groundtruth annotations. This virtual reality environment enables robotic vision researchers to generate realistic and visually plausible datawith full ground truth for a wide variety of problems such as class and instance semantic segmentation, object detection, depthestimation, visual grasping, and navigation.

Index Terms—Robotics, Synthetic Data, Grasping, Virtual Reality, Simulation.

F

1 INTRODUCTION

V ISION-BASED robotics tasks have made a huge leapforward mainly due to the development of machine

learning techniques (e.g. deep architectures [8] such asConvolutional Neural Networks or Recurrent Neural Net-works) which are continuously rising the performance barfor various problems such as semantic segmentation [11][6], depth estimation [4] [24], and visual grasping [9] [10]among others. Those data-driven methods are in need ofvast amounts of annotated samples to achieve those excep-tional results. Gathering that sheer quantity of images withground truth is a tedious, expensive, and sometimes nearbyimpossible task in the real world. On the contrary, syntheticenvironments streamline the data generation process andare usually able to automatically provide annotations forvarious tasks. Because of that, simulated environments arebecoming increasingly popular and widely used to trainthose models.

Learning on virtual or simulated worlds allows faster,low-cost, and more scalable data collection. However, syn-thetic environments face a huge obstacle to be actuallyuseful despite their inherent advantages: models trained

• P. Martinez-Gonzalez, S. Oprea, A. Garcia-Garcia, A. Jover-Alvarez, S.Orts-Escolano, and J. Garcia-Rodriguez are with the 3D Perception Lab(http://labs.iuii.ua.es/3dperceptionlab) at the University of Alicante .E-mail: [email protected], [email protected], [email protected],[email protected], [email protected], [email protected]

in that simulated domain must also be able to performproperly on real-world test scenarios which often featurenumerous discrepancies between them and their syntheticcounterparts. That set of differences is widely known asthe reality gap. In most cases, this gap is big enough sothat transferring knowledge from one domain to anotheris an extremely difficult task either because renderers arenot able to produce images like real-world sensors (due tothe implicit noise or the richness of the scene) or either thephysical behavior of the scene elements and sensors is notas accurate as it should be.

In order to address this reality gap, two methods havebeen proven to be effective: extreme realism and domainrandomization. On the one hand, extreme realism refers tothe process of making the simulation as similar as the real-world environment in which the robot will be deployed aspossible [14] [5]. That can be achieved through a combi-nation of various techniques, e.g., photorealistic rendering(which implies realistic geometry, textures, lighting andalso simulating camera-specific noise, distortion and otherparameters) and accurate physics (complex collisions withhigh-fidelity calculations). On the other hand, domain ran-domization is a kind of domain adaptation technique thataims for exposing the model to a huge range of simulatedenvironments at training time instead of just to a single syn-thetic one [1] [22] [21]. By doing that, and if the variabilityis enough, the model will be able to identify the real worldas just another variation thus being able to generalize [23].

arX

iv:1

810.

0693

6v1

[cs

.RO

] 1

6 O

ct 2

018

2

In this work, we propose an extremely photorealisticvirtual reality environment for generating synthetic data forvarious robotic vision tasks. In such environment, a humanoperator can be embodied, in virtual reality, as a robot agentinside a scene to freely navigate and interact with objects asif it was a real-world robot. Our environment is built on topof Unreal Engine 4 (UE4) to take advantage of its advancedVirtual Reality (VR), rendering, and physics capabilities.Our system provides the following features: (1) a visuallyplausible grasping system for robot manipulation which ismodular enough to be applied to various finger configura-tions, (2) routines for controlling robotic hands and bodieswith commercial VR setups such as Oculus Rift and HTCVive Pro, (3) a sequence recorder component to store allthe information about the scene, robot, and cameras whilethe human operator is embodied as a robot, (4) a sequenceplayback component to reproduce the previously recordedsequence offline to generate raw data such as RGB, depth,normals, or instance segmentation images, (5) a multi-camera component to ease the camera placement processand enable the user to attach them to specific robot jointsand configure their parameters (resolution, noise model,field of view), and (6) open-source code, assets, and tutorialsfor all those components and other subsystems that tie themtogether.

This paper is organized as follows. Section 2 analyzesalready existing environments for synthetic data generationand puts our proposal in context. Next, Section 3 describesour proposal and provides in-depth details for each one ofits components. After that, we briefly discuss applicationscenarios for our environment in Section 4. At last, in Section5, we draw conclusions about this work and in Section 6 wego over current limitations of our work and propose futureworks to improve it.

2 RELATED WORKS

Synthetic environments have been used for a long time tobenchmark vision and robotic algorithms [3]. Recently, theirimportance has been highlighted for training and evaluatingmachine learning models for robotic vision problems [2][19] [13]. Due to the increasing need for samples to trainsuch data-driven architectures, there exists an increasingnumber of synthetic datasets, environments, and simulationplatforms to generate data for indoor robotic tasks andevaluate those learned models. In this section, we brieflyreview the most relevant ones according to the scope of ourproposal. We describe both the most important features andmain flaws for the following works: CHALET, HoME, AI2-THOR, and MINOS. In addition, we also describe two otherrelated tools such as UnrealCV, Gazebo, and NVIDIA’s IsaacSim which are not strictly similar but relevant enough to bementioned. At last, we put our proposal in context takinginto account all the analyzed strong points and weaknesses.

Cornell House Agent Learning Environment (CHALET)[25] is a 3D house simulator for manipulation and navi-gation learning. It is built upon Unity 3D so it supportsphysics and interactions with objects and the scene itselfthanks to its built-in physics engine. CHALET features threemodes of operation: standalone (navigate with keyboard

and mouse input), replay (reproduce the trajectory gener-ated on standalone mode), and client (use the framework’sAPI to control the agent and obtain information). On theother hand, CHALET presents various weak points such asits lack of realism, the absence of a robot’s body or mesh,and the limitation in the number of cameras.

Household Multimodal Environment (HoME) [2] is amultimodal household environment for AI learning fromvisual, auditive, and physical information within realisticsynthetic environments sourced from SUNCG. HoME pro-vides RGB, depth, and semantic maps based on 3D ren-derings produced by Panda3D, acoustic renderings basedon EVERT, language descriptions of objects, and physicssimulations based on Bullet. It also provides a Pythonframework compatible with OpenAI gym. However, HoMEis not anywhere close to photorealism, there is no phyisicalrepresentation of the robot itself, and interactions are dis-crete.

AI2-THe House of inteRactions (THOR) [7] is a frame-work for visual AI research which consists of near-photorealistic synthetic 3D indoor scenes in which agentscan navigate and change the state of actionable objects. Itis built over Unity so it also integrates a physics enginewhich enables modeling complex physical interactions. Theframework also provides a Python interface to communicatewith the engine through HTTP commands to control theagent and obtain visual information and annotations. Someof the weaknesses of this environment are the lack of a 3Drobot model and hands, only a first-person view camera,and the discrete nature of its actions with binary states.

Multimodal Indoor Simulator (MINOS) [20] is a simu-lator for navigation in complex indoor environments. Anagent, represented by a cylinder proxy geometry, is able tonavigate (in a discrete or continuous way) on scenes sourcedfrom existing synthetic and reconstructed datasets of indoorscenes such as SUNCG and Matterport respectively. Suchagent can obtain information from multimodal sensory in-puts: RGB, depth, surface normals, contact forces, semanticsegmentation, and various egocentric measurements suchas velocity and acceleration. The simulator provides bothPython and web client APIs to control the agent and setthe parameters of the scene. However, this simulator lackssome features such as a fully 3D robot model instead ofa geometry proxy, photorealism, configurable cameras andpoints of view, and interactions with the scene.

2.1 Other Tools and Environments

Although not strictly related, we would like to remark acouple of tools from which we drew inspiration to shapeour proposal: UnrealCV, Gazebo, and NVIDIA’s Isaac Sim.

On the one hand, UnrealCV [17] [18] is a project thatextends UE4 to create virtual worlds and ease communica-tion with computer vision applications. UnrealCV consistsof two parts: server and client. The server is a plugin thatruns embedded into an UE4 game. It uses sockets to listento high-level UnrealCV commands issued by a client andcommunicates with UE4 through its C++ API to provideadvanced functionality for each command, e.g., renderingper-instance segmentation masks. The client is a PythonAPI which communicates with the server using plain text

3

protocol. It just sends those available commands to theserver and waits for a response. A detailed list of commandscan be consulted in the official documentation of the plugin.We took the main concept and design behind UnrealCVand implemented the whole pipeline inside UE4 itself to bemore efficient and customizable. Another framework whichhelped us design our environment was Gazebo 1 a well-known robot simulator that enables accurate and efficientsimulation of robots in indoor and outdoor environments. Itintegrates a robust physics engine (Bullet, ODE, Simbody,and DART), advanced 3D graphics (using OGRE), andsensors and noise modelling. On the other hand, NVIDIA’sIsaac Sim is a yet to be released virtual simulator for roboticsthat lets developers train and test their robot software usinghighly realistic virtual simulation environments. However,its software development kit is still in early access at thetime this work was carried out.

2.2 Our Proposal in ContextAfter analyzing the strong points and weaknesses of themost popular indoor robotic environments, we aimed tocombine the strengths of all of them while addressing theirweaknesses and introducing new features. In this regard,our work focuses on simulating a wide range of commonindoor robot actions, both in terms of poses and objectinteractions, by leveraging a human operator to generateplausible trajectories and grasps in virtual reality. To the bestof our knowledge, this is the first extremely photorealisticenvironment for robotic vision in which interactions andmovements can be realistically simulated in virtual reality.Furthermore, we make possible the generation of raw data(RGB-D/3D/Stereo) and ground truth (2D/3D class andinstance segmentation, 6D poses, and 2D/3D boundingboxes) for many vision problems. Although UnrealCV isfairly similar to our work, since both aim to connect UnrealEngine and computer vision/robotics, we took radicallydifferent design decisions: while its architecture is a Pythonclient/server, ours is contained entirely inside UE4 in C++.That architecture allows us to place objects, cameras, andskeletons, and generate images in a more efficient waythan other frameworks. Finally, the whole pipeline andtools are released as open-source software with extensivedocumentation2.

3 SYSTEM

The rendering engine we chose to generate photorealisticRGB images and immerse the agent in VR is Unreal Engine 4(UE4). The reasons for this choice are the following ones: (1)it is arguably one of the best game engines able to produceextremely realistic renderings, (2) beyond gaming, it hasbecome widely adopted by Virtual Reality developers andindoor/architectural visualization experts so a whole lot oftools, examples, documentation, and assets are available;(3) due to its impact across various communities, manyhardware solutions offer plugins for UE4 that make themwork out-of-the-box; and (4) Epic Games provides the fullC++ source code and updates to it so the full suite can

1. http://http://gazebosim.org/2. https://github.com/3dperceptionlab/unrealrox

Fig. 1: Snapshots of the daylight and night room setup forthe Realistic Rendering released by Epic Games to showcasethe realistic rendering capabilities of UE4.

be used and easily modified for free. Arguably, the mostattractive feature of UE4 that made us take that decisionis its capability to render photorealistic scenes like the oneshown in Figure 1. Some UE4 features that enable this re-alism are: physically-based materials, pre-calculated bouncelight via Lightmass, stationary lights, post-processing, andreflections.

It is also important to remark that we do have strict real-time constraints for rendering since we need to immerse ahuman agent in virtual reality, i.e., we require extremely re-alistic and complex scenes rendered at very high framerates(usually more than 80 FPS). By design, UE4 is engineeredfor virtual reality so it provides a specific rendering solutionfor it named Forward Renderer. That renderer is able togenerate images that meet our quality standards at 90 FPSthanks to high-quality lighting features, Multi-Sample Anti-Aliasing (MSAA), and instanced stereo rendering.

The whole system is built over UE4 taking advantage ofvarious existing features, extending certain ones with to suitour specific needs, and implementing others from scratchto devise a more efficient and cleaner project that abidesto software design principles. A general overview of ourproposal is shown in Figure 2. In this section we describeeach one of the subsystems that our proposal is composedof: robotic pawns, controller, HUD, grasping, multi-camera,recording, and playback.

3.1 Robotic PawnsOne of the most important parts of the system is the rep-resentation of the robots in the virtual environment. Robotsare represented by the mesh that models them, the control

4

Fig. 2: System diagram showing the various subsystems andtheir abstract relationships: Robotic Pawn, Controller, HUD,Multi-camera, Grasping, Recording, and Playback.

Fig. 3: Pepper and Mannequin integrated with colliders andconstraints.

and movement logic, the animations that it triggers, andthe grasping system (explained later in its correspondingsection). To encapsulate all this, we have created a baseclass that contains all the common behavior that any robotwould have in our system, which can then be extended bychild classes that implement specific things such as the meshor the configuration of the fingers for the grasping system.Using that encapsulation, we introduced two sample robotsin our environment: UE4’s mannequin and Aldebaran’sPepper (see Figure 3).

In UE4 there is a hierarchy of predefined classes readyto work together that should be used properly in order totake advantage of the facilities offered by the engine. Forexample, any element that we want to place in a scene mustextend the Actor class, and at the same time, an Actor thatis supposed to receive inputs from the user must extendthe Pawn class (and optionally can have a Controller classto abstract input events, as we will see in the next section).This means that our base class that represents the commonbehavior of robots must extend Pawn class.

The meshes that model characters with joints like ourrobots are called SkeletalMesh in UE4. In addition to the meshthat defines their geometry, they incorporate a skeleton that

defines how that geometry will be deformed according tothe relative position and rotation of its bones. An Skele-talMesh is added as a component to our class (actually, aninstance of SkeletalMeshComponent).

There are two types of inputs to which our robots mustreact to, those that come from pressing buttons or axes, andthose that come from moving the VR motion controllers.The latter is managed by an UE4 component that must beadded to our Pawn class and that will modify its positionaccording to the real-world motion controllers movement.We will be able to access the position of these componentsfrom the animation class, associate it with the hand bones ofthe robot SkeletalMesh, and move the whole arm by inversekinematics.

The animation class is created from the SkeletalMesh,so it is separate from the Pawn class, although the firsthas to access information from the second. Specifically, ouranimation classes handles the hand closing animation forthe grasping system, and, in the case of robots with legs,it also takes control of the displacement speed to executethe walking animation at different speeds. Finally, the ani-mation class is also used by the playback system (describedbelow) to recover the SkeletalMesh pose for a single frame,since it is from where the position and rotation of each jointof the SkeletalMesh is accessible for modification.

3.2 Controller Subsystem

We would like our system to seamlessly support a widerange of Virtual Reality setups to reach a potentially highernumber of users. In this regard, it is important to decouplethe controller system from the rest of the environment sothat we can use any device (such as the Oculus Rift andthe HTC Vive Pro shown in Figure 4) without excessiveeffort. To that end, it is common to have a class that handlesall the inputs from the user (in an event-driven way) andthen distributes the execution to other classes dependingon that input. The very same UE4 provides the base classfor this purpose, namely PlayerController. Many of theseuser inputs are focused on controlling the movement andbehavior of a character in the scene, usually representedin Unreal Engine 4 as a Pawn class. This means that thePlayerController class is closely related to the Pawn one.Decoupling input management from functionality is usefulas it allows us switching among different controllers forthe same Pawn (different control types for example), or usethe same controller for several ones (if they have the samebehavior for inputs).

Our controller system extends the base class Player-Controller and handles all kind of user inputs, both fromkeyboard and VR controllers (we have tested our systemwith Oculus Rift and HTC Vive). This is configured in theUE4 editor, more specifically in the Input Project Preferencespanel, where several keys, buttons, or axes can be associatedwith an event name, which is later binded to a handlerfunction in the custom PlayerController class. The controllercalls the movement and grasping functionalities from thepawn, and also global system functions as toggling therecording system, restarting the scene, and resetting theVR headset position. It also controls the Head-Up Display(HUD) system for showing input debugging feedback.

5

Fig. 4: Seamlessly supported VR headsets thanks to thedecoupled controller subsystem: HTC Vive Pro and OculusRift.

3.3 HUD Subsystem

It is convenient for any system to feature a debug sys-tem that provides feedback about the application state tothe user. In UnrealROX, we offer an interface to showinformation at various levels to the user if requested. Thisinformation is presented in a HUD which can be turned offor on to the user’s will. It can even be completely decoupledfrom the system as a whole for maximum performance. Themain information modalities provided by the HUD are thefollowing ones:

• Recording: A line of text with the recording state isalways shown in the HUD in order to let the userknow if his movements through the scene are beingrecorded.

• States: Notifies the user with a message on the screenof the relevant buttons pressed, the joints in contactwith an object, the profiling being activated, etc. Theamount of seconds these messages last on screencan be established independently. Most of them areprinted for 5 seconds.

• Error: Prints a red message indicating an error thatlasts in screen for 30 seconds (or until another erroroccurs). An example of this would be trying to recordwithout the tracker on the scene (as seen in Figure 5).

• Scene Capture: It allows us to establish a debuggingpoint of view so that we can see our robot from adifferent point of view than the first person camera.

Fig. 5: Information and error messages shown in the HUD.

We have implemented this functionality extending theHUD class that UE4 provides, and we also made it fullydecoupled from the rest of the system in a simple way byimplementing an interface3. Classes that inherite from HUDclass have a canvas and a debug canvas on which primitiveshapes can be drawn. It provides some simple methods forrendering text, textures, rectangles, and materials which canalso be accessed from blueprints. An example of texturedrawing in practice in our project is the Scene Capture,which consists in drawing a texture in the viewport cap-tured from an arbitrary camera (as shown in Figure 6). Thiswill be useful for the user to see if the animations are beingplayed correctly in a Virtual Reality environment.

Fig. 6: Scene Capture drawn in the viewport.

3.4 Grasping Subsystem

Grasping subsystem is considered one of the core com-ponents of UnrealROX. We have focused on providing arealistic grasping, both in the way the robot grasp an objectand in the movements it makes. When grasping an objectwe need to simulate a real robot behaviour, thus smoothand plausible movements are needed. The grasping actionis fully controlled by the user through the controls, naturallylimited to the degrees of freedom of the human body. In thisway, we achieve a good representation of a humanoid robotinteracting in a realistic home environment, also known asassistive robots which are the current trend in the field ofsocial robotics.

Current approaches for grasping in VR environments areanimation-driven, and based on predefined movements [15][12] [16]. This will restrict the system to only a few pre-defined object geometries hindering user’s interaction withthe environment resulting also in a unrealistic grasping.In contrast with these approaches, the main idea of ourgrasping subsystem consists in manipulating and interact-ing with different objects, regardless of their geometry andpose. In this way, the user can freely decide which object tointeract with without restrictions. The robot can manipulatean object with each hand, and change an object from a handto the other. It can also manipulate two different objects atthe same time, drop them freely or throw them around thescene.

3. https://docs.unrealengine.com/en-US/Programming/UnrealArchitecture/Reference/Interfaces

6

At the implementation level of this subsystem, we makeuse of UE4’s trigger volumes placed on each one of the fingerphalanges as we can see in Figure 7. These triggers act assensors that will determine if we are manipulating an objectin order to grasp it. With the controllers we are able to closerobot’s hands limiting individually each finger according tothe triggers. We also implement a logic for determine whento grasp or release an object based on the triggers state.Fingers position change smoothly in order to replicate a realrobot hand behaviour and to avoid pass through an object.

A sequence example grasping two objects with our cus-tom system is shown in Figure 8.

Fig. 7: Sphere trigger volumes placed on finger phalanges ofboth hands represented in yellow.

(a) Frame 0 (b) Frame 35

(c) Frame 70 (d) Frame 75

(e) Frame 80 (f) Frame 85

Fig. 8: Sequence of 6 frames (Seq =F0, F35, F70, F75, F80, F85) representing a grasping actionwith right hand meanwhile holding a fruit with the lefthand. Frames are left-right and top-down ordered

3.5 Multi-camera Subsystem

Most robots in the public market (such as Pepper or Bax-ter) integrate multiple cameras in different parts of theirbodies. In addition, external cameras are usually added tothe system to provide data from different points of view,e.g., ambient assisted living environments tend to featurevarious camera feeds for different rooms to provide therobot with information that it is not able to perceive directly.In UnrealROX, we want to simulate the ability to addmultiple cameras in a synthetic environment with the goalin mind of having the same or more amount of data that wewould have in a real environment. For instance, in order totrain a data-driven grasping algorithm it would be neededto generate synthetic images from a certain point of view:the wrist of the robot. To simulate this situation in oursynthetic scenario, we give the user the ability to placecameras attached to sockets in the robots body, e.g., thewrist itself or the end-effector (eye-in-hand). Furthermore,we also provide the functionality to add static cameras overthe scene.

To implement this subsystem, we make use of UE4’sCameraActor as the camera class and the Pawn class as theentity to which we will attach them. By default, UE4 doesnot allow us to precisely attach components in the editor soit is necessary to define a socket-camera relationship in thePawn class. This is due to the fact that it has direct access tothe skeleton to which we will be attaching specific cameras.

The objective of the CameraActor class is to render anyscene from a specific point of view. This actor can be placedand rotated at the users discretion in the viewport, whichmakes them ideal for recording any type of scene from anydesired point of view. The CameraActor is represented in UE4by a 3D camera model and like any other actor, it can bemoved and rotated in the viewport. Apart from handlingattached and static cameras, UnrealROX exposes the mostdemanded camera settings through its interface (projectionmode, Field of View (FoV), color grading, tone mapping,lens, and various rendering effects), as well as providingadditional features such as creating stereo-vision setups.

To implement the camera attachment functionality wemake extensive use of the AttachToActor function providedby UE4, which is in charge of parenting one actor withanother following some attachment rules. We can specifythe socket to which we want to attach the object. This meansthat when the selected socket changes its transform, theattached object will change it too according to the specifiedAttachmentRules. These rules dictate how this new attachedactor will behave when the socket it is linked moves orrotates. The AttachmentRules can be defined separately forlocation, rotation, and scale. This lead us to define an im-plicit relationship between the CameraActor and the socket itis attached to. For that, the Pawn class implements an arrayof USTRUCT so that each relationship has two parameters:the camera itself and the socket name. These propertiesare accompanied by the EditAnywhere meta, which makespossible the edition of the properties not only on the ClassDefault Object (CDO) but also on its instances. The user willbe in charge of filling the array specified in Listing 3. To

3. https://www.softbankrobotics.com/emea/en/robots/pepper3. https://www.rethinkrobotics.com/baxter/

7

(a) Representation of the array.

(b) Pawn Actor with cameras.

Fig. 9: In-engine representation of the array of structs. In(a) we can see the representation of the array struct inthe instance of the object, while in (b) we see its visualrepresentation in the engine.

make the attachment process easier, we provide a friendlyuser interface inside the editor (see Figure 9).

3.6 Recording Subsystem

UnrealROX decouples the recording and data generationprocesses so that we can achieve high framerates whengathering data in Virtual Reality (VR) without decreasingperformance due to extra processing tasks such as changingrendering modes, cameras, and writing images to disk. Inthis regard, the recording subsystem only acts while theagent is embodied as the robot in the virtual environment.When enabled, this mode gathers and dumps, on a per-frame basis, all the information that will be needed to replayand reconstruct the whole sequence, its data, and its groundtruth. That information will be later used as input for theplayback system to reproduce the sequence and generate allthe requested data.

In order to implement such behavior we created a newUE4 Actor, namely ROXTracker, which overrides the Tickfunction. This new invisible actor is included in the scenewe want to record and executes its tick code for eachrendered frame. That tick function loops over all cameras,objects, and robots (skeletons) in the scene and writes all theneeded information to a text file in an asynchronous way.For each frame, the actor dumps the following information:recorded frame number, timestamp in milliseconds sincethe start of the game, the position and rotation for each

camera, the position, rotation, and bounding box minimumand maximum world-coordinates for each object, and theposition and rotation of each joint of the robot’s skeleton.

Fig. 10: ROXTracker custom interface showing the robotmannequin, multiple cameras, and various parameters toconfigure which pawns and cameras are tracked and othersequence details.

The information is dumped in raw text format for ef-ficiency, after the sequence is fully recorded, the raw textfile is processed and converted into a more structured andreadable JSON file so that it can be easily interpreted by theplayback system.

3.7 Playback Subsystem

Once the scene has been recorded, we can use the cus-tom user interface in UE4 to provide the needed data forthe playback mode: the sequence description file in JSONformat and an output directory. Other parameters suchas frame skipping (to skip a certain amount of frames atthe beginning of the sequence) and dropping (keep onlya certain amount of frames) can also be customized (seeFigure 10).

This mode disables any physics simulation and interac-tions (since object and skeleton poses will be hard-codedby the sequence description itself) and then interprets thesequence file to generate all the raw data from it: RGBimages, depth maps, instance segmentation masks, andnormals. For each frame, the playback mode moves everyobject and every robot joint to the previously recordedposition and sets their rotation. Once everything is posi-tioned, it loops through each camera. For each one of them,the aforementioned rendering modes (RGB, depth, instance,and normals) are switched and the corresponding imagesare generated as shown in Figure 11.

4 APPLICATIONS

UnrealROX environment has multiple potential applica-tion scenarios to generate data for various robotic visiontasks. Traditional algorithms for solving such tasks cantake advantage of the data but the main purpose of thisenvironment is providing the ability to generate large-scaledatasets. Having the possibility of generating vast amountsof high-quality annotated data, data-driven algorithms suchas deep learning models can especially benefit from it to

8

(a) RGB (b) Depth

(c) Mask (d) Normals

Fig. 11: Rendering modes cycled by the playback mode.

increase their performance, in terms of accuracy, and im-prove their generalization capabilities in unseen situationsduring training. The set of tasks and problems that can beaddressed using such data ranges from low to high-levelones, covering the whole spectrum of indoor robotics. Someof the most relevant low-level tasks include:

• Stereo Depth Estimation: One of the typical ways ofobtaining 3D information for robotics is using a pairof displaced cameras used to obtain two differentviews from the same scene at the same time frame.By comparing both images, a disparity map can beobtained whose values are inversely proportional tothe scene depth. Our multi-camera system allows theplacement of stereo pairs at configurable baselinesso that the environment is able to generate pairsof RGB images, and the corresponding depth, fromcalibrated cameras.

• Monocular Depth Estimation: Another trending wayof obtaining 3D information consists of using ma-chine learning methods to infer depth from a singleRGB image instead of a stereo pair. From a practicalstandpoint, it is specially interesting since it requiresfar less hardware and avoids the need for calibrationstrategies. Our multi-camera system generates bydefault depth information for each RGB frame (seeFigure 12).

Fig. 12: Sample RGB sequence for monocular depth estima-tion.

• Object Detection and Pose Estimation: Being able notonly to identify which objects are in a given sceneframe but also their estimate pose and bounding boxis of utmost importance for an indoor robot. Ourenvironment is able to produce 2D and 3D boundingboxes for each frame as ground truth. Furthermore,for each frame of a sequence, the full 6D pose of theobjects is annotated too (see Figure 13).

Fig. 13: Sample bounding box annotations for the RGBsequence shown in Figure 12.

• Instance/Class Segmentation: For certain applica-tions, detecting a bounding box for each object isnot enough so we need to be able to pinpoint theexact boundaries of the objects. Semantic segmenta-tion of frames provides per-pixel labels that indicateto which instance or class does a particular pixelbelong. Our environment generates 2D (per-pixel)and 3D (per-point) labels for instance and class seg-mentation (see Figure 14).

Fig. 14: Sample instance segmentation sequence for the RGBimages shown in Figure 12.

• Normal Estimation: Estimating the normals of agiven surface is an important previous step for manyother tasks. For instance, certain algorithms requirenormal information in a point cloud to extract grasp-ing points for a robot. UnrealROX provides per-pixelnormal information.

That low-level data enables other higher-level tasks thateither make use of the output of those systems or take thelow-level data as input or even both possibilities:

• Hand Pose Estimation: Estimating the 6D pose ofeach joint of the hands provides useful informationfor various higher-level tasks such as gesture detec-tion, grasping or collaboration with other robots. Weprovide per-frame 6D pose annotations for each jointof the robot’s hands.

• Visual Grasping and Dexterous Manipulation:Grasping objects and manipulating them whilegrasped with one or both hands is a high-level taskwhich can be solved using information from vari-ous sources (RGB images, depth maps, segmentationmasks, normal maps, and joint estimates to name afew). In our case, we provide sequences in whichthe robot interacts with objects to displace, grab,

9

and manipulate them so that grasping algorithmscan take advantage of such sequences recorded fromvarious points of view (see Figure 15).

Fig. 15: Sample hands interaction sequence and its associ-ated data (from top to bottom: RGB, depth, instance masks,and normals).

• Robot Pose Estimation: As well as providing 6D posefor hand joints, our environment also provides suchinformation for all the joints of a robot on a per-frame basis. This allows training and testing bodypose estimation algorithms which can be extremelyuseful in indoor environments to analyze behaviorsand even collaborate with other robots too. To thatend, we equipped our multi-camera system with thecapability of adding room cameras that capture fullbodies typical from assisted indoor living (see Figure16).

Fig. 16: Sample data from an external point of view inthe room with the corresponding images (RGB, depth, andinstance masks).

• Obstacle Avoidance and Navigation: By leveragingvarious types of low-level information such as RGBimages, depth maps, bounding boxes, and semanticsegmentation, robots can learn to avoid obstacles (bydetecting objects and estimating their distance) andeven navigate in indoor environments (by building amap to localize themselves in the indoor scene whileavoiding objects and walls and being able to reasonsemantically to move intelligently).

As we can observe, UnrealROX is able to generate datafor a significantly wide range of robotic vision applications.Most of them orbit around indoor robotics, although someof them might as well be applied to outdoor situations. Ingeneral, their purpose can be grouped into the more general

application of Ambient Assisted Living (AAL) due to theinherent goal of achieving a robotic system able to operateintelligently in an indoor scenario in an autonomous wayto provide support at various social tasks such as in-houserehabilitation, elder care, or even disabled assistance.

5 CONCLUSION

This paper presented a virtual reality system, in which ahuman operator is embodied as a robotic agent using VRsetups such as Oculus Rift or HTC Vive Pro, for generatingautomatically annotated synthetic data for various roboticvision tasks. This environment leverages photorealism forbridging the reality gap so that models trained on itssimulated data can be transferred to a real-world domainwhile still generalizing properly. The whole project, withall the aforementioned components (recording/playback,multi-camera, HUD, controller, and robotic pawns) is freelyavailable 4 with an open-source license and detailed docu-mentation so that any researcher can use to generate customdata or even extend it to suit their particular needs. Thatdata generation process was engineered and designed withefficiency and easiness in mind and it outperforms otherexisting solutions such as UnrealCV at object, robot, andcamera repositioning, and image generation.

The outcome of this work demonstrates the potential ofusing VR for simulating robotic interactions and generatingsynthetic data that facilitates training data-driven methodsfor various applications such as semantic segmentation,depth estimation, or object recognition.

6 LIMITATIONS AND FUTURE WORKS

Currently, the environment still has certain limitations thatmust be addressed in order to make it applicable to a widerrange of robotic vision tasks. One of them is the simulationof non-rigid objects and deformations when grasping suchkind of objects. We have limited ourselves to manipulatenon-deformable objects in order not to affect realism, sincethis is a different approach with a non-haptic manipulationand deformations need to be modelled at the object level. Weare currently investigating the mechanisms that UE4 offersto model those transformations. Another important short-coming is the absence of tactile information when graspingobjects. We plan to include simulated tactile sensors toprovide force data when fingers collide with objects andgrasp them instead of providing only visual information.Furthermore, although not strictly a limitation, we are work-ing on making the system able to process Unified RobotDescription Files (URDFs) to automatically import robots,including their constraints, cinematics, and colliders, in theenvironment instead of doing that manually for each robotmodel.

ACKNOWLEDGMENTS

This work has been funded by the Spanish GovernmentTIN2016-76515-R grant for the COMBAHO project, sup-ported with Feder funds. This work has also been sup-ported by three Spanish national grants for PhD stud-ies (FPU15/04516, FPU17/00166, and ACIF/2018/197), by

4. https://github.com/3dperceptionlab/unrealrox

10

the University of Alicante project GRE16-19, and by theValencian Government project GV/2018/022. Experimentswere made possible by a generous hardware donation fromNVIDIA.

REFERENCES

[1] Konstantinos Bousmalis, Alex Irpan, Paul Wohlhart, Yunfei Bai,Matthew Kelcey, Mrinal Kalakrishnan, Laura Downs, Julian Ibarz,Peter Pastor, Kurt Konolige, et al. Using simulation and domainadaptation to improve efficiency of deep robotic grasping. arXivpreprint arXiv:1709.07857, 2017.

[2] Simon Brodeur, Ethan Perez, Ankesh Anand, Florian Golemo,Luca Celotti, Florian Strub, Jean Rouat, Hugo Larochelle, andAaron Courville. Home: A household multimodal environment.arXiv preprint arXiv:1711.11017, 2017.

[3] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalisticopen source movie for optical flow evaluation. In A. Fitzgibbonet al. (Eds.), editor, European Conf. on Computer Vision (ECCV), PartIV, LNCS 7577, pages 611–625. Springer-Verlag, October 2012.

[4] David Eigen, Christian Puhrsch, and Rob Fergus. Depth mapprediction from a single image using a multi-scale deep network.In Advances in neural information processing systems, pages 2366–2374, 2014.

[5] Adrien Gaidon, Qiao Wang, Yohann Cabon, and Eleonora Vig.Virtual worlds as proxy for multi-object tracking analysis. InProceedings of the IEEE conference on computer vision and patternrecognition, pages 4340–4349, 2016.

[6] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick.Mask r-cnn. IEEE transactions on pattern analysis and machineintelligence, 2018.

[7] Eric Kolve, Roozbeh Mottaghi, Daniel Gordon, Yuke Zhu, AbhinavGupta, and Ali Farhadi. Ai2-thor: An interactive 3d environmentfor visual ai. arXiv preprint arXiv:1712.05474, 2017.

[8] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436, 2015.

[9] Ian Lenz, Honglak Lee, and Ashutosh Saxena. Deep learningfor detecting robotic grasps. The International Journal of RoboticsResearch, 34(4-5):705–724, 2015.

[10] Sergey Levine, Peter Pastor, Alex Krizhevsky, Julian Ibarz, andDeirdre Quillen. Learning hand-eye coordination for roboticgrasping with deep learning and large-scale data collection. TheInternational Journal of Robotics Research, 37(4-5):421–436, 2018.

[11] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully con-volutional networks for semantic segmentation. In Proceedings ofthe IEEE conference on computer vision and pattern recognition, pages3431–3440, 2015.

[12] Tom Looman. Vr template.[13] Jeffrey Mahler, Jacky Liang, Sherdil Niyaz, Michael Laskey,

Richard Doan, Xinyu Liu, Juan Aparicio Ojea, and Ken Goldberg.Dex-net 2.0: Deep learning to plan robust grasps with syntheticpoint clouds and analytic grasp metrics. 2017.

[14] John McCormac, Ankur Handa, Stefan Leutenegger, and An-drew J Davison. Scenenet rgb-d: 5m photorealistic images ofsynthetic indoor trajectories with ground truth. arXiv preprintarXiv:1612.05079, 2016.

[15] Oculus. Distance grab sample now available in oculus unitysample framework.

[16] Oculus. Oculus first contact.[17] Weichao Qiu and Alan Yuille. Unrealcv: Connecting computer

vision to unreal engine. In European Conference on Computer Vision,pages 909–916. Springer, 2016.

[18] Weichao Qiu, Fangwei Zhong, Yi Zhang, Siyuan Qiao, Zihao Xiao,Tae Soo Kim, and Yizhou Wang. Unrealcv: Virtual worlds forcomputer vision. In Proceedings of the 2017 ACM on MultimediaConference, pages 1221–1224. ACM, 2017.

[19] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez,and Antonio M Lopez. The synthia dataset: A large collection ofsynthetic images for semantic segmentation of urban scenes. InProceedings of the IEEE Conference on Computer Vision and PatternRecognition, pages 3234–3243, 2016.

[20] Manolis Savva, Angel X Chang, Alexey Dosovitskiy, ThomasFunkhouser, and Vladlen Koltun. Minos: Multimodal indoorsimulator for navigation in complex environments. arXiv preprintarXiv:1712.03931, 2017.

[21] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, WojciechZaremba, and Pieter Abbeel. Domain randomization for trans-ferring deep neural networks from simulation to the real world.In Intelligent Robots and Systems (IROS), 2017 IEEE/RSJ InternationalConference on, pages 23–30. IEEE, 2017.

[22] Joshua Tobin, Wojciech Zaremba, and Pieter Abbeel. Domainrandomization and generative models for robotic grasping. arXivpreprint arXiv:1710.06425, 2017.

[23] Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy,Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boo-choon, and Stan Birchfield. Training deep networks with syntheticdata: Bridging the reality gap by domain randomization. arXivpreprint arXiv:1804.06516, 2018.

[24] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, NikolausMayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. Demon:Depth and motion network for learning monocular stereo. InIEEE Conference on computer vision and pattern recognition (CVPR),volume 5, page 6, 2017.

[25] Claudia Yan, Dipendra Misra, Andrew Bennnett, Aaron Walsman,Yonatan Bisk, and Yoav Artzi. Chalet: Cornell house agent learn-ing environment. arXiv preprint arXiv:1801.07357, 2018.

1 UnrealROX An eXtremely Photorealistic Virtual Reality ...In this work, we propose an extremely...

Documents

Transcript of 1 UnrealROX An eXtremely Photorealistic Virtual Reality ...In this work, we propose an extremely...