Project No: 644312 · 0.5.1 Elena Garrido (ATOS) New input about SLAM in the OTC platform....

D7.3 Evaluation of RAPID platforms

of 48

This document is Public, and was produced under the RAPID project (EC contract 644312).

Project No: 644312


December 31, 2017

Abstract:

This deliverable presents the evaluation results obtained by the performance analysis of the RAPID offloading framework.

The evaluation is conducted in three different parts. Firstly, using simple benchmarks that explore various characteristics

of the framework. Secondly, using RAPID’s three pilot projects, namely the 3D Hand Tracking application, the Android

antivirus and the BioSurveillance application. The aforementioned applications have been modified appropriately so that

they can take advantage of the benefits provided by the RAPID offloading framework and the outcomes of these

modifications are presented. Finally, this deliverable provides the results obtained by the analysis of RAPID’s cloud and

services.

Document Manager

Dimitris Deyannis FORTH

Document Id N°: rapid_D7.3 Version: 1.0 Date: 14/02/2018

Filename: radpid_D7.3_v1.0.docx

Confidentiality

This document contains proprietary material of certain RAPID contractors, and may not be reproduced, copied,

or disclosed without appropriate permission. The commercial use of any information contained in this

document may require a license from the proprietor of that information.


of 48


The RAPID Consortium consists of the following partners:

Participant no. Participant organisation names short name Country

1 Foundation of Research and Technology Hellas FORTH Greece

2 Sapienza University of Rome UROME Italy

3 Atos Spain S.A. ATOS Spain

4 Queen's University Belfast QUB

United

Kingdom

5 Herta Security S.L. HERTA Spain

6 SingularLogic S.A. SILO Greece

7 University of Naples "Parthenope" UNP Italy

The information in this document is provided “as is” and no guarantee or warranty is given that the information

is fit for any particular purpose. The user thereof uses the information at its sole risk and liability.

Revision history

Version Author Notes. Date

0.1 Iakovos Mavroidis (FORTH) Initial ToC. 27/11/2017

0.1.1 Terpsi Velivassaki (SILO) Updates on ToC. 28/11/2017

0.2 Dimitris Deyannis (FORTH) Input on various sections. 13/12/2017

0.3 Dimitris Deyannis (FORTH) Input on Antivirus Environment. 14/12/2017

0.3.1 Terpsi Velivassaki (SILO) Input on RAPID service Environment. 08/01/2018

0.3.2 Dimitris Deyannis (FORTH) Input on Hand Tracking (just merging). 9/01/2018

0.3.3 Elena Garrido (ATOS) Input on 4.1 section, adding subsection

SLAM in OTC.

15/01/2018

0.3.4 Dimitris Deyannis (FORTH) Text Fixes. 15/01/2018

0.3.5 Carles Fernández (HERTA) Input on BioSurveillance sections. 17/01/2018

0.3.6 Carles Fernández (HERTA) Added BioSurveillance evaluation results 23/01/2018

0.4 Dimitris Deyannis (FORTH) Integrating all input. 25/01/2018

0.5 Dimitris Deyannis (FORH) More input on Antivirus. 26/01/2017

0.5.1 Elena Garrido (ATOS) New input about SLAM in the OTC

platform.

26/01/2018

0.5.2 Carles Fernández (HERTA) Added more evaluation results for 3.3. 26/01/2018

0.5.3 Sokol Kosta (UROME) Input on section 2 about simple evaluation

experiments on the OTC platform.

Moving section 5 to 2.

29/01/2018

0.5.4 Elena Garrido (ATOS) New input about SLAM in OTC. 30/01/2018

0.5.5 Cheol-Ho Hong (QUB) Added more input to SLAM in OTC. 30/01/2018

0.5.6 Terpsi Velivassaki (SILO) Updates on RAPID service Environment. 30/01/2018


of 48


0.5.7 Sokol Kosta (UROME) More input on section 2 about simple

evaluation experiments on the OTC

platform.

29/01/2018

0.6 Dimitris Deyannis (FORTH) Integrating all input. 1/2/2018

0.6.1 Sokol Kosta (UROME) Fixed Dimitris’ comments.

More input on section 2 about Linux

evaluation experiments on the OTC

platform.

1/02/2018

0.6.2 Dimitris Deyannis (FORTH) Integrating GVirtuS input (section 2) 1/02/2018

0.6.3 Dimitris Deyannis (FORTH) Input for missing sections 2/02/2018

0.6.4 Carles Fernández (HERTA) Addressed comments and suggestions.

Added input on BioSurveillance on the

cloud (latencies).

2/02/2018

0.6.5 Elena Garrido (ATOS) Addressed comments QoS 6/02/2018

0.6.6 Dimitris Deyannis (FORTH) Fixes throughout the text 6/02/2018

0.6.7 Dimitris Deyannis (FORTH) Finalising Input 8/02/2018

0.6.8 Carles Fernández (HERTA) Review of the deliverable 9/02/2018

0.6.9 Sokol Kosta (UROME) Handled Carles comments. 9/02/2018

0.7.0 Dimitris Deyannis (FORTH) Merging review input 12/02/2018

0.7.1 Dimitris Deyannis (FORTH) Minor fixes 13/02/2018

0.7.2 Dimitris Deyanis (FORTH) Corrections 14/02/2018

1.0 Dimitris Deyannis (FORTH) Final Version 14/02/2018


of 48


Contents

Contents ................................................................................................................................................... 4

1. Introduction ...................................................................................................................................... 9

1.1. Glossary of Acronyms ........................................................................................................... 10

2. Evaluation of RAPID Service/Cloud ............................................................................................. 11

2.1. Environment .......................................................................................................................... 11

2.1.1. SLA in OTC ................................................................................................................... 13

2.2. Evaluation using simple benchmarks..................................................................................... 15

2.2.1. Android Experiments ..................................................................................................... 15

2.2.2. Linux Experiments ......................................................................................................... 19

3. Evaluation of RAPID Applications ............................................................................................... 22

3.1. 3D Hand Tracking ................................................................................................................. 22

3.1.1. Environment................................................................................................................... 22

3.1.2. Performance Results ...................................................................................................... 23

3.1.3. Conclusions for the Hand Tracker use-case ................................................................... 28

3.2. Antivirus ................................................................................................................................ 29

3.2.1. Environment................................................................................................................... 29


3.2.3. Antivirus in the OTC RAPID cloud............................................................................... 33

3.2.4. Conclusions for the Antivirus use-case .......................................................................... 34

3.3. BioSurveillance ...................................................................................................................... 35

3.3.1. Environment................................................................................................................... 36


3.3.3. Batch offloading ............................................................................................................ 38

3.3.4. BioSurveillance in the OTC RAPID cloud .................................................................... 40

3.3.5. Conclusions for the BioSurveillance use-case ............................................................... 42

4. Conclusions and Future Performance Optimizations..................................................................... 44

References .............................................................................................................................................. 46


of 48


List of Figures

Figure 1: Deployment of the RAPID framework in OTC. .................................................................... 11

Figure 2: The VMs used for the RAPID evaluation as shown in the OTC dashboard. ......................... 12

Figure 3: Screenshot of the Android phone after performing some experiments. ................................. 16

Figure 4: 4-Queens puzzle, Local vs. Remote execution performed on Huawei Android 7.0 phone and

Android-x86 4.4 VM on OTC. .............................................................................................................. 16









Figure 9: 8-Queens puzzle, Local vs. Parallel Remote execution performed on Huawei Android 7.0

phone and two Android-x86 4.4 VMs on OTC. ..................................................................................... 17

Figure 10: Screenshot of the Android VM log when executing a CPU native (C/C++) Android code

before the shared library was loaded. .................................................................................................... 18

Figure 11: Screenshot of the Android VM log when executing a CPU native (C/C++) Android code

after the shared library was loaded. ....................................................................................................... 18

Figure 12: Matrix multiplication varying the problem size. Left: the GPU offloading is performed

using a local dedicated machine. Right: the GPU offloading leverages on the GPU-Bridger OTC

virtual machine. ..................................................................................................................................... 18

Figure 13: Linux RAPID demo client after executing the N-Queens puzzle with 4, 5, 6, 7, and 8

queens .................................................................................................................................................... 20

Figure 14: The CUDA-enabled IDW Algorithm executed on different flavours: (a) CPU, (b) on-board

Titan X, (c) Titan X using GPU-Bridger instead of regular CUDA, (d) Tesla M60 on OTC using

RAPID offloading. ................................................................................................................................. 21

Figure 15: Hand Tracker testing environment topology. ....................................................................... 23

Figure 16: High-level view of the Hand Tracker I/O times in ideal conditions. ................................... 24

Figure 17: What happens when frame processing is delayed (A) vs what RAPID could facilitate (B). 24

Figure 18: Daisy-chaining to two machines could in principle double observed framerates, but in fact

the tracking quality would be the same while consuming double resources. ........................................ 25

Figure 19: Sustainable frame rate for various offloading configurations. The RGBD camera

acquisition framerate is 30 fps. .............................................................................................................. 26

Figure 20: Delay between local and remote execution. ........................................................................ 27

Figure 21: Antivirus testing environment topology. .............................................................................. 30

Figure 22: Sustainable throughput achieved by the Java and CUDA implementation of the virus-

scanning engine, executed on the tablet’s CPU and GPU respectively. ................................................ 31

Figure 23: Sustainable throughput achieved by the virus-scanning engine in different offloading

configurations using the RAPID-enabled cloudlet. ............................................................................... 32

Figure 24: End- to-end sustainable throughput achieved by the antivirus Android application in

different offloading configurations using the RAPID-enabled cloudlet. ............................................... 33


of 48


Figure 25: Sustainable throughput achieved by the antivirus engine in different offloading

configurations using OTC. ..................................................................................................................... 33

Figure 26: End-to-end sustainable throughput achieved by the antivirus Android application in

different offloading configurations using OTC. .................................................................................... 34

Figure 27. Typical pipeline of a face recognition application. .............................................................. 35

Figure 28. Evolution of architectures for this use-case. (a) Original implementation on TK1. (b)

Extended TK1 version with manual offloading through ZeroMQ (requires developing code on target

accelerator). (c) Final version on TX1, with transparent offloading by RAPID (no code required on the

accelerator). ........................................................................................................................................... 36

Figure 29. Sustained frames per second (FPS) of the local and distributed BioSurveillance

applications, with manual and automatic (RAPID) offloading, depending on the number of faces

simultaneously analysed. ....................................................................................................................... 37

Figure 30. Automatic remote offloading with RAPID allows us to override hardware limitations. In

this case, we can overcome memory limitations and use larger subject databases simply by changing

the GPU card of the accelerator. ............................................................................................................ 38

Figure 31. Performance improvements after the implementation of batch offloading operations for

RAPID. In this case, with a batch=10 the frame-rate improves even compared to the stand-alone

application running locally on TX1. ...................................................................................................... 39

Figure 32. Performance of the automatic remote offloading application with regard to the chosen

batch-size. The optimal batch-size in this case is somewhere between batch=10 and batch=100. ....... 40

Figure 33. Performance comparison of BioSurveillance running stand-alone on TX1 against remote

offloading to RAPID’s cloud, for different batch-sizes, database sizes and faces. ............................... 41

Figure 34. Latencies for each process of the face recognition pipeline, when analyzing 5 faces

simultaneously on the RAPID cloud. .................................................................................................... 42


of 48


List of Tables

Table 1: The characteristics of the VMs listed in Figure 2. ................................................................... 12

Table 2: List of available flavours in OTC ............................................................................................ 13

Table 3. List of CUDA functions offloaded in a template matching operation, with their average

computation latency. By applying batch-10 offloading, we can reduce a 15-20% of the effective

latency per template. .............................................................................................................................. 40


of 48


Executive Summary

In this deliverable, we present a thorough evaluation of the RAPID offloading framework. This

evaluation covers the performance analysis of the framework and its components, as obtained by a series

of micro-benchmarks aiming to explore its characteristics. Moreover, we present the outcomes of the

evaluation of RAPID’s three pilot applications. These three applications, namely Hand Tracker, Mobile

Antivirus and BioSurveillance, benefit from the features provided by the offloading framework, in terms

of easing the offloading deployment and increasing the performance. Since RAPID’s pilot applications

have characteristics found in multiple families of applications, such as real-time constraints, heavy I/O,

high throughput, potentially large memory consumption, or strict privacy requirements, they present

solid case studies for the evaluation of RAPID.


of 48


1. Introduction

In order to evaluate the performance aspects of the RAPID framework and infrastructure in detail, we

conduct a variety of micro-benchmarks. The purpose of these micro-benchmarks is to test the

infrastructure as a whole, as well as to explore the characteristics of each individual component.

Moreover, we evaluate the performance benefit provided to the three use-cases by RAPID, as well the

ease of deployment of offloadable tasks. Each application utilises the offloading capabilities of RAPID

in a different fashion. The Hand Tracker application relies on the RAPID Acceleration Server for native

GPGPU code offloading while the BioSurveillance system achieves CUDA code offloading using the

GPU Bridger. The mobile antivirus application utilizes the Acceleration Server for CPU code

offloading, while performing GPGPU code offloading using the GPU Bridger or a combination of the

aforementioned components. Moreover, each pilot is deployed on a different platform and is able to

utilise the entire RAPID infrastructure, such as SLAM and DS. The Hand Tracker application is

developed on Linux-based laptop and desktop host, the BioSurveillance system targets low-power Tegra

devices, and the antivirus use case is developed as an Android APK able to execute on a plethora of

mobile devices. The successful offloading performed by these use-cases, presented in this deliverable,

indicates that RAPID’s solid implementation enables CPU and GPGPU code offloading regardless of

the hardware and software platform.

The remaining of this deliverable is structured as follows. Section 2 presents the evaluation results of

the RAPID public cloud acceleration service obtained through a set of simple micro-benchmarks.

Section 3 provides a thorough analysis of RAPID’s pilot applications in terms of performance and ease

of deployment. Finally, in Section 4, we discuss further performance optimizations that could be applied

to the framework and its infrastructure, as well as to the types of applications that can benefit from

RAPID.


of 48


1.1. Glossary of Acronyms

Acronym Definition

APK Android Application Package

API Application Programming Interface

AS Acceleration Server

CO Confidential

CPU Central Processing Unit

CUDA Compute Unified Device Architecture

D Deliverable

DFE Dispatch/Fetch Engine

DMP Data Management Plan

DoA Description of the Action

DS Directory Server

DSE Design Space Explorer

DT Deutsche Telekom

EC European Commission

EU European Union

FPS Frames Per Second

GA Grant Agreement

GPGPU General-Purpose computing on Graphics Processing Units

GPU Graphics Processing Unit

I/O Input / Output

JAR Java ARchive

JVM Java Virtual Machine

OS Operating System

OTA Over The Air

OTC Open Telekom Cloud

PU Public

QoS Quality of Service

REST REpresentational State Transfer

RGB Red Green Blue

RGBD Red Green Blue & Depth

RM Registration Manager

SLA Service Level Agreement

SLAM Service Level Agreement Manager

SVN Subversion

URI Universal Resource Identifier

VM Virtual Machine

VMM Virtual Machine Manager

WP Work Package


of 48


2. Evaluation of RAPID Service/Cloud

2.1. Environment The evaluation of the RAPID framework against the three use case applications has been conducted

using the public RAPID service, deployed on the Open Telekom Cloud (OTC) provided by Deutsche

Telekom (DT). As presented in the RAPID deliverable D7.2 [1], OTC is a European public cloud

offering based on OpenStack [2], which provides Virtual Machine (VM) instance options with NVIDIA

M60 [3] Graphics Processing Units (GPUs), while effectively covering all RAPID requirements.

The RAPID framework has been deployed on OTC, following the deployment diagram presented in

Figure 1.

Figure 1: Deployment of the RAPID framework in OTC.

As shown in the figure, the following VMs will be used:

One “Infrastructure VM”, hosting the DS, VMM and SLAM components

One GPU Bridger VM hosting the GPU-Bridger backend

AV “Antivirus VMs”, where AV the number of VMs hosting AS components for the Antivirus

application, i.e. DSE, RM, DFE and GPU-Bridger frontend

KH “Kinect Hand-tracking VMs”, where KH the number of VMs hosting AS components for

the Kinect Hand-tracking application, i.e. DSE, RM and DFE.

BS “BioSurveillance VMs”, where BS the number of VMs hosting the GPU-Bridger frontend

component, which communicates with the GPU Bridger backend component of the RAPID

framework.


of 48


It has to be noted that the VMs used for the use-case applications are automatically created by the RAPID

platform as Acceleration Server VMs or even helper VMs in case of task forwarding or parallelization.

So, the exact type and number of such VMs varies over time, based on the needs of the served

applications.

For the evaluation purposes, a number of VM instances have been used, varying in number and

characteristics, according to the application under test. An indicative illustration of such a list is depicted

in Figure 2, as listed in the OTC dashboard.

Figure 2: The VMs used for the RAPID evaluation as shown in the OTC dashboard.

The specifications of the VMs listed in Figure 2 are the following:

Table 1: The characteristics of the VMs listed in Figure 2.

Instance Name Family Type vCPU RAM Disk GPU

RAPID-

AndroidNormalVM-

userId-4

Computing

I

c1.medium 1 vCPUs 1 GB 10GB -

RAPID-

AndroidHelperVM-

2

Computing

I


RAPID-

AndroidHelperVM-

1

Computing

I


Infrastructure_VM

Computing

I

c2.large 2 vCPUs 4 GB 12GB -


of 48


android44 Computing

I


GPU_Bridger GPU-

optimized

g2.2xlarge 8 vCPUs 64 GB 40GB NVIDIA

M60 x 1

2.1.1. SLA in OTC

In D7.1 we presented the final architecture and deployment of the RAPID system. Some minor

functional changes had to be made in the VMM component in order to work correctly in the OTC

instance. Bellow we list the changes we had to make in order to have QoS running in the new cloud

environment:

The original concept of the SLAM was to duplicate a specific resource from a VM when a

violation occurred, i.e.: when a machine was running out of memory, we changed the VM and

duplicated its memory until achieving the maximum available. Within OTC, in order to reduce

the cost of the experiment, there is a limitation on the number of machines we could define.

One of the parameters, the disk size, is fixed to 32 TB and cannot be changed within OTC. The

other two parameters we could change is the number of vCPUs and the size of RAM. The OTC

VM flavours, which are closer in characteristics to the original RAPID design and thus, can be

easily used, are listed in Table 2. Even though the change is necessary for the SLA functionality

as realized in the RAPID project, the VMM component has been modified in order to take into

account the different types of flavours available. No change has been made within the SLAM

component. For this part of the functionality, the SLAM just works as before, in the same way

as in the RAPID Private Cloud, and the change is completely transparent. When SLAM detects

a violation, it requests an update of the machine requesting to VMM duplicating resources of

the VMM. The VMM requests an update to the best matching flavour. The underlying

OpenStack is able to find the best matching flavour and the VM is updated.

Table 2: List of available flavours in OTC

Number of CPU cores RAM (MB) DISK GB

Flavour

Name

1 1024 up to 32 TB c1.medium

1 2048 up to 32 TB c2.medium

2 2048 up to 32 TB c1.large

2 4096 up to 32 TB c2.large

4 4096 up to 32 TB c1.xlarge

Installing the RAPID components in OTC implied code changes in VMM in other ways. The

RAPID Private Cloud is based on OpenStack and therefore the way of retrieving data is by

using the telemetry Application Programming Interface (API). However, despite the fact that

OTC is also based on OpenStack, it provides its own monitoring API. To retrieve the

monitoring data, such as the amount CPU or memory being used, we had to re-implement the

way the metrics are retrieved from the system in order to adapt to OTC, since OTC’s monitoring

API is based on REpresentational State Transfer (REST). This REST call must include an auth-

token from OTC itself. Below we present an example of retrieving the CPU information. The

Universal Resource Identifier (URI) format is "GET /v1.0/{project id}/metric-data”. The


of 48


descriptions of other important parameters are as follows: metric_name specifies the metric

name and it can be e.g. “cpu_util” or “mem_util” for our case. The dim.0 parameter specifies

the instance ID of the VM provided. The from parameter denotes the start time of this query

and it is formatted as a UNIX timestamp in milliseconds. The to parameter indicates the end

time of the query. The period parameter specifies the monitoring interval represented in

seconds. Finally, the filter parameter indicates the data aggregation mode and it can be average,

variance, min, or max.

Curl –X GET \

'https://ces.eu-de.otc.t-systems.com/V1.0/fdb52efe56ed44f79c7538fb6bbf3209/metric-

data?namespace=SYS.ECS&metric_name=cpu_util&dim.0=instance_id,ed1231e8-11ab-4976-9986-

427681887ab2&from=1516762106162&to=1516793106162&period=1200&filter=average' \

-H 'X-Auth-Token:

MIIFBAYJKoZIhvcNAQcCoIIE9TCCBPECAQExDTALBglghkgBZQMEAgEwggLSBgkqhkiG9w0BBwGgggLDBIICv3sidG9r

ZW4iOnsiZXhwaXJlc19hdCI6IjIwMTgtMDEtMjVUMTA6MzA6MjMuMzY2MDAwWiIsIm1ldGhvZHMiOlsicGFzc3dvcmQi

XSwiY2F0YWxvZyI6W10sInJvbGVzIjpbeyJuYW1lIjoidGVfYWRtaW4iLCJpZCI6IjY5OWJkNjJjZGEzMDRkMmNhZDAz

ZmQyZmIxOTBiOGNmIn0seyJuYW1lIjoib3BfZ2F0ZWRfY2NlX3N3aXRjaCIsImlkIjoiMCJ9XSwicHJvamVjdCI6eyJk

b21haW4iOnsieGRvbWFpbl90eXBlIjoiVFNJIiwibmFtZSI6Ik9UQy1FVS1ERS0wMDAwMDAwMDAwMTAwMDAyNTE4OSIs

ImlkIjoiOGUzMTZlNTdiNzM0NGFmNmI2ZmIzZmYwYzIzZWI3ZmMiLCJ4ZG9tYWluX2lkIjoiMDAwMDAwMDAwMDEwMDAw

MjUxODkifSwibmFtZSI6ImV1LWRlIiwiaWQiOiJmZGI1MmVmZTU2ZWQ0NGY3OWM3NTM4ZmI2YmJmMzIwOSJ9LCJpc3N1

ZWRfYXQiOiIyMDE4LTAxLTI0VDEwOjMwOjIzLjM2NjAwMFoiLCJ1c2VyIjp7ImRvbWFpbiI6eyJ4ZG9tYWluX3R5cGUi

OiJUU0kiLCJuYW1lIjoiT1RDLUVVLURFLTAwMDAwMDAwMDAxMDAwMDI1MTg5IiwiaWQiOiI4ZTMxNmU1N2I3MzQ0YWY2

YjZmYjNmZjBjMjNlYjdmYyIsInhkb21haW5faWQiOiIwMDAwMDAwMDAwMTAwMDAyNTE4OSJ9LCJuYW1lIjoiMTQ5NjAx

MDMgT1RDLUVVLURFLTAwMDAwMDAwMDAxMDAwMDI1MTg5IiwiaWQiOiIwYzU4OGVlZTI1NGY0ZmNmYjU5Zjg4NWZhZjE1

ZGQxOSJ9fX0xggIFMIICAQIBATBcMFcxCzAJBgNVBAYTAlVTMQ4wDAYDVQQIDAVVbnNldDEOMAwGA1UEBwwFVW5zZXQx

DjAMBgNVBAoMBVVuc2V0MRgwFgYDVQQDDA93d3cuZXhhbXBsZS5jb20CAQEwCwYJYIZIAWUDBAIBMA0GCSqGSIb3DQEB

AQUABIIBgDWxRNNuDBudhpV3C9kqhxDi7h4hIygrNWW3t4uqwjqDV6HGfEMets4+cJ+tbf9Tvcdnkf02qK06BunUMLHt

oKRp4cCwCHi3RHpQ0wzMvPkhMFimlhZCCKXeQn0k90ZaZtO8qrk10kficFEzfCaZTcv6+IZEzU8uh5ufoHnoRhWZ2fAW

9fPwhigSTPvZyZZmHWStW6aWuHbSL0VkQyRV9vURB65vfciPwGmfJCVQYjVWH7HyamO5Ds4rTNvp2MCrNbfEWUQ1Wihl

YnGVnPoHmsTjok2thVMyEsvBD0G2pJU9JdzyU3rrKS2KZ50WGSq1ufTfY5iXBQ1lS3rwFDBLRaF9lcJEFOYa9pDQKk4B

f7LjYgP+9ESwfthijDnS--2DsBUDDQNbtq0Qq7Rf-kqLKG2yW+ihBO2rdtd2xcMzU9XrhHVIXf5-

BT1owItp6EgNPpDbHtLxzjxFfnSPnIhB7+VSWTF1Vj5UsZoBDf56BSbuyr7Il2eYySiiN4D7Pozojg=='

Curl command example to make the RESTFull call to retrieve the cpu_util monitoring variable.

The retrieval of the memory utilization monitoring information is very similar to the retrieval

of the CPU utilization information, but a change had to be applied on the VM, the memory of

which needs to be monitored. In OTC we must include a package called uvp-monitor in the

virtual machine.

Another difference in OTC that affects RAPID’s QoS solution is the fact that OTC provides

updates of the metrics every 4 minutes. This needs to be taken into account while resizing a

virtual machine. For example, doubling the memory of a VM, and reading the memory

monitoring value from that VM before the memory metrics were updated, would cause OTC to

falsely generate a violation error and request a new resize. This issue directly affected the SLAM

component, and was solved by requesting the metric information with a sampling rate of 4

minutes.

After implementing all these changes, we were able to successfully run the QoS RAPID solution in

OTC. Nevertheless, it is quite more limited compared to the RAPID Private Cloud solution. For instance,

another issue affecting the QoS is that OTC does not allow issuing a resize from an Android VM but

only allows it from Linux machines.

To sum up, several small changes had to be done in RAPID components in order to test the QoS solution.

The RAPID solution within OTC is slower compared to the RAPID Private Cloud, but the main reason

is that OTC requires more time to create new VMs, and a longer metric update period. Unfortunately,

these characteristics are out of our control.

https://ces.eu-de.otc.t-systems.com/V1.0/fdb52efe56ed44f79c7538fb6bbf3209/metric-data?namespace=SYS.ECS&metric_name=cpu_util&dim.0=instance_id,ed1231e8-11ab-4976-9986-427681887ab2&from=1516762106162&to=1516793106162&period=1200&filter=average




of 48


2.2. Evaluation using simple benchmarks In this section, we describe the deployment of the RAPID infrastructure on the OTC public cloud. We

perform some simple evaluation and validation tests in order to confirm that the integration of our

platform within a commercial public cloud works as well as the previous deployment in the RAPID

Private Cloud1. We run the RAPID demo application in both Android and Linux VMs, as described in

RAPID deliverable D4.2 [4], where we tested the offloading features on the RAPID Private Cloud.

First we test the registration process, which involves several components, i.e. the Directory Server (DS),

the Service Level Agreement Manager (SLAM), and the Virtual Machine Manager (VMM). The

registration is performed correctly and the VM is allocated on the client device. However, compared to

the RAPID Private Cloud, we notice that the creation time of a VM on the OTC is around four times

slower, reaching up to 2 minutes (see RAPID D4.3 [5] for more information about the registration

process in the RAPID Private Cloud).

Then, we perform experiments of Central Processing Unit (CPU) task offloading, CPU task parallel

execution, CPU native code offloading, and General-Purpose computing on Graphics Processing Units

(GPGPU) CUDA code offloading.

2.2.1. Android Experiments

The Android device used for the experiments is a Huawei P9 Lite smartphone [6], equipped with an

Octa-core (4x2.0 GHz Cortex-A53 & 4x1.7 GHz Cortex-A53) CPU and 3 GB of RAM, running Android

7.0. The phone was physically located in Copenhagen, Denmark. In Figure 3 we show the RAPID demo

application running on the Android device after having performed some experiments. The Android VM

is based on Android-x86 4.4 and is configured with 1 OTC vCPU and 1 GB of RAM. The network

connection between the phone and the cloud was a normal household commercial Wi-Fi channel with

the following characteristics, as reported by the RAPID Network Profiler:

Latency (RTT): around 80ms

Upload Rate: around 3 Mb/s

Download Rate: around 8 Mb/s

1 The term “RAPID Private Cloud” refers to the RAPID acceleration service deployed on a private OpenStack

installation, on SILO premises, within the Task 6.2 “Cloud Infrastructure Software”.


of 48


Figure 3: Screenshot of the Android phone after performing some experiments.

Figure 4 to Figure 8 display the results of the CPU offloading tests, i.e. the N-Queens puzzle, on the

Android Operating System (OS) and when offloaded on the VM provided by the OTC. As we can see

in the figures, offloading is beneficial when the problem becomes computationally complex enough,

which in this case happens when the number of queens is equal or bigger than 6. Furthermore, we test

the RAPID parallelization support, by running the N-Queens puzzle with 8 queens, using multiple VMs.

The test was completed successfully, and the results in Figure 9 show that parallelizing the execution

with 2 VMs improves the execution time even further.

Figure 4: 4-Queens puzzle, Local vs. Remote execution

performed on Huawei Android 7.0 phone and Android-

x86 4.4 VM on OTC.



x86 4.4 VM on OTC.


of 48




x86 4.4 VM on OTC.



x86 4.4 VM on OTC.



x86 4.4 VM on OTC.

Figure 9: 8-Queens puzzle, Local vs. Parallel Remote

execution performed on Huawei Android 7.0 phone and

two Android-x86 4.4 VMs on OTC.

We also perform a test of native code offloading, showing that it is possible to offload native C/C++

code embedded in Android methods to the Android VM on the OTC. The output of this experiment can

be seen in the Android screenshot in Figure 3, where we notice that the local execution is much faster

than the offloaded execution. This is expected, given that the native method implemented in the RAPID

demo application is very simple, and its purpose is just to test that the offloading works correctly. In

Figure 10, we show the log of the AS running on the Android VM when receiving the native code for

execution. From the log, we can notice that the first time the AS tries to run the native method but the

execution fails since the method cannot be found on the currently loaded libraries. Then, the AS loads

the shared libraries that were embedded with this application, where it finds the implementation of the

native method, and then the execution is performed. When the same method is offloaded again on the

VM, the library is already loaded, so the execution is performed immediately, without wasting time for

the library loading process, as shown in the log in Figure 11.


of 48


Figure 10: Screenshot of the Android VM log when executing a CPU native (C/C++) Android code before the shared

library was loaded.

Figure 11: Screenshot of the Android VM log when executing a CPU native (C/C++) Android code after the shared

library was loaded.

Figure 12: Matrix multiplication varying the problem size. Left: the GPU offloading is performed using a local

dedicated machine. Right: the GPU offloading leverages on the GPU-Bridger OTC virtual machine.

Finally, we perform a test that offloads GPGPU CUDA code, which proves that GPU code offloading

is i) feasible and ii) convenient under the right circumstances. Choosing a performance testing suite


of 48


accepted by the community is not possible, due to the lack of such suite, given that the GPU code

offloading for Android devices is a novel approach. As the suitable evaluation tool, we chose one of the

NVIDIA's CUDA SDK 9.0 samples, which are included with the standard CUDA Toolkit: Matrix

Multiplication. The choice is influenced by its clarity of exposition on illustrating various CUDA

programming principles, which makes it easy to clearly present the needed modifications for making it

work using RAPID Android GPU code offload.

We implement a regular Android application embedding the RAPID GPU-Bridger for Java/Android

framework and the X86_64 compiled CUDA kernel as application’s resource. We perform the tests

using an ASUS ZenFone 2 ZE551ML [7] equipped with 4GB RAM, connected to the academic Wi-Fi

network sharing infrastructure Eduroam. We consider this a really common use case characterized by

pretty good but not dedicated Wi-Fi connection and not really early generation mobile handset.

The test consists of matrix multiplication with an increasing problem size:

Figure 12 displays the results of the same test suite performed in two RAPID GPU Accelerator Server

configurations:

In Figure 12 (left), the virtual GPU is hosted by a local dedicated server equipped with two

NVIDIA Titan X [8] CUDA enabled devices

In Figure 12 (right), the virtual GPU is hosted using the GPU-Bridger virtual machine instance

on the OTC providing one NVIDIA Tesla M60 [3]

Both the local and the OTC machines use exactly the same GPU-Bridger backend software, the same

NVIDIA drivers and the same NVIDIA CUDA Toolkit. As the figures clarify, from a performance point

of view, offloading the GPU code on a remote virtual machine equipped by a high-end NVIDIA CUDA

enabled GPU device, such as the OTC, is more beneficial than offloading on a local server with a low-

end device (green curves).

2.2.2. Linux Experiments

We perform the same experiments described in the previous section using the Linux device, which is a

Lenovo ThinkPad T460p laptop [9], equipped with a Quad-core Intel Core i5 (6th Gen) 6440HQ CPU

@ 2.6 GHz and 4 GB of RAM, running Ubuntu 16.04 LTS. The VM on the OTC platform runs Ubuntu

16.04 and is equipped with 1 vCPU and 1 GB of RAM.

Figure 13 presents the output of the N-Queens puzzle with 4 – 8 queens and of the C/C++ native code

offloading. As the results show, offloading is successfully performed in both experiments, even though

it was not beneficial in any of the cases. Indeed, this outcome was to be expected, given that the Linux

laptop is more powerful than the Linux VM running on the OTC platform. However, the purpose of

these experiments was to demonstrate that RAPID can run correctly on the OTC commercial public

cloud.


of 48


Figure 13: Linux RAPID demo client after executing the N-Queens puzzle with 4, 5, 6, 7, and 8 queens

and the C/C++ native method, which only prints “Hello World”.

In order to perform a test of GPU code offloading in Linux, we used a CUDA enabled IDW interpolation

software [10], which is not specifically designed for testing, but for real use, as it belongs to a software

suite developed for bathymetry interpolation. The IDW is a deterministic method for spatial

interpolation, based on the principle that near points have similar values.

We considered a fixed number of 500,000 query locations (points where the value is unknown) and a

varying number of known values: 100, 1000, 10000 and 100000.

The CUDA enabled GPU device has been provisioned as follows:

CPU: No GPU CUDA enabled algorithm, just the algorithm for the CPU

TITAN X: a NVIDIA Titan X [8] physically connected to the machine used for testing. In this

scenario we use the regular CUDA libraries with no involvement of any RAPID GPU offloading

component

TITAN X GPU-Bridger: the same device used in the previous scenario, but the RAPID GPU-

Bridger is used

TESLA M60 OTC: The GPU code is offloaded on a remote virtual machine on the Open

Telekom Cloud using the RAPID GPU-Bridger


of 48


Figure 14: The CUDA-enabled IDW Algorithm executed on different flavours: (a) CPU, (b) on-board Titan X, (c)

Titan X using GPU-Bridger instead of regular CUDA, (d) Tesla M60 on OTC using RAPID offloading.

Figure 14 shows the results from the performed experiments. As expected, the CPU underperforms when

compared to any GPU flavour. The comparison between the TITAN X and TITAN X GPU-Bridger

cases is useful to prove the minimal footprint of the GPU offloading framework developed in RAPID.

The cross between the TESLA M60 GPU-Bridger (OTC) and the local TITAN X (regular and

virtualized) lines is interesting because it is a breakout point dividing the problem sizes in two sets: for

less than 10000 known values, the GPU offloading is not convenient, for more than 10000 known values

the offloading is the best solution. As a final remark on this experiment: we never recompiled or changed

the source code of the CUDA enabled IDW interpolation algorithm. The GPU offloading is thus

completely transparent.


of 48


3. Evaluation of RAPID Applications

3.1. 3D Hand Tracking Thanks to RAPID’s native C/C++ code offloading support, porting a pure C++ application in the RAPID

platform is quite straightforward. The Hand Tracking application is a native Linux application developed

using C++ and demonstrates this. The only step required to make the native Hand Tracker remoteable

is making a Java JNI wrapper for the top-level calls of the hand tracker application and then declaring

them remoteable using the RAPID API. All other details are abstracted from the developer, and the only

step required is to choose the offloading server by initializing the DFE accordingly. All the

implementation details regarding the port of the 3D Hand Tracker to RAPID are thoroughly documented

in D2.1 [11]. The similarity of the RAPID implementation to the original code is also almost one-to-

one. As stated in D2.2 [12] (section 4.3.2), the RAPID GPU Bridge component cannot be used, since

the Hand Tracker relies on direct OpenGL/CUDA interoperation. RAPID GPU Bridge provides pure

CUDA virtualization so the OpenGL deferred rendering, which is a mandatory requirement, is

unavailable. The OpenGL/CUDA interoperation requirement could theoretically be overcome by

decoupling the OpenGL geometry to depth rendering calls that are done via shaders from the CUDA

comparison of the renderings to the camera observation data. However, this theoretical decoupling

would mean that the system would have to download this information from GPU RAM to system RAM

and then re-upload them back to GPU, something that would include a big PCI-Bus bottleneck that

would deteriorate performance. The RAPID code-base for the Hand Tracker is available for public use

at RAPID’s GitHub webpage [13]. The code-base can serve as a very helpful reference to developers

who may wish to port a similar application to RAPID since it can serve as an example of code layout.

Moreover, it provides templates for wrapping C/C++ primitives as Java objects and the overarching

organization of a maven workspace [14].

3.1.1. Environment

The software and hardware environment required by the application, as defined by the specifications in

D2.1 [11] section 5, is a 64-bit PC running Ubuntu 14.04.5 LTS and an NVIDIA GPU compatible with

CUDA version 6.5 or higher. The Hand Tracker internally relies on multiple sub-libraries for various

functionalities. All the project’s dependencies are included inside the GitHub repository in binary form.

Some of the most important dependencies are Boost [15], which offers platform independent system

libraries, OpenCV [16], which is a state-of-the-art open source Computer Vision toolkit, and OpenNI

[17], which provides the drivers to our RGBD cameras, as well as various smaller threading libraries.

The maven build system [14] chosen by RAPID automatically packages the dependencies in the

generated portable JAR archive. For this reason, the Hand Tracker application is very portable, despite

only having a Java based top-level wrapper, since the rest of the binary runtime is self-contained. The

application’s proper functionality has also been tested with Linux host machines running different

versions of Ubuntu (version 160.4 and 16.10) and CUDA (version 8.0 and 9.0).

In order to study the behaviour of our RAPID enabled Hand Tracker application, we used a two-tier

testing environment with two different tiers of devices, a high-end desktop and a low-end laptop. The

high-end desktop features a GeForce GTX 970 [18] GPU and an Intel Core i7-950 [19] processor while

the laptop has an outdated GeForce 670M [20] GPU and Intel Core i5-4210U [21]. It is worth noting

here that the laptop GPU is incompatible with recent versions of CUDA 9.0+ so with updated software

it is not capable of performing GPGPU tasks. The two connectivity options examined are a fast Gigabit

Ethernet connection and a slower 802.11 Wi-Fi channel. The ideal environment for our application


of 48


would be a portable laptop that could connect via the wireless channel to a fast offloading server. In this

way, we could be able to perform Hand Tracking while conserving the limited available battery power

and CPU resources of a laptop and benefiting from the provided portability. However, we are also

aiming for real-time performance, thus the delay we can tolerate when performing Hand Tracking is

incredibly low. In order to achieve a 30fps tracking loop time, which will allow us to process each frame

received from the RGBD device, all the processing needs to be executed within 33 milliseconds.

Unfortunately, Wi-Fi connections are very prone to radio interference and typically introduce latency

ranging from 10 to 60 milliseconds, depending on the number of connected clients and network

saturation. Moreover, the available bandwidth of a Wi-Fi connection is substantially lower than the one

provided by a Gigabit Ethernet connection. These reasons render the Wi-Fi connection architecturally

impossible to accommodate our needs.

Figure 15: Hand Tracker testing environment topology.

In order to analyse the sustained performance of our RAPID-enabled Hand Tracking application in every

configuration, we performed our evaluation using both wireless and wired connections (0.1 milliseconds

of latency). The topology of the experimental setup is shown in Figure 15. Connecting the Hand Tracker

to the RAPID Public Cloud deployment would result in even higher latency, since, depending on the

network quality, the observed latency can range from 50ms to 150ms when establishing TCP/IP

connections with the remote host. The unique latency requirements of the Hand Tracker make it an

application that is suited for low-latency connections that can only be offered by a private cloud. Due to

this requirement, as well as the OpenGL/CUDA interoperability requirement, this application was only

applicable for testing in the Private Cloud.

3.1.2. Performance Results

Before assessing the end-to-end performance of the application, we must first study the processing loop

of the Hand Tracker algorithm and understand the way it should ideally execute in order to benefit from

the remote execution.


of 48


Figure 16: High-level view of the Hand Tracker I/O times in ideal conditions.

The application is a processor of frames, generated by a camera at a framerate of 30 frames per second.

The Hand Tracker acts like a black box optimizer that can receive a prior hand configuration (x) along

with a future frame pair of RGB and Depth (t+1) that observes a hand, and then respond with a good

estimation (x+1) of the position of the hand for the frame input given. In the series of received frames,

we can repeat the procedure and acquire one estimation per received RGBD frame, thus fully tracking

the observed hand. In order to achieve this for every frame received from the device, we need to be able

to perform all processing steps in less than 33 milliseconds as seen in Figure 16. Otherwise, in the case

of any delay when the next frame arrives we are not able to continue the process, since we do not have

the (x+1) value that is required for computing state (x+2).

This is depicted in segment A of Figure 17, where we observe that for a slower 150ms processing loop

time, we must skip processing two consecutive frames for each received frame, since we do not have

enough processing time for them. This not only is bad for the user experience (as there is observable

delay), but is also bad for the quality of the tracking. During the time lost due to the dropped frames, the

hand moves further away from the last tracked position. This means that the hand tracker has to sample

in a much wider area, which makes the problem much more difficult and errors also tend to accumulate.

Figure 17: What happens when frame processing is delayed (A) vs what RAPID could facilitate (B).


of 48


The RAPID platform, on the other hand, is not constrained by the resources of a single machine, and

could allow us to scale up, as seen in Figure 17 (bottom). Unfortunately, the nature of the optimization

framework used inside the Hand Tracker is not suited for this kind of parallel processing: each of the

frames must first be processed and produce output before the next one is processed. Thus, any potential

benefit gained by assigning each incoming frame to a separate computing resource is negated by the fact

that we always have to wait for each step to be completed before handling the next step.

One of the architectural changes attempted during the course of the project in order to try and adapt to

the capabilities provided by RAPID was daisy-chaining multiple separate optimization “threads”. This

could potentially improve tracking results, by working in parallel in the input frame stream. Thus, each

of the frames would use a different computing resource provided by RAPID and be initialized by the

closest previous solution x and the latest received frame. Unfortunately, as seen in Figure 18, this

generates two completely separate optimization threads. The first thread will only use its own results,

as they would always be the most recent, and this would end up virtually increase framerates by doubling

resource consumption, but offering no real tracking quality improvement.

Figure 18: Daisy-chaining to two machines could in principle double observed framerates, but in fact the tracking

quality would be the same while consuming double resources.

To assess the system’s performance, we compare the evaluation results obtained by executing the Hand

Tracker using various configuration settings. In order to have comparable results we pre-recorded a

Hand Tracking experiment scene depicting various challenging hand movements. We added

configuration parameters to the executables in order to be able to replicate and script the same

experiment across different setups and have directly comparable graphs.

With this evaluation, we aim to identify the overhead introduced by the network connections, the RAPID

framework and the impact of any other overhead introduced by calling the application’s native code

through a Java Virtual Machine (JVM). Moreover, we want to quantify the potential speed gain achieved

by utilizing remote code execution. Of course, all these are indirectly affected by the serial nature of the

Hand Tracker that, as stated at the start of this section, has to wait for each frame to be computed before

processing the next. If the Hand Tracker application could perform tracking without relying on previous

solutions in a serial fashion then performance would substantially improve, since we would be able to

offload all of the incoming frames and receive the results, achieving a perfect 30 FPS with just a minor

and constant delay.


of 48


We begin our evaluation by measuring the sustainable performance of the Hand Tracker in its vanilla

non-Java, non-RAPID implementation, when executed in the high-end desktop and the low-end laptop

respectively. The results of this analysis are the baseline of our evaluation and are displayed by the

dashed lines in Figure 19. The high-end hardware available in the desktop computer allows the

application to achieve real-time processing at 30 frames per second. This processing rate matches the

rate at which the RGBD camera acquires new Depth and RGB frames. The slower laptop can achieve a

maximum average rate of 12 FPS, which is much slower due to the low-end hardware. Ideally, we want

to take advantage of any extra processing power provided by the desktop host in order to improve the

performance. Moreover, it is worth to keep in mind that an important contribution of RAPID is to enable

CUDA applications (such as the Hand Tracker) to run on devices lacking CUDA-enabled graphics cards.

This fact already provides a benefit of using RAPID for this use case.

We proceed with the evaluation by measuring the performance of the RAPID-enabled implementation

of the Hand Tracker, when executed on the desktop and laptop host respectively, without utilizing code

offloading. The results of this set of experiments are portrayed in Figure 19, marked as “RAPID

Desktop/Laptop Localhost”. This analysis enables us to identify the overhead introduced by wrapping

the native code of the Hand Tracker inside a Java container using JNI. The results obtained at this step

reveal the impact of data serialization, synchronization, and JVM overheads. We observe that in this

configuration, the application’s performance is reduced by 50% when executed on the high-end desktop

host. When executing on the low-end laptop host, where the potential GPU speedup is lower and the

overall execution speed slower, the overhead introduced by the Java switch is much less evident and

ranges at 10%.

Figure 19: Sustainable frame rate for various offloading configurations.

The RGBD camera acquisition framerate is 30 fps.


of 48


Finally, we measure the performance characteristics of the RAPID-enabled Hand Tracker, utilizing code

offloading via the Gigabit Ethernet and the Wi-Fi connection. The purpose of this study is to evaluate

the performance gain obtained by executing the applications logic on the high-end desktop host.

Unfortunately, since the desktop host performance, when running through its Java container, falls to

roughly 15fps at the localhost scenario and the laptop localhost performance ranges from 10-12fps, we

have an even smaller budget of just one or two milliseconds to gain at best. Considering any other

context and network overhead that needs to be transmitted between machines that proves to be too much.

RAPID automatically measures that and falls back to local execution. Thus, we get a similar

performance to the localhost Java run, minus a small overhead when the two machines negotiate through

the slower or faster connections. As it was already mentioned in Section 3.1.1 of this document, we did

not expect the Wireless connection to be possibly fast enough to help us, but thanks to the RAPID

automatic QoS sensing, the slow network does not impact negatively in execution times. A very

revealing graph about the network overhead importance is in Figure 20. It clearly captures the delay

caused by the network medium compared to the pure execution time in the remote machine. As Amdahl's

law [22] suggests regardless of the computing power available, we would still be unable to improve the

achieved frame rates past the most critical bottleneck, which in our case is the network. Thus,

applications like the Hand Tracker where we are forced to wait for each of the frames to be processed

before submitting the next frame, prove to be ill-formed problems for parallelization, since network

latency ends up directly affecting the processing performance. An example of a more fitting problem for

parallelization would be a person detector (not tracker), where there would be no inter-frame

dependencies. In that case all newly acquired frames could be submitted in parallel to the computing

resources, the network delay is not accumulated and RAPID could provide a substantial improvement.

Figure 20: Delay between local and remote execution.


of 48


3.1.3. Conclusions for the Hand Tracker use-case

Porting the Hand Tracker to the RAPID framework provided important insights that can be summarized

in the following list of remarks.

1. RAPID provides a versatile framework that can facilitate the easy upgrade of a code base that

was initially not built with distributed systems in mind to a fully distributed version, with very

little programming effort.

2. The range and type of applications that can benefit from RAPID is very large. As seen in the

Hand Tracker use case, even native applications that are written in C/C++ can be easily ported

to RAPID using a JVM Wrapper.

3. The Hand Tracker initially only targeted high-end devices that featured a fast GPGPU. With

RAPID, lower-end devices (even without graphics cards) are no longer excluded from the target

group for this application.

4. Although RAPID can transparently deliver an enormous pool of computing resources to any

device/application combination, these computing resources may still be ultimately limited

depending on the nature of the application by network quality, bandwidth and latency. Even

relatively fast Wi-Fi 802.11/n connections can become a bottleneck in I/O heavy applications.

5. The Hand Tracker, with its serial frame processing dependency, its real-time requirements and

latency sensitive visual feedback, described in detail in 2.4.2, is a use case that tests RAPID in

an extremely demanding and unfavourable scenario. Despite this RAPID manages to

accommodate it and will be able to accommodate it even better with future improved network

technologies.


of 48


3.2. Antivirus This section presents the evaluation results of GrAVity mobile antivirus for Android, which is

appropriately modified to use the RAPID offloading framework. Our originally workstation-version of

the antivirus was ported to the Android platform, in the form of an Android Package Kit (APK), at the

early stages of this project. In its original configuration, the system was taking advantage of modern

NVIDIA [23] GPUs in order to offload the computational intensive task of scanning the file-system,

checking for the presence of malicious code.

The Android version of this application was initially developed for the NVIDIA Shield K1 tablet [23],

equipped with a mobile Kepler [24] GPU. In this configuration, the system was able to achieve increased

performance by offloading the virus-scanning operations to the device’s GPU as opposed to using the

CPU. However, since the number of devices equipped with CUDA capable GPUs available on the

market is limited [25], we proceeded on developing a version of its virus-scanning engine entirely

deployed using Java. In this configuration, the application can be used by the vast majority of Android

mobile devices available to the market. However, in order to provide the benefits of the fast GPGPU

execution to all mobile devices, we modified GrAVity antivirus into offloading the computational

intensive tasks using the RAPID framework.

In its final development stage, the application offers a wide variety of execution methods on each mobile

device, as provided by the deployment of RAPID framework and its infrastructure and is available for

mobile devices that run Android version 4.4 or higher. The RAPID frontend, present to the application,

can optimize the execution and schedule it locally on the CPU of the device, or GPU if present, or opt

to offload the scanning process. The offloading can be performed either by offloading the Java version

of the scanning engine to a remove virtual machine or by scheduling the CUDA version to execute on a

remote, highly capable, GPU. The GPGPU code offloading can be performed either by using RAPID

GPU-Bridger or by using a combination of the Acceleration Server (AS) and the GPU-Bridger. Using

the combinations of the two components, the virus-scanning task is offloaded to an Android VM, which

then contacts the GPU-Bridger backend and forwards the task execution on the remote GPU. The

decision of the offloading method is based on a wide variety of factors, such as energy consumption,

throughput, latency and availability of resources.

In the following sub-sections, we present the performance analysis of the application when executed on

various environments. First, we present the evaluation results of the local execution using the CPU and

the GPU available on the NVIDIA Shield K1 tablet. Then, we provide the analysis of the execution

when the virus-scanning operation is offloaded on a private cloud infrastructure (cloudlet) using the

RAPID framework. Finally, we discuss the outcomes of the same case study using the RAPID public

cloud infrastructure for the offloading process. In each scenario, we demonstrate the results obtained by

the execution of both CPU and GPGPU implementations of the application core.

3.2.1. Environment

We evaluate the operation of our RAPID-enabled mobile antivirus using three different environment

setups. We begin by analysing its performance characteristics, executing the application on the device

it is installed on, namely the NVIDIA Shield K1 tablet. Then, we proceed with the offloading operations

using a cloudlet and finally RAPID’s private cloud infrastructure.


of 48


3.2.1.1. NVIDIA Shield K1 Tablet

The mobile device used for the evaluation of the local execution of GrAVity antivirus is the NVIDIA

Shield K1 Android tablet. It is powered by the NVIDIA Tegra® K1 [26] processor, which features a

192-core NVIDIA Kepler™ GPU and Quad-core Cortex-A15 [27] CPU clocked at 2.2 GHz. The

presence of both mobile CPU and GPU makes this platform an ideal evaluation environment for our

application, since we are able to analyse the performance characteristics of the CPU and GPU

implementations of its virus-scanning core. The system is also equipped with 16 GB of internal storage

and 2 GB of RAM and runs on Android version 7.0 Nougat, as updated by NVIDIA’s latest Over-The-

Air (OTA) update 0.5 [28] February 9, 2017.

3.2.1.2. Cloudlet Infrastructure

The cloudlet infrastructure under test is composed by two host machines interconnected using a Gigabit

Ethernet switch. The network is also equipped with a wireless access point. Each host machine is

equipped with a Intel® Core™ i7-6700 [29] CPU running at 3.40GHz and 16GB of DDR4 RAM

operating at 2400MHz. The cloudlet hosts are also equipped with a NVIDIA GeForce GTX 780 GPU

[30] providing 4GB of available GDDR5 memory. Both hosts execute instances of Android v6.0 virtual

machines with the RAPID acceleration server installed, as well as instances of the GPU-Bridger

backend. In this configuration, the cloudlet is capable to perform Java and CUDA code offloading and

we are able to evaluate the CPU and GPU implementations of our application’s virus-scanning engine.

The NVIDIA Shield K1 is connected to one of the two hosts via a Wi-Fi channel, as seen in Figure 21.

Figure 21: Antivirus testing environment topology.


The evaluation of our system is divided into three execution models, (1) local (on-device), (2) cloudlet

offloading and (3) RAPID public cloud offloading evaluation. We measure the sustained throughput

achieved by the antivirus’s Java-based and CUDA-based signature-matching engine for each execution

model. For the purpose of this analysis, we generate 100 automata, containing signatures of malicious

code snippets in binary and regular-expression format. Part of these signatures are obtained by the

ClamAV [31] database while others are hand-crafted, based on snippets of known malicious code and

e-mail filters. The purpose of the custom signatures is to stress the matching engine using complex


of 48


regular expressions, developed for this purpose. Moreover, we generate a set of 16.000 files, each one

matching one of the signatures found in the automata set. We chose to perform the evaluation without

the presence of executable malicious code or infected APKs for safety reasons. In each execution, we

scan the entire file set against all the precompiled automata, measuring both the end-to-end sustainable

throughput and the throughput achieved only be the virus scanning engine. In this way, we are able to

evaluate the performance gain provided to the matching engine by the utilization of RAPID offloading,

as well as the performance observed by the user. In this analysis, we exclude the overhead introduced

by reading the files into memory and the end-to-end results depict the throughput of the network I/O

and virus-scanning process. We choose to exclude the file-system I/O since it is highly related to the

type of storage (internal or external memory card as well as file-system type) and is measured to be the

same in each experiment, regardless of the execution model.

In order to draw the baseline for our evaluation, we begin by measuring the performance of the on-

device execution, using the NVIDIA Shield K1 tablet. The experiment is performed twice, first using

the Java implementation of the antivirus’s scanning engine and then using the CUDA-based version,

each time scanning the entire file set against all 100 automata. The results of this analysis are displayed

in Figure 22. As we can see, the Java-based engine is able to achieve 2.9 Mbps of processing throughput,

while the CUDA-based version achieves 26.9 Mbps. We observe that the CUDA-based implementation

yields higher results due to the utilization of its highly parallel architecture. However, this throughput

results can only be achieved using a very limited number of mobile devices, powered by NVIDIA GPUs.

Figure 22: Sustainable throughput achieved by the Java and CUDA implementation of the virus-scanning engine,

executed on the tablet’s CPU and GPU respectively.

We proceed with the evaluation, this time executing the antivirus configured to offload the file

processing to the cloudlet, using RAPID. The offloading is performed in three different configurations.

Firstly, we offload the Java-based engine using RAPID Acceleration Server. Secondly, we offload the

CUDA-based implementation using RAPID GPU-Bridger and finally we perform CUDA offloading

using both the Acceleration Server and GPU-Bridger. For each setup, we measure and report the end-

to-end sustainable throughput as well as the throughput achieved only by the virus-scanning engine. The

outcome of this experiment is depicted in Figure 23 with the bars indicating the sustainable throughput

in each offloading configuration, the solid line indicating the throughput achieved by the tablet’s CPU

and the dashed line representing the throughput of the Kepler GPU found on the device. As we can see


of 48


in the figure, the Java-based engine is able to perform at 29.2 Mbps of scanning throughput,

outperforming the on-device CPU yielding 10 times better performance. Moreover, the GPU offloading

achieves 161 Mbps, being 5.5 times faster than the Java offloading and 55.5 times faster than the Java-

based implementation when executed on the NVIDIA tablet. These results indicate that RAPID

offloading greatly benefits the execution of our system and resolves the execution bottleneck enforced

by the low-end hardware found on mobile devices such as tablets and smartphones.

Figure 23: Sustainable throughput achieved by the virus-scanning engine in different offloading configurations using

the RAPID-enabled cloudlet.

While code offloading is proven to be beneficial for our system, it introduces network I/O bottleneck,

since the entire file set, including the automata, have to be transferred to the remote host. This is observed

by the end-to-end performance results obtained by the experiment described above, displayed in Figure

24. As we can see, both the Java and CUDA code offloading achieve a maximum throughput of 24.5

Mbps due to the low bandwidth of the Wi-Fi connection. These results indicate that RAPID offloading

benefits the performance of our system, increasing its throughput by 8.4 times, compared to the on-

device CPU execution. However, in all cases, the performance gain achieved by the remote execution

of the scanning engine is overshadowed by the low bandwidth of the network channel and can potentially

increase with the presence of high-speed Wi-Fi channels.


of 48


Figure 24: End- to-end sustainable throughput achieved by the antivirus Android application in different offloading

configurations using the RAPID-enabled cloudlet.

3.2.3. Antivirus in the OTC RAPID cloud

In the final part of our evaluation, we conduct the same experiments performed using the cloudlet, this

time offloading the virus-scanning task on RAPID’s public cloud, namely OTC. Firstly, we measure the

only the file-processing throughput without taking into account the network I/O overhead. The result of

this analysis is displayed in Figure 25. We notice that the Java-based engine, when offloaded to the

remote Android VM using the AS is able to perform virus scanning at 21.4 Mbps, achieving 7.3 times

better throughput compared to the tablet’s CPU. The CUDA-based implementation, when offloaded

using RAPID GPU-Bridger or a combination of the AS with GPU-Bridger yields 223 Mbps of file

processing throughput, outperforming the tablet’s CPU by 76.8 times while also being 8.7 times faster

than the integrated GPU. Moreover, we can see that the high-end GPU available on OTC is able to

outperform the low-end GPU provided by our cloudlet by 1.3 times. These results indicate that RAPID-

enabled public clouds providing access to high-end hardware can greatly benefit the execution time of

our antivirus application.

Figure 25: Sustainable throughput achieved by the antivirus engine in different offloading configurations using OTC.


of 48


We conclude the evaluation of our mobile antivirus by measuring the end-to-end sustainable throughput

achieved using OTC for task offloading. In this case, the network overhead introduced by the

communication with the remote VMs is even higher compared to the bottleneck observed in the cloudlet

infrastructure. As we can see in Figure 26, the end-to-end throughput is limited to 5.4 Mbps, due to the

limitations imposed by the poor network communication. However, even considering this overhead,

offloading the virus-scanning task on OTC improves the application’s performance by 86%.

Figure 26: End-to-end sustainable throughput achieved by the antivirus Android application in different offloading

configurations using OTC.

3.2.4. Conclusions for the Antivirus use-case

Based on the results obtained during the evaluation of our RAPID-enabled mobile antivirus we are able

to point out the following conclusions.

1. RAPID enables fast deployment of distributed systems able to offload both CPU and GPGPU

code to remote hosts equipped with high-end hardware, with minimum effort.

2. Using RAPID, mobile devices equipped with low-end CPUs and lacking GPU support are able

to utilise computationally-demanding software developed for desktop and server hosts.

3. Using RAPID for GPGPU and Java code offloading improves the execution of complex code,

improving the processing throughput by several times, compared to on-device execution.

4. The end-to-end performance limitations observed during our evaluation are imposed by the poor

network capabilities of mobile devices. The introduced network I/O overhead is not a result of

RAPID’s design and implementation but rather a hardware and technology limitation.

5. The deployment of fast wireless channels can greatly improve the observed performance

achieved by RAPID offloading. We expect this benefit to be greater on Android IoT devices

equipped with Gigabit Ethernet connections.


of 48


3.3. BioSurveillance

The BioSurveillance use-case consists of a commercial face recognition application for video

surveillance, ported to an NVIDIA Tegra platform [26] [32]. The algorithms work in real-time, with

multiple faces simultaneously and under unconstrained conditions. Although the Tegra family achieves

great computational performance for being a low-power platform, its capabilities are often not enough,

given the challenging requirements of video face recognition. Concretely, hard bottlenecks may appear

due to the input video resolution, the number of faces concurrently analysed, and the size of the gallery

database. Hence, offloading part of the computations from this device becomes critical for security,

especially in crowded environments, large galleries, and 4K streams.

Figure 27. Typical pipeline of a face recognition application.

As described in D2.3 [33] and observed in Figure 27, offloading makes sense at three different stages

of the face recognition pipeline: after video decoding, after face detection, or after template extraction.

Offloading at early stages either requires an overwhelming amount of bandwidth (especially after video

decoding) or implies a compromise on privacy aspects (anywhere before template extraction), so we

decide to offload the template matching operation. This stage becomes especially critical for large

databases or large template sizes (common for algorithms based on local visual features) [34].

In the presented pipeline, video decoding, face detection and template extraction are GPU-accelerated,

whereas the rest of stages are performed on CPU. The template matching stage is executed

asynchronously, in order to hide latencies and help to fully utilize all the hardware of the board (CPU

and GPU simultaneously).

Given the sensitive nature of facial snapshots and databases, it is not feasible to replicate a private

database of subjects among devices within a cloud. Moreover, real face recognition deployments require

the gallery of enrolled subjects to be secured and centralized on a single location. Therefore, to mitigate

such privacy and security concerns, this use-case focused on the RAPID GPU-Bridger CUDA offloading

component, instead of employing the complete RAPID infrastructure.


of 48


3.3.1. Environment

Figure 28. Evolution of architectures for this use-case. (a) Original implementation on TK1. (b) Extended TK1 version

with manual offloading through ZeroMQ (requires developing code on target accelerator). (c) Final version on TX1,

with transparent offloading by RAPID (no code required on the accelerator).

A fully functional Tegra K1 [26] prototype in C++/CUDA had been implemented specifically for

RAPID during the first year of the project, scaling it down from the original commercial product, as

depicted in Figure 28a. After that, a client-server version was manually implemented to offload the

template matching component to an accelerator device, featuring a discrete GPU, as shown in Figure

28b. This is what developers without access to RAPID components would normally do. The concrete

implementation was carried out using ZeroMQ, an open-source library for inter-process communication

[35].

For the final version, we were forced to do a series of important and core modifications. First, the Tegra

K1 processor was discontinued by NVIDIA, and replaced by the Tegra X1 system-on-chip [32], which

features Cortex A57 instead of A15, 256 CUDA-cores instead of 192, and most critical to us, only

supports CUDA versions over 7.0. Thus, the prototype code had to be updated consequently, to preserve

future commercial viability. Fortunately, the last version of RAPID GPU-Bridger supports newer

CUDA versions (certified up to 9.0), so the change of platform at the frontend was transparently taken

care by the GPU-Bridger. A final technical issue appeared due to the use of the ZeroMQ library by the

BioSurveillance application. Apparently, the inter-process communication of ZeroMQ collided with the

internal sockets used by GPU-Bridger, which caused this last one not to work properly, an issue that

took several weeks to be detected and corrected. To solve the problem, the client-server communication

of the prototype had to be re-implemented using named pipes, thus correcting the offloading. Since this

issue was found, RAPID GPU-Bridger developers have taken steps in order to minimize future

compatibility risks with user applications using similar communication libraries.

The BioSurveillance use-case has been evaluated under 4 different evaluation environments:

1. Application running natively on Tegra X1, without any kind of offloading

2. Manually coded ZeroMQ client-server version, offloading from TX1 to a Linux x86 server

3. Automatic offloading using RAPID, locally in Tegra X1

4. Automatic offloading using RAPID, from Tegra X1 to Linux x86 server

Although the third case may seem impractical, it will be useful to understand the fixed introduced

overhead simply because of using RAPID: a manual and well-designed remote offloading (case 2) will

only communicate data of interest, i.e. templates, while transparently offloading with RAPID (case 4)


of 48


requires sending CUDA header functions and auxiliary data so that the whole operation can be carried

out at the remote GPU. Hence, local RAPID offloading (case 3) will help us estimate how much

performance loss is due to this extra communication overhead. The remote offloading will be carried

out towards two different discrete NVIDIA GPU cards: a GTX 760 [36] and a more powerful Titan Xp

[37].


The evaluations in this section consider two main factors: latency and scalability. Given that the

offloaded part is restricted to the template matching module, we evaluate scalability in terms of number

of faces simultaneously present in a frame, and the size of the database.

Figure 29. Sustained frames per second (FPS) of the local and distributed BioSurveillance applications, with manual

and automatic (RAPID) offloading, depending on the number of faces simultaneously analysed.

Figure 29 provides a comparison of the sustained frame-rate of the application depending on the number

of faces continuously present in front of the camera. The left figure corresponds to standard factory

settings, whereas for the right one we raise CPU and GPU clock frequencies to achieve maximum

performance. For each scenario, we compare the four cases described previously. Remote offloading is

carried out to a Linux x64 server, equipped with an NVIDIA Titan Xp GPU [37], within the same local

network (average network ping: 2ms). Each represented frame-rate is computed from the median of FPS

samples measured over a one-minute window. For this baseline comparison, the database contained only

100 subjects.

As displayed in the figure, for zero faces the behaviour is exactly equal in all cases, given that no

template is extracted and no template matching is required. As the number of analysed faces increases,

the manual remote offloading behaves similarly or even slightly better than the stand-alone case

sometimes, as it frees more resources from the low-power device than it consumes in network latency.

Local offloading using RAPID behaves approximately as the stand-alone application, except for a tiny

overhead due to the transference of CUDA headers and auxiliary data. Remote RAPID offloading adds

a noticeable but relatively small penalty to the sustained performance of the application. It is worth

noting that after a certain number of faces, the difference between local and remote execution tends to

level out, given that whereas the local execution struggles more and more to allocate resources, the price

to pay for remote execution remains constant over time.


of 48


Figure 30. Automatic remote offloading with RAPID allows us to override hardware limitations. In this case, we can

overcome memory limitations and use larger subject databases simply by changing the GPU card of the accelerator.

Figure 30 is of great importance in order to understand one of the primary contributions of automatic

offloading with RAPID. Although we pay a considerable price in terms of latency when using RAPID,

it allows us to transparently override hardware limitations, by simply changing the IP address of the

remote accelerator. In this example, the TX1 platform can only handle up to 10K database templates

during local execution. However, by remotely offloading to a remote GTX 760 GPU [36], we manage

to handle larger databases, even though the memory of the accelerator card is lower than that of the

Tegra X1 (2 GB for 760 vs 4 GB for TX1). This is explainable by understanding that the TX1 consumes

large resources of memory for face detection and template extraction, leaving few for database matching.

However, the GTX 760 is only devoted to matching, resulting in 4+2 GB available for the complete

pipeline. Likewise, RAPID allows us to deal with even larger databases by simply using a different

accelerator. With a Titan Xp [37] we are able to handle more than 100K elements in database, which is

required by certain border control projects, which until now required powerful servers to be placed next

to camera sensors and databases to be replicated.

3.3.3. Batch offloading

We can characterize the individual latency paid by every single offloading of the template matching

module. The latency that we empirically measure at the matching server (who’s CUDA function calls

are offloaded by the GPU-Bridger component within RAPID) is given by these two terms:

�̂� = [#𝑓𝑎𝑐𝑒𝑠 ×#𝐶𝑈𝐷𝐴 𝑐𝑎𝑙𝑙𝑠

𝑓𝑎𝑐𝑒× 𝑁𝑒𝑡𝑤𝑜𝑟𝑘 𝑙𝑎𝑡𝑒𝑛𝑐𝑦] + [𝐶𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 𝑙𝑎𝑡𝑒𝑛𝑐𝑦]

We notice that we pay a fix price in terms of network latency for setting up the CUDA calls required to

match a single template to the database, and a variable price depending on the number of faces that have

been found at each frame (and have resulted in correspondent templates). Separately, the computational


of 48


latency depends on factors such as the number of templates enrolled in the database, but also on the

particular hardware (GPU) where the matching is actually carried out.

It is evident that the price we pay in network latency is way larger than the computational one. A question

that arises is whether there is any mechanism to hide the overwhelming network latencies. Towards this

end, we propose to batch a number of templates into a single package of computation, thus effectively

reducing the network latency per template by a factor of batch-size. As seen in Figure 31, this tiny extra

development effectively increases the frame-rate performance, not only for the RAPID remote

offloading case (which increases frame-rate performance in a 10-20% range), but even for the stand-

alone application, yielding a consistent 5% improvement independently from the size of the database.

Figure 31. Performance improvements after the implementation of batch offloading operations for RAPID. In this

case, with a batch=10 the frame-rate improves even compared to the stand-alone application running locally on TX1.

Looking at it in more detail, we can estimate the latency of each CUDA call taken by the RAPID’s GPU-

Bridger backend at the remote machine. For this example, we evaluate a database of 100 subjects and

one single face in front of the camera. The list of CUDA calls either executed or offloaded by the

matching server for each template can be seen at Table 3. This table shows two different scenarios: the

original remote offloading without batch (i.e. batch=1, or one template sent at a time), and a batch of 10

templates. It is obvious that the averaged latency per template is indeed reduced by the proposed

approach, which saves us between a 15% and a 20% of the effective latency incurred by the application.


of 48


Table 3. List of CUDA functions offloaded in a template matching operation, with their average computation latency.

By applying batch-10 offloading, we can reduce a 15-20% of the effective latency per template.

CUDA function

Elapsed time (ms)

Batch = 1 (No batch)

Batch = 10

Total batch time Batch time / template

cudaMemset 2 2 0

cudaMemcpyAsync 135 1186 119

cudaMemset 2 2 0

cudaLaunch 1 1 0

cudaMemcpyAsync 2 2 0

cudaStreamSynchronize 1 1 0

Total elapsed time 143ms 1194ms 119ms (-17%)

Finally, in Figure 32 we present an evaluation of the resulting frame-rate depending on the batch-size

parameter. This experiment is carried out with a database of 10K subjects and remote offloading to Titan

Xp, progressively increasing the number of templates packaged in a computation batch. Adding a batch

operation seems to always improve over the standard setting (batch=1), and the optimal range in this

case appears in the range between batch=10 and batch=100. We observed that large batches beyond 200

resulted in noisier measures of frame-rate, which required us to take many more samples to reach a

consistent result.

Figure 32. Performance of the automatic remote offloading application with regard to the chosen batch-size. The

optimal batch-size in this case is somewhere between batch=10 and batch=100.

3.3.4. BioSurveillance in the OTC RAPID cloud

Finally, we have evaluated the performance of the BioSurveillance application when offloading the

template matching operation to RAPID’s public cloud. Due to the privacy constraints related to personal

data access in face recognition applications, BioSurveillance would never be able to allocate sensitive

information such as facial databases in a public cloud, so this is more of an academic exercise to

understand the limitations of operating with servers at very distant locations. The average network ping

of the RAPID OTC cloud is measured to be 59ms at the time in which the tests were conducted.


of 48


Evaluating the application on the RAPID cloud is as simple as modifying the IP address of the

accelerator at the configuration file. We compare the stand-alone application versus the remote RAPID

offloading setup, with databases ranging from 100 to 15K subjects, and for two different scenarios: non-

crowded environments (1 single face in front of the camera) and crowded environments (5 simultaneous

faces). In order to mitigate the effects of the high latencies, the proposed batch offloading operation is

performed, with batch sizes ranging from 1 (no batch, each template matching sent individually) up to

500. Larger batch sizes have not been tested, as the consequent delay in retrieving the results starts being

too long for critical applications. A batch of 500 ensures that alarms can be raised in less than a second.

Figure 33. Performance comparison of BioSurveillance running stand-alone on TX1 against remote offloading to

RAPID’s cloud, for different batch-sizes, database sizes and faces.

Figure 33 shows the obtained results. As we already anticipate, the stand-alone application cannot

evaluate more than 10K faces without running out of memory, a no-go limitation for most commercial

face recognition projects. We also observe that a standard RAPID offloading (without batch packaging)

cannot achieve the level of performance of the stand-alone application, due to the very high latency

affecting the communication with the cloud. Nevertheless, when increasing the batch-size of the

template matching operation, we found that remote offloading even improves the overall performance

of the stand-alone application for batch sizes over 150. Although large batch sizes relatively delay the


of 48


reception of matching scores (as we have to wait until filling a batch package of computation), it trades

off by improving the overall frame-rate of the application, thus avoiding to drop frames that could be

potentially critical in terms of security. Moreover, we observe that the computational performance of

the remote offloading becomes less dependent on the database size, which again tends to level off for

large batch sizes.

Figure 34. Latencies for each process of the face recognition pipeline, when analyzing 5 faces simultaneously on the

RAPID cloud.

The counterpart of these performance results is presented in terms of latency in Figure 34. In this case,

we have collected the frame-average latency for each stage of the face recognition pipeline: (1) video

decoding, (2) face detection, (3) template extraction and (4) template matching (the only RAPID-

offloaded operation). Face detection and template matching are carried out on GPU, whereas the rest

runs on CPU. We have evaluated two database sizes, 100 and 10K subjects, with and without batch

offloading, for batch sizes equal to 1, 50, and 500. As expected, the only process affected by database

size is the matching. It is also noticeable that analyzing 5 simultaneous faces increases the cost of

template extraction to a considerable amount (for a single face, template extraction latency is practically

1/5 of the one shown here). An increase of database size has a strong impact on the latency of the

matching, although not as much as the price paid for connecting to a distant server. Nonetheless, batch

offloading drastically reduces the latency of template matching to a negligible value compared to the

other pipeline operations, yielding the frame-rates presented in Figure 33, and making it possible to use

remote servers in private clouds in real commercial applications.

3.3.5. Conclusions for the BioSurveillance use-case

The thorough evaluation procedures carried out yield a series of interesting conclusions for the

BioSurveillance use-case:

1. RAPID is extremely useful for transparently offloading workloads to remote GPU devices,

achieving almost the same performance as dedicated, manually-coded distributed applications,

while providing a series of remarkable advantages:


of 48


a. Rapidly prototyping distributed applications without actually having to develop code

(code maintenance).

b. Running GPU code on machines without actual GPU support (code compatibility).

c. Transparently overriding limitations of device hardware, such as memory constraints in

our case, which are critical for large database deployment. Hardware limitations are

solved simply by using a more powerful accelerator, benefitting again in terms of code

maintenance.

2. We propose a batch offloading approach that greatly mitigates latency penalties for the remote

RAPID offloading case, even improving over the stand-alone case for offloading to both private

networks and even public clouds.


of 48


4. Conclusions and Future Performance Optimizations

In this section we discuss and propose future optimizations applicable to RAPID framework and

infrastructure, as well as optimizations on use-case level. Regarding the QoS, several aspects can be

optimized in future work. First, it would be probably faster to resize resources if the system was changed

to work with containers instead of virtual machines. Secondly, the frequency in which the information

is monitored by the host system currently yields the highest frequency at which the SLAM can take

decisions about the VMs. Thus, increasing the frequency of information monitoring updates will make

the SLAM run and take decisions faster. Finally, QoS could be further improved by considering aspects

from GPUs, and moving tasks from worse to better GPU devices consequently.

The Hand Tracker, in its current algorithmic implementation [38] and due to the serial nature of

operations between frames, is architecturally bound to always suffer from network-related latency issues

when run in a distributed manner. These issues could be partly mitigated by advances in networking

technology, but they could be completely overcome by switching the optimization pipeline from a

generative regression based optimizer to a discriminative classifier that does not have inter-frame

dependencies. Hand pose estimation algorithms based on the recent advents on Deep Convolutional

Neural Networks such as [39] would be ideal for acceleration using RAPID, and even hybrid

classifier/regression solutions [40] could be much more feasible. Unfortunately, these state-of-the-art

methods have only become recently available to the scientific community. The ability of 2D Hand

Trackers based on Neural Networks to work on direct RGB colour frames without any depth information

would also make them not require a special USB RGBD sensor, thus enabling the application to work

on any kind of device with cameras, including smart-phones, which would greatly benefit from the

resources provided by RAPID.

The virus-scanning engine used by the mobile antivirus is highly demanding on computational

resources. As we anticipated, offloading the processing intensive task of pattern matching to more

capable high-end resources, using RAPID, proved to be extremely beneficial for our system. However,

the offloading process introduces a heavy network I/O bottleneck. This limitation is not introduced by

RAPID and is observed in all offloading frameworks. We anticipate that when new network

technologies, such as 5G networks, are implemented and widely deployed in the near future our

application will benefit from RAPID offloading even more. In the meantime, we plan to explore the

usage of fast compression algorithms in order to mitigate the network bottleneck and exploit RAPID’s

capabilities to the maximum extend. We expect that the deployment of real-time compression

algorithms, such as LZ4 [41], on the critical network I/O path of our application will allow us to increase

the effective number of data transmitted to the remote host under the current network capabilities. This

feature will directly affect the end-to-end throughput achieved by the application and help us exploit the

performance gain provided by RAPID.

Regarding the BioSurveillance use-case, the fact that the template matching module is not the actual

bottleneck of the application severely limits the maximum performance improvement achievable by

RAPID, according to Amdahl’s law [22]. On the other hand, we face challenging restraints in terms of

data privacy. Biometric applications most often require to meet strong privacy-by-design considerations

that do not allow to deal with private data (e.g. video frames, facial snapshots, and databases) anywhere

but on the device they are processed or securely stored. Hence, a compromise between the

aforementioned limitations and requirements would be having part of the template extraction stage,

which is currently done on the CPU, to be accelerated by GPU, and remotely offload only part of the

template extraction GPU operations. Concretely, offloading some parts of template extraction, in which


of 48


the processed data is no longer identifiable or linked to personal data, would yield remarkable

performance improvements, as the CUDA cores would be much more stressed (having to deal with

almost all the stages of the pipeline), making remote CUDA offloading much more effective. This

development would maintain all current benefits of RAPID offloading (code maintenance, code

compatibility and dissociation from hardware limitations), while considerably improving performance,

and still complying with privacy-by-design constraints.


of 48


References

[1] RAPID, “D7.2: First RAPID-based public service,” H2020-644312 RAPID Deliverable Report,

2017.

[2] “OpenStack,” [Online]. Available: www.openstack.org. [Accessed February 2018].

[3] “Tesla M60 GPU Accelerator,” [Online]. Available: http://www.nvidia.com/object/tesla-

m60.html. [Accessed February 2018].

[4] RAPID, “D4.2: Development of Dispatch/Fetch Engine,” H2020-644312 RAPID Deliverable

Report, 2016.

[5] RAPID, “D4.3: Development of Registration Process,” H2020-644312 RAPID Deliverable

Report, 2017.

[6] Huawei, “P9 Lite Smartphone,” [Online]. Available:

https://consumer.huawei.com/en/phones/p9-lite/. [Accessed February 2018].

[7] “ASUS ZenGone 2,” [Online]. Available:

https://www.asus.com/gr/Phone/ZenFone_2_ZE551ML/. [Accessed February 2018].

[8] “NVIDIA TITAN X Graphics Card for VR Gaming,” [Online]. Available:

https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/. [Accessed February

2018].

[9] Lenovo, “ThinkPad T460p Enterprise Laptop,” Lenovo, [Online]. Available:

https://www3.lenovo.com/us/en/laptops/thinkpad/thinkpad-t-series/ThinkPad-

T460p/p/22TP2TT460P#tab-techspec. [Accessed February 2018].

[10] G. Mei, “Evaluating the power of GPU acceleration for IDW interpolation algorithm,” The

Scientific World Journal, 2014.

[11] RAPID, “D2.1: Application analysis and system requirements,” H2020-644312 RAPID

Deliverable Report, 2015.

[12] RAPID, “D2.2 Kinect Hand Tracking ported on RAPID,” H2020-644312 RAPID Deliverable

Report, 2016.

[13] “HandTrackerRAPID,” [Online]. Available:

https://github.com/RapidProjectH2020/HandTrackerRAPID. [Accessed January 2018].

[14] “Apache Maven Project,” [Online]. Available: https://maven.apache.org/. [Accessed February

2018].


of 48


[15] “Boost,” [Online]. Available: http://www.boost.org/. [Accessed February 2018].

[16] “OpenCV,” [Online]. Available: https://opencv.org/. [Accessed February 2018].

[17] “OpenNI,” [Online]. Available: https://github.com/OpenNI. [Accessed February 2018].

[18] “GeForce GTX 970,” [Online]. Available: https://www.geforce.com/hardware/desktop-

gpus/geforce-gtx-970. [Accessed February 2018].

[19] “Intel® Core™ i7-950 Processor,” [Online]. Available:

https://ark.intel.com/products/37150/Intel-Core-i7-950-Processor-8M-Cache-3_06-GHz-4_80-

GTs-Intel-QPI. [Accessed February 2018].

[20] “GeForce GTX 670M,” [Online]. Available: https://www.geforce.com/hardware/notebook-

gpus/geforce-gtx-670m. [Accessed February 2018].

[21] “Intel® Core™ i5-4210U Processor,” [Online]. Available:

https://ark.intel.com/products/81016/Intel-Core-i5-4210U-Processor-3M-Cache-up-to-2_70-

GHz. [Accessed February 2018].

[22] M. Hill and M. Marty, “Amdahl's Law in the Multicore Era,” Computer, vol. 41, no. 7, pp. 33-

38, 2008.

[23] “NVIDIA Shield K1 tablet,” [Online]. Available: https://www2.nvidia.com/en-us/shield/tablet.

[Accessed December 2017].

[24] “Kepler Architecture,” [Online]. Available: http://www.nvidia.com/object/nvidia-kepler.html.


[25] “Tegra Mobile Devices,” [Online]. Available: http://www.nvidia.com/object/tegra-phones-

tablets.html. [Accessed December 2017].

[26] “NVIDIA Tegra® K1 processor,” [Online]. Available: http://www.nvidia.com/object/tegra-k1-

processor.html. [Accessed December 2017].

[27] “ARM Cortex-A15 CPU,” [Online]. Available:

https://developer.arm.com/products/processors/cortex-a/cortex-a15. [Accessed December 2017].

[28] “Official SHIELD Tablet K1 Software Upgrade 5.0,” [Online]. Available:

https://forums.geforce.com/default/topic/992729/shield-tablet/official-shield-tablet-k1-software-

upgrade-5-0-feedback-thread-released-02-09-17-/. [Accessed December 2017].

[29] “Intel® Core™ i7-6700 Processor,” [Online]. Available:

https://ark.intel.com/products/88196/Intel-Core-i7-6700-Processor-8M-Cache-up-to-4_00-GHz.



of 48


[30] “NVIDIA GeForce GTX 980 GPU Specifications,” [Online]. Available:

https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-980/specifications. [Accessed

December 2017].

[31] “ClamAV,” [Online]. Available: https://www.clamav.net/. [Accessed January 2018].

[32] “NVIDIA Tegra X1 Processor,” [Online]. Available: http://www.nvidia.com/object/tegra-x1-

processor.html. [Accessed January 2018].

[33] RAPID, “D2.3: BioSurveillance ported on RAPID,” H2020-644312 RAPID Deliverable Report,

2016.

[34] A. Nech and I. Kemelmacher-Shlizerman, “Level playing field for million scale face

recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[35] “ZeroMQ,” [Online]. Available: http://zeromq.org/. [Accessed January 2018].

[36] “NVIDIA GeForce GTX 760 GPU Specifications,” [Online]. Available:

https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-760/specifications. [Accessed

January 2018].

[37] “NVIDIA Titan Xp GPU Specifications,” [Online]. Available: https://www.nvidia.com/en-

us/titan/titan-xp/. [Accessed January 2018].

[38] I. Oikonomidis, N. Kyriazis and A. A. Argyros, “Efficient model-based 3D tracking of hand

articulations using Kinect,” in British Machine Vision Conference (BMVC 2011), BMVA, 2011.

[39] P. Panteleris, I. Oikonomidis and A. A. Argyros, “Using a single RGB frame for real time 3D

hand pose estimation in the wild,” in IEEE Winter Conference on Applications of Computer

Vision (WACV 2018) (to appear). Also available at arxiv., IEEE, March 2018., 2018.

[40] A. Qammaz, D. Michel and A. A. Argyros, “A Hybrid Method for 3D Pose Estimation of

Personalized Human Body Models,” in IEEE Winter Conference on Applications of Computer

Vision (WACV 2018) (to appear), IEEE, 2018.

[41] “LZ4 Compression Algorithm,” [Online]. Available:

https://en.wikipedia.org/wiki/LZ4_(compression_algorithm). [Accessed February 2018].

Project No: 644312 · 0.5.1 Elena Garrido (ATOS) New input about SLAM in the OTC platform....

Documents

Transcript of Project No: 644312 · 0.5.1 Elena Garrido (ATOS) New input about SLAM in the OTC platform....