Bryan Herta Autosports -- Barracuda Racing IndyCar Team Social@Scale.
Project No: 644312 · 0.5.1 Elena Garrido (ATOS) New input about SLAM in the OTC platform....
Transcript of Project No: 644312 · 0.5.1 Elena Garrido (ATOS) New input about SLAM in the OTC platform....
D7.3 Evaluation of RAPID platforms
Page 1 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Project No: 644312
D7.3 Evaluation of RAPID platforms
December 31, 2017
Abstract:
This deliverable presents the evaluation results obtained by the performance analysis of the RAPID offloading framework.
The evaluation is conducted in three different parts. Firstly, using simple benchmarks that explore various characteristics
of the framework. Secondly, using RAPID’s three pilot projects, namely the 3D Hand Tracking application, the Android
antivirus and the BioSurveillance application. The aforementioned applications have been modified appropriately so that
they can take advantage of the benefits provided by the RAPID offloading framework and the outcomes of these
modifications are presented. Finally, this deliverable provides the results obtained by the analysis of RAPID’s cloud and
services.
Document Manager
Dimitris Deyannis FORTH
Document Id N°: rapid_D7.3 Version: 1.0 Date: 14/02/2018
Filename: radpid_D7.3_v1.0.docx
Confidentiality
This document contains proprietary material of certain RAPID contractors, and may not be reproduced, copied,
or disclosed without appropriate permission. The commercial use of any information contained in this
document may require a license from the proprietor of that information.
D7.3 Evaluation of RAPID platforms
Page 2 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
The RAPID Consortium consists of the following partners:
Participant no. Participant organisation names short name Country
1 Foundation of Research and Technology Hellas FORTH Greece
2 Sapienza University of Rome UROME Italy
3 Atos Spain S.A. ATOS Spain
4 Queen's University Belfast QUB
United
Kingdom
5 Herta Security S.L. HERTA Spain
6 SingularLogic S.A. SILO Greece
7 University of Naples "Parthenope" UNP Italy
The information in this document is provided “as is” and no guarantee or warranty is given that the information
is fit for any particular purpose. The user thereof uses the information at its sole risk and liability.
Revision history
Version Author Notes. Date
0.1 Iakovos Mavroidis (FORTH) Initial ToC. 27/11/2017
0.1.1 Terpsi Velivassaki (SILO) Updates on ToC. 28/11/2017
0.2 Dimitris Deyannis (FORTH) Input on various sections. 13/12/2017
0.3 Dimitris Deyannis (FORTH) Input on Antivirus Environment. 14/12/2017
0.3.1 Terpsi Velivassaki (SILO) Input on RAPID service Environment. 08/01/2018
0.3.2 Dimitris Deyannis (FORTH) Input on Hand Tracking (just merging). 9/01/2018
0.3.3 Elena Garrido (ATOS) Input on 4.1 section, adding subsection
SLAM in OTC.
15/01/2018
0.3.4 Dimitris Deyannis (FORTH) Text Fixes. 15/01/2018
0.3.5 Carles Fernández (HERTA) Input on BioSurveillance sections. 17/01/2018
0.3.6 Carles Fernández (HERTA) Added BioSurveillance evaluation results 23/01/2018
0.4 Dimitris Deyannis (FORTH) Integrating all input. 25/01/2018
0.5 Dimitris Deyannis (FORH) More input on Antivirus. 26/01/2017
0.5.1 Elena Garrido (ATOS) New input about SLAM in the OTC
platform.
26/01/2018
0.5.2 Carles Fernández (HERTA) Added more evaluation results for 3.3. 26/01/2018
0.5.3 Sokol Kosta (UROME) Input on section 2 about simple evaluation
experiments on the OTC platform.
Moving section 5 to 2.
29/01/2018
0.5.4 Elena Garrido (ATOS) New input about SLAM in OTC. 30/01/2018
0.5.5 Cheol-Ho Hong (QUB) Added more input to SLAM in OTC. 30/01/2018
0.5.6 Terpsi Velivassaki (SILO) Updates on RAPID service Environment. 30/01/2018
D7.3 Evaluation of RAPID platforms
Page 3 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
0.5.7 Sokol Kosta (UROME) More input on section 2 about simple
evaluation experiments on the OTC
platform.
29/01/2018
0.6 Dimitris Deyannis (FORTH) Integrating all input. 1/2/2018
0.6.1 Sokol Kosta (UROME) Fixed Dimitris’ comments.
More input on section 2 about Linux
evaluation experiments on the OTC
platform.
1/02/2018
0.6.2 Dimitris Deyannis (FORTH) Integrating GVirtuS input (section 2) 1/02/2018
0.6.3 Dimitris Deyannis (FORTH) Input for missing sections 2/02/2018
0.6.4 Carles Fernández (HERTA) Addressed comments and suggestions.
Added input on BioSurveillance on the
cloud (latencies).
2/02/2018
0.6.5 Elena Garrido (ATOS) Addressed comments QoS 6/02/2018
0.6.6 Dimitris Deyannis (FORTH) Fixes throughout the text 6/02/2018
0.6.7 Dimitris Deyannis (FORTH) Finalising Input 8/02/2018
0.6.8 Carles Fernández (HERTA) Review of the deliverable 9/02/2018
0.6.9 Sokol Kosta (UROME) Handled Carles comments. 9/02/2018
0.7.0 Dimitris Deyannis (FORTH) Merging review input 12/02/2018
0.7.1 Dimitris Deyannis (FORTH) Minor fixes 13/02/2018
0.7.2 Dimitris Deyanis (FORTH) Corrections 14/02/2018
1.0 Dimitris Deyannis (FORTH) Final Version 14/02/2018
D7.3 Evaluation of RAPID platforms
Page 4 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Contents
Contents ................................................................................................................................................... 4
1. Introduction ...................................................................................................................................... 9
1.1. Glossary of Acronyms ........................................................................................................... 10
2. Evaluation of RAPID Service/Cloud ............................................................................................. 11
2.1. Environment .......................................................................................................................... 11
2.1.1. SLA in OTC ................................................................................................................... 13
2.2. Evaluation using simple benchmarks..................................................................................... 15
2.2.1. Android Experiments ..................................................................................................... 15
2.2.2. Linux Experiments ......................................................................................................... 19
3. Evaluation of RAPID Applications ............................................................................................... 22
3.1. 3D Hand Tracking ................................................................................................................. 22
3.1.1. Environment................................................................................................................... 22
3.1.2. Performance Results ...................................................................................................... 23
3.1.3. Conclusions for the Hand Tracker use-case ................................................................... 28
3.2. Antivirus ................................................................................................................................ 29
3.2.1. Environment................................................................................................................... 29
3.2.2. Performance Results ...................................................................................................... 30
3.2.3. Antivirus in the OTC RAPID cloud............................................................................... 33
3.2.4. Conclusions for the Antivirus use-case .......................................................................... 34
3.3. BioSurveillance ...................................................................................................................... 35
3.3.1. Environment................................................................................................................... 36
3.3.2. Performance Results ...................................................................................................... 37
3.3.3. Batch offloading ............................................................................................................ 38
3.3.4. BioSurveillance in the OTC RAPID cloud .................................................................... 40
3.3.5. Conclusions for the BioSurveillance use-case ............................................................... 42
4. Conclusions and Future Performance Optimizations..................................................................... 44
References .............................................................................................................................................. 46
D7.3 Evaluation of RAPID platforms
Page 5 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
List of Figures
Figure 1: Deployment of the RAPID framework in OTC. .................................................................... 11
Figure 2: The VMs used for the RAPID evaluation as shown in the OTC dashboard. ......................... 12
Figure 3: Screenshot of the Android phone after performing some experiments. ................................. 16
Figure 4: 4-Queens puzzle, Local vs. Remote execution performed on Huawei Android 7.0 phone and
Android-x86 4.4 VM on OTC. .............................................................................................................. 16
Figure 5: 5-Queens puzzle, Local vs. Remote execution performed on Huawei Android 7.0 phone and
Android-x86 4.4 VM on OTC. .............................................................................................................. 16
Figure 6: 6-Queens puzzle, Local vs. Remote execution performed on Huawei Android 7.0 phone and
Android-x86 4.4 VM on OTC. .............................................................................................................. 17
Figure 7: 7-Queens puzzle, Local vs. Remote execution performed on Huawei Android 7.0 phone and
Android-x86 4.4 VM on OTC. .............................................................................................................. 17
Figure 8: 8-Queens puzzle, Local vs. Remote execution performed on Huawei Android 7.0 phone and
Android-x86 4.4 VM on OTC. .............................................................................................................. 17
Figure 9: 8-Queens puzzle, Local vs. Parallel Remote execution performed on Huawei Android 7.0
phone and two Android-x86 4.4 VMs on OTC. ..................................................................................... 17
Figure 10: Screenshot of the Android VM log when executing a CPU native (C/C++) Android code
before the shared library was loaded. .................................................................................................... 18
Figure 11: Screenshot of the Android VM log when executing a CPU native (C/C++) Android code
after the shared library was loaded. ....................................................................................................... 18
Figure 12: Matrix multiplication varying the problem size. Left: the GPU offloading is performed
using a local dedicated machine. Right: the GPU offloading leverages on the GPU-Bridger OTC
virtual machine. ..................................................................................................................................... 18
Figure 13: Linux RAPID demo client after executing the N-Queens puzzle with 4, 5, 6, 7, and 8
queens .................................................................................................................................................... 20
Figure 14: The CUDA-enabled IDW Algorithm executed on different flavours: (a) CPU, (b) on-board
Titan X, (c) Titan X using GPU-Bridger instead of regular CUDA, (d) Tesla M60 on OTC using
RAPID offloading. ................................................................................................................................. 21
Figure 15: Hand Tracker testing environment topology. ....................................................................... 23
Figure 16: High-level view of the Hand Tracker I/O times in ideal conditions. ................................... 24
Figure 17: What happens when frame processing is delayed (A) vs what RAPID could facilitate (B). 24
Figure 18: Daisy-chaining to two machines could in principle double observed framerates, but in fact
the tracking quality would be the same while consuming double resources. ........................................ 25
Figure 19: Sustainable frame rate for various offloading configurations. The RGBD camera
acquisition framerate is 30 fps. .............................................................................................................. 26
Figure 20: Delay between local and remote execution. ........................................................................ 27
Figure 21: Antivirus testing environment topology. .............................................................................. 30
Figure 22: Sustainable throughput achieved by the Java and CUDA implementation of the virus-
scanning engine, executed on the tablet’s CPU and GPU respectively. ................................................ 31
Figure 23: Sustainable throughput achieved by the virus-scanning engine in different offloading
configurations using the RAPID-enabled cloudlet. ............................................................................... 32
Figure 24: End- to-end sustainable throughput achieved by the antivirus Android application in
different offloading configurations using the RAPID-enabled cloudlet. ............................................... 33
D7.3 Evaluation of RAPID platforms
Page 6 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Figure 25: Sustainable throughput achieved by the antivirus engine in different offloading
configurations using OTC. ..................................................................................................................... 33
Figure 26: End-to-end sustainable throughput achieved by the antivirus Android application in
different offloading configurations using OTC. .................................................................................... 34
Figure 27. Typical pipeline of a face recognition application. .............................................................. 35
Figure 28. Evolution of architectures for this use-case. (a) Original implementation on TK1. (b)
Extended TK1 version with manual offloading through ZeroMQ (requires developing code on target
accelerator). (c) Final version on TX1, with transparent offloading by RAPID (no code required on the
accelerator). ........................................................................................................................................... 36
Figure 29. Sustained frames per second (FPS) of the local and distributed BioSurveillance
applications, with manual and automatic (RAPID) offloading, depending on the number of faces
simultaneously analysed. ....................................................................................................................... 37
Figure 30. Automatic remote offloading with RAPID allows us to override hardware limitations. In
this case, we can overcome memory limitations and use larger subject databases simply by changing
the GPU card of the accelerator. ............................................................................................................ 38
Figure 31. Performance improvements after the implementation of batch offloading operations for
RAPID. In this case, with a batch=10 the frame-rate improves even compared to the stand-alone
application running locally on TX1. ...................................................................................................... 39
Figure 32. Performance of the automatic remote offloading application with regard to the chosen
batch-size. The optimal batch-size in this case is somewhere between batch=10 and batch=100. ....... 40
Figure 33. Performance comparison of BioSurveillance running stand-alone on TX1 against remote
offloading to RAPID’s cloud, for different batch-sizes, database sizes and faces. ............................... 41
Figure 34. Latencies for each process of the face recognition pipeline, when analyzing 5 faces
simultaneously on the RAPID cloud. .................................................................................................... 42
D7.3 Evaluation of RAPID platforms
Page 7 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
List of Tables
Table 1: The characteristics of the VMs listed in Figure 2. ................................................................... 12
Table 2: List of available flavours in OTC ............................................................................................ 13
Table 3. List of CUDA functions offloaded in a template matching operation, with their average
computation latency. By applying batch-10 offloading, we can reduce a 15-20% of the effective
latency per template. .............................................................................................................................. 40
D7.3 Evaluation of RAPID platforms
Page 8 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Executive Summary
In this deliverable, we present a thorough evaluation of the RAPID offloading framework. This
evaluation covers the performance analysis of the framework and its components, as obtained by a series
of micro-benchmarks aiming to explore its characteristics. Moreover, we present the outcomes of the
evaluation of RAPID’s three pilot applications. These three applications, namely Hand Tracker, Mobile
Antivirus and BioSurveillance, benefit from the features provided by the offloading framework, in terms
of easing the offloading deployment and increasing the performance. Since RAPID’s pilot applications
have characteristics found in multiple families of applications, such as real-time constraints, heavy I/O,
high throughput, potentially large memory consumption, or strict privacy requirements, they present
solid case studies for the evaluation of RAPID.
D7.3 Evaluation of RAPID platforms
Page 9 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
1. Introduction
In order to evaluate the performance aspects of the RAPID framework and infrastructure in detail, we
conduct a variety of micro-benchmarks. The purpose of these micro-benchmarks is to test the
infrastructure as a whole, as well as to explore the characteristics of each individual component.
Moreover, we evaluate the performance benefit provided to the three use-cases by RAPID, as well the
ease of deployment of offloadable tasks. Each application utilises the offloading capabilities of RAPID
in a different fashion. The Hand Tracker application relies on the RAPID Acceleration Server for native
GPGPU code offloading while the BioSurveillance system achieves CUDA code offloading using the
GPU Bridger. The mobile antivirus application utilizes the Acceleration Server for CPU code
offloading, while performing GPGPU code offloading using the GPU Bridger or a combination of the
aforementioned components. Moreover, each pilot is deployed on a different platform and is able to
utilise the entire RAPID infrastructure, such as SLAM and DS. The Hand Tracker application is
developed on Linux-based laptop and desktop host, the BioSurveillance system targets low-power Tegra
devices, and the antivirus use case is developed as an Android APK able to execute on a plethora of
mobile devices. The successful offloading performed by these use-cases, presented in this deliverable,
indicates that RAPID’s solid implementation enables CPU and GPGPU code offloading regardless of
the hardware and software platform.
The remaining of this deliverable is structured as follows. Section 2 presents the evaluation results of
the RAPID public cloud acceleration service obtained through a set of simple micro-benchmarks.
Section 3 provides a thorough analysis of RAPID’s pilot applications in terms of performance and ease
of deployment. Finally, in Section 4, we discuss further performance optimizations that could be applied
to the framework and its infrastructure, as well as to the types of applications that can benefit from
RAPID.
D7.3 Evaluation of RAPID platforms
Page 10 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
1.1. Glossary of Acronyms
Acronym Definition
APK Android Application Package
API Application Programming Interface
AS Acceleration Server
CO Confidential
CPU Central Processing Unit
CUDA Compute Unified Device Architecture
D Deliverable
DFE Dispatch/Fetch Engine
DMP Data Management Plan
DoA Description of the Action
DS Directory Server
DSE Design Space Explorer
DT Deutsche Telekom
EC European Commission
EU European Union
FPS Frames Per Second
GA Grant Agreement
GPGPU General-Purpose computing on Graphics Processing Units
GPU Graphics Processing Unit
I/O Input / Output
JAR Java ARchive
JVM Java Virtual Machine
OS Operating System
OTA Over The Air
OTC Open Telekom Cloud
PU Public
QoS Quality of Service
REST REpresentational State Transfer
RGB Red Green Blue
RGBD Red Green Blue & Depth
RM Registration Manager
SLA Service Level Agreement
SLAM Service Level Agreement Manager
SVN Subversion
URI Universal Resource Identifier
VM Virtual Machine
VMM Virtual Machine Manager
WP Work Package
D7.3 Evaluation of RAPID platforms
Page 11 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
2. Evaluation of RAPID Service/Cloud
2.1. Environment The evaluation of the RAPID framework against the three use case applications has been conducted
using the public RAPID service, deployed on the Open Telekom Cloud (OTC) provided by Deutsche
Telekom (DT). As presented in the RAPID deliverable D7.2 [1], OTC is a European public cloud
offering based on OpenStack [2], which provides Virtual Machine (VM) instance options with NVIDIA
M60 [3] Graphics Processing Units (GPUs), while effectively covering all RAPID requirements.
The RAPID framework has been deployed on OTC, following the deployment diagram presented in
Figure 1.
Figure 1: Deployment of the RAPID framework in OTC.
As shown in the figure, the following VMs will be used:
One “Infrastructure VM”, hosting the DS, VMM and SLAM components
One GPU Bridger VM hosting the GPU-Bridger backend
AV “Antivirus VMs”, where AV the number of VMs hosting AS components for the Antivirus
application, i.e. DSE, RM, DFE and GPU-Bridger frontend
KH “Kinect Hand-tracking VMs”, where KH the number of VMs hosting AS components for
the Kinect Hand-tracking application, i.e. DSE, RM and DFE.
BS “BioSurveillance VMs”, where BS the number of VMs hosting the GPU-Bridger frontend
component, which communicates with the GPU Bridger backend component of the RAPID
framework.
D7.3 Evaluation of RAPID platforms
Page 12 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
It has to be noted that the VMs used for the use-case applications are automatically created by the RAPID
platform as Acceleration Server VMs or even helper VMs in case of task forwarding or parallelization.
So, the exact type and number of such VMs varies over time, based on the needs of the served
applications.
For the evaluation purposes, a number of VM instances have been used, varying in number and
characteristics, according to the application under test. An indicative illustration of such a list is depicted
in Figure 2, as listed in the OTC dashboard.
Figure 2: The VMs used for the RAPID evaluation as shown in the OTC dashboard.
The specifications of the VMs listed in Figure 2 are the following:
Table 1: The characteristics of the VMs listed in Figure 2.
Instance Name Family Type vCPU RAM Disk GPU
RAPID-
AndroidNormalVM-
userId-4
Computing
I
c1.medium 1 vCPUs 1 GB 10GB -
RAPID-
AndroidHelperVM-
2
Computing
I
c1.medium 1 vCPUs 1 GB 10GB -
RAPID-
AndroidHelperVM-
1
Computing
I
c1.medium 1 vCPUs 1 GB 10GB -
Infrastructure_VM
Computing
I
c2.large 2 vCPUs 4 GB 12GB -
D7.3 Evaluation of RAPID platforms
Page 13 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
android44 Computing
I
c1.medium 1 vCPUs 1 GB 10GB -
GPU_Bridger GPU-
optimized
g2.2xlarge 8 vCPUs 64 GB 40GB NVIDIA
M60 x 1
2.1.1. SLA in OTC
In D7.1 we presented the final architecture and deployment of the RAPID system. Some minor
functional changes had to be made in the VMM component in order to work correctly in the OTC
instance. Bellow we list the changes we had to make in order to have QoS running in the new cloud
environment:
The original concept of the SLAM was to duplicate a specific resource from a VM when a
violation occurred, i.e.: when a machine was running out of memory, we changed the VM and
duplicated its memory until achieving the maximum available. Within OTC, in order to reduce
the cost of the experiment, there is a limitation on the number of machines we could define.
One of the parameters, the disk size, is fixed to 32 TB and cannot be changed within OTC. The
other two parameters we could change is the number of vCPUs and the size of RAM. The OTC
VM flavours, which are closer in characteristics to the original RAPID design and thus, can be
easily used, are listed in Table 2. Even though the change is necessary for the SLA functionality
as realized in the RAPID project, the VMM component has been modified in order to take into
account the different types of flavours available. No change has been made within the SLAM
component. For this part of the functionality, the SLAM just works as before, in the same way
as in the RAPID Private Cloud, and the change is completely transparent. When SLAM detects
a violation, it requests an update of the machine requesting to VMM duplicating resources of
the VMM. The VMM requests an update to the best matching flavour. The underlying
OpenStack is able to find the best matching flavour and the VM is updated.
Table 2: List of available flavours in OTC
Number of CPU cores RAM (MB) DISK GB
Flavour
Name
1 1024 up to 32 TB c1.medium
1 2048 up to 32 TB c2.medium
2 2048 up to 32 TB c1.large
2 4096 up to 32 TB c2.large
4 4096 up to 32 TB c1.xlarge
Installing the RAPID components in OTC implied code changes in VMM in other ways. The
RAPID Private Cloud is based on OpenStack and therefore the way of retrieving data is by
using the telemetry Application Programming Interface (API). However, despite the fact that
OTC is also based on OpenStack, it provides its own monitoring API. To retrieve the
monitoring data, such as the amount CPU or memory being used, we had to re-implement the
way the metrics are retrieved from the system in order to adapt to OTC, since OTC’s monitoring
API is based on REpresentational State Transfer (REST). This REST call must include an auth-
token from OTC itself. Below we present an example of retrieving the CPU information. The
Universal Resource Identifier (URI) format is "GET /v1.0/{project id}/metric-data”. The
D7.3 Evaluation of RAPID platforms
Page 14 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
descriptions of other important parameters are as follows: metric_name specifies the metric
name and it can be e.g. “cpu_util” or “mem_util” for our case. The dim.0 parameter specifies
the instance ID of the VM provided. The from parameter denotes the start time of this query
and it is formatted as a UNIX timestamp in milliseconds. The to parameter indicates the end
time of the query. The period parameter specifies the monitoring interval represented in
seconds. Finally, the filter parameter indicates the data aggregation mode and it can be average,
variance, min, or max.
Curl –X GET \
'https://ces.eu-de.otc.t-systems.com/V1.0/fdb52efe56ed44f79c7538fb6bbf3209/metric-
data?namespace=SYS.ECS&metric_name=cpu_util&dim.0=instance_id,ed1231e8-11ab-4976-9986-
427681887ab2&from=1516762106162&to=1516793106162&period=1200&filter=average' \
-H 'X-Auth-Token:
MIIFBAYJKoZIhvcNAQcCoIIE9TCCBPECAQExDTALBglghkgBZQMEAgEwggLSBgkqhkiG9w0BBwGgggLDBIICv3sidG9r
ZW4iOnsiZXhwaXJlc19hdCI6IjIwMTgtMDEtMjVUMTA6MzA6MjMuMzY2MDAwWiIsIm1ldGhvZHMiOlsicGFzc3dvcmQi
XSwiY2F0YWxvZyI6W10sInJvbGVzIjpbeyJuYW1lIjoidGVfYWRtaW4iLCJpZCI6IjY5OWJkNjJjZGEzMDRkMmNhZDAz
ZmQyZmIxOTBiOGNmIn0seyJuYW1lIjoib3BfZ2F0ZWRfY2NlX3N3aXRjaCIsImlkIjoiMCJ9XSwicHJvamVjdCI6eyJk
b21haW4iOnsieGRvbWFpbl90eXBlIjoiVFNJIiwibmFtZSI6Ik9UQy1FVS1ERS0wMDAwMDAwMDAwMTAwMDAyNTE4OSIs
ImlkIjoiOGUzMTZlNTdiNzM0NGFmNmI2ZmIzZmYwYzIzZWI3ZmMiLCJ4ZG9tYWluX2lkIjoiMDAwMDAwMDAwMDEwMDAw
MjUxODkifSwibmFtZSI6ImV1LWRlIiwiaWQiOiJmZGI1MmVmZTU2ZWQ0NGY3OWM3NTM4ZmI2YmJmMzIwOSJ9LCJpc3N1
ZWRfYXQiOiIyMDE4LTAxLTI0VDEwOjMwOjIzLjM2NjAwMFoiLCJ1c2VyIjp7ImRvbWFpbiI6eyJ4ZG9tYWluX3R5cGUi
OiJUU0kiLCJuYW1lIjoiT1RDLUVVLURFLTAwMDAwMDAwMDAxMDAwMDI1MTg5IiwiaWQiOiI4ZTMxNmU1N2I3MzQ0YWY2
YjZmYjNmZjBjMjNlYjdmYyIsInhkb21haW5faWQiOiIwMDAwMDAwMDAwMTAwMDAyNTE4OSJ9LCJuYW1lIjoiMTQ5NjAx
MDMgT1RDLUVVLURFLTAwMDAwMDAwMDAxMDAwMDI1MTg5IiwiaWQiOiIwYzU4OGVlZTI1NGY0ZmNmYjU5Zjg4NWZhZjE1
ZGQxOSJ9fX0xggIFMIICAQIBATBcMFcxCzAJBgNVBAYTAlVTMQ4wDAYDVQQIDAVVbnNldDEOMAwGA1UEBwwFVW5zZXQx
DjAMBgNVBAoMBVVuc2V0MRgwFgYDVQQDDA93d3cuZXhhbXBsZS5jb20CAQEwCwYJYIZIAWUDBAIBMA0GCSqGSIb3DQEB
AQUABIIBgDWxRNNuDBudhpV3C9kqhxDi7h4hIygrNWW3t4uqwjqDV6HGfEMets4+cJ+tbf9Tvcdnkf02qK06BunUMLHt
oKRp4cCwCHi3RHpQ0wzMvPkhMFimlhZCCKXeQn0k90ZaZtO8qrk10kficFEzfCaZTcv6+IZEzU8uh5ufoHnoRhWZ2fAW
9fPwhigSTPvZyZZmHWStW6aWuHbSL0VkQyRV9vURB65vfciPwGmfJCVQYjVWH7HyamO5Ds4rTNvp2MCrNbfEWUQ1Wihl
YnGVnPoHmsTjok2thVMyEsvBD0G2pJU9JdzyU3rrKS2KZ50WGSq1ufTfY5iXBQ1lS3rwFDBLRaF9lcJEFOYa9pDQKk4B
f7LjYgP+9ESwfthijDnS--2DsBUDDQNbtq0Qq7Rf-kqLKG2yW+ihBO2rdtd2xcMzU9XrhHVIXf5-
BT1owItp6EgNPpDbHtLxzjxFfnSPnIhB7+VSWTF1Vj5UsZoBDf56BSbuyr7Il2eYySiiN4D7Pozojg=='
Curl command example to make the RESTFull call to retrieve the cpu_util monitoring variable.
The retrieval of the memory utilization monitoring information is very similar to the retrieval
of the CPU utilization information, but a change had to be applied on the VM, the memory of
which needs to be monitored. In OTC we must include a package called uvp-monitor in the
virtual machine.
Another difference in OTC that affects RAPID’s QoS solution is the fact that OTC provides
updates of the metrics every 4 minutes. This needs to be taken into account while resizing a
virtual machine. For example, doubling the memory of a VM, and reading the memory
monitoring value from that VM before the memory metrics were updated, would cause OTC to
falsely generate a violation error and request a new resize. This issue directly affected the SLAM
component, and was solved by requesting the metric information with a sampling rate of 4
minutes.
After implementing all these changes, we were able to successfully run the QoS RAPID solution in
OTC. Nevertheless, it is quite more limited compared to the RAPID Private Cloud solution. For instance,
another issue affecting the QoS is that OTC does not allow issuing a resize from an Android VM but
only allows it from Linux machines.
To sum up, several small changes had to be done in RAPID components in order to test the QoS solution.
The RAPID solution within OTC is slower compared to the RAPID Private Cloud, but the main reason
is that OTC requires more time to create new VMs, and a longer metric update period. Unfortunately,
these characteristics are out of our control.
D7.3 Evaluation of RAPID platforms
Page 15 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
2.2. Evaluation using simple benchmarks In this section, we describe the deployment of the RAPID infrastructure on the OTC public cloud. We
perform some simple evaluation and validation tests in order to confirm that the integration of our
platform within a commercial public cloud works as well as the previous deployment in the RAPID
Private Cloud1. We run the RAPID demo application in both Android and Linux VMs, as described in
RAPID deliverable D4.2 [4], where we tested the offloading features on the RAPID Private Cloud.
First we test the registration process, which involves several components, i.e. the Directory Server (DS),
the Service Level Agreement Manager (SLAM), and the Virtual Machine Manager (VMM). The
registration is performed correctly and the VM is allocated on the client device. However, compared to
the RAPID Private Cloud, we notice that the creation time of a VM on the OTC is around four times
slower, reaching up to 2 minutes (see RAPID D4.3 [5] for more information about the registration
process in the RAPID Private Cloud).
Then, we perform experiments of Central Processing Unit (CPU) task offloading, CPU task parallel
execution, CPU native code offloading, and General-Purpose computing on Graphics Processing Units
(GPGPU) CUDA code offloading.
2.2.1. Android Experiments
The Android device used for the experiments is a Huawei P9 Lite smartphone [6], equipped with an
Octa-core (4x2.0 GHz Cortex-A53 & 4x1.7 GHz Cortex-A53) CPU and 3 GB of RAM, running Android
7.0. The phone was physically located in Copenhagen, Denmark. In Figure 3 we show the RAPID demo
application running on the Android device after having performed some experiments. The Android VM
is based on Android-x86 4.4 and is configured with 1 OTC vCPU and 1 GB of RAM. The network
connection between the phone and the cloud was a normal household commercial Wi-Fi channel with
the following characteristics, as reported by the RAPID Network Profiler:
Latency (RTT): around 80ms
Upload Rate: around 3 Mb/s
Download Rate: around 8 Mb/s
1 The term “RAPID Private Cloud” refers to the RAPID acceleration service deployed on a private OpenStack
installation, on SILO premises, within the Task 6.2 “Cloud Infrastructure Software”.
D7.3 Evaluation of RAPID platforms
Page 16 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Figure 3: Screenshot of the Android phone after performing some experiments.
Figure 4 to Figure 8 display the results of the CPU offloading tests, i.e. the N-Queens puzzle, on the
Android Operating System (OS) and when offloaded on the VM provided by the OTC. As we can see
in the figures, offloading is beneficial when the problem becomes computationally complex enough,
which in this case happens when the number of queens is equal or bigger than 6. Furthermore, we test
the RAPID parallelization support, by running the N-Queens puzzle with 8 queens, using multiple VMs.
The test was completed successfully, and the results in Figure 9 show that parallelizing the execution
with 2 VMs improves the execution time even further.
Figure 4: 4-Queens puzzle, Local vs. Remote execution
performed on Huawei Android 7.0 phone and Android-
x86 4.4 VM on OTC.
Figure 5: 5-Queens puzzle, Local vs. Remote execution
performed on Huawei Android 7.0 phone and Android-
x86 4.4 VM on OTC.
D7.3 Evaluation of RAPID platforms
Page 17 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Figure 6: 6-Queens puzzle, Local vs. Remote execution
performed on Huawei Android 7.0 phone and Android-
x86 4.4 VM on OTC.
Figure 7: 7-Queens puzzle, Local vs. Remote execution
performed on Huawei Android 7.0 phone and Android-
x86 4.4 VM on OTC.
Figure 8: 8-Queens puzzle, Local vs. Remote execution
performed on Huawei Android 7.0 phone and Android-
x86 4.4 VM on OTC.
Figure 9: 8-Queens puzzle, Local vs. Parallel Remote
execution performed on Huawei Android 7.0 phone and
two Android-x86 4.4 VMs on OTC.
We also perform a test of native code offloading, showing that it is possible to offload native C/C++
code embedded in Android methods to the Android VM on the OTC. The output of this experiment can
be seen in the Android screenshot in Figure 3, where we notice that the local execution is much faster
than the offloaded execution. This is expected, given that the native method implemented in the RAPID
demo application is very simple, and its purpose is just to test that the offloading works correctly. In
Figure 10, we show the log of the AS running on the Android VM when receiving the native code for
execution. From the log, we can notice that the first time the AS tries to run the native method but the
execution fails since the method cannot be found on the currently loaded libraries. Then, the AS loads
the shared libraries that were embedded with this application, where it finds the implementation of the
native method, and then the execution is performed. When the same method is offloaded again on the
VM, the library is already loaded, so the execution is performed immediately, without wasting time for
the library loading process, as shown in the log in Figure 11.
D7.3 Evaluation of RAPID platforms
Page 18 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Figure 10: Screenshot of the Android VM log when executing a CPU native (C/C++) Android code before the shared
library was loaded.
Figure 11: Screenshot of the Android VM log when executing a CPU native (C/C++) Android code after the shared
library was loaded.
Figure 12: Matrix multiplication varying the problem size. Left: the GPU offloading is performed using a local
dedicated machine. Right: the GPU offloading leverages on the GPU-Bridger OTC virtual machine.
Finally, we perform a test that offloads GPGPU CUDA code, which proves that GPU code offloading
is i) feasible and ii) convenient under the right circumstances. Choosing a performance testing suite
D7.3 Evaluation of RAPID platforms
Page 19 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
accepted by the community is not possible, due to the lack of such suite, given that the GPU code
offloading for Android devices is a novel approach. As the suitable evaluation tool, we chose one of the
NVIDIA's CUDA SDK 9.0 samples, which are included with the standard CUDA Toolkit: Matrix
Multiplication. The choice is influenced by its clarity of exposition on illustrating various CUDA
programming principles, which makes it easy to clearly present the needed modifications for making it
work using RAPID Android GPU code offload.
We implement a regular Android application embedding the RAPID GPU-Bridger for Java/Android
framework and the X86_64 compiled CUDA kernel as application’s resource. We perform the tests
using an ASUS ZenFone 2 ZE551ML [7] equipped with 4GB RAM, connected to the academic Wi-Fi
network sharing infrastructure Eduroam. We consider this a really common use case characterized by
pretty good but not dedicated Wi-Fi connection and not really early generation mobile handset.
The test consists of matrix multiplication with an increasing problem size:
Figure 12 displays the results of the same test suite performed in two RAPID GPU Accelerator Server
configurations:
In Figure 12 (left), the virtual GPU is hosted by a local dedicated server equipped with two
NVIDIA Titan X [8] CUDA enabled devices
In Figure 12 (right), the virtual GPU is hosted using the GPU-Bridger virtual machine instance
on the OTC providing one NVIDIA Tesla M60 [3]
Both the local and the OTC machines use exactly the same GPU-Bridger backend software, the same
NVIDIA drivers and the same NVIDIA CUDA Toolkit. As the figures clarify, from a performance point
of view, offloading the GPU code on a remote virtual machine equipped by a high-end NVIDIA CUDA
enabled GPU device, such as the OTC, is more beneficial than offloading on a local server with a low-
end device (green curves).
2.2.2. Linux Experiments
We perform the same experiments described in the previous section using the Linux device, which is a
Lenovo ThinkPad T460p laptop [9], equipped with a Quad-core Intel Core i5 (6th Gen) 6440HQ CPU
@ 2.6 GHz and 4 GB of RAM, running Ubuntu 16.04 LTS. The VM on the OTC platform runs Ubuntu
16.04 and is equipped with 1 vCPU and 1 GB of RAM.
Figure 13 presents the output of the N-Queens puzzle with 4 – 8 queens and of the C/C++ native code
offloading. As the results show, offloading is successfully performed in both experiments, even though
it was not beneficial in any of the cases. Indeed, this outcome was to be expected, given that the Linux
laptop is more powerful than the Linux VM running on the OTC platform. However, the purpose of
these experiments was to demonstrate that RAPID can run correctly on the OTC commercial public
cloud.
D7.3 Evaluation of RAPID platforms
Page 20 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Figure 13: Linux RAPID demo client after executing the N-Queens puzzle with 4, 5, 6, 7, and 8 queens
and the C/C++ native method, which only prints “Hello World”.
In order to perform a test of GPU code offloading in Linux, we used a CUDA enabled IDW interpolation
software [10], which is not specifically designed for testing, but for real use, as it belongs to a software
suite developed for bathymetry interpolation. The IDW is a deterministic method for spatial
interpolation, based on the principle that near points have similar values.
We considered a fixed number of 500,000 query locations (points where the value is unknown) and a
varying number of known values: 100, 1000, 10000 and 100000.
The CUDA enabled GPU device has been provisioned as follows:
CPU: No GPU CUDA enabled algorithm, just the algorithm for the CPU
TITAN X: a NVIDIA Titan X [8] physically connected to the machine used for testing. In this
scenario we use the regular CUDA libraries with no involvement of any RAPID GPU offloading
component
TITAN X GPU-Bridger: the same device used in the previous scenario, but the RAPID GPU-
Bridger is used
TESLA M60 OTC: The GPU code is offloaded on a remote virtual machine on the Open
Telekom Cloud using the RAPID GPU-Bridger
D7.3 Evaluation of RAPID platforms
Page 21 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Figure 14: The CUDA-enabled IDW Algorithm executed on different flavours: (a) CPU, (b) on-board Titan X, (c)
Titan X using GPU-Bridger instead of regular CUDA, (d) Tesla M60 on OTC using RAPID offloading.
Figure 14 shows the results from the performed experiments. As expected, the CPU underperforms when
compared to any GPU flavour. The comparison between the TITAN X and TITAN X GPU-Bridger
cases is useful to prove the minimal footprint of the GPU offloading framework developed in RAPID.
The cross between the TESLA M60 GPU-Bridger (OTC) and the local TITAN X (regular and
virtualized) lines is interesting because it is a breakout point dividing the problem sizes in two sets: for
less than 10000 known values, the GPU offloading is not convenient, for more than 10000 known values
the offloading is the best solution. As a final remark on this experiment: we never recompiled or changed
the source code of the CUDA enabled IDW interpolation algorithm. The GPU offloading is thus
completely transparent.
D7.3 Evaluation of RAPID platforms
Page 22 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
3. Evaluation of RAPID Applications
3.1. 3D Hand Tracking Thanks to RAPID’s native C/C++ code offloading support, porting a pure C++ application in the RAPID
platform is quite straightforward. The Hand Tracking application is a native Linux application developed
using C++ and demonstrates this. The only step required to make the native Hand Tracker remoteable
is making a Java JNI wrapper for the top-level calls of the hand tracker application and then declaring
them remoteable using the RAPID API. All other details are abstracted from the developer, and the only
step required is to choose the offloading server by initializing the DFE accordingly. All the
implementation details regarding the port of the 3D Hand Tracker to RAPID are thoroughly documented
in D2.1 [11]. The similarity of the RAPID implementation to the original code is also almost one-to-
one. As stated in D2.2 [12] (section 4.3.2), the RAPID GPU Bridge component cannot be used, since
the Hand Tracker relies on direct OpenGL/CUDA interoperation. RAPID GPU Bridge provides pure
CUDA virtualization so the OpenGL deferred rendering, which is a mandatory requirement, is
unavailable. The OpenGL/CUDA interoperation requirement could theoretically be overcome by
decoupling the OpenGL geometry to depth rendering calls that are done via shaders from the CUDA
comparison of the renderings to the camera observation data. However, this theoretical decoupling
would mean that the system would have to download this information from GPU RAM to system RAM
and then re-upload them back to GPU, something that would include a big PCI-Bus bottleneck that
would deteriorate performance. The RAPID code-base for the Hand Tracker is available for public use
at RAPID’s GitHub webpage [13]. The code-base can serve as a very helpful reference to developers
who may wish to port a similar application to RAPID since it can serve as an example of code layout.
Moreover, it provides templates for wrapping C/C++ primitives as Java objects and the overarching
organization of a maven workspace [14].
3.1.1. Environment
The software and hardware environment required by the application, as defined by the specifications in
D2.1 [11] section 5, is a 64-bit PC running Ubuntu 14.04.5 LTS and an NVIDIA GPU compatible with
CUDA version 6.5 or higher. The Hand Tracker internally relies on multiple sub-libraries for various
functionalities. All the project’s dependencies are included inside the GitHub repository in binary form.
Some of the most important dependencies are Boost [15], which offers platform independent system
libraries, OpenCV [16], which is a state-of-the-art open source Computer Vision toolkit, and OpenNI
[17], which provides the drivers to our RGBD cameras, as well as various smaller threading libraries.
The maven build system [14] chosen by RAPID automatically packages the dependencies in the
generated portable JAR archive. For this reason, the Hand Tracker application is very portable, despite
only having a Java based top-level wrapper, since the rest of the binary runtime is self-contained. The
application’s proper functionality has also been tested with Linux host machines running different
versions of Ubuntu (version 160.4 and 16.10) and CUDA (version 8.0 and 9.0).
In order to study the behaviour of our RAPID enabled Hand Tracker application, we used a two-tier
testing environment with two different tiers of devices, a high-end desktop and a low-end laptop. The
high-end desktop features a GeForce GTX 970 [18] GPU and an Intel Core i7-950 [19] processor while
the laptop has an outdated GeForce 670M [20] GPU and Intel Core i5-4210U [21]. It is worth noting
here that the laptop GPU is incompatible with recent versions of CUDA 9.0+ so with updated software
it is not capable of performing GPGPU tasks. The two connectivity options examined are a fast Gigabit
Ethernet connection and a slower 802.11 Wi-Fi channel. The ideal environment for our application
D7.3 Evaluation of RAPID platforms
Page 23 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
would be a portable laptop that could connect via the wireless channel to a fast offloading server. In this
way, we could be able to perform Hand Tracking while conserving the limited available battery power
and CPU resources of a laptop and benefiting from the provided portability. However, we are also
aiming for real-time performance, thus the delay we can tolerate when performing Hand Tracking is
incredibly low. In order to achieve a 30fps tracking loop time, which will allow us to process each frame
received from the RGBD device, all the processing needs to be executed within 33 milliseconds.
Unfortunately, Wi-Fi connections are very prone to radio interference and typically introduce latency
ranging from 10 to 60 milliseconds, depending on the number of connected clients and network
saturation. Moreover, the available bandwidth of a Wi-Fi connection is substantially lower than the one
provided by a Gigabit Ethernet connection. These reasons render the Wi-Fi connection architecturally
impossible to accommodate our needs.
Figure 15: Hand Tracker testing environment topology.
In order to analyse the sustained performance of our RAPID-enabled Hand Tracking application in every
configuration, we performed our evaluation using both wireless and wired connections (0.1 milliseconds
of latency). The topology of the experimental setup is shown in Figure 15. Connecting the Hand Tracker
to the RAPID Public Cloud deployment would result in even higher latency, since, depending on the
network quality, the observed latency can range from 50ms to 150ms when establishing TCP/IP
connections with the remote host. The unique latency requirements of the Hand Tracker make it an
application that is suited for low-latency connections that can only be offered by a private cloud. Due to
this requirement, as well as the OpenGL/CUDA interoperability requirement, this application was only
applicable for testing in the Private Cloud.
3.1.2. Performance Results
Before assessing the end-to-end performance of the application, we must first study the processing loop
of the Hand Tracker algorithm and understand the way it should ideally execute in order to benefit from
the remote execution.
D7.3 Evaluation of RAPID platforms
Page 24 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Figure 16: High-level view of the Hand Tracker I/O times in ideal conditions.
The application is a processor of frames, generated by a camera at a framerate of 30 frames per second.
The Hand Tracker acts like a black box optimizer that can receive a prior hand configuration (x) along
with a future frame pair of RGB and Depth (t+1) that observes a hand, and then respond with a good
estimation (x+1) of the position of the hand for the frame input given. In the series of received frames,
we can repeat the procedure and acquire one estimation per received RGBD frame, thus fully tracking
the observed hand. In order to achieve this for every frame received from the device, we need to be able
to perform all processing steps in less than 33 milliseconds as seen in Figure 16. Otherwise, in the case
of any delay when the next frame arrives we are not able to continue the process, since we do not have
the (x+1) value that is required for computing state (x+2).
This is depicted in segment A of Figure 17, where we observe that for a slower 150ms processing loop
time, we must skip processing two consecutive frames for each received frame, since we do not have
enough processing time for them. This not only is bad for the user experience (as there is observable
delay), but is also bad for the quality of the tracking. During the time lost due to the dropped frames, the
hand moves further away from the last tracked position. This means that the hand tracker has to sample
in a much wider area, which makes the problem much more difficult and errors also tend to accumulate.
Figure 17: What happens when frame processing is delayed (A) vs what RAPID could facilitate (B).
D7.3 Evaluation of RAPID platforms
Page 25 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
The RAPID platform, on the other hand, is not constrained by the resources of a single machine, and
could allow us to scale up, as seen in Figure 17 (bottom). Unfortunately, the nature of the optimization
framework used inside the Hand Tracker is not suited for this kind of parallel processing: each of the
frames must first be processed and produce output before the next one is processed. Thus, any potential
benefit gained by assigning each incoming frame to a separate computing resource is negated by the fact
that we always have to wait for each step to be completed before handling the next step.
One of the architectural changes attempted during the course of the project in order to try and adapt to
the capabilities provided by RAPID was daisy-chaining multiple separate optimization “threads”. This
could potentially improve tracking results, by working in parallel in the input frame stream. Thus, each
of the frames would use a different computing resource provided by RAPID and be initialized by the
closest previous solution x and the latest received frame. Unfortunately, as seen in Figure 18, this
generates two completely separate optimization threads. The first thread will only use its own results,
as they would always be the most recent, and this would end up virtually increase framerates by doubling
resource consumption, but offering no real tracking quality improvement.
Figure 18: Daisy-chaining to two machines could in principle double observed framerates, but in fact the tracking
quality would be the same while consuming double resources.
To assess the system’s performance, we compare the evaluation results obtained by executing the Hand
Tracker using various configuration settings. In order to have comparable results we pre-recorded a
Hand Tracking experiment scene depicting various challenging hand movements. We added
configuration parameters to the executables in order to be able to replicate and script the same
experiment across different setups and have directly comparable graphs.
With this evaluation, we aim to identify the overhead introduced by the network connections, the RAPID
framework and the impact of any other overhead introduced by calling the application’s native code
through a Java Virtual Machine (JVM). Moreover, we want to quantify the potential speed gain achieved
by utilizing remote code execution. Of course, all these are indirectly affected by the serial nature of the
Hand Tracker that, as stated at the start of this section, has to wait for each frame to be computed before
processing the next. If the Hand Tracker application could perform tracking without relying on previous
solutions in a serial fashion then performance would substantially improve, since we would be able to
offload all of the incoming frames and receive the results, achieving a perfect 30 FPS with just a minor
and constant delay.
D7.3 Evaluation of RAPID platforms
Page 26 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
We begin our evaluation by measuring the sustainable performance of the Hand Tracker in its vanilla
non-Java, non-RAPID implementation, when executed in the high-end desktop and the low-end laptop
respectively. The results of this analysis are the baseline of our evaluation and are displayed by the
dashed lines in Figure 19. The high-end hardware available in the desktop computer allows the
application to achieve real-time processing at 30 frames per second. This processing rate matches the
rate at which the RGBD camera acquires new Depth and RGB frames. The slower laptop can achieve a
maximum average rate of 12 FPS, which is much slower due to the low-end hardware. Ideally, we want
to take advantage of any extra processing power provided by the desktop host in order to improve the
performance. Moreover, it is worth to keep in mind that an important contribution of RAPID is to enable
CUDA applications (such as the Hand Tracker) to run on devices lacking CUDA-enabled graphics cards.
This fact already provides a benefit of using RAPID for this use case.
We proceed with the evaluation by measuring the performance of the RAPID-enabled implementation
of the Hand Tracker, when executed on the desktop and laptop host respectively, without utilizing code
offloading. The results of this set of experiments are portrayed in Figure 19, marked as “RAPID
Desktop/Laptop Localhost”. This analysis enables us to identify the overhead introduced by wrapping
the native code of the Hand Tracker inside a Java container using JNI. The results obtained at this step
reveal the impact of data serialization, synchronization, and JVM overheads. We observe that in this
configuration, the application’s performance is reduced by 50% when executed on the high-end desktop
host. When executing on the low-end laptop host, where the potential GPU speedup is lower and the
overall execution speed slower, the overhead introduced by the Java switch is much less evident and
ranges at 10%.
Figure 19: Sustainable frame rate for various offloading configurations.
The RGBD camera acquisition framerate is 30 fps.
D7.3 Evaluation of RAPID platforms
Page 27 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Finally, we measure the performance characteristics of the RAPID-enabled Hand Tracker, utilizing code
offloading via the Gigabit Ethernet and the Wi-Fi connection. The purpose of this study is to evaluate
the performance gain obtained by executing the applications logic on the high-end desktop host.
Unfortunately, since the desktop host performance, when running through its Java container, falls to
roughly 15fps at the localhost scenario and the laptop localhost performance ranges from 10-12fps, we
have an even smaller budget of just one or two milliseconds to gain at best. Considering any other
context and network overhead that needs to be transmitted between machines that proves to be too much.
RAPID automatically measures that and falls back to local execution. Thus, we get a similar
performance to the localhost Java run, minus a small overhead when the two machines negotiate through
the slower or faster connections. As it was already mentioned in Section 3.1.1 of this document, we did
not expect the Wireless connection to be possibly fast enough to help us, but thanks to the RAPID
automatic QoS sensing, the slow network does not impact negatively in execution times. A very
revealing graph about the network overhead importance is in Figure 20. It clearly captures the delay
caused by the network medium compared to the pure execution time in the remote machine. As Amdahl's
law [22] suggests regardless of the computing power available, we would still be unable to improve the
achieved frame rates past the most critical bottleneck, which in our case is the network. Thus,
applications like the Hand Tracker where we are forced to wait for each of the frames to be processed
before submitting the next frame, prove to be ill-formed problems for parallelization, since network
latency ends up directly affecting the processing performance. An example of a more fitting problem for
parallelization would be a person detector (not tracker), where there would be no inter-frame
dependencies. In that case all newly acquired frames could be submitted in parallel to the computing
resources, the network delay is not accumulated and RAPID could provide a substantial improvement.
Figure 20: Delay between local and remote execution.
D7.3 Evaluation of RAPID platforms
Page 28 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
3.1.3. Conclusions for the Hand Tracker use-case
Porting the Hand Tracker to the RAPID framework provided important insights that can be summarized
in the following list of remarks.
1. RAPID provides a versatile framework that can facilitate the easy upgrade of a code base that
was initially not built with distributed systems in mind to a fully distributed version, with very
little programming effort.
2. The range and type of applications that can benefit from RAPID is very large. As seen in the
Hand Tracker use case, even native applications that are written in C/C++ can be easily ported
to RAPID using a JVM Wrapper.
3. The Hand Tracker initially only targeted high-end devices that featured a fast GPGPU. With
RAPID, lower-end devices (even without graphics cards) are no longer excluded from the target
group for this application.
4. Although RAPID can transparently deliver an enormous pool of computing resources to any
device/application combination, these computing resources may still be ultimately limited
depending on the nature of the application by network quality, bandwidth and latency. Even
relatively fast Wi-Fi 802.11/n connections can become a bottleneck in I/O heavy applications.
5. The Hand Tracker, with its serial frame processing dependency, its real-time requirements and
latency sensitive visual feedback, described in detail in 2.4.2, is a use case that tests RAPID in
an extremely demanding and unfavourable scenario. Despite this RAPID manages to
accommodate it and will be able to accommodate it even better with future improved network
technologies.
D7.3 Evaluation of RAPID platforms
Page 29 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
3.2. Antivirus This section presents the evaluation results of GrAVity mobile antivirus for Android, which is
appropriately modified to use the RAPID offloading framework. Our originally workstation-version of
the antivirus was ported to the Android platform, in the form of an Android Package Kit (APK), at the
early stages of this project. In its original configuration, the system was taking advantage of modern
NVIDIA [23] GPUs in order to offload the computational intensive task of scanning the file-system,
checking for the presence of malicious code.
The Android version of this application was initially developed for the NVIDIA Shield K1 tablet [23],
equipped with a mobile Kepler [24] GPU. In this configuration, the system was able to achieve increased
performance by offloading the virus-scanning operations to the device’s GPU as opposed to using the
CPU. However, since the number of devices equipped with CUDA capable GPUs available on the
market is limited [25], we proceeded on developing a version of its virus-scanning engine entirely
deployed using Java. In this configuration, the application can be used by the vast majority of Android
mobile devices available to the market. However, in order to provide the benefits of the fast GPGPU
execution to all mobile devices, we modified GrAVity antivirus into offloading the computational
intensive tasks using the RAPID framework.
In its final development stage, the application offers a wide variety of execution methods on each mobile
device, as provided by the deployment of RAPID framework and its infrastructure and is available for
mobile devices that run Android version 4.4 or higher. The RAPID frontend, present to the application,
can optimize the execution and schedule it locally on the CPU of the device, or GPU if present, or opt
to offload the scanning process. The offloading can be performed either by offloading the Java version
of the scanning engine to a remove virtual machine or by scheduling the CUDA version to execute on a
remote, highly capable, GPU. The GPGPU code offloading can be performed either by using RAPID
GPU-Bridger or by using a combination of the Acceleration Server (AS) and the GPU-Bridger. Using
the combinations of the two components, the virus-scanning task is offloaded to an Android VM, which
then contacts the GPU-Bridger backend and forwards the task execution on the remote GPU. The
decision of the offloading method is based on a wide variety of factors, such as energy consumption,
throughput, latency and availability of resources.
In the following sub-sections, we present the performance analysis of the application when executed on
various environments. First, we present the evaluation results of the local execution using the CPU and
the GPU available on the NVIDIA Shield K1 tablet. Then, we provide the analysis of the execution
when the virus-scanning operation is offloaded on a private cloud infrastructure (cloudlet) using the
RAPID framework. Finally, we discuss the outcomes of the same case study using the RAPID public
cloud infrastructure for the offloading process. In each scenario, we demonstrate the results obtained by
the execution of both CPU and GPGPU implementations of the application core.
3.2.1. Environment
We evaluate the operation of our RAPID-enabled mobile antivirus using three different environment
setups. We begin by analysing its performance characteristics, executing the application on the device
it is installed on, namely the NVIDIA Shield K1 tablet. Then, we proceed with the offloading operations
using a cloudlet and finally RAPID’s private cloud infrastructure.
D7.3 Evaluation of RAPID platforms
Page 30 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
3.2.1.1. NVIDIA Shield K1 Tablet
The mobile device used for the evaluation of the local execution of GrAVity antivirus is the NVIDIA
Shield K1 Android tablet. It is powered by the NVIDIA Tegra® K1 [26] processor, which features a
192-core NVIDIA Kepler™ GPU and Quad-core Cortex-A15 [27] CPU clocked at 2.2 GHz. The
presence of both mobile CPU and GPU makes this platform an ideal evaluation environment for our
application, since we are able to analyse the performance characteristics of the CPU and GPU
implementations of its virus-scanning core. The system is also equipped with 16 GB of internal storage
and 2 GB of RAM and runs on Android version 7.0 Nougat, as updated by NVIDIA’s latest Over-The-
Air (OTA) update 0.5 [28] February 9, 2017.
3.2.1.2. Cloudlet Infrastructure
The cloudlet infrastructure under test is composed by two host machines interconnected using a Gigabit
Ethernet switch. The network is also equipped with a wireless access point. Each host machine is
equipped with a Intel® Core™ i7-6700 [29] CPU running at 3.40GHz and 16GB of DDR4 RAM
operating at 2400MHz. The cloudlet hosts are also equipped with a NVIDIA GeForce GTX 780 GPU
[30] providing 4GB of available GDDR5 memory. Both hosts execute instances of Android v6.0 virtual
machines with the RAPID acceleration server installed, as well as instances of the GPU-Bridger
backend. In this configuration, the cloudlet is capable to perform Java and CUDA code offloading and
we are able to evaluate the CPU and GPU implementations of our application’s virus-scanning engine.
The NVIDIA Shield K1 is connected to one of the two hosts via a Wi-Fi channel, as seen in Figure 21.
Figure 21: Antivirus testing environment topology.
3.2.2. Performance Results
The evaluation of our system is divided into three execution models, (1) local (on-device), (2) cloudlet
offloading and (3) RAPID public cloud offloading evaluation. We measure the sustained throughput
achieved by the antivirus’s Java-based and CUDA-based signature-matching engine for each execution
model. For the purpose of this analysis, we generate 100 automata, containing signatures of malicious
code snippets in binary and regular-expression format. Part of these signatures are obtained by the
ClamAV [31] database while others are hand-crafted, based on snippets of known malicious code and
e-mail filters. The purpose of the custom signatures is to stress the matching engine using complex
D7.3 Evaluation of RAPID platforms
Page 31 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
regular expressions, developed for this purpose. Moreover, we generate a set of 16.000 files, each one
matching one of the signatures found in the automata set. We chose to perform the evaluation without
the presence of executable malicious code or infected APKs for safety reasons. In each execution, we
scan the entire file set against all the precompiled automata, measuring both the end-to-end sustainable
throughput and the throughput achieved only be the virus scanning engine. In this way, we are able to
evaluate the performance gain provided to the matching engine by the utilization of RAPID offloading,
as well as the performance observed by the user. In this analysis, we exclude the overhead introduced
by reading the files into memory and the end-to-end results depict the throughput of the network I/O
and virus-scanning process. We choose to exclude the file-system I/O since it is highly related to the
type of storage (internal or external memory card as well as file-system type) and is measured to be the
same in each experiment, regardless of the execution model.
In order to draw the baseline for our evaluation, we begin by measuring the performance of the on-
device execution, using the NVIDIA Shield K1 tablet. The experiment is performed twice, first using
the Java implementation of the antivirus’s scanning engine and then using the CUDA-based version,
each time scanning the entire file set against all 100 automata. The results of this analysis are displayed
in Figure 22. As we can see, the Java-based engine is able to achieve 2.9 Mbps of processing throughput,
while the CUDA-based version achieves 26.9 Mbps. We observe that the CUDA-based implementation
yields higher results due to the utilization of its highly parallel architecture. However, this throughput
results can only be achieved using a very limited number of mobile devices, powered by NVIDIA GPUs.
Figure 22: Sustainable throughput achieved by the Java and CUDA implementation of the virus-scanning engine,
executed on the tablet’s CPU and GPU respectively.
We proceed with the evaluation, this time executing the antivirus configured to offload the file
processing to the cloudlet, using RAPID. The offloading is performed in three different configurations.
Firstly, we offload the Java-based engine using RAPID Acceleration Server. Secondly, we offload the
CUDA-based implementation using RAPID GPU-Bridger and finally we perform CUDA offloading
using both the Acceleration Server and GPU-Bridger. For each setup, we measure and report the end-
to-end sustainable throughput as well as the throughput achieved only by the virus-scanning engine. The
outcome of this experiment is depicted in Figure 23 with the bars indicating the sustainable throughput
in each offloading configuration, the solid line indicating the throughput achieved by the tablet’s CPU
and the dashed line representing the throughput of the Kepler GPU found on the device. As we can see
D7.3 Evaluation of RAPID platforms
Page 32 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
in the figure, the Java-based engine is able to perform at 29.2 Mbps of scanning throughput,
outperforming the on-device CPU yielding 10 times better performance. Moreover, the GPU offloading
achieves 161 Mbps, being 5.5 times faster than the Java offloading and 55.5 times faster than the Java-
based implementation when executed on the NVIDIA tablet. These results indicate that RAPID
offloading greatly benefits the execution of our system and resolves the execution bottleneck enforced
by the low-end hardware found on mobile devices such as tablets and smartphones.
Figure 23: Sustainable throughput achieved by the virus-scanning engine in different offloading configurations using
the RAPID-enabled cloudlet.
While code offloading is proven to be beneficial for our system, it introduces network I/O bottleneck,
since the entire file set, including the automata, have to be transferred to the remote host. This is observed
by the end-to-end performance results obtained by the experiment described above, displayed in Figure
24. As we can see, both the Java and CUDA code offloading achieve a maximum throughput of 24.5
Mbps due to the low bandwidth of the Wi-Fi connection. These results indicate that RAPID offloading
benefits the performance of our system, increasing its throughput by 8.4 times, compared to the on-
device CPU execution. However, in all cases, the performance gain achieved by the remote execution
of the scanning engine is overshadowed by the low bandwidth of the network channel and can potentially
increase with the presence of high-speed Wi-Fi channels.
D7.3 Evaluation of RAPID platforms
Page 33 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Figure 24: End- to-end sustainable throughput achieved by the antivirus Android application in different offloading
configurations using the RAPID-enabled cloudlet.
3.2.3. Antivirus in the OTC RAPID cloud
In the final part of our evaluation, we conduct the same experiments performed using the cloudlet, this
time offloading the virus-scanning task on RAPID’s public cloud, namely OTC. Firstly, we measure the
only the file-processing throughput without taking into account the network I/O overhead. The result of
this analysis is displayed in Figure 25. We notice that the Java-based engine, when offloaded to the
remote Android VM using the AS is able to perform virus scanning at 21.4 Mbps, achieving 7.3 times
better throughput compared to the tablet’s CPU. The CUDA-based implementation, when offloaded
using RAPID GPU-Bridger or a combination of the AS with GPU-Bridger yields 223 Mbps of file
processing throughput, outperforming the tablet’s CPU by 76.8 times while also being 8.7 times faster
than the integrated GPU. Moreover, we can see that the high-end GPU available on OTC is able to
outperform the low-end GPU provided by our cloudlet by 1.3 times. These results indicate that RAPID-
enabled public clouds providing access to high-end hardware can greatly benefit the execution time of
our antivirus application.
Figure 25: Sustainable throughput achieved by the antivirus engine in different offloading configurations using OTC.
D7.3 Evaluation of RAPID platforms
Page 34 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
We conclude the evaluation of our mobile antivirus by measuring the end-to-end sustainable throughput
achieved using OTC for task offloading. In this case, the network overhead introduced by the
communication with the remote VMs is even higher compared to the bottleneck observed in the cloudlet
infrastructure. As we can see in Figure 26, the end-to-end throughput is limited to 5.4 Mbps, due to the
limitations imposed by the poor network communication. However, even considering this overhead,
offloading the virus-scanning task on OTC improves the application’s performance by 86%.
Figure 26: End-to-end sustainable throughput achieved by the antivirus Android application in different offloading
configurations using OTC.
3.2.4. Conclusions for the Antivirus use-case
Based on the results obtained during the evaluation of our RAPID-enabled mobile antivirus we are able
to point out the following conclusions.
1. RAPID enables fast deployment of distributed systems able to offload both CPU and GPGPU
code to remote hosts equipped with high-end hardware, with minimum effort.
2. Using RAPID, mobile devices equipped with low-end CPUs and lacking GPU support are able
to utilise computationally-demanding software developed for desktop and server hosts.
3. Using RAPID for GPGPU and Java code offloading improves the execution of complex code,
improving the processing throughput by several times, compared to on-device execution.
4. The end-to-end performance limitations observed during our evaluation are imposed by the poor
network capabilities of mobile devices. The introduced network I/O overhead is not a result of
RAPID’s design and implementation but rather a hardware and technology limitation.
5. The deployment of fast wireless channels can greatly improve the observed performance
achieved by RAPID offloading. We expect this benefit to be greater on Android IoT devices
equipped with Gigabit Ethernet connections.
D7.3 Evaluation of RAPID platforms
Page 35 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
3.3. BioSurveillance
The BioSurveillance use-case consists of a commercial face recognition application for video
surveillance, ported to an NVIDIA Tegra platform [26] [32]. The algorithms work in real-time, with
multiple faces simultaneously and under unconstrained conditions. Although the Tegra family achieves
great computational performance for being a low-power platform, its capabilities are often not enough,
given the challenging requirements of video face recognition. Concretely, hard bottlenecks may appear
due to the input video resolution, the number of faces concurrently analysed, and the size of the gallery
database. Hence, offloading part of the computations from this device becomes critical for security,
especially in crowded environments, large galleries, and 4K streams.
Figure 27. Typical pipeline of a face recognition application.
As described in D2.3 [33] and observed in Figure 27, offloading makes sense at three different stages
of the face recognition pipeline: after video decoding, after face detection, or after template extraction.
Offloading at early stages either requires an overwhelming amount of bandwidth (especially after video
decoding) or implies a compromise on privacy aspects (anywhere before template extraction), so we
decide to offload the template matching operation. This stage becomes especially critical for large
databases or large template sizes (common for algorithms based on local visual features) [34].
In the presented pipeline, video decoding, face detection and template extraction are GPU-accelerated,
whereas the rest of stages are performed on CPU. The template matching stage is executed
asynchronously, in order to hide latencies and help to fully utilize all the hardware of the board (CPU
and GPU simultaneously).
Given the sensitive nature of facial snapshots and databases, it is not feasible to replicate a private
database of subjects among devices within a cloud. Moreover, real face recognition deployments require
the gallery of enrolled subjects to be secured and centralized on a single location. Therefore, to mitigate
such privacy and security concerns, this use-case focused on the RAPID GPU-Bridger CUDA offloading
component, instead of employing the complete RAPID infrastructure.
D7.3 Evaluation of RAPID platforms
Page 36 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
3.3.1. Environment
Figure 28. Evolution of architectures for this use-case. (a) Original implementation on TK1. (b) Extended TK1 version
with manual offloading through ZeroMQ (requires developing code on target accelerator). (c) Final version on TX1,
with transparent offloading by RAPID (no code required on the accelerator).
A fully functional Tegra K1 [26] prototype in C++/CUDA had been implemented specifically for
RAPID during the first year of the project, scaling it down from the original commercial product, as
depicted in Figure 28a. After that, a client-server version was manually implemented to offload the
template matching component to an accelerator device, featuring a discrete GPU, as shown in Figure
28b. This is what developers without access to RAPID components would normally do. The concrete
implementation was carried out using ZeroMQ, an open-source library for inter-process communication
[35].
For the final version, we were forced to do a series of important and core modifications. First, the Tegra
K1 processor was discontinued by NVIDIA, and replaced by the Tegra X1 system-on-chip [32], which
features Cortex A57 instead of A15, 256 CUDA-cores instead of 192, and most critical to us, only
supports CUDA versions over 7.0. Thus, the prototype code had to be updated consequently, to preserve
future commercial viability. Fortunately, the last version of RAPID GPU-Bridger supports newer
CUDA versions (certified up to 9.0), so the change of platform at the frontend was transparently taken
care by the GPU-Bridger. A final technical issue appeared due to the use of the ZeroMQ library by the
BioSurveillance application. Apparently, the inter-process communication of ZeroMQ collided with the
internal sockets used by GPU-Bridger, which caused this last one not to work properly, an issue that
took several weeks to be detected and corrected. To solve the problem, the client-server communication
of the prototype had to be re-implemented using named pipes, thus correcting the offloading. Since this
issue was found, RAPID GPU-Bridger developers have taken steps in order to minimize future
compatibility risks with user applications using similar communication libraries.
The BioSurveillance use-case has been evaluated under 4 different evaluation environments:
1. Application running natively on Tegra X1, without any kind of offloading
2. Manually coded ZeroMQ client-server version, offloading from TX1 to a Linux x86 server
3. Automatic offloading using RAPID, locally in Tegra X1
4. Automatic offloading using RAPID, from Tegra X1 to Linux x86 server
Although the third case may seem impractical, it will be useful to understand the fixed introduced
overhead simply because of using RAPID: a manual and well-designed remote offloading (case 2) will
only communicate data of interest, i.e. templates, while transparently offloading with RAPID (case 4)
D7.3 Evaluation of RAPID platforms
Page 37 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
requires sending CUDA header functions and auxiliary data so that the whole operation can be carried
out at the remote GPU. Hence, local RAPID offloading (case 3) will help us estimate how much
performance loss is due to this extra communication overhead. The remote offloading will be carried
out towards two different discrete NVIDIA GPU cards: a GTX 760 [36] and a more powerful Titan Xp
[37].
3.3.2. Performance Results
The evaluations in this section consider two main factors: latency and scalability. Given that the
offloaded part is restricted to the template matching module, we evaluate scalability in terms of number
of faces simultaneously present in a frame, and the size of the database.
Figure 29. Sustained frames per second (FPS) of the local and distributed BioSurveillance applications, with manual
and automatic (RAPID) offloading, depending on the number of faces simultaneously analysed.
Figure 29 provides a comparison of the sustained frame-rate of the application depending on the number
of faces continuously present in front of the camera. The left figure corresponds to standard factory
settings, whereas for the right one we raise CPU and GPU clock frequencies to achieve maximum
performance. For each scenario, we compare the four cases described previously. Remote offloading is
carried out to a Linux x64 server, equipped with an NVIDIA Titan Xp GPU [37], within the same local
network (average network ping: 2ms). Each represented frame-rate is computed from the median of FPS
samples measured over a one-minute window. For this baseline comparison, the database contained only
100 subjects.
As displayed in the figure, for zero faces the behaviour is exactly equal in all cases, given that no
template is extracted and no template matching is required. As the number of analysed faces increases,
the manual remote offloading behaves similarly or even slightly better than the stand-alone case
sometimes, as it frees more resources from the low-power device than it consumes in network latency.
Local offloading using RAPID behaves approximately as the stand-alone application, except for a tiny
overhead due to the transference of CUDA headers and auxiliary data. Remote RAPID offloading adds
a noticeable but relatively small penalty to the sustained performance of the application. It is worth
noting that after a certain number of faces, the difference between local and remote execution tends to
level out, given that whereas the local execution struggles more and more to allocate resources, the price
to pay for remote execution remains constant over time.
D7.3 Evaluation of RAPID platforms
Page 38 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Figure 30. Automatic remote offloading with RAPID allows us to override hardware limitations. In this case, we can
overcome memory limitations and use larger subject databases simply by changing the GPU card of the accelerator.
Figure 30 is of great importance in order to understand one of the primary contributions of automatic
offloading with RAPID. Although we pay a considerable price in terms of latency when using RAPID,
it allows us to transparently override hardware limitations, by simply changing the IP address of the
remote accelerator. In this example, the TX1 platform can only handle up to 10K database templates
during local execution. However, by remotely offloading to a remote GTX 760 GPU [36], we manage
to handle larger databases, even though the memory of the accelerator card is lower than that of the
Tegra X1 (2 GB for 760 vs 4 GB for TX1). This is explainable by understanding that the TX1 consumes
large resources of memory for face detection and template extraction, leaving few for database matching.
However, the GTX 760 is only devoted to matching, resulting in 4+2 GB available for the complete
pipeline. Likewise, RAPID allows us to deal with even larger databases by simply using a different
accelerator. With a Titan Xp [37] we are able to handle more than 100K elements in database, which is
required by certain border control projects, which until now required powerful servers to be placed next
to camera sensors and databases to be replicated.
3.3.3. Batch offloading
We can characterize the individual latency paid by every single offloading of the template matching
module. The latency that we empirically measure at the matching server (who’s CUDA function calls
are offloaded by the GPU-Bridger component within RAPID) is given by these two terms:
�̂� = [#𝑓𝑎𝑐𝑒𝑠 ×#𝐶𝑈𝐷𝐴 𝑐𝑎𝑙𝑙𝑠
𝑓𝑎𝑐𝑒× 𝑁𝑒𝑡𝑤𝑜𝑟𝑘 𝑙𝑎𝑡𝑒𝑛𝑐𝑦] + [𝐶𝑜𝑚𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 𝑙𝑎𝑡𝑒𝑛𝑐𝑦]
We notice that we pay a fix price in terms of network latency for setting up the CUDA calls required to
match a single template to the database, and a variable price depending on the number of faces that have
been found at each frame (and have resulted in correspondent templates). Separately, the computational
D7.3 Evaluation of RAPID platforms
Page 39 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
latency depends on factors such as the number of templates enrolled in the database, but also on the
particular hardware (GPU) where the matching is actually carried out.
It is evident that the price we pay in network latency is way larger than the computational one. A question
that arises is whether there is any mechanism to hide the overwhelming network latencies. Towards this
end, we propose to batch a number of templates into a single package of computation, thus effectively
reducing the network latency per template by a factor of batch-size. As seen in Figure 31, this tiny extra
development effectively increases the frame-rate performance, not only for the RAPID remote
offloading case (which increases frame-rate performance in a 10-20% range), but even for the stand-
alone application, yielding a consistent 5% improvement independently from the size of the database.
Figure 31. Performance improvements after the implementation of batch offloading operations for RAPID. In this
case, with a batch=10 the frame-rate improves even compared to the stand-alone application running locally on TX1.
Looking at it in more detail, we can estimate the latency of each CUDA call taken by the RAPID’s GPU-
Bridger backend at the remote machine. For this example, we evaluate a database of 100 subjects and
one single face in front of the camera. The list of CUDA calls either executed or offloaded by the
matching server for each template can be seen at Table 3. This table shows two different scenarios: the
original remote offloading without batch (i.e. batch=1, or one template sent at a time), and a batch of 10
templates. It is obvious that the averaged latency per template is indeed reduced by the proposed
approach, which saves us between a 15% and a 20% of the effective latency incurred by the application.
D7.3 Evaluation of RAPID platforms
Page 40 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Table 3. List of CUDA functions offloaded in a template matching operation, with their average computation latency.
By applying batch-10 offloading, we can reduce a 15-20% of the effective latency per template.
CUDA function
Elapsed time (ms)
Batch = 1 (No batch)
Batch = 10
Total batch time Batch time / template
cudaMemset 2 2 0
cudaMemcpyAsync 135 1186 119
cudaMemset 2 2 0
cudaLaunch 1 1 0
cudaMemcpyAsync 2 2 0
cudaStreamSynchronize 1 1 0
Total elapsed time 143ms 1194ms 119ms (-17%)
Finally, in Figure 32 we present an evaluation of the resulting frame-rate depending on the batch-size
parameter. This experiment is carried out with a database of 10K subjects and remote offloading to Titan
Xp, progressively increasing the number of templates packaged in a computation batch. Adding a batch
operation seems to always improve over the standard setting (batch=1), and the optimal range in this
case appears in the range between batch=10 and batch=100. We observed that large batches beyond 200
resulted in noisier measures of frame-rate, which required us to take many more samples to reach a
consistent result.
Figure 32. Performance of the automatic remote offloading application with regard to the chosen batch-size. The
optimal batch-size in this case is somewhere between batch=10 and batch=100.
3.3.4. BioSurveillance in the OTC RAPID cloud
Finally, we have evaluated the performance of the BioSurveillance application when offloading the
template matching operation to RAPID’s public cloud. Due to the privacy constraints related to personal
data access in face recognition applications, BioSurveillance would never be able to allocate sensitive
information such as facial databases in a public cloud, so this is more of an academic exercise to
understand the limitations of operating with servers at very distant locations. The average network ping
of the RAPID OTC cloud is measured to be 59ms at the time in which the tests were conducted.
D7.3 Evaluation of RAPID platforms
Page 41 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
Evaluating the application on the RAPID cloud is as simple as modifying the IP address of the
accelerator at the configuration file. We compare the stand-alone application versus the remote RAPID
offloading setup, with databases ranging from 100 to 15K subjects, and for two different scenarios: non-
crowded environments (1 single face in front of the camera) and crowded environments (5 simultaneous
faces). In order to mitigate the effects of the high latencies, the proposed batch offloading operation is
performed, with batch sizes ranging from 1 (no batch, each template matching sent individually) up to
500. Larger batch sizes have not been tested, as the consequent delay in retrieving the results starts being
too long for critical applications. A batch of 500 ensures that alarms can be raised in less than a second.
Figure 33. Performance comparison of BioSurveillance running stand-alone on TX1 against remote offloading to
RAPID’s cloud, for different batch-sizes, database sizes and faces.
Figure 33 shows the obtained results. As we already anticipate, the stand-alone application cannot
evaluate more than 10K faces without running out of memory, a no-go limitation for most commercial
face recognition projects. We also observe that a standard RAPID offloading (without batch packaging)
cannot achieve the level of performance of the stand-alone application, due to the very high latency
affecting the communication with the cloud. Nevertheless, when increasing the batch-size of the
template matching operation, we found that remote offloading even improves the overall performance
of the stand-alone application for batch sizes over 150. Although large batch sizes relatively delay the
D7.3 Evaluation of RAPID platforms
Page 42 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
reception of matching scores (as we have to wait until filling a batch package of computation), it trades
off by improving the overall frame-rate of the application, thus avoiding to drop frames that could be
potentially critical in terms of security. Moreover, we observe that the computational performance of
the remote offloading becomes less dependent on the database size, which again tends to level off for
large batch sizes.
Figure 34. Latencies for each process of the face recognition pipeline, when analyzing 5 faces simultaneously on the
RAPID cloud.
The counterpart of these performance results is presented in terms of latency in Figure 34. In this case,
we have collected the frame-average latency for each stage of the face recognition pipeline: (1) video
decoding, (2) face detection, (3) template extraction and (4) template matching (the only RAPID-
offloaded operation). Face detection and template matching are carried out on GPU, whereas the rest
runs on CPU. We have evaluated two database sizes, 100 and 10K subjects, with and without batch
offloading, for batch sizes equal to 1, 50, and 500. As expected, the only process affected by database
size is the matching. It is also noticeable that analyzing 5 simultaneous faces increases the cost of
template extraction to a considerable amount (for a single face, template extraction latency is practically
1/5 of the one shown here). An increase of database size has a strong impact on the latency of the
matching, although not as much as the price paid for connecting to a distant server. Nonetheless, batch
offloading drastically reduces the latency of template matching to a negligible value compared to the
other pipeline operations, yielding the frame-rates presented in Figure 33, and making it possible to use
remote servers in private clouds in real commercial applications.
3.3.5. Conclusions for the BioSurveillance use-case
The thorough evaluation procedures carried out yield a series of interesting conclusions for the
BioSurveillance use-case:
1. RAPID is extremely useful for transparently offloading workloads to remote GPU devices,
achieving almost the same performance as dedicated, manually-coded distributed applications,
while providing a series of remarkable advantages:
D7.3 Evaluation of RAPID platforms
Page 43 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
a. Rapidly prototyping distributed applications without actually having to develop code
(code maintenance).
b. Running GPU code on machines without actual GPU support (code compatibility).
c. Transparently overriding limitations of device hardware, such as memory constraints in
our case, which are critical for large database deployment. Hardware limitations are
solved simply by using a more powerful accelerator, benefitting again in terms of code
maintenance.
2. We propose a batch offloading approach that greatly mitigates latency penalties for the remote
RAPID offloading case, even improving over the stand-alone case for offloading to both private
networks and even public clouds.
D7.3 Evaluation of RAPID platforms
Page 44 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
4. Conclusions and Future Performance Optimizations
In this section we discuss and propose future optimizations applicable to RAPID framework and
infrastructure, as well as optimizations on use-case level. Regarding the QoS, several aspects can be
optimized in future work. First, it would be probably faster to resize resources if the system was changed
to work with containers instead of virtual machines. Secondly, the frequency in which the information
is monitored by the host system currently yields the highest frequency at which the SLAM can take
decisions about the VMs. Thus, increasing the frequency of information monitoring updates will make
the SLAM run and take decisions faster. Finally, QoS could be further improved by considering aspects
from GPUs, and moving tasks from worse to better GPU devices consequently.
The Hand Tracker, in its current algorithmic implementation [38] and due to the serial nature of
operations between frames, is architecturally bound to always suffer from network-related latency issues
when run in a distributed manner. These issues could be partly mitigated by advances in networking
technology, but they could be completely overcome by switching the optimization pipeline from a
generative regression based optimizer to a discriminative classifier that does not have inter-frame
dependencies. Hand pose estimation algorithms based on the recent advents on Deep Convolutional
Neural Networks such as [39] would be ideal for acceleration using RAPID, and even hybrid
classifier/regression solutions [40] could be much more feasible. Unfortunately, these state-of-the-art
methods have only become recently available to the scientific community. The ability of 2D Hand
Trackers based on Neural Networks to work on direct RGB colour frames without any depth information
would also make them not require a special USB RGBD sensor, thus enabling the application to work
on any kind of device with cameras, including smart-phones, which would greatly benefit from the
resources provided by RAPID.
The virus-scanning engine used by the mobile antivirus is highly demanding on computational
resources. As we anticipated, offloading the processing intensive task of pattern matching to more
capable high-end resources, using RAPID, proved to be extremely beneficial for our system. However,
the offloading process introduces a heavy network I/O bottleneck. This limitation is not introduced by
RAPID and is observed in all offloading frameworks. We anticipate that when new network
technologies, such as 5G networks, are implemented and widely deployed in the near future our
application will benefit from RAPID offloading even more. In the meantime, we plan to explore the
usage of fast compression algorithms in order to mitigate the network bottleneck and exploit RAPID’s
capabilities to the maximum extend. We expect that the deployment of real-time compression
algorithms, such as LZ4 [41], on the critical network I/O path of our application will allow us to increase
the effective number of data transmitted to the remote host under the current network capabilities. This
feature will directly affect the end-to-end throughput achieved by the application and help us exploit the
performance gain provided by RAPID.
Regarding the BioSurveillance use-case, the fact that the template matching module is not the actual
bottleneck of the application severely limits the maximum performance improvement achievable by
RAPID, according to Amdahl’s law [22]. On the other hand, we face challenging restraints in terms of
data privacy. Biometric applications most often require to meet strong privacy-by-design considerations
that do not allow to deal with private data (e.g. video frames, facial snapshots, and databases) anywhere
but on the device they are processed or securely stored. Hence, a compromise between the
aforementioned limitations and requirements would be having part of the template extraction stage,
which is currently done on the CPU, to be accelerated by GPU, and remotely offload only part of the
template extraction GPU operations. Concretely, offloading some parts of template extraction, in which
D7.3 Evaluation of RAPID platforms
Page 45 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
the processed data is no longer identifiable or linked to personal data, would yield remarkable
performance improvements, as the CUDA cores would be much more stressed (having to deal with
almost all the stages of the pipeline), making remote CUDA offloading much more effective. This
development would maintain all current benefits of RAPID offloading (code maintenance, code
compatibility and dissociation from hardware limitations), while considerably improving performance,
and still complying with privacy-by-design constraints.
D7.3 Evaluation of RAPID platforms
Page 46 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
References
[1] RAPID, “D7.2: First RAPID-based public service,” H2020-644312 RAPID Deliverable Report,
2017.
[2] “OpenStack,” [Online]. Available: www.openstack.org. [Accessed February 2018].
[3] “Tesla M60 GPU Accelerator,” [Online]. Available: http://www.nvidia.com/object/tesla-
m60.html. [Accessed February 2018].
[4] RAPID, “D4.2: Development of Dispatch/Fetch Engine,” H2020-644312 RAPID Deliverable
Report, 2016.
[5] RAPID, “D4.3: Development of Registration Process,” H2020-644312 RAPID Deliverable
Report, 2017.
[6] Huawei, “P9 Lite Smartphone,” [Online]. Available:
https://consumer.huawei.com/en/phones/p9-lite/. [Accessed February 2018].
[7] “ASUS ZenGone 2,” [Online]. Available:
https://www.asus.com/gr/Phone/ZenFone_2_ZE551ML/. [Accessed February 2018].
[8] “NVIDIA TITAN X Graphics Card for VR Gaming,” [Online]. Available:
https://www.nvidia.com/en-us/geforce/products/10series/titan-x-pascal/. [Accessed February
2018].
[9] Lenovo, “ThinkPad T460p Enterprise Laptop,” Lenovo, [Online]. Available:
https://www3.lenovo.com/us/en/laptops/thinkpad/thinkpad-t-series/ThinkPad-
T460p/p/22TP2TT460P#tab-techspec. [Accessed February 2018].
[10] G. Mei, “Evaluating the power of GPU acceleration for IDW interpolation algorithm,” The
Scientific World Journal, 2014.
[11] RAPID, “D2.1: Application analysis and system requirements,” H2020-644312 RAPID
Deliverable Report, 2015.
[12] RAPID, “D2.2 Kinect Hand Tracking ported on RAPID,” H2020-644312 RAPID Deliverable
Report, 2016.
[13] “HandTrackerRAPID,” [Online]. Available:
https://github.com/RapidProjectH2020/HandTrackerRAPID. [Accessed January 2018].
[14] “Apache Maven Project,” [Online]. Available: https://maven.apache.org/. [Accessed February
2018].
D7.3 Evaluation of RAPID platforms
Page 47 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
[15] “Boost,” [Online]. Available: http://www.boost.org/. [Accessed February 2018].
[16] “OpenCV,” [Online]. Available: https://opencv.org/. [Accessed February 2018].
[17] “OpenNI,” [Online]. Available: https://github.com/OpenNI. [Accessed February 2018].
[18] “GeForce GTX 970,” [Online]. Available: https://www.geforce.com/hardware/desktop-
gpus/geforce-gtx-970. [Accessed February 2018].
[19] “Intel® Core™ i7-950 Processor,” [Online]. Available:
https://ark.intel.com/products/37150/Intel-Core-i7-950-Processor-8M-Cache-3_06-GHz-4_80-
GTs-Intel-QPI. [Accessed February 2018].
[20] “GeForce GTX 670M,” [Online]. Available: https://www.geforce.com/hardware/notebook-
gpus/geforce-gtx-670m. [Accessed February 2018].
[21] “Intel® Core™ i5-4210U Processor,” [Online]. Available:
https://ark.intel.com/products/81016/Intel-Core-i5-4210U-Processor-3M-Cache-up-to-2_70-
GHz. [Accessed February 2018].
[22] M. Hill and M. Marty, “Amdahl's Law in the Multicore Era,” Computer, vol. 41, no. 7, pp. 33-
38, 2008.
[23] “NVIDIA Shield K1 tablet,” [Online]. Available: https://www2.nvidia.com/en-us/shield/tablet.
[Accessed December 2017].
[24] “Kepler Architecture,” [Online]. Available: http://www.nvidia.com/object/nvidia-kepler.html.
[Accessed December 2017].
[25] “Tegra Mobile Devices,” [Online]. Available: http://www.nvidia.com/object/tegra-phones-
tablets.html. [Accessed December 2017].
[26] “NVIDIA Tegra® K1 processor,” [Online]. Available: http://www.nvidia.com/object/tegra-k1-
processor.html. [Accessed December 2017].
[27] “ARM Cortex-A15 CPU,” [Online]. Available:
https://developer.arm.com/products/processors/cortex-a/cortex-a15. [Accessed December 2017].
[28] “Official SHIELD Tablet K1 Software Upgrade 5.0,” [Online]. Available:
https://forums.geforce.com/default/topic/992729/shield-tablet/official-shield-tablet-k1-software-
upgrade-5-0-feedback-thread-released-02-09-17-/. [Accessed December 2017].
[29] “Intel® Core™ i7-6700 Processor,” [Online]. Available:
https://ark.intel.com/products/88196/Intel-Core-i7-6700-Processor-8M-Cache-up-to-4_00-GHz.
[Accessed December 2017].
D7.3 Evaluation of RAPID platforms
Page 48 of 48
This document is Public, and was produced under the RAPID project (EC contract 644312).
[30] “NVIDIA GeForce GTX 980 GPU Specifications,” [Online]. Available:
https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-980/specifications. [Accessed
December 2017].
[31] “ClamAV,” [Online]. Available: https://www.clamav.net/. [Accessed January 2018].
[32] “NVIDIA Tegra X1 Processor,” [Online]. Available: http://www.nvidia.com/object/tegra-x1-
processor.html. [Accessed January 2018].
[33] RAPID, “D2.3: BioSurveillance ported on RAPID,” H2020-644312 RAPID Deliverable Report,
2016.
[34] A. Nech and I. Kemelmacher-Shlizerman, “Level playing field for million scale face
recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[35] “ZeroMQ,” [Online]. Available: http://zeromq.org/. [Accessed January 2018].
[36] “NVIDIA GeForce GTX 760 GPU Specifications,” [Online]. Available:
https://www.geforce.com/hardware/desktop-gpus/geforce-gtx-760/specifications. [Accessed
January 2018].
[37] “NVIDIA Titan Xp GPU Specifications,” [Online]. Available: https://www.nvidia.com/en-
us/titan/titan-xp/. [Accessed January 2018].
[38] I. Oikonomidis, N. Kyriazis and A. A. Argyros, “Efficient model-based 3D tracking of hand
articulations using Kinect,” in British Machine Vision Conference (BMVC 2011), BMVA, 2011.
[39] P. Panteleris, I. Oikonomidis and A. A. Argyros, “Using a single RGB frame for real time 3D
hand pose estimation in the wild,” in IEEE Winter Conference on Applications of Computer
Vision (WACV 2018) (to appear). Also available at arxiv., IEEE, March 2018., 2018.
[40] A. Qammaz, D. Michel and A. A. Argyros, “A Hybrid Method for 3D Pose Estimation of
Personalized Human Body Models,” in IEEE Winter Conference on Applications of Computer
Vision (WACV 2018) (to appear), IEEE, 2018.
[41] “LZ4 Compression Algorithm,” [Online]. Available:
https://en.wikipedia.org/wiki/LZ4_(compression_algorithm). [Accessed February 2018].