"Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Copyright © 2016 LUXOFT 1

Alexey Rybakov, LUXOFT

May 3, 2016

Making Computer Vision Software Run

Fast on Your Embedded Platform Art and Science of Optimization


Global Software Engineering:

• Low-Power GPU Software

• Custom Vision Software

Why LUXOFT is Giving This Talk

10,000+ Luxoft software engineers


• Obstruction Removal for Drones

• CAFFE on ARM Mali

• OpenCV on ImgTec PowerVR

• HDR Encoding on GPU-based

• Low-power Motion Stabilization

• GPU-optimized 4K VP9 video codec

• See demos at our booth

Our Optimization Projects Covered in This Talk

Drone Vision

Fast

OpenCV

HDR on GPU Caffe on GPU

Stabilization Fast 4K Codecs


• Qualifying question: Who Develops Computer Vision Software?

• Typical situations in embedded SW development:

• Great new algorithm Implement

• Implementation platform: Desktop-class Embedded*

• Decision making: Delayed Real-time*

• Performance: Low FPS High FPS*

Poll

* Context of this presentation


• Need: reliable, real-time, on-device, decision-making from visual

data...implemented on a constrained HW platform (with exotic architecture)

• What to do

1. Map CV pipeline onto HW platform

2. Rethink system requirements

3. Rework algorithm logic

4. Use GPU, DSP and other aid (properly!)

5. Code optimization

6. Know your platform inside out

Embedded Vision: Challenges and Opportunities


Map CV Pipeline onto HW Platform

1.


Embedded Vision: Pipeline and Hardware


Evaluate your platform:

• Hardware features and accelerators, slow/fast memory, power management?

• Support from run-time: OS, drivers, OpenCL, CUDA, other frameworks?

• Toolchain: Compiler, debugger, profiler, [access to] documentation, optimization guides?

• Available CV frameworks: OpenCV, IPP, fastCV, other?

Benchmark your embedded platform vs. reference:

• Run simple tests: data copy, access, vectorization, memory use, energy management

• Test if CV-framework functions are optimized (coverage is often low)

…This will give you measured optimization goal

Study and Test HW Platform


Mapping to Platform: Histogram Example

Histo*

2 ms

Histo

equali-

zation

Apply

LUT

Histo

4.2 ms

Histo

equalization

Apply

LUT Camera

Camera

* Histogram collection on CPU is more than 2 times faster than on GPU

** Histogram equalization is a 1 thread, iterative histogram processing, so

GPU implementation is not reasonable.

16.2 ms

2 MB data transfer (HD frame)

1 KB data transfer 1 KB data transfer

1 KB data transfer

GPU processing

CPU processing

Memory transfers

HOST GPU = 1.33 GB/s

GPU HOST = 0.11 GB/s

SOC: Intel Merrifield platform,

Device: Dell Venue 3840

Option A vs.

Option B


Rethink System Requirements!

2.


• Important concept: “Good enough”

• How does your use case differ from classic/desktop requirements?

Art of “controlled worse”

• What decision latency do you need?

• What resolution/precision?

• Do you need all frame or a region?

Optimize System Requirements


• Universal implementation* Our Drone implementation

• Any motion Linear motion

• Any obstacles Opaque obstacles

• Have only image data Use sensor fusion (gyro)

• More than 100X faster!

Rethink Requirements:

Obstruction Removal, Drone Edition

Camera Output

*MIT CSAIL and Google Research, SIGGRAPH 2015


Rework Algorithm Logic

3.


• Desktop Embedded

• High-Res Downsampling / pyramid

• Color Monochrome or luminance

• Entire frame Regions of Interest only

• ROI cascading example: HOG to DNN

• Every frame 1/N + approximation

• Inter-frame cascading: Detection to Tracking

• Image only Sensor fusion

• Example: gyro + vision for motion est.

• CPU Parallelize for GPU

Algorithm Optimization Opportunities


• Motion Vector Field only for 3x3

(pyramid downsampling)

• Only shift and rotation

• 1000x+ performance

• Real-time 4K UHD on mobile

Optimized Video Stabilization Algorithm

• Motion Vector Field only for 3x3

grid (pyramid downsampling)

• Only shift and rotation

• Inter-frame border reconstruction

• 1000x+ performance

• Real-time 4K UHD on mobile


Use GPU and Other Aid (Properly)

4.


• Good news: computer vision is very parallelizable

• Bad news: coordination between CPU and GPU (and other compute devices) is a tricky part

• GPU: What to do (beyond algorithm-to-platform mapping and reworked logic)

• A few simple rules: memory types, datatypes, workroup size, memory alignment

• Master the art of kernel synchronization: load your cores

• Use GPU pre-optimized libraries, like OpenCV on some platforms

• Master OpenCL

• Also explore available ISP or DSP benefits.

Use GPU. Properly


1. Memory Hierarchy

2. Task Synchronization

• Example of both: Large Matrix Transpose

GPU, Two Key Concepts


Original. All FPS measured on Galaxy S7:

• Run existing DNN framework: CAFFE

• =0.7 FPS (EIGEN OpenCL library)

CPU Optimization (not a through road):

• Optimized version for Android: DNN optimized OpenBLAS:

OpenMP and NEON +2 FPS

GPU Optimizations:

• Better OpenCL implementation on ViennaCL library +0.5 FPS

• Found bottleneck: SGEMM functions

• Rewrite SGEMM (workgroup size, vectorization, etc) +4.5 FPS

Final optimized performance: 5-6 FPS

ARM Mali Accelerated CAFFE

Open Source CPU,

1 thread

Open Source GPU

OpenCL

(ViennaCL)

Open Source CPU

multithreaded,

NEON

LUXOFT

0.7 FPS

1.2 FPS 2.5 FPS 5.4 FPS


ARM Mali Accelerated CAFFE: Benchmarks

Legend

Colors

• FPS

• CPU Load

• Battery Charge

Lines

• CPU

• Optimized GPU


VP9 Video Decoder Optimization for GPU

Parsing &

Entropy

Decode

Motion

Compen

sation

Intra

Prediction

Inverse

Quant

Inverse

Transform

Reconst

ruction

Loop

filtering

• CPU: Superblock-level parallelism

Parsing &

Entropy

Decode

Motion

Compensati

on

Intra

Prediction

Inverse

Quant

Inverse

Transform

Reconstructi

on

Loop

filtering

• GPU: Frame-level parallelism

• Uses more memory Input frame

Input frame Output frame

Output frame

Optimization result: 2x-5x FPS depending on bitrate.

Platforms: AMD, Intel, NVidia SoCs

Original CPU Algorithm

GPU processing

CPU processing

Reworked and Optimized GPU Algorithm


Code Optimization

5.


• Two enemies

1. Computation

2. Data transfers

• Waste of time = waste of energy

Controversial example

ARM compiler does it automatically

Some others don’t

Two Enemies: Code and Data

Don’t calculate - Use table/lookup functions,

- Use polynomial approximations

Use classic techniques - Like loop unrolling,

- Converting to native data types

Don’t move data - Use local and cache memory

- Partition/group DRAM access

Benchmark everything - Compiler computation options

- Memory transfers


OpenCV local contrast for HD camera adjustment in real time

• Existing OpenCV histogram implementations don‘t fit into

1080p frame processing budget (need 16 ms/frame for the entire

algoithm chain to obtain 60 FPS)

Optimization Results

Things to do

• Experiment

• Benchmark

• Chose the best method

OpenCV on ImgTec PowerVR GPU: Histogram Example

Histogram Gathering Method Time, ms

OpenCV histogram (CPU) 7.5 ms

OpenCV histogram (GPU) 4.4 ms

Luxoft-PowerVR (atomic_add to global memory) 0.69 ms

Luxoft-PowerVR (atomic_add to local memory) 7.51 ms

Luxoft-PowerVR (increment at local memory) 3.28 ms


• Example: “memory tiling”

Tiled memory layout may

give 2x-3x performance gain

for vision algorithms:

1 DRAM read vs. 4 DRAM reads

in matrix transpose

Example: Fighting Data Transfers

• Reference you need to obtain or produce

(will vary by CPU/GPU of your choice)


Know Your Platform Inside Out

6.


• Things to do

• Study documentation and optimization guides for your exact HW

• Again, test/benchmark a feature before you critically rely on it

• What works for you

• Modern GPUs and DSPs may implement the entire algorithm in 1 instruction

• What works against you

• Don’t assume everything will work as documented

• “Fast” memory …may be slow (like early versions of Snapdragon)

• Great technology …but no documentation and no code examples (like iOS

Metal for compute)

Platform Specifics


• Motion vector field upsampling, common task for CV

• OpenCL supports bilinear

interpolation of everything

• How to, AMD OpenCL implementation

• AMD has QSAD function – the fastest way to SAD for blocks

• Keep MVF in Image2D

• Use sampler with CLK_FILTER_BILINEAR

Platform Example: AMD GPU for Frame Interpolation

Basic Optimized


iOS Metal Compute Findings:

• No code examples for compute, weak documentation = blackbox

• Only 64 GitHub repos, no serious projects

• xCode profiler does not work with Metal Compute use workarounds: manual timer-based

profiling

• Vector types actually not fully supported by a compiler test everything, then use

workaround: use combined approach with scalars and vectors

Encountered while working on GPU-optimized

JPEG-HDR encoding on iPhone

We still achieved about 3x-4x faster JPEG Encode

on iPhone … just took a lot of extra work

Platform Example: Apple iOS Metal for GPU Compute


Lessons Learned and Resources

!


1. Learn, test, profile, and benchmark every component of your system. Including

compiler. Don’t assume.

2. Don’t port 1:1. Rework requirements and algorithm logic too.

3. GPU and other non-CPU compute architectures may give fantastic results.

4. Use parallelization and computer vision frameworks like OpenCL or OpenCV.

Rewrite critical parts there as needed.

5. Modern HW platforms implement popular algorithms in one function call. Study

platform-specific optimization guides.

6. Sometimes things won’t work as documented. This is normal.

7. Optimization is a mix of art and science. Think outside the box.

Lessons Learned


• Embedded Vision Alliance: http://www.embedded-vision.com/

• Platform optimization guides and blog posts from:

• Altera (now Intel), AMD, ARM, Imagination Technologies, NVidia,

Qualcomm, TI

• Luxoft Computer Vision team: [email protected]

Resources

http://www.embedded-vision.com/




mailto:[email protected]


Thank you!

LUXOFT Presentation R&D Team:

Aleksandr Bobrovnik

Aleksandr Volkov

Alexey Rybakov

Anton Veselov

Artem Galin

Dmitriy Marenkov

Dmitry Ivanov

Ekaterina Popova

Ihor Starepravo

Ildar Valiev

Marat Gilmutdinov

Nikolay Nemcev

Oleksandr Murovanyi

Sergey Fedorov

Valery Bobrov

Viktor Pasoshnikov


See demos at our booth. And email me too

? Alexey Rybakov

Senior Director, Embedded

LUXOFT, Menlo Park, CA

[email protected]

mailto:[email protected]

"Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT

Technology

Transcript of "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT