"Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT
-
Upload
embedded-vision-alliance -
Category
Technology
-
view
624 -
download
2
Transcript of "Making Computer Vision Software Run Fast on Your Embedded Platform," a Presentation from LUXOFT
Copyright © 2016 LUXOFT 1
Alexey Rybakov, LUXOFT
May 3, 2016
Making Computer Vision Software Run
Fast on Your Embedded Platform Art and Science of Optimization
Copyright © 2016 LUXOFT 2
Global Software Engineering:
• Low-Power GPU Software
• Custom Vision Software
Why LUXOFT is Giving This Talk
10,000+ Luxoft software engineers
Copyright © 2016 LUXOFT 3
• Obstruction Removal for Drones
• CAFFE on ARM Mali
• OpenCV on ImgTec PowerVR
• HDR Encoding on GPU-based
• Low-power Motion Stabilization
• GPU-optimized 4K VP9 video codec
• See demos at our booth
Our Optimization Projects Covered in This Talk
Drone Vision
Fast
OpenCV
HDR on GPU Caffe on GPU
Stabilization Fast 4K Codecs
Copyright © 2016 LUXOFT 4
• Qualifying question: Who Develops Computer Vision Software?
• Typical situations in embedded SW development:
• Great new algorithm Implement
• Implementation platform: Desktop-class Embedded*
• Decision making: Delayed Real-time*
• Performance: Low FPS High FPS*
Poll
* Context of this presentation
Copyright © 2016 LUXOFT 5
• Need: reliable, real-time, on-device, decision-making from visual
data...implemented on a constrained HW platform (with exotic architecture)
• What to do
1. Map CV pipeline onto HW platform
2. Rethink system requirements
3. Rework algorithm logic
4. Use GPU, DSP and other aid (properly!)
5. Code optimization
6. Know your platform inside out
Embedded Vision: Challenges and Opportunities
Copyright © 2016 LUXOFT 6
Map CV Pipeline onto HW Platform
1.
Copyright © 2016 LUXOFT 7
Embedded Vision: Pipeline and Hardware
Copyright © 2016 LUXOFT 8
Evaluate your platform:
• Hardware features and accelerators, slow/fast memory, power management?
• Support from run-time: OS, drivers, OpenCL, CUDA, other frameworks?
• Toolchain: Compiler, debugger, profiler, [access to] documentation, optimization guides?
• Available CV frameworks: OpenCV, IPP, fastCV, other?
Benchmark your embedded platform vs. reference:
• Run simple tests: data copy, access, vectorization, memory use, energy management
• Test if CV-framework functions are optimized (coverage is often low)
…This will give you measured optimization goal
Study and Test HW Platform
Copyright © 2016 LUXOFT 9
Mapping to Platform: Histogram Example
Histo*
2 ms
Histo
equali-
zation
Apply
LUT
Histo
4.2 ms
Histo
equalization
Apply
LUT Camera
Camera
* Histogram collection on CPU is more than 2 times faster than on GPU
** Histogram equalization is a 1 thread, iterative histogram processing, so
GPU implementation is not reasonable.
16.2 ms
2 MB data transfer (HD frame)
1 KB data transfer 1 KB data transfer
1 KB data transfer
GPU processing
CPU processing
Memory transfers
HOST GPU = 1.33 GB/s
GPU HOST = 0.11 GB/s
SOC: Intel Merrifield platform,
Device: Dell Venue 3840
Option A vs.
Option B
Copyright © 2016 LUXOFT 10
Rethink System Requirements!
2.
Copyright © 2016 LUXOFT 11
• Important concept: “Good enough”
• How does your use case differ from classic/desktop requirements?
Art of “controlled worse”
• What decision latency do you need?
• What resolution/precision?
• Do you need all frame or a region?
Optimize System Requirements
Copyright © 2016 LUXOFT 12
• Universal implementation* Our Drone implementation
• Any motion Linear motion
• Any obstacles Opaque obstacles
• Have only image data Use sensor fusion (gyro)
• More than 100X faster!
Rethink Requirements:
Obstruction Removal, Drone Edition
Camera Output
*MIT CSAIL and Google Research, SIGGRAPH 2015
Copyright © 2016 LUXOFT 13
Rework Algorithm Logic
3.
Copyright © 2016 LUXOFT 14
• Desktop Embedded
• High-Res Downsampling / pyramid
• Color Monochrome or luminance
• Entire frame Regions of Interest only
• ROI cascading example: HOG to DNN
• Every frame 1/N + approximation
• Inter-frame cascading: Detection to Tracking
• Image only Sensor fusion
• Example: gyro + vision for motion est.
• CPU Parallelize for GPU
Algorithm Optimization Opportunities
Copyright © 2016 LUXOFT 15
• Motion Vector Field only for 3x3
(pyramid downsampling)
• Only shift and rotation
• 1000x+ performance
• Real-time 4K UHD on mobile
Optimized Video Stabilization Algorithm
• Motion Vector Field only for 3x3
grid (pyramid downsampling)
• Only shift and rotation
• Inter-frame border reconstruction
• 1000x+ performance
• Real-time 4K UHD on mobile
Copyright © 2016 LUXOFT 16
Use GPU and Other Aid (Properly)
4.
Copyright © 2016 LUXOFT 17
• Good news: computer vision is very parallelizable
• Bad news: coordination between CPU and GPU (and other compute devices) is a tricky part
• GPU: What to do (beyond algorithm-to-platform mapping and reworked logic)
• A few simple rules: memory types, datatypes, workroup size, memory alignment
• Master the art of kernel synchronization: load your cores
• Use GPU pre-optimized libraries, like OpenCV on some platforms
• Master OpenCL
• Also explore available ISP or DSP benefits.
Use GPU. Properly
Copyright © 2016 LUXOFT 18
1. Memory Hierarchy
2. Task Synchronization
• Example of both: Large Matrix Transpose
GPU, Two Key Concepts
Copyright © 2016 LUXOFT 19
Original. All FPS measured on Galaxy S7:
• Run existing DNN framework: CAFFE
• =0.7 FPS (EIGEN OpenCL library)
CPU Optimization (not a through road):
• Optimized version for Android: DNN optimized OpenBLAS:
OpenMP and NEON +2 FPS
GPU Optimizations:
• Better OpenCL implementation on ViennaCL library +0.5 FPS
• Found bottleneck: SGEMM functions
• Rewrite SGEMM (workgroup size, vectorization, etc) +4.5 FPS
Final optimized performance: 5-6 FPS
ARM Mali Accelerated CAFFE
Open Source CPU,
1 thread
Open Source GPU
OpenCL
(ViennaCL)
Open Source CPU
multithreaded,
NEON
LUXOFT
0.7 FPS
1.2 FPS 2.5 FPS 5.4 FPS
Copyright © 2016 LUXOFT 20
ARM Mali Accelerated CAFFE: Benchmarks
Legend
Colors
• FPS
• CPU Load
• Battery Charge
Lines
• CPU
• Optimized GPU
Copyright © 2016 LUXOFT 21
VP9 Video Decoder Optimization for GPU
Parsing &
Entropy
Decode
Motion
Compen
sation
Intra
Prediction
Inverse
Quant
Inverse
Transform
Reconst
ruction
Loop
filtering
• CPU: Superblock-level parallelism
Parsing &
Entropy
Decode
Motion
Compensati
on
Intra
Prediction
Inverse
Quant
Inverse
Transform
Reconstructi
on
Loop
filtering
• GPU: Frame-level parallelism
• Uses more memory Input frame
Input frame Output frame
Output frame
Optimization result: 2x-5x FPS depending on bitrate.
Platforms: AMD, Intel, NVidia SoCs
Original CPU Algorithm
GPU processing
CPU processing
Reworked and Optimized GPU Algorithm
Copyright © 2016 LUXOFT 22
Code Optimization
5.
Copyright © 2016 LUXOFT 23
• Two enemies
1. Computation
2. Data transfers
• Waste of time = waste of energy
Controversial example
ARM compiler does it automatically
Some others don’t
Two Enemies: Code and Data
Don’t calculate - Use table/lookup functions,
- Use polynomial approximations
Use classic techniques - Like loop unrolling,
- Converting to native data types
Don’t move data - Use local and cache memory
- Partition/group DRAM access
Benchmark everything - Compiler computation options
- Memory transfers
Copyright © 2016 LUXOFT 24
OpenCV local contrast for HD camera adjustment in real time
• Existing OpenCV histogram implementations don‘t fit into
1080p frame processing budget (need 16 ms/frame for the entire
algoithm chain to obtain 60 FPS)
Optimization Results
Things to do
• Experiment
• Benchmark
• Chose the best method
OpenCV on ImgTec PowerVR GPU: Histogram Example
Histogram Gathering Method Time, ms
OpenCV histogram (CPU) 7.5 ms
OpenCV histogram (GPU) 4.4 ms
Luxoft-PowerVR (atomic_add to global memory) 0.69 ms
Luxoft-PowerVR (atomic_add to local memory) 7.51 ms
Luxoft-PowerVR (increment at local memory) 3.28 ms
Copyright © 2016 LUXOFT 25
• Example: “memory tiling”
Tiled memory layout may
give 2x-3x performance gain
for vision algorithms:
1 DRAM read vs. 4 DRAM reads
in matrix transpose
Example: Fighting Data Transfers
• Reference you need to obtain or produce
(will vary by CPU/GPU of your choice)
Copyright © 2016 LUXOFT 26
Know Your Platform Inside Out
6.
Copyright © 2016 LUXOFT 27
• Things to do
• Study documentation and optimization guides for your exact HW
• Again, test/benchmark a feature before you critically rely on it
• What works for you
• Modern GPUs and DSPs may implement the entire algorithm in 1 instruction
• What works against you
• Don’t assume everything will work as documented
• “Fast” memory …may be slow (like early versions of Snapdragon)
• Great technology …but no documentation and no code examples (like iOS
Metal for compute)
Platform Specifics
Copyright © 2016 LUXOFT 28
• Motion vector field upsampling, common task for CV
• OpenCL supports bilinear
interpolation of everything
• How to, AMD OpenCL implementation
• AMD has QSAD function – the fastest way to SAD for blocks
• Keep MVF in Image2D
• Use sampler with CLK_FILTER_BILINEAR
Platform Example: AMD GPU for Frame Interpolation
Basic Optimized
Copyright © 2016 LUXOFT 29
iOS Metal Compute Findings:
• No code examples for compute, weak documentation = blackbox
• Only 64 GitHub repos, no serious projects
• xCode profiler does not work with Metal Compute use workarounds: manual timer-based
profiling
• Vector types actually not fully supported by a compiler test everything, then use
workaround: use combined approach with scalars and vectors
Encountered while working on GPU-optimized
JPEG-HDR encoding on iPhone
We still achieved about 3x-4x faster JPEG Encode
on iPhone … just took a lot of extra work
Platform Example: Apple iOS Metal for GPU Compute
Copyright © 2016 LUXOFT 30
Lessons Learned and Resources
!
Copyright © 2016 LUXOFT 31
1. Learn, test, profile, and benchmark every component of your system. Including
compiler. Don’t assume.
2. Don’t port 1:1. Rework requirements and algorithm logic too.
3. GPU and other non-CPU compute architectures may give fantastic results.
4. Use parallelization and computer vision frameworks like OpenCL or OpenCV.
Rewrite critical parts there as needed.
5. Modern HW platforms implement popular algorithms in one function call. Study
platform-specific optimization guides.
6. Sometimes things won’t work as documented. This is normal.
7. Optimization is a mix of art and science. Think outside the box.
Lessons Learned
Copyright © 2016 LUXOFT 32
• Embedded Vision Alliance: http://www.embedded-vision.com/
• Platform optimization guides and blog posts from:
• Altera (now Intel), AMD, ARM, Imagination Technologies, NVidia,
Qualcomm, TI
• Luxoft Computer Vision team: [email protected]
Resources
Copyright © 2016 LUXOFT 33
Thank you!
LUXOFT Presentation R&D Team:
Aleksandr Bobrovnik
Aleksandr Volkov
Alexey Rybakov
Anton Veselov
Artem Galin
Dmitriy Marenkov
Dmitry Ivanov
Ekaterina Popova
Ihor Starepravo
Ildar Valiev
Marat Gilmutdinov
Nikolay Nemcev
Oleksandr Murovanyi
Sergey Fedorov
Valery Bobrov
Viktor Pasoshnikov
Copyright © 2016 LUXOFT 34
See demos at our booth. And email me too
? Alexey Rybakov
Senior Director, Embedded
LUXOFT, Menlo Park, CA