Exploring Computation- Communication Tradeoﬀs in Camera ...amrita/slides/iiswc17.pdf · Exploring...

Exploring Computation-Communication Tradeoffs in Camera Systems

Amrita MazumdarThierry MoreauSung KimMeghan Cowan

Armin AlaghiLuis CezeMark OskinVisvesh Sathe

IISWC 2017

video surveillance cameras

3D-360 virtual reality camera rig

Camera applications are a prominent workload with tight constraints

large data size

energy harvesting camera

augmented reality glasses

light weight

real-time processing

low-power

Hardware implementations compound the camera system design space

constraint

time size

bandwidth

implementation

ASICFPGA

DSPCPU

camera system

DogChat™

We can represent camera applications as camera processing pipelines to clarify design space exploration

sensor block 1 block 2 block 3 block 4

functions in the application

DogChat™

sensor image processing

face detection

feature tracking

image rendering

We can represent camera applications as camera processing pipelines to clarify design space exploration

DogChat™

face detection

feature tracking

image rendering

offloaded to cloud

Developers can trade off between computation and communication costs

DogChat™

Developers can trade off between computation and communication costs

offloaded to cloudin-camera processing

face detection

feature tracking

image rendering

Optional and required blocks in camera pipelines introduce more tradeoffs

edge detection

motion detection

motion tracking

required

optional

face detection

feature tracking

image rendering

face detection

feature tracking

image rendering

edge detection

motion detection

motion tracking

Custom hardware platforms explode the camera system design space

FPGADSP CPU

GPUDSP

required

optional

face detection

feature tracking

image rendering

edge detection

motion detection

motion tracking

Custom hardware platforms explode the camera system design space

FPGADSP CPU

GPUDSP

required

optional

In-camera processing pipelines can help us evaluate these tradeoffs!

Challenges for modern camera systems

Low-power: face authentication for energy-harvesting cameras with ASIC design

Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration

motion detection

face detection

neural network

prep align depth

stitch

motion detection

face detection

neural network

prep align depth

stitch

Face authentication with energy harvesting cameras

WISP Cam energy-harvesting camera

powered by RF1 frame / second

~1 mW processing / frame

Is this Armin? ✅

Face authentication with energy harvesting cameras

sensor neural network

other application functions

on-chip CPU cloud

CPU-based face authentication neural networks can exceed WISPcam power budgets

sensor neural network

other application functions

ASIC hardware cloud

adding optional blocks can reduce power consumption for a neural network

face detection

motion detection

on-chip circuit

CPU-based face authentication neural networks can exceed WISPcam power budgets

Exploring design tradeoffs in ASIC accelerators

Evaluated NN topology and hardware impact on energy and accuracy

Selected a 400-8-1 network topology and used 8-bit datapaths for optimal energy/accuracy point

DMA Master

Bus Scheduler

control

... MUL MUL MUL MUL

weight weight weight weightd_in

ADD ADD ADD ADD

offset88 88 88 88

16 16 16 16acc.fifo

sig.fifosigmoid unit

26 26 26 26

26 26 26 26acc

8 26acc

PE0 PE1 PE2 PE3

feature unit

integral accumulatorVJ

integral image accumulator

classifier unit

window buffer

stage unitthreshold unit

feature unit

pixels in

input row

integral row output

+= 2 311 116 72

threshold

‘yes’ weight‘no’ weight

a db c

++ - x

weight1++

a db cweight2

a db cweight3

previous row

Streaming face detection accelerator

Explored classifier and other algorithm parameters to optimize energy optimality

neural network face detection

many more details in paper!

Synthesized ASIC accelerators in Synopsys

Constructed simulator to evaluate power consumption on real-world video input

Computed power for computation and transfer of resulting data for each pipeline configuration

EvaluationWhich pipeline achieves the lowest overall power?

Which pipeline achieves the lowest power consumption?

platform configuration compute transfer

sensor <1% >99%

sensor motion <1% >99%

sensor face detect 10% 90%

sensor NN 16% 84%

sensor motion face detect >99% <1%

sensor motion NN >99% <1%

sensor face detect NN >99% <1%

sensor motion face detect NN >99% <1%

log Power (µW)1 1000 1000000

257,236

782,090

11,340

(ratios)

sensor <1% >99%

sensor NN 16% 84%

log Power (µW)1 1000 1000000

257,236

782,090

11,340

(ratios)

prefilters reduce overall power

sensor <1% >99%

sensor NN 16% 84%

log Power (µW)1 1000 1000000

257,236

782,090

11,340

(ratios)

just using NN

prefilters with NN use less power

sensor <1% >99%

sensor NN 16% 84%

log Power (µW)1 1000 1000000

257,236

782,090

11,340

(ratios)

most power-efficient

most power-efficient with on-chip NN

In-camera processing for face authentication

In isolation, even well-designed hardware can show sub-optimal performance

Optional blocks can improve the overall cost,if they balance compute and communication

better than the original design

motion detection

face detection

neural network

motion detection

face detection

neural network

prep align depth

stitch

motion detection

face detection

neural network

prep align depth

stitch

16 GoPro cameras 4K-30 fps

3.6 GB/s raw video

Goal: 30 fps

3D-360 stereo video 1.8 GB/s output

Producing real-time VR video from a camera rig

16 GoPro cameras 4K-30 fps

3.6 GB/s raw video

Goal: 30 fps

3D-360 stereo video 1.8 GB/s output

Producing real-time VR video from a camera rig

cloud processing prevents real-

time video

offloaded to cloud

prep image align

depth from flow

image stitchsensor stream

to viewer

VR pipeline is usually offloaded to perform heavy computation

5% 20% 70% 5%

processing time

need to accelerate “depth from flow” to achieve high

performance

prep image align

depth from flow

image stitchsensor stream

to viewer

Offloading before the costly step doesn’t avoid compute-communication tradeoffs

600image alignment step produces significant

intermediate data

offloading early on is still 2x final output

Evaluation

Designed a simple parallel accelerator for Xilinx Zynq SoC, simulated for Virtex UltraScale+

Evaluated against CPU and GPU implementations in Halide

Assumed 2GB/s network link for communication

Which pipeline achieves the highest frame rate?

implementation details in paper

pipeline configuration compute transfer

sensor 100 15.8

sensor prep 100 15.8

sensor prep align 100 3.95

sensor prep align depth (CPU) 0.09 5.27

sensor prep align depth (GPU) 11.2 5.27

sensor prep align depth (FPGA) 174 5.27

sensor prep align depth (CPU) stitch 0.09 31.6

sensor prep align depth (GPU) stitch 11.2 31.6

sensor prep align depth (FPGA) stitch 174 31.6

effective FPS0 7 14 21 28 35

sensor 100 15.8

CPU results are slowest

sensor 100 15.8

Data size is too big after depth for

offloading

sensor 100 15.8

full pipeline with FPGA is only one

that achieves real-time frame rate

In-camera processing for real-time VR video

Computation and communication together highlight benefits not seen when considered separately

For VR video, in-camera processing pipelines enable applications that could not even be achieved via

cloud offload

prep align depth

stitch

In-camera pipelines evaluate computation-communication trade-offs

Use hardware-software co-design to balance constraints and optimize designs

Achieve optimal performance by considering bottlenecks in context of full system

In-camera processing pipelines help characterize camera systems

Thank you!

Exploring Computation- Communication Tradeoﬀs in Camera ...amrita/slides/iiswc17.pdf · Exploring...

Documents

Transcript of Exploring Computation- Communication Tradeoﬀs in Camera ...amrita/slides/iiswc17.pdf · Exploring...

Amrita Project

Group hank Amrita

Amrita Beamer

AMRITA VISHWA VIDYAPEETHAM, AMRITA SCHOOL OF …

A Non-heuristic Approach to Time-space Tradeoﬀs and ...

Resume & Folio Amrita Kumar

Amrita Darshanam 2015 - Amrita Vishwa Vidyapeetham

Introduction Amrita

Enterobiusvermicularis amrita kochi

Uniform hardness versus randomness tradeoﬀs for Arthur ...

CLASS:UKG - Amrita Vidyalayam

AMRITA VISHWA VI'DYAPEETHAM - Amrita Entrance Examinations

Amma Says - Amrita

Bga presents Amrita Awas

Amrita VISHWA VIDYAPEETHAM · Amrita VISHWA VIDYAPEETHAM (University established u/s 3 of UGC Act 1956) Amrita Entrance Examination – Engineering 2015 PHYSICS, CHEMISTRY & MATHEMATICS

8. PPJ21 Amrita

Amrita Women Empowerment

Tradeoﬀs in Neuroevolutionary Learning-Based Real-Time Robotic Task …peggy/techReport201701.pdf · 2017. 11. 12. · Tradeoﬀs in Neuroevolutionary Learning-Based Real-Time Robotic

Amrita Entrance Examination Engineering - vidhya360.in · Amrita Entrance Examination Engineering - Model Question Paper ... Amrita Entrance Exam, Hand Book, Brochure, B. Tech Programs,

amrita das bijoy