Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring...

Post on 16-Jul-2020

4 views 0 download

Transcript of Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring...

Exploring Computation-Communication Tradeoffs in Camera Systems

1

Amrita MazumdarThierry MoreauSung KimMeghan Cowan

Armin AlaghiLuis CezeMark OskinVisvesh Sathe

IISWC 2017

video surveillance cameras

3D-360 virtual reality camera rig

Camera applications are a prominent workload with tight constraints

2

large data size

large data size

energy harvesting camera

augmented reality glasses

light weight

light weight

real-time processing

real-time processing

real-time processing

low-power

low-power

Hardware implementations compound the camera system design space

constraint

power

time size

bandwidth

implementation

ASICFPGA

DSPCPU

GPU

3

camera system

DogChat™

We can represent camera applications as camera processing pipelines to clarify design space exploration

sensor block 1 block 2 block 3 block 4

4

functions in the application

5

DogChat™

sensor image processing

face detection

feature tracking

image rendering

We can represent camera applications as camera processing pipelines to clarify design space exploration

6

DogChat™

sensor image processing

face detection

feature tracking

image rendering

offloaded to cloud

Developers can trade off between computation and communication costs

7

DogChat™

Developers can trade off between computation and communication costs

offloaded to cloudin-camera processing

sensor image processing

face detection

feature tracking

image rendering

8

Optional and required blocks in camera pipelines introduce more tradeoffs

edge detection

motion detection

motion tracking

required

optional

sensor image processing

face detection

feature tracking

image rendering

sensor image processing

face detection

feature tracking

image rendering

edge detection

motion detection

motion tracking

Custom hardware platforms explode the camera system design space

9

ASIC

FPGADSP CPU

GPUDSP

FPGA

required

optional

sensor image processing

face detection

feature tracking

image rendering

edge detection

motion detection

motion tracking

Custom hardware platforms explode the camera system design space

10

ASIC

FPGADSP CPU

GPUDSP

FPGA

required

optional

In-camera processing pipelines can help us evaluate these tradeoffs!

Challenges for modern camera systems

Low-power: face authentication for energy-harvesting cameras with ASIC design

Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration

11

motion detection

face detection

neural network

prep align depth

stitch

Challenges for modern camera systems

Low-power: face authentication for energy-harvesting cameras with ASIC design

Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration

12

motion detection

face detection

neural network

prep align depth

stitch

Face authentication with energy harvesting cameras

WISP Cam energy-harvesting camera

powered by RF1 frame / second

~1 mW processing / frame

13

Is this Armin? ✅

14

Face authentication with energy harvesting cameras

sensor neural network

other application functions

on-chip CPU cloud

15

CPU-based face authentication neural networks can exceed WISPcam power budgets

sensor neural network

other application functions

ASIC hardware cloud

16

adding optional blocks can reduce power consumption for a neural network

face detection

motion detection

on-chip circuit

CPU-based face authentication neural networks can exceed WISPcam power budgets

Exploring design tradeoffs in ASIC accelerators

Evaluated NN topology and hardware impact on energy and accuracy

Selected a 400-8-1 network topology and used 8-bit datapaths for optimal energy/accuracy point

17

SNNAP

DMA Master

Bus Scheduler

PU

SRAM

control

PE

PE

SIG

... MUL MUL MUL MUL

weight weight weight weightd_in

ADD ADD ADD ADD

offset88 88 88 88

16 16 16 16acc.fifo

sig.fifosigmoid unit

26 26 26 26

26 26 26 26acc

1626

8 26acc

PE0 PE1 PE2 PE3

8

d_out

feature unit

integral accumulatorVJ

integral image accumulator

classifier unit

window buffer

stage unitthreshold unit

feature unit

pixels in

input row

integral row output

4 41

+= 2 311 116 72

threshold

‘yes’ weight‘no’ weight

a db c

++ - x

+ +

>- x

- x

weight1++

a db cweight2

++

a db cweight3

previous row

+ + +

Streaming face detection accelerator

Explored classifier and other algorithm parameters to optimize energy optimality

neural network face detection

many more details in paper!

Synthesized ASIC accelerators in Synopsys

Constructed simulator to evaluate power consumption on real-world video input

Computed power for computation and transfer of resulting data for each pipeline configuration

18

EvaluationWhich pipeline achieves the lowest overall power?

Which pipeline achieves the lowest power consumption?

19

platform configuration compute transfer

sensor <1% >99%

sensor motion <1% >99%

sensor face detect 10% 90%

sensor NN 16% 84%

sensor motion face detect >99% <1%

sensor motion NN >99% <1%

sensor face detect NN >99% <1%

sensor motion face detect NN >99% <1%

log Power (µW)1 1000 1000000

160

419

257,236

132

782,090

374

3,731

11,340

(ratios)

Which pipeline achieves the lowest power consumption?

20

platform configuration compute transfer

sensor <1% >99%

sensor motion <1% >99%

sensor face detect 10% 90%

sensor NN 16% 84%

sensor motion face detect >99% <1%

sensor motion NN >99% <1%

sensor face detect NN >99% <1%

sensor motion face detect NN >99% <1%

log Power (µW)1 1000 1000000

160

419

257,236

132

782,090

374

3,731

11,340

(ratios)

prefilters reduce overall power

Which pipeline achieves the lowest power consumption?

21

platform configuration compute transfer

sensor <1% >99%

sensor motion <1% >99%

sensor face detect 10% 90%

sensor NN 16% 84%

sensor motion face detect >99% <1%

sensor motion NN >99% <1%

sensor face detect NN >99% <1%

sensor motion face detect NN >99% <1%

log Power (µW)1 1000 1000000

160

419

257,236

132

782,090

374

3,731

11,340

(ratios)

just using NN

prefilters with NN use less power

Which pipeline achieves the lowest power consumption?

22

platform configuration compute transfer

sensor <1% >99%

sensor motion <1% >99%

sensor face detect 10% 90%

sensor NN 16% 84%

sensor motion face detect >99% <1%

sensor motion NN >99% <1%

sensor face detect NN >99% <1%

sensor motion face detect NN >99% <1%

log Power (µW)1 1000 1000000

160

419

257,236

132

782,090

374

3,731

11,340

(ratios)

most power-efficient

most power-efficient with on-chip NN

In-camera processing for face authentication

In isolation, even well-designed hardware can show sub-optimal performance

Optional blocks can improve the overall cost,if they balance compute and communication

better than the original design

23

motion detection

face detection

neural network

Challenges for modern camera systems

Low-power: face authentication for energy-harvesting cameras with ASIC design

Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration

24

motion detection

face detection

neural network

prep align depth

stitch

Challenges for modern camera systems

Low-power: face authentication for energy-harvesting cameras with ASIC design

Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration

25

motion detection

face detection

neural network

prep align depth

stitch

26

16 GoPro cameras 4K-30 fps

3.6 GB/s raw video

Goal: 30 fps

3D-360 stereo video 1.8 GB/s output

Producing real-time VR video from a camera rig

27

16 GoPro cameras 4K-30 fps

3.6 GB/s raw video

Goal: 30 fps

3D-360 stereo video 1.8 GB/s output

Producing real-time VR video from a camera rig

cloud processing prevents real-

time video

28

offloaded to cloud

prep image align

depth from flow

image stitchsensor stream

to viewer

VR pipeline is usually offloaded to perform heavy computation

5% 20% 70% 5%

processing time

need to accelerate “depth from flow” to achieve high

performance

29

prep image align

depth from flow

image stitchsensor stream

to viewer

Offloading before the costly step doesn’t avoid compute-communication tradeoffs

Vide

o Fr

ame

Size

(MB)

0

150

300

450

600image alignment step produces significant

intermediate data

offloading early on is still 2x final output

size

Evaluation

30

Designed a simple parallel accelerator for Xilinx Zynq SoC, simulated for Virtex UltraScale+

Evaluated against CPU and GPU implementations in Halide

Assumed 2GB/s network link for communication

Which pipeline achieves the highest frame rate?

implementation details in paper

Which pipeline achieves the highest frame rate?

31

pipeline configuration compute transfer

sensor 100 15.8

sensor prep 100 15.8

sensor prep align 100 3.95

sensor prep align depth (CPU) 0.09 5.27

sensor prep align depth (GPU) 11.2 5.27

sensor prep align depth (FPGA) 174 5.27

sensor prep align depth (CPU) stitch 0.09 31.6

sensor prep align depth (GPU) stitch 11.2 31.6

sensor prep align depth (FPGA) stitch 174 31.6

effective FPS0 7 14 21 28 35

31.6

11.2

0.1

5.3

5.3

0.1

4.0

15.8

15.8

.09

.09

(FPS)

Which pipeline achieves the highest frame rate?

32

pipeline configuration compute transfer

sensor 100 15.8

sensor prep 100 15.8

sensor prep align 100 3.95

sensor prep align depth (CPU) 0.09 5.27

sensor prep align depth (GPU) 11.2 5.27

sensor prep align depth (FPGA) 174 5.27

sensor prep align depth (CPU) stitch 0.09 31.6

sensor prep align depth (GPU) stitch 11.2 31.6

sensor prep align depth (FPGA) stitch 174 31.6

effective FPS0 7 14 21 28 35

31.6

11.2

0.1

5.3

5.3

0.1

4.0

15.8

15.8

.09

.09

(FPS)

CPU results are slowest

Which pipeline achieves the highest frame rate?

33

pipeline configuration compute transfer

sensor 100 15.8

sensor prep 100 15.8

sensor prep align 100 3.95

sensor prep align depth (CPU) 0.09 5.27

sensor prep align depth (GPU) 11.2 5.27

sensor prep align depth (FPGA) 174 5.27

sensor prep align depth (CPU) stitch 0.09 31.6

sensor prep align depth (GPU) stitch 11.2 31.6

sensor prep align depth (FPGA) stitch 174 31.6

effective FPS0 7 14 21 28 35

31.6

11.2

0.1

5.3

5.3

0.1

4.0

15.8

15.8

.09

.09

(FPS)

Data size is too big after depth for

offloading

Which pipeline achieves the highest frame rate?

34

pipeline configuration compute transfer

sensor 100 15.8

sensor prep 100 15.8

sensor prep align 100 3.95

sensor prep align depth (CPU) 0.09 5.27

sensor prep align depth (GPU) 11.2 5.27

sensor prep align depth (FPGA) 174 5.27

sensor prep align depth (CPU) stitch 0.09 31.6

sensor prep align depth (GPU) stitch 11.2 31.6

sensor prep align depth (FPGA) stitch 174 31.6

effective FPS0 7 14 21 28 35

31.6

11.2

0.1

5.3

5.3

0.1

4.0

15.8

15.8

.09

.09

(FPS)

full pipeline with FPGA is only one

that achieves real-time frame rate

In-camera processing for real-time VR video

Computation and communication together highlight benefits not seen when considered separately

For VR video, in-camera processing pipelines enable applications that could not even be achieved via

cloud offload

35

prep align depth

stitch

In-camera pipelines evaluate computation-communication trade-offs

Use hardware-software co-design to balance constraints and optimize designs

Achieve optimal performance by considering bottlenecks in context of full system

In-camera processing pipelines help characterize camera systems

Thank you!