Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring...

36
Exploring Computation- Communication Tradeos in Camera Systems 1 Amrita Mazumdar Thierry Moreau Sung Kim Meghan Cowan Armin Alaghi Luis Ceze Mark Oskin Visvesh Sathe IISWC 2017

Transcript of Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring...

Page 1: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Exploring Computation-Communication Tradeoffs in Camera Systems

1

Amrita MazumdarThierry MoreauSung KimMeghan Cowan

Armin AlaghiLuis CezeMark OskinVisvesh Sathe

IISWC 2017

Page 2: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

video surveillance cameras

3D-360 virtual reality camera rig

Camera applications are a prominent workload with tight constraints

2

large data size

large data size

energy harvesting camera

augmented reality glasses

light weight

light weight

real-time processing

real-time processing

real-time processing

low-power

low-power

Page 3: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Hardware implementations compound the camera system design space

constraint

power

time size

bandwidth

implementation

ASICFPGA

DSPCPU

GPU

3

camera system

DogChat™

Page 4: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

We can represent camera applications as camera processing pipelines to clarify design space exploration

sensor block 1 block 2 block 3 block 4

4

functions in the application

Page 5: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

5

DogChat™

sensor image processing

face detection

feature tracking

image rendering

We can represent camera applications as camera processing pipelines to clarify design space exploration

Page 6: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

6

DogChat™

sensor image processing

face detection

feature tracking

image rendering

offloaded to cloud

Developers can trade off between computation and communication costs

Page 7: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

7

DogChat™

Developers can trade off between computation and communication costs

offloaded to cloudin-camera processing

sensor image processing

face detection

feature tracking

image rendering

Page 8: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

8

Optional and required blocks in camera pipelines introduce more tradeoffs

edge detection

motion detection

motion tracking

required

optional

sensor image processing

face detection

feature tracking

image rendering

Page 9: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

sensor image processing

face detection

feature tracking

image rendering

edge detection

motion detection

motion tracking

Custom hardware platforms explode the camera system design space

9

ASIC

FPGADSP CPU

GPUDSP

FPGA

required

optional

Page 10: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

sensor image processing

face detection

feature tracking

image rendering

edge detection

motion detection

motion tracking

Custom hardware platforms explode the camera system design space

10

ASIC

FPGADSP CPU

GPUDSP

FPGA

required

optional

In-camera processing pipelines can help us evaluate these tradeoffs!

Page 11: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Challenges for modern camera systems

Low-power: face authentication for energy-harvesting cameras with ASIC design

Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration

11

motion detection

face detection

neural network

prep align depth

stitch

Page 12: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Challenges for modern camera systems

Low-power: face authentication for energy-harvesting cameras with ASIC design

Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration

12

motion detection

face detection

neural network

prep align depth

stitch

Page 13: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Face authentication with energy harvesting cameras

WISP Cam energy-harvesting camera

powered by RF1 frame / second

~1 mW processing / frame

13

Page 14: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Is this Armin? ✅

14

Face authentication with energy harvesting cameras

Page 15: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

sensor neural network

other application functions

on-chip CPU cloud

15

CPU-based face authentication neural networks can exceed WISPcam power budgets

Page 16: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

sensor neural network

other application functions

ASIC hardware cloud

16

adding optional blocks can reduce power consumption for a neural network

face detection

motion detection

on-chip circuit

CPU-based face authentication neural networks can exceed WISPcam power budgets

Page 17: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Exploring design tradeoffs in ASIC accelerators

Evaluated NN topology and hardware impact on energy and accuracy

Selected a 400-8-1 network topology and used 8-bit datapaths for optimal energy/accuracy point

17

SNNAP

DMA Master

Bus Scheduler

PU

SRAM

control

PE

PE

SIG

... MUL MUL MUL MUL

weight weight weight weightd_in

ADD ADD ADD ADD

offset88 88 88 88

16 16 16 16acc.fifo

sig.fifosigmoid unit

26 26 26 26

26 26 26 26acc

1626

8 26acc

PE0 PE1 PE2 PE3

8

d_out

feature unit

integral accumulatorVJ

integral image accumulator

classifier unit

window buffer

stage unitthreshold unit

feature unit

pixels in

input row

integral row output

4 41

+= 2 311 116 72

threshold

‘yes’ weight‘no’ weight

a db c

++ - x

+ +

>- x

- x

weight1++

a db cweight2

++

a db cweight3

previous row

+ + +

Streaming face detection accelerator

Explored classifier and other algorithm parameters to optimize energy optimality

neural network face detection

many more details in paper!

Page 18: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Synthesized ASIC accelerators in Synopsys

Constructed simulator to evaluate power consumption on real-world video input

Computed power for computation and transfer of resulting data for each pipeline configuration

18

EvaluationWhich pipeline achieves the lowest overall power?

Page 19: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Which pipeline achieves the lowest power consumption?

19

platform configuration compute transfer

sensor <1% >99%

sensor motion <1% >99%

sensor face detect 10% 90%

sensor NN 16% 84%

sensor motion face detect >99% <1%

sensor motion NN >99% <1%

sensor face detect NN >99% <1%

sensor motion face detect NN >99% <1%

log Power (µW)1 1000 1000000

160

419

257,236

132

782,090

374

3,731

11,340

(ratios)

Page 20: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Which pipeline achieves the lowest power consumption?

20

platform configuration compute transfer

sensor <1% >99%

sensor motion <1% >99%

sensor face detect 10% 90%

sensor NN 16% 84%

sensor motion face detect >99% <1%

sensor motion NN >99% <1%

sensor face detect NN >99% <1%

sensor motion face detect NN >99% <1%

log Power (µW)1 1000 1000000

160

419

257,236

132

782,090

374

3,731

11,340

(ratios)

prefilters reduce overall power

Page 21: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Which pipeline achieves the lowest power consumption?

21

platform configuration compute transfer

sensor <1% >99%

sensor motion <1% >99%

sensor face detect 10% 90%

sensor NN 16% 84%

sensor motion face detect >99% <1%

sensor motion NN >99% <1%

sensor face detect NN >99% <1%

sensor motion face detect NN >99% <1%

log Power (µW)1 1000 1000000

160

419

257,236

132

782,090

374

3,731

11,340

(ratios)

just using NN

prefilters with NN use less power

Page 22: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Which pipeline achieves the lowest power consumption?

22

platform configuration compute transfer

sensor <1% >99%

sensor motion <1% >99%

sensor face detect 10% 90%

sensor NN 16% 84%

sensor motion face detect >99% <1%

sensor motion NN >99% <1%

sensor face detect NN >99% <1%

sensor motion face detect NN >99% <1%

log Power (µW)1 1000 1000000

160

419

257,236

132

782,090

374

3,731

11,340

(ratios)

most power-efficient

most power-efficient with on-chip NN

Page 23: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

In-camera processing for face authentication

In isolation, even well-designed hardware can show sub-optimal performance

Optional blocks can improve the overall cost,if they balance compute and communication

better than the original design

23

motion detection

face detection

neural network

Page 24: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Challenges for modern camera systems

Low-power: face authentication for energy-harvesting cameras with ASIC design

Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration

24

motion detection

face detection

neural network

prep align depth

stitch

Page 25: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Challenges for modern camera systems

Low-power: face authentication for energy-harvesting cameras with ASIC design

Low latency: real-time virtual reality for multi-camera rigs with FPGA acceleration

25

motion detection

face detection

neural network

prep align depth

stitch

Page 26: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

26

16 GoPro cameras 4K-30 fps

3.6 GB/s raw video

Goal: 30 fps

3D-360 stereo video 1.8 GB/s output

Producing real-time VR video from a camera rig

Page 27: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

27

16 GoPro cameras 4K-30 fps

3.6 GB/s raw video

Goal: 30 fps

3D-360 stereo video 1.8 GB/s output

Producing real-time VR video from a camera rig

cloud processing prevents real-

time video

Page 28: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

28

offloaded to cloud

prep image align

depth from flow

image stitchsensor stream

to viewer

VR pipeline is usually offloaded to perform heavy computation

5% 20% 70% 5%

processing time

need to accelerate “depth from flow” to achieve high

performance

Page 29: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

29

prep image align

depth from flow

image stitchsensor stream

to viewer

Offloading before the costly step doesn’t avoid compute-communication tradeoffs

Vide

o Fr

ame

Size

(MB)

0

150

300

450

600image alignment step produces significant

intermediate data

offloading early on is still 2x final output

size

Page 30: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Evaluation

30

Designed a simple parallel accelerator for Xilinx Zynq SoC, simulated for Virtex UltraScale+

Evaluated against CPU and GPU implementations in Halide

Assumed 2GB/s network link for communication

Which pipeline achieves the highest frame rate?

implementation details in paper

Page 31: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Which pipeline achieves the highest frame rate?

31

pipeline configuration compute transfer

sensor 100 15.8

sensor prep 100 15.8

sensor prep align 100 3.95

sensor prep align depth (CPU) 0.09 5.27

sensor prep align depth (GPU) 11.2 5.27

sensor prep align depth (FPGA) 174 5.27

sensor prep align depth (CPU) stitch 0.09 31.6

sensor prep align depth (GPU) stitch 11.2 31.6

sensor prep align depth (FPGA) stitch 174 31.6

effective FPS0 7 14 21 28 35

31.6

11.2

0.1

5.3

5.3

0.1

4.0

15.8

15.8

.09

.09

(FPS)

Page 32: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Which pipeline achieves the highest frame rate?

32

pipeline configuration compute transfer

sensor 100 15.8

sensor prep 100 15.8

sensor prep align 100 3.95

sensor prep align depth (CPU) 0.09 5.27

sensor prep align depth (GPU) 11.2 5.27

sensor prep align depth (FPGA) 174 5.27

sensor prep align depth (CPU) stitch 0.09 31.6

sensor prep align depth (GPU) stitch 11.2 31.6

sensor prep align depth (FPGA) stitch 174 31.6

effective FPS0 7 14 21 28 35

31.6

11.2

0.1

5.3

5.3

0.1

4.0

15.8

15.8

.09

.09

(FPS)

CPU results are slowest

Page 33: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Which pipeline achieves the highest frame rate?

33

pipeline configuration compute transfer

sensor 100 15.8

sensor prep 100 15.8

sensor prep align 100 3.95

sensor prep align depth (CPU) 0.09 5.27

sensor prep align depth (GPU) 11.2 5.27

sensor prep align depth (FPGA) 174 5.27

sensor prep align depth (CPU) stitch 0.09 31.6

sensor prep align depth (GPU) stitch 11.2 31.6

sensor prep align depth (FPGA) stitch 174 31.6

effective FPS0 7 14 21 28 35

31.6

11.2

0.1

5.3

5.3

0.1

4.0

15.8

15.8

.09

.09

(FPS)

Data size is too big after depth for

offloading

Page 34: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

Which pipeline achieves the highest frame rate?

34

pipeline configuration compute transfer

sensor 100 15.8

sensor prep 100 15.8

sensor prep align 100 3.95

sensor prep align depth (CPU) 0.09 5.27

sensor prep align depth (GPU) 11.2 5.27

sensor prep align depth (FPGA) 174 5.27

sensor prep align depth (CPU) stitch 0.09 31.6

sensor prep align depth (GPU) stitch 11.2 31.6

sensor prep align depth (FPGA) stitch 174 31.6

effective FPS0 7 14 21 28 35

31.6

11.2

0.1

5.3

5.3

0.1

4.0

15.8

15.8

.09

.09

(FPS)

full pipeline with FPGA is only one

that achieves real-time frame rate

Page 35: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

In-camera processing for real-time VR video

Computation and communication together highlight benefits not seen when considered separately

For VR video, in-camera processing pipelines enable applications that could not even be achieved via

cloud offload

35

prep align depth

stitch

Page 36: Exploring Computation- Communication Tradeoffs in Camera ...amrita/slides/iiswc17.pdf · Exploring design tradeoffs in ASIC accelerators Evaluated NN topology and hardware impact

In-camera pipelines evaluate computation-communication trade-offs

Use hardware-software co-design to balance constraints and optimize designs

Achieve optimal performance by considering bottlenecks in context of full system

In-camera processing pipelines help characterize camera systems

Thank you!