Sonic Millip3De: Massively Parallel 3D Stacked Accelerator for 3D Ultrasound Richard Sampson * Ming...

Post on 16-Dec-2015

220 views 2 download

Tags:

Transcript of Sonic Millip3De: Massively Parallel 3D Stacked Accelerator for 3D Ultrasound Richard Sampson * Ming...

Sonic Millip3De:Massively Parallel 3D Stacked Accelerator for

3D Ultrasound

Richard Sampson* Ming Yang† Siyuan Wei† Chaitali Chakrabarti† Thomas F. Wenisch*

*University of Michigan †Arizona State University

2

Portable Medical Imaging Devices

• Medical imaging moving towards portability– MEDICS (X-Ray CT) [Dasika ‘10]

– Handheld 2D Ultrasound [Fuller ‘09]

• Not just a matter of convenience– Improved patient health [Gunnarsson ‘00, Weinreb ‘08]

– Access in developing countries• Why ultrasound?

– Low transmit power [Nelson ‘10]

– No dangers or side-effects

3

Handheld 3D Ultrasound

• 3D has numerous benefits over 2D– Easier to interpret images– Greater volumetric accuracy

• … as well as many challenges– 12k transducers, 10M image points

• 10-20x beyond state of the art– High raw data bandwidth (6Tb/s)

• Major bottleneck in state of the art– Tight handheld power budget (5W)

4

Why a Custom Accelerator?

• Software algorithms load/store intensive– von Neumann designs inefficient

• Large system would require over 700 DSPs– General purpose CPUs even less efficient

Architecture Energy/Scanline(1 fps)

Single CoreTime/Scanline

Intel Core i7-2670 25.08J 4.46s

ARM Cortex-A8 33.04J 132.18s

TI C6678 DSP 2.84J 2.27s

5

Contributions

• Iterative delay calculation algorithm– Reduces storage by over 400x– Enables streaming data flow

• Sonic Millip3De design– Leverages 3D die stacking technology– Transform-select-reduce accelerator framework

• Power and image analysis of Sonic Millip3De– Negligible change in image quality– Able to meet 5W power budget by 11nm node

6

Outline

• Introduction• Ultrasound background• Algorithm design• System design

– Sonic Millip3De– Select Sub-Unit

• Results and analysis• Conclusions

7

Ultrasound: Transmit and Receive

Receive Raw Channel Data

ImageSpace

FocalPoints

ReceiveTransducer

TransmitTransducer

𝜏

Ultrasound: Transmit and Receive

8

𝜏

Ultrasound: Transmit and Receive

9

𝜏

Ultrasound: Transmit and Receive

10

𝜏

Ultrasound: Transmit and Receive

11

𝜏

Ultrasound: Transmit and Receive

12

𝜏

Ultrasound: Transmit and Receive

13

𝜏

Ultrasound: Transmit and Receive

14

𝜏

Ultrasound: Transmit and Receive

15

𝜏

Ultrasound: Transmit and Receive

16

𝜏

Ultrasound: Transmit and Receive

17

𝜏

Ultrasound: Transmit and Receive

18

𝜏

Ultrasound: Transmit and Receive

19

𝜏

20

Ultrasound: Transmit and Receive

Each transducer stores array of raw receive data

𝜏

21

Ultrasound: Image Reconstruction

Image reconstructed from data based on round trip delay

22

Ultrasound: Image Reconstruction

Images from each transducer combined to produce full frame

23

Delay Index Calculation

• Iterate through all image points for each transducer and calculate delay index

• Often done with lookup tables (LUTs) instead• 50 GB LUT required for target 3D system

𝜏 𝑃

𝑃

24

Challenges of Handheld 3D Ultrasound

• Delay index LUT requires too much storage– New iterative algorithm reduces necessary

constant storage by 400x• Peak raw data bandwidth (6Tb/s) infeasible

– Sub-aperture multiplexing reduces peak data rate, but requires more transmits

• Handheld power budget very tight (5W)– 3D stacked, highly parallel data streaming design

reconstructs images efficiently

25

Iterative Delay Index Calculation

• Deltas between adjacent focal points on a scanline form smooth curve

• Fit piecewise quadratic approx. to delta function

• Two sections sufficient for negligible error

Section 1 Section 2

26

Sub-aperture Multiplexing

• Peak raw data bandwidth (6Tb/s) infeasible• Solution: sub-aperture multiplexing

– Transmit multiple times from same location– Receive with subset of transducers (sub-aperture)– Sum images together

• Prior work: reduce data rate• Our design: also reduces HW

and power requirements

27

System Design

28

System Design

Sonic Millp3De comprises 1,024 parallel pipelines

29

System Design: Transducers

Interchangeable CMOS transducer layer; can use older process

30

System Design: ADC/Storage

Separate storage layer to reduce wire lengths

31

System Design: Transform-Select-Reduce

Accelerator units in fast, low power process

32

Select Sub-Unit Design

Selects sample closest to each focal point using our algorithm

33

Select Sub-Unit Design

All delays for a scanline estimated using 9 constants

Section 1 Section 2

34

Select Sub-Unit Design

Adders calculate next iteration of quadratic approximation

A(n+1)2 + B(n+1) + C = (An2 + Bn + C) + 2An + (A+B)

Section 1 Section 2

35

Select Sub-Unit Design

Decrementor selects sample for next image focal point

Section 1 Section 2

36

Select Sub-Unit Design

Section decrementor indicates when to change constants

Section 1 Section 2

37

Outline

• Introduction• Ultrasound background• Algorithm design• System design

– Sonic Millip3De– Select Sub-Unit

• Results and analysis• Conclusions

38

System ParametersParameters Value

Sub-apertures 12Transmit Sources 16

Transmits per Frame 192Transducers per Sub-aperture 1,024

Total Transducers 12,288Storage per Transducer 4,096 x 12 bits

Focal Points per Scanline 4,096Image Depth 6 cm

Image Angular Width π/4Sampling Frequency 40 MHzInterpolation Factor 4x

Interpolated Sampling Frequency (fs) 160 MHz

Speed of Sound (tissue) 1,540 m/sTarget Frame Rate 1 fps

39

Image Quality Comparison

Ideal Our Design (12 bit)

Our design has negligible difference from ideal system

11 bit

Bits Ideal 14 13 12 11 10CNR 2.972 2.942 2.960 2.942 2.536 2.233

Simulations using Field II [Jensen ‘92, ‘95]

40

Power Analysis and Scaling

45 32 22 16 110

5

10

15

20DRAM

Memory Interface

Network Wires

Accelerator

SRAM

ADC

Transducers

Technology Node

Po

wer

(W

)

Can meet 5W by 11nm node

41

Conclusions

• 3D die stacked Sonic Millip3De design is able to meet 5W power budget by 11nm

• Algorithm/HW co-design enables order-of-magnitude gains– Power and output quality goals often in conflict– Need guidance from domain experts to balance

• Architects have much to offer for application-specific system designs

42

Questions?

Special thanks to:

Brian FowlkesOliver KripfgansRon Dreslinski