Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many...

Nasibeh TeimouriHamed Tabkhi Gunar Schirner

Summer 2014

Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential

Solutions

Outline• Heterogeneous MPSoCs

– Specialization is a growing trend

• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators

• Previous works

• Quantitative exploration of current accelerator-rich MPSoC - Huge memory demand

- Immerse communication traffic

- Overwhelming Synchronization

• The proposed accelerator centric architecture template - Implementation

- Evaluation

2




• Previous work

• Quantitative exploration of accelerator-rich MPSoC - Huge memory demand




- Evaluation

3

Heterogeneous MPSoCs

4

GPU

Alg

5 [0

]

Alg

5 [1

]

Alg

5 [2

]

Alg

5 [

63]

...

M

DMADMA

DMADMA

M S

Core 4

INT31INT0

0.8

GH

z

...

Alg2 Alg3

OS/Drv

I - L1 D – L1

L2

M

Core 3

INT31INT0

0.8

GH

z

...

Alg2 Alg3

OS/Drv

I - L1 D – L1

L2

M

4

Core 2

INT31INT0

0.8

GH

z

...

Alg2 Alg3

OS/Drv

I - L1 D – L1

L2

M

Core 1

INT31INT0

0.8

GH

z

...

Alg2 Alg3

OS/Drv

I - L1 D – L1

L2

M

Low Performance

PeripheralS

Low Performance

PeripheralS

Low Performance

Peripheral

S

BridgeM

S

Function-Level Processor

Alg1[1]

Alg4[1]

M

Function-Level Processor

Alg1

Alg4[2]

M

SRAM

S

SDRAMContrl.

S

SDRAMContrl.

SDRAMContrl.

SDRAMContrl.SDRAM

TransducerM

S

IP Comp.

S

– Heterogeneous MPSoCs– Integrated solutions for a group of evolving

markets• ILP (e.g. CPU, DSP, or even GPU)

Flexibility

- Power dissipation • Custom-HW Accelerators (ACCs) for compute-

intensive kernels Power efficiency- Cost- Inflexibility

What is the trend?

Specialization as a MPSoC trend

5

• Increasing demands for high performance low power computing– Market examples:

• Embedded vision• Software Define Radio (ADR)• Cyber Physical Systems (CPS)

– Tens billion of operations per second– Less than few watts power

- Trend: Domain specific specialization– Proliferating number of ACCs in systems ACC-Rich MPSoC

MemoryInterfaceACC Shared/LLC

Memory

ACC 0

DMA

ILP 0Cache

ACC 1

DMA

ACC N

DMA




• Previous work





- Evaluation

6

Principals of current accelerator-rich MPSoC

7

• ILP+HWACC composition– HW-ACC

• Executes Compute-intense kernels/apps

– ILP• Executes remaining

applications• Orchestrates HWACCs /

coordinate data movement

– On-chip scratchpad memory (SPM)

• Keeps data between ILP and ACCs on-chip

– Avoid costly off-chip memory access

ILP Input DMA ACC1 Output

1. Input Done

2.DMA Start

3.DMA Done

4.DMA Start

5.DMA Done

6.ACC1 Start

7.ACC1 Done

8.DMA Start

9.DMA Done

10.DMA Start

11.DMA Done

12.Output start

13-Output Done

System Fabric

ACC1

SPM

SPM

Input Interface

SPM

Output Interface

SPM

On-chip memoryILP

Stream outStream In

CacheInterrupt

Control

DMADMADMA

MPSoC with many accelerators

8

• Scratch Pad Memory (SPM)- 2 per accelerator , 1 per I/O

- To hold input job

System Fabric

ACC1 ACC2 ACCn

SPM

SPM

SPM

SPM

SPM

SPM

Input Interface

SPM

Output Interface

SPM

On-chip memoryILP

Stream outStream In

CacheInterrupt

Control

DMADMADMADMADMA

DMA

SPM: Scratch Pad Memory

• Control and interrupt lines- ACC configuration

• Centralized vs. dedicated DMA- Stream data transfer

Challenges with increasing number of interrupts

1- Memory requirement- Two SPM per each ACC- One SPM per each Interfaces- Shared memory to hold data handed

over the accelerators

2- High volume of traffic over system fabric

- No point to point connections between ACC

- Required DMA data transfers

3-ILP synchronization- Among accelerators, IO Interfaces and

DMA transfers

9

System Fabric

ACC1 ACC2 ACCn

SPM

SPM

SPM

SPM

SPM

SPM

Input Interface

SPM

Output Interface

SPM

On-chip memoryILP

Stream outStream In

CacheInterrupt

Control

NEED to quantitatively consider this

architecture!




• Previous work





- Evaluation

10

Previous works on composing ACC

• Composing bigger applications out of many accelerates like Accelerator-

Rich CMPs[1], CHARM[2]

– Imposing a considerable traffic and considerable on-chip buffers for

accelerator data exchange

– ILP load to orchestrate the system composed of accelerators

1. J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Architecture support for accelerator-rich cmps. In Proceedings of the 49th Annual Design Automation Conference, DAC ’12, pages 843–849, 2012.

2. M. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. The accelerator store framework for high-performance, low-power accelerator-based systems.Computer Architecture Letters, 9(2):53 –56, feb. 2010.

11




• Previous work





- Evaluation

12

Quantitative exploration of accelerator-rich MPSoC; WHY and HOW

• Applicability of quantitative exploration

– Quantifying the potential challenges

– Exposing the ACC-rich bottlenecks as # of ACCs increases

– Helping system architects for proper sizing of systems knobs (SPM

sizes, # of ACCs, Communication BW)

– Motivating our proposed arch-template solution

• Approaches of quantitative exploration

1- First order mathematic based analysis

2- Simulation based analysis of ACC-rich MPSoC

13

Exploration overview

• Assumptions– One HD resolution frame as input

• Divided into smaller jobs – Memory on chip

• Avoid off-chip memory for now

• Exploration steps– Memory requirement as #ACC increases– Sizing SPM to satisfy memory budget limitation– Interrupt rate load on ILP

14

Memory size analysis (calculation based)

15

0 5 10 15 20 25 30 35 400

2000

4000

6000

8000

10000

12000

14000

16000

18000

1

2

4

8

16

32

64

128

# Accelerators

Mem

ory

size

(KB

)

Job size (KB)

• Memory size = SPMs + shared memory• SPM holds one job

• Job size determines minimum size of SPM and shared memory

• Shared memory holds all jobs exchanged among ACCs

Sizing job size with respect to memory budget

• More ACCs requires larger memory

• Bigger job needs larger memory

Limiting memory budget

Output Frame

MPSoC

Input Frame

System Fabric

Shared memory

ACC1

SPM

SP

M

ACC2

SPM

SP

M

Input

SP

M

Output

SP

M

Job sizing (calculation based)

16

• The lower the size of memory, the smaller the size of job• The more #accelerators, the smaller job size

System Fabric

ACC1 ACC2 ACCn

SPM

SPM

SPM

SPM

SPM

SPM

Input Interface

SPM

Output Interface

SPM

ACC shared/LLC memoryILP

Stream outStream In

CacheInterrupt

Control

Smaller job size issues more interrupts to ILP- Responsibility of ILP to synchronize ACCs transactions

0 10 20 30 40 50 600

500

1000

1500

2000

2500

3000

3500

4000

4500

0.5

1

4

16

# Accelerators

Job

Size

(KB

)

Memory size (MB)

- Count the number of interrupts

- Measure ILP responsibility to response Interrupts

Simulation platform

17

Output Frame

MPSoC

Input Frame

System Fabric

Shared memory

ACC1

SPM

SP

M

ACC2

SPM

SP

M

Input

SP

M

Output

SP

M

• Using SpecC SLDL to develop a simulation model

– Scalable # of ACCs» Different/same data rate

– ILPs– DMAs– Mummeries (SPM, shared memory)

» On-chip and off-chip memory

• Generating ACC-Rich simulation model – BFM AMBA-AHB Communication fabric – ARM 9 (ISA v6) for ILP execution

• Priority based– Dedicated interrupt line– Centralized DMAs

IO Interface (Input)

IO Interface (Output)

ACC0 ACC1

Memory

DMA

ILP

System Fabric

Stimulus FirstQueuLastQueu Monitor

SCE refinement

# of interrupt by scaling #ACC (simulation based)

• # of interrupt vs. the number of accelerators– For different size of on-chip memory

18

0 10 20 30 40 50 600

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

0.5

1

4

8

# Accelerators

# In

terr

upts

Mem Size(MB)

More interrupts to the ILP with smaller job size - Significant utilization or even over saturation of ILP only because of driving accelerators

Memory Size (MB)

#ACC 0.5 1 4 8

1 64K 128K 512K 1M

4 32K 64K 256K 512K

9 16K 32K 128K 256K

18 8K 16K 64K 128K

34 4K 8K 32K 64K

60 2K 4K 16K 32K

Smaller memory/more ACCs -> smaller

Job

Communication overhead analysis (calculation based)

19

• Communication overhead = data exchanged through the system fabric

• More ACCs, heavier traffic on system fabric

Output Frame

MPSoC

Input Frame

System Fabric

Shared memory

ACC1

SPM

SP

M

ACC2

SPM

SP

M

Input

SP

M

Output

SP

M

0 10 20 30 40 50 60 700

50

100

150

200

250

300

# Accelerators

Com

mun

icat

ion

traf

fic

(MB

)

Exploration Summery

• Problems affiliated with current accelerator-rich architecture

– On-chip memory requirements

– ILP synchronization load

– Heavy communication traffic on system fabric

• Demands toward improved ACC-centric design

– Tackling the challenges of current ACC-rich architecture

20




• Previous work





- Evaluation

21

The goals of the proposed ACC-centric arcitecture

• The proposed solution

– An autonomous accelerator chain

• Relieving ILP’s synchronization load

– Point to point connections between accelerators

• No need for larger SPM per each accelerator

• No frequent DMA data transfers

• No heavy traffic on system fabric

22

Simulation platform

23

• Modifying the developed SpecC

model to support autonomous chain

of accelerator

– Gateways to manage the chain

• Creating another ACC-Rich

simulation model

– BFM AMBA-AHB Communication

fabric

– ARM 9 (ISA v6) for ILP execution

– Dedicated interrupt line from gateways

to ILP

– Centralized DMA

SCE refinement

Stimulus Input Data

Output Data

Monitor

InGateway

Memory

OutGatewayChain

ACC0 ACCn

ILP

System Fabric

Input Interface

Output Interface

System Fabric

Input Gateway ACC1

Output Gateway

Mem

ILP

Input Interface

SP

M

Output Interface

SP

M

ACC2

Interrupt

Control

SPM SPM

Stream In Stream Out

DMA DMA

DMA DMA

Interrupt

Control

The proposed accelerator-centric architecture template

• Gateways controlled by ILP to manage the whole chain of accelerators

– SPM to receive/send data from/to memory

– Control lines from ILP to gateways for configuration

– Interrupt lines from gateways to ILP

– Point to point connections in chain with small buffer in between– Chain works independence of ILP

24

Point to point accelerator connectionsNo much memory requirementNot many DMA data transfer

Autonomous ACC chain:Light ILP synchronization load no matter how many accelerators

System Fabric

Input Gateway ACC1

Output Gateway

Mem

ILP

Input Interface

SP

M

Output Interface

SP

M

ACC2

Interrupt

Control

SPM SPM


DMA DMA

DMA DMA

Interrupt

Control

System Fabric

Input Gateway ACC1

Output Gateway

Mem

ILP

Input Interface

SP

M

Output Interface

SP

M

ACC2

Interrupt

SPM SPM


DMA DMA

DMA DMA

Interrupt

1. DMA brings data to the input gateway’s SPM2. Input gateway receives data and starts to pass data through the chain3. Chain works on data4. Output gateway gathers data in SPM5. DMA brings data to memory

1

2

3

4

5

Evaluation

25

MORE ACC:Current arch: exponential

growth in interrupts Proposed architecture: The

same number of interrupts

0 10 20 30 40 50 60 700

50

100

150

200

250

300

Current ArchProposed Arch

# Accelerators

Dat

a co

mm

unic

atio

n tr

affi

c (M

B)

MORE ACC:Current arch: Heavier trafficProposed arch: almost the

same data traffic

0 10 20 30 40 50 60 700

100

200

300

400

500

600

700

Proposed Architecture

Current Architecture

# Accelerators

Job

Size

(KB

)

4 MB memory

MORE ACC:Current arch: Smaller jobProposed arch: almost the

same job

0 10 20 30 40 50 60 700

5

10

15

20

25

30

35

40

45

50

Proposed Architecture

Current Architecture

# Acceerators

Mem

ory

(MB)

256 KB job

MORE ACC:Current arch: Linear growth in

memory requirement Proposed arch: almost constant

memory requirement

0 10 20 30 40 50 60 700

2000

4000

6000

8000

10000

12000

Proposed Arch

Current Arch

# Accelerators

# In

terr

upts

Summary

• Specialization as a growing trend in CMPs– Accelerator rich architectures

• Exploration of the challenges in current accelerator rich architecture– Memory requirement– Communication overhead– Synchronization load

• The proposed accelerator-centric architecture template– Autonomous accelerator chain

• No large memory requirement• No heavy communication traffic• No critical amount of required synchronization

26

Question?

Again, Thanks to Professor Schirner for all

his support…

Thanks to Hamed for what I’ve been learning from him,

Thank you all ESL members for your attendance!

27

Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many...

Documents

Transcript of Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many...