Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many...

27
Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions

Transcript of Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many...

Page 1: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Nasibeh TeimouriHamed Tabkhi Gunar Schirner

Summer 2014

Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential

Solutions

Page 2: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Outline• Heterogeneous MPSoCs

– Specialization is a growing trend

• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators

• Previous works

• Quantitative exploration of current accelerator-rich MPSoC - Huge memory demand

- Immerse communication traffic

- Overwhelming Synchronization

• The proposed accelerator centric architecture template - Implementation

- Evaluation

2

Page 3: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Outline• Heterogeneous MPSoCs

– Specialization is a growing trend

• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators

• Previous work

• Quantitative exploration of accelerator-rich MPSoC - Huge memory demand

- Immerse communication traffic

- Overwhelming Synchronization

• The proposed accelerator centric architecture template - Implementation

- Evaluation

3

Page 4: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Heterogeneous MPSoCs

4

GPU

Alg

5 [0

]

Alg

5 [1

]

Alg

5 [2

]

Alg

5 [

63]

...

M

DMADMA

DMADMA

M S

Core 4

INT31INT0

0.8

GH

z

...

Alg2 Alg3

OS/Drv

I - L1 D – L1

L2

M

Core 3

INT31INT0

0.8

GH

z

...

Alg2 Alg3

OS/Drv

I - L1 D – L1

L2

M

4

Core 2

INT31INT0

0.8

GH

z

...

Alg2 Alg3

OS/Drv

I - L1 D – L1

L2

M

Core 1

INT31INT0

0.8

GH

z

...

Alg2 Alg3

OS/Drv

I - L1 D – L1

L2

M

Low Performance

PeripheralS

Low Performance

PeripheralS

Low Performance

Peripheral

S

BridgeM

S

Function-Level Processor

Alg1[1]

Alg4[1]

M

Function-Level Processor

Alg1

Alg4[2]

M

SRAM

S

SDRAMContrl.

S

SDRAMContrl.

SDRAMContrl.

SDRAMContrl.SDRAM

TransducerM

S

IP Comp.

S

– Heterogeneous MPSoCs– Integrated solutions for a group of evolving

markets• ILP (e.g. CPU, DSP, or even GPU)

Flexibility

- Power dissipation • Custom-HW Accelerators (ACCs) for compute-

intensive kernels Power efficiency- Cost- Inflexibility

What is the trend?

Page 5: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Specialization as a MPSoC trend

5

• Increasing demands for high performance low power computing– Market examples:

• Embedded vision• Software Define Radio (ADR)• Cyber Physical Systems (CPS)

– Tens billion of operations per second– Less than few watts power

- Trend: Domain specific specialization– Proliferating number of ACCs in systems ACC-Rich MPSoC

MemoryInterfaceACC Shared/LLC

Memory

ACC 0

DMA

ILP 0Cache

ACC 1

DMA

ACC N

DMA

Page 6: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Outline• Heterogeneous MPSoCs

– Specialization is a growing trend

• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators

• Previous work

• Quantitative exploration of accelerator-rich MPSoC - Huge memory demand

- Immerse communication traffic

- Overwhelming Synchronization

• The proposed accelerator centric architecture template - Implementation

- Evaluation

6

Page 7: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Principals of current accelerator-rich MPSoC

7

• ILP+HWACC composition– HW-ACC

• Executes Compute-intense kernels/apps

– ILP• Executes remaining

applications• Orchestrates HWACCs /

coordinate data movement

– On-chip scratchpad memory (SPM)

• Keeps data between ILP and ACCs on-chip

– Avoid costly off-chip memory access

ILP Input DMA ACC1 Output

1. Input Done

2.DMA Start

3.DMA Done

4.DMA Start

5.DMA Done

6.ACC1 Start

7.ACC1 Done

8.DMA Start

9.DMA Done

10.DMA Start

11.DMA Done

12.Output start

13-Output Done

System Fabric

ACC1

SPM

SPM

Input Interface

SPM

Output Interface

SPM

On-chip memoryILP

Stream outStream In

CacheInterrupt

Control

DMADMADMA

Page 8: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

MPSoC with many accelerators

8

• Scratch Pad Memory (SPM)- 2 per accelerator , 1 per I/O

- To hold input job

System Fabric

ACC1 ACC2 ACCn

SPM

SPM

SPM

SPM

SPM

SPM

Input Interface

SPM

Output Interface

SPM

On-chip memoryILP

Stream outStream In

CacheInterrupt

Control

DMADMADMADMADMA

DMA

SPM: Scratch Pad Memory

• Control and interrupt lines- ACC configuration

• Centralized vs. dedicated DMA- Stream data transfer

Page 9: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Challenges with increasing number of interrupts

1- Memory requirement- Two SPM per each ACC- One SPM per each Interfaces- Shared memory to hold data handed

over the accelerators

2- High volume of traffic over system fabric

- No point to point connections between ACC

- Required DMA data transfers

3-ILP synchronization- Among accelerators, IO Interfaces and

DMA transfers

9

System Fabric

ACC1 ACC2 ACCn

SPM

SPM

SPM

SPM

SPM

SPM

Input Interface

SPM

Output Interface

SPM

On-chip memoryILP

Stream outStream In

CacheInterrupt

Control

NEED to quantitatively consider this

architecture!

Page 10: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Outline• Heterogeneous MPSoCs

– Specialization is a growing trend

• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators

• Previous work

• Quantitative exploration of accelerator-rich MPSoC - Huge memory demand

- Immerse communication traffic

- Overwhelming Synchronization

• The proposed accelerator centric architecture template - Implementation

- Evaluation

10

Page 11: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Previous works on composing ACC

• Composing bigger applications out of many accelerates like Accelerator-

Rich CMPs[1], CHARM[2]

– Imposing a considerable traffic and considerable on-chip buffers for

accelerator data exchange

– ILP load to orchestrate the system composed of accelerators

1. J. Cong, M. A. Ghodrat, M. Gill, B. Grigorian, and G. Reinman. Architecture support for accelerator-rich cmps. In Proceedings of the 49th Annual Design Automation Conference, DAC ’12, pages 843–849, 2012.

2. M. Lyons, M. Hempstead, G.-Y. Wei, and D. Brooks. The accelerator store framework for high-performance, low-power accelerator-based systems.Computer Architecture Letters, 9(2):53 –56, feb. 2010.

11

Page 12: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Outline• Heterogeneous MPSoCs

– Specialization is a growing trend

• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators

• Previous work

• Quantitative exploration of accelerator-rich MPSoC - Huge memory demand

- Immerse communication traffic

- Overwhelming Synchronization

• The proposed accelerator centric architecture template - Implementation

- Evaluation

12

Page 13: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Quantitative exploration of accelerator-rich MPSoC; WHY and HOW

• Applicability of quantitative exploration

– Quantifying the potential challenges

– Exposing the ACC-rich bottlenecks as # of ACCs increases

– Helping system architects for proper sizing of systems knobs (SPM

sizes, # of ACCs, Communication BW)

– Motivating our proposed arch-template solution

• Approaches of quantitative exploration

1- First order mathematic based analysis

2- Simulation based analysis of ACC-rich MPSoC

13

Page 14: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Exploration overview

• Assumptions– One HD resolution frame as input

• Divided into smaller jobs – Memory on chip

• Avoid off-chip memory for now

• Exploration steps– Memory requirement as #ACC increases– Sizing SPM to satisfy memory budget limitation– Interrupt rate load on ILP

14

Page 15: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Memory size analysis (calculation based)

15

0 5 10 15 20 25 30 35 400

2000

4000

6000

8000

10000

12000

14000

16000

18000

1

2

4

8

16

32

64

128

# Accelerators

Mem

ory

size

(KB

)

Job size (KB)

• Memory size = SPMs + shared memory• SPM holds one job

• Job size determines minimum size of SPM and shared memory

• Shared memory holds all jobs exchanged among ACCs

Sizing job size with respect to memory budget

• More ACCs requires larger memory

• Bigger job needs larger memory

Limiting memory budget

Output Frame

MPSoC

Input Frame

System Fabric

Shared memory

ACC1

SPM

SP

M

ACC2

SPM

SP

M

Input

SP

M

Output

SP

M

Page 16: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Job sizing (calculation based)

16

• The lower the size of memory, the smaller the size of job• The more #accelerators, the smaller job size

System Fabric

ACC1 ACC2 ACCn

SPM

SPM

SPM

SPM

SPM

SPM

Input Interface

SPM

Output Interface

SPM

ACC shared/LLC memoryILP

Stream outStream In

CacheInterrupt

Control

Smaller job size issues more interrupts to ILP- Responsibility of ILP to synchronize ACCs transactions

0 10 20 30 40 50 600

500

1000

1500

2000

2500

3000

3500

4000

4500

0.5

1

4

16

# Accelerators

Job

Size

(KB

)

Memory size (MB)

- Count the number of interrupts

- Measure ILP responsibility to response Interrupts

Page 17: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Simulation platform

17

Output Frame

MPSoC

Input Frame

System Fabric

Shared memory

ACC1

SPM

SP

M

ACC2

SPM

SP

M

Input

SP

M

Output

SP

M

• Using SpecC SLDL to develop a simulation model

– Scalable # of ACCs» Different/same data rate

– ILPs– DMAs– Mummeries (SPM, shared memory)

» On-chip and off-chip memory

• Generating ACC-Rich simulation model – BFM AMBA-AHB Communication fabric – ARM 9 (ISA v6) for ILP execution

• Priority based– Dedicated interrupt line– Centralized DMAs

IO Interface (Input)

IO Interface (Output)

ACC0 ACC1

Memory

DMA

ILP

System Fabric

Stimulus FirstQueuLastQueu Monitor

SCE refinement

Page 18: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

# of interrupt by scaling #ACC (simulation based)

• # of interrupt vs. the number of accelerators– For different size of on-chip memory

18

0 10 20 30 40 50 600

20000

40000

60000

80000

100000

120000

140000

160000

180000

200000

0.5

1

4

8

# Accelerators

# In

terr

upts

Mem Size(MB)

More interrupts to the ILP with smaller job size - Significant utilization or even over saturation of ILP only because of driving accelerators

Memory Size (MB)

#ACC 0.5 1 4 8

1 64K 128K 512K 1M

4 32K 64K 256K 512K

9 16K 32K 128K 256K

18 8K 16K 64K 128K

34 4K 8K 32K 64K

60 2K 4K 16K 32K

Smaller memory/more ACCs -> smaller

Job

Page 19: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Communication overhead analysis (calculation based)

19

• Communication overhead = data exchanged through the system fabric

• More ACCs, heavier traffic on system fabric

Output Frame

MPSoC

Input Frame

System Fabric

Shared memory

ACC1

SPM

SP

M

ACC2

SPM

SP

M

Input

SP

M

Output

SP

M

0 10 20 30 40 50 60 700

50

100

150

200

250

300

# Accelerators

Com

mun

icat

ion

traf

fic

(MB

)

Page 20: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Exploration Summery

• Problems affiliated with current accelerator-rich architecture

– On-chip memory requirements

– ILP synchronization load

– Heavy communication traffic on system fabric

• Demands toward improved ACC-centric design

– Tackling the challenges of current ACC-rich architecture

20

Page 21: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Outline• Heterogeneous MPSoCs

– Specialization is a growing trend

• Accelerator-rich MPSoC architecture– MPSoCs with many accelerators

• Previous work

• Quantitative exploration of accelerator-rich MPSoC - Huge memory demand

- Immerse communication traffic

- Overwhelming Synchronization

• The proposed accelerator centric architecture template - Implementation

- Evaluation

21

Page 22: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

The goals of the proposed ACC-centric arcitecture

• The proposed solution

– An autonomous accelerator chain

• Relieving ILP’s synchronization load

– Point to point connections between accelerators

• No need for larger SPM per each accelerator

• No frequent DMA data transfers

• No heavy traffic on system fabric

22

Page 23: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Simulation platform

23

• Modifying the developed SpecC

model to support autonomous chain

of accelerator

– Gateways to manage the chain

• Creating another ACC-Rich

simulation model

– BFM AMBA-AHB Communication

fabric

– ARM 9 (ISA v6) for ILP execution

– Dedicated interrupt line from gateways

to ILP

– Centralized DMA

SCE refinement

Stimulus Input Data

Output Data

Monitor

InGateway

Memory

OutGatewayChain

ACC0 ACCn

ILP

System Fabric

Input Interface

Output Interface

System Fabric

Input Gateway ACC1

Output Gateway

Mem

ILP

Input Interface

SP

M

Output Interface

SP

M

ACC2

Interrupt

Control

SPM SPM

Stream In Stream Out

DMA DMA

DMA DMA

Interrupt

Control

Page 24: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

The proposed accelerator-centric architecture template

• Gateways controlled by ILP to manage the whole chain of accelerators

– SPM to receive/send data from/to memory

– Control lines from ILP to gateways for configuration

– Interrupt lines from gateways to ILP

– Point to point connections in chain with small buffer in between– Chain works independence of ILP

24

Point to point accelerator connectionsNo much memory requirementNot many DMA data transfer

Autonomous ACC chain:Light ILP synchronization load no matter how many accelerators

System Fabric

Input Gateway ACC1

Output Gateway

Mem

ILP

Input Interface

SP

M

Output Interface

SP

M

ACC2

Interrupt

Control

SPM SPM

Stream In Stream Out

DMA DMA

DMA DMA

Interrupt

Control

System Fabric

Input Gateway ACC1

Output Gateway

Mem

ILP

Input Interface

SP

M

Output Interface

SP

M

ACC2

Interrupt

SPM SPM

Stream In Stream Out

DMA DMA

DMA DMA

Interrupt

1. DMA brings data to the input gateway’s SPM2. Input gateway receives data and starts to pass data through the chain3. Chain works on data4. Output gateway gathers data in SPM5. DMA brings data to memory

1

2

3

4

5

Page 25: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Evaluation

25

MORE ACC:Current arch: exponential

growth in interrupts Proposed architecture: The

same number of interrupts

0 10 20 30 40 50 60 700

50

100

150

200

250

300

Current ArchProposed Arch

# Accelerators

Dat

a co

mm

unic

atio

n tr

affi

c (M

B)

MORE ACC:Current arch: Heavier trafficProposed arch: almost the

same data traffic

0 10 20 30 40 50 60 700

100

200

300

400

500

600

700

Proposed Architecture

Current Architecture

# Accelerators

Job

Size

(KB

)

4 MB memory

MORE ACC:Current arch: Smaller jobProposed arch: almost the

same job

0 10 20 30 40 50 60 700

5

10

15

20

25

30

35

40

45

50

Proposed Architecture

Current Architecture

# Acceerators

Mem

ory

(MB)

256 KB job

MORE ACC:Current arch: Linear growth in

memory requirement Proposed arch: almost constant

memory requirement

0 10 20 30 40 50 60 700

2000

4000

6000

8000

10000

12000

Proposed Arch

Current Arch

# Accelerators

# In

terr

upts

Page 26: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Summary

• Specialization as a growing trend in CMPs– Accelerator rich architectures

• Exploration of the challenges in current accelerator rich architecture– Memory requirement– Communication overhead– Synchronization load

• The proposed accelerator-centric architecture template– Autonomous accelerator chain

• No large memory requirement• No heavy communication traffic• No critical amount of required synchronization

26

Page 27: Nasibeh Teimouri Hamed Tabkhi Gunar Schirner Summer 2014 Quantitative Evaluation of MPSoC with Many Accelerators; Challenges and Potential Solutions.

Question?

Again, Thanks to Professor Schirner for all

his support…

Thanks to Hamed for what I’ve been learning from him,

Thank you all ESL members for your attendance!

27