University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of...

38
1 University of Michigan Electrical Engineering and Computer Science Orchestrating the Execution Orchestrating the Execution of Stream Programs on of Stream Programs on Multicore Platforms Multicore Platforms Manjunath Kudlur, Scott Mahlke Advanced Computer Architecture Lab. University of Michigan

Transcript of University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of...

Page 1: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

1 University of MichiganElectrical Engineering and Computer Science

Orchestrating the Execution of Stream Orchestrating the Execution of Stream Programs on Multicore PlatformsPrograms on Multicore Platforms

Manjunath Kudlur, Scott MahlkeAdvanced Computer Architecture Lab.

University of Michigan

Page 2: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

2 University of MichiganElectrical Engineering and Computer Science

Courtesy: Gordon’06

Cores are the New GatesCores are the New Gates

1

1975

2

4

8

16

32

64

128

256

512

1980 1985 1990 1995 2000 2005 2010

400480088080 8086 286 386 486 Pentium P2 P3 P4

Athlon Itanium Itanium2

Power4 PA8800400480088080

PA8800

Opteron CoreDuo

Power6Xbox 360

BCM 1480Opteron 4P

Xeon

Niagara Cell

RAW

RAZA XLR Cavium

Unicore

Homogeneous Multicore

Heterogeneous MulticoreCISCO CSR1

Larrabee

PicoChip AMBRIC

AMD Fusion

NVIDIA G80

Core

Core2Duo

Core2Quad

# co

res/

chip

(Shekhar Borkar, Intel)

Courtesy: Gordon’06

C/C++/Java

CUDA

X10Peakstream

Fortress

Accelerator

Ct

C T M

Rstream

Rapidmind

Stream Programming

Page 3: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

3 University of MichiganElectrical Engineering and Computer Science

Stream ProgrammingStream Programming• Programming style

– Embedded domain• Audio/video (H.264), wireless (WCDMA)

– Mainstream• Continuous query processing (IBM

SystemS), Search (Google Sawzall)• Stream

– Collection of data records• Kernels/Filters

– Functions applied to streams– Input/Output are streams– Coarse grain dataflow– Amenable to aggressive compiler

optimizations [ASPLOS’02, ’06, PLDI ’03]

Page 4: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

4 University of MichiganElectrical Engineering and Computer Science

Compiling Stream ProgramsCompiling Stream Programs

Core 1 Core 2 Core 3 Core 4

Mem Mem Mem Mem

?

Stream Program Multicore System

• Heavy lifting• Equal work distribution• Communication• Synchronization

Page 5: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

5 University of MichiganElectrical Engineering and Computer Science

Stream Graph Modulo Scheduling(SGMS)Stream Graph Modulo Scheduling(SGMS)

• Coarse grain software pipelining– Equal work distribution– Communication/computation overlap

• Target : Cell processor– Cores with disjoint address spaces– Explicit copy to access remote data

• DMA engine independent of PEs

• Filters = operations, cores = function units

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

SPU

256 KB LS

MFC(DMA)

EIB

PPE(Power PC)

DRAM

SPE0 SPE1 SPE7

Page 6: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

6 University of MichiganElectrical Engineering and Computer Science

Streamroller Software OverviewStreamroller Software Overview

Stream IRProfiling

Analysis

Scheduling

CodeGeneration

Streamroller

StreamItSPEX(C++ with

stream extensions)Stylized C

CustomApplication

Engine

SODA(Low power multicoreprocessor for SDR)

Multicore(Cell, Core2Quad,

Niagara)

Focus of this talk

Page 7: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

7 University of MichiganElectrical Engineering and Computer Science

PreliminariesPreliminaries• Synchronous Data Flow (SDF) [Lee ’87]• StreamIt [Thies ’02]

int->int filter FIR(int N, int wgts[N]) {

work pop 1 push 1 { int i, sum = 0;

for(i=0; i<N; i++) sum += peek(i)*wgts[i]; push(sum); pop(); }}

Push and pop items from input/output FIFOs

Stateless

int wgts[N];

wgts = adapt(wgts);

Stateful

Page 8: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

8 University of MichiganElectrical Engineering and Computer Science

SGMS OverviewSGMS Overview

PE0 PE1 PE2 PE3

PE0

T1

T4

T4

T1 ≈ 4

DMA

DMA

DMA

DMA

DMA

DMA

DMA

DMA

Prologue

Epilogue

Page 9: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

9 University of MichiganElectrical Engineering and Computer Science

SGMS PhasesSGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap

Page 10: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

10 University of MichiganElectrical Engineering and Computer Science

Processor AssignmentProcessor Assignment• Assign filters to processors

– Goal : Equal work distribution• Graph partitioning?• Bin packing?

A

B

D

C

A

B C

D

5

5

40 10

Original stream program

PE0 PE1

B

A

C

D Speedup = 60/40 = 1.5

A

B1

D

C

B2

J

S

Modified stream program

B2

CJ

B1

AS

D Speedup = 60/32 ~ 2

Page 11: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

11 University of MichiganElectrical Engineering and Computer Science

Filter Fission ChoicesFilter Fission Choices

PE0 PE1 PE2 PE3

Speedup ~ 4 ?

Page 12: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

12 University of MichiganElectrical Engineering and Computer Science

Integrated Fission + PE AssignIntegrated Fission + PE Assign• Exact solution based on Integer Linear Programming (ILP)

Split/Join overheadfactored in

• Objective function- Maximal load on any PE– Minimize

• Result– Number of times to “split” each

filter– Filter → processor mapping

Page 13: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

13 University of MichiganElectrical Engineering and Computer Science

SGMS PhasesSGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap

Page 14: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

14 University of MichiganElectrical Engineering and Computer Science

Forming the Software PipelineForming the Software Pipeline• To achieve speedup

– All chunks should execute concurrently– Communication should be overlapped

• Processor assignment alone is insufficient information

A

B

C

A

CB

PE0 PE1

PE0 PE1

Tim

e AB

A1

B1

A2

A1

B1

A2A→B

A1

B1

A2A→B

A3A→B

Overlap Ai+2 with Bi

X

Page 15: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

15 University of MichiganElectrical Engineering and Computer Science

Stage AssignmentStage Assignment

i

j

PE 1

Sj ≥ Si

i

j

DMA

PE 1

PE 2

Si

SDMA > Si

Sj = SDMA+1

Preserve causality(producer-consumer dependence)

Communication-computationoverlap

• Data flow traversal of the stream graph– Assign stages using above two rules

Page 16: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

16 University of MichiganElectrical Engineering and Computer Science

Stage Assignment ExampleStage Assignment Example

A

B1

D

C

B2

J

S

AS

B1Stage 0

DMA DMA DMA Stage 1

CB2

J

Stage 2

D

DMA Stage 3

Stage 4

DMA

PE 0

PE 1

Page 17: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

17 University of MichiganElectrical Engineering and Computer Science

SGMS PhasesSGMS Phases

Fission +Processor

assignment

Stageassignment

Codegeneration

Load balance CausalityDMA overlap

Page 18: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

18 University of MichiganElectrical Engineering and Computer Science

Code Generation for CellCode Generation for Cell

• Target the Synergistic Processing Elements (SPEs)– PS3 – up to 6 SPEs– QS20 – up to 16 SPEs

• One thread / SPE• Challenge

– Making a collection of independent threads implement a software pipeline

– Adapt kernel-only code schema of a modulo schedule

Page 19: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

19 University of MichiganElectrical Engineering and Computer Science

Complete ExampleComplete Examplevoid spe1_work(){ char stage[5] = {0}; stage[0] = 1; for(i=0; i<MAX; i++) { if (stage[0]) { A(); S(); B1(); } if (stage[1]) { } if (stage[2]) { JtoD(); CtoD(); } if (stage[3]) { } if (stage[4]) { D(); } barrier(); }}

A

S

B1

DMA DMA DMA

CB2

J

D

DMA DMA

AS

B1

AS

B1

B1toJ

StoB2

AtoC

AS

B1

B2

J

C

B1toJ

StoB2

AtoC

AS

B1

JtoD

CtoD B2

J

C

B1toJ

StoB2

AtoC

AS

B1

JtoD

D

CtoD B2

J

C

B1toJ

StoB2

AtoC

SPE1 DMA1 SPE2 DMA2

Tim

e

Page 20: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

20 University of MichiganElectrical Engineering and Computer Science

ExperimentsExperiments• StreamIt benchmarks

– Signal processing – DCT, FFT, FMRadio, radar– MPEG2 decoder subset– Encryption – DES– Parallel sort – Bitonic– Range of 30 to 90 filters/benchmark

• Platform– QS20 blade server – 2 Cell chips, 16 SPEs

• Software– StreamIt to C – gcc 4.1– IBM Cell SDK 2.1

Page 21: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

21 University of MichiganElectrical Engineering and Computer Science

Evaluation on QS20Evaluation on QS20

0

2

4

6

8

10

12

14

16

18

bitonic channel dct des fft f ilterbank fmradio tde mpeg2 vocoder radar geomean

Benchmarks

Re

lati

ve

Sp

ee

du

p

2P 4P 8P 16P

Split/join overhead reduces benefit from fission

Barrier synchronization (1 per iteration of the stream graph)

Page 22: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

22 University of MichiganElectrical Engineering and Computer Science

SGMS(ILP) vs. GreedySGMS(ILP) vs. Greedy

0

1

2

3

4

5

6

7

8

9

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rel

ativ

e S

pee

du

p

ILP Partitioning Greedy Partitioning Exposed DMA

(MIT method, ASPLOS’06)

• Solver time < 30 seconds for 16 processors

Page 23: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

23 University of MichiganElectrical Engineering and Computer Science

ConclusionsConclusions• Streamroller

– Efficient mapping of stream programs to multicore– Coarse grain software pipelining

• Performance summary– 14.7x speedup on 16 cores– Up to 35% better than greedy solution (11% on average)

• Scheduling framework– Tradeoff memory space vs. load balance

• Memory constrained (embedded) systems• Cache based system

Page 24: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

24 University of MichiganElectrical Engineering and Computer Science

Page 25: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

25 University of MichiganElectrical Engineering and Computer Science

Page 26: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

26 University of MichiganElectrical Engineering and Computer Science

PE4PE3PE2

Naïve UnfoldingNaïve Unfolding

A

B C

D E

F

A

B C

D E

F

A

B C

D E

F

A

B C

D E

F

PE1

Completely stateless stream program

Stream program with stateful nodes

Stream data dependence

State data dependence

A

B C

D E

F

A

B C

D E

F

A

B C

D E

F

A

B C

D E

F

PE1 PE2 PE3 PE4

DM

AD

MA

DM

AD

MA

DM

AD

MA

DM

AD

MA

Page 27: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

27 University of MichiganElectrical Engineering and Computer Science

SGMS (ILP) vs. GreedySGMS (ILP) vs. Greedy

0

1

2

3

4

5

6

7

8

9

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rela

tive s

peed

up

ILP partitioning Greedy partitioning

• Solver time < 30 seconds for 16 processors

• Highly dependent on graph structure• DMA overlapped in both ILP and Greedy

• Speedup drops with exposed DMA

(MIT method, ASPLOS’06)

Page 28: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

28 University of MichiganElectrical Engineering and Computer Science

Unfolding vs. SGMSUnfolding vs. SGMS

0

2

4

6

8

10

12

14

16

18

Benchmarks

Rela

tive S

peed

up

Unfolding SGMS

Page 29: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

29 University of MichiganElectrical Engineering and Computer Science

Pros and Cons of UnfoldingPros and Cons of Unfolding• Pros

– Simple– Good speedups for mostly stateless programs

• Cons

A

B

C

D Sync + DMA

A

B

C

DA

B

C

D

Sync + DMA

A

B

C

D

Sync + DMA

A

B

Sync + DMA

C

D

A

B

C

D

Sync + DMA

PE 1 PE 2 PE 3

A

B C

D

All input data need to be available for iterations to begin

• Long latency for one iteration• Not suitable for real time scheduling

Sync+DMASpeedup affected by filters with

large state

Page 30: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

30 University of MichiganElectrical Engineering and Computer Science

SGMS AssumptionsSGMS Assumptions

• Data independent filter work functions– Static schedule

• High bandwidth interconnect– EIB 96 Bytes/cycle– Low overhead DMA, low observed latency

• Cores can run MIMD efficiently– Not (directly) suitable for GPUs

Page 31: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

31 University of MichiganElectrical Engineering and Computer Science

Speedup UpperboundsSpeedup Upperbounds

0

5

10

15

20

25

30

35

40

45

50

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rela

tive S

peed

up

2P 4P 8P 16P 32P 64P

15% work in one stateless filterMax speedup = 100/15 = 6.7

Page 32: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

32 University of MichiganElectrical Engineering and Computer Science

SummarySummary

• 14.7x speedup on 16 cores– Preprocessing steps (fission) necessary– Hiding communication important

• Extensions– Scaling to more cores– Constrained memory system

• Other targets– SODA, FPGA, GPU

Page 33: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

33 University of MichiganElectrical Engineering and Computer Science

Stream ProgrammingStream Programming• Algorithmic style suitable for

– Audio/video processing– Signal processing– Networking (packet processing,

encryption/decryption)– Wireless domain, software defined radio

• Characteristics– Independent filters

• Segregated address space• Independent threads of control

– Explicit communication– Coarse grain dataflow– Amenable to aggressive compiler

optimizations [ASPLOS’02, ’06, PLDI ’03,’08]

Motiondecode

1D-DCT(row)

Boundedsaturate

InverseQuant AC

InverseQuant DC

Saturate1D-DCT(column)

Bitstream parser

Zigzagunorder

MPEG Decoder

Page 34: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

34 University of MichiganElectrical Engineering and Computer Science

Effect of Exposed DMAEffect of Exposed DMA

0

2

4

6

8

10

12

14

16

18

bitonic channel dct des fft filterbank fmradio tde mpeg2 vocoder radar

Benchmarks

Rela

tive s

peed

up

SGMS SGMS with exposed DMA

Page 35: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

35 University of MichiganElectrical Engineering and Computer Science

Scheduling via Unfolding EvaluationScheduling via Unfolding Evaluation• IBM SDK 2.1, pthreads, gcc 4.1.1, QS20 Blade server• StreamIt benchmarks characteristics

Benchmark Total filters Stateful PeekingState size

(bytes)

bitonic 40 1 0 4channel 55 1 34 252dct 40 1 0 4des 55 0 0 4fft 17 1 0 4filterbank 85 1 32 508fmradio 43 1 14 508tde 55 1 0 4mpeg2 39 1 0 4vocoder 54 11 6 112radar 57 44 0 1032

Page 36: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

36 University of MichiganElectrical Engineering and Computer Science

Absolute PerformanceAbsolute Performance

BenchmarkGFLOPS with

16 SPEs

bitonic N/A

channel 18.72949

dct 17.2353

des N/A

fft 1.65881

filterbank 18.41478

fmradio 17.05445

tde 6.486601

mpeg2 N/A

vocoder 5.488236

radar 30.29937

Page 37: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

37 University of MichiganElectrical Engineering and Computer Science

0_0 a1,0,0,i – b1,0 = 0Σi=1

P

1_0 1_1

1_2

1_3

Σi=1

P

Σj=0

3

a1,1,j,i – b1,1 ≤ Mb1,1

Σi=1

P

Σj=0

3

a1,1,j,i – b1,1 – 2 ≥ -M + Mb1,1

2_0 2_1 2_2

2_3

2_4

Σi=1

P

Σj=0

4

a1,2,j,i – b1,2 ≤ Mb1,2

Σi=1

P

Σj=0

4

a1,2,j,i – b1,2 – 3 ≥ -M + Mb1,2

b1,0 + b1,1 + b1,2 = 1

Original actor

Fissed 2x

Fissed 3x

Page 38: University of Michigan Electrical Engineering and Computer Science 1 Orchestrating the Execution of Stream Programs on Multicore Platforms Manjunath Kudlur,

38 University of MichiganElectrical Engineering and Computer Science

SPE Code TemplateSPE Code Templatevoid spe_work(){ char stage[N] = {0, 0,...,0}; stage[0] = 1;

for (i=0; i<max_iter+N-1; i++) { if (stage[N-1]) { } if (stage[N-2]) {

} ... if (stage[0]) { Start_DMA(); FilterA_work(); } if (i == max_iter-1) stage[0] = 0; for(j=N-1; j>=1; j--) stage[j] = stage[j-1]; wait_for_dma(); barrier(); }}

Bit mask to control active stages

Activate Stage 0

Left shift bit mask(activate more stages)

Start DMA operation

Call filter work function

Poll for completion of all outstanding DMAsBarrier synchronization

Bit mask controls whatgets executed

Go through all input items