6.189 Multicore Programming Primer, January … › courses › electrical-engineering-and...6.189...
Transcript of 6.189 Multicore Programming Primer, January … › courses › electrical-engineering-and...6.189...
MIT OpenCourseWare http://ocw.mit.edu
6.189 Multicore Programming Primer, January (IAP) 2007
Please use the following citation format:
Saman Amarasinghe, 6.189 Multicore Programming Primer, January (IAP) 2007. (Massachusetts Institute of Technology: MIT OpenCourseWare). http://ocw.mit.edu (accessed MM DD, YYYY). License: Creative Commons Attribution-Noncommercial-Share Alike.
Note: Please use the actual date you accessed this material in your citation.
For more information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms
6.189 IAP 2007
Lecture 12
StreamIt Parallelizing Compiler
Prof. Saman Amarasinghe, MIT. 1 6.189 IAP 2007 MIT
Common Machine Language
● Represent common properties of architectures � Necessary for performance
● Abstract away differences in architectures � Necessary for portability
● Cannot be too complex � Must keep in mind the typical programmer
● C and Fortran were the common machine languages foruniprocessors � Imperative languages are not the correct abstraction for
parallel architectures. ● What is the correct abstraction for parallel multicore
machines?
Prof. Saman Amarasinghe, MIT. 2 6.189 IAP 2007 MIT
Common Machine Language for Multicores
● Current offerings: � OpenMP � MPI � High Performance Fortran
● Explicit parallel constructs grafted onto imperative language● Language features obscured:
� Composability � Malleability � Debugging
● Huge additional burden on programmer: � Introducing parallelism � Correctness of parallelism � Optimizing parallelism
Prof. Saman Amarasinghe, MIT. 3 6.189 IAP 2007 MIT
Explicit Parallelism
● Programmer controls details of parallelism! ● Granularity decisions:
� if too small, lots of synchronization and thread creation � if too large, bad locality
● Load balancing decisions � Create balanced parallel sections (not data-parallel)
● Locality decisions � Sharing and communication structure
● Synchronization decisions � barriers, atomicity, critical sections, order, flushing
● For mass adoption, we need a better paradigm: � Where the parallelism is natural � Exposes the necessary information to the compiler
Prof. Saman Amarasinghe, MIT. 4 6.189 IAP 2007 MIT
Unburden the Programmer
● Move these decisions to compiler!� Granularity � Load Balancing � Locality � Synchronization
● Hard to do in traditional languages� Can a novel language help?
Prof. Saman Amarasinghe, MIT. 5 6.189 IAP 2007 MIT
Properties of Stream Programs
● Regular and repeating computation
● Synchronous Data Flow ● Independent actors
with explicit communication ● Data items have short lifetimes
Benefits: ● Naturally parallel ● Expose dependencies to compiler● Enable powerful transformations
6.189 IAP 2007 MIT
Speaker
Prof. Saman Amarasinghe, MIT. 6
Adder
AtoD
FMDemod
LPF1
Duplicate
RoundRobin
LPF2 LPF3
HPF1 HPF2 HPF3
Outline
● Why we need New Languages?
● Static Schedule ● Three Types of Parallelism ● Exploiting Parallelism
Prof. Saman Amarasinghe, MIT. 7 6.189 IAP 2007 MIT
Steady-State Schedule
● All data pop/push rates are constant ● Can find a Steady-State Schedule
� # of items in the buffers are the same before and the after executing the schedule
� There exist a unique minimum steady state schedule
● Schedule = { }
A B C
… push=2
pop=3 push=1
pop=2 …
Prof. Saman Amarasinghe, MIT. 8 6.189 IAP 2007 MIT
…push=2
Steady-State Schedule
● All data pop/push rates are constant ● Can find a Steady-State Schedule
� # of items in the buffers are the same before and the after executing the schedule
� There exist a unique minimum steady state schedule
● Schedule = { A }
A B C
… push=2
pop=3 push=1
pop=2 …
Prof. Saman Amarasinghe, MIT. 9 6.189 IAP 2007 MIT
…push=2
Steady-State Schedule
● All data pop/push rates are constant ● Can find a Steady-State Schedule
� # of items in the buffers are the same before and the after executing the schedule
� There exist a unique minimum steady state schedule
● Schedule = { A, A }
A B C
… push=2
pop=3 push=1
pop=2 …
Prof. Saman Amarasinghe, MIT. 10 6.189 IAP 2007 MIT
pop=3push=1
Steady-State Schedule
● All data pop/push rates are constant ● Can find a Steady-State Schedule
� # of items in the buffers are the same before and the after executing the schedule
� There exist a unique minimum steady state schedule
● Schedule = { A, A, B }
A B C
… push=2
pop=2 …
pop=3 push=1
Prof. Saman Amarasinghe, MIT. 11 6.189 IAP 2007 MIT
…push=2
Steady-State Schedule
● All data pop/push rates are constant ● Can find a Steady-State Schedule
� # of items in the buffers are the same before and the after executing the schedule
� There exist a unique minimum steady state schedule
● Schedule = { A, A, B, A }
A B C
… push=2
pop=3 push=1
pop=2 …
Prof. Saman Amarasinghe, MIT. 12 6.189 IAP 2007 MIT
pop=3push=1
Steady-State Schedule
● All data pop/push rates are constant ● Can find a Steady-State Schedule
� # of items in the buffers are the same before and the after executing the schedule
� There exist a unique minimum steady state schedule
● Schedule = { A, A, B, A, B }
A B C
… push=2
pop=2 …
pop=3 push=1
Prof. Saman Amarasinghe, MIT. 13 6.189 IAP 2007 MIT
pop=2…
Steady-State Schedule
● All data pop/push rates are constant ● Can find a Steady-State Schedule
� # of items in the buffers are the same before and the after executing the schedule
� There exist a unique minimum steady state schedule
● Schedule = { A, A, B, A, B, C }
A B C
… push=2
pop=3 push=1
pop=2 …
Prof. Saman Amarasinghe, MIT. 14 6.189 IAP 2007 MIT
Initialization Schedule
● When peek > pop, buffer cannot be empty after firing a filter
● Buffers are not empty at the beginning/end of the steady state schedule
● Need to fill the buffers before starting the steady state execution
peek=4 pop=3 push=1
Prof. Saman Amarasinghe, MIT. 15 6.189 IAP 2007 MIT
peek=4pop=3push=1
Initialization Schedule
● When peek > pop, buffer cannot be empty after firing a filter
● Buffers are not empty at the beginning/end of the steady state schedule
● Need to fill the buffers before starting the steady state execution
peek=4 pop=3 push=1
Prof. Saman Amarasinghe, MIT. 16 6.189 IAP 2007 MIT
Outline
● Why we need New Languages?
● Static Schedule ● Three Types of Parallelism ● Exploiting Parallelism
Prof. Saman Amarasinghe, MIT. 17 6.189 IAP 2007 MIT
Data ParallelismPeel iterations of filter, place within scatter/gather pair (fission)parallelize filters with state
Pipeline ParallelismBetween producers and consumersStateful filters can be parallelized
Types of Parallelism
Scatter
Gather
Task Parallelism � Parallelism explicit in algorithm � Between filters without producer/consumer
relationship
�
�
�
�
Task
Prof. Saman Amarasinghe, MIT. 18 6.189 IAP 2007 MIT
Types of Parallelism
Scatter
Gather
Scatter
Gather
Pip
elin
e
Data Parallel
Data
Task
Task Parallelism � Parallelism explicit in algorithm � Between filters without producer/consumer
relationship
Data Parallelism � Between iterations of a stateless filter � Place within scatter/gather pair (fission) � Can’t parallelize filters with state
Pipeline Parallelism � Between producers and consumers � Stateful filters can be parallelized
Prof. Saman Amarasinghe, MIT. 19 6.189 IAP 2007 MIT
Types of Parallelism
Scatter
Gather
Scatter
Gather
Task
Pip
elin
e
Data
Traditionally:
Task Parallelism � Thread (fork/join) parallelism
Data Parallelism � Data parallel loop (forall)
Pipeline Parallelism � Usually exploited in hardware
Prof. Saman Amarasinghe, MIT. 20 6.189 IAP 2007 MIT
Outline
● Why we need New Languages?
● Static Schedule ● Three Types of Parallelism ● Exploiting Parallelism
Prof. Saman Amarasinghe, MIT. 21 6.189 IAP 2007 MIT
Baseline 1: Task Parallelism
Adder
Splitter
Joiner
Compress
BandPass
Expand
Process
BandStop
Compress
BandPass
Expand
Process
BandStop
● Inherent task parallelism between two processing pipelines
● Task Parallel Model: � Only parallelize explicit task
parallelism � Fork/join parallelism
● Execute this on a 2 core machine ~2x speedup over single core
● What about 4, 16, 1024, … cores?
Prof. Saman Amarasinghe, MIT. 22 6.189 IAP 2007 MIT
Raw Microprocessor16 inorder, single-issue cores with D$ and I$
16 memory banks, each bank with DMA
Evaluation: Task Parallelism
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Bitonic
Sort
Chann
elVoc
oder
DCT
DES
FFT Filte
rbank
FMRadio
Serpe
nt
TDE MPEG2D
ecod
er
Vocod
er
Radar
Geo
metricM
ean
Thro
ughp
ut N
orm
aliz
ed to
Sin
gle
Cor
e St
ream
It
Cycle accurate simulator
Parallelism: Not matched to target! Synchronization: Not matched to target!
Prof. Saman Amarasinghe, MIT. 23 6.189 IAP 2007 MIT
Image by MIT OpenCourseWare.
Adder
ExpandExpandExpandExpandExpandExpand
Baseline 2: Fine-Grained Data Parallelism
Splitter
Joiner
BandStopBandStopBandStopAdder Splitter
Joiner
ProcessProcessProcess
Joiner
BandPassBandPassBandPass
CompressCompressCompress
BandStopBandStopBandStop
Expand
BandStop
Splitter
Joiner
Splitter
Process
BandPass
Compress
Splitter
Joiner
Splitter
Joiner
Splitter
Joiner
ProcessProcessProcessProcess
Joiner
BandPassBandPassBandPassBandPass
CompressCompressCompressCompress
BandStopBandStopBandStopBandStop
Expand
Splitter
Joiner
Splitter
Splitter
Joiner
Splitter
Joiner
Splitter
Joiner
● Each of the filters in the example are stateless
● Fine-grained Data Parallel Model: � Fiss each stateless filter N ways
(N is number of cores) � Remove scatter/gather if
possible ● We can introduce data parallelism
� Example: 4 cores ● Each fission group occupies entire
machine
Prof. Saman Amarasinghe, MIT. 24 6.189 IAP 2007 MIT
Evaluation: Fine-Grained Data Parallelism
0
1
2
3 4
5
6
7
8
9 10
11
12
13
14
15 16
17
18
19
Bitonic
Sort
Channe
lVocod
er
DCT
DES
FFT
Filterban
k
FMRadio
Serpen
t
TDE MPEG2Deco
der
Vocod
er
Radar
Ge
Thro
ughp
ut N
orm
aliz
ed to
Sin
gle
Cor
e St
ream
It
Task Fine-Grained Data
Good ParallelisToo Much Synchronizatio
ometr
icMea
n
m! n!
Prof. Saman Amarasinghe, MIT. 25 6.189 IAP 2007 MIT
Image by MIT OpenCourseWare.
● Resultant stream graph is mapped to hardware� One filter per core
● What about 4, 16, 1024, cores?
Baseline 3: Hardware Pipeline Parallelism
Adder
Splitter
Joiner
Compress
BandPass
Expand
Process
BandStop
Compress
BandPass
Expand
Process
BandStop
● The BandPass and BandStop filters contain all the work
● Hardware Pipelining � Use a greedy algorithm to
fuse adjacent filters � Want # filters <= # cores
● Example: 8 Cores
Prof. Saman Amarasinghe, MIT. 26 6.189 IAP 2007 MIT
Baseline 3: Hardware Pipeline Parallelism
Adder
Splitter
Joiner
BandPass
Compress Process Expand
BandStop
BandPass
BandStop
Compress Process Expand
● The BandPass and BandStop filters contain all the work
● Hardware Pipelining � Use a greedy algorithm to
fuse adjacent filters � Want # filters <= # cores
● Example: 8 Cores ● Resultant stream graph is
mapped to hardware � One filter per core
● What about 4, 16, 1024, cores?� Performance dependent on
fusing to a load-balanced stream graph
Prof. Saman Amarasinghe, MIT. 27 6.189 IAP 2007 MIT
Evaluation: Hardware Pipeline Parallelism
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Bitonic
Sort
Channe
lVocod
er
DCT
DES
FFT
Filterban
k
FMRadio
Serpen
t
TDE MPEG2Deco
der
Vocod
er
Radar
Geometr
icMea
n
Thro
ughp
ut N
orm
aliz
ed to
Sin
gle
Cor
e St
ream
It
Task Fine-Grained Data Hardware Pipelining
Parallelism: Not matched to target! Synchronization: Not matched to target!
Prof. Saman Amarasinghe, MIT. 28 6.189 IAP 2007 MIT
Image by MIT OpenCourseWare.
The StreamIt Compiler
Coarsen Data Software Granularity Parallelize Pipeline
1. Coarsen: Fuse stateless sections of the graph 2. Data Parallelize: parallelize stateless filters 3. Software Pipeline: parallelize stateful filters
Compile to a 16 core architecture � 11.2x mean throughput speedup over single core
Prof. Saman Amarasinghe, MIT. 29 6.189 IAP 2007 MIT
Phase 1: Coarsen the Stream Graph
Splitter
BandPass Peek BandPass
Compress
Peek
Peek
Expand
BandStop
Process
Peek
Expand
BandStop
Process
Compress
Joiner
Adder
● Before data-parallelism is exploited ● Fuse stateless pipelines as much
as possible without introducing state � Don’t fuse stateless with stateful � Don’t fuse a peeking filter with
anything upstream
Prof. Saman Amarasinghe, MIT. 30 6.189 IAP 2007 MIT
Phase 1: Coarsen the Stream Graph
Splitter
Joiner
BandPass Compress Process Expand
BandPass Compress Process Expand
BandStop BandStop
Adder
● Before data-parallelism is exploited ● Fuse stateless pipelines as much
as possible without introducing state � Don’t fuse stateless with stateful � Don’t fuse a peeking filter with
anything upstream ● Benefits:
� Reduces global communication and synchronization
� Exposes inter-node optimization opportunities
Prof. Saman Amarasinghe, MIT. 31 6.189 IAP 2007 MIT
AdderAdderAdder
Phase 2: Data Parallelize
Splitter
BandPass Compress
BandPass
Process Expand
Process Compress
Expand
Data Parallelize for 4 cores
BandStop
Fiss 4 ways, to occupy entire chip
Prof. Saman Amarasinghe, MIT. 32 6.189 IAP 2007 MIT
Joiner
Adder
BandStop
Splitter
Joiner
AdderAdderAdder
ProcesExpand
ProcesExpand
Phase 2: Data Parallelize
Splitter
Joiner
Adder
BandPass Compress
s
Splitter
Joiner
BandPass Compress Process Expand
BandPass Compress
s
Splitter
Joiner
BandPass Compress Process Expand
BandStop BandStop
Splitter
Joiner
Data Parallelize for 4 cores
Task parallelism! Each fused filter does equal work Fiss each filter 2 times to occupy entire chip
Prof. Saman Amarasinghe, MIT. 33 6.189 IAP 2007 MIT
AdderAdderAdder
ProcesExpand
ProcesExpand
Phase 2: Data Parallelize
BandStop BandStop
Splitter
Joiner
Adder
BandPass Compress
s
Splitter
Joiner
BandPass Compress Process Expand
BandPass Compress
s
Splitter
Joiner
BandPass Compress Process Expand
Splitter
Joiner
BandStop
Splitter
Joiner
BandStop
Splitter
Joiner
Data Parallelize for 4 cores ● Task-conscious data parallelization
� Preserve task parallelism ● Benefits:
� Reduces global communication and synchronization
Task parallelism, each filter does equal workFiss each filter 2 times to occupy entire chip
Prof. Saman Amarasinghe, MIT. 34 6.189 IAP 2007 MIT
Evaluation: Coarse-Grained Data Parallelism
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Bitonic
Sort Cha
nnelVocoder
DCT
DES
FFT Filte
rbank
FMRadio
Serpent
TDE MPEG2Dec
oder Voco
der
Radar
Geometric
Mean Th
roug
hput
Nor
mal
ized
to S
ingl
e C
ore
Stre
amIt
Task Fine-Grained Data Hardware Pipelining Coarse-Grained Task + Data
Good Parallelism! Low Synchronization!
Prof. Saman Amarasinghe, MIT. 35 6.189 IAP 2007 MIT
Image by MIT OpenCourseWare.
6
Simplified VocoderSplitter
Joiner
AdaptDFT AdaptDFT 6
RectPolar
Splitter
Splitter
Amplify
Diff
UnWrap
Accum
Joiner
Joiner
PolarRect
20
20
22
Diff
Unwrap
11
Data Parallel
Amplify
Accum
1 1 Data Parallel, but too little work! 11
Data Parallel Target a 4 core machine
Prof. Saman Amarasinghe, MIT. 36 6.189 IAP 2007 MIT
RectPoRectPoRectPo
RectPoRectPoRectPo
Data Parallelize
6
larlarlar
Splitter
Joiner
AdaptDFT AdaptDFT
Splitter
Splitter
Amplify
Diff
UnWrap
Accum
Amplify
Diff
Unwrap
Accum
Joiner
RectPolar
6
5Splitter
Joiner
larlarlarPolarRect
Splitter
Joiner
Joiner 20
20
2
1
1
1
5
2
1
1
1
Target a 4 core machine
Prof. Saman Amarasinghe, MIT. 37 6.189 IAP 2007 MIT
RectPo
Data + Task Parallel Execution
6Cores
2Time
1 211
1
Splitter
Joiner
Splitter
Splitter
Joiner
Splitter
Joiner
lar Splitter
Joiner
Joiner
6
2
1
1
1
5
5 Target 4 core machine
Prof. Saman Amarasinghe, MIT. 38 6.189 IAP 2007 MIT
RectPo
We Can Do Better!
6Cores
2 16Time 1
1
1
Splitter
Joiner
Splitter
Splitter
Joiner
Splitter
Joiner
lar Splitter
Joiner
Joiner
6
2
1
1
1
5
5
Target 4 core machine
Prof. Saman Amarasinghe, MIT. 39 6.189 IAP 2007 MIT
RectPolar
RectPolar
RectPolar
RectPolar
Phase 3: Coarse-Grained Software Pipelining
Prof. Saman Amarasinghe, MIT. 40 6.189 IAP 2007 MIT
Prologue
New Steady
State
● New steady-state is free of dependencies
● Schedule new steady-state using a greedy partitioning
Greedy Partitioning
To Schedule: Cores
Time
Target 4 core machine
16
Prof. Saman Amarasinghe, MIT. 41 6.189 IAP 2007 MIT
42 6.189 IAP 2007 MITProf. Saman Amarasinghe, MIT.
Coarse-Grained Task + Data + Software Pipeline
Evaluation: Coarse-Grained Task + Data + Software Pipelining
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
Bitonic
Sort Cha
nnelVocoder
DCT
DES
FFT
Filterba
nk
FMRadio
Serpent
TDE MPEG2Dec
oder
Vocoder
Radar
GeometricM
ean
Thro
ughp
ut N
orm
aliz
ed to
Sin
gle
Cor
e St
ream
It
Task Fine-Grained Data Hardware Pipelining Coarse-Grained Task + Data
Best Parallelism! Lowest Synchronization!
Image by MIT OpenCourseWare.
Summary
● Streaming model naturally exposes task, data, and pipeline parallelism
● This parallelism must be exploited at the correct granularity and combined correctly
Task Fine-
Grained Data
Hardware Pipelining
Coarse-Grained Task + Data
Coarse-Grained Task + Data + Software
Pipeline
Parallelism Application Dependent
Good Application Dependent
Good Best
Synchronization Application Dependent
High Application Dependent
Low Lowest
● Robust speedups across varied benchmark suite
Prof. Saman Amarasinghe, MIT. 43 6.189 IAP 2007 MIT