PACT2013Slides.pdf
-
Upload
aldairlucas -
Category
Documents
-
view
217 -
download
0
Transcript of PACT2013Slides.pdf
-
8/13/2019 PACT2013Slides.pdf
1/33
Exposing ILP in Custom Hardwarewith a Dataflow Compiler IR
Ali Mustafa Zaidi
Superisor! Dr" Daid #reaes
$niersit% of Cam&ridgeComputer La&orator%
-
8/13/2019 PACT2013Slides.pdf
2/33
2
'he Dar( Sili)on Pro&lem
*"+#H, -./nm 01/23
18%
4"*#H, -54nm 01/23
7%
6"7#H, -7*nm 01/23
3%
Amdahl8s Law
$tili,ation 2all
+
=Dark Silicon
54nm 9 1nm 07*x resour)es3
CP$! 7"4x: #P$ *"5x 0Cnsr"3
CP$!6".x: #P$ *"6x0I'RS3
Esmaeilzadeh et al, "Dark Silicon and the End of Multicore Scalin"! EEE Micro #$1#!
-
8/13/2019 PACT2013Slides.pdf
3/33
3
'he Dar( Sili)on Pro&lem
*"+#H, -./nm 01/23
18%
4"*#H, -54nm 01/23
7%
6"7#H, -7*nm 01/23
3%
Amdahl8s Law
$tili,ation 2all
+
=Dark Silicon
Can we a)hiee Supers)alar Performan)e: w;oSupers)alar
-
8/13/2019 PACT2013Slides.pdf
4/33
4
Solution! Spatial Ar)hite)tures=
ustom &ard'are, ()*s, *s, M))s, etc!
d-antaes S)ala&le: De)entrali,ed ar)hite)tures: with short: p*p wiring"
High Computational Densit%
+/>+///x Energ% and Performan)e effi)ien)%"
ssues
Poor Programma&ilit%! often re?uiring low>leel hardware (nowledge
Limited Amena&ilit%! poor performan)e on se?uential: irregular: or)omplex )ontrol>flow )ode"
E.am/les Conseration Cores! Performan)e @ in>order MIPS*5E )ore
Phoenix CASH Hardware! Performan)e 7/B less than 5>wa%
-
8/13/2019 PACT2013Slides.pdf
5/33
5
e% Reasons for High Performan)eof Complex: of>order exe)ution
s)heduling
Custom hardware has er% limitedspe)ulation
Single flow of )ontrol
If>)onersion h%per&lo)( formationfor forward &ran)hes"
0o acceleration of ack'ardsranches2
= A[i]
> 0
A i
foo()
T F
Start
i = 0
i++
< 100
T
End
F
bar()
Control>Datalow #raph
McFarlin et al., Discerning the dominant out-of-order performance advantage: is it speculation or dynamism?, S)4S 513
Solution! Spatial Ar)hite)tures=
-
8/13/2019 PACT2013Slides.pdf
6/33
6
4ur Solution
Instead of
D(* + om/ile6timeE.ecution Schedulin
2e Emplo%
S(* + Dataflo'E.ecution Model
Control>Datalow #raph
= A[i]
> 0
A i
foo()
T F
Start
i = 0
i++
< 100
T
End
F
bar()
Solution! Spatial Ar)hite)tures
-
8/13/2019 PACT2013Slides.pdf
7/33
-
8/13/2019 PACT2013Slides.pdf
8/33
8
low with the FS#
Falue State low#raph
Infinite DA#
Loops represented as 'ail Re)ursion
ran)hes represented ia if>)onersion Ena&les ressi-e S/eculation2
Ko single 8low of Control8
Instead: )ontrol implemented ia
8oolean Predi)ate Expressions8" Logi) minimi,ation )an simplif%
expressions: fa)ilitating ontrolDe/endence nalsis2
= A[i]
foo()> 0
'P
i = 0 A STATE_IN
STATE_OUT
i++
< 100 Nextiteration of'for' loop
'P
bar()
inPred
-
8/13/2019 PACT2013Slides.pdf
9/33
9
low with the FS#
Falue State low#raph
Hierarchical Dataflow #raph
Su&graphs ma% &e 8predi)ated8:
or exe)uted spe)ulatiel% 0ia 8if>)onersion83"
'Flattening'loop tail>)allsu&graphs 9 loopunrolling;pipelining"
Multiple loops in a loop>nest ma%&e unrolled independentl% toexpose ILP
= A[i]
foo()> 0
'P
i = 0 A STATE_IN
STATE_OUT
i++
< 100 Nextiteration of'for' loop
'P
bar()
inPred
-
8/13/2019 PACT2013Slides.pdf
10/33
10
low with the FS#
-
8/13/2019 PACT2013Slides.pdf
11/33
11
low with the FS#
-
8/13/2019 PACT2013Slides.pdf
12/33
12
High Leel S%nthesis Case Stud%
An% High
LeelLanguage
LLFM FS#luespe)
S%stemFerilog ASIC ; P#A
Low>
LeelIR
%1 = mul i32 %x, %y;%2 = srem i32 %1, %z;%3 = icmp slt i32 %2, %1;
FIFOF(int) x mkFIFOF1;FIFOF(int) y mkFIFOF1;FIFOF(int) z mkFIFOF1;
FIFOF(int) srem_1 mkFIFOF1;FIFOF(int) icmp_1 mkFIFOF1;FIFOF(int) icmp_2 mkFIFOF1;FIFOF(int) out_3 mkFIFOF1;
rule mul_inst;let !l1 = x"#irst; x"$e;let !l2 = y"#irst; y"$e;let rslt = !l1 & !l2;srem_1"en (rslt);icmp_1"en (rslt);
en$rule
rule srem_inst;let !l1 = srem_1"#irst; srem_1"$e;let !l2 = z"#irst; z"$e;let rslt = !l1 % !l2;icmp_2"en (rslt);
en$rule"
-
8/13/2019 PACT2013Slides.pdf
13/33
13
Leg$p LLFM *".
-
8/13/2019 PACT2013Slides.pdf
14/33
14
Performan)e 0C%)le Counts3
Kormalised
to Leg$p
Compared to Kios II;f Intel Kehalem Core i6 0SniperSim3
Matrix 'ranspose0x+( )%)les3
adp)m0x+( )%)les3
dfsin0x+( )%)les3
Keural Ket Simulator0x+M )%)les3
-
8/13/2019 PACT2013Slides.pdf
15/33
15
epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa
0
50
100
150
200
250
300
350
400
450
Frequency (Higher is e!!er"
#eg$p (%F&" 'F&_0 'F&_1 'F&_3
)H*
epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa
0
0+2
0+4
0+,
0+-
1
1+2
1+4
./rmalied elay (#/er is e!!er"
#eg$p (%F&" 'F&_0 'F&_1 'F&_3
Kios IIf -*4/MH,
re?uen)% Dela%
-
8/13/2019 PACT2013Slides.pdf
16/33
16
epic adpcm dfadd dfdiv dfmul dfsin mips
0
1
2
3
4
5
,
misspecula!ed ac!ivi!y (bi!s"
useful ac!ivi!y (bi!s"
Power and Spe)ulation
-
8/13/2019 PACT2013Slides.pdf
17/33
17
epic adpcm dfadd dfdiv dfmul dfsin mips
0
1
2
3
4
5
,
misspecula!ed ac!ivi!y (bi!s"
useful ac!ivi!y (bi!s"
Power and Spe)ulation
-
8/13/2019 PACT2013Slides.pdf
18/33
18
Kormali,ed Energ%
epic adpcm dfadd dfdiv dfmul dfsin mips GEOMEAN0.1
1
10
100
1 1
3
1
32
22
1
3
5
2
43 3 33
6
3
75 4
62
1
17 18
31
14
6
12
LegUp VSFG_0 VSFG_1 VSFG_3 Nios
S f E I ffi i
-
8/13/2019 PACT2013Slides.pdf
19/33
19
Energ% Cost Comparison!
s Kios II;f! /"*4 x0#E5 x0#E
-
8/13/2019 PACT2013Slides.pdf
20/33
20
74B &etter performan)e than stati)all% s)heduled C#: without an%optimi,ations!
Improements due to d%nami) s)heduling: MC CDA $nrolling helps: &ut speed>up saturates ?ui)(l%"
urther Improements possi&le!
alan)e &etween /redication s/eculation: to improe speed>up withoutunrolling 0thus redu)ing area and energ% )osts3
State>edge is on )riti)al path O limits &oth unrolling MC"
Last remnant of 8se?uential8 nature of program"
re?uen)% S)aling limited &% Memor% Inter)onne)t
Partition memor% pipeline memor% a))ess tree
Limitations on Performan)e
-
8/13/2019 PACT2013Slides.pdf
21/33
21
'han( ou
Impli)it Parallelism State edge Partitioning
-
8/13/2019 PACT2013Slides.pdf
22/33
22
IncreasingProgrammer
/ CompilerEffort
Alias
Anal%sis
Spe)ul"Loads
edge"
edge Partitioning
SpM' ;'LS
D%nami)
-
8/13/2019 PACT2013Slides.pdf
23/33
Performan)e 0C%)le Counts3
-
8/13/2019 PACT2013Slides.pdf
24/33
24
Performan)e 0C%)le Counts3
C%)le )ounts normali,ed to Leg$p results
FS# implemented with all loops unrolled /: +: and 7 times ull Spe)ulation! all su&graphs 0ex)ept loops3 triggeredwithout predi)ates
epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa
0
0+2
0+4
0+,
0+-
1
1+2
1+4
1+,
%ycle %/un!s i!h Full pecula!i/n
#eg$p (%F&"
'F&_0
'F&_1
'F&_3
Performan)e 0C%)le Counts3
-
8/13/2019 PACT2013Slides.pdf
25/33
25
Performan)e 0C%)le Counts3
)redication!onl% one &lo)(
will exe)ute
S/eculation!&oth &lo)(s
exe)ute: &utonl% one resultis )hosen
epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa
0
0+2
0+4
0+,
0+-
1
1+2
1+4
1+,
%ycle %/un!s i!h Full pecula!i/n
#eg$p (%F&"
'F&_0
'F&_1
'F&_3
Performan)e 0C%)le Counts3
-
8/13/2019 PACT2013Slides.pdf
26/33
26
Performan)e 0C%)le Counts3
epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa
0
0+2
0+4
0+,
0+-
1
1+2
1+4
1+,
%ycle %/un!s i!h Full pecula!i/n
#eg$p (%F&"
'F&_0
'F&_1
'F&_3
Performan)e 0C%)le Counts3
-
8/13/2019 PACT2013Slides.pdf
27/33
27
epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa
0
0+2
0+4
0+,
0+-
1
1+2
1+4
1+,
%ycle %/un!s i!h Full pecula!i/n
#eg$p (%F&"
'F&_0
'F&_1
'F&_3
Performan)e 0C%)le Counts3
epic adpcm dfadd dfdiv dfmul dfsin mips** small_bimpa
0
0+2
0+4
0+,
0+-
1
1+2
1+4
1+,
%ycle %/un!s i!h redica!ed ubgraphs
#eg$p (%F&"
'F&_0
'F&_1
'F&_3
Performan)e 0C%)le Counts3
-
8/13/2019 PACT2013Slides.pdf
28/33
28
%/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3
0
20000000
40000000
,0000000
-0000000
100000000
120000000
140000000
1,0000000
1-0000000
200000000
3,,45,
3334552
1423-,,,
1143,144
-1,4- 430,4-
small_bimpa
%/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3
0
20000
40000
,0000
-0000
100000
120000
140000
10453
142055-
1053
200 1-, 1-,
dfsin
%/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3
0
200000
400000
,00000
-00000
1000000
1200000
1400000
20014
33,34
10-444 10,243,
52-21-
2,510
epic
%/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3
0
10000
20000
30000
40000
50000
,0000
0000
-0000
42,,2
114
134
5-,0
515-0 511-,
adpcm
Performan)e 0C%)le Counts3
Performan)e 0C%)le Counts3
-
8/13/2019 PACT2013Slides.pdf
29/33
29
%/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3
0
2000
4000
,000
-000
10000
12000
14000
1,000
1-000
dfadd
%/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3
0
5000
10000
15000
20000
25000
30000
35000
40000
dfdiv
%/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3
0
2000
4000
,000
-000
10000
12000
14000
1,000
dfmul
%/re i .i/s 2f #eg$p 'F&_0 'F&_1 'F&_3
0
5000
10000
15000
20000
25000
30000
35000
mips**
Performan)e 0C%)le Counts3
$nderstanding
-
8/13/2019 PACT2013Slides.pdf
30/33
30
$nderstanding )onersion h%per&lo)( formation for forward&ran)hes"
0o acceleration of ack'ards ranches2
= A[i]
> 0
A i
foo()
T F
Start
i = 0
i++
< 100
T
End
F
bar()
Control>Datalow #raph
ormali,ing Ealuating the FS#
-
8/13/2019 PACT2013Slides.pdf
31/33
31
ormali,ing Ealuating the FS#
An% HighLeel
LanguageLLFM FS#
luespe)S%stemFerilog
ASIC ; P#ALow>Leel
IR
Plot(in>st%le operational semanti)s deeloped for FS# Assuming Stati) Dataflow exe)ution model
Low>Leel IR deeloped to fa)ilitate )onersion to luespe)
ased on Hierar)hi)al Coloured Petri>nets
High>Leel S%nthesis 'ool)hain implemented
Hardware
-
8/13/2019 PACT2013Slides.pdf
32/33
32
Hardware
-
8/13/2019 PACT2013Slides.pdf
33/33
33
Hardware lues/ec ode
)etri 0et asedo' e-elDataflo'