Post on 13-Jan-2016
description
11University of MichiganElectrical Engineering and Computer Science
Exploring the Design Space of Exploring the Design Space of LUT-based Transparent LUT-based Transparent
AcceleratorsAccelerators
Sami Yehia*, Nathan Clark▪, Scott Mahlke▪, and Krisztian Flautner*
*ARM Ltd.▪Advanced Computer Architecture Lab, University of Michigan
CASES 2005, September 24-27CASES 2005, September 24-27
222University of MichiganElectrical Engineering and Computer Science
Embedded Products ConvergenceNeeds of performance for increasing application demands
Embedded systems win through customization : more performance, low power, etc..
Traditional ISA customization and hardware specialization cannot cope with the increase of functionalities.
One way : Transparent Instruction Set Customization
3.5G (HSDPA)WiMax
NFC / RFID
Stereo Headset Bluetooth/UWB
Biometrics
GPS
TV out PC / MacMemory
card
DMB (Digital Mobile Broadcast)
20 GB HD
Concept Smart phone of 2008
333University of MichiganElectrical Engineering and Computer Science
Transparent Instruction Set Customization
Transparent
I1I1 I2I2 I3I3
I4I4
I5I5
HigherFrequency
I1
I2
I3I4I5
OR… I5
I1
I2
I3
I4
Collapse Instructions(Customization)
An alternative way to performance
No ISA (or minor) change
Baseline CPU unchanged
Hardware generates control
Eases software burden
Forward compatible
444University of MichiganElectrical Engineering and Computer Science
Architecture Framework
Compiler StandardPipeline
…BRL…
…BRL…
Application
SubgraphExecution
Unit
Inputs Outputs
ControlGeneration
Instructions
AugmentsInstruction
Stream
Subgraph
555University of MichiganElectrical Engineering and Computer Science
Pipeline Interface
666University of MichiganElectrical Engineering and Computer Science
LUT-based accelerator
Addition/Subtraction
inst1: EOR r6,r1,r2
inst2: AND r7,r4,r5
inst3: ORR r12,r6,r7
EOR AND
ORR
r1 r2
r4 r5
r12r5 r4 r2 r1 (a^b) | (c&d)
0 0 0 0 0
0 0 0 1 1
0 0 1 0 1
0 0 1 1 0
0 1 0 0 0
0 1 0 1 1
0 1 1 0 1
0 1 1 1 0
1 0 0 0 0
1 0 0 1 1
1 0 1 0 1
1 0 1 1 0
1 1 0 0 1
1 1 0 1 1
1 1 1 0 1
1 1 1 1 1
inst1: ADD r6,r1,r2
r6i = r1i r2i Cini-1
Cini = r1i.r2i | Cini-1.(r1i r2i)
A Carry Generator that
is also programmable
LUT-Based
1111011001100110
32
r12
r1r2
r4r5
32
32
32
32
LUT
777University of MichiganElectrical Engineering and Computer Science
Programmable Carry Functional Unit (PCFU)g1 LU T – p1 LU T
1616
CarryGenerator
32g1 32p1 cin1i n
32
32
32
32
g2 LU T – p2 LU T
3232
CarryGenerator
32g2 32p2 cin2i n
32
32
32
3232
64
32
32
32
32
cin1
cin2
O utLU T
32
O ut
in1in2
in3in4
in1in2
in3in4
in1in2
in3in4
32
64
32
32
32
32
O utLU T2
32
O ut2
in2in3
in4
in1
012345678910111213141516171819202122232425262728293031
ooooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooooo
oooooooooooooooooooooooooooo
oooooooooooooooooooooooo
oooooooooooooooo
L1
L2
L3
L4
L5
012345678910111213141516171819202122232425262728293031
i= (gi,pi)
o
(G,P) (G’,P’)
(G | GP’,P.P’)
888University of MichiganElectrical Engineering and Computer Science
Configuration generation
Out
put
in1
in2
Cin
0 1 0 1 0 1 0 1
0 0 1 1 0 0 1 1
0 0 0 0 1 1 1 1
AND r3, r1, r2
ADD r4, r1, r2
XOR r5, r3, r4
0 0 0 1 0 0 0 1
Subgraph
m-r3
m-r4
m-r5
p
g
1 0 0 0 1 0 0 0
Meta Register file
p g0 1 1 0 1 0 0 1
0 1 1 0 0 1 1 0
0 0 0 1 0 0 0 1
Out =A AND B
r1r2
Out =A B cin
g = A.B
p = A B
1 0 0 1 0 1 1 0
0 1 1 0 0 1 1 01 0 0 0 1 0 0 0
r3
r4Out =A B
0 1 1 1 1 0 0 0
0 0 0 1 1 1 1 0
g LUT p LUT
CarryGenerator
g1 p1
cin1
OutLUT
Out
in1in2
in1in2
Meta Function
Unit
LUT(r3) = LUT (r1) AND LUT (r2)
999University of MichiganElectrical Engineering and Computer Science
Design Space
Number of Inputs
Number of Outputs
Number of Addition/Subtractions
Shift support At inputs
At outputs
g1 LUT – p1 LUT
1616
CarryGenerator
32g1 32p1
32
3232
32
g2 LUT – p2 LUT
3232
CarryGenerator
32g2 32p2
32
3232
3232
64
32
3232
32
cin1
cin2
OutLUT
32
in1in2
in3in4
in1in2
in3in4
in1in2
in3in4
32
g1 LUT – p1 LUT
g2 LUT – p2 LUT
OutLUT
in1in2
in3in4
in5in6
in1in2
in3in4
in5in6
in1in2
in3in4
in5in6
64
32
3232
32
OutLUT2
32
Out2
in2in3
in4
in1
Shifter
OutShifter
Out
101010University of MichiganElectrical Engineering and Computer Science
Evaluation
Ported Trimaran compiler to ARM ISA Subgraph identification engine
Synthesized with Synopsis standard cell library at 0.13µ
SimpleScalar configured as ARM926EJ-S 5 stage pipe, 250 MHz
1 cycle 16k I/D caches
Single issue
Baseline: 1 cycle subgraph execution latency
111111University of MichiganElectrical Engineering and Computer Science
Speedup – Baseline PCFU4-inputs, 2-outputs PCFU design
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
164.g
zip
181.m
cf
197.p
arser
256.b
zip2
300.t
wolfep
ic
g721dec
ode
g721en
code
gsmdec
ode
gsmen
code
mpeg
2dec
mpeg
2enc
pegwitd
ec
pegwite
nc
rawcau
dio
rawdaudio
blowfis
hm
d5rc
4
rijndae
lsh
a
Sp
ee
du
p
SPEC MediaBench Encryption
121212University of MichiganElectrical Engineering and Computer Science
2In, 2Out
4In, 3Out
2In, 1Out
5In, 1Out
3In, 1Out
3In, 2Out
4In, 1Out
4In, 2Out
5In, 2Out
5In, 3Out
6In, 1Out
6In, 2Out
6In, 3Out
ARM926EJ-S
0
1
2
3
4
5
6
7
Lat
ency
(n
s)
Number of inputs/outputs
87
65
43
2
1
2
3
4
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
Speedup
Inputs
Outputs
Area is proportional
131313University of MichiganElectrical Engineering and Computer Science
Number of addition/subtractions
0 Add/Sub1.90%
1 Add/Sub4.36%
2 Add/Sub7.70%
3 Add/Sub16.40%
ARM926EJ-S100.00%
0
1
2
3
4
5
6
7
La
ten
cy
(n
s)
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
0 1 2 3 4 5
Number of ADDs Supported
Sp
ee
du
p
SPEC Avg.
MediaBench Avg.
Encryption Avg.
141414University of MichiganElectrical Engineering and Computer Science
Collapsing Emulation
2 Add/Sub6.10%
(1-1) Add/Sub5.30%
3 Add/Sub12.40%
(2-1) Add/Sub9.60% (1-1-1) Add/Sub
8.10%
ARM926EJ-S 100.00%
0
1
2
3
4
5
6
7
8
Lat
ency
(n
s)
PCFU_1ADD
in1in2
in3in4
PCFU_1ADD(4-in)
O ut
PCFU_2ADD
in1in2
in3in4
PCFU_1ADD
O ut
PCFU_1ADD
in1in2
in3in4
PCFU_1ADD
O ut
PCFU_1ADD
151515University of MichiganElectrical Engineering and Computer Science
Any at Outputs9.10%
2 at Inputs9.60%
No Shifts7.70%
Any at Inputs10.10%
1, 2, 16 at Inputs10.10%
ARM926EJ-S100.00%
0
1
2
3
4
5
6
7
La
ten
cy
(n
s)
Shift support
1.5
1.55
1.6
1.65
1.7
1.75
1.8
1.85
None 2 2,16 1,2,16 Any
Shift Values Supported
Sp
eed
up
Head
Tail
Anywhere
161616University of MichiganElectrical Engineering and Computer Science
Design points
1 Cycle PCFU
4I, 2O, 0A, None
2I, 2O, 2A, None
4I, 2O, 1A, None
4I, 2O, 2A, *Out
4I, 2O, 2A, *In
4I, 2O, 2A, 1,2,16In
4I,3O, 2A, None4I, 1O, 2A, 2In
3I, 2O, 2A, None
4I, 1O, 2A, None
4I, 2O, 2A, None
5I, 2O, 2A, None5I, 3O, 2A, None
4I, 2O, 3A, None
6I, 2O, 2A, None
1
1.1
1.2
1.3
1.4
1.5
1.6
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Area (mm^2)
Sp
eed
up
2 Cycle PCFU
4I, 2O, 0A, None
2I, 2O, 2A, None
4I, 2O, 1A, None
4I, 2O, 2A, *Out
4I, 2O, 2A, *In
4I, 2O, 2A, 1,2,16In
4I,3O, 2A, None
4I, 1O, 2A, 2In
3I, 2O, 2A, None
4I, 1O, 2A, None
4I, 2O, 2A, None
5I, 2O, 2A, None
5I, 3O, 2A, None
4I, 2O, 3A, None
6I, 2O, 2A, None
1
1.1
1.2
1.3
1.4
1.5
1.6
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Area (mm^2)S
pee
du
p
4I, 2O, 2A, None
4I, 3O, 2A, None
5I, 3O, 2A, None
171717University of MichiganElectrical Engineering and Computer Science
Conclusions Transparent Instruction Set Customization needs
Extracting computations from program
Efficient Substrate to Map subgraphs
PCFU LUT Based accelerators Flexible configurable accelerators
Efficient configuration
You can get up to 66% with a 6 input / 3 out / 2 Adder PCFU
... … but you get 62% with a 8 time smaller, ~40% faster PCFU
1818University of MichiganElectrical Engineering and Computer Science
Q & A
1919University of MichiganElectrical Engineering and Computer Science
Backups
202020University of MichiganElectrical Engineering and Computer Science
PCFU Design Space
Shifters
CarryG enerator
O utput1O utput2
CarryG enerator
Shifters
CarryG enerator
N Inputs
Out1Out2
0 Add, 1 o/p PC FU(logic only)
2 Add, 1 o/p PC FU
1 Add, 1 o/p PC FU
2 Add, 2 o/ps PC FU , sh ifts a t i/ps
3 Add, 2 o/p PC FU , shifts a t i/ps, sh ifts a t o /ps
Latency
(ns)Area (cells) Speedup
CCA(Michigan) 4.32 278748 1.8
CCA (R&D) 7.07 606345 1.8
PCFU (2 AS/4IN/2OUT) 4.2 171305 1.62
PCFU Logic only 0.59 26007 1.18
PCFU 1 ADD 2.15 63603 1.33
PCFU 2 ADD 3.79 134637 1.62
PCFU 3 ADD 5.82 274939 1.63
PCFU 2 IN 3.03 52437 1.49
PCFU 3 IN 3.24 68846 1.56
PCFU 4 IN 3.79 134637 1.62
PCFU 5 IN 5.25 214885 1.63
PCFU 6 IN 5.47 465630 1.63
PCFU (1 OUT) 3.79 134637 1.45
PCFU (2 OUT) 4.2 171305 1.62
PCFU (3 OUT) 4.57 230189 1.63
PCFU (Shift at inputs) 5.02 170529 1.75
PCFU(Shift at outputs) 4.45 158009 1.74
212121University of MichiganElectrical Engineering and Computer Science
LUT-based accelerator
ADD r4,r1,r2
XOR r5,r3,r4+
r1 r2
r3
r5i = r3i (r1i r2i cini-1)
cini = (r1i.r2i) OR (r1i r2i).cini-1
r4
r5 cin1
0LUT
fr5 0LU
T
r10
r20
r30
fr50
cin1
1LUT
fr5 1LU
T
r11
r21
r31
fr51
cin10
cin1
31LU
T
fr5 31
LUT
r131
r231
r331
fr531
cin11
cin131
Cini-1 r3i r2i r1i r5i
0 0 0 0 00 0 0 1 10 0 1 0 10 0 1 1 00 1 0 0 10 1 0 1 00 1 1 0 00 1 1 1 11 0 0 0 11 0 0 1 01 0 1 0 01 0 1 1 11 1 0 0 01 1 0 1 11 1 1 0 01 1 1 1 0
0010100110010110
32
r5i
r1i
r2ir3i
Cini-1
32
32
32
32
LUT
Closer to FPGA
Bit level functions too complex
Proposed Ripple Carry Scheme too slow
May involve carry propagation network very complex also
Hard to configure and have a reasonable latency in a GPP