A Combined Analytical and Simulation-Based Model for ...koji.inoue/paper/2009/ASP-DAC2009Far... ·...
Transcript of A Combined Analytical and Simulation-Based Model for ...koji.inoue/paper/2009/ASP-DAC2009Far... ·...
A Combined Analytical and Simulation-Based
Model for Perform
ance Evaluation of a
Reconfigurable Instruction Set Processor
FarhadMehdipour, H
. N
oori, B. Javadi,
H. H
onda, K. In
oue a
nd K
. J. M
ura
kam
i
Department of Inform
ation Science and Electrical Engineering,
KYUSHU UNIVERSITY, Fukuoka, JAPAN
{farh
ad@
c.c
sce.k
yushu-u
.ac.jp}
Outlin
e
•Reconfigura
ble
Instructions S
et Pro
cessors
•A C
om
bin
ed A
naly
tical and S
imula
tion-B
ased M
odel (C
AnSO
)
•Model Extraction a
nd C
alib
ration
•Basic
Model D
efinitio
ns
•Speedup F
orm
ula
tions
2/XXXII
•Sim
plif
ication a
nd C
alib
ration
•Experim
ents
•Experim
enta
l Setu
p
•Model Valid
ation
•Desig
n S
pace E
xplo
ration U
sin
g C
AnSO
•Effects
of M
odific
ations
•Conclu
sio
ns a
nd F
utu
re W
ork
Desig
nin
g E
mbedded S
yste
ms
�Em
bedded M
icro
pro
cessors
�Applic
ation-S
pecific
Inte
gra
ted C
ircuits (ASIC
s)
�Applic
ation-S
pecific
Instruction s
et Pro
cessors
(ASIP
s)
3/XXXII
Applic
ation-S
pecific
Instruction s
et Pro
cessors
(ASIP
s)
�Exte
nsib
le P
rocessors
Exte
nsib
le P
rocessors
�M
echanis
m
�Accele
ration b
y u
sin
g C
FU
�a h
ard
ware
is a
ugm
ente
d to the b
ase p
rocessor
�Execute
s h
ot portio
ns o
f applic
ations
4/XXXII
Exte
nsib
le P
rocessors
�Base p
rocessor (B
P)'s fix
ed instruction s
et + C
usto
m Instructions
�G
oals
�Im
pro
vin
g the p
erform
ance a
nd e
nerg
y e
ffic
iency
�M
ain
tain
ing c
om
patibility
and fle
xib
ility
5/XXXII
CPU
Instruction D
ispatc
her
Regis
ter File
+&
xLD
/ST
CFU
1C
FU
2LD/ST: Load / Store
CFU: Custom Functional Unit
Custo
m Instructions
�In
struction s
et custo
miz
ation �
�hard
ware
/softw
are
partitio
nin
g
(Identify
ing c
ritical segm
ents
in a
pplic
ations)
�C
usto
m Instructions (C
Is) are
�extracte
d fro
m c
ritical segm
ents
of an a
pplic
ation a
nd
�execute
d o
n a
Custo
m F
unctional U
nit (C
FU
)
6/XXXII
A CI can be represented as a DFG
Critical segm
ents
:
Most frequently e
xecute
d (H
ot)
portio
ns o
f th
e a
pplic
ations
Exte
nsib
le P
rocessors
�D
raw
backs:
�Lack o
f flexib
ility
�Long tim
e a
nd c
ost of desig
nin
g a
nd v
erify
ing
�M
any issues a
ssocia
ted
with d
esig
nin
g a
new
pro
cessor from
scra
tch:
•lo
nger tim
e-to-m
ark
et and
7/XXXII
•lo
nger tim
e-to-m
ark
et and
•sig
nific
ant N
RE (N
on-R
ecurrin
g E
ngin
eering) costs
�Solu
tion
�U
sin
g a
Reconfigura
ble
Functional U
nit (R
FU
)
inste
ad o
f fixed a
rchitectu
re C
FU
Reconfigura
ble
Pro
cessors
Microprocessor
Reconfigurable
Logic
8/XXXII
Reconfigurable
Processor
Pro
cessor couplin
g
Coprocessor
Processor
RFU
9/XXXII
Memory
Attached
Processor
Bridge
Loose
Coupling
(PRISM-I)
Reconfigura
ble
Instruction S
et Pro
cessors
(R
ISPs)
�Addin
g a
nd g
enera
ting c
usto
m instructions a
fter fa
brication
�U
sin
g a
reconfigura
ble
FU
(RFU
) in
ste
ad o
f custo
m F
U
10/XXXII
CPU
Instruction D
ispatc
her
Regis
ter File
+&
xLD
/ST
CFU
1C
FU
2R
FU
Config
Mem
CFU: Custom Functional Unit
RFU: Reconfigurable Functional Unit
Baseline Processor
RAC...
...
How
a R
ISP W
ork
s
400680
subiu $25,$25,1
400688
lbu
$13,0($7)
400690
lbu
$2,0($4)
400698
sll
$2,$2,0x18
4006a0
sra $14,$2,0x18
4006a8
addiu
$4,$4,1
4006b0
srl
$8,$2,0x1c
4006b8
sll
$2,$8,0x2
RISP
11/XXXII
ALU Regis
ter File
...
Configuration
Memory
GPP: General Purpose Processor
RAC=RFU: Reconfigurable Accelerator
4006b8
sll
$2,$8,0x2
4006c0
addu
$2,$2,$25
4006c8
lw$2,0($2)
4006d0
xori
$13,$13,1
4006d8
addu
$10,$10,$2
400680
subiu $25,$25,1
400698
sll
$2,$2,0x18
4006a0
sra $14,$2,0x18
400688
lbu
$13,0($7)
4006e0
bgez
$10,4006f0
. . .
A Hot Basic Block
RIS
P B
enefits
and D
raw
backs
Benefits
�Specia
lized d
ata
path
�Share
d h
ard
ware
�H
igher Speedup
12/XXXII
�Less p
ow
er consum
ption
Dra
wbacks
�M
ore
are
a
�D
ifficult to u
se
Perform
ance E
valu
ation o
f a R
ISP
�Perform
ance e
valu
ation o
f a R
ISP c
halle
nges
�desig
nin
g o
f a R
ISP a
rchitectu
re
�optim
izin
g a
n e
xis
ting a
rch. fo
r an o
bje
ctive function
�For a d
esig
ner
13/XXXII
�For a d
esig
ner
�obta
inin
g o
ptim
um
syste
m c
onfigura
tion is d
esirable
�a p
erform
ance a
naly
sis
in term
s o
f th
e p
erform
ance m
etric
s (speedup,
are
a a
nd s
o o
n) is
required
�Perform
ance e
valu
ation m
odels
�Structu
ral m
odels
: in
clu
des e
mpiric
al stu
die
s b
ased o
n m
easure
ments
and s
imula
tions o
f th
e targ
et syste
m
�Analy
tical m
odels
: in
corp
ora
tes a
syste
m (usually
sim
plif
ied) structu
re
to o
bta
in m
ath
em
atically
solv
able
models
Fra
ction o
f D
ynam
ic Instructions in A
pplic
ations
2030405060708090
100
%
BP Portion
RAC Portion
14/XXXII
0102030
adpc
m(d
ec)
blow
fish(
enc)
blow
fish(
dec)
crc
cjpe
g
djpe
ddi
jkst
ra
lam
epa
tricia
qsor
trij
ndae
l(enc
)rij
ndae
l(dec
)
sha
susa
n Ave
rage
the R
AC
is responsib
le for executing a
lmost 30%
of dynam
ic
instructions o
f applic
ations in a
vera
ge
Model Extraction a
nd U
tiliz
ation
15/XXXII
Genera
l Tem
pla
te o
f a R
ISP
16/XXXII
Basic
Model D
efinitio
ns
�B
ase P
rocessor
�an in-o
rder genera
l five-s
tage R
ISC
pro
cessor
�R
AC
�a c
oars
e-g
rain
ed tig
htly-c
ouple
d reconfigura
ble
hard
ware
�C
Is a
re indexed for direct accessin
g o
f th
e c
onfigura
tion b
it-s
tream
The c
onte
nt of all
regis
ters
are
sent to
the R
AC
(S
hare
d R
F)
17/XXXII
�The c
onte
nt of all
regis
ters
are
sent to
the R
AC
(S
hare
d R
F)
�C
ontrolli
ng c
onfigura
tions
�H
ard
ware
-based: sta
rtin
g a
ddre
ss o
f C
I and index to the c
onfig. M
em
. is
sto
red in a
CA
M
for quic
k retrie
val
�S
oftw
are
-based: sta
rtin
g a
ddre
ss o
f a C
I is
repla
ced w
ith a
specia
l in
struction
�M
em
ory
accesses
�C
ontrol in
structions
Sin
gle
and C
ontinuous E
xecutions
Continuous
Execution
18/XXXII
Sin
gle
Execution
Speedup F
orm
ula
tion
tcc
n
n i
ii BP
RAC
f
CI ∑ =
Ο×
=1
τ
RAC
ftcc
n
n i
ii BP
ccn
BP
f
CI
−=
=
Ο×
−
=
∑1
1
τ
Late
ncy o
f
execution o
f CIi
instructions o
n
the B
P
Fra
ction o
f
instructions
executing o
n B
P
Tota
l no. of
executions
of CIi
Execution
19/XXXII
()
τθ
ψτ
,
1
+
=
Ο×
−
=
∑CI
n i
ii BP
tcc
n
tcc
nos
()
()
()
()
()
()
∑∑
∑=
∈∈
×−
++
++
×=
CI
ii
n iC
j
RAC
ijOVH
RAC
Sj
OVH
RAC
ij
1
1)
,(
τθ
ττ
ττ
θτ
θψ
frequency o
f jth
occurrence o
f CIi
Overa
ll
Speedup
Execution tim
e o
n
the R
AC
Late
ncy o
f R
AC
and
the o
verh
ead
reconfigura
tion tim
e
Execution
tim
e o
n the
BP
The E
ffect of C
I Length
�Larg
e C
Is
�In
clu
din
g m
ore
instructions than the n
o. of availa
ble
resourc
es in the
RAC
�Tem
pora
l Partitio
nin
g
�D
ivid
ing larg
er C
Is to a
num
ber of sm
alle
r C
Is
20/XXXII
�D
ivid
ing larg
er C
Is to a
num
ber of sm
alle
r C
Is
iL
kk
iL
km
mp
m=
′×
Ο=
′∉
∈,
()(
)(
)(
)L
kL
kL
km
Lk
Lk
∉=
∉′∈′
=∈′
=∈′
θθ
θθ
,,
1,...,
1,...,
1,...,
1,1
,...,
1
iS
LiS
im
LiS
=∉′
′=
∈′},
,...
1{i
CL
iC
Li
C=
∉∅
=∈
',
'
}},
,...,
1{{
FU
kCI
nl
nk
kL
>∈
=}
,{
=
∈=
FUk
kk
n
lp
Lk
pP
Sid
e-E
ffects
�Control Instructions
�th
e rate
of m
iss-p
redic
ted b
ranches m
ight be reduced �
hig
her
speedup
�Instruction Cache Misses
�no n
eed for fe
tchin
g instructions b
elo
ngin
g to the C
Is
�access a
nd m
iss rate
s to instruction c
ache a
re reduced
21/XXXII
�access a
nd m
iss rate
s to instruction c
ache a
re reduced
�BP fra
ction reduces�
speedup incre
ases
),
(
1},
{
τθ
ψτ
δ′
′+
=
Ο×
+
=
×−
=
∑∑
DI
n i
ii BP
ib
x
xmp
xmtcc
n
tcc
nos
variation in
branch/cache miss-
predictions/misses
no. of penalty cycles for
branch miss-
predictions/cache misses
RF’s
Input/O
utp
ut Ports
�R
egis
ter file
is s
hare
d b
etw
een B
P a
nd R
AC
�A
dditio
nal clo
ck c
ycle
s for re
adin
g/w
riting fro
m/to the R
F
22/XXXII
),0
max(
),0
max(
∇
∇−
∇+
∆
∆−
∆+
=′
reg
reg
i CI
reg
reg
i CI
OVH
OVH
ττ
no. of C
Ii’s
inputs
no. of R
F’s
write
ports
no. of R
F’s
read p
orts
The A
ssum
ed R
AC
Arc
hitectu
re
... ...
23/XXXII
FU
FU
FU
FU
......
...
. . .
. . .
. . .
. . .
RAC
’s D
ela
y
�All
FU
s in the R
AC
im
ple
ment sim
ilar opera
tions
�Each m
ux receiv
es
�all
outp
uts
of th
e F
Us in u
pper ro
ws a
nd
Outp
uts
fro
m its
adja
cent FU
s a
t th
e s
am
e row
24/XXXII
�O
utp
uts
fro
m its
adja
cent FU
s a
t th
e s
am
e row
{}w
k
h i
k iMUX
h i
FU
w hRAC
,...,
1,0
,
1 11
∈+
=∑
∑− =
=
ττ
τ
()
()
()
()
()
()
∑∑
∑=
∈∈
×−
++
++
×=
CI
ii
n iC
j
RAC
ijOVH
RAC
Sj
OVH
RAC
ij
1
1)
,(
τθ
ττ
ττ
θτ
θψ
Sim
plif
ication a
nd C
alib
ration
�C
ontrol in
structions a
re n
ot supported
�R
eduction in instruction c
ache a
ccesses a
s w
ell
as c
ache m
isses
�avera
ge reduction in a
ccess to i-c
ache is a
lmost 17%
�avera
ge i-c
ache m
iss rate
is a
lmost 3%
.
Avera
ge i-C
ache
Accesses: 17%
25/XXXII
0510
1520
2530
35
adpc
m(d
ec)
blow
fish(
enc)
blow
fish(
dec)
crc
cjpe
gdj
ped di
jkst
ralam
e patriciaqs
ort
rijnd
ael(e
nc)
rijnd
ael(d
ec)
sha
susa
n Aver
age
%
Reduction in i-Cache accesses
Reduction in i-Cache misses
Accesses: 17%
Mis
ses: 3
%
Sim
plif
ication a
nd C
alib
ration
102030405060708090
100
%Single Executions
Single Executions after Partitioning CIs
Fra
ction o
f S
ingle
& C
ontinuous
Executions
26/XXXII
+
∗ ′′
+
∑∗ =
∗
Ο×
∗
−∗
×∗
−∗
∗
=
OVH
w hRAC
DI
n ii
i BP
imp
imtcc
n
tcc
nos
ττ
θψ
τδ
,
1
010
adpcm(dec)
blowfish(enc)
blowfish(dec)
crc
cjpeg
djped
dijkstra
lame
patricia
qsort
rijndael(enc)
rijndael(dec)
sha
susan A
verage
() ∑
=
∗∗
+′
××
Ο=
+′
′
*
1
,
CI
n i
RAC
w hOVH
iOVH
w hRAC
ττ
ατ
τθ
ψ
the ratio o
f sin
gle
to c
ontinuous
execution�
43%
Experim
enta
l Setu
p
�Fourteen a
pplic
ations o
f M
ibench
�auto
motive, security
, consum
er, n
etw
ork
, te
lecom
munic
ation
�C
Is (D
FG
s) are
extracte
d fro
m a
pplic
ations
27/XXXII
�Sim
ple
scala
r’s c
ycle
-accura
te s
imula
tor is
exte
nded to s
imula
te a
re
configura
ble
instruction s
et pro
cessor
�M
odel Esta
blis
hm
ent
�sim
ula
ting a
ll applic
ations
�colle
cting required info
rmation
�m
odel sim
plif
ication a
nd c
alib
ration
~ 4
hours
to c
om
ple
tion o
n a
PC
: D
ual C
ore
, In
tel
6600@
2400M
hz, 2G
B R
AM
Model Valid
ation
160
180
200
220
240
Speedup x 100
Cycle-accurate simulation
CAnSO
Uncalibrated CAnSO
Avera
ge
variation= 2
%
28/XXXII
100
120
140
160
adpc
m(d
ec)
blow
fish(
enc)
blow
fish(
dec)
crc
cjpe
gdj
ped di
jkst
ra
lam
e patriciaqs
ort
rijnd
ael(e
nc)
rijnd
ael(d
ec)
sha
susa
n Ave
rage
Speedup x
Avera
ge
variation= 2
2%
Desig
n S
pace E
xplo
ration U
sin
g C
AnSO
�The d
esig
n o
f a R
AC
inclu
din
g d
iffe
rent com
ponents
enta
ils a
multitude o
f desig
n p
ara
mete
rs
�E
xam
inin
g 1
00 d
esig
n p
oin
ts u
sin
g 1
4 a
pplic
ations:
�S
imula
tion: 17 d
ays
29/XXXII
�S
imula
tion: 17 d
ays
�C
AnS
O: 4 h
ours
�U
sin
g C
AnS
O, re
-sim
ula
tion is n
ot needed a
fter
esta
blis
hin
g the m
odel
Usin
g C
AnSO
for D
esig
n S
pace E
xplo
ration o
f th
e R
AC
12
34
56
71
4
760
70
80
90
100
110
120
130
Speedup x 100
Heigth
Width
blowfish(enc)
12
34
56
71
35
760
70
80
90
100
110
120
130
Speedup x 100
Heigth
Width
crc
12
34
56
71
234567
60
70
80
90
100
110
120
130
Speedup x 100
Heigth
Width
dikjstra
Increasing the width of RAC increases speedup
Width> 6: no more speedup is achievable
30/XXXII
12
34
56
71
3
5
7
80
85
90
95
100
105
110
Speedup x 100
Heigth
Width
qsort
12
34
56
1
3
5
7
70
80
90
100
110
120
Speedup x 100
Heigth
Width
rijndael(enc)
12
34
56
71
3
5
7
60
80
100
120
140
160
180
Speedup x 100
Heigth
Width
susan
Width> 6: no more speedup is achievable
the small heights � ���
very low speedup
Height> 5: RAC’s longer critical path delay�
speedup declines
Effect of M
odific
ations
180
Simulation (8-read/4-write port RF)
CAnSO (8-read/4-write port RF)
Simulation (4-read/2-write port RF)
Applying modification to the design� ���
-Sm
all
tim
e is required for re
peating the s
imula
tion
-Each ite
ration o
f th
e C
AnSO
takes less than a
min
ute
31/XXXII
100
120
140
160
adpc
m(d
ec)
blow
fish(
enc)
blow
fish(
dec)
crc
cjpe
gdj
ped
dijkst
ra
lam
epa
tricia
qsor
t
rijnd
ael(e
nc)
rijnd
ael(d
ec)
sha
susa
n Ave
rage
Speedup x 100
Simulation (4-read/2-write port RF)
CAnSO (4-read/2-write port RF)
Simulation (2-read/1-write port RF)
CAnSO (2-read/1-write port RF)
CO
NC
LU
SIO
N
�Reconfigurable instruction set processors
�A combined analytical and simulation-based model (CAnSO)
�Suitable for exploring a large design space for the accelerator
�Sufficient flexibility in a rapid evaluation of modified target architectures
32/XXXII
�Substantially reduce the design or optimization time while preserving a
reasonable accuracy
�Proves less than 2% variation in evaluation results
�Uncalibrated CAnSO depicts 22% difference in average
�Future work:
�Expanding CAnSO to support control instructions
�Considering more complicated RAC architectures