Post on 03-Jan-2016
Generating and Executing Multi-Exit Custom Instructions for an Adaptive Extensible Processor
Hamid Noori†, Farhad Mehdipour‡, Kazuaki Murakami†, Koji Inoue†, and Maziar Goudarzi†
† Kyushu University, Japan‡ AmirKabir University of Technology, Iran
DATE’07@Nice, FranceKyushu University
Outline
Introduction Overview of ADEXOR Architecture Generating Multi-Exit Custom Instructions
(MECIs) Proposing an Architecture for the CRFU Experimental Results Conclusions and Future Work
DATE’07@Nice, FranceKyushu University
Introduction Motivation
Increase in manufacturing and NRE costs Increase in design and verification costs due to more
complexity Shorter time-to-market More required flexibility due to evolution of standards,
user requirements, supporting multiple applications and etc
Our proposed approach Generating custom instruction after chip-fabrication Proposing multi-exit custom instructions Proposing a reconfigurable functional unit with
conditional execution using a quantitative approach
DATE’07@Nice, FranceKyushu University
General Overview of ADEXOR Architecture
DEC/EXE Pipeline Registers
FU1 FU2 FU3 FU4 CRFU
Reg0 ……………………………………………..
Reg31
EXE/MEM Pipeline Registers
Counter
ConfigMemory
Counter
From decode stage
Triggered by mtc0
Triggered by mtc0
DATE’07@Nice, FranceKyushu University
General Overview of ADEXOR Architecture
Phases Design phase Configuration phase Normal phase
Design PhaseChip Fabrication
TestbenchApplications
Generating
CRFU
Synt
hesi
s,ve
rific
atio
n,la
yout
,etc
ADEXOR
Targ
etApp
licat
ion New object
code
ConfigurationBits
Configuration Phase Normal Phase
CRFU
BaseProc.
ConfigMem
ADEXOR
Too
l Cha
in
DATE’07@Nice, FranceKyushu University
Generating Multi-Exit Custom Instructions
Multi-Exit Custom Instructions Include hot
directions of branch instructions
Include single entry but multiple exits
DATE’07@Nice, FranceKyushu University
Generating Multi-Exit Custom Instructions
Finding the largest sequence of instructions in the CFG Checking the anti-dependence and flow-dependence
and moving executable instructions to the head and tail(s) of MECIs
Rewriting the object code where instructions are going to be moved.
Detecting subgraphs in a HIS
branch
exit 1
LS
exit 2
S1
branch
exit 1
mtc1
branch
exit 1
Overwriting the entry node
with mtc1 instr
Moving instructions after checking data dependency (binary rewriting)
exit 2 exit 2
S2
S2
S1
S2
S1
LSLS
DATE’07@Nice, FranceKyushu University
Generating Multi-Exit Custom Instructions
Tool Chain
InstructionSet Simulator(Simplescalar)
ProfilerDetecting
Start Addressof HBBs
Reading HBBsfrom Object
code
Linking HBBs usingProfiling Informationand Generating HISs
GeneratingCDFG for
HISs
GeneratingMulti-Exit
InstructionsCustom
Updating CDFGand Binaryrewriting
(Partitioning andMapping MECIs)
IntegratedFramework
DATE’07@Nice, FranceKyushu University
CRFU Architecture: A Quantitative Approach
22 programs of MiBench were chosen Simplescalar toolset was utilized for simulation CRFU is a matrix of FUs
No of Inputs No of Outputs No of FUs Connections Location of Inputs & Outputs
Some definitions: Considering frequency and weight in measurement
CI Execution Frequency Weight (To equal number of executed instructions) Average = for all CIs (ΣFreq*Weight)
Rejection rate: Percentage of MECIs that could not be mapped on the CRFU
Mapping rate: Percentage of MECIs that could be mapped on the CRFU
DATE’07@Nice, FranceKyushu University
Inputs/Outputs
0
10
20
30
4050
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12
Number of Inputs/Outputs
Ma
pp
ing
Ra
te
Inputs Outputs
`
DATE’07@Nice, FranceKyushu University
Functional Units
0
10
20
30
40
50
60
70
80
90
100
Number of FUs
Map
pin
g R
ate
DATE’07@Nice, FranceKyushu University
Width/Depth
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9 10 11 12
Number of Width and Depth
Ma
pp
ing
Ra
teWidth without constraints Depth without constraintsWidth with constraints Depth with constraints
DATE’07@Nice, FranceKyushu University
CRFU Architecture
Connections from input ports toinputs of the rows
CRFU Input Ports
CRFU Output Ports
Outputs of 1st row to theinputs of 3rd, 4th and 5th rows
Outputs of 2nd row to theinputs of 4th and 5th rows
Row1
Row5
Adder/subtractor
AND OR XORBarrelShifter
Configurationbits
Configurationbits
Configurationbits
FU FU FU FU
DATE’07@Nice, FranceKyushu University
Supporting Conditional ExecutionFU1 FU2
FU3 FU4
ConfigurationBits
ConfigurationBits
Selector-Mux
DataSelection
Mux
Branch result from FU1
Branch result from FU2
Configuration Bits
Branch result from FU1
Branch result from FU2
Configuration Bits
Configuration Bits
Configuration Bits
DATE’07@Nice, FranceKyushu University
Synthesis result
Synopsys tools Hitachi 0.18 μm Area: 2.1 mm2 Configuration bits: 615 bits Delay Depth of DFG
of MECIDelay (ns)
1 2.2
2 4.2
3 6.1
4 7.9
5 9.8
DATE’07@Nice, FranceKyushu University
Experiment setup
22 applications of Mibench Simplescalr
Issue 4-way
L1- I cache 32K, 2 way, 1 cycle latency
L1- D cache 32K, 4 way, 1 cycle latency
Unified L2 1M, 6 cycle latency
Execution units 4 integer, 4 floating point
RUU size & Fetch queue size 64
Branch predictor bimodal
Branch prediction table size 2048
Extra branch misprediction latency 3
DATE’07@Nice, FranceKyushu University
Speedup CIs & MECIs
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3S
pee
du
p
CIs
MECIs
DATE’07@Nice, FranceKyushu University
Effect of clock frequency of speedup
1
1.5
2
2.5
3
3.5S
pe
ed
up
200 MHz 250 MHz 300 MHz 350 MHz 400 MHz
DATE’07@Nice, FranceKyushu University
Conclusion
Our experimental results show that by extending custom instructions over multiple HBBs the average speedup increases by 46% compared to the custom instructions which are limited to only one HBB. This is achieved in return for 83% more hardware and 20% more configuration bits. Utilizing connections with different length are helpful for supporting larger custom instructions with the available number of FUs.
DATE’07@Nice, FranceKyushu University
Future work
Energy evaluation of the ADEXOR Exploring the design space of CRFU
architecture (To study the effect of number of inputs, outputs, FU on the speedup, area and power)
DATE’07@Nice, FranceKyushu University
Introduction (1/2) Efficiency and flexibility in embedded system design
Both are critical Both are conflicting design goals Custom hardwired feature for more efficiency
(performance and energy) Programmable feature for more flexibility
Design challenges for future embedded systems More required efficiency (performance and
energy) for future embedded systems Higher manufacturing and NRE cost of new
nanometer-scale technologies Higher cost and risk in development Shorter time-to-market
DATE’07@Nice, FranceKyushu University
Introduction (2/2) Custom instructions
Effective technique for improving efficiency Custom functional units are required
(manufacturing, NRE and design cost)
Our proposed approach Generating custom instruction after chip-
fabrication Proposing multi-exit custom instructions Replacing custom functional units with a
reconfigurable functional unit with conditional execution
DATE’07@Nice, FranceKyushu University
Generating Multi-Exit Custom Instructions
Motivating example adpcm loop
B1
S1
B2
S2
B3
B4
S3
B5
S4
B6
S5
J1
B7
S6
J2
B8
B9
S7
B10
S8
S9
B11
J3
B12
S10
DATE’07@Nice, FranceKyushu University
Generating Multi-Exit Custom Instructions Generating Hot Instruction
Sequence1. MECIs should not cross
loop boundaries2. Sorting loop from
innermost to outermost3. Reading HBBs and linking
them to generate Hot Instruction Sequence
4. Sort other remaining HBBs in ascending order considering their start address
Function MAKE_HIS (objfile, HIS, start_addr)
1 if (HBB with start_addr is not included in previous MECIs) then read_add_HBB2HIS (objfile, HBB(start_addr), HIS) else return;
2 switch last_instruction(HBB)3 case (indirect jump, return or call):
return; 4 case (direct jump): MAKE_HIS(objfile,
HIS, target address of jump); 5 case (branch): 5-1 if (it is hot backward) then return;5-2 elsif (not-taken direction is hot)
then MAKE_HIS(objfile, HIS, target address of not-taken direction) else return;
5-3 if (taken direction is hot) then MAKE_HIS(objfile, HIS, target address of taken direction) else return;
6 default: return;
DATE’07@Nice, FranceKyushu University
Generating Multi-Exit Custom Instructions
MECIs include fixed point instructions except multiply, divide and load. At most on store and five branches.
A MECI can have at most four exit points branch with only one hot direction indirect jump and return call hot backward branch an instruction where its next instruction is non-
executable. If both directions of a branch are hot, both
corresponding HBBs are added.
DATE’07@Nice, FranceKyushu University
Executing MECIs on the CRFU
DEC/EXE Pipeline Registers
FU1 FU2 FU3 FU4 CRFU
Reg0 ……………………………………………..
Reg31
EXE/MEM Pipeline Registers
Counter
ConfigMemory
Counter
From decode stage
Triggered by mtc0
Triggered by mtc0
DATE’07@Nice, FranceKyushu University
Proposing an Architecture for the CRFU
Design Methodology
First Stage
Mapping (MapToolP1) and AnalyzingCIs for Determining the CRFU
Preliminary Architecture
Generating HIS
Mapping is successful?
Mapping Process(MapToolP2)
Application
CustomInstructions (CIs)
CRFU PrimaryArchitecture
Mappable CIs(MapCIP2)
Rejected CIs
Integrated TemporalPartitioning & Mapping
(IntegFrameP3)
Fixing Final CRFUArchitecture
Yes
No
First Phase
Final CRFU Architecture
Third Phase
Fourth Phase
Object Code
HIS
Custom Instruction
1: SLL R2,R17,0x12: ADDU R2, R2,R173: SLL R2, R2, 0x34: ADDU R2, R2,R175: SLL R2,R2, 0x46: ADDU R2, R2,R207: SLL R3, R18, 0x28: ADDU R3, R3, R2
DFG
RFU Map ......
.
.
.. . .
5 6
7 1
2
4 3
1
2
3
4
5
6
8
7
8
DFG
RFU MapSecond Phase
Mappable CIs(MapCIP3)
Modifying Mapping Process(MapToolP4)
&
IntegFrameP4
DATE’07@Nice, FranceKyushu University
Mapping Tool
2
3
4
1
1: SUBU R3, R0, R32: ADDU R10, R0, R03: SRA R8, R10, 0x34: SLT R2, R3, R85: BNE R0,400488, R2
ADDU
SRA
SLT
SUBU
BNE
R3R0 R0R0
R10
R8
R2
R2
R30x3
400488
2
3
4
5
1
A Custom Instruction
Data Flow GraphRFU Map
5
DATE’07@Nice, FranceKyushu University
Integrated Framework (1/2)
Integrated Framework Performs an integrated
temporal partitioning and mapping process
Takes rejected CIs as input Partitions them to appropriate
mappable CIs Adds nodes to the current
partition while architectural constraints are satisfied
The ASAP level of nodes represents their order to execute according to their dependencies
Mapping onRFU
TemporalPartitioning
(HTTP or VTTP)
Mapping is successful?
FinalConfigurations
Incremental TemporalPartitioning
(HTTP or VTTP)NO
RFU ArchitecturalConstraints
Rejected CIs
YES
Similarity Detection
Spatial Merging
Reduced FinalConfigurations
IncrementalTemporal Partitions
TemporalPartitions (Initial
Partitions)
DATE’07@Nice, FranceKyushu University
Integrated Framework (2/2)
Incremental HTTP The node with the
highest ASAP level is selected and moved to the subsequent partition.
Nodes selection and moving order: 15, 13, 11, 9, 14, 12, 10, 8, 3 and 7.
2
6 4
0
1
5
RFU Map
0
1
2
3
8
9
4
10
11
5
12
13
6
14
15
7
Data Flow Graph of Input CI
1st Partition
2nd Partition
0
1
2
4 5 6
7
1st Partition
3
8
9
10
11
12
13
14
15
2nd Partition123
5
4
6 7 8
9
3
11
10 12 14 7
8 13 15
9
RFU Map
10
DATE’07@Nice, FranceKyushu University
Supporting Conditional Execution
1 5 6 7 84 11 15
17 16212223242526
... ... ...
...
Control Dependencies
subu
slt
bne
sra
slt
sra
bne
8
16
17
15
21
22
23
19
5
subu
subusubu 7
Data Dependencies
Imm
DATE’07@Nice, FranceKyushu University
Execution time of configuration phase
ApplicationExec. time
(Seconds)Application
Exec. time (Seconds)
adpcm 225 gsm 461
bitcounts 331 lame 526
blowfish 94 patricia 84
basicmath 34 qsort 233
cjpeg 75 rijndael 68
crc 132 sha 29
dijkstra 101 stringsearch 3
djpeg 9 susan 122
fft 36 Average 150.8
DATE’07@Nice, FranceKyushu University
Effect of connections on mapping rate
By deleting the connections with length more than one, 24.2% of MECIs can not be mapped.
DATE’07@Nice, FranceKyushu University
General Overview of ADEXOR Architecture
Main components Base processor
4-issue in-order RISC processor Reconfigurable functional units (CRFU)
Coarse grain Based on matrix of functional units (FUs) Multi-cycle Parallel with other functional units of the base processor Read/write from/to register file Functions and connections are controlled by configuration
bits Configuration memory
To keep the configuration data of CRFU for multi-exit custom instructions (MECIs)
Counters Control read/write ports of register file and select between
CRFU and processor functional units
DATE’07@Nice, FranceKyushu University
Speedup CIs & MECIs
The number of inputs, outputs and FUs are the same
simpler connections and FUs and does not support conditional execution.
Area: 1.15 mm2 Delay for a CI with a critical length of five is 7.66
ns. Each CI configuration needs 512 bits. The average number of instructions included in
CIs (one HBB) is 6.39 instructions and for MECIs is 7.85 instructions.
DATE’07@Nice, FranceKyushu University
MECIs vs. CI
0
20
40
60
80
100
120
140
160
180
%S
pe
ed
up
en
ha
nc
em
en
t
DATE’07@Nice, FranceKyushu University
Distribution of functions
row1 row2 ro3 row4 row5
and 2 3 1 1 0
Or 1 2 1 1 1
Xor 2 1 1 1 0
Nor 1 0 0 0 0
Add/sub 5 4 3 2 1
Shift 4 3 2 2 1
Compare 2 2 2 2 1