Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab...
-
Upload
brooke-nichols -
Category
Documents
-
view
220 -
download
2
Transcript of Application Specific Instruction Generation for Configurable Processor Architectures VLSI CAD Lab...
Application Specific Instruction Generation for Configurable Processor Architectures
VLSI CAD LabComputer Science Department, UCLA
Led by Jason Cong
Yiping Fan, Guoling Han, Zhiru Zhang
Supported by NSF
Outline
Motivation Related Works Problem Statement Proposed Solutions Experimental Results Conclusions
Motivation
Motivation (cont’d) Flexibility is required to satisfy different requirements and
to avoid potential design errors Application Specific Instruction-set Processors (ASIPs)
provide a solution to the tradeoff between efficiency and flexibility A general purpose processor + specific hardware resource Base instruction set + customized instructions Specific hardware resource implements the customized
instructions Either runtime reconfigurable or pre-synthesized Gain more popularity recently
IFX Carmel 20xx, ARM, Tensilica Xtensa, STM Lx, ARC Cores
Application Specific Instruction-set Processor
Program with basic instructions set I
t1 = a * b;
t2 = b * 0xf0;;
t3 = c * 0x12;
t4 = t1 + t2;
t5 = t2 + t3;
t6 = t5 + t4;
CustomLogic
* * *
+ ++
0xf0 0x12a b c
Execution time: 9 clock cycles
*: 2 clock cycles +: 1 clock cycles
Application Specific Instruction-set Processor (cont’d)
* * *
+ ++
0xf0 0x12a b c
Program with extended instructions
t1 = extop1(a, b, 0xf0);
t2 = extop2(b, c, 0xf0, 0x12);
t3 = t1 + t2;
Execution time: 5 clock cycles Speedup: 1.8
extops: 2 clock cycles +: 1 clock cyclesExtended Instruction Set: Iextop1 expop2
extop1 extop2
Related Works [Kastner et al, TODAES’02]
Template generation + coveringLimitation: Minimum number of templates may not lead to maximum speedup Ignore architecture constraints
[Atasu et al, DAC’03]Branch and bound Limitation: High complexity Instruction reuse is not considered
[Peymandoust et al, ICASAP’03]Instruction selection + instruction mappingLimitation: Minimize the extended instruction number
Preliminaries Control data flow graph (CDFG)
Basic blocks(BBK)each bbk is a DAG, denoted by G(V, E)
Control edges Cone
A subgraph consisting of node v and its predecessors such that any path connecting a node in the cone and v lies entirely in the cone
K-feasible cone Pattern
A single output DAG Trivial pattern Nontrivial pattern Associated with execution time, number of I/O,
area
Trivial PatternExecution timeI/O: 2-in 1-out
* * *
+ +
+
0xf0 0x12a b c
n1 n2 n3
n4 n5
n6
Nontrivial PatternSW Execution timeHW Execution time
I/O: 2-in 1-outArea: 2
{a, b, 0xf0}
Problem StatementGiven:
G(V, E) The basic instruction set I Pattern constraints:
I. Number of inputs |PI(pi)| Nin, i;II. Number of outputs |PO(pi)| = 1, i; III. Total area
Objective: Generate a pattern library P Map G to the extended instruction set IP, so that the total
execution time is minimized.
1
( )ii N
area p A
Problem DecompositionSub-problem 1. Pattern Enumeration:
Generate all of the patterns S satisfying the constraints (i) and (ii) from G(V, E).
Sub-problem 2. Instruction Set Selection:Select a subset P of S to maximize the potential speedup while satisfying the area constraint.
Sub-problem 3. Application Mapping:Map G(V, E) to IP so that the total execution time of G is minimized.
Proposed ASIP Compilation Flow
Instruction Implementation / ASIP synthesis
Pattern Generation / Pattern Selection
Application Mapping Pattern library
C
ASIP constraints
Implementation
Mapped CDFG
SUIF / CDFG generator
CDFG
1. Pattern Enumeration All possible application specific instruction
patterns should be enumerated Each pattern is a k-feasible cone Cut enumeration is used to enumerate all
the k-feasible cones [cong et al, FPGA’99] In topological order, merge the cuts of fan-
ins and discards those cuts not k-feasible
1. Pattern Enumeration (cont’d)
3-feasible cones:n1: {a, b}
n2: {b, 0xf0}
n3: {c, 0x12}
* * *
+ +
+
0xf0 0x12a b c
n1 n2n3
n4 n5
n6
n4: {n1, n2},
n5: {n2, n3}, {n2, c, 0x12}, {n3, b, 0xf0}
{b, 0xf0, c, 0x12}n6: {n4, n5}, {n4, n2, n3}, {n5, n1, n2}
{n1, b, 0xf0}, {n2, a, b}, {a, b, 0xf0}
2. Pattern Selection (1) Resource cost and the execution time can
be obtained using high-level estimation tool
The extended instructions should satisfy the area constraintUse all the enumerated patterns
Optimal code can be generated Mapping becomes unaffordable
Heuristically select a set of patterns
2. Pattern Selection (2)
Basic idea: simultaneously consider speed up, occurrence frequency and area.
Speedup Tsw(p) = |V(p)|
Thw(p) = Length of the critical path of scheduled p Speedup(p) = Tsw(p) / Thw(p)
Occurrence Some pattern instances may be isomorphic Graph isomorphism test [ Nauty Package ] Small subgraphs, isomorphism test is very fast
Gain(p) = Speedup(p) Occurrence(p)
* * *
+ ++
0xf0 0x12a b c
n1 n2 n3
n4 n5
n6
Pattern *+
Tsw= 3
Thw= 2
Speedup = 1.5
2. Pattern Selection (3)
Selection under Area Constraint Can be formulated as a 0-1 knapsack problem
0-1 knapsack problem: Given n items (patterns) and weight W (area constraint A), and the ith item (pattern) is associated with value (gain) vi and weight (area) wi, select a subset of items to maximize the total value, while the total weight does not exceed W.
Optimally solvable by Dynamic programming algorithm.
3. Application Mapping (1)
Application mapping covers each node in G(V, E) with the extended instruction set to minimize the execution time.
The execution time of a mapped DAG is defined as the sum of the execution time of the patterns covering the DAG.
: non-trivial : trivial
( ) ( )hw swp p
T T p T p
3. Application Mapping (2)
Theorem: The application mapping problem is equivalent to the minimum-area technology mapping problem. Execution time ↔ area Total area = sum of area of each component Total execution time = sum of execution time of each
pattern Minimum-area mapping is NP-hard → application
mapping is NP-hard A lot of minimum-area technology mapping algorithms
Minimum-area technology mapping
[Keutzer, DAC’87 ] Tree decomposition + dynamic programming
[Rudell] [Liao, ICCAD’95]Min-cost binate covering Given:
a boolean function f with variable set X a cost function which maps X to a nonnegative integer
Objective: find an assignment for each variable so that the value of f is 1
and the sum of cost is minimized
Binate Covering (1)
* * *
+ +
+
0xf0 0x12a b c
n1 n2n3
n4 n5
n6
Pattern Function Cost Covers
p0 + 1 n6
p1 + 1 n5
p2 + 1 n4
p3 * 2 n3
p4 * 2 n2
p5 * 2 n1
p6 *+ 2 n1 , n4
p7 *+ 2 n2, n4
p8 *+ 2 n2, n5
p9 *+ 2 n3, n5
p10 (*)+(*) 2 n1, n2, n4
p11 (*)+(*) 2 n2, n3, n5
Binate Covering (2)
* * *
+ +
+
0xf0 0x12a b c
n1 n2n3
n4 n5
n6
Pattern Function Cost Covers
p0 + 1 n6
p1 + 1 n5
p2 + 1 n4
p3 * 2 n3
p4 * 2 n2
p5 * 2 n1
p6 *+ 2 n1 , n4
p7 *+ 2 n2, n4
p8 *+ 2 n2, n5
p9 *+ 2 n3, n5
p10 (*)+(*) 2 n1, n2, n4
p11 (*)+(*) 2 n2, n3, n5
Covering clause: p0
The fan-ins of the sink node need be covered by some pattern
Binate Covering (3)
* * *
+ +
+
0xf0 0x12a b c
n1 n2n3
n4 n5
n6
The nodes that generate inputs to pi must be covered by some other pattern
Pattern Function Cost Covers
p0 + 1 n6
p1 + 1 n5
p2 + 1 n4
p3 * 2 n3
p4 * 2 n2
p5 * 2 n1
p6 *+ 2 n1 , n4
p7 *+ 2 n2, n4
p8 *+ 2 n2, n5
p9 *+ 2 n3, n5
p10 (*)+(*) 2 n1, n2, n4
p11 (*)+(*) 2 n2, n3, n5
Covering clause: p2+p6+p7+p10
Binate Covering (4)
* * *
+ +
+
0xf0 0x12a b c
n1 n2n3
n4 n5
n6
Pattern Function Cost Covers
p0 + 1 n6
p1 + 1 n5
p2 + 1 n4
p3 * 2 n3
p4 * 2 n2
p5 * 2 n1
p6 *+ 2 n1 , n4
p7 *+ 2 n2, n4
p8 *+ 2 n2, n5
p9 *+ 2 n3, n5
p10 (*)+(*) 2 n1, n2, n4
p11 (*)+(*) 2 n2, n3, n5
p2 →p4 & p2 →p5
¬p2 + p4 & ¬ p2 + p5
Binate Covering (4)
* * *
+ +
+
0xf0 0x12a b c
n1 n2n3
n4 n5
n6
Pattern Function Cost Covers
p0 + 1 n6
p1 + 1 n5
p2 + 1 n4
p3 * 2 n3
p4 * 2 n2
p5 * 2 n1
p6 *+ 2 n1 , n4
p7 *+ 2 n2, n4
p8 *+ 2 n2, n5
p9 *+ 2 n3, n5
p10 (*)+(*) 2 n1, n2, n4
p11 (*)+(*) 2 n2, n3, n5
¬p6 + p4
¬p7 + p5
Binate Covering (5)
* * *
+ +
+
0xf0 0x12a b c
n1 n2n3
n4 n5
n6
f = p0(p2+p6+p7+p10)(¬p2 + p4)(¬ p2 + p5)(¬p6 + p4)(¬p7 + p5) (p1+p8+p9+p11) (¬p1 + p3)(¬ p1 + p4) (¬p8 + p3)(¬p9 + p4)
min-cost cover: p0, p10, p11 with cost 1+2+2 = 5
Experimental Results (1)
A commercial reconfigurable system – Nios from Altera is used to implement the ASIPs. 5 extended instruction formats up to 2048 instructions for each format
Some DSP applications are taken as benchmark Altera’s Quartus II 3.0 is used to aid the
synthesis and the physical design of the extended instructions.
Experimental Results (2)
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10Pattern Size
Occ
urr
ence
fft_briirfirprdirmcm
Pattern size vs. number of pattern instances (2-input patterns)
Experimental Results (3)
Speedup under different input size constraints
0
1
2
3
4
5
6
7
8
fft_br iir fir pr dir mcm
Spee
dup
4-inputs
3-inputs
2-inputsSpeedup = Textended/ Tbasic
Ideal speedup
• pipeline hazard
• memory impact
Experimental Results (4)Speedup and resource overhead on Nios implementations
-1.77%-2.54%-2.75 3.08 Average
560.00%02.76%1863.224.754mcm
160.00%00.80%543.023.282dir
140.00%01.05%711.751.572pr
80.15%1,0240.76%512.142.402fir
400.71%4,7363.79%2553.733.187iir
169.79%65,5366.06%4082.653.289fft_br
DSP BlockMemoryLENiosEstimation
Resource OverheadSpeedupExtended Instruction #
-1.77%-2.54%-2.75 3.08 Average
560.00%02.76%1863.224.754mcm
160.00%00.80%543.023.282dir
140.00%01.05%711.751.572pr
80.15%1,0240.76%512.142.402fir
400.71%4,7363.79%2553.733.187iir
169.79%65,5366.06%4082.653.289fft_br
DSP BlockMemoryLENiosEstimation
Resource OverheadSpeedupExtended Instruction #
Conclusions Propose a set of algorithms for ASIP
compilationActual performance metric is used as the
optimization objective Reduce the instruction mapping problem into
an area-minimization logic covering problemOperation duplication is considered implicitly
Experiments show encouraging speedup
Thank YouThank You