Automatic Processor Specialisation using Ad-hoc Functional Units...
-
date post
20-Dec-2015 -
Category
Documents
-
view
216 -
download
2
Transcript of Automatic Processor Specialisation using Ad-hoc Functional Units...
Automatic Processor Specialisation using
Ad-hoc Functional Units
[email protected], [email protected], [email protected]
EPFL – I&C – LAP
© Ienne 2002Automatic Processor Specialisation2
Classic Options for Systems-on-Chip
Design Gap!
© Ienne 2002Automatic Processor Specialisation3
Processor Specialisation:Get the Best of Both Options
Embedded!
© Ienne 2002Automatic Processor Specialisation4
VLIW Processor Specialisation
Two complementary specialisation strategies:Parametric Architecture
Ad-hoc Functional Units (AFUs)
© Ienne 2002Automatic Processor Specialisation5
Automatically Collapsing Clusters of Instructions into New Ones
If the ad-hoc functional unit completes the
job faster GAIN
One ad-hoc complex operation instead of a long
sequence of standard ones
© Ienne 2002Automatic Processor Specialisation6
General Goal
Automatically achieve
processor specialisation
through high-level
application code analysis
© Ienne 2002Automatic Processor Specialisation7
Outline
IntroductionMotivational exampleGoalsOpportunities for specialisationChallenges, further opportunities,…
© Ienne 2002Automatic Processor Specialisation8
Elementary Motivational ExampleAn Important Kernel…
/* init */
a <<= 8;
/* loop */
for (i = 0; i < 8; i++) {
if (a & 0x8000) {
a = (a << 1) + b;
} else {
a <<= 1;
}
}
return a & 0xffff;
Shift-and-addunsigned8 x 8-bit
multiplication
© Ienne 2002Automatic Processor Specialisation9
Software Predication
/* init */
a <<= 8;
/* loop */
for (i = 0; i < 8; i++) {
p1 = - ((a & 0x8000) >> 15);
a = (a << 1) + b & p1;
}
return a & 0xffff;
Predicate mask(0 or –1 = 0xfffffff)
Shift PredicatedAdd
© Ienne 2002Automatic Processor Specialisation10
Loop Kernel DAG
a
&
0x8000
>>
15
-
b
&
<<
1
+
a
In SW In HW
~6cycles
AND gates
Only wiring
ALU
1-2cycles!
© Ienne 2002Automatic Processor Specialisation11
Ad-hoc Unit To AccelerateShift-and-Add Multiplication Loop
Register File
ALU LD/ST MSTEP
if (Rn [31] = = 1)then Rn (Rn << 1) + Rm
else Rn (Rn << 1)1 ad-hoc instruction added
loop kernel
reduced to 15-30%
© Ienne 2002Automatic Processor Specialisation12
Loop Unrolling
/* init */
a <<= 8;
/* no loop anymore */
p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;
p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;
p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;
p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;
p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;
p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;
p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;
p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;
return a & 0xffff;
© Ienne 2002Automatic Processor Specialisation13
Full DAG
a
b
+
a
+
+
+
+
+
+
+
&
aa b
&-network
+
a
Column Compr.
In SW
~50 c
ycle
s
In HW
~3-4
cycle
s
ArithmeticOptimiser
&
0x8000
>>
15
-
&
<<
1
+
&
0x8000
>>
15
-
&
<<
1
+
&
0x8000
>>
15
-
&
<<
1
+
Etc.
a
b<<
8
© Ienne 2002Automatic Processor Specialisation14
Ad-hoc Unit To AccelerateMultiplication?! Yeah, a MUL…
Register File
ALU LD/ST MUL
Rn (Rn & 0x0000.ffff) x (Rm & 0x0000.ffff)
1 ad-hoc instruction added
function reduced by a factor 10-15
© Ienne 2002Automatic Processor Specialisation15
Classic “Ad-hoc” Customisation…
Altera Nios:
Can we do more of this, really ad-hoc?!
© Ienne 2002Automatic Processor Specialisation16
Mainstream SoC/FPGA Processors and Specialisation?
All the recent embedded processors offer some sort of specialisation:
Arbitrary functional units or tightly coupled coprocessors (IFX Carmel 20xx, ARM, Tensilica Xtensa, Altera Nios, etc.)
Parametric resources (STM Lx, ARC Cores, Tensilica Xtensa, Altera Nios, etc.)
But all assume an onerousmanual study and design!
© Ienne 2002Automatic Processor Specialisation17
Summary of Gain Potentials inAd-hoc FUs
Exploit data parallelism in hardware
Exploit constant for logic
simplification
Some operations reduce to wires in
hardware
Exploit arithmetic properties for efficient chaining of arithmetic operations (e.g., carry
save)
© Ienne 2002Automatic Processor Specialisation18
Goals
How much scope for AFU specialisation in typical multimedia code?
Are classic ILP techniques or other optimisations (e.g., arithmetic) important to increase the speedup? To which extent?
What are the microarchitectural needs for exploiting well the potentials?Memory ports in the AFUs?Number of inputs from the register file? Are
two enough?Number of outputs to the register file? Is one
enough?
© Ienne 2002Automatic Processor Specialisation19
Related Work inReconfigurable Computing
Most of the work in reconfigurable computing; typically experiments are linked to a given microarchitecture: CHIMAERA [Ye et al., 2000] has the most rich measurements
but only for 1-output AFUs and no AFU-memory interface Similarly PRISC [Razdan et al., 1994] and ConCISe [Kastrup et
al., 1999] use clustering approaches for 2 inputs - 1 output AFUs
GARP [Hauser et al., 1997] concentates on the mapping of control flow (hyperblocks in loops) in a loosely coupled architecture (coprocessor)
First, investigate where potentials are fix microarchitecture
© Ienne 2002Automatic Processor Specialisation20
Related Work inAFU Identification
Other authors concentrate on identification methods (“what is the best function for an AFU?”) often with some microarchitectural assumptions MaxMISOs [Alippi et al., 1999] are 1-output candidates of
maximal size [Jacome et al., 2000] introduce vertical- and horizontal-
aggregation as heuristic methods to cluster operations (no comparisons with other techniques)
[Arnold et al, 2001] use library pattern-matching techniques with a dynamic pattern library (instruction clusters) but very limited cluster complexity (3 instructions) in the experiments
ASIP synthesis: different problem (minimal covering)
First, investigate where potentials are develop appropriate identification algorithms
© Ienne 2002Automatic Processor Specialisation21
Methodology
Concentrate on Data FlowEasier to capture automatically (no
architecturally visible state in the AFUs)Constant latency (variable latency would
hardly fit into a statically scheduled environment—e.g., VLIW)
Measurements on Basic BlocksRepresent the upper limit of the potential
advantagesUpper limit is reachable if microarchitectural
constraints are satisfied (e.g., no. of inputs and outputs)
© Ienne 2002Automatic Processor Specialisation22
Experimental Flow
© Ienne 2002Automatic Processor Specialisation23
Software Execution:Approximate RISC Model
One clock cycle assumed for most SUIF nodes, representing the usage of the execution stage
Exceptions: e.g., type casts (zero), divisions (N) Assumed all forwarding paths existing No data/instruction cache or perfect hit rates assumed Jumps accounted with a fixed amount to the cycle count of
each basic block
IF ID WB
IF ID EX
IF ID EX2
IF ID
1:
2:
3:
4:
5:
WB
EX
WB
EX
EX1 EX3
EX
ID
WB
IF WB
© Ienne 2002Automatic Processor Specialisation24
Hardware Execution:Synthesis-based Model
Operator Precision Relative Delay
hw
Multiply-Accumulator 32 bits x 32 bits + 64 bits 1.00
Adder 4 bits + 4 bits 0.11
Adder 8 bits + 8 bits 0.12
Adder 16 bits + 16 bits 0.20
Adder 24 bits + 24 bits 0.24
Adder 32 bits + 32 bits 0.25
Divider 4 bits / 4 bits 0.38
Divider 8 bits / 8 bits 1.22
Divider 16 bits / 16 bits 3.68
Divider 24 bits / 24 bits 6.33
Divider 32 bits / 32 bits 9.61
Divider (by power of two) any / any 0.00
Barrel shifter 8 bits 0.08
Barrel shifter 16 bits 0.11
Barrel shifter 32 bits 0.16
Barrel shifter (by constant amount) any 0.00
Bitwise multiplexer any 0.02
CMOS 0.18µ
SynopsysDesign Compiler+ DesignWare
© Ienne 2002Automatic Processor Specialisation25
Partitioning of DFGMix of Hardware and Software
AFU memory bandwidth issueOn-AFU (Hardware) and
Off-AFU (Software) instructionsDFG partitioned in HW and SW layers
High Cost!Low
Performance?
© Ienne 2002Automatic Processor Specialisation26
Example of Layering Hybrid DFGs
Hardwareand
softwarelayers
© Ienne 2002Automatic Processor Specialisation27
Metrics and Measurements
Topological basic block information: Inputs, outputs, etc.
Saved cycles speedup
HW
opsall
iSW CPiLat )(
_
opsSWall
iSW
layersAFUall
iHW
opsall
iSW iLatCPiLat
_____
)()(
© Ienne 2002Automatic Processor Specialisation28
Basic Blocks CharacteristicsExamples
Weight
BB # In Out Ld St Tsw Thw nhw Thyb
adpcmdecode 5 22.84% 3 2 1 0 9 2.07 2 3
22 17.77% 4 3 1 1 7 1.25 1 3
9 12.69% 2 3 0 0 5 0.33 1 1
4 7.61% 1 3 1 0 6 1.00 2 3
mpeg2decode 4 37.44% 5 2 2 0 13 2.49 2 4
10 34.56% 4 2 2 0 12 2.49 2 4
pegwit 1 31.47% 2 0 296 36 811 3.65 3 335
25 9.06% 5 2 0 0 7 0.83 1 1
28 6.47% 2 0 2 1 5 2.29 2 5
9 6.45% 2 1 2 0 5 2.82 3 5
13 6.45% 2 0 2 1 5 2.29 2 5
10 5.16% 4 2 0 0 4 0.83 1 1
TopologyBenchmarkParallelMemoryAccess
SequentialMemoryAccess
Execution concentrated
in few BBs
Few Ld/St…
Small delays
…well separated
High RF pressure
© Ienne 2002Automatic Processor Specialisation29
Basic Blocks Characteristics
Moderate hardware resources for AFUs:Often, half of the execution time concentrated
in not more than 2-3 basic blocks
Pressure on the register file higher than classically supported
Limited importance of memory portsExcept some dramatic cases…
Small delay of typical basic blocks
© Ienne 2002Automatic Processor Specialisation30
Potential Basic SpeedupExamples
SeqMem
NoConst Basic
PlusBitwidth
PlusArith
adpcmdecode 5 1.80 12.71% 12.71% 12.71% 14.82% 14.82%
22 3.50 8.47% 10.59% 10.59% 10.59% 10.59%
9 1.67 8.47% 8.47% 8.47% 8.47% 8.47%
4 1.20 3.18% 4.24% 5.29% 5.29% 5.29%
mpeg2decode 4 2.17 24.18% 24.18% 26.87% - -
10 2.40 21.50% 21.50% 24.18% - -
pegwit 1 81.10 16.72% 28.31% 28.34% - -
25 1.75 7.03% 5.86% 7.03% - -
28 1.25 0.00% 1.17% 2.34% - -
9 1.00 0.00% 0.00% 2.33% - -
13 1.25 0.00% 1.17% 2.33% - -
10 1.33 3.50% 3.50% 3.50% - -
Benchmark Cycle Savings
BB #
ILP
Good speedup with
few BBs
Not critical…
BBs too simple to
bring advantage
© Ienne 2002Automatic Processor Specialisation31
Inputs and Outputs of Basic Blocks
Speedup per # inputs Speedup per # outputs
>60% ~50%
© Ienne 2002Automatic Processor Specialisation32
Potential Basic Speedup
Limited available parallelismTop-ranking basic blocks: 10 to 50% cycle
savingsHardwired constants not a key advantage
Small price for a reduction in design risk
Sequentialisation penalty not dramaticAFU memory ports not essential
Accurate bitwidth analysis and arithmetic optimizations bring limited or no advantageBasic blocks are too simple, ceiling effects,…
© Ienne 2002Automatic Processor Specialisation33
Effects of ILP TechniquesExamples
Benchmark
adpcmdecode (par.) 49.45% 88.11% 88.11% 88.11% 88.11%
(2.0 x) (8.4 x) (8.4 x) (8.4 x) (8.4 x)
adpcmdecode (seq.) 45.21% 77.51% 77.51% 81.75% 81.75%
(1.8 x) (4.5 x) (4.5 x) (5.5 x) (5.5 x)
mpeg2decode (par.) 68.23% 68.23% 86.91% - 87.92%
(3.1 x) (3.1 x) (7.6 x) - (8.3 x)
mpeg2decode (seq.) 60.09% 60.09% 73.73% - 74.40%
(2.5 x) (2.5 x) (3.8 x) - (3.9 x)
pegwit (par.) 63.33% 67.31% 67.31% - -
(2.7 x) (3.1 x) (3.1 x) - -
pegwit (seq.) 38.99% 42.96% 42.96% - -
(1.6 x) (1.7 x) (1.7 x) - -
PlusUnrolling
PlusBitwidthAnalysis
PlusArithmetic
Opt.
1
1
1
Basic
3
4
2
1
2
PlusPredication
2
2
10
1
1
1
1
2
2
1
2
2
1
-
1
-
-2
-
--
1
total speedup30% number of basic blocks to reach 30% speedup
© Ienne 2002Automatic Processor Specialisation34
Effects of ILP Techniques
Major improvements:Cumulative speedups between 1.7x and 6.3x
Register file pressure not significantly modified
Hardware complexity and Thw increasedArea is typically below 2-3x that of 32-bit
multiplier, almost never >10x
Accurate bitwidth analysis and arithmetic optimisations bring limited or no advantageBaseline advantage already very large
© Ienne 2002Automatic Processor Specialisation35
Arithmetic Optimisation Impact
mpeg2decode basic block #7 Tsw Thw
bb
Without arithmetic transformations 55 5 25,344,000 30.6%
With arithmetic transfonmations 55 3 26,357,760 31.4%
w/o optimisation with optimisation
© Ienne 2002Automatic Processor Specialisation36
Conclusions
DFG-level opens potential speedups (2–3x) at low cost (hardware and toolset) and low risk
Larger number of AFU write ports (2-3) neededHardcoding of constants not essentialAFU memory interfaces also not essentialILP techniques help, as expectedSophisticated and detailed techniques (bitwidth
analysis, arithmetic optimizations) sometimes masked by other effects
© Ienne 2002Automatic Processor Specialisation37
Ongoing Work
Measure advantages through a complete toolchain (notably, compiler):DSP microarchitecture:
Validate simple model Find out bottlenecks and impose real DSP constraints
(e.g., nonortogonality)VLIW microarchitecture:
Go beyond simple software execution model
Develop novel speedup-driven identification algorithms
How to get more AFU specialisation potentialsDynamic identification and configuration of
AFUs
© Ienne 2002Automatic Processor Specialisation38
+++++
MS3
Typical Identification Algorithms
Bottom-up greedy approaches to cluster instructions
Topologically-driven rather than speedup driven E.g., MaxMISO identification [Alippi et al., 1999]:
*
+
*
+
+
*
+ +
MS2
MS1+
*
+
*
© Ienne 2002Automatic Processor Specialisation39
Speedup-driven Identification
Prune-out optimal set of low-speedup nodes to achieve the required input/output count
i0
0.1
i1 i2 i3 i4 i5 i6 i7
1
2
1
0.1
k
o0 o1
0.5
3
0.5
0.1
0.1
SIMD-like and unconnected
graphs
© Ienne 2002Automatic Processor Specialisation40
Open Issues and Perspectives
Power consumption advantages?Power down because:
Less instruction fetches and decodes Less register reads and writebacks
Power possibly up because: Reduced correlation of signals in the AFU Low-efficiency of the implementation (in case of
eFPGAs)
More opportunities to increase speedup?Detect and implement LUTs (e.g., in
quantisers) as discrete CAMs Detect runtime constant values
© Ienne 2002Automatic Processor Specialisation41
Dynamic Specialisation?
Java Bytecode
JiT + Specialisation
ARM + RFU
Dynamic compilation and optimisation together with hardware specialisation
DAISY, Crusoe, JiT, etc. Specialisation may profit
from runtime information Identification in runtime
conditions Dynamic reconfigurability
challenge
© Ienne 2002Automatic Processor Specialisation42
Conclusions
Processor customisation opportunities are here: soft cores, FPGA processors, etc.
Very specific field of hardware/software codesign with a very large potentialDo not give up versatilityGet most of the performance of custom
hardware
Needs automation, to complement compilers and synthesizers (some work exists but limited in scope)