Automatic Processor Specialisation using Ad-hoc Functional Units...

Automatic Processor Specialisation using

Ad-hoc Functional Units

[email protected], [email protected], [email protected]

EPFL – I&C – LAP

© Ienne 2002Automatic Processor Specialisation2

Classic Options for Systems-on-Chip

Design Gap!


Processor Specialisation:Get the Best of Both Options

Embedded!


VLIW Processor Specialisation

Two complementary specialisation strategies:Parametric Architecture

Ad-hoc Functional Units (AFUs)


Automatically Collapsing Clusters of Instructions into New Ones

If the ad-hoc functional unit completes the

job faster GAIN

One ad-hoc complex operation instead of a long

sequence of standard ones


General Goal

Automatically achieve

processor specialisation

through high-level

application code analysis


Outline

IntroductionMotivational exampleGoalsOpportunities for specialisationChallenges, further opportunities,…


Elementary Motivational ExampleAn Important Kernel…

/* init */

a <<= 8;

/* loop */

for (i = 0; i < 8; i++) {

if (a & 0x8000) {

a = (a << 1) + b;

} else {

a <<= 1;

}

}

return a & 0xffff;

Shift-and-addunsigned8 x 8-bit

multiplication


Software Predication

/* init */

a <<= 8;

/* loop */

for (i = 0; i < 8; i++) {

p1 = - ((a & 0x8000) >> 15);

a = (a << 1) + b & p1;

}

return a & 0xffff;

Predicate mask(0 or –1 = 0xfffffff)

Shift PredicatedAdd


Loop Kernel DAG

a

&

0x8000

>>

15

-

b

&

<<

1

+

a

In SW In HW

~6cycles

AND gates

Only wiring

ALU

1-2cycles!


Ad-hoc Unit To AccelerateShift-and-Add Multiplication Loop

Register File

ALU LD/ST MSTEP

if (Rn [31] = = 1)then Rn (Rn << 1) + Rm

else Rn (Rn << 1)1 ad-hoc instruction added

loop kernel

reduced to 15-30%


Loop Unrolling

/* init */

a <<= 8;

/* no loop anymore */

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

p1 = - ((a & 0x8000) >> 15); a = (a << 1) + b & p1;

return a & 0xffff;


Full DAG

a

b

+

a

+

+

+

+

+

+

+

&

aa b

&-network

+

a

Column Compr.

In SW

~50 c

ycle

s

In HW

~3-4

cycle

s

ArithmeticOptimiser

&

0x8000

>>

15

-

&

<<

1

+

&

0x8000

>>

15

-

&

<<

1

+

&

0x8000

>>

15

-

&

<<

1

+

Etc.

a

b<<

8


Ad-hoc Unit To AccelerateMultiplication?! Yeah, a MUL…

Register File

ALU LD/ST MUL

Rn (Rn & 0x0000.ffff) x (Rm & 0x0000.ffff)

1 ad-hoc instruction added

function reduced by a factor 10-15


Classic “Ad-hoc” Customisation…

Altera Nios:

Can we do more of this, really ad-hoc?!


Mainstream SoC/FPGA Processors and Specialisation?

All the recent embedded processors offer some sort of specialisation:

Arbitrary functional units or tightly coupled coprocessors (IFX Carmel 20xx, ARM, Tensilica Xtensa, Altera Nios, etc.)

Parametric resources (STM Lx, ARC Cores, Tensilica Xtensa, Altera Nios, etc.)

But all assume an onerousmanual study and design!


Summary of Gain Potentials inAd-hoc FUs

Exploit data parallelism in hardware

Exploit constant for logic

simplification

Some operations reduce to wires in

hardware

Exploit arithmetic properties for efficient chaining of arithmetic operations (e.g., carry

save)


Goals

How much scope for AFU specialisation in typical multimedia code?

Are classic ILP techniques or other optimisations (e.g., arithmetic) important to increase the speedup? To which extent?

What are the microarchitectural needs for exploiting well the potentials?Memory ports in the AFUs?Number of inputs from the register file? Are

two enough?Number of outputs to the register file? Is one

enough?


Related Work inReconfigurable Computing

Most of the work in reconfigurable computing; typically experiments are linked to a given microarchitecture: CHIMAERA [Ye et al., 2000] has the most rich measurements

but only for 1-output AFUs and no AFU-memory interface Similarly PRISC [Razdan et al., 1994] and ConCISe [Kastrup et

al., 1999] use clustering approaches for 2 inputs - 1 output AFUs

GARP [Hauser et al., 1997] concentates on the mapping of control flow (hyperblocks in loops) in a loosely coupled architecture (coprocessor)

First, investigate where potentials are fix microarchitecture


Related Work inAFU Identification

Other authors concentrate on identification methods (“what is the best function for an AFU?”) often with some microarchitectural assumptions MaxMISOs [Alippi et al., 1999] are 1-output candidates of

maximal size [Jacome et al., 2000] introduce vertical- and horizontal-

aggregation as heuristic methods to cluster operations (no comparisons with other techniques)

[Arnold et al, 2001] use library pattern-matching techniques with a dynamic pattern library (instruction clusters) but very limited cluster complexity (3 instructions) in the experiments

ASIP synthesis: different problem (minimal covering)

First, investigate where potentials are develop appropriate identification algorithms


Methodology

Concentrate on Data FlowEasier to capture automatically (no

architecturally visible state in the AFUs)Constant latency (variable latency would

hardly fit into a statically scheduled environment—e.g., VLIW)

Measurements on Basic BlocksRepresent the upper limit of the potential

advantagesUpper limit is reachable if microarchitectural

constraints are satisfied (e.g., no. of inputs and outputs)


Experimental Flow


Software Execution:Approximate RISC Model

One clock cycle assumed for most SUIF nodes, representing the usage of the execution stage

Exceptions: e.g., type casts (zero), divisions (N) Assumed all forwarding paths existing No data/instruction cache or perfect hit rates assumed Jumps accounted with a fixed amount to the cycle count of

each basic block

IF ID WB

IF ID EX

IF ID EX2

IF ID

1:

2:

3:

4:

5:

WB

EX

WB

EX

EX1 EX3

EX

ID

WB

IF WB


Hardware Execution:Synthesis-based Model

Operator Precision Relative Delay

hw

Multiply-Accumulator 32 bits x 32 bits + 64 bits 1.00

Adder 4 bits + 4 bits 0.11





Divider 4 bits / 4 bits 0.38





Divider (by power of two) any / any 0.00

Barrel shifter 8 bits 0.08



Barrel shifter (by constant amount) any 0.00

Bitwise multiplexer any 0.02

CMOS 0.18µ

SynopsysDesign Compiler+ DesignWare


Partitioning of DFGMix of Hardware and Software

AFU memory bandwidth issueOn-AFU (Hardware) and

Off-AFU (Software) instructionsDFG partitioned in HW and SW layers

High Cost!Low

Performance?


Example of Layering Hybrid DFGs

Hardwareand

softwarelayers


Metrics and Measurements

Topological basic block information: Inputs, outputs, etc.

Saved cycles speedup

HW

opsall

iSW CPiLat )(

_

opsSWall

iSW

layersAFUall

iHW

opsall

iSW iLatCPiLat

_____

)()(


Basic Blocks CharacteristicsExamples

Weight

BB # In Out Ld St Tsw Thw nhw Thyb

adpcmdecode 5 22.84% 3 2 1 0 9 2.07 2 3

22 17.77% 4 3 1 1 7 1.25 1 3

9 12.69% 2 3 0 0 5 0.33 1 1

4 7.61% 1 3 1 0 6 1.00 2 3

mpeg2decode 4 37.44% 5 2 2 0 13 2.49 2 4

10 34.56% 4 2 2 0 12 2.49 2 4

pegwit 1 31.47% 2 0 296 36 811 3.65 3 335

25 9.06% 5 2 0 0 7 0.83 1 1

28 6.47% 2 0 2 1 5 2.29 2 5

9 6.45% 2 1 2 0 5 2.82 3 5

13 6.45% 2 0 2 1 5 2.29 2 5

10 5.16% 4 2 0 0 4 0.83 1 1

TopologyBenchmarkParallelMemoryAccess

SequentialMemoryAccess

Execution concentrated

in few BBs

Few Ld/St…

Small delays

…well separated

High RF pressure


Basic Blocks Characteristics

Moderate hardware resources for AFUs:Often, half of the execution time concentrated

in not more than 2-3 basic blocks

Pressure on the register file higher than classically supported

Limited importance of memory portsExcept some dramatic cases…

Small delay of typical basic blocks


Potential Basic SpeedupExamples

SeqMem

NoConst Basic

PlusBitwidth

PlusArith

adpcmdecode 5 1.80 12.71% 12.71% 12.71% 14.82% 14.82%

22 3.50 8.47% 10.59% 10.59% 10.59% 10.59%

9 1.67 8.47% 8.47% 8.47% 8.47% 8.47%

4 1.20 3.18% 4.24% 5.29% 5.29% 5.29%

mpeg2decode 4 2.17 24.18% 24.18% 26.87% - -

10 2.40 21.50% 21.50% 24.18% - -

pegwit 1 81.10 16.72% 28.31% 28.34% - -

25 1.75 7.03% 5.86% 7.03% - -

28 1.25 0.00% 1.17% 2.34% - -

9 1.00 0.00% 0.00% 2.33% - -

13 1.25 0.00% 1.17% 2.33% - -

10 1.33 3.50% 3.50% 3.50% - -

Benchmark Cycle Savings

BB #

ILP

Good speedup with

few BBs

Not critical…

BBs too simple to

bring advantage


Inputs and Outputs of Basic Blocks

Speedup per # inputs Speedup per # outputs

>60% ~50%


Potential Basic Speedup

Limited available parallelismTop-ranking basic blocks: 10 to 50% cycle

savingsHardwired constants not a key advantage

Small price for a reduction in design risk

Sequentialisation penalty not dramaticAFU memory ports not essential

Accurate bitwidth analysis and arithmetic optimizations bring limited or no advantageBasic blocks are too simple, ceiling effects,…


Effects of ILP TechniquesExamples

Benchmark

adpcmdecode (par.) 49.45% 88.11% 88.11% 88.11% 88.11%

(2.0 x) (8.4 x) (8.4 x) (8.4 x) (8.4 x)

adpcmdecode (seq.) 45.21% 77.51% 77.51% 81.75% 81.75%

(1.8 x) (4.5 x) (4.5 x) (5.5 x) (5.5 x)

mpeg2decode (par.) 68.23% 68.23% 86.91% - 87.92%

(3.1 x) (3.1 x) (7.6 x) - (8.3 x)

mpeg2decode (seq.) 60.09% 60.09% 73.73% - 74.40%

(2.5 x) (2.5 x) (3.8 x) - (3.9 x)

pegwit (par.) 63.33% 67.31% 67.31% - -

(2.7 x) (3.1 x) (3.1 x) - -

pegwit (seq.) 38.99% 42.96% 42.96% - -

(1.6 x) (1.7 x) (1.7 x) - -

PlusUnrolling

PlusBitwidthAnalysis

PlusArithmetic

Opt.

1

1

1

Basic

3

4

2

1

2

PlusPredication

2

2

10

1

1

1

1

2

2

1

2

2

1

-

1

-

-2

-

--

1

total speedup30% number of basic blocks to reach 30% speedup


Effects of ILP Techniques

Major improvements:Cumulative speedups between 1.7x and 6.3x

Register file pressure not significantly modified

Hardware complexity and Thw increasedArea is typically below 2-3x that of 32-bit

multiplier, almost never >10x

Accurate bitwidth analysis and arithmetic optimisations bring limited or no advantageBaseline advantage already very large


Arithmetic Optimisation Impact

mpeg2decode basic block #7 Tsw Thw

bb

Without arithmetic transformations 55 5 25,344,000 30.6%

With arithmetic transfonmations 55 3 26,357,760 31.4%

w/o optimisation with optimisation


Conclusions

DFG-level opens potential speedups (2–3x) at low cost (hardware and toolset) and low risk

Larger number of AFU write ports (2-3) neededHardcoding of constants not essentialAFU memory interfaces also not essentialILP techniques help, as expectedSophisticated and detailed techniques (bitwidth

analysis, arithmetic optimizations) sometimes masked by other effects


Ongoing Work

Measure advantages through a complete toolchain (notably, compiler):DSP microarchitecture:

Validate simple model Find out bottlenecks and impose real DSP constraints

(e.g., nonortogonality)VLIW microarchitecture:

Go beyond simple software execution model

Develop novel speedup-driven identification algorithms

How to get more AFU specialisation potentialsDynamic identification and configuration of

AFUs


+++++

MS3

Typical Identification Algorithms

Bottom-up greedy approaches to cluster instructions

Topologically-driven rather than speedup driven E.g., MaxMISO identification [Alippi et al., 1999]:

*

+

*

+

+

*

+ +

MS2

MS1+

*

+

*


Speedup-driven Identification

Prune-out optimal set of low-speedup nodes to achieve the required input/output count

i0

0.1

i1 i2 i3 i4 i5 i6 i7

1

2

1

0.1

k

o0 o1

0.5

3

0.5

0.1

0.1

SIMD-like and unconnected

graphs


Open Issues and Perspectives

Power consumption advantages?Power down because:

Less instruction fetches and decodes Less register reads and writebacks

Power possibly up because: Reduced correlation of signals in the AFU Low-efficiency of the implementation (in case of

eFPGAs)

More opportunities to increase speedup?Detect and implement LUTs (e.g., in

quantisers) as discrete CAMs Detect runtime constant values


Dynamic Specialisation?

Java Bytecode

JiT + Specialisation

ARM + RFU

Dynamic compilation and optimisation together with hardware specialisation

DAISY, Crusoe, JiT, etc. Specialisation may profit

from runtime information Identification in runtime

conditions Dynamic reconfigurability

challenge


Conclusions

Processor customisation opportunities are here: soft cores, FPGA processors, etc.

Very specific field of hardware/software codesign with a very large potentialDo not give up versatilityGet most of the performance of custom

hardware

Needs automation, to complement compilers and synthesizers (some work exists but limited in scope)

Automatic Processor Specialisation using Ad-hoc Functional Units...

Documents

Transcript of Automatic Processor Specialisation using Ad-hoc Functional Units...