2431 SocD 08 Optimization Hw es08 - TUT · dsf full custom ASIC ... E.g. floating point DCT 200...

Erno Salminen - Oct. 2007

TKTTKT--2431 Soc 2431 Soc DesignDesign

Lec 8 Lec 8 –– OptimizationOptimization

Erno SalminenErno Salminen

Department ofDepartment of Computer SystemsComputer SystemsTampere University of TechnologyTampere University of Technology

Fall 2008Fall 2008

Erno Salminen - Oct. 2007#2/47

Copyright noticeCopyright notice

Part of the slidesadapted from slide set by Alberto Sangiovanni-Vincentelli

course EE249 at University of California, Berkeleyhttp://www-cad.eecs.berkeley.edu/~polis/class/lectures.shtml

Part of figures from:J. Heikkinen, J. Sertamo, T. Rautiainen and J. Takala, "Design of Transport Triggered Architecture Processor for Discrete Cosine Transform", in Proc. 15th Ann. IEEE Int. ASIC/SOC Conf., Rochester, NY, U.S.A., Sept. 25-28 2002, pp. 87-91


At firstAt first

Make sure that simple things work before even trying more complex ones


OutlineOutline

Determine bottlenecks - Amdahl’s lawMethods

Architectural choicesAlgorithm modifications, assembly codingCustom processorsHW accelerators


ForewordForeword

”Premature optimization is the root of all evil”Donald Knuth [quoting Hoare]

Sutter, Alexandrescu1st rule: Don’t optimize2nd rule (for experts only): Don’t do it yet. Measure twice, optimize once.

Focus on making code as clear and readable as possibleOptimizations make design and code more complex Optimize only when performance bottle-neck has been proven (and identified)


System bottlenecks (1)System bottlenecks (1)

[H. Meyr, Application Specific Instruction-Set

Processors for Wireless Communications, Tampere

SoC, Nov. 2004]

[Berkeley Design Technology Inc., Alternatives to DPSs: What and Why?, Tampere SoC, Nov. 2003]

Determine what’s taking timeOr area, power, memory


System bottlenecks (2)System bottlenecks (2)

Concentrate optimization on bottlenecksNo use of optimizing part that takes small fraction, say 3%, of the execution time

Trivial Matlab exampleRemoved one unnecessary #include from m files12x speedupLocating bottleneck took few hoursFixing the bottleneck took 1 minute

System may be refined into smaller blocks to define the bottlenecks in logic area or propagation delay

Otherwise, it is difficult to determine the relation between HDL source line and schematic


AmdahlAmdahl’’s Laws Law

ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced

Speedupoverall =ExTimeold

ExTimenew

Speedupenhanced

=1

(1 - Fractionenhanced) + Fractionenhanced

Speedupenhanced

exc.

tim

e

[H. Corporaal, course material Adv. Computer architectures, Univ. Delft, 2001]

HUOM! OBS!

Muy importante!


AmdahlAmdahl’’s Law Examples Law Example

Floating point instructions improved to run 2X; but only 10% of actual instructions are FP

Max. speedupoverall = 1 / (1- fractionenhanced)

Speedupoverall = 10.95

= 1.053

ExTimenew = ExTimeold x (0.9 + 0.1/2) = 0.95 x ExTimeold


Architectural choicesArchitectural choices


Architectural choicesArchitectural choiceslo

g Fl

exib

ility

log

Flex

ibili

ty

log Efficiency log Efficiency (increasing speed, (increasing speed, decreasing power and area)decreasing power and area)

FPGA

micro-

processor

Data+instr

mem

General purpose

microprocessor MAC

micro-

processorAddr

gen

Data+instr

mem

SW programmable

DSP

co-

proc

micro-

processorco-

proc

Data+instr

mem

Hardware

reconfigurable

processor

Dream solution(exists only in

marketing material...)

Dream solution(exists only in

marketing material...)

Direct mapped HW

std. cell

ASIC

dsf

full custom

ASIC


Heinrich Meyr, Future Wireless Communication Systems…, VTC, 2005.

(Figure data by T.Noll T.Noll, RWTH Aachen)

http://www.ieeevtc.org/vtc2005spring/presentations/2020_presentations/HMeyr.pdf

General-purpose CPU

DSP

FPGA, ASIP

std-cell ASIC

full custom ASIC

General-purpose CPU

DSP

FPGA, ASIP

ASIC


Architectural choices (2)Architectural choices (2)

Area and energy efficiencies of comparable MPEG-4 encoder implementations (bigger the better)

,[Mpixels/s/mm2]

,[Mpixels/s/W]

[O. Silven and K. Jyrkkä, Observations on Power-Efficiency Trends in Mobile Communication Devices, EURASIP Journal on Embedded Systems, Vol 2007, Article ID 56976, 10 pages, 2007.]

dream solution

Values include RAM.


ASIC versus PLD/FPGA Design StartsASIC versus PLD/FPGA Design Starts

0

1000

2000

3000

4000

5000

6000

2001 2002 2003 2004

ASIC Design Starts

Source: Gartner Group0

100000

200000

300000

400000

500000

600000

2001 2002 2003 2004

PLD/FPGA Design Starts

Source: Gartner Group

“ASIC design starts will decline 12.3 percent to 4,345 this year following the precipitous 36 percent drop in design starts in 2001”

(B. Lewis, Gartner Dataquest, 10/28/02)

PLD/FPGAs are becoming more and morethe driving force in microelectronicstechnology, CAD tools and System-on-Chipdesign.


Algorithmic Algorithmic modifications, assembly modifications, assembly languagelanguage


Algorithm manipulationAlgorithm manipulation

Accelerated function should give identical results with original

Additional conversion functions may destroy all speedupDo not perform over-accurate calculation

Single/double prec. floating-point vs. fixed pointSW emulation of floating point operations is s-l-o-wE.g. floating point DCT 200 kcycles, fixed point 15 kcyclesHW FPUs are big: ~5.7 mm2 @0.35 um [Brunelli, TreSoc04],

~120 kgates (compare to RISC core ~50 kgates)Fixed point is less accurate

Word width optimizationEspecially on HWOn CPU, smallest is not necessarily fastest

Using type char may require additional shift/AND/ORinstructions


Example: SortingExample: Sorting

Simplest algorithms have O(n2) execution timeMore complex O(n log n)

Require recursion, advanced data structures, and multiple arrays

Recursion may lead to stack overflowMultiple arrays require big memoryFig: http://linux.wku.edu/~lamonml/algor/sort/sort.htmlP.S. Avoi light-colored lines( e.g. yellow). use markers

bubble

selection

insertion

shell

900

heapmerge

quick

0.7


Algortihm: Sacrificing qualityAlgortihm: Sacrificing quality

[Ramchan Woo, Tampere Soc, Nov. 2004]

Decrease data width


Assembly coding (1)Assembly coding (1)

Try assembly only if everything else failsKeep also the high-level language (HLL) version to allow portability and reuse

Sometimes required with special instructions Such as interrupt handling, MMX, processor mode (user/supervisor)

Speedup with RISC procecssors not that great

Usually only one execution unit(Few) instructions, simple addressingDecent compilers available


Assembly coding (2)Assembly coding (2)

DSPs most likely benefit from assemblyTight loopsComplex micro-architecture is difficult for compiler

“Latest Compilers fall short of hand-optimized performance substantially even for DSP Kernels”

[Naji S. Ghazal et al., Retargetable Estimation for DSP, Architecture Selection, Tampere Soc, Nov. 1999]


Optimization impactOptimization impactRISC = estimated number of required basic ”RISC” operationsfm = fitting coefficient = measured_cycles / estim_RISC_ops N.O = no optimizationH.O. = hand optimizedO. Lehtoranta, PhD Thesis, TUT 2006

[O. Lehtoranta, PhD Thesis, TUT 2006]


Assembly example: vector copy, B[] = A[]Assembly example: vector copy, B[] = A[]

First versionstart_copy:ld r1, [r2] // r2 is src addr, A[i]st [r3], r1 // r3 is dst addr, B[i]inc r2inc r3dec r4 // r4 is data amount, one data copiedcmp r4, 0 // is enough copied?bneq start_copy // loop back if needed

Secondld r1, [r2]inc r2st [r3], r1and so on ...

Increment does depend on r1 and stall is avoidedLoad could be performed just before branch

Load delay happens during pipeline stall

Load causes pipeline stall if next instruction depends on loaded value


Assembly example: delayed branchAssembly example: delayed branch

Fig 2. ’Normal’ branch

Fig 3. Delayed branch

Two instr. (i3 +i4) following the branch are also executed

Addr Instruction

a1 i1: MR=MR+MX0*MY0 (SS);

a2 i2: IF COND JUMP aa1;

a3 i3

a4 i4

a5 i5

a6 i6

a7 i7

... ...

aa1 ii1

[http://www.analog.com/UploadedFiles/Application_Notes/587795865ee_123.pdf]

four-cycle stall two-cycle stall


Custom processors Custom processors (ASIPs)(ASIPs)


Custom processorsCustom processorsAllow using C/C++ compilationASIP = Application Specific Instruction set ProcessorExtend CPU with application (domain) specific instructions

MAC, sum with clipping, DCT etc.Extension tightly coupled with CPU pipelineOptimize internal communication within CPU

Remove unnecessary instructionsOtherwise configure CPU (num of registers, data width...)


Custom processor performance (1)Custom processor performance (1)Tensilica XtensaKernel speed-up 6x – 100x

Depends heavily on applicationBase CPU ~20 000 gates

HW overhead 20% - 150%

[Monica Lam, Compiler Technology for Configurable Processors, Tampere SoC, Nov. 2001.]

Disclaimer: heavy marketing contentDisclaimer: heavy marketing content


Custom processor performance (2)Custom processor performance (2)

[Yasmin Oz et al.,Galois Field Instruction Set Accelerator in the StarCore SC140 DSP, Tampere SoC, Nov. 2001.]

Reed-Solomon decoding cycle count

Speedup 22.1 14.5 6.3 1.0

SC140 = original Star Core DSPGFISA = special instructions for Galois field operations added

HW overhead ~10%Special ISA does not help every algorithm!

runt

ime

=t(sc140)t(gfisa)


Custom processor performance (3)Custom processor performance (3)

Beneficial also for energy

[H. Meyr, Application Specific Instruction-Set

Processors for Wireless Communications, TreSoC 2004

Note: E= P * t

(6.1x speedup)

(8.0x speedup)


Transport Triggered Architecture (TTA)Transport Triggered Architecture (TTA)

Application-specific processorMore flexible than HWStill allows programmabilityAlmost the same performance as ASIC

MOVE design framework allows (semi)automatic exploration

Number of execution unitsConnections between unitsMany trade-offs between area and performanceMany proposed custom CPUs use manual exploration

Resembles VLIWEverything scheduled at compile-time

Designer gives C code and restrictions to exploration toolTools generate synthesizable VHDL


TTA (2)TTA (2)C compiler automatically configured to new micro-architecture

Distinctive factor to many CPUs

One instruction: move, e.g. ”Add r2, r3, r3:

move reg[2] -> ALU.op1

move reg[3] -> ALU.trig

move ALU.result -> reg_file [2]

TTA allows more freedom in code scheduling than traditional CPUs

But suffers from larger code size


TTA (3)TTA (3)

Better area and performance than general purpose RISCSpecial function unit (SFU)

added manuallyincreases areadecreases ex.time

For certain algorithms, same cycle counts as ASIC may achieved

ASIC has bigger frequencyCurrently, developed also at TUT

Interested students may do project work on TTA


Area vs. runtime tradeArea vs. runtime trade--offoff

TTA’s cycle count smaller than RISC, close to ASIC

TTA’s area between ASIC and RISCASIC has highest frequency

(memory excluded) (memory excluded)

[Hämäläinen, Euromicro DSD, 2005]RC4 exploration


HW acceleratorsHW accelerators


HW accelerators (1)HW accelerators (1)

Favor: highest performance, smallest area and power Against: longest design time, narrow application domainDo not require code memory like progammable processors (CPU, ASIP, DSP)Example: 8x8 DCT

D: [J. Nikara, Application-Specific Parallel Structures for Discrete Cosine Transforn and Variable Length Decoding, PhD thesis, TUT, June 2004]

# Type um Cycles Area Speedup (in cycle count) Freq [MHz] Max perf

[blocks/s]

Perf/area [blocks/s /

gates]A RISC (ARM9) 0.18 2660 190 kilogates + mem 1.0 160 60 M 0.32

B ASIP (TTA+SFU) 0.13 538 56 kilogates + 34 kilogates mem

4.9 250 464 M 5.16

C HW (by student) 0.18 250 44 kilogates 10.6 182 728 M 16.55

D HW (by PhD) 0.11 9439 kilogates + control

logic 29.3 253 2691 M 69.01


HW accelerators (2)HW accelerators (2)

Regular, data-flow type functions most suitable for HWCommunication between CPU and HW critical

Delay, mutual exclusion, pipelining

CPUCPU only CPU CPU

CPUCPU + HW v.1

HW

CPU communication overhead reduces

the overall speedup

4x speedup

CPU + HW v.2

CPU

HW

CPU CPUpipeline

HW


HW accelerator (5)HW accelerator (5)

CPU 1CPU 1 I+D memI+D

mem

accel 1

accel 1

on-chip networkon-chip network

network IF

network IF

network IF

network IF

accel

3

accel

3

accel 2

accel 2

CPU 2CPU 2 I+D memI+D

mem

local, private acc.

remore, shared acc.


HW accelerators (3)HW accelerators (3)Orig SW:for i=0:N loop

load r1, [r2]add, sub, mul, cmp, beq, other processingst r1, [r3]

end loop

SW + HW, straightforward pollingstart_hw()while (hw_ready==0) {}for i=0:N loop

load r1, [r2]end loop

SW + HW, pipelinedstart_hw()other_function_x();while (hw_ready==0) {}for i=0:N loop

load r1, [r2]end loop

Measured SW ex.time includes loading input values and storing the results

Even if HW does processing much faster, data transfers from CPU to HW must be taken into account

Function X executed in parallel with HW. Less time wasted in polling (but still polling)

polling =busy wait


HW accelerators (4)HW accelerators (4)Polling vs. interrupts

Interrupts allow more efficient parallel executionCPU controlled transfers vs. DMA

CPU transfer all the data, time O(n), 7 cycles/wordstart_copy:ld r1, [r2] // r2 is src addrst [r3], r1 // r3 is dst addrinc r2inc r3 // dec r4 // r4 is data amount, one data copiedcmp r4, 0 // is enough copied?bneq start_sopy // loop back if needed

CPU just inits DMA controller, time O(1), DMA 1 cycle/wordstart_dma:st #DMA_SRC_ADDR, r1st #DMA_DST_ADDR, r2st #DMA_AMOUNT, r4do_other_stuff()...


HW optimization (1)HW optimization (1)Reuse benefits from configurability and many parameters

Run-time configurability is often costlyGood for simulation-based testing

Convert input signals into generics for synthesisTurn unwanted features off to save area and power

Perhaps increases the max freq alsoif enable_g = ’1’ then <code>;

0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

we=0, re=0 we=0, re=1 we=1, re=0 we=1, re=1

rom ram

Memory type

Con

figur

atio

n m

emor

y ar

ea [g

ates

]

No slots 1 slot 2 slots

Example: config memory inside bus wrapper

2 generics1. we= write enable2. re = read enable

optimize according to application


HW optimization (2)HW optimization (2)

Try to design HW so that propagation delay is not (linearly) dependent on data width

Scalable solutionBad example: if data < 55 then data<= data+1;Better: if data /= 55 then data<= data+1;

Turn on boundary optimizationLogic in different entities optimized together

block Bblock B

block A

(If output uses < 16 of all possible values)

block A

(If output uses < 16 of all possible values)

4b(This can be opitmized)

(This can be opitmized)Note: combinatorial outputs not recommended

E.g. inverters can be removed

Restricted value set in output



Minimize the data width of signalsRemove unnecessary flip-flops (á 4-6 eq.gates)

i.e. those with constant output DC: set compile_seqmap_propagate_constants true

Optimizes also the logic after the flip-flop

always 1

always 0

By default, synthesis does NOT remove any registers

All signals that are assigned in sequential process (clk, rst_n) produce a flip-flop

Flip-flop with constant output

propagated constant



real logicreal logic

”debug value”

unnecessary mux

Do not ’reset’ registers when value is not needede.g. if valid_in = ’0’ then data_r <= (others =>’0’);

Unncecessary input MUXGood for visualization in simulation thoughif dbg_enable_g = ’1’ then reg <= dbg_value;

Easy to see when these are valid

Validity determined according to signal empty


HW optim: Aim at HW optim: Aim at ””fast enoughfast enough””Do not overoptimize HW, if performance limit is known

100 frames/sec encoder is not better than 25 fps enc, if camera restricts the frame rate anyway

Minimizing critical path, causes large areaRequires larger drive strength for gates They also have higher leakage currents

area

speed a:[1/cycles]

b:[MHx]

Minimizing cycle count needs many parallel sub-blocks (e.g. ALUs)Consider the integration overheads also


””Fast enoughFast enough””: Real data: Real dataImplementing low-power configurable processors - practical options and

tradeoffs, Wei, J.; Rowen, C.; Design Automation Conference, 2005. Proceedings. 42nd,13-17 June 2005 Page(s):706 - 711


C/C++ based HW designC/C++ based HW design”Do not need HW designers anymore as SW designer can do everything”

Not exactly true...SystemC

Good for simulationMany problems with synthesis currentlyHW oriented SystemC cannot be compiled as SW anymore

Catapult C by MentorPromising approachPure C:

No timingInterfaces defined in synthesis tool

Best idea in Catapult CSame description can be compiled and synthesizedBlock level design – not for large systems

Not practical in large scale currently


C/C++ based HW design (2)C/C++ based HW design (2)

[Ramani, Haggard, Southeastern Symposium on System Theory, 2001]


ConclusionConclusion

Remember Amdahl’s law – concentrate on appropraite parts of the systemASIPs provide great improvements but allow programmabilityCommunication between components has great impact on perfromance

Use interrupts and DMA controllersPipeline SW and HW

2431 SocD 08 Optimization Hw es08 - TUT · dsf full custom ASIC ... E.g. floating point DCT 200...

Documents

Transcript of 2431 SocD 08 Optimization Hw es08 - TUT · dsf full custom ASIC ... E.g. floating point DCT 200...