Download - Implementation of DSP IC - National Chiao Tung Universitytwins.ee.nctu.edu.tw/courses/dspic_14/class_note.files/DSP IC-lec05... · Implementation of DSP IC Lecture 5 Datapath Design

Implementation of DSP IC

Implementation of DSP IC

Lecture 5 Datapath Design for Processor

1

Computer/Processor ?

Building Hardwarethat Computes

2

Computing Devices Then…

EDSAC, University of Cambridge, UK, 1949 3

Computing Systems Today …• Computers & Microprocessors in

everything– Vast infrastructure behind them

Robots

4

Overview of Computer Systems

Clusters

Massive Cluster

Gigabit Ethernet

Cloud

5

“Mea

ley

Mac

hine

”“M

oore

Mac

hine

”

Computer is a State-Controlled Machine:Implementation as Comb logic + Latch

Alpha/00

Delta/10

Beta/01

0/0

1/0

1/1

0/1

0/0

1/1La

tch

Com

bina

tion

alLo

gic

Input Stateold Statenew Div

0 0 0

00 01 10

00 10 01

0 0 1

1 1 1

00 01 10

01 00 10

0 1 1

Finite state machine

6

Computer is a Microprogrammed Controller• State machine in which part of state is a “micro-pc”.

– Explicit circuitry for incrementing or changing PC • Includes a ROM with “microinstructions”.

– Controlled logic implements at least branches and jumps

ROM

(Instructions)

Addr

BranchPC

+ 1

MUX

Control

0: forw 35 xxx1: b_no_obstacles 0002: back 10 xxx3: rotate 90 xxx4: goto 001

Instruction Branch

Com

bina

tion

al L

ogic

/Co

ntro

lled

Mac

hine

State w/ Address

7

Instruction Execution CycleInstruction

Fetch

InstructionDecode

OperandFetch

Execute

ResultStore

NextInstruction

Obtain instruction from program storage

Determine required actions and instruction size

Locate and obtain operand data

Compute result value or status

Deposit results in storage for later use

Determine successor instruction

Processor

regs

F.U.s

Memory

program

Data

von Neumanbottleneck

8

“Bell’s Law” – new class per decade

year

log

(peo

ple

per c

ompu

ter)

streaming informationto/from physical world

Number CrunchingData Storage

productivityinteractive

• Enabled by technological opportunities• Smaller, more numerous and more intimately connected• Brings in a new kind of application• Used in many ways not previously imagined

9

Uniprocessor Performance

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Perfo

rman

ce (v

s. V

AX-1

1/78

0)

25%/year

52%/year

??%/year

From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006

• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present

Pipelining,Data locality,Parallelism processing

10

Driven Technology: Moore’s Law

• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965

• # of transistors on cost-effective integrated circuit double every 18 months

11

CISC (Complex Instruction Set Computer)

12

RISC (Reduced Instruction Set Computer)

13

UniProcessor Performance• Early 1970s, mainframes and minicomputers

– 25%~30% growth per year in performance• Late 1970, microprocessor

– 35% growth per year in performance• Early 1980s, Reduced Instruction Set Computer (RISC)

architectures– 2 critical performance techniques

• ILP (initially through pipelining and later through multiple instruction issue)

• Cache– 50% growth per year in performance

• 1998~2000, relative performance– By technology: 1.35per year– By technology + architecture, overall: 1.58 per year

Note: 1.581.35(1+15%), the architecture improvement factor is 15%

14

Alternative Datapath• CISC processor

– Hard to pipeline– Some hardware are idle at most time

• Single-issue RISC• Multi-issue VLIW processor• Cascade composite functional units

for ASIP

15

Single-issue RISC• Low hardware utilization

• Single-issue processor• ~1/3 utilization• Operation per cycle =1• High performance?

16

Data Dependence and Parallelism

• If 2 instructions are parallel– they can be executed simultaneously in a pipeline

without causing any stalls (except the structural hazards)

– their execution order can be swapped• If 2 instructions are dependent

– they must be executed in order or partially overlapped.

• To exploit parallelisms over instructions is equivalent to determine dependences over instructions

Introduction

Exploit Instruction-Level Parallelism

• Two main approaches:– Hardware-based dynamic approaches

• Hardware locates the parallelism in run-time• Used in server and desktop processors (Not

used as extensively in PMP processors)• Superscalar processors: Pentium 4, IBM Power,

AMD Opteron– Compiler-based static approaches

• Software finds parallelism at compile-time• Used in DSP processors (Not as successful

outside of scientific applications)• VLIW processors: Itanium 2, ITRI PAC

Introduction

Compiler Techniques for Exposing ILP

• Pipeline scheduling– Separate dependent instruction from the source instruction

by the pipeline latency (or instruction latency) of the source instruction

• Example:for (i=999; i>=0; i=i-1)

x[i] = x[i] + s;

Com

piler Techniques

Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)DADDUI R1,R1,#‐8BNE R1,R2,Loop

Data DependenceLoop: L.D F0, 0(R1) ;F0=array element

ADD.D F4, F0, F2 ;add scalar in F2

S.D F4, 0(R1) ;store result

DADDUI R1, R1, #-8 ;decrement pointer

BNE R1, R2, Loop ;branch R1!=R2

• The arrows show the order that must be preserved for correct execution.

• If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped.

Step 1: Insert Pipeline Stalls

Loop: L.D F0,0(R1)stallADD.D F4,F0,F2stallstallS.D F4,0(R1)DADDUI R1,R1,#‐8stall (assume integer load latency is 1)BNE R1,R2,Loop

Com

piler Techniques

9 C.C/ iteration

Step 2: Re-SchedulingScheduled code:Loop: L.D F0,0(R1)

DADDUI R1,R1,#‐8ADD.D F4,F0,F2stallstallS.D F4,8(R1)BNE R1,R2,Loop

Com

piler Techniques

7 C.C/ iteration

Step 3: Loop Unrolling• Loop unrolling

– Unroll by a factor of 4 (assume # elements is divisible by 4)– Eliminate unnecessary instructions

Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1) ;drop DADDUI & BNEL.D F6,‐8(R1)ADD.D F8,F6,F2S.D F8,‐8(R1) ;drop DADDUI & BNEL.D F10,‐16(R1)ADD.D F12,F10,F2S.D F12,‐16(R1) ;drop DADDUI & BNEL.D F14,‐24(R1)ADD.D F16,F14,F2S.D F16,‐24(R1)DADDUI R1,R1,#‐32BNE R1,R2,Loop

Com

piler Techniques

note: number of live registers vs. original loop

Step 4: Re-Schedule the Unrolled loop

• Pipeline schedule the unrolled loop:

Loop: L.D F0,0(R1)L.D F6,‐8(R1)L.D F10,‐16(R1)L.D F14,‐24(R1)ADD.D F4,F0,F2ADD.D F8,F6,F2ADD.D F12,F10,F2ADD.D F16,F14,F2S.D F4,0(R1)S.D F8,‐8(R1)DADDUI R1,R1,#‐32S.D F12,16(R1)S.D F16,8(R1)BNE R1,R2,Loop

Com

piler Techniques

14 C.C/ 4 iterationsor 3.5 C.C/ iteration

ILP and Data Dependencies• HW/SW must preserve program order:

order instructions would execute in if executed sequentially as determined by original source program– Dependences are a property of programs

• Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is property of the pipeline

• Importance of the data dependencies1) indicates the possibility of a hazard2) determines order in which results must be calculated3) sets an upper bound on how much parallelism can possibly be

exploited• HW/SW goal: exploit parallelism by preserving program order

only where it affects the outcome of the program

Unrolled Loop Detail• Do not usually know upper bound of loop• Suppose it is n, and we would like to unroll the

loop to make k copies of the body• Instead of a single unrolled loop, we generate a

pair of consecutive loops:– 1st executes (n mod k) times and has a body that is the

original loop– 2nd is the unrolled body surrounded by an outer loop

that iterates (n/k) times• For large values of n, most of the execution time

will be spent in the unrolled loop

Com

piler Techniques

3 Limits to Loop Unrolling1. Decrease in amount of overhead amortized with each

extra unrolling• Amdahl’s Law

2. Growth in code size • For larger loops, concern it increases the

instruction cache miss rate3. Register pressure (compiler limitation): potential

shortfall in registers created by aggressive unrolling and scheduling• If not be possible to allocate all live values to

registers, may lose some or all of its advantage• Loop unrolling reduces impact of branches on pipeline;

another way is branch prediction

Basic VLIW (Very Long Instruction Word)

• A VLIW uses multiple, independent functional units• A VLIW packages multiple independent operations into one very long

instruction– The burden for choosing and packaging independent operations

falls on compiler– HW in a superscalar makes these issue decisions is unnecessary

• VLIW depends on enough parallelism for keeping FUs busy– Loop unrolling and then code scheduling– Compiler may need to do local scheduling and global scheduling

• Here we consider a VLIW processor might have instructions that contain 5 operations, including 1 integer (or branch), 2 FP, and 2 memory references– Depend on the available FUs and frequency of operation

Recall: Unrolled Loop that Minimizes Stalls for Scalar

1 Loop: L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D F4,F0,F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28 ADD.D F16,F14,F29 S.D 0(R1),F410 S.D -8(R1),F811 S.D -16(R1),F1212 DSUBUI R1,R1,#3213 BNEZ R1,LOOP14 S.D 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration

L.D to ADD.D: 1 CycleADD.D to S.D: 2 Cycles

Loop Unrolling in VLIW

Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.29 clocks per iteration 23 ops in 9 clocks, average 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW

Multi-issue VLIW processor• High Utilization• Operation per cycle =n for n-way VLIW• Poor code density• Large chip area due to register files with ~3N R/W ports

– For N FUs, area and delay are increases as N 3 and N 3/2

• ILP limited

31

ASIP, Application-Specific Instruction Set Processor

• Complicate, composite datapath• High hardware utilization (high OPC)• Lower RF area due to limited R/W ports• Suitable for specific DSP applications• Cascading order is what matter

32

Port number Scalar Compostie FUs 3-VLIW3FUs 2R/1W 3R/1W 5R/3W4FUs 2R/1W 4R/1W 7R/4W

Floating-Point (FP) Arithmetic• IEEE754 single precision FP number

• Pros

– Very wide dynamic range with exponential scale – Automatic radix-point tracking in “exponent” with

hardware– Full precision of mantissa w/ fractional operations

• Cons

– Complex hardware due to dynamic normalization and alignment (i.e. shift operations)

– Unnecessary dynamic range for most DSP applications

IEEE Standard for Binary Floating-Point Arithmetic, IEEE Standard 754, 1985

Integer Arithmetic• Simple hardware

– Low power, small area, fast execution time ...

• Programmers must take care of data ranges of intermediate variables– Prevent overflow– Maintain enough precision

• Frequent scaling & normalization help to utilize the finite bits more efficiently– Tradeoffs between quality (precision) & speed (for

explicit exponent tracking & data rounding)

Proposed Static Floating-Point (SFP) Arithmetic

• Effective compromise between FP and integer arithmetic– Fractional operations with automatic rounding as FP arithmetic– Static exponent tracking based on worst-case range estimation

as integer arithmetic– Static normalization and alignment depending on the exponent

• Better precision by efficient data-bit utilization– No exponent attached to data– Automatic rounding with fractional operations– More frequent scaling and normalization operations

Design Flow• Computation kernels represented in

SDFG (synchronous dataflow graph)• Range estimation of each variable

(i.e. an edge of SDFG) by PEV (peak estimation vector) analysis

• Shift insertion according to the PEV analysis

PEV Analysis• PEV (peak estimation vector) records the worst-case

dynamic range with exponent• Each edge in floating-point SDFG can be associated with a

PEV [M r] – “M” is the maximum magnitude (cf. worst-case mantissa) – “r” is the position of radix point (cf. the corresponding

exponent)• PEV calculation rules

– “r” should be identical before adding or subtracting– [M1 r1] × [M2 r2] = [M1 × M2 r1 + r2] – M and r have the relation of “M divided (multiplied) by 2 when

r minus (plus) 1”– M should be kept in the range of 0.5~1 for maximum bit

utilization

Example 1

• Align the radix point (r) before summation• Keep the maximum magnitude (M) in the range between 0.5

& 1 to prevent overflow & maximize precision

[0.5 -1]

[1 -1]

[0.8 0]

[0.6 -1]

[1.5 -1]

[0.96 0]

[1 0]

[0.75 -2]

[0.48 -1]

Example 2• Calculation rules:

– Align the radix point (r) before summation– Keep the maximum magnitude (M) in the range between 0.5 & 1 to

prevent overflow & maximize precision

• Example– 4-point DCT

I0

I1

I2

I3

O0

(-)

(-)

(-)

(-)

[1 0]

[1 0]

[1 0]

[1 0]

[2 0][1 -1]

[2 -1][1 -2]

[2 0][1 -1] O2

O1

O3

C2

C1

C1

C2

C3

C3

C1 0.9238

C2 0.3827

C3 0.7071

[0.38 -1][0.74 0]

[2 -1][1 -2]

[0.92 -1]

[0.92 -1]

[0.38 -1][0.74 0]

[0.71-2]

[0.71-2]

[0.74 0][0.38 -1] [1.31 -1]

[0.65 -2]

[0.74 0][0.38 -1]

[1.31 -1][0.65 -2]

Normalize

Align

[2 0][1 -1]

[2 0][1 -1]

FP to SFP Conversion• Insert shift operations according to the PEV analysis• 4-point DCT example

I0

I1

I2

I3

O0

(-)

(-)

(-)

(-)

O2

O1

O3

C2

C1

C1

C2

C3

C3

C1 0.9238

C2 0.3827

C3 0.7071

Proposed Static Floating-Point (SFP) Arithmetic

• For N-bit data samples– (N+1)-bit adder/subtractor with

input aligners & output normalizer (in our embodiment, all are 1-bit right shifters)

– N-bit fractional multiplier with 1-bit output normalizer (left shifter)

– N-bit barrel shifter with arithmetic right shifts

radix point

sign bit

fraction

>>1

>>1

>>1

<<1

>> N

(N+1)-bit

N-bitfractional

Static FP Arithmetic : Concluding Remarks

• Effective compromise between FP and integer units– Fractional operations with automatic rounding as FPU– Shrunk (1-bit) aligners & normalizers from FPU– Static exponent tracking (radix point) as integer units– Almost FP quality (precision) with much simpler hardware (thus

speed, area, & power) as the integer units

• SFP arithmetic utilizes the bits more efficiently for precision– No exponent (more free bits for precision)– Automatic rounding with fractional multiplications & more

frequent normalization than integer units to reduce leading sign-bits

Example: Linear Phase FIRLinear phase FIR filter: with approximately constant frequency-response magnitude and linear phase (constant group delay) in pass-band

N-tap

N multipliersN-1 adders

(N+1)/2 multipliersN-1 adders, if odd N

N/2 multipliersN-1 adders, if even N

By exploiting substructure sharing to reduce area

Cascaded Datapath for LPFIR Filter

• 16-bit full-precision A-M-Acc datapath• 16-bit post-truncation (or direct-

truncation) A-M-Acc datapath

44

Final Project• Please design a cascaded, SFP

datapath for LPFIR filter

45