Implementation of DSP IC
Implementation of DSP IC
Lecture 5 Datapath Design for Processor
1
Computer/Processor ?
Building Hardwarethat Computes
2
Computing Devices Then…
EDSAC, University of Cambridge, UK, 1949 3
Computing Systems Today …• Computers & Microprocessors in
everything– Vast infrastructure behind them
Robots
4
Overview of Computer Systems
Clusters
Massive Cluster
Gigabit Ethernet
Cloud
5
“Mea
ley
Mac
hine
”“M
oore
Mac
hine
”
Computer is a State-Controlled Machine:Implementation as Comb logic + Latch
Alpha/00
Delta/10
Beta/01
0/0
1/0
1/1
0/1
0/0
1/1La
tch
Com
bina
tion
alLo
gic
Input Stateold Statenew Div
0 0 0
00 01 10
00 10 01
0 0 1
1 1 1
00 01 10
01 00 10
0 1 1
Finite state machine
6
Computer is a Microprogrammed Controller• State machine in which part of state is a “micro-pc”.
– Explicit circuitry for incrementing or changing PC • Includes a ROM with “microinstructions”.
– Controlled logic implements at least branches and jumps
ROM
(Instructions)
Addr
BranchPC
+ 1
MUX
Control
0: forw 35 xxx1: b_no_obstacles 0002: back 10 xxx3: rotate 90 xxx4: goto 001
Instruction Branch
Com
bina
tion
al L
ogic
/Co
ntro
lled
Mac
hine
State w/ Address
7
Instruction Execution CycleInstruction
Fetch
InstructionDecode
OperandFetch
Execute
ResultStore
NextInstruction
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
Processor
regs
F.U.s
Memory
program
Data
von Neumanbottleneck
8
“Bell’s Law” – new class per decade
year
log
(peo
ple
per c
ompu
ter)
streaming informationto/from physical world
Number CrunchingData Storage
productivityinteractive
• Enabled by technological opportunities• Smaller, more numerous and more intimately connected• Brings in a new kind of application• Used in many ways not previously imagined
9
Uniprocessor Performance
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Perfo
rman
ce (v
s. V
AX-1
1/78
0)
25%/year
52%/year
??%/year
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, October, 2006
• VAX : 25%/year 1978 to 1986• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
Pipelining,Data locality,Parallelism processing
10
Driven Technology: Moore’s Law
• “Cramming More Components onto Integrated Circuits”– Gordon Moore, Electronics, 1965
• # of transistors on cost-effective integrated circuit double every 18 months
11
CISC (Complex Instruction Set Computer)
12
RISC (Reduced Instruction Set Computer)
13
UniProcessor Performance• Early 1970s, mainframes and minicomputers
– 25%~30% growth per year in performance• Late 1970, microprocessor
– 35% growth per year in performance• Early 1980s, Reduced Instruction Set Computer (RISC)
architectures– 2 critical performance techniques
• ILP (initially through pipelining and later through multiple instruction issue)
• Cache– 50% growth per year in performance
• 1998~2000, relative performance– By technology: 1.35per year– By technology + architecture, overall: 1.58 per year
Note: 1.581.35(1+15%), the architecture improvement factor is 15%
14
Alternative Datapath• CISC processor
– Hard to pipeline– Some hardware are idle at most time
• Single-issue RISC• Multi-issue VLIW processor• Cascade composite functional units
for ASIP
15
Single-issue RISC• Low hardware utilization
• Single-issue processor• ~1/3 utilization• Operation per cycle =1• High performance?
16
Data Dependence and Parallelism
• If 2 instructions are parallel– they can be executed simultaneously in a pipeline
without causing any stalls (except the structural hazards)
– their execution order can be swapped• If 2 instructions are dependent
– they must be executed in order or partially overlapped.
• To exploit parallelisms over instructions is equivalent to determine dependences over instructions
Introduction
Exploit Instruction-Level Parallelism
• Two main approaches:– Hardware-based dynamic approaches
• Hardware locates the parallelism in run-time• Used in server and desktop processors (Not
used as extensively in PMP processors)• Superscalar processors: Pentium 4, IBM Power,
AMD Opteron– Compiler-based static approaches
• Software finds parallelism at compile-time• Used in DSP processors (Not as successful
outside of scientific applications)• VLIW processors: Itanium 2, ITRI PAC
Introduction
Compiler Techniques for Exposing ILP
• Pipeline scheduling– Separate dependent instruction from the source instruction
by the pipeline latency (or instruction latency) of the source instruction
• Example:for (i=999; i>=0; i=i-1)
x[i] = x[i] + s;
Com
piler Techniques
Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)DADDUI R1,R1,#‐8BNE R1,R2,Loop
Data DependenceLoop: L.D F0, 0(R1) ;F0=array element
ADD.D F4, F0, F2 ;add scalar in F2
S.D F4, 0(R1) ;store result
DADDUI R1, R1, #-8 ;decrement pointer
BNE R1, R2, Loop ;branch R1!=R2
• The arrows show the order that must be preserved for correct execution.
• If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped.
Step 1: Insert Pipeline Stalls
Loop: L.D F0,0(R1)stallADD.D F4,F0,F2stallstallS.D F4,0(R1)DADDUI R1,R1,#‐8stall (assume integer load latency is 1)BNE R1,R2,Loop
Com
piler Techniques
9 C.C/ iteration
Step 2: Re-SchedulingScheduled code:Loop: L.D F0,0(R1)
DADDUI R1,R1,#‐8ADD.D F4,F0,F2stallstallS.D F4,8(R1)BNE R1,R2,Loop
Com
piler Techniques
7 C.C/ iteration
Step 3: Loop Unrolling• Loop unrolling
– Unroll by a factor of 4 (assume # elements is divisible by 4)– Eliminate unnecessary instructions
Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1) ;drop DADDUI & BNEL.D F6,‐8(R1)ADD.D F8,F6,F2S.D F8,‐8(R1) ;drop DADDUI & BNEL.D F10,‐16(R1)ADD.D F12,F10,F2S.D F12,‐16(R1) ;drop DADDUI & BNEL.D F14,‐24(R1)ADD.D F16,F14,F2S.D F16,‐24(R1)DADDUI R1,R1,#‐32BNE R1,R2,Loop
Com
piler Techniques
note: number of live registers vs. original loop
Step 4: Re-Schedule the Unrolled loop
• Pipeline schedule the unrolled loop:
Loop: L.D F0,0(R1)L.D F6,‐8(R1)L.D F10,‐16(R1)L.D F14,‐24(R1)ADD.D F4,F0,F2ADD.D F8,F6,F2ADD.D F12,F10,F2ADD.D F16,F14,F2S.D F4,0(R1)S.D F8,‐8(R1)DADDUI R1,R1,#‐32S.D F12,16(R1)S.D F16,8(R1)BNE R1,R2,Loop
Com
piler Techniques
14 C.C/ 4 iterationsor 3.5 C.C/ iteration
ILP and Data Dependencies• HW/SW must preserve program order:
order instructions would execute in if executed sequentially as determined by original source program– Dependences are a property of programs
• Presence of dependence indicates potential for a hazard, but actual hazard and length of any stall is property of the pipeline
• Importance of the data dependencies1) indicates the possibility of a hazard2) determines order in which results must be calculated3) sets an upper bound on how much parallelism can possibly be
exploited• HW/SW goal: exploit parallelism by preserving program order
only where it affects the outcome of the program
Unrolled Loop Detail• Do not usually know upper bound of loop• Suppose it is n, and we would like to unroll the
loop to make k copies of the body• Instead of a single unrolled loop, we generate a
pair of consecutive loops:– 1st executes (n mod k) times and has a body that is the
original loop– 2nd is the unrolled body surrounded by an outer loop
that iterates (n/k) times• For large values of n, most of the execution time
will be spent in the unrolled loop
Com
piler Techniques
3 Limits to Loop Unrolling1. Decrease in amount of overhead amortized with each
extra unrolling• Amdahl’s Law
2. Growth in code size • For larger loops, concern it increases the
instruction cache miss rate3. Register pressure (compiler limitation): potential
shortfall in registers created by aggressive unrolling and scheduling• If not be possible to allocate all live values to
registers, may lose some or all of its advantage• Loop unrolling reduces impact of branches on pipeline;
another way is branch prediction
Basic VLIW (Very Long Instruction Word)
• A VLIW uses multiple, independent functional units• A VLIW packages multiple independent operations into one very long
instruction– The burden for choosing and packaging independent operations
falls on compiler– HW in a superscalar makes these issue decisions is unnecessary
• VLIW depends on enough parallelism for keeping FUs busy– Loop unrolling and then code scheduling– Compiler may need to do local scheduling and global scheduling
• Here we consider a VLIW processor might have instructions that contain 5 operations, including 1 integer (or branch), 2 FP, and 2 memory references– Depend on the available FUs and frequency of operation
Recall: Unrolled Loop that Minimizes Stalls for Scalar
1 Loop: L.D F0,0(R1)2 L.D F6,-8(R1)3 L.D F10,-16(R1)4 L.D F14,-24(R1)5 ADD.D F4,F0,F26 ADD.D F8,F6,F27 ADD.D F12,F10,F28 ADD.D F16,F14,F29 S.D 0(R1),F410 S.D -8(R1),F811 S.D -16(R1),F1212 DSUBUI R1,R1,#3213 BNEZ R1,LOOP14 S.D 8(R1),F16 ; 8-32 = -24
14 clock cycles, or 3.5 per iteration
L.D to ADD.D: 1 CycleADD.D to S.D: 2 Cycles
Loop Unrolling in VLIW
Unrolled 7 times to avoid delays7 results in 9 clocks, or 1.29 clocks per iteration 23 ops in 9 clocks, average 2.5 ops per clock, 50% efficiency Note: Need more registers in VLIW
Multi-issue VLIW processor• High Utilization• Operation per cycle =n for n-way VLIW• Poor code density• Large chip area due to register files with ~3N R/W ports
– For N FUs, area and delay are increases as N 3 and N 3/2
• ILP limited
31
ASIP, Application-Specific Instruction Set Processor
• Complicate, composite datapath• High hardware utilization (high OPC)• Lower RF area due to limited R/W ports• Suitable for specific DSP applications• Cascading order is what matter
32
Port number Scalar Compostie FUs 3-VLIW3FUs 2R/1W 3R/1W 5R/3W4FUs 2R/1W 4R/1W 7R/4W
Floating-Point (FP) Arithmetic• IEEE754 single precision FP number
• Pros
– Very wide dynamic range with exponential scale – Automatic radix-point tracking in “exponent” with
hardware– Full precision of mantissa w/ fractional operations
• Cons
– Complex hardware due to dynamic normalization and alignment (i.e. shift operations)
– Unnecessary dynamic range for most DSP applications
IEEE Standard for Binary Floating-Point Arithmetic, IEEE Standard 754, 1985
Integer Arithmetic• Simple hardware
– Low power, small area, fast execution time ...
• Programmers must take care of data ranges of intermediate variables– Prevent overflow– Maintain enough precision
• Frequent scaling & normalization help to utilize the finite bits more efficiently– Tradeoffs between quality (precision) & speed (for
explicit exponent tracking & data rounding)
Proposed Static Floating-Point (SFP) Arithmetic
• Effective compromise between FP and integer arithmetic– Fractional operations with automatic rounding as FP arithmetic– Static exponent tracking based on worst-case range estimation
as integer arithmetic– Static normalization and alignment depending on the exponent
• Better precision by efficient data-bit utilization– No exponent attached to data– Automatic rounding with fractional operations– More frequent scaling and normalization operations
Design Flow• Computation kernels represented in
SDFG (synchronous dataflow graph)• Range estimation of each variable
(i.e. an edge of SDFG) by PEV (peak estimation vector) analysis
• Shift insertion according to the PEV analysis
PEV Analysis• PEV (peak estimation vector) records the worst-case
dynamic range with exponent• Each edge in floating-point SDFG can be associated with a
PEV [M r] – “M” is the maximum magnitude (cf. worst-case mantissa) – “r” is the position of radix point (cf. the corresponding
exponent)• PEV calculation rules
– “r” should be identical before adding or subtracting– [M1 r1] × [M2 r2] = [M1 × M2 r1 + r2] – M and r have the relation of “M divided (multiplied) by 2 when
r minus (plus) 1”– M should be kept in the range of 0.5~1 for maximum bit
utilization
Example 1
• Align the radix point (r) before summation• Keep the maximum magnitude (M) in the range between 0.5
& 1 to prevent overflow & maximize precision
[0.5 -1]
[1 -1]
[0.8 0]
[0.6 -1]
[1.5 -1]
[0.96 0]
[1 0]
[0.75 -2]
[0.48 -1]
Example 2• Calculation rules:
– Align the radix point (r) before summation– Keep the maximum magnitude (M) in the range between 0.5 & 1 to
prevent overflow & maximize precision
• Example– 4-point DCT
I0
I1
I2
I3
O0
(-)
(-)
(-)
(-)
[1 0]
[1 0]
[1 0]
[1 0]
[2 0][1 -1]
[2 -1][1 -2]
[2 0][1 -1] O2
O1
O3
C2
C1
C1
C2
C3
C3
C1 0.9238
C2 0.3827
C3 0.7071
[0.38 -1][0.74 0]
[2 -1][1 -2]
[0.92 -1]
[0.92 -1]
[0.38 -1][0.74 0]
[0.71-2]
[0.71-2]
[0.74 0][0.38 -1] [1.31 -1]
[0.65 -2]
[0.74 0][0.38 -1]
[1.31 -1][0.65 -2]
Normalize
Align
[2 0][1 -1]
[2 0][1 -1]
FP to SFP Conversion• Insert shift operations according to the PEV analysis• 4-point DCT example
I0
I1
I2
I3
O0
(-)
(-)
(-)
(-)
O2
O1
O3
C2
C1
C1
C2
C3
C3
C1 0.9238
C2 0.3827
C3 0.7071
Proposed Static Floating-Point (SFP) Arithmetic
• For N-bit data samples– (N+1)-bit adder/subtractor with
input aligners & output normalizer (in our embodiment, all are 1-bit right shifters)
– N-bit fractional multiplier with 1-bit output normalizer (left shifter)
– N-bit barrel shifter with arithmetic right shifts
radix point
sign bit
fraction
>>1
>>1
>>1
<<1
>> N
(N+1)-bit
N-bitfractional
Static FP Arithmetic : Concluding Remarks
• Effective compromise between FP and integer units– Fractional operations with automatic rounding as FPU– Shrunk (1-bit) aligners & normalizers from FPU– Static exponent tracking (radix point) as integer units– Almost FP quality (precision) with much simpler hardware (thus
speed, area, & power) as the integer units
• SFP arithmetic utilizes the bits more efficiently for precision– No exponent (more free bits for precision)– Automatic rounding with fractional multiplications & more
frequent normalization than integer units to reduce leading sign-bits
Example: Linear Phase FIRLinear phase FIR filter: with approximately constant frequency-response magnitude and linear phase (constant group delay) in pass-band
N-tap
N multipliersN-1 adders
(N+1)/2 multipliersN-1 adders, if odd N
N/2 multipliersN-1 adders, if even N
By exploiting substructure sharing to reduce area
Cascaded Datapath for LPFIR Filter
• 16-bit full-precision A-M-Acc datapath• 16-bit post-truncation (or direct-
truncation) A-M-Acc datapath
44
Final Project• Please design a cascaded, SFP
datapath for LPFIR filter
45
Top Related