Post on 10-Feb-2016
description
February 13, 2009 http://csg.csail.mit.edu/6.375 L05-1
Combinational Circuits and Simple Synchronous Pipelines
Arvind Computer Science & Artificial Intelligence LabMassachusetts Institute of Technology
February 13, 2008 L05-2http://csg.csail.mit.edu/6.375
Object code(Verilog/C)
Bluespec: Two-Level Compilation
Rules and Actions(Term Rewriting System)
• Rule conflict analysis• Rule schedulingJames Hoe & Arvind@MIT 1997-2000
Bluespec(Objects, Types,
Higher-order functions)
Level 1 compilation• Type checking• Massive partial evaluation and static elaboration
Level 2 synthesis
Lennart Augustsson@Sandburst 2000-2002
Now we call this Guarded Atomic Actions
February 13, 2008 L05-3http://csg.csail.mit.edu/6.375
Static Elaboration
.exe
compiledesign2 design3design1
elaborate w/params
run1run1run2.1…
run1run1run3.1…
run1run1run1.1…
run w/params
run w/params
run1run1…
run
At compile time Inline function calls and unroll loops Instantiate modules with specific parameters Resolve polymorphism/overloading, perform most
data structure operations
sourceSoftwareToolflow: source
HardwareToolflow:
February 13, 2008 L05-4http://csg.csail.mit.edu/6.375
Combinational IFFTin0
…
in1
in2
in63
in3in4
Bfly4
Bfly4
Bfly4
x16
Bfly4
Bfly4
Bfly4
…
Bfly4
Bfly4
Bfly4
…
out0
…
out1
out2
out63
out3out4
Permute
Permute
Permute
All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ...
*
*
*
*
+
-
-
+
+
-
-
+
*jt2
t0
t3
t1
February 13, 2008 L05-5http://csg.csail.mit.edu/6.375
4-way Butterfly Node
function Vector#(4,Complex) bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k);
BSV has a very strong notion of types Every expression has a type. Either it is declared by
the user or automatically deduced by the compiler The compiler verifies that the type declarations are
compatible
*
*
**
+
-
-+
+
-
-+
*i
t0
t1
t2
t3
k0
k1
k2
k3
February 13, 2008 L05-6http://csg.csail.mit.edu/6.375
BSV code: 4-way Butterflyfunction Vector#(4,Complex) bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k);
Vector#(4,Complex) m, y, z;
m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3];
y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]);
z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3];
return(z);endfunction
Polymorphic code: works on any type of numbers for which *, + and - have been defined
*
*
**
+
-
-+
+
-
-+
*i
m y z
Note: Vector does not mean storage
February 13, 2008 L05-7http://csg.csail.mit.edu/6.375
Combinational IFFTin0
…
in1
in2
in63
in3in4
Bfly4
Bfly4
Bfly4
x16
Bfly4
Bfly4
Bfly4
…
Bfly4
Bfly4
Bfly4
…
out0
…
out1
out2
out63
out3out4
Permute
Permute
Permute
stage_f function
repeat it three times
February 13, 2008 L05-8http://csg.csail.mit.edu/6.375
BSV Code: Combinational IFFTfunction Vector#(64, Complex) ifft
(Vector#(64, Complex) in_data);//Declare vectors Vector#(4,Vector#(64, Complex)) stage_data; stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]);return(stage_data[3]);
The for loop is unfolded and stage_f is inlined during static elaboration
Note: no notion of loops or procedures during execution
February 13, 2008 L05-9http://csg.csail.mit.edu/6.375
BSV Code: Combinational IFFT- Unfoldedfunction Vector#(64, Complex) ifft
(Vector#(64, Complex) in_data);//Declare vectors Vector#(4,Vector#(64, Complex)) stage_data;
stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]);
Stage_f can be inlined now; it could have been inlined before loop unfolding also.
Does the order matter?
stage_data[1] = stage_f(0,stage_data[0]);stage_data[2] = stage_f(1,stage_data[1]);stage_data[3] = stage_f(2,stage_data[2]);
February 13, 2008 L05-10http://csg.csail.mit.edu/6.375
Bluespec Code for stage_ffunction Vector#(64, Complex) stage_f (Bit#(2) stage, Vector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; endreturn(stage_out); twid’s are
mathematically derivable constants
February 13, 2008 L05-11http://csg.csail.mit.edu/6.375
Suppose we want to reuse some part of the circuit ...
in0
…
in1
in2
in63
in3in4
Bfly4
Bfly4
Bfly4
x16
Bfly4
Bfly4
Bfly4
…
Bfly4
Bfly4
Bfly4
…
out0
…
out1
out2
out63
out3out4
Permute
Permute
Permute
Reuse the same circuit three times to reduce area
But why?
February 13, 2009 http://csg.csail.mit.edu/6.375 L05-12
Architectural Exploration:Area-Performance tradeoff in 802.11a Transmitter
February 13, 2008 L05-13http://csg.csail.mit.edu/6.375
802.11a Transmitter Overview
Controller Scrambler Encoder
Interleaver Mapper
IFFT CyclicExtend
headers
data
IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain)
complex numbers accounts for 85% area
24 Uncoded
bits
One OFDM symbol (64 Complex Numbers)
Must produce one OFDM symbol every 4 secDepending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol
February 13, 2008 L05-14http://csg.csail.mit.edu/6.375
Preliminary results[MEMOCODE 2006] Dave, Gerding, Pellauer, Arvind
Design Lines of RelativeBlock Code (BSV) AreaController 49 0%Scrambler 40 0%Conv. Encoder 113 0%Interleaver 76 1%Mapper 112 11%IFFT 95 85%Cyc. Extender 23 3%Complex arithmetic libraries constitute another 200 lines of code
February 13, 2008 L05-15http://csg.csail.mit.edu/6.375
Combinational IFFTin0
…
in1
in2
in63
in3in4
Bfly4
Bfly4
Bfly4
x16
Bfly4
Bfly4
Bfly4
…
Bfly4
Bfly4
Bfly4
…
out0
…
out1
out2
out63
out3out4
Permute
Permute
Permute
Reuse the same circuit three times to reduce area
February 13, 2008 L05-16http://csg.csail.mit.edu/6.375
f g
Design AlternativesReuse a block over multiple cycles
we expect:Throughput toArea to
ff g
decrease – less parallelism
The clock needs to run faster for the same throughput hyper-linear increase in energy
decrease – reusing a block
February 13, 2008 L05-17http://csg.csail.mit.edu/6.375
Circular pipeline: Reusing the Pipeline Stage
in0
…
in1
in2
in63
in3in4
out0
…
out1
out2
out63
out3out4
…
Bfly4
Bfly4
Permute
Stage Counter
February 13, 2008 L05-18http://csg.csail.mit.edu/6.375
Superfolded circular pipeline: Just one Bfly-4 node!in0
…
in1
in2
in63
in3in4
out0
…
out1
out2
out63
out3out4
Bfly4
Permute
Index == 15?
Index: 0 to 15
64, 2-way M
uxes
4, 16-way M
uxes
4, 16-way DeM
uxes
Stage 0 to 2
February 13, 2008 L05-19http://csg.csail.mit.edu/6.375
Pipelining a block
inQ outQf2f1 f3
CombinationalC
inQ outQf2f1 f3 PipelineP
inQ outQf Folded
PipelineFP
Clock? Area? Throughput?Clock: C < P FP Area: FP < C < P Throughput: FP < C < P
February 13, 2008 L05-20http://csg.csail.mit.edu/6.375
Synchronous pipeline
xsReg1inQ
f1 f2 f3
sReg2 outQ
rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2));endrule
This is real IFFT code; just replace f1, f2 and f3 with stage_f code
This rule can fire only if
Atomicity: Either all or none of the state elements inQ, outQ, sReg1 and sReg2 will be updated
- inQ has an element- outQ has space
February 13, 2008 L05-21http://csg.csail.mit.edu/6.375
Stage functions f1, f2 and f3function f1(x); return (stage_f(1,x)); endfunction
function f2(x); return (stage_f(2,x)); endfunction
function f3(x); return (stage_f(3,x)); endfunction
The stage_f function was given earlier
February 13, 2008 L05-22http://csg.csail.mit.edu/6.375
Problem: What about pipeline bubbles?
xsReg1inQ
f1 f2 f3
sReg2 outQ
rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2));endrule
Red and Green tokens must move even if there is nothing in the inQ!
Modify the rule to deal with these conditions
Also if there is no token in sReg2 then nothing should be enqueued in the outQ
Valid bits orthe Maybe type
February 13, 2008 L05-23http://csg.csail.mit.edu/6.375
The Maybe type data in the pipelinetypedef union tagged {
void Invalid;data_T Valid;
} Maybe#(type data_T);
datavalid/invalid
Registers contain Maybe type values
rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= Valid f1(inQ.first()); inQ.deq(); end else sReg1 <= Invalid; case (sReg1) matches tagged Valid .sx1: sReg2 <= Valid f2(sx1); tagged Invalid: sReg2 <= Invalid; case (sReg2) matches tagged Valid .sx2: outQ.enq(f3(sx2));endrule
sx1 will get bound to the appropriate part of sReg1
February 13, 2008 L05-24http://csg.csail.mit.edu/6.375
Folded pipeline
rule folded-pipeline (True); if (stage==0) begin sxIn= inQ.first(); inQ.deq(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==n-1) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==n-1)? 0 : stage+1;endrule
x
sReginQ
f
outQstage
The same code will work for superfolded pipelines by changing n and stage function f
notice stage is a dynamic parameter now!
no for-loop
Need type declarations for sxIn and sxOut
February 13, 2008 L05-25http://csg.csail.mit.edu/6.375
802.11a Transmitter Synthesis results (Only the IFFT block is changing)
IFFT Design Area (mm2)
ThroughputLatency
(CLKs/sym)
Min. Freq Required
Pipelined 5.25 04 1.0 MHz
Combinational 4.91 04 1.0 MHz
Folded(16 Bfly-4s)
3.97 04 1.0 MHz
Super-Folded(8 Bfly-4s)
3.69 06 1.5 MHz
SF(4 Bfly-4s) 2.45 12 3.0 MHz
SF(2 Bfly-4s) 1.84 24 6.0 MHzSF (1 Bfly4) 1.52 48 12 MHZ
TSMC .18 micron; numbers reported are before place and route.
The same source code
All these designs were done in less than 24 hours!
February 13, 2008 L05-26http://csg.csail.mit.edu/6.375
Why are the areas so similarFolding should have given a 3x improvement in IFFT area
BUT a constant twiddle allows low-level optimization on a Bfly-4 block a 2.5x area reduction!
February 13, 2008 L05-27http://csg.csail.mit.edu/6.375
Language notesPattern matching syntax
Vector syntax
Implicit conditions
Static vs dynamic expression
February 13, 2008 L05-28http://csg.csail.mit.edu/6.375
Pattern-matching: A convenient way to extract datastructure components
The &&& is a conjunction, and allows pattern-variables to come into scope from left to right
case (m) matches tagged Invalid : return 0; tagged Valid .x : return x;endcase
if (m matches (Valid .x) &&& (x > 10))
typedef union tagged { void Invalid; t Valid;} Maybe#(type t);
x will get bound to the appropriate part of m
February 13, 2008 L05-29http://csg.csail.mit.edu/6.375
Syntax: Vector of RegistersRegister
suppose x and y are both of type Reg. Then x <= y means x._write(y._read())
Vector of Int x[i] means sel(x,i) x[i] = y[j] means x = update(x,i, sel(y,j))
Vector of Registers x[i] <= y[j] does not work. The parser thinks it means
(sel(x,i)._read)._write(sel(y,j)._read), which will not type check
(x[i]) <= y[j] parses as sel(x,i)._write(sel(y,j)._read), and works correctly
Don’t ask me why
February 13, 2008 L05-30http://csg.csail.mit.edu/6.375
Making guards explicitrule recirculate (True); if (p) fifo.enq(8); r <= 7;endrule
rule recirculate ((p && fifo.enqG) || !p); if (p) fifo.enqB(8); r <= 7;endrule
Effectively, all implicit conditions (guards) are lifted and conjoined to the rule guard
February 13, 2008 L05-31http://csg.csail.mit.edu/6.375
Implicit guards (conditions)Rulerule <name> (<guard>); <action>; endrule
where
<action> ::= r <= <exp>| m.g(<exp>)| if (<exp>) <action>
endif| <action> ; <action>
make implicit guards explicit
m.gB(<exp>) when m.gG
February 13, 2008 L05-32http://csg.csail.mit.edu/6.375
Guards vs If’sA guard on one action of a parallel group of actions affects every action within the group(a1 when p1); (a2 when p2)
==> (a1; a2) when (p1 && p2)A condition of a Conditional action only affects the actions within the scope of the conditional action(if (p1) a1); a2
p1 has no effect on a2 ... Mixing ifs and whens(if (p) (a1 when q)) ; a2
((if (p) a1); a2) when ((p && q) | !p)
February 13, 2008 L05-33http://csg.csail.mit.edu/6.375
Static vs dynamic expressions
Expressions that can be evaluated at compile time will be evaluated at compile-time 3+4 7
Some expressions do not have run-time representations and must be evaluated away at compile time; an error will occur if the compile-time evaluation does not succeed Integers, reals, loops, lists, functions, …