Combinational Circuits and Simple Synchronous Pipelines Arvind

33
February 13, 2009 http:// csg.csail.mit.edu/6.375 L05-1 Combinational Circuits and Simple Synchronous Pipelines Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology

description

Combinational Circuits and Simple Synchronous Pipelines Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology. Now we call this Guarded Atomic Actions. Bluespec: Two-Level Compilation. Bluespec (Objects, Types, Higher-order functions). - PowerPoint PPT Presentation

Transcript of Combinational Circuits and Simple Synchronous Pipelines Arvind

Page 1: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2009 http://csg.csail.mit.edu/6.375 L05-1

Combinational Circuits and Simple Synchronous Pipelines

Arvind Computer Science & Artificial Intelligence LabMassachusetts Institute of Technology

Page 2: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-2http://csg.csail.mit.edu/6.375

Object code(Verilog/C)

Bluespec: Two-Level Compilation

Rules and Actions(Term Rewriting System)

• Rule conflict analysis• Rule schedulingJames Hoe & Arvind@MIT 1997-2000

Bluespec(Objects, Types,

Higher-order functions)

Level 1 compilation• Type checking• Massive partial evaluation and static elaboration

Level 2 synthesis

Lennart Augustsson@Sandburst 2000-2002

Now we call this Guarded Atomic Actions

Page 3: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-3http://csg.csail.mit.edu/6.375

Static Elaboration

.exe

compiledesign2 design3design1

elaborate w/params

run1run1run2.1…

run1run1run3.1…

run1run1run1.1…

run w/params

run w/params

run1run1…

run

At compile time Inline function calls and unroll loops Instantiate modules with specific parameters Resolve polymorphism/overloading, perform most

data structure operations

sourceSoftwareToolflow: source

HardwareToolflow:

Page 4: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-4http://csg.csail.mit.edu/6.375

Combinational IFFTin0

in1

in2

in63

in3in4

Bfly4

Bfly4

Bfly4

x16

Bfly4

Bfly4

Bfly4

Bfly4

Bfly4

Bfly4

out0

out1

out2

out63

out3out4

Permute

Permute

Permute

All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power, ...

*

*

*

*

+

-

-

+

+

-

-

+

*jt2

t0

t3

t1

Page 5: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-5http://csg.csail.mit.edu/6.375

4-way Butterfly Node

function Vector#(4,Complex) bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k);

BSV has a very strong notion of types Every expression has a type. Either it is declared by

the user or automatically deduced by the compiler The compiler verifies that the type declarations are

compatible

*

*

**

+

-

-+

+

-

-+

*i

t0

t1

t2

t3

k0

k1

k2

k3

Page 6: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-6http://csg.csail.mit.edu/6.375

BSV code: 4-way Butterflyfunction Vector#(4,Complex) bfly4 (Vector#(4,Complex) t, Vector#(4,Complex) k);

Vector#(4,Complex) m, y, z;

m[0] = k[0] * t[0]; m[1] = k[1] * t[1]; m[2] = k[2] * t[2]; m[3] = k[3] * t[3];

y[0] = m[0] + m[2]; y[1] = m[0] – m[2]; y[2] = m[1] + m[3]; y[3] = i*(m[1] – m[3]);

z[0] = y[0] + y[2]; z[1] = y[1] + y[3]; z[2] = y[0] – y[2]; z[3] = y[1] – y[3];

return(z);endfunction

Polymorphic code: works on any type of numbers for which *, + and - have been defined

*

*

**

+

-

-+

+

-

-+

*i

m y z

Note: Vector does not mean storage

Page 7: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-7http://csg.csail.mit.edu/6.375

Combinational IFFTin0

in1

in2

in63

in3in4

Bfly4

Bfly4

Bfly4

x16

Bfly4

Bfly4

Bfly4

Bfly4

Bfly4

Bfly4

out0

out1

out2

out63

out3out4

Permute

Permute

Permute

stage_f function

repeat it three times

Page 8: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-8http://csg.csail.mit.edu/6.375

BSV Code: Combinational IFFTfunction Vector#(64, Complex) ifft

(Vector#(64, Complex) in_data);//Declare vectors Vector#(4,Vector#(64, Complex)) stage_data; stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]);return(stage_data[3]);

The for loop is unfolded and stage_f is inlined during static elaboration

Note: no notion of loops or procedures during execution

Page 9: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-9http://csg.csail.mit.edu/6.375

BSV Code: Combinational IFFT- Unfoldedfunction Vector#(64, Complex) ifft

(Vector#(64, Complex) in_data);//Declare vectors Vector#(4,Vector#(64, Complex)) stage_data;

stage_data[0] = in_data; for (Integer stage = 0; stage < 3; stage = stage + 1) stage_data[stage+1] = stage_f(stage,stage_data[stage]); return(stage_data[3]);

Stage_f can be inlined now; it could have been inlined before loop unfolding also.

Does the order matter?

stage_data[1] = stage_f(0,stage_data[0]);stage_data[2] = stage_f(1,stage_data[1]);stage_data[3] = stage_f(2,stage_data[2]);

Page 10: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-10http://csg.csail.mit.edu/6.375

Bluespec Code for stage_ffunction Vector#(64, Complex) stage_f (Bit#(2) stage, Vector#(64, Complex) stage_in); begin for (Integer i = 0; i < 16; i = i + 1) begin Integer idx = i * 4; let twid = getTwiddle(stage, fromInteger(i)); let y = bfly4(twid, stage_in[idx:idx+3]); stage_temp[idx] = y[0]; stage_temp[idx+1] = y[1]; stage_temp[idx+2] = y[2]; stage_temp[idx+3] = y[3]; end //Permutation for (Integer i = 0; i < 64; i = i + 1) stage_out[i] = stage_temp[permute[i]]; endreturn(stage_out); twid’s are

mathematically derivable constants

Page 11: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-11http://csg.csail.mit.edu/6.375

Suppose we want to reuse some part of the circuit ...

in0

in1

in2

in63

in3in4

Bfly4

Bfly4

Bfly4

x16

Bfly4

Bfly4

Bfly4

Bfly4

Bfly4

Bfly4

out0

out1

out2

out63

out3out4

Permute

Permute

Permute

Reuse the same circuit three times to reduce area

But why?

Page 12: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2009 http://csg.csail.mit.edu/6.375 L05-12

Architectural Exploration:Area-Performance tradeoff in 802.11a Transmitter

Page 13: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-13http://csg.csail.mit.edu/6.375

802.11a Transmitter Overview

Controller Scrambler Encoder

Interleaver Mapper

IFFT CyclicExtend

headers

data

IFFT Transforms 64 (frequency domain) complex numbers into 64 (time domain)

complex numbers accounts for 85% area

24 Uncoded

bits

One OFDM symbol (64 Complex Numbers)

Must produce one OFDM symbol every 4 secDepending upon the transmission rate, consumes 1, 2 or 4 tokens to produce one OFDM symbol

Page 14: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-14http://csg.csail.mit.edu/6.375

Preliminary results[MEMOCODE 2006] Dave, Gerding, Pellauer, Arvind

Design Lines of RelativeBlock Code (BSV) AreaController 49 0%Scrambler 40 0%Conv. Encoder 113 0%Interleaver 76 1%Mapper 112 11%IFFT 95 85%Cyc. Extender 23 3%Complex arithmetic libraries constitute another 200 lines of code

Page 15: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-15http://csg.csail.mit.edu/6.375

Combinational IFFTin0

in1

in2

in63

in3in4

Bfly4

Bfly4

Bfly4

x16

Bfly4

Bfly4

Bfly4

Bfly4

Bfly4

Bfly4

out0

out1

out2

out63

out3out4

Permute

Permute

Permute

Reuse the same circuit three times to reduce area

Page 16: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-16http://csg.csail.mit.edu/6.375

f g

Design AlternativesReuse a block over multiple cycles

we expect:Throughput toArea to

ff g

decrease – less parallelism

The clock needs to run faster for the same throughput hyper-linear increase in energy

decrease – reusing a block

Page 17: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-17http://csg.csail.mit.edu/6.375

Circular pipeline: Reusing the Pipeline Stage

in0

in1

in2

in63

in3in4

out0

out1

out2

out63

out3out4

Bfly4

Bfly4

Permute

Stage Counter

Page 18: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-18http://csg.csail.mit.edu/6.375

Superfolded circular pipeline: Just one Bfly-4 node!in0

in1

in2

in63

in3in4

out0

out1

out2

out63

out3out4

Bfly4

Permute

Index == 15?

Index: 0 to 15

64, 2-way M

uxes

4, 16-way M

uxes

4, 16-way DeM

uxes

Stage 0 to 2

Page 19: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-19http://csg.csail.mit.edu/6.375

Pipelining a block

inQ outQf2f1 f3

CombinationalC

inQ outQf2f1 f3 PipelineP

inQ outQf Folded

PipelineFP

Clock? Area? Throughput?Clock: C < P FP Area: FP < C < P Throughput: FP < C < P

Page 20: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-20http://csg.csail.mit.edu/6.375

Synchronous pipeline

xsReg1inQ

f1 f2 f3

sReg2 outQ

rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2));endrule

This is real IFFT code; just replace f1, f2 and f3 with stage_f code

This rule can fire only if

Atomicity: Either all or none of the state elements inQ, outQ, sReg1 and sReg2 will be updated

- inQ has an element- outQ has space

Page 21: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-21http://csg.csail.mit.edu/6.375

Stage functions f1, f2 and f3function f1(x); return (stage_f(1,x)); endfunction

function f2(x); return (stage_f(2,x)); endfunction

function f3(x); return (stage_f(3,x)); endfunction

The stage_f function was given earlier

Page 22: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-22http://csg.csail.mit.edu/6.375

Problem: What about pipeline bubbles?

xsReg1inQ

f1 f2 f3

sReg2 outQ

rule sync-pipeline (True); inQ.deq(); sReg1 <= f1(inQ.first()); sReg2 <= f2(sReg1); outQ.enq(f3(sReg2));endrule

Red and Green tokens must move even if there is nothing in the inQ!

Modify the rule to deal with these conditions

Also if there is no token in sReg2 then nothing should be enqueued in the outQ

Valid bits orthe Maybe type

Page 23: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-23http://csg.csail.mit.edu/6.375

The Maybe type data in the pipelinetypedef union tagged {

void Invalid;data_T Valid;

} Maybe#(type data_T);

datavalid/invalid

Registers contain Maybe type values

rule sync-pipeline (True); if (inQ.notEmpty()) begin sReg1 <= Valid f1(inQ.first()); inQ.deq(); end else sReg1 <= Invalid; case (sReg1) matches tagged Valid .sx1: sReg2 <= Valid f2(sx1); tagged Invalid: sReg2 <= Invalid; case (sReg2) matches tagged Valid .sx2: outQ.enq(f3(sx2));endrule

sx1 will get bound to the appropriate part of sReg1

Page 24: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-24http://csg.csail.mit.edu/6.375

Folded pipeline

rule folded-pipeline (True); if (stage==0) begin sxIn= inQ.first(); inQ.deq(); end else sxIn= sReg; sxOut = f(stage,sxIn); if (stage==n-1) outQ.enq(sxOut); else sReg <= sxOut; stage <= (stage==n-1)? 0 : stage+1;endrule

x

sReginQ

f

outQstage

The same code will work for superfolded pipelines by changing n and stage function f

notice stage is a dynamic parameter now!

no for-loop

Need type declarations for sxIn and sxOut

Page 25: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-25http://csg.csail.mit.edu/6.375

802.11a Transmitter Synthesis results (Only the IFFT block is changing)

IFFT Design Area (mm2)

ThroughputLatency

(CLKs/sym)

Min. Freq Required

Pipelined 5.25 04 1.0 MHz

Combinational 4.91 04 1.0 MHz

Folded(16 Bfly-4s)

3.97 04 1.0 MHz

Super-Folded(8 Bfly-4s)

3.69 06 1.5 MHz

SF(4 Bfly-4s) 2.45 12 3.0 MHz

SF(2 Bfly-4s) 1.84 24 6.0 MHzSF (1 Bfly4) 1.52 48 12 MHZ

TSMC .18 micron; numbers reported are before place and route.

The same source code

All these designs were done in less than 24 hours!

Page 26: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-26http://csg.csail.mit.edu/6.375

Why are the areas so similarFolding should have given a 3x improvement in IFFT area

BUT a constant twiddle allows low-level optimization on a Bfly-4 block a 2.5x area reduction!

Page 27: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-27http://csg.csail.mit.edu/6.375

Language notesPattern matching syntax

Vector syntax

Implicit conditions

Static vs dynamic expression

Page 28: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-28http://csg.csail.mit.edu/6.375

Pattern-matching: A convenient way to extract datastructure components

The &&& is a conjunction, and allows pattern-variables to come into scope from left to right

case (m) matches tagged Invalid : return 0; tagged Valid .x : return x;endcase

if (m matches (Valid .x) &&& (x > 10))

typedef union tagged { void Invalid; t Valid;} Maybe#(type t);

x will get bound to the appropriate part of m

Page 29: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-29http://csg.csail.mit.edu/6.375

Syntax: Vector of RegistersRegister

suppose x and y are both of type Reg. Then x <= y means x._write(y._read())

Vector of Int x[i] means sel(x,i) x[i] = y[j] means x = update(x,i, sel(y,j))

Vector of Registers x[i] <= y[j] does not work. The parser thinks it means

(sel(x,i)._read)._write(sel(y,j)._read), which will not type check

(x[i]) <= y[j] parses as sel(x,i)._write(sel(y,j)._read), and works correctly

Don’t ask me why

Page 30: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-30http://csg.csail.mit.edu/6.375

Making guards explicitrule recirculate (True); if (p) fifo.enq(8); r <= 7;endrule

rule recirculate ((p && fifo.enqG) || !p); if (p) fifo.enqB(8); r <= 7;endrule

Effectively, all implicit conditions (guards) are lifted and conjoined to the rule guard

Page 31: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-31http://csg.csail.mit.edu/6.375

Implicit guards (conditions)Rulerule <name> (<guard>); <action>; endrule

where

<action> ::= r <= <exp>| m.g(<exp>)| if (<exp>) <action>

endif| <action> ; <action>

make implicit guards explicit

m.gB(<exp>) when m.gG

Page 32: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-32http://csg.csail.mit.edu/6.375

Guards vs If’sA guard on one action of a parallel group of actions affects every action within the group(a1 when p1); (a2 when p2)

==> (a1; a2) when (p1 && p2)A condition of a Conditional action only affects the actions within the scope of the conditional action(if (p1) a1); a2

p1 has no effect on a2 ... Mixing ifs and whens(if (p) (a1 when q)) ; a2

((if (p) a1); a2) when ((p && q) | !p)

Page 33: Combinational Circuits and Simple Synchronous Pipelines Arvind

February 13, 2008 L05-33http://csg.csail.mit.edu/6.375

Static vs dynamic expressions

Expressions that can be evaluated at compile time will be evaluated at compile-time 3+4 7

Some expressions do not have run-time representations and must be evaluated away at compile time; an error will occur if the compile-time evaluation does not succeed Integers, reals, loops, lists, functions, …