1 Clockless Computing Montek Singh Thu, Sep 13, 2007.

38
1 Clockless Computing Clockless Computing Montek Singh Montek Singh Thu, Sep 13, 2007 Thu, Sep 13, 2007
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    219
  • download

    4

Transcript of 1 Clockless Computing Montek Singh Thu, Sep 13, 2007.

1

Clockless ComputingClockless Computing

Montek SinghMontek Singh

Thu, Sep 13, 2007Thu, Sep 13, 2007

2

Dynamic Logic Pipelines Dynamic Logic Pipelines (contd.)(contd.)

Drawbacks of Williams’ PS0 PipelinesDrawbacks of Williams’ PS0 Pipelines Lookahead Pipelines Lookahead Pipelines [Singh/Nowick 2000][Singh/Nowick 2000]

High-Capacity Pipelines High-Capacity Pipelines [Singh/Nowick 2000][Singh/Nowick 2000]

3

Drawbacks of PSO PipeliningDrawbacks of PSO Pipelining1.1. Poor throughput:Poor throughput:

long cycle time: 6 events per cyclelong cycle time: 6 events per cycle data “tokens” are forced far apart in timedata “tokens” are forced far apart in time

2.2. Limited storage capacity:Limited storage capacity: max only 50% of stages can hold distinct tokensmax only 50% of stages can hold distinct tokens data tokens must be separated by at least one data tokens must be separated by at least one

spacerspacer

My Research Goals My Research Goals have beenhave been: : address both address both

issuesissues still maintain very low latencystill maintain very low latency

4

Recent ApproachesRecent Approaches3 novel styles for high-speed async pipelining:3 novel styles for high-speed async pipelining:

MOUSETRAP Pipelines MOUSETRAP Pipelines [Singh/Nowick, TAU-00, ICCD-[Singh/Nowick, TAU-00, ICCD-01]01]

““Lookahead Pipelines”Lookahead Pipelines” (LP) (LP) [Singh/Nowick, Async-00][Singh/Nowick, Async-00] ““High-Capacity Pipelines”High-Capacity Pipelines” (HC) (HC) [Singh/Nowick, [Singh/Nowick,

WVLSI-00]WVLSI-00]

Goal:Goal: significantly improve throughput of PS0significantly improve throughput of PS0

Two Distinct Strategies:Two Distinct Strategies: LP: LP: introduceintroduce protocol optimizations protocol optimizations

““shave off”shave off” components from critical cycle components from critical cycle

HC: HC: fundamentally new protocolfundamentally new protocolgreater concurrency: “loosely-coupled” stagesgreater concurrency: “loosely-coupled” stages

5

OutlineOutline New Asynchronous Pipelines: New Asynchronous Pipelines:

MOUSETRAP PipelinesMOUSETRAP Pipelines LLookahead ookahead PPipelines (LP)ipelines (LP) HHigh-igh-CCapacity Pipelines (HC)apacity Pipelines (HC) Dynamic circuit styleDynamic circuit style

Static circuit styleStatic circuit style

6

Lookahead Pipeline StylesLookahead Pipeline Styles

Singh and NowickSingh and Nowick

Async-2000Async-2000[Best Paper Award][Best Paper Award]

7

Lookahead Pipelines: Strategy Lookahead Pipelines: Strategy #1#1Use non-neighbor communication:Use non-neighbor communication:

stage receives information stage receives information from from multiple later multiple later stagesstages

allows allows “early evaluation” “early evaluation”

Benefit:Benefit: stage gets stage gets head-starthead-start on next on next

cyclecycle

8

Lookahead Pipelines: Strategy Lookahead Pipelines: Strategy #2#2Use early completion detection:Use early completion detection:

completion detector completion detector moved before stagemoved before stage (not after) (not after) stage indicatesstage indicates “early done”“early done” in parallel with in parallel with

computationcomputation

Benefit:Benefit: again, stage gets again, stage gets head-starthead-start on on

next cyclenext cycle

early completion detectorearly completion detector

9

Lookahead Pipelines: OverviewLookahead Pipelines: Overview5 New Designs:5 New Designs:

““Dual-Rail” Data Signaling:Dual-Rail” Data Signaling: LP3/1:LP3/1: “early evaluation”“early evaluation” LP2/2:LP2/2: “early done”“early done” LP2/1:LP2/1: “early evaluation” + “early done”“early evaluation” + “early done”

““Single-Rail” Bundled-Data Signaling:Single-Rail” Bundled-Data Signaling: LPLPSRSR2/2:2/2: “early done”“early done”

LPLPSRSR2/1:2/1: “early evaluation” + “early done”“early evaluation” + “early done”

10

Optimization = Optimization = “early evaluation”“early evaluation” each stage has two control inputs: from stages N+1 and N+2each stage has two control inputs: from stages N+1 and N+2

Idea: Idea: shorten precharge phaseshorten precharge phase terminate precharge terminate precharge early:early: when N+2 is done evaluating when N+2 is done evaluating

Dual-Rail Design #1: Dual-Rail Design #1: LP3/1LP3/1

Datain

Dataout

PCPC EvalEval

From N+2From N+2From N+2From N+2

NN N+1N+1 N+2N+2

ProcessingBlock

CompletionDetector

11

LP3/1 ProtocolLP3/1 Protocol PRECHARGEPRECHARGE N:N: when N+1 completes when N+1 completes

evaluationevaluation EVALUATEEVALUATE N:N: whenwhen N+2N+2 completes completes

evaluationevaluation

New!New!

11 22 33

Enables “early evaluation!”Enables “early evaluation!”

44

N evaluatesN evaluates N+1 evaluatesN+1 evaluates

N+2 indicates “done”N+2 indicates “done”

N+2 evaluatesN+2 evaluates

NN N+1N+1 N+2N+2

N+1 indicates “done”N+1 indicates “done”

33

12

PS0PS0PS0PS0

LP3/1LP3/1LP3/1LP3/1

LP3/1: Comparison with PS0LP3/1: Comparison with PS0

55

44

4466

NN N+1N+1 N+2N+2

NN N+1N+1 N+2N+2

Enables “early evaluation!”Enables “early evaluation!”

11

11

evaluatesevaluates

evaluatesevaluates

22

22

evaluatesevaluates

evaluatesevaluates

33

33

evaluatesevaluates

evaluatesevaluatesOnly 4 events in cycle!Only 4 events in cycle!

6 events in cycle6 events in cycle

PRECHARGE N:PRECHARGE N: when N+1 when N+1completes evaluationcompletes evaluationPRECHARGE N:PRECHARGE N: when N+1 when N+1completes evaluationcompletes evaluation

33

indicates “done”indicates “done”

indicates “done”indicates “done”

33

EVALUATE N:EVALUATE N: when N+2 completes evaluation when N+2 completes evaluationEVALUATE N:EVALUATE N: when N+2 completes evaluation when N+2 completes evaluation

EVALUATE N:EVALUATE N: when N+1 completes precharging when N+1 completes prechargingEVALUATE N:EVALUATE N: when N+1 completes precharging when N+1 completes precharging

13

11 22 33

44

LP3/1 PerformanceLP3/1 Performance

DETECTEVAL TT 3Cycle Time =Cycle Time =

saved pathsaved path

Savings over PS0:Savings over PS0: 1 Precharge + 1 Completion Detection1 Precharge + 1 Completion Detection

14

LP3/1: Inside a StageLP3/1: Inside a Stage

Timing Issues:Timing Issues: must satisfy several simple must satisfy several simple

constraintsconstraints Ex.:Ex.: PCPC must arrive must arrive beforebefore

Eval de-assertedEval de-asserted 1-sided timing requirement1-sided timing requirement easily satisfied in practiceeasily satisfied in practice

PC (From Stage N+1)PC (From Stage N+1)Eval (From Stage N+2)Eval (From Stage N+2)

NANDNAND

““early Eval”early Eval”

““old Eval”old Eval”Merging 2 Control Merging 2 Control Inputs:Inputs:

15

Dual-Rail Design #2: Dual-Rail Design #2: LP2/2LP2/2Optimization = Optimization = “early done”“early done”

Idea: move completion detector Idea: move completion detector beforebefore processing processing blockblockstage indicates whenstage indicates when “about to”“about to” precharge/evaluateprecharge/evaluate

ProcessingBlock

“early” Completion

Detector

Datain

Dataout

“early done”

16

LP2/2 Completion DetectorLP2/2 Completion DetectorModified completion detectors needed:Modified completion detectors needed:

DoneDone=1=1 when stage starts evaluating, and inputs valid when stage starts evaluating, and inputs valid DoneDone=0=0 when stage starts precharging when stage starts precharging

asymmetric C-elementasymmetric C-element

CCDoneDone

ORORbitbit00

ORORbitbit11

ORORbitbitnn

++++++

PCPC

17

11 22

44

LP2/2 ProtocolLP2/2 ProtocolCompletion Detection:Completion Detection:

performedperformed in parallel in parallel with evaluation/precharge of with evaluation/precharge of stagestage

N evaluatesN evaluates N+1 evaluatesN+1 evaluates

NN N+1N+1 N+2N+2

22

““early done”early done”of N+1 evalof N+1 eval

33

33

““early done”early done”of N+2 evalof N+2 eval

““early done”early done”of N+1 prechof N+1 prech

18

LP2/2 PerformanceLP2/2 Performance

11 22

3344

LP2/2 savings over PS0: LP2/2 savings over PS0: 1 Evaluation + 1 Precharge1 Evaluation + 1 Precharge

DETECTEVAL TT 22Cycle Time =Cycle Time =

19

Dual-Rail Design #3: Dual-Rail Design #3: LP2/1LP2/1Hybrid of LP3/1 and LP2/2.Hybrid of LP3/1 and LP2/2. Combines: Combines:

early evaluationearly evaluation of LP3/1 of LP3/1 early doneearly done of LP2/2 of LP2/2

DETECTEVAL TT 2Cycle Time =Cycle Time =

20

Lookahead Pipelines: OverviewLookahead Pipelines: Overview5 New Designs:5 New Designs:

““Dual-Rail” Data Signaling:Dual-Rail” Data Signaling: LP3/1:LP3/1: “early evaluation”“early evaluation” LP2/2:LP2/2: “early done”“early done” LP2/1:LP2/1: “early evaluation” + “early done”“early evaluation” + “early done”

““Single-Rail” Bundled-Data Signaling:Single-Rail” Bundled-Data Signaling: LPLPSRSR2/2:2/2: “early done”“early done”

LPLPSRSR2/1:2/1: “early evaluation” + “early done”“early evaluation” + “early done”

21

Single-Rail Design: Single-Rail Design: LPLPSRSR2/12/1Derivative of LP2/1, adapted to single-rail:Derivative of LP2/1, adapted to single-rail:

bundled-data: bundled-data: matched delaysmatched delays instead of completion instead of completion detectorsdetectors

delaydelay delaydelay delaydelay

““Ack”Ack” to previous stages is to previous stages is “tapped off early”“tapped off early”once in evaluate (precharge), dynamic logic insensitive to input changesonce in evaluate (precharge), dynamic logic insensitive to input changes

22

PC and Eval are combined exactly as in LP3/1PC and Eval are combined exactly as in LP3/1

Inside an LPInside an LPSRSR2/1 Stage2/1 Stage

““done”done” generated by an generated by an asymmetric C- asymmetric C-element element

donedone=1=1 when stage evaluates, and when stage evaluates, and data inputs data inputs validvalid donedone=0=0 when stage precharges when stage precharges

PC (From Stage N+1)PC (From Stage N+1)

Eval (From Stage N+2)Eval (From Stage N+2)

NANDNAND

aCaC++

““ack”ack”

““req” inreq” in

data indata in data outdata out

““req” outreq” out

matcheddelay

donedone

23

LPLPSRSR2/1 Protocol2/1 Protocol

11 22

33

aCEVAL TT 2Cycle Time =Cycle Time =

element-C asymmetric throughDelay aCT

N evaluatesN evaluates N+2 evaluatesN+2 evaluates

N+2 indicates “done”N+2 indicates “done”

NN N+1N+1 N+2N+2

22

N+1 evaluatesN+1 evaluates

N+1 indicates “done”N+1 indicates “done”

24

Throughput

Design Giga items/sec Improvement (%)

PS0 0.51 1

LP3/1 0.69 1.3

LP2/2 0.90 1.8

LP2/1 1.04 2.0

LPSR2/2 1.31 2.6

LPSR2/1 1.55 3.0

HC 1.75 3.4

dual-raildual-rail

single-railsingle-rail

FIFO Results (simulations)FIFO Results (simulations)

LP dual-rail: LP dual-rail: over 80% faster than Williams’ PS0 over 80% faster than Williams’ PS0 comparable latencycomparable latency

LP single-rail: LP single-rail: even fastereven faster

0.190.19 CMOS CMOS3.3 V, 300°K3.3 V, 300°K

25

datapath widthdatapath width= 32 dual-rail bits!= 32 dual-rail bits!

Practicality of Gate-Level Practicality of Gate-Level PipeliningPipeliningWhen datapath is wide:When datapath is wide:

Can often split into narrow Can often split into narrow “streams”“streams”

comp. comp. ddet. et. ffairly airly low cost!low cost!

Use Use “localized”“localized” completion detector completion detector for each stream:for each stream:

need to examine only a few bitsneed to examine only a few bits small fan-insmall fan-in

send “done” to only a few gatessend “done” to only a few gates small fan-outsmall fan-out

donedone

fan-out=2fan-out=2

comp. det.comp. det.fan-in = 2fan-in = 2

26

High-Capacity PipelinesHigh-Capacity Pipelines

Singh/Nowick Singh/Nowick WVLSI-00, ISSCC-02, Async-02WVLSI-00, ISSCC-02, Async-02

27

HCHC Pipeline Style Pipeline StyleHigh-Capacity Pipelines (HC)High-Capacity Pipelines (HC)

bundled datapaths; dynamic logic function blocksbundled datapaths; dynamic logic function blocks latch-free: no explicit latches neededlatch-free: no explicit latches needed

dynamic logic provides implicit latchingdynamic logic provides implicit latching novel highly-concurrent protocol novel highly-concurrent protocol maximizes storage maximizes storage

capacitycapacity traditional latch-free approaches: “spacers” limit capacity to traditional latch-free approaches: “spacers” limit capacity to

50%50%

Key Idea: Obtain greater control of stage’s operationKey Idea: Obtain greater control of stage’s operation separate control of pull-up/pull-downseparate control of pull-up/pull-down result = new result = new “isolate phase”“isolate phase” stage holds outputs/impervious to input changesstage holds outputs/impervious to input changes

Advantage: Each stage can hold a distinct data itemAdvantage: Each stage can hold a distinct data item 100% storage capacity100% storage capacity

Extra Benefit: Obtain greater concurrencyExtra Benefit: Obtain greater concurrency High throughputHigh throughput

28

HC: Basic StructureHC: Basic Structure

Key Idea:Key Idea:2 independent control 2 independent control signals:signals:pc: pc: controls prechargecontrols prechargeeval: eval: controls evaluationcontrols evaluation

Allows novel 3-phase cycle:Allows novel 3-phase cycle:

EvaluateEvaluate

““Isolate” (hold)Isolate” (hold)

Precharge Precharge

delaydelay

stagestagecontrollercontroller

pcpc evaleval

ackack

N N+1 N+2

delaydelay

Single-rail “Bundled Datapath”: Single-rail “Bundled Datapath”: matched delay: matched delay: produces delayed produces delayed “done” “done”

signalsignalworst-case delay: longer than slowest path worst-case delay: longer than slowest path

for datafor data

delaydelay

29

HC: Inside a StageHC: Inside a StageIndependent ControlsIndependent Controls of of pull-uppull-up and pull-down: and pull-down:

allows new 3allows new 3rdrd phase: “isolate” phase: “isolate”

pcpc asserted: asserted: prechargeprecharge evaleval asserted: asserted: evaluateevaluate pcpc and and evaleval de-asserted: enter de-asserted: enter “isolate” (hold) “isolate” (hold)

phasephase

“keeper”

controlscontrolsevaluationevaluation

controlscontrolsprechargeprecharge

evaleval

inputsoutputs

pcpc

30

HC: ProtocolHC: Protocol

Most Existing Protocols: Most Existing Protocols: 3 synchronization 3 synchronization

arcsarcs1 forward arc: 1 forward arc: data dependencydata dependency2 backward arcs: 2 backward arcs: control synchronizationcontrol synchronization

Our protocol: Our protocol: only 2only 2 synchronization arcssynchronization arcsonly 1 backward arconly 1 backward arc

once stage N+1 evaluates, N can complete entire next once stage N+1 evaluates, N can complete entire next cycle!cycle!

EvalEval

IsolateIsolate

PrechargePrecharge

pc=1pc=1eval=1eval=1

pc=1pc=1eval=0eval=0

pc=0pc=0eval=0eval=0

EvalEval

IsolateIsolate

PrechargePrecharge

Stage NStage N Stage N+1Stage N+1

X

31

Formal Specification of ControllerFormal Specification of Controller

Problem: Specification Problem: Specification too concurrenttoo concurrent for direct synthesis for direct synthesisdesired precharge condition: N and N+1 have evaluated desired precharge condition: N and N+1 have evaluated

same data same data problem: this condition not uniquely captured by given problem: this condition not uniquely captured by given

signals!signals!N may evaluate next data item,N may evaluate next data item, while N+1 stuck on current item!while N+1 stuck on current item!

T+T+

T-T-

(Evaluate of(Evaluate ofN+1 complete)N+1 complete)

(Precharge of(Precharge ofN+1 complete)N+1 complete)

pc+pc+ eval+eval+

S+S+

eval-eval-

pc-pc-

S-S-

(Start(Startevaluate)evaluate)

(Evaluate(Evaluatecomplete)complete)

(Isolate)(Isolate)

(Start(Startprecharge)precharge)

(Precharge(Prechargecomplete)complete)

32

Modified Specification of Modified Specification of ControllerControllerSolution: Add a state variable Solution: Add a state variable ok2pcok2pc

ok2pc ok2pc records whether N+1 has records whether N+1 has “absorbed”“absorbed” N’s data N’s data itemitem

ok2pcok2pc resets resets immediately when N deletes item immediately when N deletes item (N (N precharges)precharges)

ok2pcok2pc is set is set when N+1 deletes item when N+1 deletes item (N+1 precharges) (N+1 precharges)

ok2pc+ok2pc+

ok2pc-ok2pc-

pc+pc+ eval+eval+

S+S+

eval-eval-

pc-pc-

S-S-

T+T+

T-T-

(Evaluate of(Evaluate ofN+1 complete)N+1 complete)

(Precharge of(Precharge ofN+1 complete)N+1 complete)

33

Controller implementationController implementation

Controller implementation is very simple:Controller implementation is very simple: each signal implemented using each signal implemented using a single gatea single gateok2pcok2pc typically typically off the critical pathoff the critical path

INVINV

NAND3NAND3

aCaC++

SS

TT

SSTT

ok2pcok2pc

pcpc

evaleval SS

34

++

evalevalpcpc

HC: Stage ImplementationHC: Stage Implementation

reqreq donedone

ackack

NANDNANDINVINV

delaydelay

state variable:state variable: off the critical pathoff the critical path

from currentfrom currentstagestage

self-loop:self-loop: key to fastkey to fast “ “isolation”isolation”

from nextfrom nextstagestage

early ackearly ack

35

HC: OperationHC: Operation

11

NN N+1N+1N evaluatesN evaluates N+1 starts toN+1 starts to

evaluateevaluateN prechargesN precharges

N enables itself for next evaluationN enables itself for next evaluation

22

33

(fast(fastself-loop)self-loop)

N isolatesN isolates

(fast(fastself-loop)self-loop)

(early Ack)(early Ack)

Cycle Time = 8 CMOS gate delaysCycle Time = 8 CMOS gate delaysCycle Time = 8 CMOS gate delaysCycle Time = 8 CMOS gate delays

36

N enables itselfN enables itselffor next evaluationfor next evaluation

N prechargesN precharges

PerformancePerformance

11

)()( INVPRECHNANDaCEVAL TTTTT 3Cycle Time =Cycle Time =

N evaluatesN evaluates

NN N+1N+1 N+2N+2

N+1 evaluatesN+1 evaluates

33

22

N isolatesN isolates

22

37

Throughput

Design Giga items/sec Improvement (%)

PS0 0.51 1

LP3/1 0.69 1.3

LP2/2 0.90 1.8

LP2/1 1.04 2.0

LPSR2/2 1.31 2.6

LPSR2/1 1.55 3.0

HC 1.75 3.4

dual-raildual-rail

single-railsingle-rail

FIFO Results (simulations)FIFO Results (simulations)

LP dual-rail: LP dual-rail: over 80% faster than Williams’ PS0 over 80% faster than Williams’ PS0 comparable latencycomparable latency

LP single-rail: LP single-rail: even fastereven faster

0.190.19 CMOS CMOS3.3 V, 300°K3.3 V, 300°K

38

Fabricated Chip: HC FIFOFabricated Chip: HC FIFO 2.5 GHz in 0.18u2.5 GHz in 0.18u