Design of a High-Speed Asynchronous Turbo Decoder Pankaj Golani, George Dimou, Mallika Prakash and...

Design of a High-Speed

Asynchronous Turbo Decoder

Pankaj Golani, George Dimou, Mallika Prakash and Peter A. Beerel

Asynchronous CAD/VLSI GroupMing Hsieh Electrical Engineering Department

University of Southern California

ASYNC 2007 – Berkeley, CaliforniaMarch 12th 2007

Motivation and Goal

Mainstream acceptance of asynchronous design• Leverage-off of ASIC standard-cell library-based design flow

• Achieve significant benefits to overcome sync momentum

Our research goal for async designs…• High-speed standard-cell flow

• Applications where designs yield significant improvement• throughput and throughput per area

• energy efficiency

Single Track Full Buffer (Ferretti’02)

Follows 2 phase protocolHigh performance standard cell circuit familyComparison to synchronous standard-cell

• 4.5x better latency• 1+GHz in 0.18µm

• ~2.4X faster than synchronous • 2.8x more area

RL

S

B RCD

SCDA

Reset

L R

B

Forward path

Reset path

1-of-N

1 2

1-of-N

L

R

Block Processing –

Pipelining and Parallelism

imeExecutionT Throughput

K

K cases M peoplepipelines Latency l

Let c be the person cycle time

Steinhart Aquarium

First M cases arrive at t = l

Subsequent M cases arrive every c

time unitsConsider two scenarios

• Baseline

• cycle time C1, latency L1

• Improved

• cycle time C2 = C1/2.4, latency L2 = L1/4.5

Questions

• How does cycle time affect throughput?

• How does latency affect throughput ?

lcM

MK

imeExecutionT

Block Processing –

Combined Cycle Time and Latency Effect

Large K: throughput ratio cycle time ratio

Small K: throughput ratio latency ratio

Throughput vs Number of cases

0 5 10 15 20 25 30 35 40 45

Number of cases (K)

Th

rou

gh

pu

t

Baseline

Improved4.32

Throughput vs Number of cases

0

5

10

15

20

0 200 400 600 800 1000 1200

Number of cases (K)

Th

rou

gh

pu

t

Baseline

Improved

2.6

Talk Outline

• Turbo coding and decoding – an introduction

• Tree soft-input soft-output (SISO) decoder

• Synchronous turbo decoder

• Asynchronous turbo decoder

• Comparisons and conclusions

0 1111

N bits

Turbo Coding – Introduction

Error correcting codes • Adds redundancy• The input data is K bits• The output code word is N bits (N>K) • The code rate is r = K/N• Type of codes

• Linear code• Convolutional code (CC)• Turbo code

Encoder0 1111

K bits

Turbo Encoding - Introduction

Turbo Encoding• Berrou, Glavieux and Thitimajshima (1993)• Performance close to Shannon channel capacity• Typically uses two convolutional codes and an interleaver• Interleaver used to improve error correction

• increases minimum distance of code• creates a large block code

InterleaverInnerCC

OuterCC

Turbo Encoder

Turbo Decoding

Received Data

memory

Inner SISO

De-interleaver

Interleaver

Outer SISO

Turbo decoder components•Two soft-in soft-out (SISO) decoders

• one for inner CC and one for outer CC• soft input: a priori estimates of input data• soft output: a posterior estimates of input data • SISO often based on Min-Sum formulation

• Interleaver / De-interleaver• maps SISO outputs to SISO inputs• same permutation as used in encoder

Iterative nature of algorithm leads to block processing•One SISO must finish before next SISO starts

The Decoding Problem

t = 0 t = KSent bit is 1

Sent bit is 0

Requires finding paths in a graph called a trellis• Node: State j of encoder at time index k • Edge: Represents receiving a 0 or 1 in node for state j at time k• Path: Represents a possible decoded sequence

• the algorithm finds multiple pathsExample Trellis

• For a 2-state encoder, encoding K bits

DecodedSequence

0 1 0 0 0 1 0 1 0 0

t = k

Min-Sum SISO Problem Formulation

Branch and path metrics• Branch metric (BM)

• indicates difference between expected and received values• Path metric

• sum of associated branch metrics

Min-Sum Formulation: for each time index k find • Minimum path metric over all paths for which bit k = 1• Minimum path metric over all paths for which bit k = 0

t = 0 t = k t = KSent bit is 1

Sent bit is 0

Minimum path metric when bit k = 1 is 13

Minimum path metric when bit k = 0 is 16

1 1 0

21 3 2

1 1

1

3

11

3

1 1

21

Talk Outline

• Turbo coding and decoding – an introduction

• Tree SISO low-latency turbo decoder architecture




Conventional SISO - O(K) latency

Calculation of the minimum path can be divided into two phases• Forward state metric for time k and state j: • Backward state metric for time k and state j:

Data dependency loop prevents pipelining • Cycle time limited to latency of 2-way ACS• Latency is O(K)

1011

0001

0 ,min

kkkkk BMBBMBB

011

11

001

01

0 ,min

kkkkk BMFBMFF

t = 0 t = k t = Kt = k-1Received

bit is 1

Received bit is 0

t = k+1

jkFjkB

Tree SISO – low latency architecture

10

4,200

2,0,11

4,210

2,0min10

4,0BMBMBMBMBM

3)22,21(min

102,1

001,0

,112,1

101,0

min102,0

BMBMBMBMBM

Tree SISO (Beerel/Chugg JSAC’01)• Calculate BMs for larger and larger segments of trellis.( )• Analogous to creating group-wise PG logic for tree adders • Tree SISO can process the entire trellis in parallel• No data dependency loops so finer pipelining possible• Latency is O(log K)

t=2

2

2

1

2

t=0 t=1t=0 t=1 t=2 t=3 t=41

1

2 1

1

2

1

3

jilkBM ,

,

Remainder of Talk Outline

• Turbo Coding – an introduction

• Turbo Decoding





Synchronous Base-Line Turbo Decoder

• Synchronous turbo decoder base-line• IBM 0.18µm Artisan standard cell library• SCCC code was used with a rate of ½• Number of iterations performed is 6

• Gate level pipelined to achieve high throughput• Performed timing-driven P&R• Peak frequency of 475MHz• SISO area of 2.46mm2

• To achieve high throughput, multiple blocks instantiated

Asynchronous Turbo Decoder

Static Single Track Full Buffer Standard-Cell Library (Golani’06)• Total of (only) 14 cells in IBM 0.18µm process• Extensive spice simulations were performed

• optimized trade-off between performance and robustness

Chip design• Standard ASIC place-and-route flow (congestion-based)• ECO optimization flow

Chip level simulation• Performed on critical sub-block (55K transistors)• Verified timing constraints• Measured latency and throughput using Synopsys Nanosim

Keeper

Keeper

SR

M2

M1

M3

M11

M12

NR

M10

L

A

Channel wire

Static Single Track Full Buffer (Ferretti’01)

Statically drive line → improves noise margin

Sender Receiver 1-of-N data

SST channel

1-of-N static single-track protocol

1-of-N

1 2

Holds low Drives high

Holds high Drives low

Asynchronous Implementation Challenges - I

FORK Join

• Degradation in throughput• Unbalanced fork and join structure• The token on the short branch is stalled due to imbalance• This leads to over all slowing down of the fork join

FORK Join

• Slack matching• Improves the throughput because of an additional pipeline buffer• Identify fork / join bottlenecks and resolve by adding buffers• After P&R long wires can also create such a problem• This can be solved by adding buffers on long wires using ECO flow

Asynchronous Implementation Challenges - II

• SSTFB implements only point to point communication• Use dedicated Fork cells

• Creates another pipeline stage• To slack match buffers are needed on the other paths

• Integrate Fork within Full Adder

FAFork

Full adder

Full adder

Full adder

Full adder

Full adder

Full adder

Full adder

45% less area than full adder and fork

Decreases the number of slack matching buffers required

Full Adder + Fork Full Adder with Integrated Fork

Full adder

Full adder

Full adder

Full adder

FORKFull adder

Full adder

Full adder

Full adder

FORK

Asynchronous Implementation Challenges – III

Buffer Buffer Buffer Buffer

• 60% of the design are slack matching buffers• Most of the time these buffers occur in linear chains

Slack2 Slack2 Slack4

17% area and 10% power improvement for SLACK2

30% area and 19% power improvement for SLACK4

• To save area and power two new cells were created• SLACK2• SLACK4

Slack2 BufferBuffer

Remainder of Talk Outline

• Turbo Coding – an introduction

• Turbo Decoding





Comparisons

• Synchronous• Peak frequency of 475MHz• Logic area of 2.46mm2

• Asynchronous• Peak frequency of 1.15GHz• Logic area of 6.92mm2

• Design time comparison• Synchronous: ~4 graduate-student months• Asynchronous: ~12 graduate-student months

Synch vs Async

imeExecutionT Throughput

K

M pipelined8-bit Tree SISOs Latency l

Let c be the sync clock cycle time (475 MHz)

First M bits arrive at t = l

Subsequent M bits arrive every c time

unitsTwo implementations• Synch: cycle time C1 and latency L1• Async: cycle time C2 = C1/2.4

latency L2 = L1/4.5

Desired comparisons• Throughput comparison vs block size• Energy comparison vs block size

lcM

MK

imeExecutionT

Received Memory Interleaver/ De-interleaver

K bits

Comparisons – Throughput / Area

• For small block sizes asynchronous provides better throughput/area• As block size ↑ the two implementations become comparable• For block sizes of 512 bits synchronous cannot achieve async throughput

Throughput/area vs Block size

0 1000 2000 3000 4000 5000

Block size (bits)

Th

rou

gh

pu

t/a

rea

(Mb

ps

/mm

2 )

Asynch (M=1)

Sync (variable M)

2.13

M=8

1.28

M=3

3.91

M=11

Comparisons – Energy/Block

For equivalent throughputs and small block sizes asynchronous is more energy efficient than synchronous

Async advantages grow with larger async library (e.g., w/ BUF1of4)

Energy per block

0

2

4

6

8

10

12

14

0 1000 2000 3000 4000 5000

Blocks size (bits)

Ener

gy p

er b

lock

(n

J)

Async (M=1)

Synch (variable M)

Conclusions

• Asynchronous turbo decoder vs. synchronous baseline• static STFB offers significant improvements for small block sizes

• more than 2X throughput/area• higher peak throughput (~500Mbps)• more energy efficient

• well-suited for low-latency applications (e.g. voice)• High-performance async advantageous for applications which require

• high performance (e.g., pipelining)• low latency• block processing for which parallelism has diminishing returns

• synchronous design requires extensive parallelism to achieve equivalent throughput

Future Work

• Library Design• Larger library with more than 1 size per cell• 1-of-4 encoding

• Async CAD• Automated slack matching• Static timing analysis

Questions ??

Design of a High-Speed Asynchronous Turbo Decoder Pankaj Golani, George Dimou, Mallika Prakash and...

Documents

Transcript of Design of a High-Speed Asynchronous Turbo Decoder Pankaj Golani, George Dimou, Mallika Prakash and...