Design of a High-Speed Asynchronous Turbo Decoder Pankaj Golani, George Dimou, Mallika Prakash and...
-
Upload
ross-miller -
Category
Documents
-
view
224 -
download
3
Transcript of Design of a High-Speed Asynchronous Turbo Decoder Pankaj Golani, George Dimou, Mallika Prakash and...
Design of a High-Speed
Asynchronous Turbo Decoder
Pankaj Golani, George Dimou, Mallika Prakash and Peter A. Beerel
Asynchronous CAD/VLSI GroupMing Hsieh Electrical Engineering Department
University of Southern California
ASYNC 2007 – Berkeley, CaliforniaMarch 12th 2007
Motivation and Goal
Mainstream acceptance of asynchronous design• Leverage-off of ASIC standard-cell library-based design flow
• Achieve significant benefits to overcome sync momentum
Our research goal for async designs…• High-speed standard-cell flow
• Applications where designs yield significant improvement• throughput and throughput per area
• energy efficiency
Single Track Full Buffer (Ferretti’02)
Follows 2 phase protocolHigh performance standard cell circuit familyComparison to synchronous standard-cell
• 4.5x better latency• 1+GHz in 0.18µm
• ~2.4X faster than synchronous • 2.8x more area
RL
S
B RCD
SCDA
Reset
L R
B
Forward path
Reset path
1-of-N
1 2
1-of-N
L
R
Block Processing –
Pipelining and Parallelism
imeExecutionT Throughput
K
K cases M peoplepipelines Latency l
Let c be the person cycle time
Steinhart Aquarium
First M cases arrive at t = l
Subsequent M cases arrive every c
time unitsConsider two scenarios
• Baseline
• cycle time C1, latency L1
• Improved
• cycle time C2 = C1/2.4, latency L2 = L1/4.5
Questions
• How does cycle time affect throughput?
• How does latency affect throughput ?
lcM
MK
imeExecutionT
Block Processing –
Combined Cycle Time and Latency Effect
Large K: throughput ratio cycle time ratio
Small K: throughput ratio latency ratio
Throughput vs Number of cases
0 5 10 15 20 25 30 35 40 45
Number of cases (K)
Th
rou
gh
pu
t
Baseline
Improved4.32
Throughput vs Number of cases
0
5
10
15
20
0 200 400 600 800 1000 1200
Number of cases (K)
Th
rou
gh
pu
t
Baseline
Improved
2.6
Talk Outline
• Turbo coding and decoding – an introduction
• Tree soft-input soft-output (SISO) decoder
• Synchronous turbo decoder
• Asynchronous turbo decoder
• Comparisons and conclusions
0 1111
N bits
Turbo Coding – Introduction
Error correcting codes • Adds redundancy• The input data is K bits• The output code word is N bits (N>K) • The code rate is r = K/N• Type of codes
• Linear code• Convolutional code (CC)• Turbo code
Encoder0 1111
K bits
Turbo Encoding - Introduction
Turbo Encoding• Berrou, Glavieux and Thitimajshima (1993)• Performance close to Shannon channel capacity• Typically uses two convolutional codes and an interleaver• Interleaver used to improve error correction
• increases minimum distance of code• creates a large block code
InterleaverInnerCC
OuterCC
Turbo Encoder
Turbo Decoding
Received Data
memory
Inner SISO
De-interleaver
Interleaver
Outer SISO
Turbo decoder components•Two soft-in soft-out (SISO) decoders
• one for inner CC and one for outer CC• soft input: a priori estimates of input data• soft output: a posterior estimates of input data • SISO often based on Min-Sum formulation
• Interleaver / De-interleaver• maps SISO outputs to SISO inputs• same permutation as used in encoder
Iterative nature of algorithm leads to block processing•One SISO must finish before next SISO starts
The Decoding Problem
t = 0 t = KSent bit is 1
Sent bit is 0
Requires finding paths in a graph called a trellis• Node: State j of encoder at time index k • Edge: Represents receiving a 0 or 1 in node for state j at time k• Path: Represents a possible decoded sequence
• the algorithm finds multiple pathsExample Trellis
• For a 2-state encoder, encoding K bits
DecodedSequence
0 1 0 0 0 1 0 1 0 0
t = k
Min-Sum SISO Problem Formulation
Branch and path metrics• Branch metric (BM)
• indicates difference between expected and received values• Path metric
• sum of associated branch metrics
Min-Sum Formulation: for each time index k find • Minimum path metric over all paths for which bit k = 1• Minimum path metric over all paths for which bit k = 0
t = 0 t = k t = KSent bit is 1
Sent bit is 0
Minimum path metric when bit k = 1 is 13
Minimum path metric when bit k = 0 is 16
1 1 0
21 3 2
1 1
1
3
11
3
1 1
21
Talk Outline
• Turbo coding and decoding – an introduction
• Tree SISO low-latency turbo decoder architecture
• Synchronous turbo decoder
• Asynchronous turbo decoder
• Comparisons and conclusions
Conventional SISO - O(K) latency
Calculation of the minimum path can be divided into two phases• Forward state metric for time k and state j: • Backward state metric for time k and state j:
Data dependency loop prevents pipelining • Cycle time limited to latency of 2-way ACS• Latency is O(K)
1011
0001
0 ,min
kkkkk BMBBMBB
011
11
001
01
0 ,min
kkkkk BMFBMFF
t = 0 t = k t = Kt = k-1Received
bit is 1
Received bit is 0
t = k+1
jkFjkB
Tree SISO – low latency architecture
10
4,200
2,0,11
4,210
2,0min10
4,0BMBMBMBMBM
3)22,21(min
102,1
001,0
,112,1
101,0
min102,0
BMBMBMBMBM
Tree SISO (Beerel/Chugg JSAC’01)• Calculate BMs for larger and larger segments of trellis.( )• Analogous to creating group-wise PG logic for tree adders • Tree SISO can process the entire trellis in parallel• No data dependency loops so finer pipelining possible• Latency is O(log K)
t=2
2
2
1
2
t=0 t=1t=0 t=1 t=2 t=3 t=41
1
2 1
1
2
1
3
jilkBM ,
,
Remainder of Talk Outline
• Turbo Coding – an introduction
• Turbo Decoding
• Tree SISO low-latency turbo decoder architecture
• Synchronous turbo decoder
• Asynchronous turbo decoder
• Comparisons and conclusions
Synchronous Base-Line Turbo Decoder
• Synchronous turbo decoder base-line• IBM 0.18µm Artisan standard cell library• SCCC code was used with a rate of ½• Number of iterations performed is 6
• Gate level pipelined to achieve high throughput• Performed timing-driven P&R• Peak frequency of 475MHz• SISO area of 2.46mm2
• To achieve high throughput, multiple blocks instantiated
Asynchronous Turbo Decoder
Static Single Track Full Buffer Standard-Cell Library (Golani’06)• Total of (only) 14 cells in IBM 0.18µm process• Extensive spice simulations were performed
• optimized trade-off between performance and robustness
Chip design• Standard ASIC place-and-route flow (congestion-based)• ECO optimization flow
Chip level simulation• Performed on critical sub-block (55K transistors)• Verified timing constraints• Measured latency and throughput using Synopsys Nanosim
Keeper
Keeper
SR
M2
M1
M3
M11
M12
NR
M10
L
A
Channel wire
Static Single Track Full Buffer (Ferretti’01)
Statically drive line → improves noise margin
Sender Receiver 1-of-N data
SST channel
1-of-N static single-track protocol
1-of-N
1 2
Holds low Drives high
Holds high Drives low
Asynchronous Implementation Challenges - I
FORK Join
• Degradation in throughput• Unbalanced fork and join structure• The token on the short branch is stalled due to imbalance• This leads to over all slowing down of the fork join
FORK Join
• Slack matching• Improves the throughput because of an additional pipeline buffer• Identify fork / join bottlenecks and resolve by adding buffers• After P&R long wires can also create such a problem• This can be solved by adding buffers on long wires using ECO flow
Asynchronous Implementation Challenges - II
• SSTFB implements only point to point communication• Use dedicated Fork cells
• Creates another pipeline stage• To slack match buffers are needed on the other paths
• Integrate Fork within Full Adder
FAFork
Full adder
Full adder
Full adder
Full adder
Full adder
Full adder
Full adder
45% less area than full adder and fork
Decreases the number of slack matching buffers required
Full Adder + Fork Full Adder with Integrated Fork
Full adder
Full adder
Full adder
Full adder
FORKFull adder
Full adder
Full adder
Full adder
FORK
Asynchronous Implementation Challenges – III
Buffer Buffer Buffer Buffer
• 60% of the design are slack matching buffers• Most of the time these buffers occur in linear chains
Slack2 Slack2 Slack4
17% area and 10% power improvement for SLACK2
30% area and 19% power improvement for SLACK4
• To save area and power two new cells were created• SLACK2• SLACK4
Slack2 BufferBuffer
Remainder of Talk Outline
• Turbo Coding – an introduction
• Turbo Decoding
• Tree SISO low-latency turbo decoder architecture
• Synchronous turbo decoder
• Asynchronous turbo decoder
• Comparisons and conclusions
Comparisons
• Synchronous• Peak frequency of 475MHz• Logic area of 2.46mm2
• Asynchronous• Peak frequency of 1.15GHz• Logic area of 6.92mm2
• Design time comparison• Synchronous: ~4 graduate-student months• Asynchronous: ~12 graduate-student months
Synch vs Async
imeExecutionT Throughput
K
M pipelined8-bit Tree SISOs Latency l
Let c be the sync clock cycle time (475 MHz)
First M bits arrive at t = l
Subsequent M bits arrive every c time
unitsTwo implementations• Synch: cycle time C1 and latency L1• Async: cycle time C2 = C1/2.4
latency L2 = L1/4.5
Desired comparisons• Throughput comparison vs block size• Energy comparison vs block size
lcM
MK
imeExecutionT
Received Memory Interleaver/ De-interleaver
K bits
Comparisons – Throughput / Area
• For small block sizes asynchronous provides better throughput/area• As block size ↑ the two implementations become comparable• For block sizes of 512 bits synchronous cannot achieve async throughput
Throughput/area vs Block size
0 1000 2000 3000 4000 5000
Block size (bits)
Th
rou
gh
pu
t/a
rea
(Mb
ps
/mm
2 )
Asynch (M=1)
Sync (variable M)
2.13
M=8
1.28
M=3
3.91
M=11
Comparisons – Energy/Block
For equivalent throughputs and small block sizes asynchronous is more energy efficient than synchronous
Async advantages grow with larger async library (e.g., w/ BUF1of4)
Energy per block
0
2
4
6
8
10
12
14
0 1000 2000 3000 4000 5000
Blocks size (bits)
Ener
gy p
er b
lock
(n
J)
Async (M=1)
Synch (variable M)
Conclusions
• Asynchronous turbo decoder vs. synchronous baseline• static STFB offers significant improvements for small block sizes
• more than 2X throughput/area• higher peak throughput (~500Mbps)• more energy efficient
• well-suited for low-latency applications (e.g. voice)• High-performance async advantageous for applications which require
• high performance (e.g., pipelining)• low latency• block processing for which parallelism has diminishing returns
• synchronous design requires extensive parallelism to achieve equivalent throughput
Future Work
• Library Design• Larger library with more than 1 size per cell• 1-of-4 encoding
• Async CAD• Automated slack matching• Static timing analysis
Questions ??