Hardware Design and Realization for Iteratively Decodable ...

Hardware Design and Realization for Iteratively

Decodable Codes

Emmanuel Boutillon and Guido Masera

1. Introduction

The transition from analog telecommunication equipments and terminalsto digital systems and, more recently, the rushing development of wirelesscommunications were made possible by key advances in IC (Integrated Cir-cuit) technology. The continuous process of miniaturization of Silicon devicesknown as Moore’s law led to an increase of transistor density close to three or-ders of magnitude in the last 20 years. At the same time, large improvementshave been achieved in methodologies and tools for the design of highly com-plex digital circuits and productivity in IC design (measured as the averagenumber of components designed per month by an expert designer) has beenincreasing by 20% every year. Both trends contributed in a decisive way tothe hardware implementation of computationally intensive base–band algo-rithms, including powerful decoding architectures for turbo and LDPC codes.

From the methodological point of view, the design of efficient channeldecoders for wireless applications is an extremely difficult and exciting do-main of practicing. In most of cases, hardware design activity is abstractedfrom the addressed application, meaning that a set of area and performanceconstraints are extracted from the application and the digital designer canconcentrate his attention onto hardware specific issues, which are largelyindependent of the application. The design of channel decoders requiresa significantly different approach, where system and architecture levels arestrictly interconnected and the hardware design activity cannot be separatedfrom the deep knowledge of the underlying algorithms. Instead, decoding al-gorithms and hardware architectures must be jointly studied and optimized,having in mind that any choice at the implementation level has an impact onthe system level performance and viceversa. Hardware aware algorithm sim-

Preprint submitted to Elsevier November 27, 2013

plifications and finite precision representation of both external and internaldata are imperative to achieve area and energy efficiency, but their effect oncommunication performance must be carefully evaluated, usually by meansof bit and cycle accurate software models and Monte Carlo simulations.

In general, the objectives of a designer are often conflicting and an ap-propriate trade–off should be studied.

• Bit Error Rate (BER) Performance: in other words, to be as close aspossible to the theoretical performance. A gap of performance of 0.2up to 0.5 dB in SNR for a given target Bit Error Rate is currentlyaccepted.

• Low area size: the smaller is, the better, since it reduces the fabricationcost.

• High speed, low latency, to be able to cope with the growing commu-nication rate.

• Low power: to increase the reliability of the design, the sustainability ofthe whole system and the autonomy of the receiver when it is suppliedby batteries.

• Flexibility: having a single design supporting several standards in or-der to share the conception cost and mask fabrication cost on a largenumber of products.

For most of the implementation approaches, the decoding throughput D(in Mbit/s) of a design can be expressed by the equation:

D = KFclk

Nmaxit ×Nc

(1)

where Fclk (in MHz) is the clock frequency, K is the number of informationbits of a codeword, Nmax

it is the maximum number of decoding iterationsand finally, Nc is the number of clock cycles needed for a decoding iteration.Equation 1 should be completed with two other architectural parameters fora given technology, level of signal to noise ratio and level of performance:hardware efficiency and energy efficiency. The hardware efficiency He is de-fined as the decoding throughput per area unit (Mbit/s/mm2): He = D/Swhere S is the area of the design [1]. The energy efficiency Ee is the averageenergy to decode one bit (in Joule/bit).

2

The first section presents the standard implementation of a Turbo-Codedecoder and that of a LPDC decoder, in order to fix the internal accuracyand the notations. The three next sections are dedicated respectively to areaoptimization (parameter He), speed optimization (parameter D increasingFclk, decreasing Nc and Nit) and energy optimization (parameter Ee). To doso, we had to make choices. In fact, some optimization techniques (say forexample, bit-length optimization), have an influence on the other criteria (itdecreases the area, thus the power dissipation and the critical path). Sectionfive deals about the flexible implementation of iterative decoders. Section sixexplores non standard decoding techniques (analog and stochastic decoders).Finally, section seven describes some relevant implementations.

2. Standard Implementation

In this section, we present the baseline architecture for Turbo-Code de-coder and for LDPC decoder. The architecture is all numerical. The firstsection concerns on the finite precision representation of the input message.Then the Turbo-Code is studied, followed by the LDPC code. The objectiveof this section is to give an insight into the problem involved and to definethe notations used in the chapter.

2.1. Quantization of input LLR

The quantization of input LLR is an important issue. On one side, thequantization should be done with a maximum number of bits, in order toobtain an optimal performance. On the other side, high number of bits forthe input LLR implies high hardware complexity and low clock frequency. Itis interesting to note that the function of a decoding process is the suppressionof the channel noise. This robustness against channel noise implies also, as abeneficial side effect, a robustness against the internal quantification noise ofthe input LLR. Thus, compared to a classical DSP application, the internalprecision of a turbo-decoder can be rather low without significant degradationof the performance.

Let us consider a binary code associated with a BPSK modulation, wherethe bit d is associated to a signal modulated with an amplitude X = −1 ifd = 0, X = 1 else wise. The received symbol is thus equal to x = X + n,where n is a realization of a white Gaussian noise of variance σ. The LLRλ(d) = ln(P (d=1/x)

P (d=0/x)), assuming that there is no a priori knowledge on the

value of d, is then:

3

λ(d) = 2x/σ2. (2)

The quantized value λ(d)Q of λ(d) on bLLR bits is given by λ(d)Q =Q(λ(d)), where the quantification function Q is defined as:

λQ = Q(λ) = sat(

⌊λ · 2bLLR−1 − 1

A+ 0.5

⌋, 2bLLR−1 − 1) (3)

where sat(a, b) = a if a belongs to [−b, b], and sat(a, b) = sign(a)× b other-wise and A is the dynamic interval of the quantization (data are quantizedbetween [−A,A]). The symmetrical quantization is needed to avoid a sys-tematic advantage of one bit value against the other (a situation that woulddecrease the performance of the decoder).

One can note that if A is very large, most of the input will be quantizedby a 0 value, i.e., an erasure. In that case, the decoding process will fail.On the other hand, a too small value of A would lead to a saturation of thequantized input most of the time and the soft quantization would thus beonly equivalent to the hard decision. Clearly, for a given code rate and agiven SNR, there is an optimal value of A. For examples, Figure 1 showsthe influence on performance of the range value A for bLLR = 5 for the LTETurbo-Code K = 6144 and a coding rate R = 1/3. Figure 2 shows theinfluence of bLLR on performance for the LTE Turbo-Code with K = 6144,a coding rate R = 1/3. In each simulation, the value of A is selected tooptimize the performance for a FER of 10−2.

As a rule of thumb, for a 1/2 code rate, the optimal value of A is around1.2. Moreover bLLR = 6 gives almost optimal performance. In practical case,bLLR varies between 3 and 6.

Equations (2) and (3) show that the input value λ(d)Q of the turbo-decoder depends on the channel observation x, the value A, as mentionedabove, but also on the variance of the SNR of the signal, i.e., the variance ofthe noise σ2 (see equation (2)). The SNR is not available at the receiver sideand it should be estimated using the statistic of the input signal. In practicalcase, the estimation of the SNR is not alway mandatory. For example, themax-log-map algorithm is used, i.e., the max∗ operation is approximated bythe simple max operator (see section 3.4), its estimation is not necessary. Infact, the terms 2

σ2 of equation (2) is just, in the logarithm domain, an offsetfactor that impact both input and output of the max operator. Moreover,when the log-map algorithm is used, a pragmatic solution is to replace the

4

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.910

−3

10−2

10−1

100

Performance of a LTE decoder for several values of A

Eb/N

o(dB)

FE

R

A=2

A=3

A=3.25

A=4

A=5

A=6

Figure 1: LTE code for K = 6144, R = 1/3, bLLR = 5 and several value of A

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.910

−3

10−2

10−1

100

Influence of bLLR

on the performance

Eb/N

o(dB)

FE

R

bLLR

=8; A=5.5

bLLR

=7 ; A=5

bLLR

=6 ; A=4.5

bLLR

=5; A=3.2

bLLR

=4; A=2.5

bLLR

=3; A=2

Figure 2: LTE code for K = 6144, R = 1/3 and bLLR varying from 3 to 8

5

value of the real standard deviation σ by the maximum value σo of σ leadingto a BER (or a FER) acceptable for the application. Note that when theeffective noise variance σ is below σo, the decoding process becomes sub-optimal due to a sub-estimation of the λ(d). Nevertheless, the BER (or theFER) would still decrease and thus remains in the functionality domain ofthe application.

To simplify the notations, we will use x instead of λ(d)Q to represent thequantized input of the decoder, thus xi is an integer, coded on bLLR bits,taking it values in [−2bLLR−1 + 1, 2bLLR−1 − 1].

2.2. Standard Turbo decoder architecture

The standard Turbo-decoder architecture and the associated decoding isfirst presented. Then, a detailed description of the MAP decoder is presented,first in the case of binary code, with known initial and final state, then withtail biting code [2].

2.2.1. Global architecture

In this section, we assume a received quantized message R with Ri =(xi, y

(1)i , y

(2)i ), i = 1..K where y(j) represents the redundancy on the first (j =

1) and second (j = 2) encoders and K is the size of the information message.The message R is the quantized received message. The basic architecture ofthe Turbo-Decoder is composed of a single MAP decoder and its associatedmemories, as shown in figure 3.

The decoding process starts with the receiving of the message R. Theinput values are supposed to be quantized Log Likelihood Ratios (LLR)(see section 2.1). The information bits {xi}i=1..K are stored in the memoryRAMx. This memory is able to read the message in the natural order, whenthe first encoder is processed or in the permutated order when the secondencoder is processed. In the later case, the RAMx delivers the interleavedsequence {xπ(i)}i=1..K , where π is the permutation function of the encoder.The redundancy bits are stored in the memory RAMy. This memory is ableto outputs either {y(1)} or {y(2)} redundancy sequences.

The decoding process starts to decode the first encoder with the input{x} and {y(1)}. The output of the extrinsic memory RAMext is set to zero,since no a priori information is available at this stage of the decoding. Thea priori value ai entering the SISO decoder is thus ai = xi. The outputsof the MAP decoder are respectively the extrinsic information z(1) and thea posteriori LLR value Λ1. The dynamic of the extrinsic information is

6

Figure 3: Global architecture of the decoder

7

eventually reduced by multiplying z(1) by a positive scaling factor δh=1 lessthan or equal to 1, where the index h = 1 indicates the first half iteration.Then the scaled extrinsic δ1z(1) are serially stored in the RAMext memory.

Once the first half iteration is done, the second half iteration starts.The memory RAMx delivers {xπ(i)}i=1..K while the memory RAMext out-

puts {δ1z(1)π(i)}i=1..K). The two values are added to generate the apriori input

value ai that enters the SISO decoder. The RAMy memory outputs the y(2)

information. Once the processing of the second code is finished, the MAPdecoder outputs the extrinsic {z(2)

i }i=1..K . The extrinsic values are scaledagain by a factor δ2 before entering into the RAMext memory.

Then, the first code is decoded again, with new extrinsic informationδ2z(2) generated during the processing of the previous half iteration. TheRAMext reorders the extrinsic value δ2z2 by reading the values in the π−1

order. The process is repeated during a fixed number of half-iterations.During the last decoding, the LLR {Λi}i=1..K values are output and a newcodeword can be decoded.

The effect of the scaling value δh is to reduce the impact of the extrinsicvalues and to help the convergence of the Turbo-decoding process. Theoptimal value of δh changes according to the half-iteration index h. Generally,the value of δh is of the form δ = 1 − 2−γ, where γ is a positive integer, inorder to replace a multiplication by a shift operation (multiplication by 2−γ)and an addition. Typical values of δh are 0.5 (γ = 1), 0.75 (γ = 2), or0.875 (γ = 3) for all half iterations, except the last half iteration wherethe value is equal to 1. Figure 4 shows the impact of the δ values on theperformance for a LTE code of size K = 6144, code rate 1/3, 6 decodingiterations, a quantization on bLLR = 5 bits and a range factor A = 3.3. Thethree numbers of the legend are respectively δ1, then δh for h = 2..11 andfinally, δ12. One can see that the serie of δh values have a significant influenceover the performance. A gap of 0.18 dB in Eb/N0 can be observed betweenthe performance with and without scaling factor (all the δh values are equalto 1, i.e. the curve ”1-1-1”).

2.2.2. Standard MAP architecture

In this section, we focus our attention on the finite-length architectureof the log-MAP decoder. We first remind the equations of the log-MAPalgorithm. Let m(s′, s) be the Branch Metric (BM) connecting state s′ tostate s of the trellis and (X(s′, s), Y (s′, s) the associated modulated values

8

0 0.2 0.4 0.6 0.8 110

−3

10−2

10−1

100

Performance for several series of δh

Eb/N

o(dB)

FE

R

0.5 0.5 0.5

1 1 1

0.75 0.75 0.75

0.75 0.75 1

0.5 0.75 1

Figure 4: Global architecture of the decoder

for the information bit x(s, s′) and the redundancy bit y(s′, s). The BranchMetric m(s′, s) is given by

m(s′, s) = X(s′, s)a+ Y (s′, s)y (4)

One should note that, for a given input (ai, yi), there are only four possiblevalues of the branch metric according to the sign of X(s′, s) and Y (s′, s).Those four possible values are given by:

m00i = −a− y

m01i = −a+ y

m10i = +a− y

m11i = +a+ y (5)

The affectation of m(s′, s) to one of the four values of (5) depends on theencoder structure. The computation of mi for a given trellis stage requiresonly 4 additions.

9

The forward recursion, for i = 1..K − 1, is given by:

MFi (s) = max∗(MF

i−1(s′0) +mi(s′0, s);M

Fi−1(s′1) +mi(s

′1, s)) (6)

where the initial state of the forward state value MF0 is given by MF

0 (s =0) = Inf and MF

0 (s 6= 0) = 0, where Inf is a high positive number in therange of the precision of the decoder.

The backward recursion, for i = K − 1 down to 1, is given by:

MBi (s′) = max∗(MB

i+1(s0) +mi(s′, s0);MB

i+1(s1) +mi(s′, s1)) (7)

where the intial state of the forward state valueMKK+1 is given byMB

K+1(s =0) = Inf and MF

K+1(s 6= 0) = 0.Finally, the output LLR Λ(di), for i = 1..K, is given by:

Λ(di) = +max(s′,s)∈T 1{MFi−1(s′) +m(s′, s) +MB

i (s)}−max(s′,s)∈T 0{MF

i−1(s′) +m(s′, s) +MBi (s)} (8)

where T 1 (respectively T 0) is the set of transitions from state s′ at timei − 1 to state s at time i associated to an encoded bit d = 1 (respectivelyd = 0) .

Overall architecture. The MAP decoder architecture is given on figure 5.The decoding process starts with the computation of the BM value from theai and yi values, for i = 1..K (see equation (4)). The ai values enter theLast In First Out (LIFO) memory LIFOa of depth K. The branch metricsvalues mi for i = 1..K enter directly the Forward recursion (equation (6)) togenerateMF

i for i = 1..K. TheMFi values enter directly the memory LIFOM

of depth K. In parallel, the mi values enter also the LIFO memory LIFOm

(also of depth K). Once the forward processing is finished, the backwardprocess starts. The memory LIFOm outputs the sequence mK+1−i, i = 1..Kto perform the backward recursion (equation (7)). During the backwardprocessing, the backward unit outputs MB

K+1−i, for i = 1..K. At the sametime, LIFOM outputs MF

K−i in order to generate the decoded output ΛK−iin the LLR unit (see equation (8)). The final step of the MAP decoder is tosubtract the a priori value aK−i (output of LIFOa) from ΛK−i to generate theintrinsic information ¯zK−i, i = 1..K. One can note that, with this schedule,the outputs are generated in the reverse order. This is not a a big issue,in fact, it only changes the way the RAMext (see figure 3) schedules itsinput/output accesses.

10

Figure 5: Architecture of the MAP decoder (Forward-Backward algorithm)

11

In order to perform a Forward (or Backward) step in one clock cycle,we should map directly the data flow graph of equation (see equation (6))in the hardward. It means, for an encoder of 2ν states, 2ν interconnectedACS* (Add Compare Select *) should be implemented. Figure 6 shows anexample of such architecture for a ν = 3 convolution coder of polynomialcaracteristics (in octal) (17, 13). Figure 7 shows the architecture of the ACS*(Add Compare Select) that performs the operation max∗(M0 + m0;M1 +m1)).

Since each forward (or backward) step is done in one clock cycle, we canrepresent the scheduling by figure 8.a. In this figure, the time is representedon the x-axis and the index of the information bit on the y-axis. During thefirst K clock cycles, the branch metrics and Forward recursion are performedfrom index 1 to K (arrow ”Forward” in figure 8). The forward state metricsMF are stored in the LIFOM memory (represented by the grey area). Fromclock cycle K to clock cycle 2K, the backward unit performs the recursionfrom index K to index 1 (arrow ”backward” in figure 8). Finally, the LLRunit performs the computation of the intrisic information (arrow ”LLR” infigure 8).

2.2.3. MAP architecture for tail-biting code

In case of Tail-Biting Turbo-Code ([2]), there is no initial state. To solvethis problem, a preamble of convergence length L starting from bit indexK − L with an all zero initial state metric MF

K (all states are equiprobable)should be done for the Forward recursion in order to have a correct estimationof the initial state MF

0 = MFK , since the code is tail-bitting. In an identical

way, a backward recursion of length L should be done between bit index Ldownto bit index 0 in order to have a current estimation of the initial stateMB

K = MB0 , as shown in figure 8.b.

The value of L is called convergence length. The higher L, the better theconvergence of the algorihtm (ideally, L = K) but in practice, a value of Lequals 5 or 6 times the convergence length of the code for a 1/3 turbo-codeis enough to obtain almost optimal performance. For high rate Turbo-Code,the convergence length should include a significant number of trellis sectionswith non-punctionned redundancy bit to obtain near optimal performance.

2.3. Standard LDPC decoder architecture

The decoding of LDPC codes is often associated to a computational ar-chitecture resembling the structure of the Tanner graph, with processing ele-

12

Figure 6: Example of parallel forward architecture for the convolutional encoder (in octal)(17,15)

13

Figure 7: Achitecture of the ACS* operator

14

Figure 8: Global scheduling of the SISO decoder

15

ments (PE) associated to both variable and check nodes, memory units andinterconnects to support exchange of messages between graph nodes. There-fore, generally speaking, processing elements, memory units and interconnectstructures are the key components in any decoder.

Using these three types of components, a large variety of decoding archi-tectures for LDPC codes have been proposed and implemented in the lastyears. However one specific architecture has been adopted more often thanall other ones and can be considered as a de facto standard solution, ableto provide relevant complexity and performance advantages in most appli-cations. This commonly used decoding architecture is the partially parallellayered decoder and it will be detailed in the rest of the Section, after a briefoverview of fully parallel and serial solutions.

In a fully parallel architecture, all nodes of the Tanner graph are instan-tiated and served concurrently. The high level architecture of a fully paralleldecoder for a (N,M) LDPC code is shown in Figure 9.

Three elements are included in the architecture: check nodes (CN), vari-able nodes (VN) and an interconnect network, which supports each CN toVN edge in the Tanner graph. At iteration n, each VN j receives from thechannel a LLR λj, together with a number of check to variable messages

β(n)i,j , arriving from a set of CNs i. Generated variable to check messages are

sent back to CN: specifically, α(n)i,j is the message sent from VN j to CN i

at iteration n. The two–phase decoding algorithm for LDPC codes adoptsa scheduling where each iteration is split in two separated, non–overlappedparts:

1. VNs receive input messages from CNs and compute new messages tobe sent back to CNs,

2. CNs receive messages from VNs and generate new messages forwardedto VNs.

For the sake of clarity, the key equations for two–phase decoding are herereported.

β(n)i,j = ΨCN

(α

(n−1)i,k , k ∈ N (i)

)(9)

α(n)i,j = λj +

∑k∈M(j),k 6=i

β(n−1)k,j (10)

where M(j) is the set of parity checks involving VN j and N (i) is the setof VNs that participate in parity check i; moreover, ΨCN is the check node

16

Figure 9: Fully parallel decoder architecture17

update function, which has multiple formulations according to the adoptedapproximation.

Ideally, this architecture optimizes the decoding throughput, since it guar-antees the maximum degree of parallelism. However, it has been seldomadopted in practical implementations [3], because of two reasons: on oneside, it leads to a huge number of connections among PEs, which cause rout-ing congenstion in the case of large codes. On the other side, such a kindof solution is dedicated to a single specific code and does not provide anyflexibility.

In a serial architecture, a single PE is allocated and graph nodes areserved in a time multiplexed way. In Figure 10, a serial decoder is shown,composed of a single CN unit, a single VN unit, an input buffer (IB) and amessage memory (MSG RAM). Once LLRs of a frame are loaded into IB,the decoding proceeds through a sequence of elementary steps to completevariable to check half iteration. For each VN j,

1. VN unit reads LLR λj from IB

2. VN unit reads β(n−1)k,j from MSG RAM, for all k ∈M(j)

3. VN unit executes (eq:variable)

4. obtained messages α(n)i,j are written back into MSG RAM

The check to variable half iteration is similarly organized:

1. CN unit reads α(n−1)i,k from MSG RAM, for all k ∈ N (i)

2. CN unit executes (eq:check)

3. obtained messages β(n)i,j are written back into MSG RAM

Due to the sparse nature of H matrix, indices i and j cannot be directlyused to address the MSG RAM, but an address generator is required to storemessages in a compact form.

Contrary to the fully parallel case, the serial approach offers excellentflexibility in terms of supporting heterogeneous codes, but the achievablethoughput is too low for most applications. The decoding of an (N,M)LDPC code with N variable nodes and M check nodes, assuming averagedegrees of dc and dv for check and variable nodes respectively, would requirea number of cycles given by

3Nit (dv ·N + dc ·M) (11)

18

Figure 10: Serial decoder architecture19

where the factor three takes into account that, for every node, incoming mes-sages must be read and processed, and additionally result must be written:in a fully serial decoder, each of these elementary operations takes one cycle.As an example, a (1000, 500) code with dc = 5 and dv = 8 needs 31,500cycles per iteration, or 315,000 cycles per 10 iterations: even a very fast de-coder implementation with clock frequency of 1 GHz would result into a lowthroughput, 3 Kb/s, insufficient for almost all applications.

In the partially parallel architecture, allocated PEs are a lower numberthan graph nodes and multiple nodes share a same PE. By tuning the numberof PEs, different cost–throughput trade–offs can be targeted. Interconnectionamong PEs still induces relevant complexity, but less than for fully parallelarchitectures: permutation networks with acceptable implementation costcan be used to support inter–PE communication needs associated to mul-tiple LDPC codes. Proposed solutions range from fully flexible networks,capable to support any parity check matrix, to lower complexity structuresthat work on specific classes of codes. Examples of fully flexible solutions arethe efficient nonblocking Benes networks [4] and application specific NoCs(Network on Chip) [5]. Benes networks provide full connectivity among Ninputs and N outputs, with a low complexity structure, made of 2log2N − 1layers of N/2 2 × 2 switching elements. Therefore the overall complexityof Benes networks grows with N as Nlog2N , while complexity is quadraticwith N in crossbar switches. The construction of a 8 Benes network from a2 × 2 switch is shown in Figure 11. A NoC based decoder makes use of anNoC to exchange extrinsic information among PEs. Both direct and indirectnetworks can be used to this purpose. A direct NoC is shown in Figure 12together with the node structure. The network includes P nodes, each ofwhich contains both a processing element (PE) and a routing element (RE).The generic PE includes storage and computational resources to execute thedecoding algorithm on a portion of the received frame. On the other hand,the RE is built around an m ×m input–queuing crossbar switch, where mis the number of input/output ports of the node. Received inputs are storedinto m FIFOs and outputs are registered before leaving the node. The readenable signals of the input FIFOs, the crossbar controls and the load signalsof the output registers are driven by a routing algorithm, which makes rout-ing decisions that minimize delivery latency. The routing algorithm can beexecuted either on the fly or off–line: in the latter case, routing decisionsare generated in advance and stored into dedicated memory, which are parte

20

Figure 11: Examples of Benes networks21

of the RE. On the contrary, in the first case, routting information must beappended to exchanged messages and dedicate processing resources are re-quired in the RE to run the routing algorithm and to control FIFOs andcrossbar switch.

Low complexity interconnect structures have been widely adopted to sup-port structured LDPC codes. In these codes, the parity check matrix H iscomposed by a number Mb of block rows and a number Nb block columns,which partition H into Mb · Nb entries. Each entry in H is replaced with az × z matrix that is either a different shift permutation of the z × z identitymatrix, or the z × z all zero matrix. This kind of structure provides bothexcellent communication performance and simple inter–PE connection. Inparticular, logarithmic barrel shifters can be used [6] [7] to give support toall connections in the Tanner graph of structured codes.

Although the two–phase scheduling naturally stems from the bi–partitestructure of the Tanner graph, better communication performance is obtainedwith a different scheduling, which is known as layered decoding, shuffleddecoding, horizontal scheduling or turbo decoding message passing [7]. Inthis decoding approach, the parity check matrix is seen as the concatenationof multiple layers or constituent codes, each one associated to a numberof rows. Within a given iteration, CN processing is applied to a layer andobtained messages are immediately forwarded to connected VNs. These VNsare then updated and their messages sent back to CNs belonging to the nextlayer. Thus, an iteration is divided into a number of sub–iterations, one foreach layer. The reason why layered decoding has replaced the two–phasescheduling comes from its better efficiency: it allows to reduce the numberof iterations by up to 50%, with no penalty in communication performance,so enabling a lower decoding latency than the original scheduling.

In layered decoding, (10) is modified as

α(n)i,j = λj +

∑k∈M(j)

β(n)k,j − β

(n−1)i,j = α

(n)j − β

(n−1)i,j (12)

Message α(n)j is now a unique value sent by VN j to all connected CNs; each

CN derives its own message α(n)i,j as the difference between α

(n)j and β

(n−1)i,j .

To implement the partially parallel layered decoder, the high level archi-tecture in Figure 13 can be used. The main elements in the shown architec-

22

Figure 12: A NoC based decoder architecture23

Figure 13: Partially parallel layered decoder architecture

ture are:

• the PE, here indicated as CNP, which implements (9) and (12),

• the memory units (MU) that initially receive λj and then store comu-

lative variable to check messages, α(n)j

• the two permutation networks Π and Π−1, which allow proper align-ment of messages exchanged between CNPs and MUs (α

(n)j )

• the controller, which generate addresses for MUs and control signalsfor the permutation networks.

PE processing is also detailed in Figure 13: PE j receives α(n)j through the

network from a MU. The the first adder computes the difference between α(n)j

and β(n−1)i,j , with the second operand read from memory S. Obtained message

α(n)i,j is stored in a FIFO and also processed by the Ψ unit, which generates

new check to variable message β(n)i,j and stores it back to memory. Finally,

α(n)i,j message is retrived from the FIFO and added to β

(n)i,j : the resulting sum

is an updated version of α(n)j that will be used when processing the following

layer. It is then stored back into MUs through the permutation network.

Several variations are possible for this general architecture. First of all,the Ψ unit can implement either the original SPA or a reduced complexity

24

version, such as the Min–Sum or other simplified algorithms mentioned inSection 3.4. The size of S memory can be greatly reduce by adopting thecompression method also described in Section 3.4, where only the first twominimum magnitudes are stored,together with sign bits and position of theminimum.

Possible changes are also applicable to the number of inputs/outputs inthe PE. As shown in Figure 13, the PE is usually organized in a serial form:it receives and generates a message per clock cycle. Although parallel checknode processing units have also been proposed to increase parallelism byhandling multiple messages at the same time [8], the serial organization isalmost always preferred because of its capability to easily adapt to differentnode degrees. Parallel PE architecture is detailed in Section 4, in the contextof highly parallel decoders for multi Gb/s applications.

3. Low complexity decoder

From an historical point of view, the minimization of occupied area hasbeen considered for a long time as the most important optimization targetin the design of complex baseband processing components. Since productioncost grows rapidly with Silicon area, area efficiency is clearly a very impor-tant characteristic in any product to be offered to the consumer and telecommarket. Today, ultra-deep sub-micron CMOS technologies enable the inte-gration of huge amounts of gates on a very small die, with an extremelylow cost of allocated devices. Therefore, area efficiency per se is no moreviewed as the main design objective and other optimization criteria domi-nate the designer’s choices. However, techniques developed to minimize thearea are still of interest, not only for historical reasons, but also to reducethe static power dissipation, which is going to become a dramatic problemin the coming generations of CMOS processes.

There are several design approaches to improve area efficiency. Optimiza-tion of the internal precision of data is beneficial to both the simplificationof the processing components and the reduction of memory footprint. Spe-cific approximations can also be introduced at the algorithm level to avoidcomplex arithmetic operations and to reduce the amount of storage.

3.1. Internal precision optimization

In this section, we describe techniques to optimize the internal precisionof the decoder. This issue is addressed from a twofold perspective. On one

25

side, we intend to show what kinds of trade-offs can be achieved betweenerror correcting capabilities, the decoding speed and the area complexithfor arithmetic and processing units. As a matter of fact, the internal accu-rary (bit width) of data affects both the combinational delay of the basicarithmetic components (such as adders, comparators, multiplexors, ...) andtheir area. Therefore a careful choice of internal precision is one of the mostimportant task for the design of any decoder.

For the Turbo-code, we specifically discuss about the saturation of thestate metric and the extrinsic metrics.

For the LPDC, we present the saturation of the Soft Output values.

On the other side, the bit width also influences the footprint of memorycomponents, which usually dominates the whole area complexity of a channeldecoder. Thus, to reduce the memory cost, techniques have been proposedto compress internally the processed data before storing. Both lossless andlossy methods have been considered, where lossless approaches do not intro-duce any drop in error correcting capabilities, while lossy solutions call forcarefull analysis of achievable compromises between BER performance andarea saving.

3.2. Precision optimization in Turbo decoder

Since the log-MAP algorithm is implemented, the result of the decodingis not affected by an additive factor to all metrics (see equation (8)). Thus,for each stage i, it is possible to replace the computation of (4) by addingthe constant value |ai|+ |yi| to mi(s

′, s). In that case, (4) becomes:

m(s′, s) = 2Xi(s′, s)|ai|+ 2Yi(s

′, s)|yi| (13)

The advantage of this new formulation is that all branch metrics are nowpositive. Moreover, they are all multiple of 2, i.e., the least significant bitof the branch metrics (and by extension, the forward and backward statemetrics) is always equals to zero. In order to delete this useless LSB, (13) issimply replaced by:

m(s′, s) = Xi(s′, s)|ai|+ Yi(s

′, s)|yi| (14)

Internal precision. Let the symbols bLLR, bz, ba, bm, bM and finally, bΛ bethe number of bits to code respectively the quantized input value x (and y),the extrinsic values z (and its scaled version δhz) the a priori information a,

26

the branch metric m(s′, s), the state metrices MF and MB, and finally, theoutput Λ. These values are not mutually independant.

According to figure 3, m(s′, s) is the summation of y coded on bLLR anda coded on ba bits. Assuming ba ≥ bLLR, m(s′, s) is coded on bm = ba + 1bits. The relation between bm and bM has been intensively studied in thelitterature, first in the case of the Viterbi algorithm ([9], [10]), and thenextended on the Forward-Backward algorithm ([11]). The principle of thesealgorithms is simple. Let us explain in terms of probability: the branchmetrics are expressed in the log domain in a quantized way, this means, in theprobability domain, that all branches has a minimum non zero probabilityε. Since the encoder have a constraint length ν , the ratio of probabilitybetween the state with the lowest probability and the state with the highestprobability is always bounded by εν . In fact, any couple of states at time iis connected to any state at time i − ν by two paths of length ν (here weassume i ≥ ν), the product of branch probability for those two path is upperbounded by 1 and lower bounded by εν . In the log domain, that means thatthe difference between the highest state metric and the lowest state metricis always bounded. The exact bound can be obtained by simulating theForward recursion (or Backward recursion) with the most reliable possiblereceived ”all zero” sequence (see [11]), i.e., a received ”all zero” messagewith maximum probability (i.e. x0

i = −2bx−1 + 1, y0i = −2bx−1 + 1 and

a0 = −2ba−1 + 1). With this message, the upper bound of the maximumdifference between the lowest and the highest state metric is given by:

∆ = max{max{|MFi (s)−MF

i (s′)|, s 6= s′}, i = 1..K} (15)

Note that the value of ∆ depends on the initial state vector MF0 . Once

∆ is known, it is possible to derive bM as:

bM = dlog2(∆)e+ 2 (16)

where dxe is the smallest integer greater or equal to x. In fact, using therescaling method, when the state metrics are above 2bM−1 (half the maximumvalue), 2bM−1 is substracted to all state metrics. This operation has a lowcost in hardware. In fact, the test consists only to verify that the MostSignificant Bit (MSB) of all state metrics is equal to 1 (done by a 2ν ANDgate entry), the substraction consists just in setting to 0 the MSB. It is alsopossible to perform all operations in mod 2bM arithmetic, as shown in [12],[13]. Note that, it is also possible to use only bM = dlog2(∆)e + 1 for state

27

metric if ∆+2bm < 2dlog2(∆)e. The last value to be determined is bz that is setgenerally to bx + 1. Typical values for a 8-state Turbo-Decoder are bx = 6,bz = 7, ba = 8, bm = 8 and bM = 10.

Non uniform quantization schemes can be exploited to save a few bitsin the finite precision of the data in the turbo-decoder. It allows to savearea by reducing the size of the memory. The drawback, generally, is asmall degradation in terms of performance. There is many tricks that canbe applied and make a design significantly more efficient than the ”standardone”. The readers are invited to refer to [14], [15] (compression of the statemetric) or [16], [17], [18] (compression of the extrinsic information) to getsome ideas of methods that can be used.

3.3. Sliding-Window technique in Turbo decoder

The sliding-window technique is a technic used to reduce both the decod-ing latency of the MAP decoder and the size of the state metric memory. Theidea is to slip the frame of size N bits in Q windows of lenght L (Q = N/L)and perform serially the forward-backward on each window, from 1 to L,then L + 1 to 2L and so on. For the forward recursion, the final state of arecursion can be used as the initial state of the next windows. Nevertheless,a special processing is required to process the initial backware metric. It canbe done by processing first the backward recusion and saving every L stepsthe state metrique MB, as shown in figure 14.a. It can be seen in this figurethat the size of the forward state metric memory is reduced from K to L,with the aditionnal Q/L backward state metric vector, which can representa large area saving of memory. The drawback is an increased of the decodinglatency (from 2K to 3K) and the need of an additional backward recusionto find the initial states. It is possible to trade-off latency with performancesusing prehemble of size L (as shown in figure 14.b) to estimate the initialbackware state metrics of each window. The latency of this sliding window isreduced from 3K to 2K +L. An extensive description of the sliding windowtechnic can be found in [11].

To avoid the overhead due to initializations, it is possible to use the statevalue of the backward recusion obtained during the last iteration [19], asshown in figure 15. If the size of the windows L is high enough, this methodlead to excellent result.

28

Figure 14: Sliding window technic29

Figure 15: Sliding window with next-iteration initialization

30

Figure 16: Implementation of Log–MAP scheme

3.4. Algorithm simplifications

Implementation of the MAP algorithm in the logarithmic domain leadsto replace multiplications with simpler additions and to avoid exponentialcomputation in the generation of branch metrics. However additions in theoriginal MAP algorithm become in the new domain complex trascendentalcomputations of the following form:

ln(ea + eb

)= max?(a, b) (17)

Multiple implementations for the max∗ operator are possible, with differentdegrees of complexity and accuracy. The two most widely known and usedsolutions are

31

1. the Log–MAP scheme, showed in Figure 16

ln(ea + eb

)= max∗(a, b) = max (a, b) + ln

(1 + e−|a−b|

)(18)

where the max∗ operator is implemented by means of a simple com-parator to obtain the sign of the a−b difference, a multiplexer to selectthe largest value, a look-up table where possible logarithmic correctionterms are stored, and an adder. Two o three bits are usually enough torepresent corrections stored in look-up tables. The Log–MAP schemeenables the exact computation of max∗, but its hardward implemen-tation has a non negligible area occupation: the area occupied by thecircuit of Figure 16 synthesized for a 130 nm CMOS process with 8 and10 bits is given in the second row of Table 1.

2. the Max–Log–MAP approximation, which only needs comparator andmultiplexer

ln(ea + eb

)= max∗(a, b) ∼= max (a, b) (19)

As shown in the third row of Table 1, the complexity of the Max–Log–MAP approximation is much lower than Log–MAP. Moreover,the Max–Log–MAP introduces a shorter latency in the computationof max∗, equal to the propagation delay due to comparator plus mul-tiplexer, and without the contribution coming from the final addershown in Figure 16. Since the max∗ operator is included in the itera-tive equations for the update of forward and backward state metrics,a shorter latency in the calculation of max∗ translates into a higherclock frequency and therefore a higher achievable throughput for thethe whole decoder. On the other hand, the Max–Log–MAP approxi-mation causes a penalty in the error correcting capabilities of binaryturbo codes typically equal to a few tenths of dBs.

To obtain fair comparisons, area values in Table 1 are derived from thesame synthesis conditions: 130 nm CMOS technology, target clock frequency200 MHz, maximum area optimization effort. Also decoding performanceare referred to the same simulation conditions: 16 state, 1/2 rate turbocode, with generator polynomials (1, 33/23)o, codeword lenght N = 103,pseudo–random turbo interleaver, maximum number of decoding iterations10, AWGN channel.

Several algorithms have been obtained to approximate max∗ with dif-ferent methods and different trade–offs between implementation complexity

32

Table 1: Comparison among different max∗ implementations in terms of: (i) Occupiedarea (µm2) on a 130 nm CMOS technology, (ii) difference of required SNR ∆SNR (dB)with respect to Log–MAP at BER 10−4

Scheme area for 8 bits area for 10 bits ∆SNR

Log–MAP 1022.72 1276.888 0Max–Log–MAP 250.133 312.666 0.4Constant Log–MAP 599.108 754.519 0.03Linear Log–MAP 978.342 1264.784 0Improved Log–MAP 891.602 1135.684 0PWL Log–MAP 653.573 831.086 0.05

and BER performance. In most of them, the correction term in (18) is ob-tained by means of an approximated computation instead of being stored ina look-up table. The main approximations based on this approach are:

1. Constant Log-MAP [20], where the correction term is given by 3/8, if|a− b| < 2 and it is equal to 0 otherwise

2. Linear Log-MAP [21], where the correction term is obtained through asecond maximum calculation: max (ln2− 0.25 · |a− b|, 0)

3. Improved Log–MAP [22], with correction evaluated as ln2 · 2−|a−b|

A different approach has been adopted in [23]: instead of approximating thecorrection term in (18), the whole max∗ operator has been approximatedby means of robust geometric programming [24]. This approach leads toa Piece–Wise Linear (PWL) approximation, with very low implementationcomplexity in the case of three terms:

max∗ (a, b) ∼= max (a, 0.5 · (a+ b+ 1), b) (20)

As reported in Table 1, Constant Log–MAP and PWL based Log–MAP allowfor a relevant area saving (more than 35 %) with respect to Log–MAP, whileLinear and Improved Log–MAP approximations provide a much more lim-ited area advantage. In terms of BER performance, all these approximatedmethods exhibit large gains with respect to Max-Log–MAP and they are veryclose to the performance offered by the Log–MAP algorithm (differences arearound 0.05 dB at a BER level of 10−4).

The max∗ operator is extensively used in turbo code decoders, thereforethe low complexity approximations mentioned above are of great interest.

33

The max∗ is also often used with more than two input values. The dominantimplementation approach in this case is the adoption of a sequential compu-tational structure, where the max∗ operator is applied recursively over pairsof values. For example, with three input values, the computation is organizedalong two recursive steps:

max∗ (a, b, c) = max∗ (max∗ (a, b) , c) (21)

Alternatively, a parallel computation structure can be used, with the ad-vantage of reducing the overall latency, especially for large numbers of inputvalues. Other area–latency trade–offs are obtained resorting to approximateformulations of max∗. In [25], it is shown that the Chebyshev inequality canbe exploited to approximate (21) as

max∗ (a, b, c) ∼= max (a, b, c) + ln(1 + e−δ

)(22)

where δ is the difference between the first and the second maximum. This ap-proximation can be implemented by means of a very efficient computationalstructure, which allows for a relevant area saving with respect to the realiza-tion of max∗ in both parallel and recursive forms. The high level architectureview includes two components: one component has to find the first and sec-ond maximum, while the second component deals with the correction termln(1 + e−δ

). The two largest input values can be selected using a tree struc-

ture [26] which recursively decompose the search problem for a set of n valuesinto two parallel search problems on n/2 values. A simple implementationof the correction term is obtained by means of a LUT accessed by δ. Savingof area enabled by this approach with respect to the usual implementationbased on the recursive application of max∗ operator as in (21), is between50% and 70%. As for the effect on decoding capability, approximation in(22) exhibits excellent performance: for the case of a 16–state binary turbocode, (22) introduces a penalty of 0.05 dB at BER of 10−5 with respect toLog-MAP, while in an 8–state duo–binary turbo code neglibible performancedegradation was observed at BER of 10-6.

Approximate computations have also been proposed for the most com-plex processing units in LDPC code decoders. In particular, in sum–productalgorithm (SPA), the check node update implies relevant complexity; more-over it is very sensitive to finite word length effects. For this reasons, SPA isnormally implemented in a simplified form. The most widely known solution

34

to reduce complexity in check node computations is the Min–Sum algorithm[27], which is here reported for the case of a degree n check node, with ninputs ini and n outputs outk:

outk =n∏

i=1,i 6=k

sign(ini) ·mini 6=k

(|ink|), k = 1, . . . , n (23)

where sign function takes the sign of its argument and the min function isapplied to absolute values of received inputs that have an index different fromthe index of the computed output. This computation can be implementedby means of simple hardware components, as shown in Figure 17 for the casen = 5. Part (B) of the Figure shows the basic box–plus component, ableto process two inputs: two subtracters and two multiplexers compute theabsolute values of inputs a and b; the minimum absolute value is identifiedby means of a third subtracter, which controls the final multiplexer; finally,the xor gate receives the two sign bits and its output is used as the selectionsignal for the two central multiplexers, which propagates the right sign values.

Part (A) of the Figure shows an efficient structure to apply (23) to anumber n of inputs: two sequences of box–plus components are allocated tocombine sequentially inputs 1 to 4 in the forward direction and inputs 2 to 5 inthe backward direction. The forward and backward chains generate output5 and 1 respectively. Additional three components are used to generateoutputs 2 to 4, by combining intermediate values along the two chains. Thisbackward–forward strucuture can be easily generalized to process any numbern of inputs. The number of required box–plus components is given in generalby 3n − 6. When adopted to synthesize a parallel 8–input check node on a130 nm technology for a target clock frequency of 200 MHz, this Min–Sumcomputation scheme leads to a complexity of 1.49 equivalent Kgates.

Performance offered by the Min–Sum scheme in the decoding of LDPCcodes has been improved resorting to either additive or multiplicative cor-rection, leading to Offset Min–Sum and Normalized Min-Sum [27]. Thesetwo methods have a slightly larger complexity than pure Min–Sum, 1.52 and1.53 kgates respectively, for the same parallel check node mentioned above.

PWL approximation has been also proposed to simplify SPA. For exam-ple, in [28] the proposed Min–Sum plus method uses a PWL computationwhere multiplying factors are powers of two and can be easily implementedin hardware with shift operations.

35

Figure 17: Implementation of Min–Sum algorithm

Table 2: Comparison among different low complexity approximations for SPA, in termsof: (i) Difference of required SNR ∆SNR (dB) with respect to SPA at BER 10−4, (ii)Implementation complexity in Kgates, (iii) Average number of iterations (ANI).

Scheme ∆SNR complexity ANISum–Product 0 16.74 12.1MacLaurin approx. 0 6.73 12.3PWL Min–Sum 0.01 5.24 12.3Normalized Min–Sum 0.06 1.53 12.5Offset Min–Sum 0.22 1.52 13.0Min–Sum 0.5 1.49 11.2

MacLaurin approximations have also been proposed to replace SPA [29],with close to optimum performance but relevant increase of occupied areawith respect to previously mentioned approximations. Table 2 comparesmentioned approximations in terms of difference of required SNR ∆SNR (dB)with respect to Sum–Product at BER 10−4, implementation complexity inKgates and average number of iterations (ANI). All reported data are re-ferred to homogeneous synthesis and simulation conditions: synthesis hasbeen performed on a 130 nm CMOS technology with a target clock frequencyof 200 MHz; simulations are run for a WiMAX LDPC code with rate 1/2and codeword length 2304.

An interesting solution is introduced in [30], where the key operation incheck node update is formulated as the difference between two max∗ opera-

36

tions and this idea is exploited to propose a low-complexity hardware compo-nent supporting the decoding of both turbo and LDPC codes. This approachoffers an excellent trade–off between occupied area and performance in theimplementation of a multi standard, multi mode turbo and LDPC decoder.

The use of the Min–Sum scheme in the decoding of LDPC codes is oftenassociated with message compression [31]. Consider a generic parity checkequation with degree d and messages represented on n bits, including thesign. In order to directly store all outgoing CN messages, n · d bits arenecessary. It can be observed that, in the Min–Sum approximation, only twoalternative results are possible when computing the CN output magnitudes:either minimum or one–but–minimum is selected. Therefore, only the firsttwo minimum values need to be stored, together with all sign bits and theposition of the minimum (an index that discriminates among d elements).Each individual message can be easily reconstructed from this compressedinformation, which uses 2(n−1)+d+ log2d bits. This approach may result ina large saving of memory space: for example, with d = 15 and n = 6 almost70% of memory capacity is saved with respect to [31]

3.5. Scheduling of turbo-code

In this section, we present the windowing techniques to save memory (con-vergence length, Next Iteration Initialisation). Several types of schedulingare described (F-B, B-F, butterfly), the possibility to pipeline successive halfiteration is also proposed. The impact of those parameters on the hardwareefficiency are discussed.

4. High throughput architectures

In the introduction, we present simplified equation (1) that gives thedecoding througput of a generic decoder as a function of clock frequencyFclk, number of information bits K, maximum number of decoding iterationsNmaxit , and number of cycles per iteration Nc. For a parallel decoder, this

equation can be extended as

D = PcKFclk

Navgit ×Nc

(24)

The new parameter Pc represent the number of parallel components workingconcurrently in the architecture on independent codewords (discussed in the

37

Section 4.1). Compared to (1), the maximum number of decoding iterationNmaxit is replaced by an average number of decoding iteration Navg

it , thanksto buffering techniques (described in Section 4.2). The decreasing of Nc

thanks to processing unit parallelism is presented in the three subsequentsub–sections. In fact, parallelism in processing unit requires parallel accessmemory that can lead to memory conflicts. This problem is studied in Section4.3, while Sections 4.4 and 4.5 are respectively dedicated to Turbo and LDPChigh throughput decoders. Finally, we omit in this Section the maximizationof Fclk, since it has been indirectly discussed in Section 3.1 about precisionoptimisation and in Section 3.4 about sub–optimal algorithms.

4.1. Basic solutions to increase decoding speed

Two straightforward approaches to increase the decoder throughput arethe allocation of multiple processing units capable of decoding in parallel anumber of received data blocks, and the unrolling of iterations. Both meth-ods introduce a Pc factor larger than one in (24) and are characterized byconstant hardware efficiency: for example, duplication of decoder simply in-troduce a factor two in both occupied area and achieved throughput.

Allocation of multiple decoders in parallel is shown in Figure 18-(a): thesimplified structure of the single instance decoder, which is composed byinput and output buffers (IB and OB), decoding core (DEC) and internalmemory (MEM) is replicated to handle in parallel multiple frames. Bothprocessing and storage resources need to be replicated, leading to a lineargrowth of occupied area with the degree of parallelism. The complexity in-crease is accompanied by a proportional power dissipation increase. On theother side, the throughput improves linearly with the number of parallelunits, with no penalty in latency, which remains the same as for a singleinstance architecture.

Rather then having multiple complete decoders working at the same time,specific sub-parts of the decoding algorithm can be arranged for parallel pro-cessing. For example, in a turbo decoder, parallelism is routinely introducedin the implementation of path metric update equations and in the simulta-neous processing of several trellis windows.

Finally parallelism can be added at the iteration level, allocating multipledecoders (Nit) that work in a pipelined fashion (Figure 18-(b)): at a given

38

Figure 18: Parallel and unfolded architectures

39

time, each of the Nit decoders performs a single iteration on a separatedblock, so supporting a decoding throughput I times higher than for a sin-gle decoder architecture. Similarly to the parallelization of whole decodersworking on subsequent blocks, also the unfolding of the iterative decodingalgorithm in a pipelined architecture requires additional storage for the mul-tiple blocks that are simultaneously processed.

Better efficiency can be achieved by exploiting parallelism and/or pipelin-ing at some deeper level of the decoding algorithm.

4.2. Average vs worst case number of iteration

For a given SNR, the number of iteration required to decode a receivedcodeword varies from one code word to the others. For a given SNR, thedifference between the average number of iteration Navg to decode a codeword and the maximum number of iteration Nmax required to obtain a givenlevel of performance can varies in a factor greater than two Navg ≈ Nmax/2.Sizing the hardware in the worst case means that, in average, half the hard-ware resources of the decoder are not used. As shown in section 5.2, thereis method to detect, at the end of each decoding iteration, the convergenceof a frame. It is thus possible to size the hardware to process the averagecase Navg of decoding iteration instead of the worst case Nmax when an inputbuffer is used to smooth the variation of the number of decoding iterationfrom one frame to the others. This method has been presented by several au-thors [32], [33], [34]. Using the tools of queue theory, they show that an extrainput buffer of size two frames is enough in practice. In other words, withoutsignificant degradation in performance, it is possible to reduce the size thehardware by a factor of 2 thanks to input buffer. The effect of this methodis to increase the maximum decoding latency of decoder. If the throughputof the decoded bit needs to be constant, an output buffer is required as well.In that case, the decoding latency of the whole decoder remains constant.

4.3. Parallel error free memory access

Parallelism is often introduced in both Turbo and LDPC decoders bydividing the received frame into several segments or windows, simultaneouslyprocessed. This kind of parallelism demands for particular attentions in thedefinition of the interconnection structure. In particular they have to facethe conflict problem in the access to memory banks. For both kinds of codes,the decoder architecture is composed by several PEs connected with memory

40

banks through an interconnection network. In the case of turbo codes, thenetwork executes interleaving and de–interleaving operations according tospecific permutation laws. In the case of LDPC codes, the network supportsthe exchange of messages according to the connections between CNs and VNsdefined in the H matrix.

Memory collisions may occur when more PEs need to make access to thesame memory bank at the same time, so leading to possible stalls in thedecoding process with degradations in the achievable throughput. In turbodecoders, the collision problem arises because reading/writing operations oc-cur in two different orders, the natural and the interleaved ones; in LDPCdecoders, it comes from the irregular structure of the H matrix. There-fore, the inter–PE communication structure has a key role in the decoderdesign, particularly when the number of nodes, the level of parallelism andthe amount of codes to be supported increase.

The general problem can be stated as follows: given N data, B memorybanks and P PEs, find a partition of the N data among the B banks suchthat the P PEs are always able to access in parallel to all the memory banks,with no conflicts, to both read and write data. Graphically, this partitioningproblem can be represented by means of a graph, in which each vertex rep-resents a bit position in the block (and therefore an address in an undividedinterleaver memory) and the edges connect vertexes that are accessed simul-taneously: adjacent vertexes represent cells that must be placed in differentpartitions (Figure 26). Thus the interleaver partitioning in sections can bestated as a graph coloring problem.

Several solutions have been proposed in order to avoid, or at least al-leviate, the occurrence of memory collisions. In particular, three differentapproaches can be devised for both turbo and LDPC codes:

1. design of conflict free codes,

2. hardware structures and algorithm changes introduced to mitigate theeffects of conflicts,

3. code independent solutions.

The first approach is a joint code-decoder design where the codes are de-signed imposing specific constraints that simplify the decoding process andavoid collisions in the memory access. In the design of LDPC decoders, theparity check matrix H is made of sub–matrices where the ones, i.e. theedges in the Tanner graph, are properly selected in order to avoid collisions.

41

For example, in [7], H is divided into several tiles or sub–matrices, eachone obtained starting from identity matrices via row permutations, with theconstraint than every column and row have a single element equal to one.This structure greatly simplifies the decoder implementation, since the ex-changed messages can be retrieved from memories with a simple addressingscheme (i.e. shift registers). If attention is paid in avoiding small cycles inthe Tanner graph, the girth is not increased and these kinds of codes canreach communication performance equivalent or even superior to randomlygenerated irregular codes.

A similar approach was also proposed for the so-called Irregular Repeat–Accumulate (IRA) codes [35] and implemented in a number of works [36] [37][38].

As for turbo codes, in order to design collision-free interleavers, withassigned block size and number of parallel PEs, special transformations havebeen proposed [39] [40], based on a sequence of cyclic shifts along the columnsof an initial row-column interleaver. The effect of these transformations isto evenly distribute the accesses among available memories, so obtaining acollision-free interleaver. In order to reduce the regularity of the obtainedpermutation law and to increase its spreading properties, both intra-row andinter-row permutations have been suggested as additional transformationsthat do not affect the property of avoiding collisions, but at the same timeimprove decoder performance.

The generation of a collision free S—random interleaver is described in[41], together with its implementation in the specific case of a rate 1/3 turbocode and block size equal to 4096. The proposed technique basically insertscollision avoidance constraints in a procedure for the generation of S–randominterleavers and the resulting penalty in terms of code performance is limited.The combination of spatial and temporal permutations is another techniquethat has been explored to obtain dividable interleavers resulting in good per-formance. In [42], a bit reversal permutation is concatenated with optimizedshuffle patterns capable of avoiding low–weight error events. In [43], a twostage interleaver is proposed able to cover a wide range of interleaver sizes:the first stage implements a spatial permutation by means of a cross–barunit, while the second one applies a temporal permutation to each parallelSISO. The most interesting feature of this solution is that it offers trade–offsbetween latency of the obtained architecture and performance of the code.

The design of conflict free interleavers and parity check matrices has beenproved to be a very efficient way to simplify the implementation of parallel

42

decoders. However, this approach suffers from a few limitations: (i) it is notapplicable to all existing standards, but only to those standards that adoptedconflict free codes; (ii) it only supports specific kinds of parallelism that havebeen taken into account in the design of the code, thus not all parallel de-coders can necessarily be feasible; (iii) this approach can hardly lead to thedesign of parallel flexible decoders, able to support a wide spectrum of codes.

In the second and third approaches listed at the beginning of this Section,no constraints are posed in the code construction, but architectural strate-gies are adopted in the decoder design in order to avoid collisions for anyspecified code. Instead of constraining the interleaver design to be supportedin a parallel decoder, the collision problem can be faced by designing properarchitectures capable of masking the conflicts in the memory access. Theindubitable advantage of this approach is that it is viable in principle for anypermutation law and, as a consequence, it is standard compliant. However,the design a fully flexible decoder, able to cope with any kind of code canonly be achieved at the cost of additional hardware resources and penaltiesin terms of decoding throughput.

When an interleaver is mapped to multiple parallel memories, frequentconflicts in the access to a specific bank are generated, but each memoryis on average accessed once per cycle. Based on this observation, ratherthan modifying the permutation in order to eliminate the conflicts, one cansimply manage the conflicts by buffering data that cannot be immediatelywritten to (or read from) the memory. A number of papers have been pub-lished on this approach, proposing different buffering architectures [44] [45][46]. For a parallel architecture with P PEs and interleaving memories, aring–shaped network is proposed in [44] (Figure 19), consisting of P RIBB(Ring Interleaver Bottleneck Breaker) cells placed between PEs and mem-ory banks. Each cell is connected to the previous and to the following onesin a ring structure, moreover it has connections with corresponding SISOand memory. The RIBB cell contains local buffers for lateral outputs andmemory connection. A LLR generated by a SISO is accompanied with a tagidentifying the destination bank; according to this tag, the LLR is sent alongthe ring and, when received by the cell connected to the destination bank, itis either written into the memory or, in case a conflict occurs, it is stored lo-cally for later writing. Lateral buffers at the left and right ring outputs allowto accommodate for simultaneous circulating data. The RIBB decoder canbe viewed as a very simple Network-on-Chip (NoC) based decoder, where the

43

Figure 19: RIBB based Turbo decoder44

ring topology offers lower implementation complexity than other networks,such as for example the crossbar–switch. On the other side, the ring architec-ture introduces additional latency, essentially because of the delay necessaryto send the LLRs along the network and because of the buffers necessaryfor the resolution of collisions. The required buffer length decreases with Pfor the local buffer, but the size increases for right and left buffers; due totheir impact on required area and dissipated energy, the large size sometimesreached for these buffers is the major limitation of the proposed solution.

With the purpose of combating the growth of buffer size, the RIBB schemehas been generalized in a subsequent work [45], allowing for more connec-tions between one cell and the others. The new scheme, named GIBB, forGeneralized Interleaver Bottleneck Breaker, makes use of more complex net-work topologies that reduce both the latency in the transport of LLRs andthe length of the buffers; on the other hand, in this case the area requiredfor the communication structure increases significantly with respect to theRIBB case.

The RIBB approach was also proposed for LDPC codes [47]: again, mem-ory collisions are not completely avoided, but they are significantly reducedby means of a Ring Buffer. When a collision occurs between two generatedaddresses for access to the same memory, one of the addresses is inserted ina circular buffer that delays the memory access until no collision occurs.

A network–base approach was originally proposed in [48], where theLDPC processing nodes are connected together by means of an NoC in-frastructure (se Figure 12 in Section 2). This idea was later developed anddeeply investigated to efficiently support virtually any turbo and LDPC codes[49] [5] [50], with almost no penalty in terms of throughput, with respect toimplementations optimized for a specific class of codes. For example, bothindirect and direct network topologies are studied in [51] and obtained re-sults show that an NoC based decoder can achieve a throughput of severalhundreds of Mb/s, with a cost overhead with respect to a fully dedicatedimplementation limited to less than 15%.

A generic message passing architecture was proposed in [52] to decodegeneric turbo or LDPC codes. The main idea is to allocate a couple of prop-erly controlled permutation networks that separate the memory banks fromthe decoder PEs: these permutation networks spatially scramble the mes-sages, while memories introduce a temporal interleaving (Figure 20). Inputan output ports are connected to the decoder PEs, check and variable nodes

45

Figure 20: A flexible message exchange network for Turbo and LDPC decoders

46

for LDPC decoders, SISOs for Turbo decoders. Permutation networks canbe implemented as crossbar–switches or Benes structures. The combinationof the spatial and temporal permutations has been shown to be able to avoidcollisions for any code and any degree of parallelism. Each network imple-ments a proper permutation scheme that can be generated starting from theknowledge of the parity check matrix, via an annealing algorithm, which pro-gressively refines initial solution. As this algorithm is complicated, it has tobe run off–line and generated permutations need to be stored in additionalmemories. The proposed scheme is able to cope with any Turbo and LDPCcode, allowing the design of an effective flexible and reconfigurable decoder.On the other hand, the area requirements of the proposed approach tends tobe dominated by permutation networks and internal memories.

A different memory mapping methodology that provides conflict free ac-cess to memory units for all PEs in a partially parallel LDPC decoder isproposed in [53]. The problem is modelled resorting to a tripartite graph,where the three sets of vertices include time instances at which data areread, messages to be exchanged and time instances at which data are writ-ten respectively. Edges indicate time constraints, meaning that a vertexcorresponding to a given message is connected to multiple vertices, one foreach time instance at which the message has to be read or written. Graphcoloring is then applied to the tripartite graph to color its edges so thatmessages can exchanged between PEs and the memory banks without anyconflict at any time instance. This approach appears efficient and could beimplemented in hardware in a form that allows for on the fly generation ofrequired partitions for any code.

4.4. High speed turbo-code

In this section, we present several architectural structures to decrease thenumber of clock cycles Nc to perform a decoding iteration. Three differentlevels of architecture are explored: trellis level, Forward-Backward level andfinally, half iteration level.

4.4.1. Radix-4 architecture

The basic structure of the Forward-Backward decoder presented in section2.2.2 performs a trellis section in one clock cycle (see equation (6)). Byunfolding this equation, it is possible to compute directly MB

i+2 as a functionof MB

i and the branch metrics mi and mi+1. In that case, there are four

47

possible branches reaching each state of the trellis. This technique is oftencalled ”compressed trellis” or Radix-4 trellis [54], [55], [56]. The advantageof this method is to reduced by a factor of 2 the number of clock cyclesrequired to process a window of size L. The drawback is that it is morecomplex than the standard architecture (say radix-2) and also has a longercritical path. Nevertheless, there are other parts of the circuit that constrainthe maximum clock frequency (memory access, for example) and in practice,the radix-4 architecture does not affect significantly the clock frequency ofthe whole decoder. In summary, it is an efficient method often used for highspeed Turbo-Decoders. One should also note that duo-binary turbo-code(normalized for example in the DVB-RCS standard) uses by construction aradix-4 architecture. The radix-8 architecture (computation of MB

i+3 fromMB

i and the 3 intermediate branch metrics) is not often used in practicebecause of its complexity.

4.4.2. Forward-Backward parallelism

The Forward-Backward parallelism consists in processing several windowsin parallel. Figure 21 shows a forward-backward unit with P windows pro-cessed in parallel.

As mentioned in section 4.3, having P Forward-Backward unit in parallelrequires also parallel access memory, leading to memory access conflicts. Thisissue has already been presented in section 4.3. One should note that it is notpossible to increase indefinitely this type of parallelism. In fact, a high levelof parallelism implies small windows size, and thus, low quality estimationof the initial states of the borders of the windows, which, in turns, impactsthe decoding performance. For a very high requirement of throughput, itmay be more efficient to use basic solutions to increase the decoding speed(see section 4.1) than increasing the level of parallelism P of the Forward-Backward Unit.

4.4.3. Half-iteration parallelism

So far, the presented methods schedule two consecutive half iterationin an iterative way: the second half iterations starts when the first one isfinished. For example, in figure 21.a, the second half iteration starts at timeK/P + L. The time to perform a decoding iteration, in this example, isthus Nc = 2(K/P + L). To reduce Nc, it is possible to schedule the twohalf-iterations with some time overlapping.

48

Figure 21: Example of windows scheduling with a parallelism P = 2

49

First, it can be noticed in figure 21.a that, during the first (respectivelylast) L clock cycles, the P backward (respectively forward) units are notused, which is a waste of resources. The simple solution to avoid this wasteis to start the next half iteration at time K/P . In that case, the secondhalf-iteration may use data that have not yet been updated by the first halfiteration, which generates and update conflict. One solution is to a priorisuppress them, by the specific design of the interleaver, as proposed in [1].

Second, it is also possible to perform both half-iteration simultaneously inparallel. Each Forward-Backward unit sends to the other the new extrinsicinformation as soon as they are generated. This method, is called replicascheduling in [57] and has been tested in [58]. The main interest of thismethod is to reduce the decoding latency of the Turbo decoder.

To conclude this section, we should mention paper [59] as a good synthesisof high-speed parallel architecture. The authors used most of the mentionedtechniques to obtain a Turbo-Decoder with a decoding througput greaterthan 2 Gbit/s.

4.5. High speed LDPC code

For high–speed applications, it is necessary to use the maximum decod-ing parallelism. Therefore, in this case, parallelism is exploited both in theprocessing of check and variable nodes, and in reading/writing of reliabilitymessages from/to memory modules. LDPC decoding algorithms are nativelyparallel and therefore fully parallel architectures are possible, although theyusually lead to routing congestion and lack of flexibility.

As already mentioned in Section 2, in fully parallel decoding architectures,check and variable nodes are mapped into individual PEs and hard–wiredinterconnects allow for simultaneous exchange of all messages. Thus, eachiteration only takes two clock cycles: in one cycle, CNs compute their outputsand send them to VNs; in the other cycle, VNs processing is enabled andgenerated outputs are sent back to CNs. The achievable throughput can besimply expressed as

D = KFclk2Nit

(25)

If VN and CN processing are combined in a single cycle, the factor two in thedenominator of (25) is avoided, but the clock frequency is likely to decrease.

The fully parallel architecture was applied in [3] to a 1024 regular code,decoded at 1 Gb/s. The large number of inter–PE connections leads to

50

routing congestion, which tends to increase occupied area and reduce clockfrequency. To mitigate such effects, longest wires can be identified and par-titioned into multiple segments: by inserting registers between one segmentand the following one, transmission of a message is distributed over multiplecycles [60]. This approach helps reducing the worst–case wiring delay andincreasing the throughput: a 13 Gb/s decoder was designed in [60] using thispartitioning.

In [61], a bit–serial architecture is adopted to combat routing congestionof fully parallel decoder. This solution led to a 660 bit code decoder featuring3.3 Gb/s.

Modified decoding algorithms have also been introduced to improve fea-sibility of the fully parallel architecture [62]. Particularly, the Split–Rowalgorithm has been shown to provide a five time throughput speed–up, whilesaving at the same time 2/3 of wiring area. These improvements come fromthe partitioning of check node updates: for each check to variable messageto be computed, a partition is defined as a sub–set of the incoming variableto check messages, and local check node updates are independently appliedto identified partitions. This simplification greatly reduces the number ofwires in the fully parallel architecture, at the penalty of a small reduction inperformance (around 0.3 dB).

A fully parallel decoder also needs to simultaneously update all check andvariable node outputs. Therefore it is made of parallel PEs, able to receiveall VN-CN messages in parallel and write back the results simultaneously toall connected VNs. This results in lower update latency and consequentlyhigher throughput. On the other hand, computation is organized sequen-tially in a serial PE, which receives one input message per cycle, and needsseveral cycles to update all outputs. The serial PE solution is particularlyefficient if combined with Min–Sum decoding, where , out of all LLRs of aCN, only two magnitudes are of interest, i.e. minimum and second minimum.A serial check node usually execute the search of minimum and second min-imum on the stream or received values. The serial approach is slower thanthe parallel one, but it offers limited flexibility to support variable degrees ofparity checks.

An example of parallel PE implementation is given in [63], but the pro-posed architecture can only support a unique check node degree. More flexi-bility is made possible by the tree–way structure [8], which exploits a gener-

51

alized topology to realize a parallel check node architecture up to degree 32,with a fairly simple control mechanism. An example of tree-way structurefor the case of degree 6 is given in Figure 22. The tree–way architectureincludes three stages:

• The Direct VN Comparison (DVC) stage, where couples of adjacentincoming messages are processed by two input processing units.

• The Multiple Shuffled Comparison (MSC) Stage, which is divided intomultiple sub–stages, each one using a shuffle network that implementsa circular shifting permutation on values provided by previous sub–stages. Required rotational shifts depend on node degree and are storedin a small memory. Every sub–stage also includes two inputs process-ing units, that extend the CN update equation to additional inputmessages.

• The Extrinsic Calculation (EC) Stage, where values coming from theMSC are combined with proper input messages to obtain final outputmessages.

The parallel CN architecture was also adapted for layered decoding and ap-plied to the implementation of standard compliant decoders for structuredLDPC codes specified in WiMAX and WiFi standards. The same solutionwas also shown to achieve much higher throughputs, e.g. 1.7 Gbps with 15iterations at 300 MHz and area occupation of 4.5 mm2 using a 130 nm CMOSprocess.

Sliced message passing (SMP) is a modified decoding algorithm proposedfor the implementation of high throughput decoders [64]. The key idea inSMP is to divide each row in the parity check matrix into multiple segments,or slices, which are independently processed in separate clock cycles. Thiscomputational scheme offers a twofold advantage: first, the slice organizationsignificantly reduces the interconnect requirements and as a consequency re-sults into large saved area; second, it allows for overlapped check and variablenode processing, which reduces the decoding latency. A throughput of 5.3Gb/s was achieved by a decoder implemented with a 90 nm technology andrunning the SMP on the (2048, 1723) RS–based LDPC code adopted in theIEEE 10 GBase–T standard. This implementation showed that multi–Gb/sthroughput is possible also for decoders with row–parallel, instead of fully

52

Figure 22: Tree–way architecture for a degree 6 check node

53

parallel architecture. In general, row–parallel architectures have enough con-current resources to simultaneously serve parity check equations associatedto a set of rows in the H matrix, but not all of them. These architec-tures have been introduced to efficiently implement the layered decoding onQC-LDPC codes, which are mainly adopted in wireless communication stan-dards. However, the same approach was successfully used to implement veryfast decoders for applications characterized by much higher throughput thanwireless communications. Recent examples of fast row–parallel decoders aregiven in [65] [66].

Finally, control of iteration number (see Section 5) can be explited notonly the reduce energy dissipation in iterative decoders, but also to increasethe average decoding throughput.

5. Energy efficient architectures

Reduction of energy consumption is one of the most important objectivesin the implementation of channel decoders. The tremendous increase in wire-less data traffic over the last decades goes along with a critical increase inenergy consumption. In the context of wireless and mobile communications,the user terminal is usually recognized as the most important target for en-ergy optimization: the battery life is currently the biggest limitation for thedevelopment of novel terminals and power–hungry applications, such as videogaming, multimedia streaming and mobile TV. In a modern smartphone,there are two major sources of power consumption: the radio connection tothe network (GPRS or WiFi) and the display (including LCD panel, graphicsdriver and backlight). In particular, it is shown in [67] that the dissipatedpower during a session of intensive data exchange across the network is typi-cally as high as several hundreds of mW (measures are done on a OpenmokoNeo Freerunner 2.5G smartphone). However, it was recently understoodthat base stations contribute a significant proportion of the overall energydissipated by information and communication technology. For example, it isreported in [68] that, in a typical wireless cellular network, basestations areresponsible for more than 55% of dissipated power, corresponding to 13.3 kgof CO2 emissions per subscriber per year. Therefore, the design of energyefficient base stations will not only introduce great ecological benefits, but

54

also determine relevant economic advantages for operators.

Iterative decoders tend to have a larger power consumption than requiredfor classical codes [69] (convolutional and block codes) because of threemain reasons: (1) the hardware complexity in turbo and LDPC decoders islarger than for other decoders, such as Viterbi decoders for example; (2) thedecoding process is iterative and this implies that the clock frequency mustbe kept high with respect to the decoding throughput; (3) large memoriesare included in the decoder architecture, such as the interleaver memoriesand the input-output buffers.

The problem of reducing the dissipated energy per decoded bit and periteration can be faced at different levels of abstractions, from circuit level,up to architecture and system levels. Techniques exploited at circuit andarchitecture level are essentially general purpose methods, capable to reducedissipated energy in any digital circuit. On the contrary, algorithm andsystem level methods are specifically introduced for the channel decodingapplication. These two classes of techniques are separately addressed in thefollowing sections.

5.1. General purpose methods

At circuit and architecture levels, many techniques developed to mini-mize energy consumption in digital circuits can be adapted and applied tothe specific case of digital channel decoders. The most popular general pur-pose methods to combat power dissipation are clock gating, dynamic voltageand frequency scaling, voltage over scaling with reduced precision replica andparallelism increasing.

Clock gating is an effective technique to reduce energy dissipation associ-ated to the clock signal in synchronous circuits. Implementations of a samplecircuit with and without clock gating are shown in Figure 23. Based on thevalue of an enable signal (E), multiplexers are used in the left side of thepicture (a) to either sample a new data (Di) or to re–circulate the previouslystored data. However the clock signal (ck) is distributed to each registerindependently of E and this generates a large switching activity on clocklines. On the contrary, in the right side implementation (b), clock switchingis filtered by the and gate when enable is off. This technique can be ap-plied to large processing blocks of a digital system, where internal state isonly modified when the enable signal is active. In the opposite situation, no

55

Figure 23: Implementation of registers with enable signal

switching activity is generated within the blocks and the dissipated power issignificantly reduced..

A good example of this approach adopted in a channel decoder is avail-able in [70], where a low power LDPC decoder design for the next generationwireless handset System on Chip (SoC) is proposed. The design is basedon an high level synthesis tool that adopts block clock gating technique andsaves up to 29% of dissipated power.

Dynamic voltage and frequency scaling (DVFS) is a well known methodto reduce energy dissipation in digital systems, including microprocessorsand memory components. In CMOS technology, dissipated power dependslinearly on clock frequency, therefore reducing only the frequency does notallow to save energy, because the required processing takes more time tocomplete: if clock frequency is halved, required power is also reduced to50%, but the whole amount of dissipated energy is the same as in the fullspeed case. On the contrary, lowering the supply voltage and the operatingfrequency at the same time reduces both power and energy consumption.

56

For non deterministic computations, application of DVFS requires a pre-dictor, able to accurately estimate the necessary workload. When predictedworkload is high, voltage and clock frequency are not scaled: in such con-ditions, the full processing capabilities are exploited to quickly complete as-signed tasks and large energy is spent to this purpose. On the other side,when a lower workload is predicted, voltage and frequency are scaled, sodistributing the tasks over a longer period of time and saving energy. Inthe context of iterative decoders, supply voltage and clock frequency canbe adapted according to the instantaneous decoder workload that is repre-sented by the number of iterations required to decode the current frame. Thisworkload is not usually uniform, but rather it is changing with channel con-ditions: when the workload is below the peak value, that is a lower numberof iterations have to be allocated in the available time frame, the circuit hadbetter be run at a low speed by scaling down the supply voltage. Under thehypothesis that the number of required iterations is known in advance, anoptimum supply voltage could be selected for the current iteration. Unfortu-nately this information is not available when the frame decoding is scheduledand the optimal voltage assignment cannot be determined; however heuris-tic techniques have been studied to dynamically change the supply voltageaccording to the estimated SNR.

The adoption of dynamic voltage and frequency scaling in the design ofan LDPC decoder has been explored in [71]. A low power real–time decoderthat provides constant time processing of each frame using dynamic volt-age and frequency scaling is proposed. The developed predictor is based onmonitoring of number of errors in check equations. For a given signal tonoise ratio, this number has a Gaussian distribution and the average valuedecreases linearly with signal to noise ratio. The number of errors in checkequations is therefore used to predict the required decoding effort of eachreceived data frame. The prediction is then used to adjust dynamically thefrequency. Such approach does not work for AWGN channels, where correla-tion between initial check error and number of decoding iterations is insuffi-cient. This approach was extended in [72] to the case of AWGN channel. Upto 30% saving in decoding energy consumption is achieved with negligiblecoding performance degradation.

An as slow as possible algorithm has been proposed in [73] to adapt thesupply voltage to the instantaneous workload of a turbo decoder. The al-gorithm is based on estimations of the energy and delay associated to thedecoding of a given data frame. The energy consumption for the kth process-

57

ing iteration of block n is Enk = CeffV2nk, where Ceff is the equivalent total

switching capacitance of the decoder and Vnk is the assigned supply voltage.The time delay in a digital circuit can be approximately expressed as

tnk =θVnk

(Vnk − Vth)η

where θ and η are process dependent constants and Vth is the thresholdvoltage.The problem is to find a set of voltages Vnk to be assigned to each iterationapplied to a sequence of N data blocks, so that the total energy consumptionis minimum and a given constraint on the decoding delay (Ttot) is met. Thetotal energy for the decoding of the N blocks can be expressed as

Etot(Vnk;n = 1..N, k = 1..Kn) =N∑n=1

Kn∑k=1

Enk

where a variable number of iterations Kn can be used for each block n; thedelay constraint is stated as

N∑n=1

Kn∑k=1

tnk ≤ Ttot

The proposed optimization algorithm starts with a low voltage (as-slow-as-possible approach) and adapts the supply voltage from one iteration tothe other, so obtaining profiles of supply voltage such as that provided inFigure 24: here the decoding of each frame is started assigning the lowestadmitted voltage level (V1). The decoding process then continues using valuesof supply voltage decided based on the available decoding time: when anadditional iteration is started for the the current frame, the supply voltageis increased by one step, in order to speed up the decoding. If the currentframe turns out to take a large number of iterations to complete decoding,time needs to be saved in the processing of the following frame and thereforea high supply voltage is assigned (see the third and fourth blocks in Figure24).

In order to better exploit the potential energy advantage coming from thesupply voltage scaling, the SNR estimation has been suggested as a methodto determine in advance the required number of iterations and to optimallyassign Vnk values. It can be proved that the lowest energy consumption is

58

Figure 24: Profile of supply voltage assignments in different iterations

59

achieved when the circuit is running at constant speed all the time and thisimplies that the best voltage assignment is found by distributing uniformlythe global workload.The actual effectiveness of this approach depends on the SNR and on thenumber of available voltage levels: reported results show that with 5 levelsthe energy consumption can be reduced by percentages ranging from 10% atlow SNR up to 80% at high SNR.

A further general purpose technique to reduce consumption of energyin a digital circuit is based on extended processing parallelism and voltageover scaling. The general idea can be introduced with the help of a verysimple example. Assume that a FIR digital filter is implemented by meansof the sequential scheme given in figure 25-(a), where one incoming sampleper clock period is processed. The filter impements the following discretetime equation

y[n] = ax[n] + by[n+ 1] (26)

At each cycle, a new input sample is received and an output sample is gener-ated, therefore the filter throughput is given by th = fs, where fs is the clockfrequency. The maximum clock frequency for this computational scheme de-pends on the delay along the critical path, which is given by tA+ tM , with tAand tM combinational delays associated to adder and multiplier respectively.Said CL the overall capacitance switched at each cycle by filter componentsand VDD the supply voltage, the dissipated power for the sequential architec-ture is Ps = CLV

2DDfs and the energy required per sample is Es = CLV

2DD. If

a scaling factor α is applied to supply voltage, power is saved at the cost ofadditional processing time. However, as an effect of the voltage scaling, clockfrequency and therefore throughput are decreased with respect to the initialcase. The same initial throughput th can be maintained with scaled voltage,if a parallel architecture is adopted. In Figure 25-(b), two input samples arehandled and two output samples are generated per clock cycle. Thereforethe same throughput as the sequential architecture is achieved with a clockfrequency fp = 1/2fs. The dissipated power in the parallel filter is given by

Pp = 2CL (αVDD)2 fp

The factor 2 takes into account that the parallel architecture as a double num-ber of components than the sequential scheme and thus the total switchedcapacitance is 2CL. The per sample dissipated energy is Ep = 1

2CLα

2V 2DD,

60

Figure 25: Sequential (a) and parallel (b) architectures for FIR filter implementation

with a relevant improvement with respect to the sequential case. For exam-ple, with CL = 1 pF, VDD = 3 V and α = 0.5, the sequential and parallelschemes dissipate Es = 9 pJ and Ep = 1.13 pJ respectively.

This general idea is adopted in [74], where the inherent parallelism ofLDPC decoding algorithm is exploited to achieve energy efficiency. Theinitial LDPC decoder architecture is composed by multiple shared variablenodes (VNUs), multiple shared check nodes (CNUs), and shared memoryunits that enable message exchange between VNUs and CNUs. The numberof allocated VNUs and CNUs is then increased to obtain the same throughputas the original decoder with a scaled supply voltage. The described approachis proved to be very efficient: for a medium-size LDPC code (frame size of2048 bits), supply voltage scaling reduces the per bit, per iteration energyachieved at SNR of 4dB from the initial value of 10.4 pJ/bit/iteration to 2.7pJ/bit/iteration, with unaltered decoding throughput of 648 Mb/s.

In low–power design techniques mentioned above, volage scaling has the

61

effect of increasing circuit delay. This effect is compensated by either increas-ing the processing parallelism or reducing the number of iterations. When notcompensated, voltage overscaling achieves energy saving with no throughputpenalty, but it leads to intermittent errors, whenever the delay over a signalpath passes the clock period. In the context of iterative decoders, this effectcan be seen as an additional source of noise, which introduces a BER per-formance degradation. The desired level of performance can be restored inthe decoder if voltage overscaling is accompanied by some form of error con-trol, able to improve the overall reliability of the system. In [75], a techniqueknown as reduced precision redundancy (RPR) is used to mitigate the effectsintroduced by voltage overscaling. Two implementations are generated for aprocessing unit: the first one is a full precision implementation architectureoperating with over–scaled voltage, the second unit is a replica, running withfull voltage on reduced precison data. In a large majority of cases, overscaledunits are activated and used to decode received frames with a low consump-tion of energy. In case of detection of failures in the decoding process, thisevent is considered as the detection of timing errors generated by voltageoverscaling and the power hungry reduced precision replica is activated tocorrect errors.

5.2. Application specific methods

At the algorithm level, a popular technique to save decoding energy iscontrolling the number of iterations. In both turbo and LDPC codes, at eachSNR value, the BER level improves with performed iterations. Therefore,in general, the early stopping of iterations can introduce energy saving andperformance loss at the same time.

However, the adoption of early stopping techniques does not necessarilyimplies an exchange between BER performance and consumed energy. Fora given target BER level, the number of required iterations depends on theinstantaneous quality of the channel. As an example, assume that a decoderis designed to achieve the wanted level of BER with N1 iterations at signalto noise ratio SNR1; also assume that the same performance level is reachedwithN2 < N1 iterations at signal to noise ratio SNR2 (with SNR2 > SNR1).When channel conditions are good and signal to noise ratio is SNR2, thedecoding iterations can be limited to N2, with no performance degradationwith respect to SNR1 and a percentage energy saving that can be estimated

62

asN1 −N2

N1

A more accurate evaluation of the saved energy must also consider the dis-sipation contribution due to additional circuits required to detect the rightconditions for the early stopping.

When dealing with early stopping of iterations, we have to distinguishbetween two cases:

• undecodable frames, which cannot be corrected independently of thenumber of performed iterations, and

• decodable frames, which can be corrected by means of a finite numberof iterations.

If early stopping of iteration is conceived to specifically address the first typeof frames, no degradation of performance is introduced. Moreover the earlydetection of undecodable frames may also result into a reduced latency forthe whole transmission, when automatic repeat request (ARQ) is supportedby the application: in this case, the ARQ is activated as soon as the frameis recognized as undecodable, without spending time in useless decoding it-erations.

In order to save iterations in the processing of decodable frames, a suc-cessful decoding must be detected as early as possible, with a high level ofreliability. LDPC codes incorporate an implicit stopping criterion for decod-able frames: a correct codeword is recognized if and only if the set of paritycheck equations is verified,

H · x = 0

where H is the parity check matrix and x is the column vector of harddecisions. This operation is normally executed in a decoder to see whetherthe frame has been sucessfully decoded or not. The test is run after a fixednumber of iterations or, more often, it is performed after each iteration: ifthe test is passed, the process is stopped, if not decoding is continued untilsuccess or until the maximum number of allowed iterations has been reached.Performance degradation associated to this parity check criterion is almostnegligible; the average number of saved iterations is rather high with largeSNR values, while it becomes very low with small SNRs.

63

An improvement over this approach is suggested in [76], where it is recog-nized that sometimes, in structured codes, a number of iterations are spentto correct errors among the parity bits of the code, while the information bitsare already correct. Under this kind of condition, additional energy can besaved by stopping iterations as soon as all information bits result to be cor-rect. When applied to WiFi and WiMAX LDPC codes, this method providesan additional 11 % saving in the average number of iterations with respectto simple verification of parity check equations.

In the case of turbo codes, the early detection of succesfully decodedframes can be implemented resorting to a CRC (cyclic redundancy checking)scheme concatenated with the turbo decoder: when the frame is completelydecoded and no errors are detected by the CRC, unnecessary iterations areavoided by stopping the decoder. The procedure can be organized accordingto the following steps:

1 Initialize the number of iterations S to 1 and identify a maximumnumber of supported iterations, S < SMAX .

2 Decode the frame, increment S and see from the concatenated CRC ifthere is any bit error in the frame.

3 Steps 1 and 2 are repeated until S = SMAX or no errors are detectedin the current frame.

When the procedure is completed, the circuit is shut down before the nextdata frame arrives. As the energy dissipation tends to grow linearly withiteration, the percentage of power saved with this approach is roughly equalto the average reduction achieved in the number of iterations. According toresults reported in [73], this method applied to the decoding of a rate 1/3turbo decoder with block length 736 saves 75% of energy at high SNR.

The iteration number can also be controlled by means of a decision-aidedstop criterion, somehow related to the QoS requirements. Several early stop-ping criteria are based on the observation of LLR values processed by thedecoder. A known method makes use of output LLRs that are comparedto a threshold previously set on the basis of the target bit error rate [77]:if LLRs magnitudes of all bits in a block are under the threshold, decisionsare not considered reliable enough are new iterations are scheduled, while a

64

passed threshold indicates that a sufficient confidence in the decoded outputshas been achieved and the block decoding is stopped. Also this techniquegives percentages of energy reduction as high as 50%. Simpler and less pre-cise techniques to control the iteration number include the evaluation of thenumber of 1’s accumulated from the output of the decoded block at eachiteration [78]: a necessary condition for having identical decoded bits fromcurrent and previous iterations is that the two accumulated values are equal.

A wide class of stopping methods work by computing a metric relatedto the convergence speed of the iterative decoding process. They decidewhether the decoder has to be stopped or not based on the comparisonbetween the computed metric and some proper threshold. Hard decisionaided stopping criteria were originally proposed for the decoding of turbocodes [79] and then extended to the case of LDPC codes [80]. In these works,the computed metrics are derived from the hard decisions of the informationsequence obtained at the end of each iteration. When applied to the decodingof LDPC codes, the hard decision method includes the following couple ofsteps:

1. at the end of iteration k (k > 1), the hard decision Dk(j) is computedfor the decoder LLR associated to the jth bit

2. if Dk(j) = Dk−1(j) for all j or k passed the maximum number of itera-tions, then stop the decoding for the current frame, otherwise proceedto the next iteration.

This class of methods were shown to provide good results in terms of savediteration, for both turbo and LDPC codes, especially in the low SNR region,where they effectively detect undecodable frames and save a relevant amountof useless iterations. The percentage reduction in terms of average number ofiterations becomes much less prominent at high SNRs where most of framesare decodable.

The cross entropy between output LLR messages of two successive iter-ations can also be exploited to control iterations, as originally proposed in[81]. The evaluated metric is basically a measure of the amount of uncer-tainty associated to a given LLR: if the computed cross entropy is below agiven threshold for all bits, this implies that additional iterations are notlikely to modify the current decoding result. The quality of this early stop-ping criterion depends on the choice of the threshold: larger thresholds tendto decrease the average number of iterations but extend at the same time per-

65

formance degradation in term of BER. Approximate expressions of the crossentropy have been proposed to reduce hardware implementation complexityof this method.

A low complexity criterion for early stopping of LDPC decoders is pro-posed in [82]. The variations of the number of satisfied parity-check con-straints across decoding iterations is monitored in this method to distinguishamong different behaviours in the evolution of LLR values, namely oscil-lation, slow convergence and fast convergence. Proper setting of multiplethresholds is required also for this method in order to achieve good perfor-mance. The key advantage of the method is the very low implementationcomplexity, due to the fact that conventional LDPC decoders already includeunit to detect and count satisfied parity check equations. Thus, no additionalhardware resource are required.

A relevant saving of iterations in the low SNR region is offered by theConvergence Mean Magnitude (CMM) method proposed in [80]. The methodmonitors the evolution of the mean magnitude of the LLR messages alongthe decoding process, and stopping decisions are based on the convergence ofthe mean magnitude. Ideally, for a code with infinite length and unboundedgirth, the mean magnitude of LLRs either increases continually or reaches ina few iterations a constant value: in the first case, a correct codeword canbe declared, while in the latter a decoding failure is obtained independentlyof the number of performed iterations. The evaluation of the slope of meanmagnitude can therefore used as a criterion to early stop the decoder. Cyclesand trapping sets in a real world code graph make the convergence muchless smooth, with possible fluctuations between small and large numbers.However a proper choice of thresholds leads to quite a reliable prediction ofdecodable and undecodable frames.

It is well known that in complex digital systems very large margins ofpower reduction are offered by architecture-level optimizations, which ba-sically consist of design exploration aimed at finding the best area-energytrade-off for a given application.

From the analysis of the main contributions to power consumption in aturbo decoder, many authors founded that most of power is dissipated inthe memories; more precisely, for a sliding window approach, around 90% ofenergy consumption in a SISO module is due to the branch and path metricmemories; moreover a large part of the energy consumption is typically dueto the interleaver memories and to the input/output buffers. Of course the

66

relevance of the memory contribution to the global power dissipation tendsto reduce when high-speed, parallel or pipelined architectures are considered,because of the increased amount of logic and clock frequency.

In systems including memories, large percentages of reduction in energydissipation are achievable by means of RAM partitioning. In fact splittinga memory in multiple RAMs allows relaxing the constraints on access timesand the gained delay margins can be exploited to either increase the speed ofthe system, or to reduce the power consumption. In the latter case, the orig-inal throughput can be kept, but the supply voltage scaled down, so largelyreducing the dissipated energy. This technique is very effective because thedissipated dynamic power Pd depends quadratically on the supply voltageVDD, while the access delay depends almost linearly on VDD: a fairly typicalexpression for a first order estimation of delay as a function of supply voltageis the following [83],

Td ∝VDD

(VDD − Vth)η(27)

where, for a 0.25µm technology, η is equal to 1.3 ÷ 1.5 and the thresholdvoltage Vth is 0.22 times the original supply voltage.

In the architecture of a turbo-decoder, the access sequence for each mem-ory is usually fixed and known in advance, although in some cases, such as3GPP-UMTS standard, it can vary from frame to frame. Thus each mem-ory can be split in separated sections so that two cells included in the samesection are not accessed in two consecutive steps and time constraints arerelaxed by a factor of 2. Extending this procedure, memories can be splitso that two cells included in the same section are not accessed for s steps,relaxing timing constraints by a factor of s. The partitioning is straightfor-ward for input and output buffers, because the access scheduling to thesememories is always the same, that is the natural order from the first to thelast address.

The interleaver memory, on the other side, cannot be partitioned with-out applying proper techniques to avoid collisions. The method shown inFigure 26 can be used to represent the constraints that have to be imposedon a given interleaver in order to partition it into a number of separatedmemories. In the graph, each vertex represents a memory cell and the edgesconnect vertexes that are accessed in a number of s consecutive steps or less:adjacent vertexes represent cells that must be placed in different partitions.Thus the interleaver partitioning in sections can be stated as a graph coloring

67

Figure 26: Graph coloring on a sample interleaver

68

problem. In the example of Figure 26, cells must be accessed according toboth the natural order (0 to 8) and the interleaved one (2, 5, 0, 4, 7, 1, 3,6 ,8). Continuous edges indicate the in order sequence, while dotted linesshow the interleaved sequence. Graph nodes have been partitioned into threegroups, corresponding to the three memories M1, M2 and M3; the obtainedpartitioning allows the simultaneous access of two consecutive values (s = 2)in both ordered and interleaved sequences. The method has been applied tomultiple real codes [84] and an average power saving of 75% has been reachedwith an area overhead of 23%.

A wide range of trade-offs in turbo decoder architectures have been stud-ied in [85] [86], and several architectural parameters have been introduced,with the aim of finding a storage organization efficient from the energy pointof view. It is proved in [86] that the optimal choice of these parameters isstrongly dependent on the specific turbo code and on the technology modelsused; the optimal parameter setting is obtained by simulation on high levelmodels.

Data flow transformations are applied to reduce both the storage of statemetrics and the number of memory transfers: in order to achieve this, themetrics actually stored in RAM’s are limited to a fraction 1/θ of the total,while the other metrics are recalculated when needed or temporarily storedin register files. While this idea does not provide any area benefit, it is quiteeffective in reducing the energy dissipation: percentage reduction around50% are reported for values of θ between 2 and 4.

In the same work, the delay and energy benefits deriving from the adop-tion of some degree of parallelism in the decoding architecture is shown for aparticular case of parallel decoding scheme, known as double flow structure:in the proposed architecture, the usual sliding window approach is adoptedfor the first half of the data block, meaning that α recursion is continuouslyapplied starting from the initial trellis state and two simultaneous β recur-sions are processed along sliding windows. At the same time, the second halfof the data block is processed, with the α and β roles swapped:

* β metrics are updated in the backward direction starting from the blocktermination,

* α metrics are processed according to the sliding window scheme bymeans of two α processor that go through the trellis in time reversedirection.

69

This architecture can be viewed as a particular case of the more general para-metric description based on the DFG [7], but in this case it is shown that itallows for a 25% of reduction in dissipated energy with respect to the usualsingle flow structure.

A further application specific method to reduce energy dissipation in iter-ative decoders is based on the dynamic re-encoding of received messages [87].The basic idea, originally applied to the decoding of algebraic block codesand convolutional codes, aims at maintaining the survivor path on the zer-path of the trellis, so reducing the state transistion activity. No performanceloss is introduced by this approach, while a small additional implementationcost is due to the re-encoding process, which needs to be incorporated in thedecoder architecture.

6. Exotic designs

Unusual decoder implementations can be described here: for example,analog decoders and stochastique decoders, polar code, non-binary.

6.1. Analog decoders

Long latency and high power consumption are recognized weaknesses ofdigital iterative decoders. The use of analog circuit techniques to implementiterative decoding was proposed to avoid these problems [88] [89].

An analog decoder exploits continuous–time operation and deep paral-lelism to improve speed and reduce power dissipation. Internal metrics andsoft values are associated to voltages or currents and they propagate fromone decoder component to the other through continuous–time analog feed-back loops.

This means that, in analog decoders, iterations no longer exist and thewhole circuit can be viewed as a continuous–time network, where generatedoutputs tend to settle on stable values, which are returned as the decodedcodewords.

The reduced power dissipation of analog decoder is primarily due to thepossibility to use a single wire to propagate a whole soft values, while mul-tiple wires are required to transmit the same information in binary form.Moreover, when soft values are represented by means of analog voltages,

70

these voltages do not always swing the entire supply range, while in a dig-ital decoder each flip of a bit is associated to a rail–to–rail voltage swing,which dissipates the highest energy. Finally, since analog decoders oper-ate in continuous–time, the distribution of high–speed clock signal, which isresponsible for a large percentage of energy dissipation in digital implemen-tations, is not required.

On the other side, speed advantage provided by analog decoders comesfrom the use of very simple circuits that perform the required soft calcula-tions with a small occupied area. This enables the adoption of a degree ofparallelism much larger than in digital decoders, with a large potential interms of decoding speed–up.

In most analog implementations, decoder components are composed oftransistors biased in weak–inversion mode, where drive currents are ratherlow. As a consequence, circuit bandwidth and achievable throughput is in-herently low. High internal parallelism is used to achieve wanted decodingspeed, however this approach results to introduce a limitation on the codelength. While in a fully parallel decoder, the number of circuit componentsincreases linearly with the code size, wiring among components grows at amuch larger rate, both in terms of number of wires and in terms of wirelength. In particular, long wire length translates into large parasitic capaci-tance and therefore low achievable throughput.

Further difficulties that have been experimented with analog implemen-tation, come from

• the effects of device mismatch and other transistor non–idealities onthe loss of error-correcting performance, which call for specific designprocedures able to guarantee predictable results in terms of quality andperformance,

• the limited copabilities in terms of reconfigurability/flexibility of ana-log circuits, which make it difficult to support multiple etherogeneouscodes.

Notwithstanding these difficulties, several proposals appeared in the last15 years to implement analog decoders. The first ideas were published be-tween 1998 and 2000 [88] [89]: in these initial attempts, decoding operationswere performed by means of bipolar junction transistors.

71

The first analog turbo decoder, designe dfor a 48 bit code, appeared in2003 [90]. A longer, 500 bit turbo code is supported by another analogdecoder proposed in the same years for application in magnetic recording[91].

A CMOS analog Min–Sum based decoder was proposed in 2005 [92].Instead of usual subthreshhold MOS transistors, strongly inverted CMOSdevices have been used in this work to implement key components. A par-tially continuous exchange of soft values was exploited in [93] to achieve highdecoding speed with an analog turbo decoder, configurable up to 2432 bitscode length.

So far, analog decoders have only been demonstrated on codes with blocklengths up to a few thousands of bits. Maximum throughput is also lowerthan in most digital decoders. Since, device mismatches and difficulties withwiring and input buffering are expected to limit the scaling of current analogdecoders to larger code sizes, digital turbo and LDPC decoders are likely tobe the best solution for current and new applications.

6.2. Stochastic decoders

The principle of the stochastic computation was introduced in 1956 byJohn Von Neumann [94]. The principle is to represent a real q between 0and 1 by a binary stochastic stream X having the probability q of being 1(P (X = 1) = q) and 1 − q of being 0 (P (X = 0) = 1 − q). Assuming twoindependent stochastic stream X and Y of probability p and q, the logicaloperation AND of X and Y gives a new stochastic stream Z having theprobability l = p × q of being 1, in fact, both X and Y needs to be 1 inorder to obtain Z = 1. In other words, a simple AND gate can perform themultiplication in the stochastic domain: l = p × q. Of course, the result isexact for an infinite number of realizations. In practice, the precision of theresult can be tuned with the number of realizations.

Similarly, it is easy to verify that with an XOR gate (figure 27.a) theoutput probability verifies: l = p× (1−q)+(1−p)×q, which is, in the prob-ability domain, the processing required to generate the extrinsic informationof a check node.

Now, let us study the operator given in figure 27.b. One can see that ifthe values of X and Y agree, the register takes the value of X (or Y ) afterthe rising edge of the clock, otherwise, the register keeps its old value. Let lbe the probability of Z to be one: P (Z = 1) = l. The new value of Z is 1 in

72

Figure 27: Elementary Check node and Variable node architecture for stochatic compu-tation

73

two cases: first, X = Y = 1, which is an event of probability p× q. Second,when X and Y disagree and when the previous value of Z was 1, an eventof probability (p× (1− q) + (1− p)× q)× l. Thus, the value of l verifies theequation (28:

p× q + (p× (1− q) + (1− p)× q)× l = l (28)

which gives the solution:

l =p× q

p× q + (1− q)× (1− p)(29)

Thus, in the probability domain, with stochastic processing, both ele-mentary check node and variable node can be performed using very simplehardware components. Fully parallel decoder architecture is thus possibleand the question arises if stochastic decoding can be an efficient alternativefor classical decoder architecture using LLR. In other words, if the simplicityof the architecture can alleviate the high number of clock cycles required toobtain an accurate result. The principle of stochastic decoding is given in[95] and [96]. In these papers, the authors show that a straight implemen-tation generates deadlock cycles that freeze the decoder in a given state. Infact, if the two inputs of a variable node differ, the output is stuck in itscurrent state. This stuck value is propagated to the check node, and prop-agated again to the variable node (LPDC code contains loops). In orderto prevent this, the decoder requires breaking this deadlock cycle by mon-itoring on the fly the probability value of a given edge and by generatingan independent stochastic sequence with the same probability. W. Gross etal. have proposed several methods to perform such an operation and haveproduced hardware implementation of LDPC decoder with astonishing re-sult [97], [98]. On the average, their stochastic LDPC decoders have betterperformance than the classical ”LLR based” LDPC decoder, but, few code-words require a large number of decoding iterations (few thousands up tofew million) before reaching convergence. Recently, stochastic decoder havealso been extended to the case of Non-Binary LDPC code [99] as well asTurbo-Code [100], [101].

6.3. Conclusion

In this section, we have regrouped under the name ”exotic”’ decoderarchitectures that are not based on the LLR representation of the edge mes-sages. Those exotic implementations can also give efficient architecture, with

74

comparable, or even better, performance than those of the standard imple-mentations. Nevertheless, most of the research in those areas is done byacademic and so far, there is no large scale industrial utilization of this typeof decoder. This situation can evolve in the future. For completeness, weshould also mention the Finite Alphabet Iterative (FAID) Decoders [102],where message values belong to a finite set of value, without explicit linkwith LLR value, and where specific updated rules are defined in the variablenode. Since there is no reported architecture using FAID decoder, it hasbeen omitted of the chapter.

7. A survey of relevant implementations

After years from their introduction, iteratively decodable codes have beenadopted in a number of communication standards as forward error correction(FEC) systems. The first important communication standard that specifiedthe use of a turbo code for forward error correction is the 3GPP UMTS stan-dard for wireless cellular communication: the selected solution is a parallelconcatenation scheme, with two 8-state convolutional encoders and a wideselection of interleavers, up to a maximum size of 5120 bits; the global coderate is 1/3 and the highest bit rate is 2 Mb/s.

In subsequent years, turbo and LDPC codes have been adopted as manda-tory or optional channel codes in several standards for wireless communi-cations (WiFi, WiMAX, 2GPP-LTE), digital TV broadcasting (DVB-RCS,DVB-S2, DVB-T2), wireline communications (10GBASE-T, HPAV, HPAV2,ITU-G.hn), space telemetry applications (CCSDS). The use of turbo andLDPC codes was also considered for the coding of data in magnetic recodingand non volatile memory components, such as for example Flash memories.

In this wide spectrum of applications, a pretty large number of relevanthardware implementations can be found in the open literature. A completelist and comparison of all available implementations is not among the ob-jectives of this chapter. However, in the following sections, a selection ofimportant implementations is given.

Several commercial implementations have been developed for decoderscompliant with the 3GPP standard, both in the form of proprietary “hardIP” cores, integrated in complete transceiver components by manufactur-ing companies, and in the form of “soft IP” cores, made available as syn-thesizable HDL (Hardware Description Language) source code. A set of

75

encoder/decoder cores are made available by TrellisWare [103] for differentcommercial applications, such as for example HomePlug AV, IEEE 802.11n,IEEE 802.16e, DVB-S2 and CCSDS Near-Earth and Deep Space communi-cations. These cores are targeted for FPGA or ASIC implementation; oneof them, an LDPC decoding core indicated as F-LDPC, or Flexible LDPC,can be customized in terms of block size, code rate and performance - costtradeoff.

Quite a large number of cores (called Silicon IP) from different providersare found in [104]: in particular, the site gives access to advanced FEC coresfor both wireless and wired applications and also for DVB-S2. In [105], sev-eral cores for turbo decoders are proposed, including 3GPP, DVB-RCS and3GPP-LTE.

The ASIC approach is not the only solution for the implementation ofturbo and LDPC decoders. Implementation platforms of different naturehave been selected for the development of decoders, according to addressedapplication and cost constraints. In addition to processing speed, latency,energy consumption, cost and implementation complexity, two importantfeatures, namely scalability and flexibility are gaining an increasing role inthe characterization of modern channel decoders. Scalability is the capabilityof a platform to adapt to different choices for system level parameters, such asfor example the mother convolutional codes or the size of the processed block.Throughput, latency and power consumption typically change with theseparameters and additional implementation complexity is paid to supportscalability. However the decoder architecture is not changed or reconfiguredwhen adapting to a different set of parameters.

On the other hand, the term flexibility is used to indicate the possibility toupdate an implementation platform in order to support a completely differentdecoder that does not simply require a change in some parameters; as anexample, a decoder that can be configured for different concatenation schemes(parallel as well as serial) would be flexible.

The decoder implementations reported in the following sub–sections areclassified in four groups, according to their dominant characteristic:

1. High throughput decoders, where the key design objective was the op-timization of decoding rate

2. Low energy and low area decoders,

3. Flexible decoders, able to support multiple standards

76

4. Hardware accelerators for performance evaluation of iterative decoders,especially in the high SNR region.

7.1. High throughput implementations

Most of the high throughput implementations of turbo decoders adoptparallel processing techniques. An alternative approach exploited in [106]is based on high radix processing techniques, where throughput is improvedby handling multiple trellis stages per clock cycle. A radix-16 modified log-MAP algorithm is implemented in [106] to achieve a maximum throughputof 50 Mbps. The decoder was implemented using a commercial 90 nm CMOStechnology and occupies a core area of 1.6 mm2. Surprinsingly enough, thisdecoder also offers excellent performance in terms of energy efficiency, whichis as low as 13 pJ/bit/iteration.

In [107] a very efficient turbo decoder targeted to 3GPP-LTE standardis presented. The architecture includes eight parallel radix-4 MAP decodersand a couple of Batcher networks to propagate eight LLRs between MAPdecoders and interleaving memories with no contention. When synthesizedon a 130 nm technology, the whole decoder achieves a throughput of 390.6Mbps, with an occupied area of 3,57 mm2 and an energy efficiency equal to0.36 nJ/bit/iteration.

A 250 Mbps throughput is achieved in [108] for an LDPC decoder sup-porting WiFi standard.

The WiMedia 1.5 standard for Ultra-Wideband (UWB) requires very highthroughput (over 1 Gbit/s), which demands for LDPC decoding architecturewith increased parallelism compared to conventional approaches. In [109],the required throughput is achieved by exploiting additional parallelism atthe check node level: the decoder, implemented on a 65 nm technology, hasa maximum clock frequency of 264 MHz, occupies an area of 0.51 mm2 andoffers throughput between 0.72 and 1.22 Gbps, depending on the selectedblock size and code rate.

Iterative concatenated convolutional codes have also been specified bythe digital video broadcasting standard (DVB-RCS) that requires decod-ing throughput significantly higher that wireless cellular communications (68Mb/s); even higher precessing speed would be required for other applications,such as high performance storage and optical communications.In order to support throughput ranging from tens of Mb/s up to severalGb/s, high parallelism architectures have to be implemented and the custom

77

ASIC is the most suited platform for these applications. Implemented ASICturbo decoders therefore offer high performance and reduced complexity atthe cost of almost no flexibility in the obtained result. Among the commer-cially available ASIC solutions, a number of cores are offered in [105] forDVB-RCS (60 Mb/s) and other high speed applications, such as disk storage(320 Mb/s) and Echostar/Broadcom video broadcast (110 Mb/s).

7.2. Low energy and low area implementations

In [32], low-power VLSI techniques are adopted in the design of a 802.11nLDPC decoder. The decoder exploits the TDMP VSS technique with a12 datapath architecture: separate variable-to-check and check-to-variablememory banks are instantiated, one per type for each datapath. Each ofthese macro banks contains 3 micro-banks, each storing 9 values per word.The internal degree of parallelism is effectively sprung up to 1227. VLSIimplementation is efficiently tackled in less-than-worst case thanks to volt-age overscaling and reduced precision replica methods [49] [50], described inSection 5.1.

Layered decoding in run in parallel on two block rows in the LDPCdecoder described in [110]. This solution can be exploited in QC LDPCcodes and provides additional parallelism that allows for higher throughputor reduced clock frequency and power dissipation (see Section 5.1). A lowenergy implementation of an LDPC decoder for the 10BASE-T Ethernetprotocol in shown in [66], where the energy efficiency is evaluated as 13.3pJ/bit/oteration.

7.3. Flexible decoders

Typical platforms for the implementation of iterative decoders includesoftware programmable devices, such as microprocessors, Digital Signal Pro-cessors (DSPs) and Application Specific Intruction set Processors (ASIPs),customized Application Specific Integrated Circuits (ASICs) and reconfig-urable devices (e.g. FPGAs). General purpose microprocessors and DSPstake advantage of software programmability to achieve very high flexibilityin terms of decodable turbo and LDPC codes. They also enable run timemodification of various communication parameters. In the case of 2G wire-less cellular communications, the use of DSPs to support several processingfunctions, including the forward error correction, has become very popularand dedicated hardware units have been added to the execution units of some

78

recent DSP, with the purpose of achieving a sufficient speed-up in the process-ing to support decoding throughputs as high as 2 Mb/s. The turbo-decodercoprocessor TMS320C64X DSP is a typical example of this approach [111].However the achievable computational power is not sufficient for most cases.For example, an LDPC decoder implemented on TMS320C64X yields 5.4Mb/s throughput running at 600MHz [112]: this data rate is not compliantwith specifications of new and emerging wireless standards. At the same time,the large flexibility of microprocessors and DSPs is accompanied by largeoverhead in terms of occupied area and dissipated power, which make theiruse in the context of channel decoding impractical. Software programmablesolutions are largely used in the design, test and performance comparison ofdecoding algorithms.

Graphics processor units (GPU) offer an actractive, flexible software-based implementation platform for the performance evaluation of decodingalgorithms. They provides extremely high computational power, by employ-ing a huge number of processing cores and working on a large set of data inparallel. Specific optimizations are required to efficiently map the decodingalgorithm onto the GPU’s hardware resources and to avoid bottlenecks inthe access to the memory. However GPUs have been proved to constitutean excellent veichle to accelerate computation–intensive DSP algorithms. Asan example, in [113], a GPU based implementation of an LDPC decoder forthe IEEE 802.11n WiFi LDPC code and 802.16e WiMAX LDPC codes isdemonstrated, with a throughput as high as 100.3 Mbps.

A high degree of flexibility is also offered by the use of embedded pro-cessors specifically targeted to the decoding application: this approach, alsoknown as ASIP (Application Specific Instruction set Processor) or customiz-able processors, greatly overcomes the limitations of general purpose micro-processors, giving the designer the possibility to precisely tune instructionset, functional units, internal pipelining and memory organization to thespecific tasks required by the application: At the same time, an ASIP, beingsoftware programmed, is almost as flexible as a commercial DSP. First inves-tigations on the design of ASIP based decoders were conducted in the contextof 3GPP UMTS standard for a turbo code [114] [115]. More recently, quite alarge number of efforts have been done in the design of ASIP based decodersand some of them try to extend decoder flexibility to cover both turbo andLDPC codes across multiple standards (WiFi, WiMAX, LTE, DVB-RCS). A

79

multi-ASIP architecture is described in [116], where each core includes twoinstruction sets, one for LDPC and one for turbo codes. The complete de-coder has 8 cores and a simple interconnect network, which allows for efficientmemory sharing and collision avoidance.

A decoder processor with similar flexibility is proposed in [117]. Convolu-tional codes, binary/duo-binary turbo codes and structured LDPC codes aresupported by the proposed ASIP, which follows the Single Instruction Mul-tiple Datapath (SIMD) paradigm. The ASIP achieves payloads of 237 Mb/sand 257 Mb/s for WiMAX and WiFi respectively. A single SIMD ASIP withinternal parallelism of 96 is proposed in [118] to decodes both binary (WiFi,WiMAX) and non-binary LDPC codes. This large flexibility comes at a highimplementation cost, especially due to allocated memories, which dominatethe area occupation.

High level of flexibility is also achieved by FPGA based solutions, whichare provided with a large amount of fine grain parallelism and can be con-figured to reach better speed performance than DSP implementations, whilestill offering flexibility. The programming process for an FPGA consists inthe uploading of a bit stream containing the information for the internalconfiguration of logic blocks, interconnects and memories. For most devices,the reconfiguration process takes a long time and implies that the hard-ware previously mapped to the FPGA is stopped; these two main difficultiesare overcome in recent devices that support partial and dynamic reconfig-uration. Another major limitation to the adoption of FPGA platforms forwireless communications comes from the high power dissipation of field pro-grammable devices, especially in static conditions. However, the very shortdevelopment time makes FPGA platforms the ideal solution for fast prototyp-ing of decoders and rapid empirical testing of different decoding algorithmsand implementation choices (Section 7.4). In general, FPGA devices aresuited for datapath intensive designs and make use of programmable switchmatrices, which are optimized for local routing. The high level of intrisicparallelism in LDPC decoding algorithm and the low adjacency of paritycheck matrix lead to long and complex routing, which is not well suited tohigh clock frequency. In some cases, time sharing of computational and stor-age resources is used as a method to limit global interconnect routing, at acost of reduced throughput[119] [37]. A number of FPGA implementationshave also been proposed to support decoding of turbo codes with mediumthroughput [120], [121], [122], [105].

80

Application Specific Integrated Circuits (ASICs) allow for implemen-tation of dedicated decoders, carefully customized to realize the desiredthroughput with the minimum cost in terms of occupied area and dissipatedpower. Of course, this approach provides limited flexibility and thereforeASIC based decoders are usually intended for single standard applications.Some form of flexibility can be incorporated in the ASIC design by recur-ring to modular architectures with proper memory partitioning and pro-grammable interconnects.

An interesting solution towards flexibility is described in [123]. In thiswork, the flexibility is limited to the LDPC codes specified by a single stan-dard (intra-standard flexibility), however two schedulings are supported bythe same decoder, the layered decoding and the two phase message pass-ing (TPMP). The potential benefit of this dual scheduling approach comesin case of unstructured codes, where the layered decoding scheduling gen-erates more collisions in the access to data memories than the two-phasedecoding process. The decoder is built around a Reconfigurable Serial Pro-cessing Array (RSPA), which includes serial Min-Sum processing elementsand reconfigurable logarithmic barrel shifters. The RSPA can be dynami-cally reconfigured to choose between the two decoding modes according todifferent block LDPC codes. WiMAX standard specifications in terms ofthroughput are met with a clock frequency of 260 MHz.

In the design of a multi standard decoder, some general guidelines mustbe followed.

• Commonalities among required processing and storage units must beidentified and exploited, in order to enable the largest possible reuseof hardware components. For example, proper sharing of resources ispossible to support both Log-MAP and Max-Log-MAP algorithms, andthis enable the implementation of decoders capable to process a binaryas well as a duo-binary turbo code by reusing the same architectureand its logic [124]. Solutions have also been proposed to reuse most ofthe processing hardware of a classical turbo SISO unit for the LDPCcase [125].

• Data flow among allocated components must be flexible enough to sup-port all interconnect needs. Permutation networks are available to dy-namically establish the correct connections between the components.

81

For example, fully flexible network approaches have been proposed [4]that allow to process any parity check matrix, and therefore to supportvirtually any LDPC code. A more limited and less expensive flexibil-ity is usually realized in LDPC decoders which only cover structuredcodes: in this case, the fully flexible networks can be replaced by alower complexity logarithmic barrel shifter [126].

• In order to limit interconnect needs, serial organization of processing ispreferred to the parallel one, when possible. For example, in LDPC de-coding, the elementary check node unit can be designed to be serial orparallel. In the first case, any value of check node degree can be easilysupported with efficient hardware usage. Of course, very large valuesof check node degree may increase the latency and limit the achievablethroughput to a great extent. Alternatively, parallelism at check nodelevel provides significant increase in throughput with affordable com-plexity [8] In order to achieve flexibility, in the parallel approach, thecheck node unit is sized for the maximum check node degree requiredby the addressed applications and all lower values are supported.

One of the most serious problems in the implementation of multi-standarddecoders for LDPC codes is handling of memory accesses without conflicts.Solution to this problem is facilitated in structured QC codes, however mem-ory access conflicts are unavoidable in the case of multi rate irregular QCcodes. Block row and column permutations have been proposed in [127] toenable efficient management of memory accesses.

A modified layered decoding algorithm, known as Layered Message Pass-ing Decoding with Identical Core Matrices (LMPD-ICM) is exploited in [128]to efficiently tackle the flexibility issue. The H matrix is partitioned in sev-eral layers, with each layer yielding a core matrix. This consists of the nonzero columns of that layer. The resulting core matrix is further divided into smaller and identical tasks. Applying LMPD-ICM to a QC-LDPC codereveals that core matrices of layers are column-permuted versions of eachother and show similarities not only among different layers in a single code,but also among different codes within a same code class. When tested onthe case of WiMAX QC-LDPC codes, this approach achieves a throughputof 200 Mb/s at 400MHz frequency.

Recently, Network-on-Chip (NOC) based decoders for both turbo andLDPC codes have been proposed, where the intrinsic flexibility of a NOC is

82

exploited to facilitate communication among a number of units that sharethe decoding task. More precisely, a NOC-based decoder architecture relieson an NoC, where each node contains a processing element and a routingelement: each processing element implements the turbo and LDPC decodingalgorithms on a given portion of data, while the routing element deliversvalues to the correct destination.

Unfortunately, the communication patterns for both turbo and LDPCcodes suffer from collisions in the memory accesses. The idea of using NoCto address this collisions was originally proposed in [129] for the case ofturbo codes. Later, similar ideas have been explored and refined. In [130],the concept of intra-IP NoC was introduced to indicate an application spe-cific NoC with features tailored to the requirements of a given intellectualproperty (IP). The use of an intra-IP NoC as the interconnection frameworkfor flexible turbo code decoders was deeply investigated in [49], where it isshown that generalized de Bruijn and generalized Kautz topologies achievethe best results in terms of throughput and complexity. NOC based de-coders for LDPC codes have also been proposed ([131], [132]). Finally, thedesign of a complete fully flexible NOC based decoder for both turbo andLDPC codes is described in [50] [5]. Performance offered by the decoderon a wide set of cases, including IEEE 802.16e (WiMAX), IEEE 802.11n(WiFi), China Multimedia Mobile Broadcasting (CMMB), Digital Terres-trial Multimedia Broadcast (DTMB), HomePlug AV (HPAV), 3GPP LongTerm Evolution (LTE), and Digital Video BroadcastingReturn Channel viaSatellite (DVB-RCS), is shown. The dynamic reconfiguration issue is alsoaddressed to enable on the fly switching between any couple of codes inthe considered set of standards. Surprisingly enough, such a wide form offlexibility introduces a limited penalty in terms of additional occupied area.For example, a multi standard configuration covering WiMAX, HPAV andDVB-RCS coding results into an area occupation of 1.43 mm2 estimated afterplace and route on a CMOS 65 nm process. Two state of the art, single stan-dard implementations compliant with WiMAX specification and synthesizedon the same technology occupy areas of 0.47 and 0.55 mm2 respectively forLDPC and turbo decoding [133]. Therefore the area penalty of the extendedflexibility offered by the NOC based solution can be evaluated as equal to100× (1.43− 1.02)/1.02) = 40 %.

83

7.4. Hardware accelerators

The study of high performance errror correcting codes in the low BERregion implies extremely long simulation times. For example, in order to yielda statistically significant number of errors in the performance evaluation ofa code at a BER level of 10−12, at least 1014 information bits must beprocessed, and this requires a huge amount of time, even on a powerfulmachine: with a simulation throughput of 1 Mb/s, 2.7 years are necessary tocomplete the simulation.

Therefore FPGA based hardware accelerators have been proposed tospeed up investigation of code performance in such conditions. A relevantapplication domain where this kind of accelerators have been exploited isdata storage on memories or magnetic supports.

Iteratively decodable codes and in particular LDPC codes are a promisingsolution to replace Reed Solomon and BCH codes in the implementation oferror correction systems for high density magnetic recording and non volatileFlash memories. Both applications require extremely low bit error rates typ-ically between 10−12 and 10−15. Differently from the case of traditional blockcodes, the actual performance of LDPC codes at such BER levels needs tobe evaluated empirically, in order to verify that the near capacity error cor-recting capabilities are maintained and no error floor is found. The usualMonte Carlo simulations run on PC or server platforms cannot provide re-liable performance results within tractable amount of time. Therefore, forthese applications, hardware based emulation [134] has been adopted, bymapping the decoder algorithm and the whole verification chain on high per-formance FPGA device or even onparallel FPGA processing cluster. BERperformance down to 10−14 has been evaluated by means of this approachfor large LDPC codes [135].

There is so many interesting and outstanding implementation that mak-ing an exhautive list is an impossible mission. We select to sample theabondant literature in order to give some insight to the reader.

8. Conclusion

In this chapter we try to give to the reader some information about theimplementation of error control codes. The focuss is given on LDPC code andTurbo-Code, which has been well explored since the invention of Turbo-Codeand rediscovery of LDPC code in the nineties: in IEEE Xplore, the searchwith the words ”LDPC” OR ”TURBO-CODES” and ”architecture” gives

84

more than 1000 papers! It is still a very active area, since new standards andapplications appear as well as new hardware constraints like flexibility, lowpower and/or very high speed but most of the paper are now incremental.Interestingly, most of the error code used in, say a cellular phone or a settop box, comes from a little set of very small companies that are highlyspecialised in the desing of error control IP (Intelectual Property) cores. Itis also interesting to note that a standard decoder (say few 100 Mbit/s in acurrent mid-2010 technology) is mainly memory (say 90 %), the remaining10% of the silicon area is composed of processing logic, state machine forcontrol and input/output protocol. Those 10 % of area represents most ofthe design complexity.

To conclude, there is a central question: which one, between Turbo-codeand LDPC code, is the most efficient? Each code familly have their ownpartisans but there is no clear answer so far. The reader can reffer to [136],where N. Keinle et al. synthetize there accumulated experience on the designof several ASIC to try to give a clear answer.

The story is not yet ended. For example, papers about architecture ofpolar code decoder and Non-Binary LDPC decoder begin to appear. Once anew familly of codes will be implemented in a standard, it will trigger againan academic and industrial research activities.

[1] G. D., B. E., J. Tousch, M. Jezequel, Towards an optimal paralleldecoding of turbo codes”, in: 4th International Symposium on TurboCodes and Related Topics, 2006.

[2] A. Barbulescu, S. Pietrobon, Terminating the trellis of turbo-codes in the same state, Electronics Letters 31 (1) (1995) 22–23.doi:10.1049/el:19950008.

[3] A. Blanksby, C. Howland, A 690-mw 1-gb/s 1024-b, rate-1/2 low-density parity-check code decoder, Solid-State Circuits, IEEE Journalof 37 (3) (2002) 404–412. doi:10.1109/4.987093.

[4] G. Masera, F. Quaglio, F. Vacca, Implementation of a flexible LDPCdecoder, Circuits and Systems II: Express Briefs, IEEE Transactionson 54 (6) (2007) 542 –546. doi:10.1109/TCSII.2007.894409.

[5] C. Condo, M. Martina, G. Masera, A network-on-chip-basedturbo/LDPC decoder architecture, in: Design, Automation Test inEurope Conference Exhibition (DATE), 2012, 2012, pp. 1525 –1530.

85

[6] E. Boutillon, J. Castura, K. F. R., Decoder-first code design, in: TurboCodes and Related Topics, 2000 2nd International Symposium on,2000, pp. 459 – 462.

[7] M. Mansour, N. Shanbhag, VLSI architectures for SISO-APP decoders,Very Large Scale Integration (VLSI) Systems, IEEE Transactions on11 (4) (2003) 627 –650. doi:10.1109/TVLSI.2003.816136.

[8] M. Awais, A. Singh, E. Boutillon, G. Masera, A novel architecture forscalable, high throughput, multi-standard LDPC decoder, in: DigitalSystem Design (DSD), 2011 14th Euromicro Conference on, 2011, pp.340 –347. doi:10.1109/DSD.2011.112.

[9] C. Shung, P. Siegel, G. Ungerboeck, H. Thapar, Vlsi architecturesfor metric normalization in the viterbi algorithm, in: Communica-tions, 1990. ICC ’90, Including Supercomm Technical Sessions. SU-PERCOMM/ICC ’90. Conference Record., IEEE International Con-ference on, 1990, pp. 1723–1728 vol.4. doi:10.1109/ICC.1990.117356.

[10] P. Tortelier, D. Duponteil, Dynamique des mtriques dans lalgorithmede viterbi, annals of telecommunications - annales des tlcommunica-tions 45 (7) (1990) 377–383. doi:DOI:10.1007/BF03000938.

[11] E. Boutillon, W. Gross, P. Gulak, Vlsi architectures for the map algo-rithm, Communications, IEEE Transactions on 51 (2) (2003) 175–185.doi:10.1109/TCOMM.2003.809247.

[12] A. Hekstra, An alternative to metric rescaling in viterbi decoders,Communications, IEEE Transactions on 37 (11) (1989) 1220–1222.doi:10.1109/26.46516.

[13] W. M. Gilbert, A. Worm, H. Michel, F. Gilbert, G. Kreiselmaier,M. Thul, N. Wehn, Advanced implementation issues of turbo-decoders,in: In Proc. 2nd International Symposium on Turbo-Codes and RelatedTopics, 2000, pp. 351–354.

[14] H. Liu, J.-P. Diguet, C. Jego, M. Jezequel, E. Boutillon, Energy effi-cient turbo decoder with reduced state metric quantization, in: Sig-nal Processing Systems, 2007 IEEE Workshop on, 2007, pp. 237–242.doi:10.1109/SIPS.2007.4387551.

86

[15] M. Martina, G. Masera, State metric compression techniquesfor turbo decoder architectures, Circuits and Systems I: Reg-ular Papers, IEEE Transactions on 58 (5) (2011) 1119–1128.doi:10.1109/TCSI.2010.2090566.

[16] A. Singh, E. Boutillon, G. Masera, Bit-width optimization of ex-trinsic information in turbo decoder, in: Turbo Codes and RelatedTopics, 2008 5th International Symposium on, 2008, pp. 134–138.doi:10.1109/TURBOCODING.2008.4658686.

[17] J. Vogt, J. Ertel, A. Finger, Reducing bit width of extrinsic memoryin turbo decoder realisations, Electronics Letters 36 (20) (2000) 1714–1716. doi:10.1049/el:20001177.

[18] O. Muller, A. Baghdadi, M. Jezequel, Bandwidth reduction of extrinsicinformation exchange in turbo decoding, Electronics Letters 42 (19)(2006) 1104–1105. doi:10.1049/el:20062209.

[19] A. Abbasfar, K. Yao, An efficient architecture for high speed turbodecoders, in: Acoustics, Speech, and Signal Processing, 2003. Proceed-ings. (ICASSP ’03). 2003 IEEE International Conference on, Vol. 4,2003, pp. IV–521–4 vol.4. doi:10.1109/ICASSP.2003.1202694.

[20] W. Gross, P. Gulak, Simplified MAP algorithm suitable for implemen-tation of turbo decoders, Electronics Letters 34 (16) (1998) 1577–1578.doi:10.1049/el:19981120.

[21] J.-F. Cheng, T. Ottosson, Linearly approximated log-MAP algorithmsfor turbo decoding, in: Vehicular Technology Conference Proceedings,2000. VTC 2000-Spring Tokyo. 2000 IEEE 51st, Vol. 3, 2000, pp. 2252–2256 vol.3. doi:10.1109/VETECS.2000.851673.

[22] H. Wang, H. Yang, D. Yang, Improved log-MAP decoding algorithmfor turbo-like codes, Communications Letters, IEEE 10 (3) (2006) 186–188. doi:10.1109/LCOMM.2006.1603379.

[23] S. Papaharalabos, P. Mathiopoulos, G. Masera, M. Martina, Onoptimal and near-optimal turbo decoding using generalized maxoperator, Communications Letters, IEEE 13 (7) (2009) 522–524.doi:10.1109/LCOMM.2009.090537.

87

[24] K.-L. Hsiung, S.-J. Kim, S. Boyd, Tractable approximate robust geo-metric programming, Optimization and Engineering 9 (2) (2008) 95–118. doi:10.1007/s11081-007-9025-z.URL http://dx.doi.org/10.1007/s11081-007-9025-z

[25] S. Papaharalabos, P. Mathiopoulos, G. Masera, M. Martina, Non-recursive max* operator with reduced implementation complexityfor turbo decoding, Communications, IET 6 (7) (2012) 702–707.doi:10.1049/iet-com.2011.0217.

[26] C.-L. Wey, M.-D. Shieh, S.-Y. Lin, Algorithms of finding the first twominimum values and their hardware implementation, Circuits and Sys-tems I: Regular Papers, IEEE Transactions on 55 (11) (2008) 3430–3437. doi:10.1109/TCSI.2008.924892.

[27] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, X.-Y. Hu, Reduced-complexity decoding of LDPC codes, Communications, IEEE Transac-tions on 53 (8) (2005) 1288–1299. doi:10.1109/TCOMM.2005.852852.

[28] X.-Y. Hu, E. Eleftheriou, D.-M. Arnold, A. Dholakia, Effi-cient implementations of the sum-product algorithm for decodingLDPC codes, in: Global Telecommunications Conference, 2001.GLOBECOM ’01. IEEE, Vol. 2, 2001, pp. 1036–1036E vol.2.doi:10.1109/GLOCOM.2001.965575.

[29] S. Papaharalabos, P. Mathiopoulos, Simplified sum-product algorithmfor decoding ldpc codes with optimal performance, Electronics Letters45 (2) (2009) 116–117. doi:10.1049/el:20092323.

[30] M. Martina, G. Masera, S. Papaharalabos, P. Mathiopoulos,F. Gioulekas, On practical implementation and generalizations ofmax∗ operator for turbo and LDPC decoders, Instrumentationand Measurement, IEEE Transactions on 61 (4) (2012) 888–895.doi:10.1109/TIM.2011.2173045.

[31] F. Guilloud, E. Boutillon, J.-L. Danger, Lambda-Min decoding al-gorithm of regular and irregular LDPC codes, in: 3rd InternationalSymposium on Turbo Codes and related topics, 1-5 september, Brest,France, 2003.

88

[32] J. Vogt, A. Finger, Increasing throughput of iterative decoders, Elec-tronics Letters 37 (12) (2001) 770–771. doi:10.1049/el:20010495.

[33] G. Bosco, G. Montorsi, S. Benedetto, Decreasing the complexity ofldpc iterative decoders, Communications Letters, IEEE 9 (7) (2005)634–636. doi:10.1109/LCOMM.2005.1461688.

[34] S. Sweatlock, S. Dolinar, K. Andrews, Buffering requirementsfor variable-iterations ldpc decoders, in: Information The-ory and Applications Workshop, 2008, 2008, pp. 523–530.doi:10.1109/ITA.2008.4601025.

[35] M. Mansour, High-performance decoders for regular and irregularrepeat-accumulate codes, in: Global Telecommunications Conference,2004. GLOBECOM ’04. IEEE, Vol. 4, 2004, pp. 2583–2588 Vol.4.doi:10.1109/GLOCOM.2004.1378472.

[36] D. Hocevar, LDPC code construction with flexible hardware im-plementation, in: Communications, 2003. ICC ’03. IEEE In-ternational Conference on, Vol. 4, 2003, pp. 2708–2712 vol.4.doi:10.1109/ICC.2003.1204466.

[37] T. Zhang, K. Parhi, A 54 mbps (3,6)-regular FPGA LDPC decoder,in: Signal Processing Systems, 2002. (SIPS ’02). IEEE Workshop on,2002, pp. 127 – 132. doi:10.1109/SIPS.2002.1049697.

[38] H. Zhong, T. Zhang, Design of VLSI implementation-orientedLDPC codes, in: Vehicular Technology Conference, 2003. VTC2003-Fall. 2003 IEEE 58th, Vol. 1, 2003, pp. 670–673 Vol.1.doi:10.1109/VETECF.2003.1285102.

[39] A. Giulietti, L. Van der Perre, A. Strum, Parallel turbo coding inter-leavers: avoiding collisions in accesses to storage elements, ElectronicsLetters 38 (5) (2002) 232–234. doi:10.1049/el:20020148.

[40] B. Bougard, A. Giulietti, L. Van Der Perre, F. Catthoor,A class of power efficient VLSI architectures for high speedturbo-decoding, in: Global Telecommunications Conference, 2002.GLOBECOM ’02. IEEE, Vol. 1, 2002, pp. 549–553 vol.1.doi:10.1109/GLOCOM.2002.1188139.

89

[41] J. Kwak, S. M. Park, S.-S. Yoon, K. Lee, Implementation of a parallelturbo decoder with dividable interleaver, in: Circuits and Systems,2003. ISCAS ’03. Proceedings of the 2003 International Symposium on,Vol. 2, 2003, pp. II–65–II–68 vol.2. doi:10.1109/ISCAS.2003.1205889.

[42] A. Nimbalker, T. Blankenship, B. Classon, T. Fuja, D. Costello,Contention-free interleavers for high-throughput turbo decoding,Communications, IEEE Transactions on 56 (8) (2008) 1258–1267.doi:10.1109/TCOMM.2008.050502.

[43] R. Dobkin, M. Peleg, R. Ginosar, Parallel interleaver design and VLSIarchitecture for low-latency MAP turbo decoders, Very Large ScaleIntegration (VLSI) Systems, IEEE Transactions on 13 (4) (2005) 427–438. doi:10.1109/TVLSI.2004.842916.

[44] F. Gilbert, M. J. Thul, N. Wehn, Communication centric architecturesfor turbo-decoding on embedded multiprocessors, in: Design, Automa-tion and Test in Europe Conference and Exhibition, 2003, 2003, pp.356–361. doi:10.1109/DATE.2003.1253634.

[45] M. J. Thul, F. Gilbert, N. Wehn, Concurrent interleaving archi-tectures for high-throughput channel coding, in: Acoustics, Speech,and Signal Processing, 2003. Proceedings. (ICASSP ’03). 2003 IEEEInternational Conference on, Vol. 2, 2003, pp. II–613–16 vol.2.doi:10.1109/ICASSP.2003.1202441.

[46] F. Speziali, J. Zory, Scalable and area efficient concurrent inter-leaver for high throughput turbo-decoders, in: Digital System De-sign, 2004. DSD 2004. Euromicro Symposium on, 2004, pp. 334–341.doi:10.1109/DSD.2004.1333294.

[47] F. Kienle, Implementation issues of low-density parity-check decoders,Ph.D. thesis, University of Kaiserslautern, iSBN 978-3-939432-06-7(2006).

[48] T. Theocharides, G. Link, N. Vijaykrishnan, M. Irwin, Im-plementing LDPC decoding on network-on-chip, in: VLSI De-sign, 2005. 18th International Conference on, 2005, pp. 134–137.doi:10.1109/ICVD.2005.109.

90

[49] M. Martina, G. Masera, Turbo NOC: A framework for the design ofnetwork-on-chip-based turbo decoder architectures, Circuits and Sys-tems I: Regular Papers, IEEE Transactions on 57 (10) (2010) 2776–2789. doi:10.1109/TCSI.2010.2046257.

[50] C. Condo, M. Martina, G. Masera, VLSI implementation of amulti-mode turbo/LDPC decoder architecture, Circuits and SystemsI: Regular Papers, IEEE Transactions on PP (99) (2012) 1 –14.doi:10.1109/TCSI.2012.2221216.

[51] M. Martina, G. Masera, H. Moussa, A. Baghdadi, On chip intercon-nects for multiprocessor turbo decoding architectures, Microprocessorsand Microsystems 35 (2) (2011) 167–181.

[52] A. Tarable, S. Benedetto, G. Montorsi, Mapping interleavinglaws to parallel turbo and LDPC decoder architectures, Infor-mation Theory, IEEE Transactions on 50 (9) (2004) 2002–2009.doi:10.1109/TIT.2004.833353.

[53] A. Sani, P. Coussy, C. Chavet, E. Martin, A methodology basedon transportation problem modeling for designing parallel inter-leaver architectures, in: Acoustics, Speech and Signal Processing(ICASSP), 2011 IEEE International Conference on, 2011, pp. 1613–1616. doi:10.1109/ICASSP.2011.5946806.

[54] M. Bickerstaff, L. Davis, C. Thomas, D. Garrett, C. Nicol, A 24mb/sradix-4 logMAP turbo decoder for 3GPP-HSDPA mobile wireless,in: Solid-State Circuits Conference, 2003. Digest of Technical Pa-pers. ISSCC. 2003 IEEE International, 2003, pp. 150 – 484 vol.1.doi:10.1109/ISSCC.2003.1234244.

[55] Y. Zhang, K. Parhi, High-throughput radix-4 logmap turbo de-coder architecture, in: Signals, Systems and Computers, 2006. AC-SSC ’06. Fortieth Asilomar Conference on, 2006, pp. 1711–1715.doi:10.1109/ACSSC.2006.355053.

[56] C. Thomas, M. Bickerstaff, L. Davis, T. Prokop, B. Widdup, G. Zhou,D. Garrett, C. Nicol, Integrated circuits for channel coding in 3g cel-lular mobile wireless systems, Communications Magazine, IEEE 41 (8)(2003) 150–159. doi:10.1109/MCOM.2003.1222732.

91

[57] J. Zhang, Y. Wang, M. Fossorier, J. S. Yedidia, Iterative decodingwith replicas, Information Theory, IEEE Transactions on 53 (5) (2007)1644–1663. doi:10.1109/TIT.2007.894683.

[58] O. Muller, A. Baghdadi, M. Jezequel, Exploring parallel processing lev-els for convolutional turbo decoding, in: Information and Communica-tion Technologies, 2006. ICTTA ’06. 2nd, Vol. 2, 2006, pp. 2353–2358.doi:10.1109/ICTTA.2006.1684774.

[59] T. Ilnseher, F. Kienle, C. Weis, N. Wehn, A 2.15gbit/s turbo codedecoder for lte advanced base station applications, in: Turbo Codesand Iterative Information Processing (ISTC), 2012 7th InternationalSymposium on, 2012, pp. 21–25. doi:10.1109/ISTC.2012.6325191.

[60] N. Onizawa, T. Hanyu, V. Gaudet, Design of high-throughput fullyparallel LDPC decoders based on wire partitioning, Very Large ScaleIntegration (VLSI) Systems, IEEE Transactions on 18 (3) (2010) 482–489. doi:10.1109/TVLSI.2008.2011360.

[61] A. Darabiha, A. Carusone, F. Kschischang, A 3.3-Gbps bit-serial block-interlaced min-sum LDPC decoder in 0.13-µm CMOS, in: Custom Inte-grated Circuits Conference, 2007. CICC ’07. IEEE, 2007, pp. 459–462.doi:10.1109/CICC.2007.4405773.

[62] T. Mohsenin, D. Truong, B. Baas, A low-complexity message-passingalgorithm for reduced routing congestion in LDPC decoders, Circuitsand Systems I: Regular Papers, IEEE Transactions on 57 (5) (2010)1048–1061. doi:10.1109/TCSI.2010.2046957.

[63] M. Karkooti, J. Cavallaro, Semi-parallel reconfigurable architecturesfor real-time LDPC decoding, in: Information Technology: Coding andComputing, 2004. Proceedings. ITCC 2004. International Conferenceon, Vol. 1, 2004, pp. 579–585 Vol.1. doi:10.1109/ITCC.2004.1286526.

[64] L. Liu, C. J. R. Shi, Sliced message passing: High throughput over-lapped decoding of high-rate low-density parity-check codes, Circuitsand Systems I: Regular Papers, IEEE Transactions on 55 (11) (2008)3697–3710. doi:10.1109/TCSI.2008.926995.

[65] Z. Zhang, V. Anantharam, M. Wainwright, B. Nikolic, An effi-cient 10GBASE-T ethernet LDPC decoder design with low error

92

floors, Solid-State Circuits, IEEE Journal of 45 (4) (2010) 843–855.doi:10.1109/JSSC.2010.2042255.

[66] A. Cevrero, Y. Leblebici, P. Ienne, A. Burg, A 5.35 mm2 10GBASE-T ethernet LDPC decoder chip in 90 nm CMOS, in: Solid StateCircuits Conference (A-SSCC), 2010 IEEE Asian, 2010, pp. 1 –4.doi:10.1109/ASSCC.2010.5716619.

[67] A. Carroll, G. Heiser, An analysis of power consumption in a smart-phone, in: Proceedings of the 2010 USENIX conference on USENIXannual technical conference, Vol. 1, 2010, pp. 271 –284.

[68] C. Han, T. Harrold, S. Armour, I. Krikidis, S. Videv, P. Grant, H. Haas,J. Thompson, I. Ku, C.-X. Wang, T. A. Le, M. Nakhai, J. Zhang,L. Hanzo, Green radio: radio techniques to enable energy-efficient wire-less networks, Communications Magazine, IEEE 49 (6) (2011) 46 –54.doi:10.1109/MCOM.2011.5783984.

[69] A. Worthen, S. Hong, R. Gupta, W. Stark, Performance optimization ofVLSI transceivers for low-energy communications systems, in: MilitaryCommunications Conference Proceedings, 1999. MILCOM 1999. IEEE,Vol. 2, 1999, pp. 1434 –1438 vol.2. doi:10.1109/MILCOM.1999.821440.

[70] Y. Sun, J. Cavallaro, T. Ly, Scalable and low power LDPC de-coder design using high level algorithmic synthesis, in: SOC Con-ference, 2009. SOCC 2009. IEEE International, 2009, pp. 267 –270.doi:10.1109/SOCCON.2009.5398044.

[71] W. Wang, G. Choi, Minimum-energy LDPC decoder for real-time mobile application, in: Design, Automation Test in Eu-rope Conference Exhibition, 2007. DATE ’07, 2007, pp. 1 –6.doi:10.1109/DATE.2007.364615.

[72] W. Wang, G. Choi, K. Gunnam, Low-power VLSI design ofLDPC decoder using DVFS for AWGN channels, in: VLSI De-sign, 2009 22nd International Conference on, 2009, pp. 51 –56.doi:10.1109/VLSI.Design.2009.68.

[73] O.-H. Leung, C.-Y. Tsui, R.-K. Cheng, Reducing power consumption ofturbo-code decoder using adaptive iteration with variable supply volt-

93

age, Very Large Scale Integration (VLSI) Systems, IEEE Transactionson 9 (1) (2001) 34 –41. doi:10.1109/92.920817.

[74] A. Darabiha, A. Chan Carusone, F. Kschischang, Power reduction tech-niques for LDPC decoders, Solid-State Circuits, IEEE Journal of 43 (8)(2008) 1835 –1845. doi:10.1109/JSSC.2008.925402.

[75] J. Cho, N. Shanbhag, W. Sung, Low-power implementation of a high-throughput LDPC decoder for IEEE 802.11n standard, in: Signal Pro-cessing Systems, 2009. SiPS 2009. IEEE Workshop on, 2009, pp. 040–045. doi:10.1109/SIPS.2009.5336223.

[76] Z. Chen, X. Zhao, X. Peng, D. Zhou, S. Goto, An early stopping cri-terion for decoding LDPC codes in WiMAX and WiFi standards, in:Circuits and Systems (ISCAS), Proceedings of 2010 IEEE InternationalSymposium on, 2010, pp. 473–476. doi:10.1109/ISCAS.2010.5537638.

[77] C. Schurgers, L. Van der Perre, M. Engels, H. De Man, Adaptive turbodecoding for indoor wireless communication, in: Signals, Systems, andElectronics, 1998. ISSSE 98. 1998 URSI International Symposium on,1998, pp. 107 –111. doi:10.1109/ISSSE.1998.738048.

[78] Z. Wang, H. Suzuki, K. Parhi, VLSI implementation issues of turbodecoder design for wireless applications, in: Signal Processing Sys-tems, 1999. SiPS 99. 1999 IEEE Workshop on, 1999, pp. 503 –512.doi:10.1109/SIPS.1999.822356.

[79] R. Shao, S. Lin, M. Fossorier, Two simple stopping criteria for turbodecoding, Communications, IEEE Transactions on 47 (8) (1999) 1117–1120. doi:10.1109/26.780444.

[80] J. Li, X. hu You, J. Li, Early stopping for LDPC decoding: convergenceof mean magnitude (CMM), Communications Letters, IEEE 10 (9)(2006) 667 –669. doi:10.1109/LCOMM.2006.1714539.

[81] J. Hagenauer, E. Offer, L. Papke, Iterative decoding of binary blockand convolutional codes, Information Theory, IEEE Transactions on42 (2) (1996) 429 –445. doi:10.1109/18.485714.

94

[82] D. Shin, K. Heo, S. Oh, J. Ha, A stopping criterion forlow-density parity-check codes, in: Vehicular Technology Confer-ence, 2007. VTC2007-Spring. IEEE 65th, 2007, pp. 1529 –1533.doi:10.1109/VETECS.2007.319.

[83] R. Gonzalez, B. Gordon, M. Horowitz, Supply and threshold voltagescaling for low power CMOS, Solid-State Circuits, IEEE Journal of32 (8) (1997) 1210 –1216. doi:10.1109/4.604077.

[84] G. Masera, M. Mazza, G. Piccinini, F. Viglione, M. Zamboni, Architec-tural strategies for low-power VLSI turbo decoders, Very Large ScaleIntegration (VLSI) Systems, IEEE Transactions on 10 (3) (2002) 279–285. doi:10.1109/TVLSI.2002.1043330.

[85] C. Schurgers, F. Catthoor, M. Engels, Energy efficient data transferand storage organization for a MAP turbo decoder module, in: LowPower Electronics and Design, 1999. Proceedings. 1999 InternationalSymposium on, 1999, pp. 76 –81.

[86] C. Schurgers, F. Catthoor, M. Engels, Memory optimization ofMAP turbo decoder algorithms, Very Large Scale Integration(VLSI) Systems, IEEE Transactions on 9 (2) (2001) 305 –312.doi:10.1109/92.924051.

[87] H. Liu, C. Jego, E. Boutillon, J.-P. Diguet, M. Jezequel, Scarcestate transition turbo decoding based on re-encoding combined withdummy insertion, Electronics Letters 45 (16) (2009) 846 –848.doi:10.1049/el.2009.0394.

[88] J. Hagenauer, M. Winklhofer, The analog decoder, in: InformationTheory, 1998. Proceedings. 1998 IEEE International Symposium on,1998, pp. 145–. doi:10.1109/ISIT.1998.708738.

[89] H.-A. Loeliger, F. Lustenberger, M. Helfenstein, F. Tarkoy, Probabilitypropagation and decoding in analog VLSI, in: Information Theory,1998. Proceedings. 1998 IEEE International Symposium on, 1998, pp.146–. doi:10.1109/ISIT.1998.708739.

[90] V. Gaudet, P. Gulak, A 13.3-mb/s 0.35-µm CMOS analog turbo de-coder IC with a configurable interleaver, Solid-State Circuits, IEEEJournal of 38 (11) (2003) 2010–2015. doi:10.1109/JSSC.2003.818134.

95

[91] A. Xotta, D. Vogrig, A. Gerosa, A. Neviani, A. Graell i Amat, G. Mon-torsi, M. Bruccoleri, G. Betti, An all-analog CMOS implementation of aturbo decoder for hard-disk drive read channels, in: Circuits and Sys-tems, 2002. ISCAS 2002. IEEE International Symposium on, Vol. 5,2002, pp. V–69–V–72 vol.5. doi:10.1109/ISCAS.2002.1010642.

[92] S. Hemati, A. Banihashemi, C. Plett, An 80-mb/s 0.18-µm cmos analogmin-sum iterative decoder for a (32,8,10) LDPC code, in: CustomIntegrated Circuits Conference, 2005. Proceedings of the IEEE 2005,2005, pp. 243–246. doi:10.1109/CICC.2005.1568652.

[93] M. Arzel, C. Lahuec, F. Seguin, D. Gnaedig, M. Jezequel, Semi-iterative analog turbo decoding, Circuits and Systems I: Reg-ular Papers, IEEE Transactions on 54 (6) (2007) 1305–1316.doi:10.1109/TCSI.2007.897770.

[94] J. von Neumann, Probabilistic logics and the synthesis of reliable organ-isms from unreliable components, Automata studies 34 (1956) 43–98.

[95] V. Gaudet, A. Rapley, Iterative decoding using stochastic computation,Electronics Letters 39 (3) (2003) 299–301. doi:10.1049/el:20030217.

[96] C. Winstead, V. Gaudet, A. Rapley, C. Schlegel, Stochasticiterative decoders, in: Information Theory, 2005. ISIT 2005.Proceedings. International Symposium on, 2005, pp. 1116–1120.doi:10.1109/ISIT.2005.1523513.

[97] S. Tehrani, A. Naderi, G.-A. Kamendje, S. Hemati, S. Mannor,W. Gross, Majority-based tracking forecast memories for stochasticldpc decoding, Signal Processing, IEEE Transactions on 58 (9) (2010)4883–4896. doi:10.1109/TSP.2010.2051434.

[98] A. Naderi, S. Mannor, M. Sawan, W. Gross, Delayed stochastic de-coding of ldpc codes, Signal Processing, IEEE Transactions on 59 (11)(2011) 5617–5626. doi:10.1109/TSP.2011.2163630.

[99] G. Sarkis, W. Gross, Efficient stochastic decoding of non-binary ldpccodes with degree-two variable nodes, Communications Letters, IEEE16 (3) (2012) 389–391. doi:10.1109/LCOMM.2011.122211.111737.

96

[100] Q. T. Dong, M. Arzel, C. Jego, W. Gross, Stochastic decoding of turbocodes, Signal Processing, IEEE Transactions on 58 (12) (2010) 6421–6425. doi:10.1109/TSP.2010.2072924.

[101] Q. T. Dong, M. Arzel, C. Jego, Design and fpga implementationof stochastic turbo decoder, in: New Circuits and Systems Con-ference (NEWCAS), 2011 IEEE 9th International, 2011, pp. 21–24.doi:10.1109/NEWCAS.2011.5981209.

[102] D. Declercq, E. Li, B. Vasic, S. Planjery, Approaching maximum like-lihood decoding of finite length ldpc codes via faid diversity, in: In-formation Theory Workshop (ITW), 2012 IEEE, 2012, pp. 487–491.doi:10.1109/ITW.2012.6404721.

[103] TrellisWare technologies, FEC products, [Online; accessed 21-December-2012].URL http://www.trellisware.com/products/fec-products/

[104] Design & Reuse, Silicon ip, [Online; accessed 21-December-2012](2012).URL http://www.design-reuse.com/sip/

[105] iCODING, Iterative turbo decoder cores, [Online; accessed 21-December-2012] (2012).URL http://www.icoding.com/products.htm

[106] K.-T. Shr, Y.-C. Chang, C.-Y. Lin, Y.-H. Huang, A 6.6pj/bit/iterradix-16 modified log-MAP decoder using two-stage ACS architecture,in: Solid State Circuits Conference (A-SSCC), 2011 IEEE Asian, 2011,pp. 313 –316. doi:10.1109/ASSCC.2011.6123575.

[107] C. Studer, C. Benkeser, S. Belfanti, Q. Huang, A 390mb/s 3.57 mm2

3GPP-LTE turbo decoder ASIC in 0.13 µm CMOS, in: Solid-StateCircuits Conference Digest of Technical Papers (ISSCC), 2010 IEEEInternational, 2010, pp. 274 –275. doi:10.1109/ISSCC.2010.5433918.

[108] X.-Y. Shih, C.-Z. Zhan, A.-Y. Wu, A 7.39mm2 76mw (1944, 972)LDPC decoder chip for IEEe 802.11n applications, in: Solid-State Cir-cuits Conference, 2008. A-SSCC ’08. IEEE Asian, 2008, pp. 301 –304.doi:10.1109/ASSCC.2008.4708787.

97

[109] M. Alles, N. Wehn, F. Berens, A synthesizable IP core for Wi-Media 1.5 UWB LDPC code decoding, in: Ultra-Wideband, 2009.ICUWB 2009. IEEE International Conference on, 2009, pp. 597 –601.doi:10.1109/ICUWB.2009.5288833.

[110] X. Peng, Z. Chen, X. Zhao, D. Zhou, S. Goto, A 115 mw 1 Gbps QC-LDPC decoder ASIC for WiMAX in 65 nm CMOS, in: Solid StateCircuits Conference (A-SSCC), 2011 IEEE Asian, 2011, pp. 317 –320.doi:10.1109/ASSCC.2011.6123576.

[111] Texas Instruments, TMS320C64x DSP turbo-decoder coprocessor(TCP) reference guide, [Online; accessed 21-December-2012] (2004).URL http://www.ti.com/lit/ug/spru534b/spru534b.pdf

[112] G. Lechner, J. Sayir, M. Rupp, Efficient DSP implementation of anLDPC decoder, in: Acoustics, Speech, and Signal Processing, 2004.Proceedings. (ICASSP ’04). IEEE International Conference on, Vol. 4,2004, pp. iv–665 – iv–668 vol.4. doi:10.1109/ICASSP.2004.1326914.

[113] G. Wang, M. Wu, Y. Sun, J. Cavallaro, A massively parallel imple-mentation of QC-LDPC decoder on GPU, in: Application SpecificProcessors (SASP), 2011 IEEE 9th Symposium on, 2011, pp. 82 –85.doi:10.1109/SASP.2011.5941084.

[114] A. La Rosa, C. Passerone, F. Gregoretti, L. Lavagno, Implementa-tion of a UMTS turbo-decoder on a dynamically reconfigurable plat-form, in: Design, Automation and Test in Europe Conference andExhibition, 2004. Proceedings, Vol. 2, 2004, pp. 1218 – 1223 Vol.2.doi:10.1109/DATE.2004.1269062.

[115] O. Schliebusch, H. Meyr, R. Leupers, Optimized ASIP Synthesisfrom Architecture Description Language Models, Springer-Verlag, NewYork, Inc. Secaucus, NJ, USA, 2007.

[116] P. Murugappa, R. Al-Khayat, A. Baghdadi, M. Jezequel, A flexiblehigh throughput multi-ASIP architecture for LDPC and turbo decod-ing, in: Design, Automation Test in Europe Conference Exhibition(DATE), 2011, 2011, pp. 1 –6.

[117] M. Alles, T. Vogt, N. Wehn, FlexiChaP: A reconfigurable ASIP forconvolutional, turbo, and LDPC code decoding, in: Turbo Codes and

98

Related Topics, 2008 5th International Symposium on, 2008, pp. 84–89. doi:10.1109/TURBOCODING.2008.4658677.

[118] F. Naessens, A. Bourdoux, A. Dejonghe, A flexible ASIP decoder forcombined binary and non-binary LDPC codes, in: Communicationsand Vehicular Technology in the Benelux (SCVT), 2010 17th IEEESymposium on, 2010, pp. 1 –5. doi:10.1109/SCVT.2010.5720462.

[119] M. Mansour, N. Shanbhag, Memory-efficient turbo decoder ar-chitectures for LDPC codes, in: Signal Processing Systems,2002. (SIPS ’02). IEEE Workshop on, 2002, pp. 159 – 164.doi:10.1109/SIPS.2002.1049702.

[120] S. Sharma, S. Attri, F. Chauhan, A simplified and efficient imple-mentation of FPGA-based turbo decoder, in: Performance, Com-puting, and Communications Conference, 2003. Conference Pro-ceedings of the 2003 IEEE International, 2003, pp. 207 – 213.doi:10.1109/PCCC.2003.1203701.

[121] X.-J. Zeng, Z.-L. Hong, Design and implementation of a turbo decoderfor 3G W-CDMA systems, Consumer Electronics, IEEE Transactionson 48 (2) (2002) 284 –291. doi:10.1109/TCE.2002.1010133.

[122] J. Steensma, C. Dick, FPGA implementation of a 3GPP turbo codec,in: Signals, Systems and Computers, 2001. Conference Record of theThirty-Fifth Asilomar Conference on, Vol. 1, 2001, pp. 61 –65 vol.1.doi:10.1109/ACSSC.2001.986881.

[123] S. Huang, D. Bao, B. Xiang, Y. Chen, X. Zeng, A flexible LDPC de-coder architecture supporting two decoding algorithms, in: Circuitsand Systems (ISCAS), Proceedings of 2010 IEEE International Sym-posium on, 2010, pp. 3929 –3932. doi:10.1109/ISCAS.2010.5537686.

[124] M. Martina, M. Nicola, G. Masera, A flexible UMTS-WiMaxturbo decoder architecture, Circuits and Systems II: Ex-press Briefs, IEEE Transactions on 55 (4) (2008) 369 –373.doi:10.1109/TCSII.2008.919510.

[125] M. Scarpellino, A. Singh, E. Boutillon, G. Masera, Reconfigurablearchitecture for LDPC and turbo decoding: A NoC case study,

99

in: Spread Spectrum Techniques and Applications, 2008. ISSSTA’08. IEEE 10th International Symposium on, 2008, pp. 671 –676.doi:10.1109/ISSSTA.2008.131.

[126] C. Beuschel, H.-J. Pfleiderer, FPGA implementation of a flexible de-coder for long LDPC codes, in: Field Programmable Logic and Appli-cations, 2008. FPL 2008. International Conference on, 2008, pp. 185–190. doi:10.1109/FPL.2008.4629929.

[127] B. Xiang, D. Bao, S. Huang, X. Zeng, A fully-overlapped multi-modeQC-LDPC decoder architecture for mobile wimax applications, in:Application-specific Systems Architectures and Processors (ASAP),2010 21st IEEE International Conference on, 2010, pp. 225 –232.doi:10.1109/ASAP.2010.5540958.

[128] Y.-L. Wang, Y.-L. Ueng, C.-L. Peng, C.-J. Yang, Processing-task ar-rangement for a low-complexity full-mode WiMAX LDPC codec, Cir-cuits and Systems I: Regular Papers, IEEE Transactions on 58 (2)(2011) 415 –428. doi:10.1109/TCSI.2010.2071910.

[129] C. Neeb, M. Thul, N. Wehn, Network-on-chip-centric approach to inter-leaving in high throughput channel decoders, in: Circuits and Systems,2005. ISCAS 2005. IEEE International Symposium on, 2005, pp. 1766– 1769 Vol. 2. doi:10.1109/ISCAS.2005.1464950.

[130] F. Vacca, G. Masera, H. Moussa, A. Baghdadi, M. Jezequel, Flex-ible architectures for LDPC decoders based on network on chipparadigm, in: Digital System Design, Architectures, Methods andTools, 2009. DSD ’09. 12th Euromicro Conference on, 2009, pp. 582–589. doi:10.1109/DSD.2009.235.

[131] C. Condo, G. Masera, A flexible Noc-based LDPC code decoder im-plementation and bandwidth reduction methods, in: Design and Ar-chitectures for Signal and Image Processing (DASIP), 2011 Conferenceon, 2011, pp. 1 –8. doi:10.1109/DASIP.2011.6136889.

[132] W.-H. Hu, J. H. Bahn, N. Bagherzadeh, Parallel LDPC decoding on anetwork-on-chip based multiprocessor platform, in: Computer Archi-tecture and High Performance Computing, 2009. SBAC-PAD ’09. 21st

100

International Symposium on, 2009, pp. 35 –40. doi:10.1109/SBAC-PAD.2009.9.

[133] G. Gentile, M. Rovini, L. Fanucci, Low-complexity architectures ofa decoder for IEEE 802.16e LDPC codes, in: Digital System DesignArchitectures, Methods and Tools, 2007. DSD 2007. 10th EuromicroConference on, 2007, pp. 369 –375. doi:10.1109/DSD.2007.4341494.

[134] X. Hu, Z. Li, B. Vijaya Kumar, R. Barndt, Error floor es-timation of long LDPC codes on magnetic recording channels,Magnetics, IEEE Transactions on 46 (6) (2010) 1836 –1839.doi:10.1109/TMAG.2010.2040026.

[135] Y. Cai, S. Jeon, K. Mai, B. Kumar, Highly parallel FPGA emulation forLDPC error floor characterization in perpendicular magnetic recordingchannel, Magnetics, IEEE Transactions on 45 (10) (2009) 3761 –3764.doi:10.1109/TMAG.2009.2022318.

[136] F. Kienle, N. Wehn, H. Meyr, On complexity, energy- andimplementation-efficiency of channel decoders, Communi-cations, IEEE Transactions on 59 (12) (2011) 3301–3310.doi:10.1109/TCOMM.2011.092011.100157.

101

Hardware Design and Realization for Iteratively Decodable ...

Documents

Transcript of Hardware Design and Realization for Iteratively Decodable ...