LDPC Decoding Methods

DESIGN OF OPTIMAL LAYER DECODING FOR LDPC CODES

AJITH.C Mr.P.KABILAMANI

M.E-VLSI DESIGN ASSISTANT PROFESSOR,Dept.ECE

SAKTHI MARIAMMAN ENGINEERING COLLEGE SAKTHI MARIAMMAN ENGINEERING COLLEGE

CHENNAI-602 105 CHENNAI-602 105

E-mail ID:[email protected] E-mail ID: [email protected].

ABSTRACT

In this paper, present a method for creating LDPC

codes which are specifically designed to be

hardware friendly. This method involves

constraining the cyclic shift values in the base H-

matrix to reduce the complexity of the cyclic shift

hardware. In this, show that the decoder hardware

implementation for these codes has higher

throughput and lower power consumption than

decoders designed for traditional LDPC codes. It

provide results showing that these codes maintain

the error rate performance expected of LDPC codes

while achieving these throughput and power

consumption improvements.

Keywords - low density parity check decoder;

encoder; layered; LDPC; permutation network;

shuffle; cyclic shift; constrained code design; low

power hardware; high data rate; gigabit;

communications; WiGig; flooding;

I. INTRODUCTION

Low Density Parity Check (LDPC) codes [1]

have received a great deal of attention in recent years.

This is due to their ability to achieve performance

close to the Shannon limit [2],

hardware [3][4], and their support of high data rates.

However, data rates in modern communication

systems are everincreasing, and the rise of battery-

powered mobile

communication and computing devices necessitates

low-power hardware design wherever possible. As a

result, there is continuing work in the area of

improving the power-efficiency

and throughput of LDPC encoding and decoding

hardware.

Traditionally, LDPC code design has focused on

error rate performance and parallelization, while

decoder hardware design has focused on throughput

and power consumption. Often the approach used is

code-first design, where the hardware is designed

after a good-performing code has been created. In

this paper, our approach is a hardware-first design

approach focused on improving the cyclic shift

function (alternatively referred to as a shuffle

network or permutation network) used to vary the

connections between the variable nodes (VN) and

check nodes (CN) in the decoder.

This paper presents a method for designing LDPC

codes which are inherently hardware-friendly.

Initially, we discuss some background information on

LDPC codes and the hardware required to perform

cyclic permutations of the VN data in a layered

decoder. We then present our approach to code

design which minimizes the number of cyclic

permutations that must be supported by the hardware.

We show the improved hardware implementation for

these codes, and discuss the improved performance

characteristics of this hardware. And finally, we

demonstrate that the error-rate performance of these

codes is comparable to codes designed by existing

methods.

II. LDPC CODES AND LAYERED DECODER

DESIGN

The structure of a specific LDPC code is

determined entirely by its H-matrix. Because of the

size and routing complexity involved in

implementing a fully-parallel decoder (also known as

a flooding decoder), most practical LDPC codes are

structured to support partially parallel decoders in

hardware (also known as layered-schedule decoders).

To create a code with good performance which

supports parallel decoding in hardware, one

commonly used approach is protograph-based code

design. Using this approach, a structured code is

created by expanding a relatively small Tanner graph

mailto:[email protected]

mailto:[email protected]

called a protograph [5]. The resultant code has an H-

matrix, which can be described by a set of

permutation matrices. In one form of protograph-

based code design, an expansion factor Z is chosen.

Each permutation matrix is of size Z×Z, and is

created by performing a right cyclic shift on the Z×Z

identity matrix. Fig. 1 shows an illustrative example

of the base H-matrix of a protograph-based LDPC

code designed using Z-factor expansion. Note that we

use the term base Hmatrix to refer to the compact

form of the H-matrix, where each cell represents a

permutation matrix. In this example, the positive

number x refers to the identity matrix shifted x times

to the right (e.g., the number ‘5’ refers to the identity

matrix cyclically shifted five times to the right), and

the value ‘-1’ refers to the Z×Z all-zeros matrix. The

shift value determines the connection of the VNs to

the CNs for the particular row in the base H-matrix.

The amount of the cyclic shift (i.e., the value x)

ranges from 0 to Z-1 and generally varies across the

rows and columns of the base H-matrix. Careful

selection of the shift values is necessary to design a

code with good errorrate performance.

For any row of the base H-matrix, each VN will

be connected to at most one CN. This is easily

explained because the row weight and column weight

of the identity matrix (or, a cyclically shifted identity

matrix) are both one. This condition is important to

eliminate contention issues on the reading of VN

information and the writing of CN messages back to

the VN memory. This concept is usually generalized

by dividing the H-matrix into layers. Each layer

consists of a set of contention-free CNs (i.e., only one

CN can access a given VN memory at a specific

time), which are processed in parallel without

contention. Note that one can treat each row of the

base H-matrix as a layer. To increase decoding

throughput it is possible to process more than one

row of the base H-matrix in the same layer. In this

case, the protograph of the LDPC code needs to be

carefully designed to allow parallel processing of

multiple base H-matrix rows in a single layer.

Figure 1. Base H-matrix of a protograph-based LDPC

code designed using

Z-factor expansion (Z=7)

A protograph-based LDPC code is well-suited to

layered decoder design. To simplify description,

unless otherwise stated we consider that each layer

consists of one base Hmatrix row. We also refer to

the hardware which processes all CNs in one row of

the base H-matrix as the CN processor [6, Ch. 5]. In

a layered decoder, one CN processor can be designed

to serially process the different rows of the base H-

matrix. That is, the decoder operates on one layer at a

time, and processes the layers in sequence until a full

iteration through all parity checks (i.e. the entire H-

matrix) is completed. However, in order to use the

same CN processor for each layer, the interconnect

between the VNs and the CNs changes with each

layer processed according to the cyclic shifts

represented by the values in the base H-matrix. The

base H-matrix has c columns, and each column

corresponds to Z elements of the VN data. For each

column, the hardware must be able to perform a

number of cyclic shifts equal to the column weight of

that column. Moreover, if the decoder supports

multiple code rates (multiple base H-matrices), the

CN processing machine needs to be designed to

process any row in the set of base H-matrices.

Consequently, the number of cyclic-shift possibilities

which must be supported increases accordingly. The

permutations of the protograph are usually designed

to result in codes of good error-rate performance

without paying attention to the number of cyclic-shift

possibilities in each column across the set of base H-

matrices. However, this can significantly increase the

complexity of the CN processing hardware as will be

shown later in this paper.

There are several hardware approaches to

handling the cyclic shifts in the decoder. Fig. 2 shows

one common approach. This approach uses an

absolute right-cyclic-shift when data is prepared for

the CN processor and a complementary absolute left-

cyclic-shift when messages from the CN processor

are returned to the VNs.

Figure 2. Layered LDPC decoder using absolute

cyclic shifts

An improved approach [7] is to replace the

dual rightshift/ left-shift hardware with a single

relative shift. This approach is shown in Fig. 3. The

single shift performed in this method is a right

relative shift equal to the difference between the

current layer's absolute shift value and the previous

layer's absolute right shift value. This approach will

generally require less hardware (silicon) and have

less overall path delay and power consumption than

implementing two absolute shift functions.

Figure 3. Layered LDPC decoder using relative

cyclic shift

In the hardware implementation of an

LDPC decoder, the cyclic-shift functions are

typically constructed with multiplexers created from

CMOS logic or pass-transistor logic. The logic for

these multiplexers adds to the processing delay for

each layer of the code. This delay limits the operating

frequency of the CN processors, which in turn

reduces the throughput of the decoder. Moreover, the

switching of the multiplexer increases the power

consumption of the decoder.

The approaches in Figs. 2 and 3 are effective

decoder implementations for existing LDPC codes.

However, if we take a step back and consider the

hardware together with the code design, we can

design codes with higher-throughput and lowerpower

consumption. By paying attention to the cyclic-shift

possibilities in each column across all base H-

matrices, we can intelligently select them to optimize

both hardware implementation and performance.

III. VECTOR DECODER ARCHITECTURE

FOR QC-LDPC CODES

Vector decoder architecture overcomes the

limitation of the scalar decoder (described in the

previous section) by packing multiple messages in

the same memory word. As noted before, this is

possible because, block RAMs can be configured into

different aspect rations. For the NMSA, the intrinsic

and the extrinsic messages are usually 6–8 bit wide,

thus up to six 6-bit messages can be packed in one

memory word in the 512x36 block RAM

configuration. We define the number of messages

packed into one memory word as . The vector

decoder for a -regular code requires CNUs, VNUs,

and block RAMs for intrinsic and extrinsic

memories. Fig. 4.1 shows a vector decoder for the

code when . Potentially, the throughput of a vector

decoder can be times that of a scalar decoder, given

that there are times more functional units operating

simultaneously. However, without proper data

packing scheme, memory access conflicts will be

caused since multiple messages are accesses per

cycle. Besides, efficient message alignment units are

required so that the additional logic incurred would

be reduced. The techniques to overcome these

challenges is described next.

In a vector decoder, each block RAM location

holds multiple messages. Memory conflicts could

arise if the CNU and VNU try to access the same

location simultaneously. The key challenges with

vector decoding is to reduce the potential for such

conflicts and to ensure that the overhead of resolving

these conflicts through alignment units does not

increase the complexity of the decoder and limit

scalability and clock frequency. We address this

challenge with a combination of three techniques that

are described next. First, we use double buffering,

i.e., the messages are replicated for CNU and VNU

access, so that they are stored in different ways to

match the access pattern of the CNU and VNU

processing. Though it doubles the amount of memory

needed for the storage, it does not increase the

number of block RAMs necessary, because we use

the same block RAM to store both CNU and VNU

memory. This works because typically the CPM sizes

are much smaller than the depth of the block RAMs

in an FPGA. Second, we develop a new packing

strategy that not only uses the block RAMs

efficiently but also reduces the potential for conflicts.

Third we propose a sequential alignment unit and its

implementation to demonstrate that the alignment

task can be achieved with relatively low complexity,

which makes the scheme scalable.

Figure 4.Vector Decoder Architecture

Message Packing and Alignment

The CNU and VNU memory can be modelled

as two dimensional arrays. the nonzero entries in the

CPM are stored in the block RAM. Each message

appears in two different locations, because of the

double buffering. Each block RAM is partitioned

logically into a VNU memory and CNU memory and

the messages are stored in different order to facilitate

conflict free access by CN and VN processing units.

There are two advantages of our scheme First, our

scheme does not require any read alignment units

Second, our method works for any value of c and v

which is essential to support overlapped message

passing.

Proposed Packing Scheme

The CNU and VNU memory can be

modeled as two dimensional arrays, denoted by and .

The variable-to-check messages are stored in the

CNU memory by the CNU access order, i.e., the

message is packed as . The check-to-variable

messages are packed in the VNU memory by the

VNU access order, i.e., the message is packed as . In

general, message is packed to the location in the

CNU memory and to in the VNU memory.Each

block RAM is partitioned logically into a VNU

memory and CNU memory and the messages are

stored in different order to facilitate conflict free

access by CN and VN processing units. We compare

our method with the scheme presented in Wang [12],

which is shown in Fig. There are two advantages of

our scheme over the scheme proposed in [12]. First,

our scheme does not require any read alignment units

(due to double buffering), where as the packing

shown in Fig. 4(c) requires two read alignment units

in addition to two write alignment units (which we

also require). Second, our method works for any

value of and which is essential to support overlapped

message passing, whereas the method in [12] works

well for nonoverlapped message passing, i.e., and are

implicity assumed to be 0. When and are nonzero as

in this example, the packing scheme in [12] becomes

inefficient. For example, consider the updating of the

messages in the third word in the VNU memory, one

can see that these messages are spread across three

different memory words, which would entail three

reads and a very complex alignment circuitry with

the concomitant increase in latency.

Sequential Write Alignment Unit:

Given that a given message is mapped to

different locations in the CNU and VNU memory,

except when , the messages need to be aligned before

they are written to the memory. Given the double

buffered efficient packing scheme described above,

the alignment task is greatly simplified. We need just

the current word and the previous word to reconstruct

the input order for the CN and VN processing units.

The messages are assumed to be quantized to bits.

The hardware requirements are relatively modest. Let

for VNU alignment and for CNU alignment. For

vector length and CPM size , 1) when , or when , ,

no alignment unit circuitry is needed; 2) when of the

partially parallel decoder.Vector processing can be

combined with overlapped message passing to further

improve the throughput.

IV.RESULTS AND DISCUSSION

The Proposed design has been implemented by

Verilog HDL,Simulated with Xilinx ISE 9.1i.

The First step in the Vectored layer Decoder

implementation is the Error detection of two

errors,then Error correction using Quasi-cyclic Error

correction.Then the corrected output is decoded to

original message.

Fig 5. Error Correction Using Layered Decoding

Fig.6.Error Correction Using Vectored Layer

Decoding

V. CONCLUSION

In ths paper,we explore a compact software

implementation of Layered and Vectored Layer

Decoder Design.From the performance analysis of

the proposed design.it is evident that the error rate

performance of the Layered Decoder can be

improved by the proposed design.The Verilog HDL

code for Layered and Vectored Layer Decoding

algorithm is developed block wise.Optimized and

synthesizable Verilog HDL code for each block

synthesized using XilinxISE 9.1i and the verilog

HDL implementation has shown that the language

provides a useful tool of practicing the algorithms

without drawing of large amount of logic

gates.Verified that functionally correct.

REFERENCES

[1] R. G. Gallager, "Low-density parity-check

codes," Cambridge, MA: MIT Press, 1963.

[2] D. J. C. MacKay and R. M. Neal, “Near Shannon

limit performance of low density parity check codes,”

Electronics Letters, vol. 32, pp. 1645– 1646, Aug.

1996.

[3] E. Boutillon, J. Castura, and F. R. Kschischang,

"Decoder-first code design," Proceedings of the

2nd Intl Symposium on Turbo Codes and Related

Topics, pp. 459-462, Sep. 2000.

[4] T. Zhang and K. K. Parhi, "VLSI implementation-

oriented (3,k)-regular

low-density parity-check codes," IEEE Workshop on

Signal Processing Systems, pp 25-36, Sep. 2001.

[5] J. Thorpe, "Low-density parity-check (LDPC)

codes constructed from protographs," IPN Progress

Report 42-154, Apr.-Jun. 2003.

[6] W. E. Ryan and S. Lin, Channel Codes: Classical

and Modern, Cambridge University Press, 2009.

[7] A. Blanksby, B. Shen, and J. Trachewsky, "LDPC

code set for mmwave communication," Proceedings

of the 2010 ACM international workshop on

mmWave communications, pp. 39-43, Sep. 2010.

[8] G. Liva, and M. Chiani,“Protograph LDPC Codes

Design Based on EXIT Analysis,” IEEE Global

Telecommunication Conference, GLOBECOM , pp.

3250-3254, Nov. 2007.

[9] Y. Hu, E. Eleftheriou, and D. M. Arnold,

“Regular and irregular progressive edge growth

Tanner graphs,” IEEE Trans. Inform. Theory, vol. 51,

no. 1, pp. 386-398, Jan. 2005.

[10] S. A. Surra, E. Pisek, T. Henige, "Gigabit rate

achieving low-power LDPC codes: Design and

architecture", IEEE Wireless Communications and

Networking Conference (WCNC), pp. 1994-1999,

Mar. 2011.

[11] “IEEE 802.11ad. PHY/MAC Complete Proposal

Specification (approved as IEEE 802.11ad D0.1),”

http://tinyurl.com/2fqlkxx, May 2010.

LDPC Decoding Methods

Documents

Transcript of LDPC Decoding Methods