Technical Report on - UBC ECE › ~lampe › Reports › 2019-Report.pdfMohamed Matar. PAGE 1...

Technical Report on

FASTER THAN NYQUIST DECODER

HARDWARE DESIGN

University of British Columbia

By

Mohamed Matar

Abstract

Faster than Nyquist (FTN) is an improved technique to send data pulses faster than its

Nyquist limit, such way introduces Inter Symbol interference (ISI) when receiving the

same signal. ISI can be mitigated by introducing BCJR decoder [9] as a convolutional

decoder that considers ISI as a channel encoding as shown by authors in [1]. Such decoder

is computationally expensive that requires special hardware to perform the decoding

algorithm in a fast and efficient manner. Also, the BCJR decoder data flow is a sequence

flow which contains lots of data dependencies, therefore, it is harder to design a hardware

implementation that makes it fast.

In this report, we look at the constraints and realization parameters that are considered

during hardware such as bit representation, function approximation for better hardware

and data dependency resolving. We explore the trade-offs that contribute to the design

space of the BCJR decoder as well as the entire receiving decoder. We report system

design solutions and we implement a high-performance solution that optimize for latency

and verify it on FPGA.

Introduction to FTN Decoder Ideally, in a communication system, the receiver includes an LDPC decoder that account for channel decoding as seen in figure 1. However, in FTN system we have an extra block called BCJR decoder, this block is responsible for equalizing the effect of Inter Symbol Interference that occurs as result of introducing Faster than Nyquist sampling as seen in figure 2. In the next section, we will discuss the need for FTN and how its decoder work.

WHAT IS FTN?

FTN is a way to compress signal bandwidth by allowing overlap between symbols during

transmission. This allows symbols to be interfered during sampling their values at the

receiver, therefore, ISI occurs.

The FTN decoder [1] is an iterative feedback loop decoder that consists of BCJR decoder

followed by an LDPC decoder that feeds back the BCJR decoder in an iterative loop (as

shown in figure 2) that keeps going for a variable number of iterations, this loop is

terminated either by satisfying the LDPC parity check matrix or by failing to achieve it

after a maximum number of iterations and streaming out the last iteration bits.

As seen from this figure 3, the dataflow of the algorithm is sequential since it is a feedback

loop algorithm. This introduces a data dependency between each iteration output and

next iteration input which limits the ability to parallelize these computations.

In other words, the best implementation we can do is one that minimizes the latency of

each iteration.

In order to do so, each block in an iteration needs to be optimized for latency, in this

study we focus on optimizing BCJR block since it is the system bottleneck.

Figure 1

Figure 2

BCJR BLOCK

In order to optimize the BCJR block, we need to understand how it works.

Fig.2 shows the steps need to compute the output of BCJR block (LLR), it also shows the

dependencies between data elements which can be categorized into two categories:

Intra-dependency:

In block ❶ potential parallelism is available in the 3 dimensions of (W, D, C).

In block ❷ potential parallelism is available only in the 2 dimensions of (D, C).

In block ❸ potential parallelism is available in the 3 dimensions of (W, D, C).

The previous dependencies limit the overall parallelism to be as much as block ❷

parallelism (Amdahl’s Law), which means that the whole system will be limited by the

part of the algorithm that can’t be parallelized.

DSP decoder

Inter-dependency:

It corresponds to dependencies across blocks, as seen in figure 3, block ❷ depends on

block output, block ❸ depends on block ❶, ❷ outputs.

This means that LLR cannot be computed until we compute ❶&❷ firstly, to mention

also, each LLR in W dimension require the value 𝛽,γ and 𝛼 in the same dimension which

could be a problem considering that 𝛼, 𝛽 are computed sequentially in a reverse order in W

dimension.

In other words, in order to compute the first output element LLR[0], we need to compute

𝛽[0],γ[0] and 𝛼[0], which is fine for γ and 𝛼, however 𝛽[0] is the last element to compute in

block ❷ due to the reverse sequential nature of computation. Such constraints make it

hard to directly optimize the BCJR block for latency

In the next section we look at two main techniques proposed in order to minimize the

latency for such block, we also study the parameters that affect hardware realization for

the whole system.

System Exploration

In this section, we look at the parameters that control latency in the design of each block

in FTN decoder system. We first look at the LDPC parameters such as decoding iterations,

then we look at BCJR parameters such as data storage requirements, BCJR block size and

function approximations.

LDPC BLOCK

In this study we don’t focus on the design of the LDPC block, instead we use an LDPC IP

core [3] provided by FPGA vendor Xilinx. Because it is a black box, we can only control

Figure 3

parameters that contribute to its latency such as number of decoding iterations.

We simulate different values for the number of iterations, and we report the minimum

number of iterations needed to maintain the same system performance.

This evaluation is done by modeling the whole system in MATLAB while varying the

LDPC decoding iterations.

Figure 4 shows bit error rate (BER) comparison across different signal to noise ratios

(Eb/No) between two implementations, first one uses maximum number of iterations for

LDPC iterations, and the second one is the minimum number of iterations that almost

report the same accuracy.

The figure shows that using 8 iterations almost preserve the same accuracy achieved by

using 50 iterations which allows performance speedup of 6.25x.

Figure 4

BCJR BLOCK

1. Fixed point representation

In order to realize arithmetic operations in hardware, values can be represented as

integers, floating point or fixed point.

We choose fixed point since the computations done in such application require precision

representation, so we have a choice of fixed point or floating point. Floating point vs fixed

point in a nutshell introduce fine accuracy vs efficiency trade-off, ideally, we can choose

fixed point to represent our values if we know its dynamic range and the required integers

and precision needed to represent them.

A good way to figure out the appropriate number of bits to represent data structures is to

histogram their full precision values, in the following figures we show histograms for the

main data structures used to produce the LLR output which are 𝛼 , 𝛽 and γ.

Figures 5,6 and 7 show the histogram of the dynamic range of 𝛼 , 𝛽 and γ across a whole

code block, as seen the range of values can be represented 7-bits signed for integer part, we

also simulated of precision bits needed and 2 more bits for precision report the best results.

In figure 8, we report the best results achieved using 9-bits (7 integers, 2 fraction bits).

Figure 5: Histogram of Alpha values (range of values vs number of occurrences)

Figure 6: Histogram of Beta values (range of values vs number of occurrences)

Figure 7: Histogram of Gamma values (range of values vs number of occurrences)

Figure 8

2. Sliding Window

As stated before, the first LLR to be produced needs to wait for block ❶ and ❷ to finish,

we also know that these two blocks are operating sequentially in W dimension, which

means that the first LLR will be available after (W x latency of block ❶) + (W x latency of

block ❷).

This means if W is large then the overall latency is large, but if we can reduce the size of

W then the latency can be minimized by the minimum W that can be used.

In principle, W corresponds to the block window, which is defined by the size of the LDPC

block code size, code blocks could be in size of ~10K, 32K or even 64K elements, each

element correspond to a received symbol.

Ideally, these elements are dependent to some degree, this dependency corresponds to the

symbol interference across the same frame, which is unlikely to be across the whole code

block. The authors in [2] proposed the sliding window technique which divides the whole

code into smaller blocks that are assumed to be independent.

These small blocks can act independent given that they share an overlap that is used to

calculate an initial dependency estimation across each small block, the authors propose

the dependency size to be 6 times the convolution memory elements used in a

convolutional (BCJR) encoder. In FTN application, the convolution memory is different

than a typical convolutional encoder since it corresponds to the effective overlap between

symbols after sampling with interference.

As noticed in figure 9, the code block is divided into smaller frames, each one can be

assumed independent if a certain overlap overhead is considered from previous frame. We

simulate different values of w and o to choose a proper value that can be a good trade-off

between accuracy and performance.

Figure 9

Figure 10 compares figure 8 results with the addition of two extra configurations of

different w and o values, the yellow line is dividing the code block into frames of w=128

and overlap of o=16, while the green line is w=64 and o=8.

With a target of overall accuracy drop of less than 0.2 dB, yellow line configuration is more

promising for accuracy target.

Figure 10

3. Max-function approximation

In figure 3, we can see a max operation performed across block ❷ and ❸, ideally these

operations are log sum of exponential operations which is an expensive computation to do

in hardware. In practical, this operation can be performed using a max function which is

easy to implement in hardware + a correction term to adjust the value of max to be the

same as the original value. The correction term in hardware is usually designed as look-up

table, and the bigger the table the more accurate and closer to original value the output is.

However, such big table is costly in hardware, thereby, in the next experiment we simulate

the system accuracy with and without the correction term to evaluate its affect.

As shown in figure 11, the effect of correction term can be neglected on the overall

accuracy of the system, hence, we choose to go with max-only design choice.

Figure 11

Architecture Evaluation

OVERLAPPED MAP DECODER

Due to the reversed order of 𝛽 and 𝛼, LLR can be first computed at the end of block ❷,

which as well starts at the end of block ❶ since 𝛽[w] needs γ[n] which is computed as the

last element of block ❶.

Notice that block ❶ has potential independence over the 3 dimensions which means that

γ values can be computed out of order, this way 𝛽 and 𝛼 can start computing block ❷

overlapped with block ❶ as seen in figure 11. Similar architecture was proposed as a

systolic structure by authors of [8] for simpler Turbo decoders.

This way the first LLR can be produced after computing 𝛽[w/2] and 𝛼[w/2], as seen in

figure 11, which is reduced to (latency of block ❷ x W/2) instead of (W x latency of block

❶) + (W x latency of block ❷).

The proposed technique achieves performance speedup of more than 2x.

Figure 12

OVERLAPPED MAP DECODER PIPELINE

When implementing the proposed overlapped MAP decoder with the following

parameters: w=128, o=16, C= 4, D = 64, we can parallelize over C and D dimensions which

means that we allocate resources enough to compute the operations in C&D dimensions

in parallel.

Figure 12 shows the optimizations done over baseline data flow shown in figure 3, by

overlapping first two blocks, we get minimized latency to produce first LLR after t=w.

Figure 13

Figure 13 and 14 show the pipeline architectures of the new reduced blocks, each of them

has an initiation interval=2 (which is the dependent non-overlapped stages of the

pipeline), while the depth corresponds to the overlapped independent stages of the

pipeline.

Figure 14

Figure 15

NORMALIZATION

𝛽 and 𝛼 are additive values that are computed based on their previous values, since we are

using fixed point representation of their values then we are limited by a certain threshold

that corresponds to the max value indicated by the number of bits to represent, as the

values of both matrices keep propagating, their range increases to reach the threshold of

the number of bits, hence the values saturate to the maximum and don’t change.

In order to overcome this limitation, we introduce a normalization operation that normalize

the values after each iteration in w dimension, however such operation is a sequential

operation that changes the structure of the pipeline to have more initiation interval since

the next window cannot overlap its depth until the normalization operation is done which

takes three stages to finish.

Figure 16 and 17 show the corresponding data flow and pipeline changes after

normalization. The pipeline depth gets deeper since there are three sequential stages of

normalization are done before storing the final value of 𝛽 or 𝛼.

In figure 16, we can see more operations done in each block, these operations are the

normalization procedure needed for both 𝛽 and 𝛼.

The normalization increases the initiation interval as seen in figure 17, to become 4

instead of 2 in the non-normalized design.

Figure 16

Figure 17

Results

The proposed MAP decoder is implemented in HLS and synthesized using VIVADO HLS

tool [4], table 1 shows the resource allocation after synthesis with the target clock

HLS SYNTHESIS RESULTS

BRAM DSP FF LUT clock

Used 192 512 39580 170681 3.5ns

Total 5376 12288 3456K 1728K 4ns

Util(%) 3.5 4 1 9

Table 1

VIVADO IMPLEMENTATION RESULTS

The design is synthesized and implemented for the target FPGA (xcvu13p-fhgb2104) and

the implementation results is shown in table 2.

BRAM DSP FF LUT clock

Used 128 512 19527 66431 4.16ns

Total 5376 12288 3456K 1728K 4ns

Util(%) 3 4 1 3.8

Table 2

BCJR instance throughput = (128*4)/(252*4.1ns) = 490 Mbps

LDPC SYNTHESIS RESULTS

The provided LDPC core is synthesized using Xilinx VIVADO tool and the synthesis

results are as follow:

BRAM DSP FF LUT Clock

Used 124.5 0 61199 56235 2.5ns

Total 5376 12288 3456K 1728K 4ns

Util(%) 2.3 0 2 3.5

Table 3

LDPC throughput for (code block=10368, Rate=0.815) = 1.775 Gbps[3]

MAP DECODER VERIFICATION

The implemented design is verified against MATLAB reference model by passing test

vectors from MATLAB and comparing results from both hardware and MATLAB.

The snapshots in figure 18 and 19 compare matching results between HLS RTL output

simulation against post-implementation simulation.

Figure 18

Figure 19

LDPC DECODER VERIFICATION

The LDPC IP is currently being verified against MATLAB model, the current available

model uses a standard LDPC code matrix from DVB-S2 standard [5]. This standard

provides H-matrix with a code word length of 64800, however the provided LDPC IP

provides certain constraints to use a customized code, as shown in the table 4 [3], the

maximum code-length provided is 32768.

Table 4

Instead, we use an available standard that can be configured for the LDPC IP which is

Data-Over-Cable Service Interface Specifications DOCSIS 3.1 [6], the LDPC IP is

configured through AXI-interface [7].

The IP provides MATLAB model to define LDPC codes in YAML format [8] as seen in

figure 20. This file is translated to AXI-lite transactions that configure the register space

of the LDPC IP as shown on the right tables of figure 19. For more details on the complete

register space, refer to the LDPC IP product guide [3].

Figure 20

SYSTEM THROUGHPUT

Based on the results of both BCJR and LDPC throughput, we can fit parallel units for BCJR

and LDPC. For the target FPGA area, we can fit 6 BCJRs and 2 LDPC as shown in figure 21

per channel which produce overall throughput of 3.5 Gbps.

Figure 21

Future work

Next section, we talk about new techniques discovered to enhance the design throughput.

We introduce a coarse-grained normalization we call it normalization interval.

NORMALIZATION INTERVAL

Currently, normalization is performed every iteration in W dimension (fine grained),

intuitively, it should be done whenever there is saturation in fixed points values (coarse

grained).

For current design with 9-bits for data structures, simulations show that normalization

can be done each in intervals of 4 iterations. To illustrate more, we can look at the

example in figure 21, each block corresponds to an iteration in W dimension, the first

block computes first 𝛽 and 𝛼 without normalization since we don’t need it, this continues

for 4 iterations until we reach value saturation, normalization is performed to adjust the

values.

Such technique divides the pipeline into two unequal pipelines, each has its own depth

and latency, which needs special handling hardware.

In principle, such technique should boost the reported throughput since it allows the

BCJR decoder to operate in fewer number of cycles. If we use 9-bits, we can perform

normalization after 4 iterations, which gives the following throughput:

Throughput = 𝟏𝟐𝟖 𝒙 𝟒

[𝟏𝟐𝟖+(𝟏𝟐𝟖/𝟓)∗𝟐] 𝒙 𝟒.𝟏𝒏𝒔 = 735 Mbps (~ 1.5 x)

Figure 21

References

[1] Jana et. al, Interference Cancellation for Time-Frequency Packed Super-Nyquist WDM

Systems.

[2] Benedetto et. al, A Soft-Input Soft-Output Maximum A Posteriori (MAP) Module to

Decode Parallel and Serial Concatenated Codes.

[3] https://www.xilinx.com/support/documentation/ip_documentation/ldpc/v2_0/pb052-

ldpc.pdf

[4] https://www.xilinx.com/support/documentation-navigation/design-hubs/dh0012-

vivado-high-level-synthesis-hub.html

[5] https://www.mathworks.com/help/comm/examples/dvb-s-2-link-including-ldpc-

coding.html

[6] https://apps.cablelabs.com/specification/CM-SP-PHYv3.1

[7] https://static.docs.arm.com/ihi0022/g/IHI0022G_amba_axi_protocol_spec.pdf

[8] Del Barco. Et al, FPGA implementation of high-speed parallel maximum a posteriori

(MAP) decoders.

[9] Bahl et al, Optimal decoding of linear codes for minimizing symbol error rate.

https://www.xilinx.com/support/documentation/ip_documentation/ldpc/v2_0/pb052-ldpc.pdf

https://www.xilinx.com/support/documentation/ip_documentation/ldpc/v2_0/pb052-ldpc.pdf

https://www.xilinx.com/support/documentation-navigation/design-hubs/dh0012-vivado-high-level-synthesis-hub.html

https://www.xilinx.com/support/documentation-navigation/design-hubs/dh0012-vivado-high-level-synthesis-hub.html

https://www.mathworks.com/help/comm/examples/dvb-s-2-link-including-ldpc-coding.html

https://www.mathworks.com/help/comm/examples/dvb-s-2-link-including-ldpc-coding.html

https://apps.cablelabs.com/specification/CM-SP-PHYv3.1

Technical Report on - UBC ECE › ~lampe › Reports › 2019-Report.pdfMohamed Matar. PAGE 1...

Documents

Transcript of Technical Report on - UBC ECE › ~lampe › Reports › 2019-Report.pdfMohamed Matar. PAGE 1...