Technical Report on - UBC ECE › ~lampe › Reports › 2019-Report.pdfMohamed Matar. PAGE 1...
Transcript of Technical Report on - UBC ECE › ~lampe › Reports › 2019-Report.pdfMohamed Matar. PAGE 1...
Technical Report on
FASTER THAN NYQUIST DECODER
HARDWARE DESIGN
University of British Columbia
By
Mohamed Matar
PAGE 1
Abstract
Faster than Nyquist (FTN) is an improved technique to send data pulses faster than its
Nyquist limit, such way introduces Inter Symbol interference (ISI) when receiving the
same signal. ISI can be mitigated by introducing BCJR decoder [9] as a convolutional
decoder that considers ISI as a channel encoding as shown by authors in [1]. Such decoder
is computationally expensive that requires special hardware to perform the decoding
algorithm in a fast and efficient manner. Also, the BCJR decoder data flow is a sequence
flow which contains lots of data dependencies, therefore, it is harder to design a hardware
implementation that makes it fast.
In this report, we look at the constraints and realization parameters that are considered
during hardware such as bit representation, function approximation for better hardware
and data dependency resolving. We explore the trade-offs that contribute to the design
space of the BCJR decoder as well as the entire receiving decoder. We report system
design solutions and we implement a high-performance solution that optimize for latency
and verify it on FPGA.
Introduction to FTN Decoder Ideally, in a communication system, the receiver includes an LDPC decoder that account for channel decoding as seen in figure 1. However, in FTN system we have an extra block called BCJR decoder, this block is responsible for equalizing the effect of Inter Symbol Interference that occurs as result of introducing Faster than Nyquist sampling as seen in figure 2. In the next section, we will discuss the need for FTN and how its decoder work.
WHAT IS FTN?
FTN is a way to compress signal bandwidth by allowing overlap between symbols during
transmission. This allows symbols to be interfered during sampling their values at the
receiver, therefore, ISI occurs.
The FTN decoder [1] is an iterative feedback loop decoder that consists of BCJR decoder
followed by an LDPC decoder that feeds back the BCJR decoder in an iterative loop (as
shown in figure 2) that keeps going for a variable number of iterations, this loop is
terminated either by satisfying the LDPC parity check matrix or by failing to achieve it
after a maximum number of iterations and streaming out the last iteration bits.
As seen from this figure 3, the dataflow of the algorithm is sequential since it is a feedback
loop algorithm. This introduces a data dependency between each iteration output and
next iteration input which limits the ability to parallelize these computations.
In other words, the best implementation we can do is one that minimizes the latency of
each iteration.
PAGE 2
In order to do so, each block in an iteration needs to be optimized for latency, in this
study we focus on optimizing BCJR block since it is the system bottleneck.
Figure 1
Figure 2
BCJR BLOCK
In order to optimize the BCJR block, we need to understand how it works.
Fig.2 shows the steps need to compute the output of BCJR block (LLR), it also shows the
dependencies between data elements which can be categorized into two categories:
Intra-dependency:
In block ❶ potential parallelism is available in the 3 dimensions of (W, D, C).
In block ❷ potential parallelism is available only in the 2 dimensions of (D, C).
In block ❸ potential parallelism is available in the 3 dimensions of (W, D, C).
The previous dependencies limit the overall parallelism to be as much as block ❷
parallelism (Amdahl’s Law), which means that the whole system will be limited by the
part of the algorithm that can’t be parallelized.
DSP decoder
PAGE 3
Inter-dependency:
It corresponds to dependencies across blocks, as seen in figure 3, block ❷ depends on
block output, block ❸ depends on block ❶, ❷ outputs.
This means that LLR cannot be computed until we compute ❶&❷ firstly, to mention
also, each LLR in W dimension require the value 𝛽,γ and 𝛼 in the same dimension which
could be a problem considering that 𝛼, 𝛽 are computed sequentially in a reverse order in W
dimension.
In other words, in order to compute the first output element LLR[0], we need to compute
𝛽[0],γ[0] and 𝛼[0], which is fine for γ and 𝛼, however 𝛽[0] is the last element to compute in
block ❷ due to the reverse sequential nature of computation. Such constraints make it
hard to directly optimize the BCJR block for latency
In the next section we look at two main techniques proposed in order to minimize the
latency for such block, we also study the parameters that affect hardware realization for
the whole system.
System Exploration
In this section, we look at the parameters that control latency in the design of each block
in FTN decoder system. We first look at the LDPC parameters such as decoding iterations,
then we look at BCJR parameters such as data storage requirements, BCJR block size and
function approximations.
LDPC BLOCK
In this study we don’t focus on the design of the LDPC block, instead we use an LDPC IP
core [3] provided by FPGA vendor Xilinx. Because it is a black box, we can only control
Figure 3
PAGE 4
parameters that contribute to its latency such as number of decoding iterations.
We simulate different values for the number of iterations, and we report the minimum
number of iterations needed to maintain the same system performance.
This evaluation is done by modeling the whole system in MATLAB while varying the
LDPC decoding iterations.
Figure 4 shows bit error rate (BER) comparison across different signal to noise ratios
(Eb/No) between two implementations, first one uses maximum number of iterations for
LDPC iterations, and the second one is the minimum number of iterations that almost
report the same accuracy.
The figure shows that using 8 iterations almost preserve the same accuracy achieved by
using 50 iterations which allows performance speedup of 6.25x.
Figure 4
BCJR BLOCK
1. Fixed point representation
In order to realize arithmetic operations in hardware, values can be represented as
integers, floating point or fixed point.
We choose fixed point since the computations done in such application require precision
representation, so we have a choice of fixed point or floating point. Floating point vs fixed
point in a nutshell introduce fine accuracy vs efficiency trade-off, ideally, we can choose
fixed point to represent our values if we know its dynamic range and the required integers
and precision needed to represent them.
A good way to figure out the appropriate number of bits to represent data structures is to
histogram their full precision values, in the following figures we show histograms for the
main data structures used to produce the LLR output which are 𝛼 , 𝛽 and γ.
PAGE 5
Figures 5,6 and 7 show the histogram of the dynamic range of 𝛼 , 𝛽 and γ across a whole
code block, as seen the range of values can be represented 7-bits signed for integer part, we
also simulated of precision bits needed and 2 more bits for precision report the best results.
In figure 8, we report the best results achieved using 9-bits (7 integers, 2 fraction bits).
Figure 5: Histogram of Alpha values (range of values vs number of occurrences)
PAGE 6
Figure 6: Histogram of Beta values (range of values vs number of occurrences)
Figure 7: Histogram of Gamma values (range of values vs number of occurrences)
PAGE 7
Figure 8
2. Sliding Window
As stated before, the first LLR to be produced needs to wait for block ❶ and ❷ to finish,
we also know that these two blocks are operating sequentially in W dimension, which
means that the first LLR will be available after (W x latency of block ❶) + (W x latency of
block ❷).
This means if W is large then the overall latency is large, but if we can reduce the size of
W then the latency can be minimized by the minimum W that can be used.
In principle, W corresponds to the block window, which is defined by the size of the LDPC
block code size, code blocks could be in size of ~10K, 32K or even 64K elements, each
element correspond to a received symbol.
Ideally, these elements are dependent to some degree, this dependency corresponds to the
symbol interference across the same frame, which is unlikely to be across the whole code
block. The authors in [2] proposed the sliding window technique which divides the whole
code into smaller blocks that are assumed to be independent.
These small blocks can act independent given that they share an overlap that is used to
calculate an initial dependency estimation across each small block, the authors propose
the dependency size to be 6 times the convolution memory elements used in a
convolutional (BCJR) encoder. In FTN application, the convolution memory is different
than a typical convolutional encoder since it corresponds to the effective overlap between
symbols after sampling with interference.
PAGE 8
As noticed in figure 9, the code block is divided into smaller frames, each one can be
assumed independent if a certain overlap overhead is considered from previous frame. We
simulate different values of w and o to choose a proper value that can be a good trade-off
between accuracy and performance.
Figure 9
Figure 10 compares figure 8 results with the addition of two extra configurations of
different w and o values, the yellow line is dividing the code block into frames of w=128
and overlap of o=16, while the green line is w=64 and o=8.
With a target of overall accuracy drop of less than 0.2 dB, yellow line configuration is more
promising for accuracy target.
Figure 10
PAGE 9
3. Max-function approximation
In figure 3, we can see a max operation performed across block ❷ and ❸, ideally these
operations are log sum of exponential operations which is an expensive computation to do
in hardware. In practical, this operation can be performed using a max function which is
easy to implement in hardware + a correction term to adjust the value of max to be the
same as the original value. The correction term in hardware is usually designed as look-up
table, and the bigger the table the more accurate and closer to original value the output is.
However, such big table is costly in hardware, thereby, in the next experiment we simulate
the system accuracy with and without the correction term to evaluate its affect.
As shown in figure 11, the effect of correction term can be neglected on the overall
accuracy of the system, hence, we choose to go with max-only design choice.
Figure 11
Architecture Evaluation
OVERLAPPED MAP DECODER
Due to the reversed order of 𝛽 and 𝛼, LLR can be first computed at the end of block ❷,
which as well starts at the end of block ❶ since 𝛽[w] needs γ[n] which is computed as the
last element of block ❶.
Notice that block ❶ has potential independence over the 3 dimensions which means that
γ values can be computed out of order, this way 𝛽 and 𝛼 can start computing block ❷
PAGE 10
overlapped with block ❶ as seen in figure 11. Similar architecture was proposed as a
systolic structure by authors of [8] for simpler Turbo decoders.
This way the first LLR can be produced after computing 𝛽[w/2] and 𝛼[w/2], as seen in
figure 11, which is reduced to (latency of block ❷ x W/2) instead of (W x latency of block
❶) + (W x latency of block ❷).
The proposed technique achieves performance speedup of more than 2x.
Figure 12
OVERLAPPED MAP DECODER PIPELINE
When implementing the proposed overlapped MAP decoder with the following
parameters: w=128, o=16, C= 4, D = 64, we can parallelize over C and D dimensions which
means that we allocate resources enough to compute the operations in C&D dimensions
in parallel.
Figure 12 shows the optimizations done over baseline data flow shown in figure 3, by
overlapping first two blocks, we get minimized latency to produce first LLR after t=w.
PAGE 11
Figure 13
Figure 13 and 14 show the pipeline architectures of the new reduced blocks, each of them
has an initiation interval=2 (which is the dependent non-overlapped stages of the
pipeline), while the depth corresponds to the overlapped independent stages of the
pipeline.
Figure 14
PAGE 12
Figure 15
NORMALIZATION
𝛽 and 𝛼 are additive values that are computed based on their previous values, since we are
using fixed point representation of their values then we are limited by a certain threshold
that corresponds to the max value indicated by the number of bits to represent, as the
values of both matrices keep propagating, their range increases to reach the threshold of
the number of bits, hence the values saturate to the maximum and don’t change.
In order to overcome this limitation, we introduce a normalization operation that normalize
the values after each iteration in w dimension, however such operation is a sequential
operation that changes the structure of the pipeline to have more initiation interval since
the next window cannot overlap its depth until the normalization operation is done which
takes three stages to finish.
Figure 16 and 17 show the corresponding data flow and pipeline changes after
normalization. The pipeline depth gets deeper since there are three sequential stages of
normalization are done before storing the final value of 𝛽 or 𝛼.
In figure 16, we can see more operations done in each block, these operations are the
normalization procedure needed for both 𝛽 and 𝛼.
The normalization increases the initiation interval as seen in figure 17, to become 4
instead of 2 in the non-normalized design.
PAGE 13
Figure 16
Figure 17
PAGE 14
Results
The proposed MAP decoder is implemented in HLS and synthesized using VIVADO HLS
tool [4], table 1 shows the resource allocation after synthesis with the target clock
HLS SYNTHESIS RESULTS
BRAM DSP FF LUT clock
Used 192 512 39580 170681 3.5ns
Total 5376 12288 3456K 1728K 4ns
Util(%) 3.5 4 1 9
Table 1
VIVADO IMPLEMENTATION RESULTS
The design is synthesized and implemented for the target FPGA (xcvu13p-fhgb2104) and
the implementation results is shown in table 2.
BRAM DSP FF LUT clock
Used 128 512 19527 66431 4.16ns
Total 5376 12288 3456K 1728K 4ns
Util(%) 3 4 1 3.8
Table 2
PAGE 15
BCJR instance throughput = (128*4)/(252*4.1ns) = 490 Mbps
LDPC SYNTHESIS RESULTS
The provided LDPC core is synthesized using Xilinx VIVADO tool and the synthesis
results are as follow:
BRAM DSP FF LUT Clock
Used 124.5 0 61199 56235 2.5ns
Total 5376 12288 3456K 1728K 4ns
Util(%) 2.3 0 2 3.5
Table 3
LDPC throughput for (code block=10368, Rate=0.815) = 1.775 Gbps[3]
MAP DECODER VERIFICATION
The implemented design is verified against MATLAB reference model by passing test
vectors from MATLAB and comparing results from both hardware and MATLAB.
The snapshots in figure 18 and 19 compare matching results between HLS RTL output
simulation against post-implementation simulation.
Figure 18
Figure 19
PAGE 16
LDPC DECODER VERIFICATION
The LDPC IP is currently being verified against MATLAB model, the current available
model uses a standard LDPC code matrix from DVB-S2 standard [5]. This standard
provides H-matrix with a code word length of 64800, however the provided LDPC IP
provides certain constraints to use a customized code, as shown in the table 4 [3], the
maximum code-length provided is 32768.
Table 4
Instead, we use an available standard that can be configured for the LDPC IP which is
Data-Over-Cable Service Interface Specifications DOCSIS 3.1 [6], the LDPC IP is
configured through AXI-interface [7].
The IP provides MATLAB model to define LDPC codes in YAML format [8] as seen in
figure 20. This file is translated to AXI-lite transactions that configure the register space
of the LDPC IP as shown on the right tables of figure 19. For more details on the complete
register space, refer to the LDPC IP product guide [3].
PAGE 17
Figure 20
SYSTEM THROUGHPUT
Based on the results of both BCJR and LDPC throughput, we can fit parallel units for BCJR
and LDPC. For the target FPGA area, we can fit 6 BCJRs and 2 LDPC as shown in figure 21
per channel which produce overall throughput of 3.5 Gbps.
Figure 21
PAGE 18
Future work
Next section, we talk about new techniques discovered to enhance the design throughput.
We introduce a coarse-grained normalization we call it normalization interval.
NORMALIZATION INTERVAL
Currently, normalization is performed every iteration in W dimension (fine grained),
intuitively, it should be done whenever there is saturation in fixed points values (coarse
grained).
For current design with 9-bits for data structures, simulations show that normalization
can be done each in intervals of 4 iterations. To illustrate more, we can look at the
example in figure 21, each block corresponds to an iteration in W dimension, the first
block computes first 𝛽 and 𝛼 without normalization since we don’t need it, this continues
for 4 iterations until we reach value saturation, normalization is performed to adjust the
values.
Such technique divides the pipeline into two unequal pipelines, each has its own depth
and latency, which needs special handling hardware.
In principle, such technique should boost the reported throughput since it allows the
BCJR decoder to operate in fewer number of cycles. If we use 9-bits, we can perform
normalization after 4 iterations, which gives the following throughput:
Throughput = 𝟏𝟐𝟖 𝒙 𝟒
[𝟏𝟐𝟖+(𝟏𝟐𝟖/𝟓)∗𝟐] 𝒙 𝟒.𝟏𝒏𝒔 = 735 Mbps (~ 1.5 x)
Figure 21
References
[1] Jana et. al, Interference Cancellation for Time-Frequency Packed Super-Nyquist WDM
Systems.
[2] Benedetto et. al, A Soft-Input Soft-Output Maximum A Posteriori (MAP) Module to
Decode Parallel and Serial Concatenated Codes.
PAGE 19
[3] https://www.xilinx.com/support/documentation/ip_documentation/ldpc/v2_0/pb052-
ldpc.pdf
[4] https://www.xilinx.com/support/documentation-navigation/design-hubs/dh0012-
vivado-high-level-synthesis-hub.html
[5] https://www.mathworks.com/help/comm/examples/dvb-s-2-link-including-ldpc-
coding.html
[6] https://apps.cablelabs.com/specification/CM-SP-PHYv3.1
[7] https://static.docs.arm.com/ihi0022/g/IHI0022G_amba_axi_protocol_spec.pdf
[8] Del Barco. Et al, FPGA implementation of high-speed parallel maximum a posteriori
(MAP) decoders.
[9] Bahl et al, Optimal decoding of linear codes for minimizing symbol error rate.