Advanced Baseband Processing Circuits and Systems for 5G ... · Advanced Baseband Processing...

Advanced Baseband Processing Circuits andSystems for 5G CommunicationsIEEE SiPS 2018 Tutorial – Half Day.

Chuan Zhang1,2 and Xiaosi Tan1,2

November 7, 20181Lab of Efficient Architectures for Digital-communication and Signal-processing (LEADS)2National Mobile Communications Research Laboratory, Southeast University, Nanjing, China

Tutorial Speakers

Chuan Zhang is now an associate professor of National MobileCommunications Research Laboratory, School of InformationScience and Engineering, Southeast University, Nanjing, China.

Xiaosi Tan is currently research fellow in National MobileCommunications Research Laboratory, School of InformationScience and Engineering, Southeast University, Nanjing, China.

Tutorial Speakers

Chuan Zhang is now an associate professor of National MobileCommunications Research Laboratory, School of InformationScience and Engineering, Southeast University, Nanjing, China.

Xiaosi Tan is currently research fellow in National MobileCommunications Research Laboratory, School of InformationScience and Engineering, Southeast University, Nanjing, China.

Outline

1. Introduction

2. Polar Code Decoder

3. Algorithms and Implementations for MIMO Detection

4. Deep Learning Based Baseband Signal Processing

Introduction.

Introduction

• Baseband signal processing in 5G era- Massive MIMO;- Advanced coding and modulation;- Cognitive and cooperative radio baseband platform;- Configurable radio air-interface.

• What is challenging?- Advanced algorithms;- Implementation of circuits, architectures, and platforms;- Using AI techniques.

Introduction

Massive MIMO channel H

channel encoder

NOMA encoder

Massive MIMO

encoder/Precoder

digital filter

RF chain

bit stream

......

channel decoder

NOMA decoder

Massive MIMO

detector/Combiner

time-domain

pre-process

RF chain

output stream

channel/noise estimation

Transmitter Side Receiver Side

Figure 1: System-level diagram of 5G wireless baseband architecture.

Introduction

• Targets- High spectral efficiency;- Low power and low area;- High throughput and low latency.

• This tutorial- Polar code decoder (Chuan Zhang);- Massive MIMO detection (Xiaosi Tan);- Uniform belief propagation processing (Chuan Zhang);- Deep learning based baseband signal processing (Xiaosi Tan).

Polar Code Decoder.

Polar Codes

Brief introduction of polar codes:

• A capacity-achieving codes proposed by Erdal Arıkan in 20091.• The forward error correction (FEC) code for eMBB’s control channel.

• Linear block code:

xN = uNGN,

where GN denotes the generator matrix.

1E. Arikan, “Channel polarization: a method for constructing capacity-achieving codes for symmetric binary-input memoryless channels,”IEEE Transactions on Information Theory, vol. 55, no. 7, pp. 3051–3073, 2009.

Polar Codes

The biggest asset of polar coding compared to SoA is its universal, andflexible, and versatile nature:

• Universal: the same hardware can be used with different codelengths, rates, channels.

• Flexible: the code rate can be adjusted readily to any numberbetween 0 and 1.

• Versatile: can be used in multiple coding scenarios.

Figure 2: Channel polarization.7

Polar Code’s Performance

With list-decoding and CRC polar codes deliver even better performanceto LDPC and Turbo codes used in present wireless standards2.

EsNo (dB)0 0.5 1 1.5 2 2.5 3 3.5 4

P(1024,512), 4-QAM, L-1, CRC-0, SNR = 2P(1024,512), 4-QAM, L-32, CRC-0, SNR = 2P(1024,512), 4-QAM, L-32, CRC-16, SNR = 2Dispersion bound for (1024,512)WiMAX CTC (960,480)

Figure 3: Comparison with Turbo codes.

2E. Arıkan, “Polar coding for 5g wireless?,” International Workshop on Polar Code, 2015.

Polar Code’s Performance

With list-decoding and CRC polar codes deliver even better performanceto LDPC and Turbo codes used in present wireless standards3.

EsNo (dB)0 0.5 1 1.5 2 2.5 3 3.5 4

P(2048,1024), 4-QAM, L-1, CRC-0, SNR = 2P(2048,1024), 4-QAM, L-32, CRC-0, SNR = 2P(2048,1024), 4-QAM, L-32, CRC-16, SNR = 2Dispersion bound for (2048,1024)WiMAX LDPC(2304,1152), Max Iter = 100

Figure 4: Comparison with LDPC codes.

3E. Arıkan, “Polar coding for 5g wireless?,” International Workshop on Polar Code, 2015.

Contents

Polar Code Decoder

Successive Cancelation (SC) Decoding for Polar Codes

SC-based Decoding of Polar Codes

Stochastic Polar BP decoding

Approximate BP Decoding Implementation

Decoding Trellis of SC Decoder

( ) ( )L y11 1

( ) ( )L y11 2

( ) ( )L y11 3

( ) ( )L y11 4

( ) ( )L y11 5

( ) ( )L y11 6

( ) ( )L y11 7

( ) ( )L y11 8

( ) ( )1 88 1L y

( ) ˆ( , )5 8 48 1 1L y u

( ) ˆ( , )3 8 28 1 1L y u

( ) ˆ( , )7 8 68 1 1L y u

( ) ˆ( , )2 88 1 1L y u

( ) ˆ( , )6 8 58 1 1L y u

( ) ˆ( , )4 8 38 1 1L y u

( ) ˆ( , )8 8 78 1 1L y u

Stage 3Stage 2Stage 1

ˆ ˆ ˆ ˆu u u u 1 2 3 4

ˆ ˆu u3 4

ˆ ˆu u2 4

ˆ ˆu u5 6

ˆ ˆu u1 2

: Type I PE : Type II PE

• The trellis isFFT-like.

• SC decoder isa serialdecoder.

TypeI PE :(2i)N (yN

1 , u2i−11 )= (−1)u2i−1 (i)

N/2(yN/21 , u2i−2

1,o ⊕ u2i−21,e ) +

(i)N/2 (y

NN/2+1, u

2i−21,e ),

TypeII PE :(2i−1)N (yN

1 , u2i−11 )=sgn[(i)N/2(y

N/21 , u2i−2

1,o ⊕ u2i−21,e )]sgn[(i)N/2(y

NN/2+1, u

2i−21,e )]·

min[∣∣∣(i)N/2(y

N/21 , u2i−2

1,o ⊕ u2i−21,e )

∣∣∣ , ∣∣∣(i)N/2(yNN/2+1, u

2i−21,e )

∣∣∣].11

Decoding Trellis of SC Decoder

( ) ( )L y11 1

( ) ( )L y11 2

( ) ( )L y11 3

( ) ( )L y11 4

( ) ( )L y11 5

( ) ( )L y11 6

( ) ( )L y11 7

( ) ( )L y11 8

( ) ( )1 88 1L y

( ) ˆ( , )5 8 48 1 1L y u

( ) ˆ( , )3 8 28 1 1L y u

( ) ˆ( , )7 8 68 1 1L y u

( ) ˆ( , )2 88 1 1L y u

( ) ˆ( , )6 8 58 1 1L y u

( ) ˆ( , )4 8 38 1 1L y u

( ) ˆ( , )8 8 78 1 1L y u

ˆ ˆ ˆ ˆu u u u 1 2 3 4

ˆ ˆu u3 4

ˆ ˆu u2 4

ˆ ˆu u5 6

ˆ ˆu u1 2

: Type I PE : Type II PE

• Marked eachprocesselement withred label.

• The dataflow graph(DFG) isobtained.

TypeI PE :(2i)N (yN

1 , u2i−11 )= (−1)u2i−1 (i)

N/2(yN/21 , u2i−2

1,o ⊕ u2i−21,e ) +

(i)N/2 (y

NN/2+1, u

2i−21,e ),

TypeII PE :(2i−1)N (yN

1 , u2i−11 )=sgn[(i)N/2(y

N/21 , u2i−2

1,o ⊕ u2i−21,e )]sgn[(i)N/2(y

NN/2+1, u

2i−21,e )]·

min[∣∣∣(i)N/2(y

N/21 , u2i−2

1,o ⊕ u2i−21,e )

∣∣∣ , ∣∣∣(i)N/2(yNN/2+1, u

2i−21,e )

∣∣∣].12

DFG of SC Decoder

• First, this a feed forward DFG.• Critical path of the DFG equals the processing time of a single PE.• The entire DFG is composed of two identical parts.

E1 F1 E1 F1

Цu1Цu2

Цu3Цu4

Цu5Цu6

Цu7Цu8

DFG of SC Decoder – One More Step

• Since the PEs A1,2,3,4 are functionally identical, we can merge themtogether and represent the merged one with A.

• Similar approaches can be applied to other PEs.

Stage 1 Stage 2 Stage 3

3,6,10,13

4,7,11,14

Arch. and Latency of SC Decoder

The decoding latency for an N-bit SC polar decoder equals 2(N − 1)clock cycles.

2 ×∑log2N

i=12i−1 = 2 × 2log2N − 1

2 − 1 = 2(N − 1).

( ) ( )L y11 1

( ) ( )L y11 2

( ) ( )L y11 3

( ) ( )L y11 4

( ) ( )L y11 5

( ) ( )L y11 6

( ) ( )L y11 7

( ) ( )L y11 8

ˆ iu2 1

: Type I PE

ˆ ˆ u or u2 6

orˆ ˆu u1 2

ˆ ˆu u5 6

ˆ ˆ ˆ ˆu u u u 1 2 3 4

ˆ ˆu u3 4

ˆ ˆu u2 4

ˆ ˆ ˆ ˆu u u u 1 2 3 4

ˆ ˆu u2 4

ˆ ˆu u3 4

orˆ ˆu u1 2ˆ ˆu u5 6

ˆ ˆ u or u2 6

: Type II PE : Take sign bit

Main frame

Feedback part

Figure 5: Arch. for SC decoder.Figure 6: Scheduling of SC decoder.

Lower Latency Consideration

There are only 2 possible outputs for every Type I PE, depending onwhat value u2i−1 will take.

( ) ( )L y11 1

( ) ( )L y11 2

( ) ( )L y11 3

( ) ( )L y11 4

( ) ( )L y11 5

( ) ( )L y11 6

( ) ( )L y11 7

( ) ( )L y11 8

ˆ iu2 1

: Type I PE

ˆ ˆ oru u2 6

orˆ ˆu u1 2

ˆ ˆu u5 6

ˆ ˆ ˆ ˆ u u u u1 2 3 4

ˆ ˆu u3 4

ˆ ˆu u2 4

Figure 7: Arch. for SC decoder.

( ) ( )L y11 1

( ) ( )L y11 2

( ) ( )L y11 3

( ) ( )L y11 4

( ) ( )L y11 5

( ) ( )L y11 6

( ) ( )L y11 7

( ) ( )L y11 8

ˆ iu2 1

: Type I PE

ˆ ˆ oru u2 6

orˆ ˆu u1 2

ˆ ˆu u5 6

ˆ ˆ ˆ ˆ u u u u1 2 3 4

ˆ ˆu u3 4

ˆ ˆu u2 4

Clock Cycle 1

Figure 8: Scheduling of SC decoder.

Pre-Computation DFG

According to the pre-computation look-ahead approach, Type I and TypeII PEs in the same stage are activated within the same clock cycle. TheDFG illustrated can then be modified as follows:

A/B C/D E/FD D

Dstart

4,751 2 1 2

Figure 9: Arch. for SC decoder.

Figure 10: Scheduling of SC decoder. 17

Pre-Comp. SC Decoder

The decoding latency reduces to (N − 1) clock cycles.∑log2N

i=12i−1 =

2log2N − 12 − 1 = (N − 1).

ˆ iu2 1

: Merged PE

ˆ ˆ u or u2 6

orˆ ˆu u1 2

ˆ ˆu u5 6

ˆ ˆ ˆ ˆu u u u 1 2 3 4

ˆ ˆu u3 4

ˆ ˆu u2 4

( ) ( )L y11 1

( ) ( )L y11 2

( ) ( )L y11 3

( ) ( )L y11 4

( ) ( )L y11 5

( ) ( )L y11 6

( ) ( )L y11 7

( ) ( )L y11 8

ˆ ˆ ˆ ˆu u u u 1 2 3 4

ˆ ˆu u3 4

ˆ ˆu u2 4

orˆ ˆu u1 2ˆ ˆu u5 6

ˆ ˆ u or u2 6

Figure 11: Pre-comp. SC decoder.

Contents

Polar Code Decoder

SC Decoding

0.60 0.40

0.070.33

0.03 0.30

0.04 0.03

0.01 0.02

0.28 0.02 0.03 0.01

0.01 0.02

0.220.38

0.10 0.28

0.12 0.10

0.04 0.06

0.20 0.08 0.08 0.04

0.07 0.03

Visited node

Non-visited node

SC Decoding - Greedy Algorithm

• The frozen bit is set as 0.• The information bit extends to

0 and 1, calculate thetransmission possibility.

• Picking up the most likely codebit per step, continue pathextension.

• Decoding ends at the last level,decoding results are”0000 0010”.

SC List Decoding & SC Stack Decoding

0.60 0.40

0.070.33

0.03 0.30

0.01 0.02

0.28 0.02 0.03 0.01

0.01 0.02

0.220.38

0.10 0.28

0.12 0.10

0.04 0.06

0.20 0.08 0.08 0.04

0.07 0.03

Visited node

Non-visited node

SC List (SCL) Decoding Algorithm for L = 2

0.60 0.40

0.070.33

0.03 0.30

0.01 0.02

0.28 0.02 0.03 0.01

0.01 0.02

0.220.38

0.10 0.28

0.12 0.10

0.04 0.06

0.20 0.08 0.08 0.04

0.07 0.03

Visited node

Non-visited node

SC Stack (SCS) Decoding Algorithm for L = 2

SC Heap Decoding

-5 -12

-9 -12 -14

0 1 0 1

1 0 1 0

0 0 0 1 1

-5 -12

-9 -12 -14 -9

1 0 1 11 0 1 0

0 0 0 1 1

-90 00

-9 -12 -14 -12

1 0 1 1

1 0 1 0

0 0 0 1 1

0 1 1 1 1 0

0 1 1 0

Step 1 Step 2 Step 3

• Expansion in a non-full heap

-5 -12

-9 -12 -14

0 1 0 1

0 1 1 0

1 0 1 0

0 0 0 1 1

1 1 1 0

Compare with a random

node in the last layer

-9 -12 -12 -16

1 0 1 1

1 0 1 0

1 1 1 00 1 1 0

-5 -12

-9 -12 -9 -16

1 0 1 11 0 1 0

1 1 0 0 1 1

1 1 1 0

Step 1 Step 2 Step 3

• Expansion in a full heap

Connection between Data Structure and SC Family

• Access: Access each record exactly once.• Search: Find the location of the record with a given key value.• Insertion: Add a new record to the structure.• Deletion: Remove a record from the structure.

Connection between Data Structure and SC Family

Data Structure Access Search Insertion Deletion Space ComplexityArray O(1) O(n) O(n) O(n) O(n)Stack O(n) O(n) O(1) O(1) O(n)Queue O(n) O(n) O(1) O(1) O(n)

Linked List O(n) O(n) O(1) O(1) O(n)Binary Search Tree O(log n) O(log n) O(log n) O(log n) O(n)

Heap O(log n) O(log n) O(log n) O(log n) O(n)

• Access: Access each record exactly once.• Search: Find the location of the record with a given key value.• Insertion: Add a new record to the structure.• Deletion: Remove a record from the structure.

Complexity Analysis

SC-based decoders Insertion Deletion Searching Computational Complexity Space ComplexitySCL Decoder O(1) O(1) O(L log L) O(LN(log N + log L)) O(LN)

SCS Decoder O(D) O(1) O(1) O(LN log N + mD) O(DN)

SCH Decoder O(log D) O(log D) O(1) O(LN log N + m log D) O(DN)

Predicted SC-AVL Decoding Algorithm

-12 -5

1 1 00 1 1 0

-12 -5

1 1 00 1 1 0

-12 -3

1 1 00 1 0 10

-12 -3

1 1 00 1 0 10

1 0 10 1

1 1 00

1 0 10 1

1 1 00

-12 -5

1 1 00

-12 -5

1 1 00

1 0 10 11 0 10 1

1 1 01 1 0

1 0 10 11 0 10 1

1 0 10 01 0 10 0

1 0 10 0

• Searching: Search the best candidate path as well as the path that ispointed by the pointer in the tree. O(1)

• Deletion: Delete the pointed path from the tree and reconstruct theAVL tree4 O(log D)

• Insertion: Insert two expanded path in appropriate place and checkthe balance factor of AVL tree. If AVL tree is unbalanced, rotate torebalance it. Pointer points to the right leaf node. O(log D)

4It is named from its inventors G. M. Adelson-Velsky and Evgenii Landis.

Comparisons over SC-Based Decoding Algorithms

0.5 1 1.5 2 2.5

Eb/N0 (dB)

SCH, SCT, SCL (2)

SCL (4), SCS (4)

conventional SCSC List (4)SC Stack (4)SC Heap (4)SC AVLTree (4)SC List (2)

FER performance comparison.

0.5 1 1.5 2 2.5

Eb/N0 (dB)

conventional SCSC List (4)SC Stack (4) SC Heap (2) SC AVLTree (2)SC List (2)

Computational complexity comparison.

Memory-Aware Architecture

Memory Blocks

Path Length MetricPath Length Metric

Memory Blocks

Path Length Metric

Memory Blocks

Path Length MetricPath

Length

Metric

Length

Metric

Length

Metric

Data Structure Block

1SC CoreSC Core

Decoder Core

1SC Core

Decoder Core

Instruction

Control Unit

Pointer

Memory

Address

Metrics

Sorter

SC-based decoders Latency (µs) Throughput (Mbps)SCL Decoder 490.92 2.09SCS Decoder 3628.53 0.28SCH Decoder 3247.45 0.32SCT Decoder 3315.28 0.31

Conclusion

• The connection between data structure and SC family is revealed.• The complexity of existing algorithms could be analyzed and proved

from a constructed data perspective.• New algorithms could be proposed and predicted based on the

methodology.• A formal architecture of SC-based decoding is discussed.

Contents

Polar Code Decoder

Deterministic BP Decoding

=+ : F node : G node

Figure 12: Factor Graph of polar BP decoding with N = 8.

• Left to right and right to left messages• Initialization for 2 types of messages:

from left to right: R1,j =

1, if j ∈ A,0, if j ∈ Ac

from right to left: Ln+1,j =P(yj|xj = 0)P(yj|xj = 1)

• The iterative updating rules associated with L(t)i,j and R(t)

i,j :L(t)

i,j = g(L(t)i+1,2j−1, L

(t)i+1,2j + R(t)

i,j+N/2),

L(t)i,j+N/2 = g(R(t)

i,j , L(t)i+1,2j−1) + L(t)

i+1,2j,

R(t)i+1,2j−1 = g(R(t)

i,j , L(t−1)i+1,2j + R(t)

i,j+N/2),

R(t)i+1,2j = g(R(t)

i,j , L(t−1)i+1,2j−1) + R(t)

i,j+N/2,

where g(x, y) ≈ sign(x)sign(y)min(|x|, |y|).• After I iterations, we have

0 if R1,j ≥ 1,1 else.

i,j NR

i, j NL

i , jR

i , jL

i ,j NL

+ +1 /2

i ,j NR

(a) BCB module

G1 F1A

(b) BCB logic structure

Figure 13: BCB and its logic structure.

Stochastic Computing

• Requiring low complexity• High fault tolerance• Lack of accuracy

0 1 1 1 0 0 1 0

1 0 0 0 1 1 0 0

Bit-wise

Operation0 1 0 0 1 0 1 0

Bit stream A

Bit stream C

Bit stream B

Figure 14: Basic module for stochastic computing.

Stochastic BP Decoding

• Stochastic channel message

Pr(yi = 1) , P(yi|xi = 1) = 1e−LR(yi) + 1 (4)

• New initialization rules:

from left to right: R1,j =

0.5, if j ∈ A,0, if j ∈ Ac

from right to left: Ln+1,j = Pr(yi = 1)

• Reformulation for F node:

Pz , Pr(z = 1) = Px(1 − Py) + (1 − Px)Py (6)

• Reformulation for G node:

Pz =PxPy

PxPy + (1 − Px)(1 − Py)(7)

Basic computation block

Figure 15: Architecture of the Basic Computation Block (BCB).37

1 1.5 2 2.5 3

Eb/N0(dB)

N=64, deterministic decodingN=64, stochastic decodingN=256, deterministic decodingN=256, stochastic decoding

Figure 16: Numerical results for straightforward stochastic BP decoder.

Approaches for Improvement

• The stochastic computing correlation SCC(A,B) of two unary bitstreams A and B is given by

SCC(A,B) =

PA&B − PAPB

min(PA,PB)− PAPB, if PA&B > PAPB

PA&B − PAPBPAPB − max(PA + PB − 1, 0) otherwise

(8)• For two maximally similar (or different) unary bit streams A and B,

we get SCC(A,B) = +1(or − 1).• If we have SCC(A,B) = 0, it indicates that bit stream A and B are

uncorrelated and suitable for stochastic computing.

Approaches for Improvement

• The stochastic computing correlation SCC(A,B) of two unary bitstreams A and B is given by

SCC(A,B) =

PA&B − PAPB

min(PA,PB)− PAPB, if PA&B > PAPB

PA&B − PAPBPAPB − max(PA + PB − 1, 0) otherwise

• If the length of unary bit stream A and B is infinite, we have

limL→∞

L∑i=1

AiBi = limL→∞

PA&B = ab = PaPb

Bit Length Increasing

1.5 2 2.5 3

0 (dB)

N=64, m=1024N=64, m=512N=64, m=128N=256, m=1024N=256, m=512N=256, m=128

Figure 17: Comparison of stochastic decoders with different stream lengths.

Permutation Change in Stochastic BP Decoding

1 2 3 4 5 6 7 8 9

Number of Bits

Figure 18: Permutation graph of bit stream 00101101.

Permutation Change in Stochastic BP Decoding

0 50 100 150 200 250

Number of Bits

bit stream in the 1st stagebit stream in the 2nd stagebit stream in the 3rd stagebit stream in the 4th stage

Figure 19: Permutation change of unary bit streams in different stages.

Stage-wise Re-randomization

Figure 20: Schedule of the Stage-wise Re-Randomization with N = 844

(2,,44))

(3,,44))

(3,,6)

Figure 21: Schedule of the Stage-wise Re-Randomization with N = 8

1 1.5 2 2.5 3

Eb/N0(dB)

N=64, deterministic decodingN=64, original re-randomizationN=64, modified re-randomizationN=256, deterministic decodingN=256, original re-randomizationN=256, modified re-randomization

Figure 22: Numerical results for different stochastic decoders.

Early Termination

1 1.5 2 2.5 3

Eb/N0(dB)

Figure 23: Performance comparison of different number of iterations.

Early Termination

• Here, we define thr is an empirically threshold, we havecovt = |

m∑i=1

(Lt2,j − Lt−1

2,j )| < thr,

covt+1 = |m∑

i=1(Lt+1

2,j − Lt2,j)| < thr

• If covt is less than the predetermined threshold, we believe that thenumber of iterations is sufficient and therefore terminate thedecoding process.

Hardware Architecture

switch

RE BCB

comparator reorder

L_S1 L_S2 L_Sn-1

R_S3 R_SnR_S2

Figure 24: Hardware architecture of fully parallel N-bit stochastic decoder.

Comparison of Different BP Decoders

Table 1: Polar BP Decoders for N = 8.

Implementation Logic Gate Register Processing DelayDet. BP 5, 680 1, 632 4 clksSto. BP 120 6, 144 1, 024 clksStage-wise sto. BP 120 2048 512 clksEarly Termination 120 2048 less than 512clks

Contents

Polar Code Decoder

Motivation for Approximate Computing

• Applications of approximate computing

• Exact Results NOT NecessaryA few erroneous pixels do not affect human recognizing the image.

(a) Accurate Result. (b) Inexact Result.

Figure 25: Approximate computing in image processing.

• Numerical Values NOT MatterHandwritten Digit Recognition

Similarity

0.9257

0.1563

1 0.0984

0.2435

• Numerical Values NOT MatterHandwritten Digit Recognition

Similarity

0.9257

0.1563

1 0.0984

0.2435

0.9331

0.8929

• NO Golden Standard Answer

• Design a circuit that may not be 100% correct• Targeting at error-tolerant applications• Trade accuracy for area/delay/power

00 01 11 10

00 000 000 000 000

01 000 001 011 010

11 000 011 110

10 000 010 110 100

Figure 26: K-Map and digital circuit for 2-bit multiplier .

• Design a circuit that may not be 100% correct• Targeting at error-tolerant applications• Trade accuracy for area/delay/power

00 01 11 10

00 000 000 000 000

01 000 001 011 010

11 000 011 110

10 000 010 110 100

Figure 27: K-Map and digital circuit for approximate 2-bit multiplier .

Approximate BP Decoder

an-1 a3a2

Comparator

bn-1 b3b2 a1a0 b1b0

sa sn-1 s3s2 s1s0

Figure 28: Proposed approximate architecture for G node.

Error analysis of the proposed G node:

• Supposing that a and b are random input numbers with uniformdistribution.

• the probability of a[n−1:k] = b[n−1:k] and the probability of thea[k−1:0] < b[k−1:0] are derived as:

P(a[n−1:k] = b[n−1:k]) = (12 )

n−k,

P(a[k−1:0] < b[k−1:0]) =2k − 12n+1 .

• The Error Rate (ER) of the proposed approximate G node isconsidered as:

ER = (12 )

n−k· 2k − 1

2k+1 =2k − 12n+1 . (10)

0.5 1 1.5 2 2.5 3 3.5

Eb/N0(dB)

floating pointignored bit k=1ignored bit k=2ignored bit k=3ignored bit k=4

Figure 29: Simulation results of BP decoders with different ignored bit k.

Inventer Inventer

AOU AOU

MaSa MbSb

Inventer

Figure 30: Conventional architecture for F node.62

Adder Subtracter Comparator

Inventer

1 0 0 1

Figure 31: Proposed approximate architecture for F node.

an-1 a4a3 a2 a1 a0

a2 a1 a0an-1 a4a3

Figure 32: Proposed 3-bit approximate AOU.

0.5 1 1.5 2 2.5 3 3.5

Eb/N0(dB)

floating pointfixed point3 bits of AOU1 bit of AOU

Figure 33: Simulation results of BP decoders with different bits of AOU.

Table 2: Implementation of Different Arch.s for (64, 32) Code.

Hardware overheads Conventional Approximate ReductionALUT 97, 283 61, 958 36.3%

Registers 16, 214 14, 751 9.0%

Conclusion

• A general methodology to implement approximate&stochastic BPdecoder for polar code

• Alleviating the contradiction between higher throughput andhardware consumption

• Significant hardware reduction with negligible performancedegradation

• Future work will focus on the implementation of data-basedapproximate&stochastic BP decoder

Algorithms and Implementationsfor MIMO Detection.

Introduction of Large-Scale MIMO System

• AdvantagesIncreased spectral efficiencyEnhanced link reliabilityImproved coverage

• DisadvantagesHindering the application of conventional data detection methods

Limitations of Conventional Detection Methods

• Maximum likelihood (ML)Computational complexity grows exponentially with the number oftransmitting antennas

• K-Best and sphere decoding (SD)Only favorable for small-scale MIMOSuffering from excessive complexity for large-scale MIMO

• Minimum mean-squared error (MMSE)Relying on Neumann approximationThe approximate error is proportional to the ratio M2/N

Belief Propagation (BP) Algorithm

• Great advantages of BPProviding better performanceRobust and insensitive to the selection of initial solutionMatrix-inversion free

• Proposed BPSymbol-based: suitable for high-order modulationReal domain: reduce the constellation sizeBalancing performance and complexity

System Model

For complex MIMO model with M transmitting (Tx) and N receiving(Rx) antennas, the received vector r can be obtained by

r = Hx + n

• r = [r1, r2, . . . , rN]T, x = [x1, x2, ..., xM]T ∈ ΘM

• H =

1≤i≤N,1≤j≤Mis a complex-valued channel matrix.

• n = [n1, n2, . . . , nN]T is the additive white Gaussian noise (AWGN)

with ni ∼ CN (0, σ2), 1 ≤ i ≤ N.• The complex constellation Θ composed of Q = ∥Θ∥ = 2Mc distinct

points.

Channel Model

A correlated model is expressed by

H = R1/2r HwR1/2

• R1/2r ∈ CN×N: correlation between Rx antennas

• R1/2t ∈ CM×M: correlation between Tx antennas

• Hw: i.i.d. real Gaussian distributed

Channel Model

Three kinds of correlated channels

• Only correlation among Rx antennas considered:

H = R1/2r HwΣ

• Only correlation among Tx antennas considered:

H = Σ1/2r HwR1/2

• Correlation among both Tx and Rx antennas considered:

H = R1/2r HwR1/2

A FG for MIMO Channels

symbol nodes

observation nodes

®®i jx r

a prior

information ®®j ir xa posterior

information

Figure 34: Factor graph of a real domain large-scale MIMO.

A Prior Information at Symbol Nodes

Soft information:Fxi→rj , pxi→rj ,αxi→rj

• a prior probability vector:

p(l)xi→rj = [p(l)

i,j (s1), . . . , p(l)i,j (s√Q−1)]

• a prior LLR vector:

α(l)xi→rj = [α

(l)i,j (s1), . . . , α

(l)i,j (s√Q−1)]

whereα(l)i,j (sk) = ln p(l)(xi = sk)

p(l)(xi = s0), k = 1, . . . ,

√Q − 1

A Posteriori Information at Observation Nodes

A posteriori information is defined as:

β(l)rj→xi = [β

(l)j,i (s1), ....β

(l)j,i (s√Q−1)]

whereβ(l)j,i (sk) = ln p(l)(xi = sk|rj,H)

p(l)(xi = s0|rj,H)

Symbol-Based BP Detection in Real Domain

symbol nodes

observation nodes

®®i jx r

a prior

information

• Step1: Message Updating of Observation Nodes- Receiving a prior information obtained from neighbouring symbol

nodes- Updating a posteriori information- Passing it back to all symbol nodes

Symbol-Based BP Detection in Real Domain

symbol nodes

observation nodes

®®i jx r

a prior

information

• Step2: Message Updating of Symbol Nodes- Receiving a posteriori information obtained from neighbouring

observation nodes- Updating a prior information- Passing it back to all the connected observation nodes

Numerical Results with QPSK Modulation

0 2 4 6 8 1010

10−3

10−2

10−1

SISO, AWGNM=N=16, MMSEM=N=16, General SE−BPM=N=16, Proposed BP

Figure 35: Performance comparison of different detection methods fori.i.d.Rayleigh fading channel (QPSK)

Numerical Results with 16-QAM Modulation

0 5 10 1510

10−4

10−3

10−2

10−1

SISO, AWGNM=8,N=128, MMSEM=8,N=128, Approximate method, k=3M=8,N=32, MMSEM=8,N=32, Approximate method, k=4M=8,N=32, Proposed BPM=32,N=64, MMSEM=32,N=64, Approximate method, k=4M=32,N=64, Proposed BP

Figure 36: Performance comparison of approximate MMSE and BP detectionsin i.i.d.Rayleigh channel (16-QAM) 80

0 5 10 1510

10−4

10−3

10−2

10−1

SISO,AWGNM=8,N=32, iid Rayleigh, MMSEM=8,N=32, iid Rayleigh, BPM=8,N=32, Rx correlation, MMSEM=8,N=32, Rx correlation, BPM=8,N=32, Tx correlation, MMSEM=8,N=32, Tx correlation, BPM=8,N=32, Tx correlation, damped BPM=8,N=32, Rx−Tx correlation, MMSEM=8,N=32, Rx−Tx correlation, BPM=8,N=32, Rx−Tx correlation, damped BP

Figure 37: Performance comparison of BP detections for all kinds of MIMOchannels with M = 8, N = 32 (16-QAM) 81

0 5 10 1510

10−4

10−3

10−2

10−1

idd Rayleigh, General SE−BPidd Rayleigh, Proposed BPRx correlation, General SE−BPRx correlation, Proposed BPTx correlation, General damped SE−BPTx correlation, Proposed damped BPRx−Tx correlation, General damped SE−BP Rx−Tx correlation, Proposed damped BP

Figure 38: Comparison of general and proposed BP detections for MIMOchannels with M = 8, N = 32 (16-QAM)

Half Time Break

Deep Learning Based BasebandSignal Processing.

Applications of Deep Learning

Deep Learning in communication systems

Satellite communications Vehicle to everything Smart devices

Internet of things 5G networks

Deep Learning in communication systems

• MotivationProblems hard to model mathematicallyAdvanced deep learning techniquesNear optimal performanceUniform architecture for various modules

• ChallengesJoint optimization for multiple modulesHigh training complexityUnfriendly hardware architecture

Our work

Deep learning in the baseband

• Uniform baseband accelerator based on BP• Massive MIMO detection with DNN• DNN-based polar codes decoder• A CNN channel equalizer

Deep Learning techniques

• A model of machine learning• Use multiple layers of neural networks• Learn data representations with multiple levels of abstraction using a

training set

Deep Neural Network (DNN)

• Deep neural network model:

y = f(x0;θ)

• Mapping function in the l-th layer:

xl = f(l)(xl−1; θl), l = 1, ..., L

Hidden LayersyInput Layer Output Layer

,i jW 1

,i jW 2

,i jW 3

argmin || ||=

-åki j

i iW n

Input D

Desired Data

, , , , = , ,, ,nD n N1 2

Figure 39: Multi-layer architecture of a feedforward DNN.

Convolutional Neural Network (CNN)

• A class of feed-forward DNN• Convolutional layers to build feature maps

ci,j = ReLU(hi,j ⋆ x + bi,j)

• Reduces the number of free parameters

Figure 40: The typical architecture of a CNN.

Contents

Deep Learning Based Baseband Signal Processing

Improving BP MIMO Detector with Deep Learning

Polar BP Decoder Based on Deep Learning

Convolutional Neural Network Channel Equalizer

Belief Propagation MIMO Detector

• Belief propagation (BP) DetectorPros: Good BER performance, robustness, lower complexityCons: loopy factor graph, convergence rate

...Sym

rvatio

a prior information

a posteriori information

Figure 41: Factor Graph of a large MIMO system.

Modified BP Algorithms

• Damped BP: Mitigate loopiness by message dampingIterative updating rules:

β(l)ji (sk) = log

p(l−1)(xi = sk|yj,H)

p(l−1)(xi = s1|yj,H)

α(l)ij (sk) =

N∑t=1,t =j

β(l)ti (sk)

p(l)ij (xi = sk) =

exp(α(l)ij (sk))∑K

m=1 exp(α(l)ij (sm))

p(l)ij ⇐ (1 − δ)p(l)

ij + δp(l−1)ij

Multiscale Modified BP Algorithms

• Damped BP: Mitigate loopiness by message dampingIterative updating rules:

β(l)ji (sk) = log

α(l)ij (sk) =

N∑t=1,t =j

β(l)ti (sk)

p(l)ij (xi = sk) =

exp(α(l)ij (sk))∑K

m=1 exp(α(l)ij (sm))

p(l)ij ⇐ (1 − δ

(l)ij )p

(l)ij + δ

(l)ij p(l−1)

• Multiscaled factors lead to better performance?• How to find the optimal damping factors δ

• Max-Sum (MS) BP: Futher reduced complexity by eliminating thedivisionIterative updating rules:

β(l)ji (sk) = log

α(l)ij (sk) =

N∑t=1,t =j

β(l)ti (sk)

p(l)ij (xi = sk) = exp(α(l)

ij (sk)− maxsm∈Ω

α(l)ij (sm))

p(l)ij ⇐ (1 − δ)λp(l)

ij − ω + δp(l−1)ij

• Max-Sum (MS) BP: Futher reduced complexity by eliminating thedivisionIterative updating rules:

β(l)ji (sk) = log

α(l)ij (sk) =

N∑t=1,t =j

β(l)ti (sk)

p(l)ij (xi = sk) = exp(α(l)

ij (sk)− maxsm∈Ω

α(l)ij (sm))

p(l)ij ⇐ (1 − δ

(l)ij )λ

(l)ij p(l)

ij − ω(l)ij + δ

(l)ij p(l−1)

• Multiscaled factors lead to better performance?• How to find the optimal δ(l)ij , λ(l)

ij and ω(l)ij ?

The Framework to Build DNN from BP

...Sym

rvatio

a prior information

a posteriori information

.... ...

Hidden

layers

Output

Converting BP to DNN

Table 3: Mapping BP factor graph (FG) to DNN

BP FG DNNNodes Neurons

Received signals Input data xTransmitted signals Output data y

l-th iteration l-th hidden layerBelief messages α(l), β(l), p(l) Hidden signals xl

Message updating rules Mapping functions of layersModification factors δ, λ, ω Parameters θ

Architecture of the DNN MIMO Detector

......

Hidden

layers

Output

One BP iteration step

Figure 42: The architecture of the DNN detector with 3 BP iterations.

Unfolded one full BP iteration in the DNN

Figure 43: One full iteration in the DNN with Tx = Rx = 4 and BPSKmodulation.

Proposed DNN MIMO Detectors

Table 4: Summary of the proposed DNN MIMO Detectors: DNN-dBP andDNN-MS

Method DNN-dBP DNN-MSThe iterative algorithm Damped BP Max-Sum BPTraining parameters ∆ δ δ,λ,ω

Inputs y, δ(0), p(0)ij y, δ(0),λ(0),ω(0), p(0)

ijMapping functions f(l) The updating rules

Loss function L(x,O) = − 1M

M∑i=1

K∑k=1

xi(sk) log(Oi(sk))

Training Details

• Configuration: Tx × Rx = 8 × 32• SNR: 0, 5, 10, 15, 20dB• Optimization method: Mini-batch stochastic gradient descent

(SGD)• Number of layers: selected by pre-simulation• Platform: TensorFlow5

• Learning rate = 0.01 with Adam optimization6

• Parameters initialized with all 0.5

5M. Abadi et al., TensorFlow: large-scale machine learning on heterogeneous systems, Software available from tensorflow.org, 2015.6D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

Experiment: DNN-dBP and DNN-MS in i.i.d. channels

0 5 10 15 20Average Received SNR(dB)

MMSEBPDNN-dBPHADMSDNN-MSAMPSDRSISO, AWGN

Figure 44: Performance comparison of SD, MMSE, BP, DNN-dBP, HAD, MSand DNN-MS in i.i.d. Rayleigh channels with asymmetric antennaconfiguration (M = 8; N = 32).

Experiment: DNN-dBP in correlated channels

R Rx correlation, BPRx correlation, HADRx correlation, DNN-dBPRx correlation, MMSETx correlation, BPTx correlation, HADTx correlation, DNN-dBPTx correlation, MMSERx-Tx correlation, BPRx-Tx correlation, HADRx-Tx correlation, DNN-dBPRx-Tx correlation, MMSE

Figure 45: Performance comparison of MMSE, BP, DNN-dBP and HAD indifferent correlated channels with asymmetric antenna configuration (M = 8; N= 32).

Experiment: DNN-MS in correlated channels

Rx correlation, BPRx correlation, MSRx correlation, DNN-MSTx correlation, BPTx correlation, MSTx correlation, DNN-MSRx-Tx correlation, BPRx-Tx correlation, MSRx-Tx correlation, DNN-MS

Figure 46: Performance comparison of MMSE, BP, MS and DNNMS indifferent correlated channels with asymmetric antenna configuration (M = 8; N= 32).

Contents

Polar BP Decoding

Figure 47: Factor graph of polar codes with N = 8.

Polar BP Decoding

• Left to right and right to left propagations• L(t)

i,j denotes left to right propagation in i-th stage j-th node duringt-th iteration.

• The iterative updating rules associated with L(t)i,j and R(t)

i,j :L(t)

i,j = g(L(t)i+1,2j−1, L

(t)i+1,2j + R(t)

i,j+N/2),

L(t)i,j+N/2 = g(R(t)

i,j , L(t)i+1,2j−1) + L(t)

i+1,2j,

R(t)i+1,2j−1 = g(R(t)

i,j , L(t−1)i+1,2j + R(t)

i,j+N/2),

R(t)i+1,2j = g(R(t)

i,j , L(t−1)i+1,2j−1) + R(t)

i,j+N/2,

where g(x, y) = ln 1 + ex+y

ex + ey .

Min-sum Decoding

• Computation complexity of exponential and logarithm function isprohibitive.

• Low-complexity min-sum approximation is introduced:

g(x, y) ≈ sign(x)sign(y)min(|x|, |y|)

• About 0.4 dB degradation from min-sum approximation under longcode.

Scaled/Offset Min-sum Decoding

• To compensate degradation of MS decoding, scaled/offset min-sum isproposed:

g(x, y) ≈α · sgn(x)sgn(y)min(|x|, |y|),

g(x, y) ≈sgn(x)sgn(y)max(

min(|x|, |y|)− β, 0).

• Scaling and offset factors play key roles for performance.• How to obtain optimal parameters?

Network Architectures

• Unfolding iterative polar decoder into recurrent structure.

Figure 48: Recurrent architecture of neural network decoder.

Training Details

• Optimization method: Mini-batch stochastic gradient descent(SGD)

• Learning rate = 0.01 with Adam optimization7

• Cross entropy loss function:

L(y, y) = − 1N

N∑i=1

ui log(oi) + (1 − ui) log(1 − oi).

• All zero codewords with AWGN noise at single SNR(SNR = 1dB).

7D. P. Kingma and J. Ba, “Adam: a method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

Results

10 20 30 40 50 60 70 80 90 100

Normalized lossL2R factor R2L factor

Figure 49: Evolution of scaling factors for (1024, 512) polar code.

Results

10 20 30 40 50 60 70 80 90 100

Normalized lossL2R offset R2L offset

Figure 50: Evolution of offset factors for (1024, 512) polar code.

Results

1 1.5 2 2.5 3 3.5 4

Eb/N0 (dB)

BPApproximate MS2D Normalized MS, =[0.875, 0.9375]2D Offset MS, = [0.0, 0.25]

Figure 51: Performance comparison of BP decoding for (1024, 512) polarcodes.

VLSI Architecture

• Offset MS only requires one extra subtraction −→ More friendly tohardware design

• Bi-directional updating −→ Reduce 50% decoding latency

Z [q-1:0]

Offset

X [q-1:0]

Y [q-1:0]

q-2 q-2

(a) Modified Offset PE.

Channel Output

PEArray

Memory

R Message

L Message

(b) Diagram of polar BP decoder.

Figure 52: Hardware architecture of polar BP decoder.

Contents

Channel Equalization

• Inter-symbol Interference (ISI) The channel with ISI is modeledas a finite impulse response (FIR) filter h. The signal with ISI isequivalent to the convolution of channel input with the FIR filter asfollows:

v = s ⊗ h.

• Nonlinear Distortion The nonlinearities in the communicationsystem are mainly caused by amplifiers and mixers:

ri = g[vi] + ni.

Maximum Likelihood Equalizer

• Estimation Use training sequence s to estimate channelcoefficients h that maximizes likelihood:

hML = arg maxh

p(r|s,h).

• BCJR Use BCJR algorithm to find codeword that maximizes aposterior probability:

p(si = s|r, hML), i = 1, 2...N.

• Drawbacks• Good performance for ISI, but bad for nonlinear distortion• Require accurate a priori information of channel• Prohibitive O(n2) complexity

Proposed CNN Equalizer

Channel

CNN Equalizer

NN Decoder

H(z) g(v)Channel

Encoder

v r ! "#

Figure 53: System model.

• Fully convolutional neural network with 1-D convolution:

yi,j = σ(C∑

K∑k=1

Wi,c,kxc,k+j + bi),

• Train parameters to maximize likelihood:

θ = arg maxθ

p(s = s|r,θ).

Training Details

• CNN structure: 6 layers with 1 × 3 filter• Learning rate = 0.001• Mean squared error (MSE) loss function:

L(s, s) = 1N∑

i|si − si|2,

• Random codewords from SNR 0 dB to 11 dB.• Weights initialization: N ∼ (µ = 0, σ = 1)

Results on Linear Channel with ISI

• Channel coefficients:

H(z) = 0.3482 + 8704z−1 + 0.3482z−2.

Results on Nonlinear Channel with ISI

• Channel coefficients:

H(z) = 0.3482 + 8704z−1 + 0.3482z−2.

• Nonlinear distortion:

|g(v)| = |v|+ 0.2|v|2 − 0.1|v|3 + 0.5 cos(π|v|).

Results on Joint DNN-CNN Detector

10−4

10−3

10−2

10−1

Eb/N0 (dB)

. ..GPC+SC

. ..DL

. ..CNN+NND

. ..CNN+NND-Joint

• Outperforms GPC+SC.

• Only requires about 30% parameters compared with DL method.

Conclusion

With deep learning techniques we achieve:

• A framework to design DNN by unfolding BP• A CNN channel equalizer• Performance improvements• Uniform architecture for baseband modules• Efficient hardware implementations

Advanced Baseband Processing Circuits and Systems for 5G ... · Advanced Baseband Processing...

Documents

Transcript of Advanced Baseband Processing Circuits and Systems for 5G ... · Advanced Baseband Processing...

Opportunities and Challenges for Semiconductor Industry in ......3. Opportunities of Semiconductor Industry in 5G Era [2, 3] 3.1 5G Baseband Chips. The development of 5G has created

ANALOG-BASEBAND ARCHITECTURES AND CIRCUITS FOR ...must be highly reconfigurable and robustly operational underneath a low-voltage supply. This book presents novel analog-baseband architectures

Bt Baseband Tricoli

5G, WLAN, and LTE Wireless Design with MATLAB...5G, WLAN, and LTE Wireless Design with MATLAB ... algorithm? Generate baseband waveform Standard compliance • Generate all physical

Baseband transmission2003

Analog Baseband Circuits for WCDMA Direct- Conversion ...lib.tkk.fi/Diss/2003/isbn9512265958/isbn9512265958.pdf · Helsinki University of Technology Electronic Circuit Design Laboratory

5G-Oriented OTN Technology - NGOF · 5G RAN networks will evolve from the two-level structure of a baseband unit (BBU) and a remote radio unit (RRU) present in 4G/LTE networks, to

ITSF - TIMING FOR 5G for 5G.pdf · 5G NR (New Radio) – new base station architecture 10 LTE • Single Node with Radio Unit and Baseband Unit, or Remote Radio Unit using CPRI up

Gigabit Baseband Modem Technology for 5G mmWave Applications · Gigabit Baseband Modem Technology for 5G mmWave Applications Interlligent Seminar, 18th May 2017 ... Optional MIMO

Baseband transmission.pdf

Cloud processing for 5G systems v6simeone/Cloud processing for 5G systems v6.pdf · Cloud Processing for 5G Systems Cloud Radio Access Network (C-RAN): Baseband processing offloading

Beamforming and MIMO Digital Radio Baseband and Testbed ...MASTER OF ENGINEERING - SPRING 2016 Electrical Engineering and Computer Science Physical Electronics and Integrated Circuits

TECHNICAL OVERVIEW W1906EP/ET 5G · Next-generation communication use the 5G library to develop new algorithms and perform early system validation, with or without working baseband

Issues for Multi-Band Multi-Access Radio Circuits in 5G ...

LTE BASEBAND TARGETED DESIGN PLATFORmchina.xilinx.com/publications/prod_mktg/LTE-Baseband-SellSheet.pdf · LTE basestation design including radio, baseband, media access control (mAC),

Single-Chip 5G WiFi IEEE 802.11ac 2×2 MAC/Baseband ... Sheets/Cypress PDFs...ADVANCE CYW43570 Single-Chip 5G WiFi IEEE 802.11ac 2×2 MAC/Baseband/Radio with Integrated Bluetooth 4.1

5G multi antenna advantage€¦ · › Integration with Cloud RAN and Core Network ... ›Works with today’s Ericsson Radio System Baseband ›5–6 times capacity compared to 8T

BASEBAND AUDIO

U.S. Technology Update and 5G · PDF fileU.S. Technology Update and 5G Introduction ... 3G Base Station includes the Baseband Unit, Transceivers, Power Amplifiers, and other auxiliary

Baseband Inter-Chip Communicationpublications.lib.chalmers.se/records/fulltext/165541.pdf · Baseband Inter-Chip Communication Interface Simulation for Baseband ASIC Multi-Processor