OFDM Physical Layer Architecture and Real-Time Multi-Path ...

OFDM Physical Layer Architecture and Real-Time Multi-Path FadingChannel Emulation for the 3GPP Long Term Evolution Downlink

by

Elliot Briggs, B.S.E.E., M.S.E.E.

A Dissertation

In

Electrical Engineering

Submitted to the Graduate Facultyof Texas Tech University in

Partial Fulfillment ofthe Requirements for

the Degree of

Doctor of Philosophy

Approved

Dr. Brian Nutter

Dr. Tanja Karp

Dr. Sunanda Mitra

Dr. Dominick CasadonteInterim Dean of the Graduate School

December, 2012

Texas Tech University, Elliot Briggs, December 2012

Contents

List of Tables iv

List of Figures v

Nomenclature viii

Abstract xii

1 Preface 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Introduction 22.1 OFDM Receiver Synchronization and Equalization . . . . . . . . . . . . . . 32.2 System Architecture for Real-Time Multi-Path Wireless Channel Emulation 42.3 Cyclic Prefix Redundancy Combination and Arbitrary-Ratio Resampling 5

2.3.1 Cyclic Prefix Redundancy Combination with LTE Context . . . . . 52.3.2 Arbitrary-Ratio Resampling Using a Reformulated Farrow Filter . 5

2.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 OFDM Receiver Synchronization 73.1 OFDM System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.2 Timing Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Sampling Clock Frequency Offset and Symbol Timing Correction: A

Joint Effort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183.4 Time-Domain Detection of the LTE Primary Synchronization Signal . . . 243.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4 OFDM Channel Estimation and Equalization 404.1 Linear Regression Techniques for Channel Estimation . . . . . . . . . . . . 454.2 The Missing Link: Frequency-Time Interpolation . . . . . . . . . . . . . . . 584.3 Reference Symbol Arrangements and Their Relationship with Timing

Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5 Resampling Techniques Using Locally Weighted Linear Regression 75

6 Exploitation of Excess Cyclic Prefix to Improve Reception Quality 876.1 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

7 Real-Time Wireless Channel Emulation 947.1 Real-Time Multi-Path SISO Channel Emulation . . . . . . . . . . . . . . . . 95

7.1.1 Stochastic Jakes Process Generation . . . . . . . . . . . . . . . . . . 957.1.2 Arbitrary-Ratio Upsampler Design: User-Variable Doppler . . . . . 99

ii


7.2 Real-Time Milti-Path MIMO Channel Emulation . . . . . . . . . . . . . . . 1097.3 Implemention in FPGA Hardware . . . . . . . . . . . . . . . . . . . . . . . . 1157.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8 Conclusions 120

A Generic Multicarrier System Model 122A.1 Linear Transforms and Basis Functions . . . . . . . . . . . . . . . . . . . . . 122A.2 Serial-to-Parallel and Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . 125A.3 The Stationary AWGN Channel . . . . . . . . . . . . . . . . . . . . . . . . . . 127

B OFDM System Model 133

References 140

iii


List of Tables

1 Root Indices for the LTE Primary Synchronization Signal . . . . . . . . . . 262 Multi-Rate PSS Detector Computation and Coefficient Storage Require-

ments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 Dyadic Cascaded Upsampler Computation and Coefficient Storage Re-

quirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354 Breakdown of Computations of Overlap-Add PSS detection as Imple-

mented by MATLAB’s “fftfilt” function . . . . . . . . . . . . . . . . . . . . . . 385 Computational Breakdown for Online Computation of Locally Weighted

Linear Regression (m= 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 Coefficients of Designed Dyadic Linear Phase Half-Band IIR Upsampler . 1047 IIR and FIR Interpolation Performance Comparison . . . . . . . . . . . . . 1058 Workload and Coefficient Storage Breakdown for a Single Variable-Rate

Channel Coefficient Generator . . . . . . . . . . . . . . . . . . . . . . . . . . 1099 FPGA Resource Consumption for a Single Channel Matrix Generator, Ex-

cluding WGN Source ( fs = 200 MHz) . . . . . . . . . . . . . . . . . . . . . . 118

iv


List of Figures

1 FIR Filter That Models the Effects of Fractional Timing Error (M = 256) 102 Illustration of “Left” and “Right” Symbol Timing Error Positions . . . . . 113 SIR vs. Symbol Timing Error m - Left and Right Errors . . . . . . . . . . . 134 Data-Driven Critical Value for Symbol Timing Estimates . . . . . . . . . . 165 SNR Degradation vs. Subcarrier Index . . . . . . . . . . . . . . . . . . . . . 196 SNR Degradation vs. SCO with Varying Es

N0. . . . . . . . . . . . . . . . . . . 20

7 Receiver Architecture Capable of Synchronizing SCO and Symbol Timing 218 Received OFDM Signal Afflicted with 40 ppm of SCO: Effects on SNR

and Symbol Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 Received OFDM Signal Afflicted with -40 ppm of SCO After Resampling

Using Measured Timing Drift Rate and Feedback Control Technique . . . 2210 Received OFDM Signal Afflicted with -40 ppm of SCO and EVA-200

Channel Model: Successful SCO Detection and Correction Using OnlyTime-Domain Information in High Mobility Channel Conditions . . . . . 23

11 Cyclic Correlation Properties Between the LTE Downlink PSS ZC Sequences 2712 PSS Position in an LTE Downlink OFDM Symbol . . . . . . . . . . . . . . . 2713 Linear Correlation Properties Between the LTE Downlink PSS ZC Se-

quences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2914 Alias Zones Introduced by Initial 2x Downsampling in an LTE PSS Sym-

bol (NF F T = 256) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3015 Frequency Response of LTE Test Signal Overlaid with Oversampled PSS

Matched Filter (NF F T = 256) . . . . . . . . . . . . . . . . . . . . . . . . . . . 3016 Dyadic Downsampling Filter Response Overlay . . . . . . . . . . . . . . . . 3217 Dyadic Downsampling Filter: Cascaded Frequency Response . . . . . . . 3218 Dyadic Downsampling Filter with Matched Filter Bank: Implemented

Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3319 Dyadic Upsampling Filter Response Overlay . . . . . . . . . . . . . . . . . . 3420 Dyadic Upsampling Filter: Cascaded Frequency Response . . . . . . . . . 3521 Multi-Rate PSS Detection Algorithm: Implemented Processing Structure 3622 Multi-Rate vs. Overlap-Add PSS Correlation . . . . . . . . . . . . . . . . . . 3723 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3824 LMS Equalizer Results: Equalization Coefficients (top), Per-Channel Squared

Error (bottom) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4425 Exponential Weighting Kernel with Varying τ Parameter . . . . . . . . . . 5026 Overlaid Locally Weighted Regression Results with Varying τ Kernel Pa-

rameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5127 MSE vs. Model Parameter τ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5228 LWR Experiment: Mean-Squared Error vs. τ vs. N . . . . . . . . . . . . . . 5329 MSE vs. Model Parameter τ: i.i.d. Trials with i.i.d. AWGN . . . . . . . . . 5430 Finding the Abscissa of a Quadratic Function’s Minumum Using Inverse

Quadratic Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5531 Finding the Abscissa of an Arbitrary Function’s Local Minumum Using

the Successive Inverse Quadratic Minimum Finding Technique . . . . . . 56

v


32 Successive Inverse Quadratic Inerpolation Minimum Finding AlgorithmFinding the Minimum Across the Error Surface of the LWR Kernel Pa-rameter Sweeps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

33 Frequency-Staggered, Time-Spaced Reference Symbol Orientation in the“Extended” and “Normal” CP Modes Used in the LTE Downlink . . . . . . 58

34 Valid Output Samples of a Rate-6 Polyphase Upsampler Overlaid on theKnown Channel Magnitude . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

35 Interpolation and Gap-Filling of a Periodic Signal using the ExtendedDFT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

36 Cubic Spline Interpolation/Extrapolation Along the Frequency Dimen-sion (Step 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

37 Cubic Spline Interpolation/Extrapolation Along the Time Dimension (Step2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

38 MSE of Cubic Spline Interpolation/Extrapolation Operating Under theEVA Channel Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

39 Comparison Between LWR and LS Algorithms Applied to LTE RS config-uration with Cubic Splines Interpolator, EVA Channel Model, σ2 = .005 71

40 An Example Comparison of LWR vs. LS Equalization: QPSK ModulatedData, EPA-5 Channel Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

41 MSE Performance Comparison of LS and LWR Channel Estimators UsingLTE Channel Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

42 Simultaneous Data Smoothing and (4x) Interpolation Using the LWRAlgorithm (m= 4) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

43 Farrow Filter Structure Derived from the LWR Algorithm . . . . . . . . . . 7744 Q Matrix Row-Wise Taps (top row), Q Matrix Row-Wise Frequency Re-

sponses (bottom row), m=5, p=8, β = 30 . . . . . . . . . . . . . . . . . . . 7845 Generated CVFD (Farrow) Filter’s Group Delay vs. ∆ . . . . . . . . . . . . 7946 Generated CVFD (Farrow) Filter’s Magnitude and Phase vs. ∆ . . . . . . 7947 Q Matrix Row-Wise Taps (top row), Q Matrix Row-Wise Frequency Re-

sponses (bottom row), m= 5, p = 24,β = 250 . . . . . . . . . . . . . . . . 8048 Q Matrix Row-Wise Taps (top row), Q Matrix Row-Wise Frequency Re-

sponses (bottom row), m= 5, p = 24,β = 14 . . . . . . . . . . . . . . . . . 8149 Generated CVFD (Farrow) Filter’s Group Delay vs. ∆: m = 5, p =

24,β = 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8150 Generated CVFD (Farrow) Filter’s Magnitude vs. ∆: m = 5, p = 24,β =

14, Useful for Simultaneous Interpolation and Smoothing . . . . . . . . . 8251 Sidelobes Resulting from CVFD Rate Transition with Varying Levels of

Input Oversampledness�

m= 5, p = 8,β = 30�

. . . . . . . . . . . . . . . . 8352 Generalized CVDF (Farrow) Based Arbitrary-Ratio Upsampler Using 8x

Polyphase Upsampling Preprocessor . . . . . . . . . . . . . . . . . . . . . . . 8453 Inconvenient-Rate Resampling for an LTE or UMTS System to 100 MHz

Sampling Rate from 30.72 MHz . . . . . . . . . . . . . . . . . . . . . . . . . 8554 Farrow-Based LTE Resampling Filter . . . . . . . . . . . . . . . . . . . . . . 8655 Receiver Architecture: Combining CP Redundancy in the Time Domain . 91

vi


56 SNR Enhancement Using CP Redundancy in AWGN Channel and Multi-Path Channel Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

57 Time-Varying SISO Channel Model . . . . . . . . . . . . . . . . . . . . . . . 9658 Designed Jakes FIR Filter: NJakes = 256, fmax = 100 Hz, fd = .8 . . . . . 9859 Single MACC Element Jakes Filter Processing p Complex Jakes Processes 9960 Arbitrary-Ratio Resampler Architecture . . . . . . . . . . . . . . . . . . . . . 10061 Dyadic Half-Band 4x FIR Upsampler - Overlaid and Cascaded Frequency

Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10262 Dyadic Half-Band 2x FIR Upsampler - Implementation Structure . . . . . 10363 Second Order Type-1 and Type-2 All-Pass Sections . . . . . . . . . . . . . . 10364 An Example of Cascaded Half-Band IIR Upsamplers Constructed Using

Cascaded 2nd-Order All-Pass Sections . . . . . . . . . . . . . . . . . . . . . 10365 Dyadic Half-Band Linear Phase 4x IIR Upsampler - Overlaid and Cas-

caded Frequency Response . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10466 Prototype Filter for 32x Upsampler: Exploiting the Oversampled Input

Signal (5x magnification) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10667 Dual Polyphase Filter Arbitrary-Ratio Resampler: Dual Commutator Traver-

sal States with Extended Shift Register Positioning . . . . . . . . . . . . . . 10768 Arbitrary-Ratio Upsampler: Rate-32 Polyphase Upsampling with Linear

Interpolators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10869 Frequency Response of the Cascaded Jakes and Arbitrary-Ratio Resam-

pler: δ = 0.0225 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10870 Tapped Delay Line MIMO Channel Model . . . . . . . . . . . . . . . . . . . 11071 Channel Matrix Generator System Diagram . . . . . . . . . . . . . . . . . . 11472 Hardware Matrix Multiplication Operation for Correlating i.i.d. Jakes

Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11573 Test Configuration of the Implemented Channel Emulator . . . . . . . . . 11674 Hardware-Sourced Jakes Impulse Response from MIMO Emulator . . . . 11775 Two-Element Vector Defined in the Orthonormal Basis

�

x0, x1

�

. . . . . . 12376 Two-Element Vector Redefined in the Orthonormal Basis

�

y0, y1

�

. . . . . 12377 x and y orthonormal basis vectors defined in the x basis . . . . . . . . . . 12478 QAM Constellations: QPSK and 16QAM . . . . . . . . . . . . . . . . . . . . 12579 Generic System Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12780 Effect of a noisy channel with memory: OFDM example . . . . . . . . . . 13281 Illustration of the Cyclic Prefix in an OFDM signal . . . . . . . . . . . . . . 13682 Generic OFDM System Model (Equalization Component not Shown) . . 139

vii


Nomenclature

Acronyms

3GPP 3rd Generation Partnership Project

ACF Autocorrelation Function

ADC Analog to Digital Converter

ALU Arithmetic Logic Unit

AWGN Additive White Gaussian Noise

CCF Cross Correlation Function

CFO Carrier Frequency Offset

CP Cyclic Prefix

CVDF Continuously Variable Delay Filter

DAC Digital to Analog Converter

DDS Direct Digital Synthesis

DFT Discrete Fourier Transform

DRAM Dynamic Random Access Memory

DSP Digital Signal Processing

DTFT Discrete-Time Fourier Transform

EDFT Extended Discrete Fourier Transform

EPA Extended Pedestrian A

ETU Extended Typical Urban

EVA Extended Vehicular A

FBMC Filter Bank Multicarrier

FDD Frequency Domain Duplexing

FFT Fast Fourier Transform

FIFO First In First Out

FIR Finite Impulse Response

flop Floating-Point Operation

viii


FPGA Field Programmable Gate Array

HARQ Hybrid Automatic Repeat Request

i.i.d. Independent and Identically Distributed

ICI Inter-Carrier Interference

IIR Infinite Impulse Response

ISI Inter-Symbol Interference

LDPC Low Density Parity Check

LMMSE Linear Minimum Mean Squared Error

LMS Least Mean Squares

LS Least Squares

LTE Long Term Evolution

LWR Locally Weighted Regression

MACC Multiply Accumulate

MATLAB Matrix Laboratory

MF Matched Filter

ML Maximum Likelihood

MMSE Minimum Mean Squared Error

MSE Mean Squared Error

MUX Multiplexer

NLMS Normalized Least Mean Squares

OFDM Orthogonal Frequency Division Multiplexing

PDP Power Delay Profile

PHY Physical Layer

ppm Parts Per Million

PRACH Physical Random Access Channel

PSD Power Spectral Density

PSS Primary Synchronization Signal

ix


QAM Quadrature Amplitude Modulation

QPSK Quadrature Phase Shift Keying

RAM Random Access Memory

RLS Recursive Least Squares

ROM Read-Only Memory

RS Reference Symbol

SCO Sampling Clock Offset

SERDES Serializer-Deserializer

SIR Signal to Interference Ratio

SNR Signal to Noise Ratio

SOS Sum of Sinusoids

SSS Secondary Synchronization Signal

TDD Time Domain Duplexing

UMTS Universal Mobile Telecommunications System

WGN White Gaussian Noise

WOM Write-Only Memory

ZC Zadoff-Chu

Operator

(·)(i) i th time index or iteration of vector or matrix

(·)H The Hermitian transpose of a vector or matrix

(·)T The transpose of a vector or matrix

0M×N An M × N rectangular matrix of zeros

0M An M ×M matrix of zeros

H−1 Inverse of matrix H

H:, j The contents of the j th column of matrix H

Hi,: The contents of the i th row of matrix H

Hi, j The element located in the i th row and j th column of matrix H

x


Hi,m:n The contents of the mth through the nth column in the i th row ofmatrix H

hi The i th element in the vector h

Hm:n, j The contents of the mth through the nth row in the j th columnof matrix H

hm:n The contents of the mth through nth element in the vector h

HM A square matrix H with dimensions M ×M

IM The M ×M identity matrix

Rx x Autocorrelation matrix of x

Rx y Cross-correlation matrix of x and y

det [·] Determinant of a matrix

diag {·} Returns the diagonal of a matrix as a vector or constructs a di-agonal matrix out of a vector

E[·] The expected value operator

min (·, ·) Minimum value of the listed arguments

∇θ J (·) Gradient of the cost function J with respect to θ

⊗ Convolution operator or the Kronecker product of two matrices

‖ · ‖ The `2 norm (Euclidean norm) of a vector, or the maximumsingular value of a matrix.

bτ estimate of the variable τ

jp−1

rx x Autocorrelation vector of x

rx y Cross-correlation vector of x and y

Sx x Power Spectral Density of x

x (t) continuous-time series x at time t

x[n] The time sequence x at index n

xi


Abstract

This dissertation is focused on OFDM receiver algorithms, particularly involving re-

ceiver synchronization and channel equalization. These two topics are critical compo-

nents in an LTE downlink receiver. The various aspects of receiver synchronization are

presented and their impact on reception quality is quantitatively defined. Building on

this information, a receiver architecture is constructed that is capable of simultaneously

correcting symbol timing and sampling frequency offset using a feedback-controlled

arbitrary-ratio resampler. The topic of channel estimation is presented by first investi-

gating MMSE algorithms, leading to the more practical family of algorithms that use

stochastic optimization techniques. A new family of algorithms is explored that are

based on locally weighted linear regression. The regression algorithm uses an opti-

mum parameterized kernel, found using offline training.

Throughout the dissertation, algorithms are tested using realistic models that emu-

late typical time-varying multi-path fading channel scenarios defined by the LTE stan-

dard for conformance testing. To perform extended simulations in real-time, a channel

emulator architecture is developed, implemented, and tested in FPGA hardware. The

developed architecture allows online programming of the desired spatial and tempo-

ral correlation properties of the channel and has been designed to be scalable to the

desired spatial or temporal dimensions.

The primary goal of the dissertation is to offer high performance, while maintaining

a low complexity, cost-effective hardware implementation. Although implementation

details target an FPGA-based design, the concepts can be extrapolated to ASIC or even

software-based targets.

xii


1 Preface

1.1 Background

I was first introduced to DSP in 2005 when I took the course at Texas Tech, instructed by

my committee co-chair Dr. Karp. At that point in time, I was already very enthusiastic

about communication circuit design. Later, I signed up for “Modern Communications

Circuits” instructed by Dr. Nutter, my commitee chair, who presented the concept of

software defined radio at the end of the course. I was intrigued. I later became a

graduate student studying embedded systems under Dr. Nutter, who sent me away

for a 6 month internship at Innovative Integration in Simi Valley, California. This is

where the bits and pieces of interesting topics reached critical mass. At Innovative, I

was introduced to FPGAs, spending most of my time learning how to implement DSP

algorithms that run in real-time for customers’ software-defined radio projects. The

connect between DSP, FPGAs and communications convinced me that I wasn’t done

being a graduate student.

After a few months back at Texas Tech, I submitted a proposal to Dan McLane at

Innovative to build an OFDM receiver in an FPGA, a project I thought would be chal-

lenging enough to last at least a semester or two. He had heard about this new “LTE”

standard that seemed to be attracting the attention of many of his customers. Much

of the work in this dissertation is a result of this project. FPGA implementation was

paramount throughout the work with Innovative. Together, we developed and imple-

mented an LTE receiver and a real-time wireless channel emulator. During this work,

I became a firm believer of the “design for implementation” philosophy. The central

theme of this dissertation is not only the various algorithms, but their implementation

in a real-time system.

1.2 Acknowledgments

The biggest gratitude is to my wife, Kristin. Together, I believe we clearly demonstrate

the “better than the sum of the pieces” concept. I’d like to thank her for her tolerance

with my textbook addiction, especially while on vacation. I also owe a huge debt of

gratitude to my committee chairs, Drs. Tanja Karp and Brian Nutter, who have with-

stood the inhumane job of reading my papers and answering my questions. My com-

mittee chairs have both given me great inspiration and have illustrated an impossibly

high standard that I will always strive to reach. I’d also like to thank Dan McLane, the

former co-founder and vice president at Innovative Integration. Without Dan’s support,

1


I may have never reached the “critical mass” moment that I mentioned. I’d also like

to thank my colleagues with whom I worked at Innovative, Amit Mane and Chunmei

Kang.

1.3 Organization

This dissertation covers two main subjects. The first portion of the dissertation presents

algorithms for the LTE downlink physical layer, staying true to the “design for imple-

mentation” philosophy. This portion of the dissertation is split into three chapters cov-

ering OFDM receiver synchronization, equalization, channel estimation, cyclic prefix

redundancy combination, and finally, an arbitrary resampling technique. The arbitrary

resampler comprises one of the main components in the presented receiver synchro-

nization architecture. The second subject presents an architecture for real-time multi-

path fading channel emulation for testing of MIMO-OFDM receivers in a laboratory en-

vironment. The architecture is presented as implemented in FPGA hardware, followed

by hardware-derived test results. Finally, two appendices are included as an OFDM

primer. The appendices establish a common conceptual and notational framework.

Throughout many of the chapters, implementation details are given with reference

to the Xilinx “Virtex” family of FPGAs. These FPGAs offer the tremendous compute

power, but require much more design overhead than a microprocessor-based software

approach. The goal of each presented technique is to minimize FPGA resource con-

sumption in order to maximize cost-effectiveness and minimize power consumption.

The analysis of the developed algorithms will be performed using compute workloads,

usually specified in multiply-accumulates per second (MACCs/s). When available, im-

plementation efficiency will also be quantified by measuring the number of valuable

FPGA resources the implementation requires. The consumption of the Xilinx-specific

resources, such as block RAM (BRAM) and the special DSP48E arithmetic logic units

(ALUs) provide a good metric for implementation cost.

2 Introduction

The 3rd generation partnership project’s “Long Term Evolution” (3GPP LTE) standard

has been, and will continue to be society-changing. As LTE is being deployed in the

United States, cities are being blanketed with mobile Internet access that often exceeds

the throughput of available residential DSL and cable Internet services [1–3]. The LTE

standard is the first widely deployed MIMO-OFDM (multiple-input multiple-output or-

thogonal frequency division multiplexing) cellular air interface, offering a theoretical

2


1 Gbit/s throughput in the latest 10th release. The key enabling technology of the

LTE downlink is OFDM, providing efficient multi-user spectrum utilization and offering

wide bandwidth configurations that achieve tremendous throughput in the harshest

mobile channel environments. OFDM is also elegantly extended to utilize MIMO tech-

niques that dramatically increase the channel’s data-carrying capacity.

2.1 OFDM Receiver Synchronization and Equalization

Along with many benefits, OFDM brings many challenges. Successful OFDM recep-

tion depends on the special orthogonality condition that is achieved by the properties

of the discrete Fourier transform (DFT). The receiver must synchronize symbol tim-

ing to prevent inter-symbol interference (ISI), and must cancel sampling and carrier

frequency errors to assure orthogonality between each of the subcarriers, preventing

inter-carrier interference (ICI). After successful synchronization provides reliable recep-

tion, the receiver must estimate the channel’s frequency response in order to equalize

each subcarrier. After equalization, the receiver performs MIMO decoding, demaps the

received symbols, and extracts the transmitted bits. At this point, the bits are scram-

bled, interleaved, and are protected by a powerful error correction code. The channel

decoder undoes the scrambling and decodes the resulting binary stream. The physical

layer (PHY) of the LTE downlink also includes several types of channel coding that

provide tremendous error correction capability. In case of decoding failure, when the

received data is unable to be decoded without errors, the LTE PHY uses a hybrid auto-

matic repeat request (HARQ) subsystem and protocol that works in tandem with the

channel decoder. The HARQ mechanism requests and accepts repeated segments of

parity bits. Along with the OFDM receiver, these elements comprise the “Layer 1” PHY.

The first portion of this dissertation will focus on techniques that achieve successful

OFDM reception, including synchronization and equalization. A good overview on re-

ceiver synchronization is available in the following chapters, as well as in the available

texts [4–7].

The proposed techniques only use information available in the time domain sig-

nal. Any frequency-domain information relies on the synchronization process itself,

and should not contribute as a primary information source. Using time-domain obser-

vations, a receiver architecture has been developed that simultaneously corrects sam-

pling frequency errors and symbol timing. The architecture utilizes a special feedback-

controlled arbitrary-ratio resampling technique that is developed in its own chapter. To

enhance synchronization performance in the LTE downlink, a multi-rate signal process-

3


ing technique has been developed that is able to detect the LTE primary synchroniza-

tion signal in a computationally efficient manner. The computational workload as well

as the implementation aspects are compared between the developed technique and a

more traditional approach.

Next, a framework for adaptive channel equalization is presented that entirely by-

passes the need to perform channel estimation. The technique acknowledges that the

second-order statistics of the channel’s frequency response are unlikely to be explic-

itly known by the receiver, unlike many other publications on the topic. The adaptive

equalization methods approach the minimum mean-squared error (MMSE) solution.

Other optimal channel estimation algorithms can be derived using a regression-based

approach. Using a pre-defined model for the data, an optimal filter can be derived

in a similar manner as the MMSE methods using quadratic optimization to minimize

error. In an attempt to provide optimality over a variety of conditions, a parameterized

kernel is used and the optimal parameter is found using a developed offline training

technique. Finally, several interpolation techniques are explored to construct the equal-

ization matrix for the remaining subcarrier locations in the frequency-time grid. The

featured interpolation procedure is optimized for minimum latency and memory usage.

2.2 System Architecture for Real-Time Multi-Path Wireless Channel Emulation

As a designer develops an OFDM receiver, performance-measuring simulations must

constantly verify the intended operation. In the first stages of design, verification can

be performed using short computer simulations of realistic scenarios in a software envi-

ronment such as MATLAB. For realism, the simulations can mimic or emulate common

conditions that occur in the intended operating environment using statistical models.

Many industry-standard models have been established for commonly occurring mobile

operating environments. As the developer moves beyond the prototyping phase and

into implementation, simulation tasks become more time-critical. Finally, when the de-

sign is nearly deployed, long-term real-time simulation becomes increasingly integral.

At this stage, a real-time channel emulator is an invaluable tool, allowing the receiver

to operate in a simulated environment in real-time for extended durations of time with-

out encountering repetitive channel conditions. This dissertation presents a real-time

MIMO channel emulator architecture that has been developed and implemented in

FPGA hardware. The architecture allows the user to program the desired temporal and

spatial aspects of the channel, allowing real-time simulation using industry-standard,

as well as custom channel models.

4


2.3 Cyclic Prefix Redundancy Combination and Arbitrary-Ratio Resampling

Two additional topics are included that provide performance enhancement to OFDM

reception. Cyclic prefix redundancy combination provides a boost in SNR, and the

arbitrary-ratio resampler enables sampling frequency offset correction.

2.3.1 Cyclic Prefix Redundancy Combination with LTE Context

Channels with memory spread the energy of the transmitted OFDM symbols across

time. In an OFDM system, symbol overlap causes harmful inter-symbol interference

(ISI). To protect against ISI, a cyclic extension, or cyclic prefix (CP) of the transmitter’s

IDFT output is performed to provide a sacrificial guard interval. The CP is an elegant

solution that has many attractive properties. The CP is phase-continuous with the sym-

bol, minimizing additional spectral emissions, it provides tolerance to symbol timing

errors, and it circularizes the channel convolution, providing single-tap-per-subcarrier

equalization. By it’s nature, it also provides redundancy in each OFDM symbol. The

segment of the CP that is left uncorrupted by ISI is available as a copy of portion of

the transmitted symbol. If the receiver is aware of the channel’s length, or excess de-

lay, it can select the uncorrupted segment of CP and utilize the available redundancy.

The channels excess delay is readily available in the existing channel estimation com-

ponents in the receiver. Using the knowledge of the channel’s excess delay, a method

is shown that performs this combination at near-zero computational cost to achieve

modest gains in signal-to-noise (SNR) ratio by combining the available redundancy.

2.3.2 Arbitrary-Ratio Resampling Using a Reformulated Farrow Filter

The locally weighted regression algorithm shows a promising alternative to the typical

stochastic optimization class of algorithms when used for channel estimation purposes.

By reformulating the regression algorithm, an optimal filter matrix is found for the

given kernel using the assumption of periodically sampled data. Along with Horner’s

rule, the filter’s matrix operation arrives at Farrow’s filter structure using an alterna-

tive formulation. Using this approach, the width of the pass-band, along with other

features of the Farrow filter can be “tuned”. When the Farrow filter is prefixed with

an upsampling component, arbitrary-ratio resampling can be performed with attrac-

tive stop-band attenuation properties. This design is utilized in the presented receiver

synchronization architecture and is featured throughout its simulation results.

5


2.4 Contributions

The significant contributions of this dissertation are summarized by the following list:

• An OFDM receiver architecture that simultaneously estimates and corrects sam-

pling frequency offset and symbol timing in the time domain. The technique is

presented and verified using a generic OFDM system configuration without spe-

cial training symbols or preambles while operating in a multi-path fading channel

with high mobility.

• A multi-rate algorithm for detecting symbol timing and the sector ID using the pri-

mary synchronization signal (PSS) in the LTE downlink. The multi-rate approach

is shown to have many superior properties when compared to the overlap-add

algorithm.

• A machine learning technique for optimal channel estimation. Using locally

weighted linear regression with a parameterized kernel, maximum likelihood es-

timation of the channel’s frequency response can be performed using a constant

filter matrix. A convex optimization technique is then used to approximate the

MMSE kernel parameter.

• The locally weighted linear regression technique formulates the well-known Far-

row filter. Using the parametrized kernel, a variable bandwidth Farrow filter is

created with continuously-variable linear group delay in the passband. Using

Horner’s rule, an efficient implementation structure is realized.

• A simple technique utilizes excess CP to achieve modest gains in SNR, when

provided knowledge of the channel’s excess delay. The redundancy combination

correlates the noise across frequency. The correlation relationship is explicitly

derived.

• An architecture is developed that implements an industry-standard model for

emulating spatiotemporally correlated multi-path fading channels in FPGA hard-

ware, operating in real-time, capable of processing modern wide-bandwidth sig-

nals.

6


3 OFDM Receiver Synchronization

Synchronization is perhaps the most performance-influencing component in an OFDM

receiver. Without reliable synchronization, the fundamental orthogonality property of

OFDM given by the DFT is lost. To ensure orthogonal operation, a wireless receiver

must perform the delicate balancing act of timinig, frequency, and sample clock syn-

chronization, each of which must be estimated using only the received signal. Often-

times a “chicken and the egg” situation arises. Without timing synchronization, can the

receiver estimate the sample clock frequency error? With a sampling frequency error,

can the receiver estimate and perform timing synchronization? In the normal case, the

receiver comes online with all three of these inter-dependent synchronization tasks to

perform.

3.1 OFDM System Model

Before delving into the inner-workings of an OFDM receiver, its system model, derived

in detail in Appendices A and B, is reintroduced here.

v(k) = EWMZR

h

H0 H1

i

ZT 0

0 ZT

WHM 0M

0M WHM

x(k− 1)

x(k)

+ n(k)

!

(1)

The system model in Eq. 1 illustrates the time-consecutive transmission of x vectors,

containing the transmitted mapped symbols and zeros. Each x vector is multiplied by

an IDFT matrix WHM , defined by

Wm,n =1p

Me j 2πmn

M , 0≤ m, n≤ M − 1 , (2)

where M is the size of the (I)DFT operation. Next, a cyclic prefix is prepended to the

result of the IDFT using the ZT permutation matrices:

ZT =

0L×(M−L) IL

IM

, (3)

where L indicates the number of prefixed samples. The cyclic prefix operation allows

the system to operate in a channel with memory. The channel’s effects are modeled by

7


the block matrix multiplication with H0 and H1

H0 =

0 · · · hd · · · h2...

. . . . . ....

.... . . hd

.... . .

...

0 · · · · · · · · · 0

H1 =

h1 0 · · · · · · 0...

. . . . . ....

hd · · · h1. . .

.... . . . . . 0

0 hd · · · h1

,

(4)

and the addition of the WGN vector n. After the channel has had its influence on the

signal, the receiver selects the appropriate block of samples for its DFT operation using

the ZR permutation matrix

ZR =h

0M×L IM

i

. (5)

The vector of selected samples is then multiplied by the DFT matrix, resulting in a

vector of symbols that have been affected by the channel’s frequency response. The

E matrix is used to equalize the channel’s effects, allowing for demapping and the ex-

traction of the transmitted data. In this model, the receiver is responsible for selecting

the block of samples for each symbol and equalizing the channel’s effects. These op-

erations will be two of the main receiver functions discussed in the following sections

and chapters.

3.2 Timing Synchronization

In an ideal OFDM system, as shown in Fig. 82, the serial-to-parallel and parallel-to-

serial components operate in perfect synchronization. If the perfect synchronization

assumption is removed, which is the case with all realizable wireless OFDM commu-

nications systems, the receiver operates on a time-shifted version of the transmitted

signal. A general description of the effect of timing offset in an OFDM system can be

used when the sampling clocks of the transmitter and receiver are perfectly matched

in frequency, but the sampling clock phases are not guaranteed to be aligned, causing

a constant fractional timing offset. For completeness, the phase offset between the

8


sampling clocks can extend beyond 2π so that modeled delays may extend beyond the

integer sample duration. The continuous-valued timing offset causes a phase shift in

the frequency domain according to

φ = e− j2πτk

M

k = (0, 1, · · · , M − 1)−M

2,

τ ∈ [0,M

2) ,

(6)

where M is the size of the DFT matrix WM in the OFDM system and τ is the continuous

delay that represents the timing offset between the transmitter and receiver in units of

samples.

To model fractional timing offset in simulation, when the system typically has per-

fectly synchronous sampling phase (i.e. when using MATLAB, Simulink, or other syn-

chronous mathematical descriptions of a wireless system), a simple FIR filter is able

to model the effects of fractional timing offset by linearly shifting the phase with fre-

quency.

h= sinc (τ− k)

k = (0,1, · · · , M − 1)−M

2,

sinc(t)¬

(

1 t = 0sin(πt)πt

t 6= 0,

τ ∈ [0,M

2) .

(7)

The FIR filter model is approximately all-pass and features a variable linear phase shift

that depends on the fractional delay parameter τ. A windowing operation is performed

on the impulse response vector h to reduce the Gibb’s phenomenon near the band

edges of the filter. Fig. 1 illustrates several impulse responses generated using Eq. 7

with M = 256 and various values of τ. For clarity, the x axis (indicating the value of k)

has been magnified about k = 0.

In a practical wireless OFDM receiver, the value of τ must be continuously esti-

mated to maintain proper reception. The receiver must use the estimated value of τ,

bτ to adjust the serial-to-parallel operation so u(k) ideally contains only the P samples

that correspond to the transmitted y(k). v(k) can be obtained only as long as u(k)

contains all of the M samples produced by the IDFT of x(k). Eq. 1 uses the ZR matrix

9


−20 −15 −10 −5 0 5 10 15 20

0

0.5

1

Fractional Timing Offset Model − Channel Impulse Response

τ=−0.5

−20 −15 −10 −5 0 5 10 15 20

0

0.5

1

τ=−0.3

−20 −15 −10 −5 0 5 10 15 20

0

0.5

1

τ=−0.1

−20 −15 −10 −5 0 5 10 15 20

0

0.5

1

τ=0.1

−20 −15 −10 −5 0 5 10 15 20

0

0.5

1

τ=0.3

−20 −15 −10 −5 0 5 10 15 20

0

0.5

1

τ=0.5

relative time index k

Figure 1: FIR Filter That Models the Effects of Fractional Timing Error (M = 256)

to select the final block of M samples in each symbol. To keep this notation, it will

be assumed that the serial-to-parallel operation eliminates the integer portion of the

delay τint = floor(τ), leaving only the fractional delay τfrac = τ − τint. In the likely

case where 0 < τfrac < 1, the fractional delay can be lumped into channel’s impulse

response. Recalling Eq. 162, the channel’s effects are modeled using two horizontally

concactenated matrices, H0 and H1, each generated using the channel’s impulse re-

sponse vector h, which has d elements. The interpolated channel impulse response

that lumps the channel’s impulse response with the fractional delay effects can be de-

10


m=0

m < -(L-d) (left error)

m > 0 (right error)

CP CPd

Figure 2: Illustration of “Left” and “Right” Symbol Timing Error Positions

fined by the vector g.

g=d∑

n=1

hnsinc(τfrac+ (n− 1)− k) ,

k = 0,1, . . . , M − 1

(8)

Again, the sinc in Eq. 8 requires the use of a windowing operation to reduce the Gibb’s

phenomenon near the band edges. When the channel is not the unit impulse, the

receiver cannot distinguish the effect of the channel vs. fractional timing error [8].

If the serial-to-parallel operation is misplaced, the selected block of samples by the

ZR matrix may contain samples from adjacent symbols. In accordance with [9], ideal

symbol timing occurs when the first element of the vector selected by the ZR matrix

is juxtaposed with the CP, i.e. the selected symbol contains no energy from the CP

or from any of the neighboring symbols. This position is denoted as m = 0. A “right”

error occurs when the selected block of samples is positioned late in time so that energy

from m samples of the next symbol is included in the receiver’s DFT operation. This

position error occurs when m>0. Similarly, a “left” error occurs when m ≤ −(L − d),

or when the m samples in the selected block contains channel echo energy from the

previous symbol (recall that d indicates the excess delay of the channel and L indicates

the length of the CP). The receiver selects blocks of symbols using integer indices;

therefore the actual m will have a fractional component according to τfrac. The left and

right timing error conditions are illustrated in Fig. 2.

11


The level of interference caused by the two types of symbol timing errors is not

symmetric, i.e. the level of interference caused by a left error is not equal to a right

error, given the same absolute offset. The signal-to-interference ratio (SIR) of a right

error is defined by [9]

SIRr =(M −m)2

(2M −m)m− 2 M−mσ2

H

∑m−1k=0

∑dk′=k+2σ

2hk′

,

m> 0

(9)

where the channel gain σ2H = hHh and the scalar σ2

hk′= h∗k′hk′ is the power of a given

channel tap at index k′. Similarly, the SIR for the left error is defined by

SIRl =(M − c)2

(2M − c) c− 2 M−cσ2

H

∑c−1k=0

∑L+c+kk′=k+1σ

2hk′

,

c = d − (L+m) ,

−L ≤ m≤ (−L+ d − 1)

(10)

The example shown in [9] defines the channel vector h= [0.3484, 0.3910,0, 0,0.4386,

0.3910,0, 0.3405,0.3106, 0.2767,0.2467, 0.1746], L = 52 and M = 512. Using the

specified channel vector, Fig. 3, reproduced from [9], shows that the level of interfer-

ence introduced by left and right errors is not equal, which can be intuitively justified.

The channel’s impulse response usually contains more energy in the elements with the

least amount of relative delay, thus for a left error, the ISI energy is initially less than

for the right error, given the same absolute error. With a left error, the desired symbol’s

energy tends to outweigh the interference in the erroneously selected portion of the CP.

In a right error, the samples erroneously selected from the next symbol contain chan-

nel echoes of the desired symbol, which aren’t useful for demodulation, and the next

symbol’s energy.

The symbol timing in a wireless receiver must be estimated using only the in-

formation in the received signal. Many symbol timing estimators exist that provide

good performance, however noise and channel conditions can affect estimation preci-

sion [10–13].

The integer portion bτint of the offset estimate bτ is used to adjust the serial-to-parallel

operation so only the estimated fractional timing offset bτfract remains. The estimation

error can be modeled by considering the symbol timing estimates to be a random vari-

able denoted by bτ(t) and its instantaneous realization bτ.

12


−60 −40 −20 0 20 40 6010

0

101

102

103

104

SIR

m

SIR vs. Integer Symbol Timing Position m

SIR for Left Errors

SIR for Right Errors

Figure 3: SIR vs. Symbol Timing Error m - Left and Right Errors

The effect of a timing error can be observed by assuming that bτ(t) is normally

distributed with a time-invariant mean τ and variance σ2τ. The maximum-likelihood

estimates of τ and σ2τ given N realizations of bτ(t) are µ

bτ(t) and σ2bτ(t), realized by

computing the mean and unbiased variance of the N realizations of bτ(t). For finite N ,

µbτ(t) is denoted as a random variable with a Student’s t-distribution, defined by the

degrees of freedom ν = N − 1. Assuming a finite N , the value of N must be chosen so

timing position estimates are obtained by the receiver in a timely manner (and so that

the receiver can track non-stationary statistics, if present), and so the variations among

realizations of µbτ(t) are small enough such that the probability of erroneous symbol

placement is minimized.

As N → ∞, µbτ → τ, and Var

�

µbτ

�

= bσ2bτ→ 0, therefore bσ2

bτ> 0 for finite N . The

non-zero variance requires careful consideration of the timing window placement. If

the timing window is misplaced by a single sample, a right error can occur (Fig. 3),

therefore the statistical confidence in each realization of µbτ(t) must be analyzed so the

decision of the integer symbol timing position can be given a statistical justification.

Because only integer symbol timing positions can be implemented by choosing an in-

teger number of samples for each DFT operation, the rounding decision policy must be

considered when deciding N . In the following analysis, assume that the timing position

13


used for the receiver’s serial-to-parallel operation is determined by nearest(µbτ), which

rounds µbτ to the nearest integer.

Using the Student’s t-distribution, the estimated parameters µbτ and σ2

bτcan be used

to determine P(τ ≤ µbτ + z), or the probability that the true timing position estimated

by µbτ is less than the value determined by the offset value z, which defines the prob-

ability threshold offset from the estimated mean, or “critical value”. The value of z is

determined using a pre-defined level of probability denoted by α, such that

P(τ≤ µbτ+ z) = α . (11)

In this analysis, α is constant and the critical value z will be computed according to

the gathered statistics. The remaining task is to define an acceptable value of α and to

compute the value of z, which can be computed using

z = tcdf (α,ν)Se , (12)

where tcdf (α,ν) computes the solution to the inverse Student’s t cumulative-distribution

function (cdf) integral F−1 (α,ν), and the standard error Se is defined by

Se =σbτpN

. (13)

Given the severity of the interference caused by timing errors, and the desire for

smaller averaging windows for more responsive receiver aquisition and tracking, it

may be desirable to implement symbol “timing backoff”, offsetting the symbol position

from the estimated value of τ to assure safe symbol timing placement (reducing the

probability of error). To illustrate the usefulness (necessity) of timing backoff, suppose

a sliding window of N symbol timing estimates are used to generate the symbolwise

time series of estimates µbτ [n] and σ

bτ [n]. The estimates of τ can be recursively com-

puted using

s [n] = s [n− 1] + bτ [n]− bτ [n− N]

µbτ [n] =

s [n]N

,(14)

14


Similarly, the unbiased variance and standard deviation can be recursively computed

q [n] = bτ [n]2

v [n] = v [n− 1] + q [n]− q [n− N]

σ2bτ [n] =

N v [n]− s [n]2

N (N − 1)

(15)

Note that Eq. 14 and 15 are both computed using an equally weighted (across time)

sliding window of N realizations of bτ(t), each requiring a memory buffer of size N . The

stochastic optimization algorithms presented in [14,15], frequently include techniques

that employ a “forgetting factor” λ to estimate parameters (statistics) that cannot be

assumed to be stationary. These iterative algorithms are attractive as they only require

a single-element memory buffer and are capable of implementing a variable-length av-

eraging window by adjusting the λ parameter. Each recursion applies an exponentially

diminishing weight to older or more “stale” realizations of the statistical parameter to

be estimated. Using this concept, the mean µbτ [n] can be recursively estimated using

µbτ [n] = λµbτ [n− 1] + (1−λ) bτ [n] , (16)

and similarly for bσbτ [n]

σ2bτ [n] = λσbτ [n− 1] + (1−λ)

�

bτ [n]−µbτ [n]

�2 ,

σbτ [n] =

Æ

σ2bτ[n]

(17)

By selecting a forgetting factor λ = 1− 1N

, the equally weighted, or “rectangular” win-

dow recursion and the “decaying” or “forgetting” window recursions can be compared.

Fig. 4 demonstrates the confidence boundary z for α= .99 on a set of symbol timing

estimates with τ = 0, variance σ2τ = 4, and a sliding window size of N = 128. The red

curve (rectangular window recursion) indicates the true confidence boundary estimates

generated using Eq. 14 and 15 and the definition of the probability in Eq. 11. The blue

curve (forgetting recursion) indicates the confidence boundaries generated from the

approximated values of µbτ [n] and σ

bτ [n] using Eq. 16 and Eq. 17 In this example, the

symbol position τ= 0 corresponds to m= 0, and from Fig. 4, over realizations of bτ(t),

there are often situations where the probability of the timing estimate being below the

decision boundary for the rounding policy µbτ is less than .99, which after the rounding

process, the estimated probability of producing a right error exceeds .01, or more than

1 out of 100 symbols are likely to contain interference from a right error. This analysis

15


0 2000 4000 6000 8000 10000 12000 14000 16000

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

P(µτ≤µ

τ+z)=α : α=.99 , τ=−6 , σ

2

τ=4, N=128, λ=1−(1/N)

µτ+

z−

τ

n

rectangular recursion

forgetting recursion

rounding decision

Figure 4: Data-Driven Critical Value for Symbol Timing Estimates

is useful in the consideration of the window size N , and the amount of timing backoff

that should be implemented when N is required to be undesirably large.

In the above example, to assure that a right error is prevented, the symbol position

can be taken from a position earlier than indicated by the estimate. If the timing

window is purposefully “backed off” of the position estimate, the probability of a right

error is reduced. In the example, if the symbol timing position is chosen to be −1

or even −2, the probability of rounding beyond the CP boundary to produce a right

error will be greatly reduced. The act of backing the symbol timing placement away

from the estimated position allows the receiver to use smaller values of N to achieve

the desired right-error probabilities. The penalty for the backoff is frequency-domain

phase shifts, which can be lumped into the already existent fractional timing offset

τfrac. In a practical receiver, the backoff value can be computed online using a constant

value for the inverse-cdf in Eq. 12 for a constant α, and the presented recursions.

To this point, the sampling clocks between the transmitter and receiver have been

assumed to be perfectly aligned in frequency, but are offset in phase. It will be shown

that this assumption will assure the incorrect operation of a practical OFDM receiver

over time.

When sampling clock frequency mismatch is present, the relative symbol timing

at the receiver is no longer stationary over time [16, 17] and the received signal is

16


degraded by inter-carrier interference (ICI) [18]. In a system with sample clock offset

(SCO), the symbol timing has two parameters, the drift rate∆τ and the current symbol

timing position τ. The drift rate ∆τ will be denoted in units of samples per symbol,

that is, the timing shift in unit-sample durations that occur over the unit-time of a

single OFDM symbol duration. The sample clock offset ∆ fSCO between the transmitter

and receiver is defined by

∆ fSCO = fT X − fRX , (18)

where fT X and fRX define the transmitter and receiver sampling clock frequencies, re-

spectively and ∆ fSCO is denoted in units of Hz, or alternatively, in samples per second.

The sampling clock error effectively resamples the transmitted signal. Suppose the

sample clock error between the transmitter and receiver is∆ fSCO = (−)+100 Hz, caus-

ing, 100 (fewer) additional samples to be (removed from) added to the ideal received

signal each second. Given an ideal ideal number of transmitted samples per symbol P

(Eq. 166), and a transmitter sampling rate fT X , the transmitted symbol rate is defined

by

fT Xs ym =fT X

P, (19)

with units of symbols per second. The constant timing window drift ∆τ is now defined

by

∆τ=∆ fSCO

fT Xs ym(20)

with units of samples per symbol. The timing window drift is a symptom of sample

clock offset that is detectable using nothing but the time series of successive symbol

timing estimates. The drift parameter ∆τ contains both the magnitude and direction

information that allows the SCO to be directly estimated.

Ideally, the symbol timing positions are determined by

δτ [n] = δτ [n− 1] +∆τ

τ [n] = δτ [n] ,(21)

However, to model random estimation error, the zero-mean random variable s(t) with

variance σ2s is included.

δτ [n] = δτ [n− 1] +∆τ

bτ [n] = δτ [n] + s [n] ,(22)

bτ [n]must be realized for each received OFDM symbol to maintain the proper samples-

17


per-symbol units used for ∆τ. The instantaneous measurement of the SCO can be

obtained using Eq. 22 in Eq. 23.

∆τ [n] = δτ [n]−δτ [n− 1]

Óδτ [n] = bτ [n]

Óδτ [n− 1] = bτ [n− 1]

(23)

therefore

d∆τ [n] = bτ [n]− bτ [n− 1] (24)

with the variance

Var�

d∆τ [n]�

= 2σ2s (25)

Note that 0 < |∆τ| << 1 for “normal” levels of SCO and the symbol timing esti-

mates are usually indicated by integers, requiring a large number of ∆τ estimates to

achieve sufficient precision. A moving average of SCO measurements can be obtained

using a recursive estimation algorithm (Eq. 16). However, the SCO measurement will

not need to be directly measured, but will be used as a control variable for the receiver

to actively cancel SCO using an arbitrary resampling component. More details on this

will be presented in the next section.

3.3 Sampling Clock Frequency Offset and Symbol Timing Correction: A Joint

Effort

Each aspect of synchronization in an OFDM receiver is closely and oftentimes inter-

related. Such is the case between symbol timing and SCO. Symbol timing drift is

a symptom caused by the resampling process from the SCO. In this case, it is more

worthwhile to treat the disease that causes the symptom, rather than just treating the

symptom and ignoring the root cause. SCO produces other symptoms, most of which

are from the effect of the drifting timing window.

As the symbol timing drifts, the fractional component τfrac is continuously varied. If

the receiver is capable of tracking the integer timing offset component τint, frequency

domain phase-shifts introduced by τfrac are still present. In the conditions of a time-

invariant channel, it is possible, although not advisable to measure SCO in the fre-

quency domain. If the channel is time-varying, the receiver cannot be expected to

18


1 200 400 600 800 1000 12000

2

4

6

8

10

12

14

16

18

subcarrier index

degra

dation (

dB

)

SNR Degradation vs. Subcarrier Index at Es/No = 50 dB

1 ppm

5 ppm

10 ppm

20 ppm

Figure 5: SNR Degradation vs. Subcarrier Index

distinguish between the varying phase shift caused by SCO and the constantly chang-

ing frequency response of the channel [8]. The time-varying frequency domain phase

shifts indirectly resulting from SCO are a symptom of a symptom of the SCO. Addition-

ally, to measure SCO in the frequency domain, the receiver is assumed to have achieved

sufficient symbol timing synchronization to produce frequency-domain symbols with-

out ISI from timing errors and inter-carrier interference (ICI) from other sources. If

SCO is present, the resampling operation caused by the SCO degrades (destroys) the

orthogonality of the received subcarriers (orthogonality is an absolute term) [18], thus

frequency domain measurements are inherently polluted with ICI.

SCO introduces ICI caused by the mislocation of the DFT bin positions in the fre-

quency domain. The amount of ICI on each subcarrier depends on its absolute distance

from the central (DC) subcarrier. The index-dependent degradation of SNR is defined

by

Dn = 10 log10

�

1+1

3

Es

No

�

πn∆ fSCO

fs

�2�

,

n ∈�

−M

2,−

M

2+ 1, . . . ,

M

2− 1�

(26)

where Es and No are the expected symbol energy and noise energy density, respectively

fs indicates the nominal sampling frequency of the system, and n denotes the subcar-

rier, or DFT bin index [18]. Fig. 5 shows the SNR degradation per subcarrier position

19


5 10 15 20 25 30 35 40

10−1

100

101

102

clock offset (ppm)

de

gra

da

tio

n

SNR Degradation vs. SCO at Varying Es/No (subcarrier index = 1200)

20 dB

30 dB

40 dB

50 dB

Figure 6: SNR Degradation vs. SCO with Varying Es

N0

n for several levels of SCO normalized to parts-per-million (ppm) in example OFDM

configuration with 1200 centrally-located subcarriers using a FFT size of M = 2048.

The demonstrated SCO levels, which are quite realistic for low-cost clock oscillators,

show significant SNR degradation. Note that Fig. 5 illustrates the SNR degradation

at a fixed Es

No= 50 dB. Fig. 6 shows the maximum degradation levels located at the

outermost subcarrier indices for several values of Es

No.

Receiver architectures exist that acknowledge the ability to detect SCO in the time-

domain using symbol timing information, yet fail to acknowledge the effects of the SCO

other than phase shifts from fractional symbol timing error [19]. This architecture only

tracks and corrects the phase from the fractional symbol symbol timing errors as the

timing position drifts, leaving the ICI unmitigated. The only component that benefits

from this design is the channel estimator.

The proposed receiver architecture uses the symbol timing information to actively

cancel the SCO by resampling the received signal. The resampling action is performed

using arbitrary-ratio resampler, requiring no additional physical hardware, such as a

phase-locked loop or a voltage-controlled crystal oscillator. After correctly resampling

the received signal, SCO-induced ICI is eliminated, and the symbol timing becomes

stationary. The arbitrary-ratio resampler is controlled using feedback techniques by ob-

serving symbol timing drift and appropriately adjusting the resampling ratio to achieve

stationary symbol timing.

The receiver’s architecture with the SCO cancelling components is shown in Fig. 7.

20


FFTFarrow-Based

Resampling Filter

Symbol Timing

Estimator

resampled signalreceived signal

recovered TX clock

RX clock demodulatedOFDMsignal

S/P

measured SCO

interpolation postion

Loop Filter

Accumulator

Figure 7: Receiver Architecture Capable of Synchronizing SCO and Symbol Timing

In this receiver, the incoming signal is constantly resampled by the rate indicated by the

output of the loop-filter-controlled accumulator. The loop filter indicates the measured

SCO in units of samples per sample, which is decremented from the value stored in

an accumulator on each incoming sampling clock cycle. The accumulator produces

the sampling indices used by the Farrow-based resampling filter component, which has

been designed in Ch. 5, and is illustrated in Fig. 52 and Fig. 54.

The following simulations illustrate the SCO detection and correction algorithm

on a generalized OFDM system. In this system, no special training symbols are made

available for timing information, and the symbol timing is derived entirely using the

Beek algorithm using the in-built correlation properties of the CP [13]. The example

uses an FFT size of M = 2048, a CP length of L = 512 with 1200 occupied subcarriers

and a sampling rate of 30.72 MHz. Each ppm of SCO adds 30.72 Hz of absolute error,

introducing ±2.56×10−3 samples per symbol per ppm, exactly ±1×10−6 samples per

sample per ppm.

Fig. 8 shows the magnitude of a received OFDM symbol with a -40 ppm SCO error

magnitude alongside the estimated symbol timing indices found by the Beek algorithm.

The simulated SCO levels represent the worst-case error magnitude for two separate

low-cost 20 ppm clock sources at the transmitter and receiver. The outer subcarriers

clearly display large levels of ICI, corresponding with Eq. 26 and Fig. 5. No noise

has been added to the signal in this simulation; all of the SNR degradation is a result

of ICI. The symbol timing is also clearly impacted, drifting nearly 100 samples over

approximately 950 symbols.

Fig. 9 shows the simulation result under the same conditions using the proposed

architecture. In this simulation, the loop filter has been implemented using a simple

integrating controller that adjusts the Farrow resampler’s input accumulator based on

21


1 256 512 768 1024 1280 1536 1792 20480

0.25

0.5

0.75

1m

ag

nitu

de

subcarrier index (frequency)

Received OFDM Symbol Magnitude Polluted with −40 ppm of SCO

200 400 600 8001180

1190

1200

1210

1220

1230

1240

1250

1260

1270

1280

1290

estim

ate

d s

ym

bo

l tim

ing

po

sitio

n (

sa

mp

les)

OFDM symbol index (time)

Drifting Symbol Timing Estimates

Figure 8: Received OFDM Signal Afflicted with 40 ppm of SCO: Effects on SNR andSymbol Timing

1 256 512 768 1024 1280 1536 1792 20480

0.25

0.5

0.75

1

ma

gn

itu

de


OFDM Symbol Magnitude After Correction: −40 ppm SCO Error

200 400 600 8001255

1260

1265

1270

1275

1280

estim

ate

d s

ym

bo

l tim

ing

po

sitio

n (

sa

mp

les)


Symbol Timing Estimates

100 200 300 400 500 600 700 800 9000

1

2

3

4x 10

−5 SCO Compensation: Control System Response


sa

mp

les/s

am

ple

Accumulator Input

Known SCO Drift Rate

Figure 9: Received OFDM Signal Afflicted with -40 ppm of SCO After Resampling UsingMeasured Timing Drift Rate and Feedback Control Technique

22


1 256 512 768 1024 1280 1536 1792 20480

0.25

0.5

0.75

1

1.25

1.5

1.75

magnitude


OFDM Symbol Magnitude After Correction: −40 ppm SCO Error

0 500 1000 1500 2000

1180

1190

1200

1210

1220

1230

1240

1250

estim

ate

d s

ym

bol tim

ing p

ositio

n (

sam

ple

s)


Symbol Timing Estimates

0 200 400 600 800 1000 1200 1400 1600 1800 2000−1

0

1

2

3

4

5x 10

−5 SCO Compensation: Control System Response


sam

ple

s/s

am

ple

Accumulator Input

Known SCO Drift Rate

Figure 10: Received OFDM Signal Afflicted with -40 ppm of SCO and EVA-200 ChannelModel: Successful SCO Detection and Correction Using Only Time-Domain Informationin High Mobility Channel Conditions

the observed timing drift. The SCO estimates are generated for each symbol using a

recursively computed moving average over 120 symbols (10 ms), intentionally slow-

ing the response of feedback correction. The symbol magnitude plot in Fig. 9 shows

minimal SNR degradation or distortion from the resampling process after converging

to the ideal resampling ratio. Fig. 9 also verifies that the symbol timing becomes quite

stationary as the SCO compensation converges to the ideal state.

The previous example demonstrates no unique characteristics of either time or fre-

quency domain derived SCO measurement techniques. The simple AWGN channel al-

lows observation of the SCO-induced symbol timing drift in the frequency domain with

minimal ICI in the central subcarrier locations. The next example (Fig. 10) highlights

the unique ability of the proposed method to measure and correct SCO in the midst of

high mobility and multi-path channel conditions. The Beek algorithm continues to be

the only source of symbol timing estimates, which is degraded by the spreading of the

CP’s correlation energy by the multi-path channel, an effect noted by Beek’s original

paper [13]. To counteract the increased estimate variance, a moving average window

is increased to 360 symbols (30 ms) and the loop filter gain is decreased.

23


The simulation results in Fig. 10 show the same OFDM signal configuration oper-

ating in the LTE-specified extended vehicular A (EVA) model with a 200 Hz maximum

Doppler frequency. The EVA model has 9 channel echoes with a relatively large ex-

cess delay, thus the CP correlation energy is widely distributed, aggressively varying

the symbol timing estimates. The simulation results show the algorithm converged to

nearly cancel the SCO mid-way through the simulation and maintained low steady-

state error despite the large symbol timing estimate variance. The symbol magnitude

subplot in Fig. 10 shows the near-zero ICI in the outer subcarrier locations despite the

severe SCO, mobility, and multi-path conditions.

3.4 Time-Domain Detection of the LTE Primary Synchronization Signal

The previous section highlighted that synchronization of both sampling frequency and

symbol timing can be performed jointly without relying on information from the fre-

quency domain, a particularly attractive property allowing synchronization to be in-

dependent of demodulation reliability. In the LTE downlink, the specially designed

primary synchronization signal (PSS) is at the receiver’s disposal, which is designed to

have optimal correlation properties. Transmitted periodically every 5 ms, the PSS can

greatly enhance nearly every synchronization task in the receiver, including symbol

timing, SCO measurement, carrier frequency offset, and symbol index synchroniza-

tion [20–23].

Unlike the CP, which happens to be correlated with its respective symbol by cir-

cumstance, the LTE PSS is specifically designed to have maximal autocorrelation and

zero cross-correlation with other PSS signals. The CP correlation property is simply an

exploitation arising from the properties of the DFT. It was likely that the CP was never

directly intended to be used for symbol timing synchronization, but to simply provide

an elegant solution to “circularizing” the signal-channel convolution, and to provide a

phase-continuous guard interval to protect against ISI by exploiting the periodic na-

ture of the DFT. Essentially, CP-based timing information is built upon several layers of

exploitations. The goal of this section is not to eliminate the CP as a source of symbol

timing information, but to add another available source. Using the CP as well as the

PSS, overall synchronization performance can be enhanced.

The PSS is generated using Zadoff-Chu (ZC) sequences, which have excellent cyclic

correlation properties. To introduce the generation and properties of generalized ZC

sequences, the analysis in [21, 23] will be followed closely. The ZC sequence zγ [n] is

24


defined according to

zγ [n] =

(

exp�

− j 2πγN

n(n+2q)2

�

, N even

exp�

− j 2πγN

n(n+1+2q)2

�

, N odd, (27)

where N is the sequence length, n= 0, 1, . . . , N −1, q is an arbitrary integer, and γ is a

positive integer, referred to as the “index”, which is relatively prime to N . In [21, 23],

q = 0 and N is an odd prime. Notice that ZC sequences are always unit-magnitude.

This subtle elegant property allows a ZC sequence to be stored using only the angles of

its elements. The cyclic autocorrelation function of the sequence zγ [n] is

Rzγzγ [m] =N−1∑

n=0

zγ [n] z∗γ [(n+m)mod N] , m= 0, 1, . . . , N − 1 (28)

where Rzγzγ [0] = N and Rzγzγ [m] = 0, for m 6= (0 mod N). Importantly, the cyclic

cross-correlation function of two sequences zγiand zγ j

, both of length N is defined by

Rzγizγ j[m] =

N−1∑

n=0

zγi[n] z∗γ j

[(n+m)mod N] , m= 0,1, . . . , N − 1 . (29)

�

�

�Rzγizγ j

�

�

� = 1/p

N if |n−m| is relatively prime with N , which can be easily satisfied if

N is a prime number, in which case the cyclic cross-correlation at all lags achieves the

minimum theoretical value for any two sequences that have ideal autocorrelation [23].

Another interesting property is shown in [21]. A duality exists between time and

frequency domain ZC sequences. Let

Zγ [k] =N−1∑

n=0

zγ [n]exp�

− j2πnk

N

�

, k = 0,1, . . . , N − 1 , (30)

denote the DFT of the time-domain sequence. Both zγ and Zγ are periodic sequences

of N and are related by

Zγ [k] = Zγ [0] z∗γ

�

γ′k�

, k = 0, 1, . . . , N − 1, (31)

where γ′ denotes the multiplicative inverse of γmod N , thus γ′γ= 1 mod N , hence the

DFT of a ZC sequence is also a ZC sequence. This property allows ZC sequences to be

directly generated in the frequency domain without a DFT operation. This property is

more useful in the LTE uplink where many ZC sequences must be frequently generated

25


sector ID: N (2)I D Root index u0 251 292 34

Table 1: Root Indices for the LTE Primary Synchronization Signal

for the physical random access channel (PRACH) signaling. Only 3 ZC sequences are

used in the downlink, two of which are complex conjugates. Pairing this property with

the constant amplitude property of ZC sequences, the total storage requirement for the

set of PSS sequences is reduced by 2/3.

Now that the properties of ZC sequences are shown, attention will be given to the

LTE downlink, which defines 3 ZC sequences for the 3 possible PSSs using Eq. 32 and

Tbl. 1 [20].

zu [n] =

(

exp�

− j πun(n+1)63

�

, n= 0,1, . . . , 30

exp�

− j πun(n+1)(n+2)63

�

, n= 31,32, . . . , 61(32)

where the root index u indicates the sector ID according to Tbl. 1. Notice that N = 63,

while only 62 elements are generated. The central value is punctured so that the D.C.

subcarrier is not populated. These root indices were chosen for their good auto and

cross-correlation properties, as well as their low frequency-offset sensitivity, allowing

detection before frequency offset correction has taken place in the receiver [23].

Fig. 11 shows the cyclic correlation properties between the ZC sequences using

u = 25, itself, and the other two root indices. The magnitude-squared auto and cross

correlations have been scaled to reflect the puncturing of the central entry in the overall

sequence. The autocorrelation of root index 25 shows the excellent properties of the

ZC sequences (Fig. 11, top). While the cross correlation with the u = 25 and u = 29

sequences is almost ideal (Fig. 11, middle), some correlation energy is observed with

u = 25 and u = 34 (Fig. 11, bottom). Interestingly, [23] highlights the admirable

zero cross correlation between u = 25 and u = 29, but avoids revealing the non-ideal

properties between u= 25 and u= 34.

In the LTE downlink, the PSS sequences zu [n] occupy the central 62 subcarriers in

the last symbol of the 0th and 10th slots in each 10ms frame (each frame contains 20

500 µs slots) in frequency-domain duplexing (FDD) operation and the third symbol in

the 3rd and 12th slots in time-domain duplexing (TDD) (see [23]). The PSS remains in

these locations regardless of the bandwidth mode or any other configuration parame-

26


−31 −24 −16 −8 0 8 16 24 310

0.5

1Cyclic Correlation Between ZC Sequences with Root Index 25 and 25,29,34

|Rz

0z

0

|2/(

N−

1)2

−31 −24 −16 −8 0 8 16 24 310

0.5

1

|Rz

0z

1

|2/(

N−

1)2

−31 −24 −16 −8 0 8 16 24 310

0.5

1

|Rz

0z

2

|2/(

N−

1)2

cyclic shift (m−n)

Figure 11: Cyclic Correlation Properties Between the LTE Downlink PSS ZC Sequences

ter. Note that the DC subcarrier is null in the LTE downlink and the PSS is placed in the

62 surrounding subcarriers. Interestingly, the 10 total subcarriers surrounding the PSS

are null, providing a 75 kHz gap between the PSS and the surrounding miscellaneous

subcarriers on each side. Fig. 12 illustrates the general orientation of the PSS subcarri-

ers in an OFDM symbol, where NF F T is the size of the receiver’s FFT, determined by the

operating bandwidth.

The cross-correlation properties of the ZC sequences are utilized to indicate the

sector identity (N (2)I D ) to the receiver. The transmitted PSS is not correlated with the

sequences generated using the other root indices, so the receiver can easily determine

NFFT/2-1-NFFT/2 32-32 0

PSS data/otherdata/other

FFT index (frequency)

Figure 12: PSS Position in an LTE Downlink OFDM Symbol

27


the sector ID by performing matched filtering with the received sequence against the 3

possibilities. Once detected, the sector ID provides the receiver key information used

to determine critical parameters such as reference symbol placement (frequency shift),

the descrambling PN sequence seeds, and the location of the secondary synchronization

signal (SSS), among other things. The SSS detection then provides N (1)I D , which is then

used to determine the cell ID.

To minimize detection error, [22] uses an insightful technique to confirm that the

correct detection of N (2)I D has been performed. Together N (2)I D and N (1)I D make up the cell

identification number, which is used to determine the seed for the PN data source as

well as the location of the modulated reference symbols (RS) in the LTE frequency-time

resource grid. If the cell ID is correctly detected, very good cross-correlation should

exist between the anticipated received values. If good correlation does not result, the

detection of N (2)I D and N (1)I D has almost certainly failed, signaling a retry attempt for cell

ID detection.

The correlation properties of the PSS can also be used to indicate symbol timing

information. After collecting several PSS position estimates, the drift of their timing

can then be used to generate SCO estimates, providing an additional SCO estimation

source, aiding in the proposed feedback-controlled SCO cancellation process. The max-

imum likelihood symbol timing estimate bτ is derived using a cyclic matched filter with

the three possible ZC sequences

bτ= argmaxn

arg max

u∈{25,29,34}

�

�

�

�

�

N−1∑

n=0

x [n] z∗u [(n+m)mod N]

�

�

�

�

�

2

, m= 0,1, . . . , N − 1 , (33)

where x [n] is the received signal time series that contains the transmitted PSS. Eq. 33

also provides the detected N (2)I D parameter, indicated by its respective root index u.

Converting Eq. 33 into a linear convolution allows the matched filtering operation

to be performed using FIR structures and removes the assumption that the signal is

periodic, a better general assumption, especially when assuming that timing errors will

likely be present and that adjacent symbols exist in time.

bτ= argmaxn

arg max

u∈{25,29,34}

�

�

�

�

�

N−1∑

n=0

x [n] z∗u [n+m]

�

�

�

�

�

2

, m= 0,1, . . . , N − 1 , (34)

Fig. 13 shows that the correlation properties are very negligibly degraded when per-

forming linear instead of cyclic correlation.

28


−62−56 −48 −40 −32 −24 −16 −8 0 8 16 24 32 40 48 56 620

0.5

1Linear Correlation Between ZC Sequences with Root Index 25 and 25,29,34

|Rz

0z

0

|2/(

N−

1)2

−62−56 −48 −40 −32 −24 −16 −8 0 8 16 24 32 40 48 56 620

0.5

1

|Rz

0z

1

|2/(

N−

1)2

−62−56 −48 −40 −32 −24 −16 −8 0 8 16 24 32 40 48 56 620

0.5

1

|Rz

0z

2

|2/(

N−

1)2

time shift

Figure 13: Linear Correlation Properties Between the LTE Downlink PSS ZC Sequences

The oversampledness of the PSS signals in the wider LTE bandwidth configura-

tions suggests the use of efficient linear convolution algorithms such as overlap-add

or overlap-save algorithm [24], both of which take advantage of the computationally

efficient properties of the FFT. Both the FIR and overlap-add methods will be consid-

ered for matched-filtering of the incoming time sequence. The following discussion

compares a proposed multi-rate technique with the standard overlap-add algorithm.

When NF F T = 128, according to [20] and illustrated in Fig. 12, the entire PSS-

occupied OFDM symbol is populated with the PSS and null subcarriers. In this config-

uration, the PSS symbol is oversampled by a factor of two. Computational savings can

be obtained by performing filtering at the minimum possible rate [25].

Stepping up to NF F T = 256, the PSS is surrounded by data subcarriers, separated

by 5 null subcarriers, as shown in Fig. 14. In this configuration, downsampling by 2

simply folds the data subcarrier energy upon itself. Only the aliases of the sidelobes,

or “shoulders” of the OFDM signal fold over to the PSS-occupied subcarrier indices.

Fig. 15 more clearly shows the sidelobes on a multi-symbol FFT of a specially gener-

ated LTE test signal that contains PSS subcarriers in every symbol. Clearly, the sidelobes

generated by the PSS and the matched filter will both alias in the same manner when

29


−128 −91 −64 −31 0 31 64 91 1270

0.2

0.4

0.6

0.8

1

1.2

FFT index

ma

gn

itu

de

PSS Symbol Subcarrier Occupation with Alias Regions from 2x Downsampling

folding foldingPSS

Figure 14: Alias Zones Introduced by Initial 2x Downsampling in an LTE PSS Symbol(NF F T = 256)

−128 −91 −64 −37−31 0 3137 64 91 127−70

−60

−50

−40

−30

−20

−10

0

subcarrier index

Magnitude (

dB

) (n

orm

aliz

ed to 0

dB

)

Frequency Response of LTE Test Signal Overlaid with Oversampled PSS Matched Filter

LTE test signal

matched filter

Figure 15: Frequency Response of LTE Test Signal Overlaid with Oversampled PSSMatched Filter (NF F T = 256)

30


downsampled, thus the matched filter and the signal both retain their correlation prop-

erties. However, another downsampling operation by 2, reducing the matched filtering

operation to the minimum possible sampling rate, aliases the data subcarriers to the

PSS location, and therefore a traditional band limiting or polyphase downsampler must

be used to prevent such interference.

Whether by design or coincidence, the initial no-cost 2x downsampling operation

never aliases over the PSS locations in all FFT sizes and bandwidth configurations

available in the LTE downlink (NF F T = 2n, n ∈ {7,8, 9,10, 11}), always allowing a

no-cost downsampling stage to initially reduce the sampling rate by 2. To further

reduce the workload in each FFT configuration, successive FIR downsampling can be

performed to allow constant minimum-rate PSS correlation.

The goal of this multi-rate design is to perform matched filtering at a constant and

minimum possible rate across all LTE bandwidth modes and FFT sizes. Starting with the

smallest FFT size and its corresponding sampling rate, the minimum-rate matched filter

bank is generated by populating 128-element IFFT operations with the three possible

63-element frequency domain ZC sequences. The filter generated by a minimum-sized

64-element IFFT could be used for minimum-rate processing, however for NF F T > 128,

the required downsampling filter must have an excessively narrow transition band to

fit inside the gap provided by the 5 null subcarriers surrounding the PSS. The narrow

transition band requires a high filter order in order to not degrade the correlation, high

enough that oversampled PSS correlation is justified.

For NF F T = 128, no rate transition is used, and the matched filter is oversampled

by a factor of 2. For NF F T = 256, the matched filter bank is preceded by the no-cost 2x

downsampling operation. For each FFT size greater than 256, a 2x polyphase down-

sampler is added, creating a chain, or dyadic cascade of filters. Each filter only needs to

suppress frequencies that alias into the PSS location after its respective downsampling.

The overlaid frequency response of the entire filter chain is displayed in Fig. 16. Each

filter is designed to be a half-band filter, a special Nyquist filter that has zeros for nearly

half of its coefficient set, which provides a very efficient polyphase implementation and

reduces the coefficient storage requirement by a factor of 2.

The cascaded filter response in Fig. 17 appears to show a particularly underwhelm-

ing frequency response. Large lobes emerge from the stop-band and the transition

band is nowhere near “sharp”. However, the filter gives excellent performance where

needed, not allowing interference to alias over the PSS while reducing the sampling

rate to a constant rate for a fixed matched filter bank, and using the minimum number

of computations to do so.

31


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−120

−100

−80

−60

−40

−20

0

Normalized Frequency (×π rad/sample)

Ma

gn

itu

de

(d

B)

(no

rma

lize

d t

o 0

dB

)

Overlaid Dyadic Downsampling Filter Cascade − Magnitude Frequency Response (dB)

h1

h2

h3

Figure 16: Dyadic Downsampling Filter Response Overlay

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−120

−100

−80

−60

−40

−20

0


Magnitude (

dB

) (n

orm

aliz

ed to 0

dB

)

Dyadic Downsampling Filter Cascade − Cascaded Magnitude Frequency Response (dB)

Figure 17: Dyadic Downsampling Filter: Cascaded Frequency Response

32


PSS0 MF

PSS2 MF

PSS1 MF

max

0

1

2

Sector ID

maxSymbol Timing Estimate

z-M1

H1,1

bypass2

0

1 z-M2

H2,1

bypass3

0

1 z-M3

bypass4

0

1

H3,1

2:1RX

bypass1

0

1

Figure 18: Dyadic Downsampling Filter with Matched Filter Bank: Implemented Struc-ture

Filter Input Sampling Rate Coefficient Storage Workload Cumulative Workload(MHz) (samples) (MMACCs/s) (MMACCs/s)

PSS0 1.92 65 245.76 (complex) 245.76PSS1 1.92 65 245.76 (complex) 491.52PSS2 1.92 65 245.76 (complex) 737.28h3 3.84 4 30.72 (real) 768h2 7.68 6 92.16 (real) 860.16h1 15.36 6 184.32 (real) 1,042.48h0 30.72 0 0 1,042.48

Table 2: Multi-Rate PSS Detector Computation and Coefficient Storage Requirements

Fig. 18 illustrates the implemented structure of the PSS detector at this point, show-

ing the cascade of 2x downsampling components, each separated by a bypass multi-

plexer used for selecting the appropriate rate transition for the incoming sampling rate

and respective FFT size. The multiplexers allow the final sampling rate to be constant,

regardless of the wide range of input sampling rates, allowing the matched filter bank

to operate with constant coefficients and at a constant rate.

A breakdown of the filter’s workload is listed in Tbl. 2. The cascaded design allows

each filter to work at a constant rate when enabled, simplifying the system’s workload

analysis. At the given workload, each of the matched filters can be implemented us-

ing a single complex MACC element in a Xilinx Virtex7 FPGA, each complex MACC

element requiring 6 DSP48E1 FPGA slices [26]. Two block RAMs (BRAMs) are re-

quired to store the coefficients of the symmetric matched filters, two of which are com-

33


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−120

−100

−80

−60

−40

−20

0


Ma

gn

itu

de

(d

B)

(no

rma

lize

d t

o 0

dB

)

Overlaid Dyadic Upsampling Filter Cascade − Magnitude Frequency Response (dB)

h4

h5

h6

h7

Figure 19: Dyadic Upsampling Filter Response Overlay

plex conjugates of each other, eliminating the need for a third BRAM. The remaining

polyphase downsamplers require too few coefficients to justify using valuable BRAM

elements. Coefficient storage should be relegated to discrete registers. The low work-

load of each of the polyphase downsamplers in the cascade can be implemented using

a single DSP48E1, capable of realistically providing 250 MMACs/s. Summing up the

required resources, the downsamplers with matched filter bank should occupy approx-

imately 21 DSP48E1 elements and 3 BRAMs with some ancillary logic elements. The

DSP48E1 usage accounts for a mere 1.6% of the available elements in the smallest

Xilinx Virtex7 DSP-targeted FPGA.

The PSS detection performs admirably at this point in the design, capable of de-

tecting the correct sector ID in any of the possible sampling rates. A direct comparison

with the overlap-add method cannot yet be made. The output rate of the overlap-add

method is equal to its input. A side effect of the downsampling process in the multi-rate

design is reduced symbol timing resolution. To restore the original time resolution in

all operating sampling rates, variable power-of-2 upsampling can be performed in the

same manner as the downsampling direction, using a dyadic upsampling filter cascade

of half-band polyphase upsampling filters.

Fig. 19 shows the overlaid frequency response of the designed filter cascade at

the respective relative sampling rates. The upsampling filter has a more difficult job

34


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

−140

−120

−100

−80

−60

−40

−20

0


Magnitude (

dB

) (n

orm

aliz

ed to 0

dB

)

Dyadic Upsampling Filter Cascade − Full Cascaded Magnitude Frequency Response (dB)

Figure 20: Dyadic Upsampling Filter: Cascaded Frequency Response

Filter Input Sampling Rate Coefficient Storage Workload Cumulative Workload(MHz) (samples) (MMACCs/s) (MMACCs/s)

h4 1.92 8 61.44 (real) 61.44h5 3.84 8 122.88 (real) 184.32h6 7.68 4 122.88 (real) 307.20h7 15.36 3 184.32 (real) 491.52

Table 3: Dyadic Cascaded Upsampler Computation and Coefficient Storage Require-ments

than the downsampling portion. The upsampler must successively eliminate aliases

altogether, rather than eliminate aliases that only interfere with the PSS subcarriers.

The resulting filter design requires higher workloads in each stage. Fig. 20 shows the

cascaded frequency response of the upsampler.

Fig. 21 shows the udpated design with the attached upsampler cascade. The trans-

posed multiplexer shows that either multiplexer configurations can be used in the de-

sign. The “max” component is now moved to the output of the upsampler chain to

take advantage of the unit-input sample time resolution achieved by the upsampling

operation.

The breakdown of the upsampler’s workload is shown in Tbl. 3. Like the downsam-

35


PSS0 MF

PSS2 MF

PSS1 MF

max

0

1

2

Sector ID

z-M1

H1,1

bypass2

0

1 z-M2

H2,1

bypass3

0

1 z-M3

bypass4

0

1

H3,1

2:1RX

bypass1

0

1

z-M4 z-M5 z-M6 z-M7

1 2 3 4Mode

max(position)

Symbol Timing Estimate

H4,1 H5,1 H6,1 H7,1

Figure 21: Multi-Rate PSS Detection Algorithm: Implemented Processing Structure

pler, the required coefficient storage space is very small, too small to waste valuable

BRAM in the FPGA, and general-purpose registers are a better choice. Also as with the

downsampler, each stage can be implemented using a single DSP48E1 MACC element,

totaling 4 in the entire design.

Analyzing Tbl. 2 and 3, and considering the anticipated hardware resource con-

sumption, it is hard to imagine that the overlap-add method can achieve better results.

The time-domain results in Fig. 22 show the overlaid outputs of the multi-rate overlap-

add techniques. The overlap-add output is generated using MATLAB’s “fftfilt” function.

Expectedly, the resulting output sequences of the two techniques are quite similar. In

this example, the signal is generated according to the LTE specification using N (2)I D = 0,

and NF F T = 2048. The two PSS symbols are spaced by 5 ms in a single 10 ms radio

frame, according to the 20 MHz LTE mode of operation in the FDD format. In this

example, the overlap-add method uses a pair of 32,768 point inverse and forward FFTs

and an overlap of 2,047 samples, the automatically chosen ideal size by MATLAB’s fft-

filt function. Using this configuration, the overlap-add algorithm progresses through

the time series in strides of 30,721 samples, completing approximately 1,000 strides

per second. According to the fftfilt function, each (I)FFT operation requires 1,441,974

36


1.528 1.53 1.532 1.534 1.536 1.538 1.54 1.542 1.544

x 104

0

0.2

0.4

0.6

0.8

1

sample index (time)

magnitude−

square

d

polyphase cascade

fftfilt

0 0.5 1 1.5 2 2.5 3

x 105

0

0.2

0.4

0.6

0.8

1

sample index (time)

magnitude−

square

d

Multi−Rate vs. Overlap−Add PSS Correlation

polyphase cascade

fftfilt

Figure 22: Multi-Rate vs. Overlap-Add PSS Correlation

floating-point operations (flops).

The breakdown analysis for the overlap-add PSS detection algorithm is listed in

Tbl. 4. The total sustained workload of 6.37 Gflops/s would require the theoretical per-

formance of almost 40 Cray I supercomputers from the year 1976, each capable of 160

Mflops/s [27]. Summing up the computations required for root index 25 matched fil-

tering, given in Tbl. 2 and 3, and assuming a complex MACC requires 8 Flops and a real

MACC requires 2, the total workload of the multi-rate design requires 5.26 Gflops/s,

or nearly 33 Cray I computers.

In addition to fewer computations, the multi-rate design has several other advan-

tages. The inner argmax term in Eq. 13 can be performed at the minimum rate. As a

result, the outer argmax term must only be evaluated on an individual output stream,

resulting in a two-thirds complexity reduction factor vs. the overlap-add method. The

multi-rate design also has a very low coefficient storage requirement, requiring only 39

real and 130 complex coefficients. Each of the filters used in the overlap-add technique

has 215 elements, which can be computed from smaller coefficient sets, however the

full-sized filter coefficients must reside in memory during operation. In addition to the

37


operation flops/s1,000 215 pt. FFTs 1, 000× 1,441, 9743,000 215 pt. IFFTs 3, 000× 1,441, 974

3, 000× 215 complex multiplications 6× 3, 000× 215

3, 000× 2,047 complex additions 2× 3,000× 2, 047

Σ 6.37 Gflops/s

Table 4: Breakdown of Computations of Overlap-Add PSS detection as Implementedby MATLAB’s “fftfilt” function

−10 −5 0 5 10 15 20 25 30

10−4

10−2

100

102

104

PSS Symbol Timing Estimation, MSE vs. SNR, Overlap−Add (OA) vs. Multi−Rate (MR) Techniques vs. OFDM FFT size

SNR

MS

E

OA 128

MR 128

OA 256

MR 256

OA 512

MR 512

OA 1024

MR 1024

OA 2048

OA 2048

Figure 23

memory required for the filter coefficients, the FFT itself requires large tables of twiddle

factors that must also reside in RAM. A 215 point FFT implemented in a Xilinx Virtex6

FPGA a uses 201 BRAMs, 31 DSP48E elements, and has a latency of 397 µs [28].

In a software implementation, the large tables of twiddle factors and filter coefficient

sets may exceed the available cache space in a microprocessor, or may require large

amounts of dedicated cache space. The low computational workload of the overlap-

add method seems to come at the price of increase memory system complexity.

Detailed performance testing of the symbol timing capability is shown in Fig. 23.

As expected, the overlap-add and the multi-rate techniques are equivalent using the

smallest FFT size since the multi-rate technique actually doesn’t change the sampling

rate in this case. For FFT sizes 256 and 512, the MSE for the multi-rate algorithm is

roughly 2-3 dB SNR worse than overlap-add, according to the test results. Interestingly,

in the 1,024 and 2,048 FFT size operation, the multi-rate algorithm outperforms the

38


overlap-add algorithm in higher SNRs, crossing over at approximately 13-14 dB SNR

in both configurations. In these configurations, the multi-rate algorithm significantly

outperforms the overlap-add in high SNR conditions, where the MSE produced by the

overlap-add algorithm flattens out to a near-constant level, becoming independent of

SNR. The test was performed using fully occupied PSS/data OFDM symbols according

the LTE specification, totaling 24,000 PSS symbol detections for each data point.

3.5 Concluding Remarks

Receiver synchronization using time domain measurements allows the receiver to be

compartmentalized, relying solely on pre-FFT information. In the case of sampling

clock frequency synchronization, time domain measurements offer many benefits that

not only simplify the receiver’s architecture but also eliminate any dependency on post-

FFT information, which itself is fundamentally dependent on synchronization perfor-

mance. In the analysis of symbol timing synchronization, it was shown that sampling

frequency errors and symbol timing errors are co-related, and a receiver architecture

was developed that simultaneously synchronizes both. The proposed receiver architec-

ture illustrated admirable performance on the most generic OFDM system, using only

CP-derived symbol timing information.

To enhance timing and sampling frequency error measurements in an LTE downlink

receiver, the PSS is detected in the time domain using an efficient multi-rate architec-

ture. The computational workload was shown to be less than that of a technique known

for efficiently implementing long-length filters, the overlap-add algorithm.

Despite the depth of content in this chapter, critical components that have been

left undiscussed, particularly involving carrier frequency offset (CFO) correction. CFO

and SCO both produce SNR-degrading ICI [29]. CFO correction is well-studied and

can even be performed using the previously mentioned Beek algorithm. Many papers

present time-domain CFO correction techniques that align with the established design

principles in the presented receiver architecture [10,13,30,31].

39


4 OFDM Channel Estimation and Equalization

OFDM is well known for its simple single-tap equalization procedure (Eq. 179). While

the equalization procedure is trivial, obtaining (estimating) the equalization matrix can

be the most computationally intensive portion of an OFDM receiver. The process of

estimating the equalization matrix must be preceded by estimating the channel’s fre-

quency response, i.e. Eq. 176, 177. The equalization matrix is obtained by inverting

the diagonal frequency response matrix, i.e. Eq. 178.

In an OFDM receiver, the exact values that make up H1 in Eq. 174, or more concisely,

the channel impulse response vector h that comprises H1, is unknown and must be

estimated using the received signal. Usually, reference symbols (RSs) are inserted into

the OFDM symbol vector that give the receiver a foothold for directly estimating the

equalization matrix E. If the transmitted vector x(k) is known, v(k), the noisy channel

corrupted vesion of the transmitted signal is available.

v(k) =WMZRH1ZT WHMx(k) +WMZRn(k)

n(k) =WMZRn(k)

v(k) = Dx(k) + n(k)

(35)

The ideal equalization matrix E is obtained if the noise is explicitly known

E= D−1 = diag�

x(k)v(k)− n(k)

�

, (36)

which is impractical. More realistically, the receiver can obtain the least-squares solu-

tion

hLS =x(k)v(k)

ELS = diag�

hLS

(37)

This type of channel estimation is commonly referred to as the “least-squares” (LS)

estimator because ELS is the least-squares solution to the system of linear equations.

The LS estimator gives poor performance but has a very low computational complexity

when compared to other channel estimation techniques.

To improve the precision of the channel estimate in noisy conditions, the chan-

nel’s second-order statistics can be used to obtain the linear minimum-mean-squared

error (LMMSE) filter. After applying the filter, the LMMSE channel estimate is ob-

40


tained [32,33].

hMMSE = QMMSEhLS (38)

The filter matrix QMMSE is obtained using the Normal equations.

QMMSE = RhR−1hn

QMMSE = Rh

�

Rh+σ2nI�−1 (39)

This solution is almost as impractical as when the noise was assumed to be explic-

itly known by the receiver. This estimator assumes the receiver knows the channel’s

second-order statistics. These requirements are rarely met in a practical receiver, and

the autocorrelation matrix and noise variance must be estimated. If the statistics used

to compute QM MSE contain error, the matrix inversion could enhance error, or pro-

vide little benefit to the channel estimator. Also, the channel’s statistics are assumed

to be non-stationary in a mobile cellular application such as LTE. In mobile channel

conditions, the autocorrelation matrix is constantly changing with time, requiring the

constant refreshing of QM MSE. Sources that cite this technique often claim that the

second-order statistics are “assumed to be known by the receiver” [23,33,34].

The receiver benefits directly from the equalization matrix rather than the estimated

channel. Naturally, the two are inverses of each other. Rather than estimate the chan-

nel, the optimum equalization matrix can be found using the following optimization

problem:

minE

E�

‖ x− Ev ‖2�

(40)

To find the solution to this optimation problem, the following cost function is estab-

lished and expanded [15].

J(E)¬ E�

‖ x− Ev ‖2�

= E�

(x− Ev) (x− Ev)H�

= E�

xxH�

− E�

xvH�H

E− EHE�

xvH�

+ EE�

vvH�

EH

= Rx −RHx vE− EHRx v + ERvE

H

=h

1 EHi

Rx −Rvx

−Rx v Rv

1

E

.

(41)

Using Schur factorization, the center matrix of the cost function can be decomposed

41


into a product of upper-triangular, diagonal, and lower-triangular matrices.

Rx −Rvx

−Rx v Rv

=

1 −RvxR−1v

0 1

Rx −RvxR−1v Rx v 0

0 Rv

1 0

−R−1v Rx v 1

(42)

Substituting Eq. 42 into the result of Eq. 41 gives the expanded cost function.

J(E) =�

Rx −RvxR−1v Rx v

�

+�

E−R−1v Rx v

�HRv

�

E−R−1v Rx v

�

(43)

Since Rv is positive semi-definite, the equalization matrix that minimizes the cost func-

tion is

E= Eo = R−1v Rx v (44)

and the minimum mean-squared error is

Jmin = J(Eo) =m.m.s.e.=�

Rx −RvxR−1v Rx v

�

, (45)

One popular method to solve for the optimal equalization matrix Eo in Eq. 44 is to use

the steepest descent algorithm, which is widely known for its ability to start from an

initial guess for Eo and make interative improvements on the guess, ultimately converg-

ing on the true Eo. The general update procedure for the Steepest Descent algorithm is

given by

Ei = Ei−1+µPi, i ≥ 0 (46)

where each update is performed using the P matrix and the step size µ. Each P matrix

must be computed such that the cost decreases monotonically with each iteration, i.e.

J(Ei)< J(Ei−1) (47)

The update matrix P can be computed using the gradient of the cost function [15]

Pi =−�

∇J�

Ei−1

�H�H= Rx v −RvEi−1 , (48)

so that

Ei = Ei−1+µ�

Rx v −RvEi−1

�

, i ≥ 0 , E−1 = initial guess (49)

Also, the step size µ must satisfy the following condition to ensure Eq. 47

0< µ <2

λmax, (50)

42


where λmax denotes the largest eigenvalue of Rv.

A receiver rarely has knowlege of the correlation matrices Rx v and Rv, which are

usually time-varying. However, stochastic optimization techniques based on the steep-

est decent algorithm can be used to obtain the equalization matrix that approaches Eo

by estimating the second-order statistics. Both the algorithms in Eq. 44 and 39 perform

estimation assuming known statistics. If the statistics are estimated, the final product

is a result of two consecutive layers of estimates.

Stochastic optimization methods do not require the explicit knowledge of the second-

order statistics (correlation matrices) of the channel and yet can approach the optimum

solution over a sequence of iterations. The LMS algorithm operates similarly to the

steepest decent algorithm, except that the correlation matrices are approximated using

the instantaneous realization of the outer products. In the LMS algorithm, the expecta-

tion operator is removed, and the correlation matrices are approximated by performing

the respective outer products, i.e.

bRx v = xvH

bRv = vvH(51)

The LMS update equation can now be expressed as

Ei = Ei−1+µ�

xivHi − viv

Hi Ei−1

�

, i ≥ 0 , E−1 = initial guess , (52)

where the subscript i has been added to x and v to indicate the symbol index and the

step-size µ must satisfy Eq. 50.

Forcing E to be diagonal, the LMS algorithm in Eq. 52 approaches the optimal

value with finite steady-state error. Fig. 24 shows the results of the LMS algorithm

operating in the narrowest bandwidth LTE configuration, simultaneously estimating the

equalization coefficients for 24 RSs. The results show a wide variety of convergence

rates, ranging from near-instability to unreasonably slow. This effect is a result of

the differing relative step sizes, which depend on the input signal magnitudes. The

normalized LMS (NLMS) and ε-NLMS update the step sizes according to the magnitude

of the received signal, achieving equal convergence rates.

The LMS algorithm (and its variants) are derived using the steepest-descent tech-

nique, estimating the second-order statistics using instantaneous realizations. The re-

cursive least squares (RLS) algorithm alternatively arrives at the optimal solution using

Newton’s method, which recursively updates and improves its estimate of the second-

order statistics with each iteration. To provide tracking capability in time-variant con-

43


Figure 24: LMS Equalizer Results: Equalization Coefficients (top), Per-ChannelSquared Error (bottom)

ditions, a “forgetting factor” is included that proportionately weights newer and “for-

gets” older information. Depending on the application and forgetting factor selection,

the RLS algorithm is capable of faster convergence rates and better steady-state error at

considerably higher computational cost than the LMS family of algorithms. Hou in [35]

presents the RLS algorithm in an OFDM channel estimation context. A more gen-

eral overview of adaptive filters and stochastic optimization techniques can be found

in [14,15].

Stochastic optimization (adaptive filter) algorithms are well-studied in the litera-

ture. Specifically, Rom in [36] provides an extensive survey of MMSE channel estima-

tion algorithms, presented in the specific context of the LTE downlink. It is worthwhile

44


to consider an alternative class of algorithms that does not satisfy the MMSE optimality,

but are ML over the set of model parameters used to describe the signal. Unlike the

stochastic optimization algorithms that directly produce the equalization matrix, the

next approach estimates the channel.

4.1 Linear Regression Techniques for Channel Estimation

Assuming that an OFDM symbol contains training (reference) symbols (subcarriers), a

linear regression algorithm is a powerful tool that can be used to find the best-fit, or

ML parameter vector to fit the data using a given model [37–40].

Suppose it is desired to estimate or predict a value given the training vector y

observed at the coordinates indicated by the respective row in the matrix Z. Assuming

2 dimensions, the training set is indicated by the column vector z, which generates

the feature vector x using a mapping function. To model the data, the θ vector will

contain the coefficients of an mth order model. The coefficients that optimally fit the

data using the chosen model will be obtained through the learning algorithm. The

model is chosen to suit the anticipated properties of the data.

The function that maps the coordinates into the feature space defines the model.

As an example, let the θ vector contain coefficients of an mth order polynomial used to

describe y in the the feature space. The respective m× 1 vector x can be defined using

the elements of z.

x(i) =�

z0i ,z1

i , · · · ,zm−1i

�T(53)

Using the model parameter θ and the feature vectors, the prediction function generates

output values denoted by

by(i) = h�

x(i)�

= θ 0x(i)0 + θ 1x(i)1 + · · ·+ θm−1x(i)m−1 = θT x(i), 0≤ i ≤ n− 1 (54)

where n indicates the number of training variables and the vector and θ indicate the

obtained parameters used for prediction.

Using the entire set of feature vectors, the optimum parameter vector θ can be

found by constructing an n×m Vandermonde “design matrix”, X, using the x vectors

45


generated using the set of n training variables.

X=

—�

x(0)�T

—

—�

x(1)�T

—...

—�

x(n−1)�T

—

(55)

Using the established property that h�

x(i)�

= θ T x(i) =�

x(i)�Tθ , the error function for

a chosen θ vector can be established to be the difference between the prediction output

and the training.

Xθ − y=

�

x(0)�Tθ − y0

�

x(1)�Tθ − y1

...�

x(n−1)�Tθ − yn−1

=

h�

x(0)�

− y0

h�

x(1)�

− y1...

h�

x(n−1)�

− yn−1

(56)

To minimize the squared error, the following cost function parameterized by θ is es-

tablished.

J (θ ) =1

2

�

Xθ − y�2 (57)

To minimize J (θ ) with respect to θ , the gradient ∇θ J (θ ) can be found, set equal to

zero, and solved for θ to find the vector that optimizes the established cost fucntion,

θ opt .

∇θ J (θ ) =1

2∇θ�

�

Xθ − y�2�

=1

2∇θ�

�

Xθ − y�T �Xθ − y

�

�

=1

2∇θ�

θ T XT Xθ − θ T XT y− yT Xθ + yT y�

=1

2

�

2XT Xθ − 2XT y�

= XT Xθ −XT y

XT Xθ opt −XT y= 0

θ opt =�

XT X�−1

XT y .

(58)

To predict new target values, the optimum parameter vector θ opt is multiplied by the

46


design matrix X.

by= Xθ opt (59)

The X matrix in Eq. 59 contains the mapped coordinates of the predicted variable by.

While this linear regression technique is useful in general, it has several disadvan-

tages. When the contour of the data becomes increasingly detailed, the model order

must increase to produce low levels of prediction error. The number of computations

required to compute the m× m matrix inverse to obtain θ opt becomes prohibitive as

the model order increases. Another disadvantage to this algorithm is the equal weight-

ing of each element in the training set. Outliers and data with a great distance from

other data can introduce large bias errors throughout the entire regression space while

not having any significance to any data other than the data in the directly surrouding

region. In the machine learning context, performing regression using small windows of

data allows regression tasks to be performed using huge sets of training data without

having to process the set of data in its entirety.

To reduce the size of training data that must be processed to obtain a desired target

variable, windows of data directly surrounding the target variable location can be used

in place of the entire set. Using this concept with a local weighting kernel, a low-order

model can be used to compute locally optimum parameter vectors θ (i)opt for each target

variable. This operation can be implemented by adding a diagonal weight matrix to

the cost function that can be used to emphasize only the samples in close proximity to

the particular target variable.

J�

θ (i)�

=1

2

�

�

Xθ (i)− y�T

W�

Xθ (i)− y�

�

W= diag¦

w(i)©

(60)

The w vector contains the set of weights determined by the chosen weight kernel.

When W = I, the cost function collapses to Eq. 57. Alternatively, a popular kernel

exponentially decreases the influence of data points as they increase in distance from

the target variable location according to

w(i) = exp

−

�

x(i)− x�T �

x(i)− x�

2τ2

, (61)

which (coincidentally) cosmetically appears as the Gaussian kernel, where τ is the

47


“bandwidth” parameter that determines the width of the bell shaped function.

The new optimum θ (i)opt for this locally weighted regression (LWR) algorithm is found

using the cost function in Eq. 60,

∇θ J�

θ (i)�

=1

2∇θ�

�

Xθ (i)− y�T

W�

Xθ (i)− y�

�

=1

2∇θ�

�

θ (i)�T

XT WXθ (i)−�

θ (i)�T

XT Wy− yWXθ (i)+ yWy�

= XT WXθ (i)−XT Wy

XT WXθ (i)opt −XT Wy= 0

θ (i)opt =�

XT WX�−1

XT Wy

(62)

Each individual regression point at position i is computed using its correspoding θ (i)opt

by(i) = x(i)θ (i)opt (63)

At first, it may appear cumbersome to compute a θ (i)opt vector for each regression

output. With closer inspection, intriguing benefits of this algorithm are quickly re-

vealed. The most off-putting part of this algorithm is the requirement of an online

matrix inverse for each output; the weight kernel used to generate W must depend

on i. The XT WX term produces an m× m matrix, therefore when m = 2, the matrix

inversion is trivial using the following special property of 2× 2 matrices.

A−1 =1

det (A)

A(2,2) −A(1,2)

−A(2,1) A(1,1)

(64)

If online computation must take place, it must be noted that even m= 2 provides good

performance. When m= 2 and x(i) =�

z0i = 1,z1

i = zi

�T, θ (i)opt indicates the parameters

for the best-fit straight line for the weighted set of points. The first row of the resulting

X matrix is occupied by ones, and the second row is occupied by the z vector. The

row of ones causes the XT WX operation to require many multiplications by 1, an op-

eration that reduces to memory copies, significantly reducing the number of required

multiplications. Additionally, the XT W operation occurs twice in Eq. 62; the result can

be reused, further reducing the number of required computations. Perhaps the most

significant algorithmic simplification is obtained by exploiting the fact that the signifi-

48


operation × + 1x

XT W p 0 0�

XT W�

X 2p 4(p− 1) 0�

XT WX�−1 6 1 1

�

XT W�

y 2p 2(p− 1) 0θ (i)opt =

�

XT WX�−1 �XT Wy

�

4 2 0by(i) = x(i)θ (i)opt 2 1 0

total 5p+ 12 6(p− 1) + 4 1

Table 5: Computational Breakdown for Online Computation of Locally Weighted LinearRegression (m= 2)

cant elements in the W matrix are in proximity of the target variable location, and by

definition, only these elements have a significant impact to the computed output (the

result of the “local weighting”). If the exponential weighting kernel defined in Eq. 61 is

used, many elements of the resulting w(i) vector are insignificant enough to be rounded

to zero with little impact on the final outcome.

The above exploitations can be used to fundamentally modify the structure of the

algorithm. First, the insignificant values in the W matrix are removed, reducing its size

to p× p. This action requires the reduction in size of the X matrix and y vector, which

are accessed in a sliding block of p samples according to:

θ (i)opt =�

XT((0:(p−1))+i,:)WX((0:(p−1))+i,:)

�−1XT((0:(p−1))+i,:)Wy((0:(p−1))+i,:) . (65)

Table 5 shows the number of computations required for each part of the algorithm if

the computations are all performed online. If the data is equally spaced, a further sim-

plification can be made. The above calculations consider a sliding block of p samples,

in which case the X matrix is shifted along with the y vector index. As i gets large, espe-

cially when m > 2, the last column in the X matrix becomes dominant, and the matrix

quickly becomes ill-conditioned, preventing accurate inversion. A more logical solution

is to keep the X matrix fixed, sliding y through a window, rather than a window across

y. Doing so transforms most of the required operations into the m× p constant QLWR

49


−15 −10 −5 0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Exponential Weighting Kernal with Varying τ Parameter (p=32)

offset

magnitude

τ=1

τ=2

τ=3

τ=4

τ=5

τ=6

Figure 25: Exponential Weighting Kernel with Varying τ Parameter

matrix.

θ (i)opt = QLWRy((0:(p−1))+i,:)

QLWR =�

XT WX�−1

XT W

X=

10 11 · · · 1m−1

20 21 · · · 2m−1

...... · · ·

...

p0 p1 · · · pm−1

by(i) = x(i)θ (i)opt

(66)

Now, the computation of the θ (i)opt vector requires only m�

p− 1�

additions, p (m− 1)

multiplications and no inverse operations. The modification enables higher orders of

m without the exponentially increasing cost of online matrix inversion and the issues

with ill-conditioned matrices.

To use this algorithm for channel estimation in an OFDM system, a kernel and its

corresponding value of p must be chosen. Fig. 25 displays the weighting function for

several values of the τ parameter in the exponential weighting kernel defined in Eq. 61.

While the kernel selection is somewhat arbitrary, the exponential kernel provides good

performance.

The chosen kernel and its parameters have a large impact on the performance of

50


50 100 150 200 250

−2

−1

0

1

2

τ=1

50 100 150 200 250

−2

−1

0

1

2

τ=2

50 100 150 200 250

−2

−1

0

1

2

τ=3

50 100 150 200 250

−2

−1

0

1

2

τ=4

50 100 150 200 250

−2

−1

0

1

2

τ=5

50 100 150 200 250

−2

−1

0

1

2

τ=6

LWR Results Using Varying τ Parameter, p=32

blue=signals+noise, green=signal, red=regression results

Figure 26: Overlaid Locally Weighted Regression Results with Varying τ Kernel Param-eter

the regression technique. The effects of the τ parameter on the regression results

using the exponential kernel is demonstrated in Fig. 26, showing 32 regression runs on

a signal with i.i.d. AWGN added to each experiment. The experiment is performed for

τ = [1, 2,3, 4,5, 6]. When τ = 1, the regression seems to “overfit” the signal, fitting

both the signal and its noise component. Conversely, when τ = 6, the regression is

unable to fit the “pointy” sections of the signal, producing regions of “underfitting”.

The τ that optimizes the MSE lies somewhere in-between 1 and 6. Once the optimum

value of τ is found, or learned either online, or using training data, the regression can

perform optimally on signals with similar characteristics.

If τ is swept in a similar manner as the previous experiment (displayed in Fig. 25)

51


1 2 3 4 5 60.022

0.024

0.026

0.028

0.03

0.032

0.034

0.036

0.038

0.04

τ

MS

E

MSE vs. Model Parameter τ

Figure 27: MSE vs. Model Parameter τ

and the computed MSE is used to measure the performance of the regression, the

relationship between τ and MSE as well as the optimum value of τ can be found.

Fig 27 shows the MSE for a sweep of the τ parameter from τ = [.7, .8, . . . , 6] using

the same signal with i.i.d. AWGN between each evaluation. The MSE is computed

using the known noiseless signal and the regression result. Clearly the MSE has a local

minimum around τ = 2 and perhaps most importantly, the sweep reveals a convex

error function.

To understand the relationship τ has with various types of signals, WGN is gener-

ated and upsampled by the variable rate N . Small N provides little upsampling, and

the resulting signals have sharper curvature and contour. WGN is also added to each

signal and is i.i.d. between each data point to corrupt the signals with noise. Fig. 28

shows the result of this experiment. The local minima are marked with a solid red cir-

cle for each sweep of τ with each N . Intuitively, the larger values of τ are better suited

for signals with more gradual contours, confirmed by Fig. 28 that showing that as N is

increased, the optimum value of τ that produces the minimum MSE also increased. For

signals with rapidly varying contours, choosing a large value for τ results in excessively

high absolute MSE, less so for signals with gradual contours. The relative MSE penalty

for excessively large values of τ is about equal within each trial between all N in this

experiment.

The previous experiment brings forth an interesting problem. Although data can

52


1 2 3 4 5 6

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

0.055

0.06

τ

MS

E

MSE vs. Model Parameter τ vs. Upsampling Factor N

N=12

N=14

N=16

N=18

N=20

N=22

N=24

Local Minima

Figure 28: LWR Experiment: Mean-Squared Error vs. τ vs. N

be optimally fitted using a pre-defined model and weight function, the fit is optimally

optimal with the correct selection of τ using the exponential weighting kernel. Initially,

the optimum value of τ exists and is unknown. The previous experiments revealed that

the error function is convex within the bounded sweep interval. The exact definition

of the error function is unknown, but it is known to be convex and can be assumed

to contain its global minimum within a wide, bounded region. If an optimal value of

τ is found, will the value also be optimum for a signal drawn from the same process?

Fig. 29 shows the MSE vs. τ sweep for 32 i.i.d. signals generated using the same

statistical process with i.i.d. AWGN between each signal. Clearly, the MSE and the

error function varies within each trial, but the minima lie within the same region with

some variation. The region of the error surface surrounding the minimum is quite flat.

Along with the sampling resolution, the error surface flatness produces some variation

in the set of observed mimima. This experiment reveals that an optimal value of τ

that minimizes the error for one signal is optimal, or nearly optimal for other signals

generated by the same process. In an OFDM channel estimation context, the optimum

τ could be determined by the channel’s excess delay, which largely determines the

contour characteristics of the channel’s frequency response curve.

To maintain computational simplicity, the error-minimizing τ can be found for one

53


1 2 3 4 5 60

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

τ

MS

E

MSE vs. Model Parameter τ − i.i.d. Trials with i.i.d. AWGN

Figure 29: MSE vs. Model Parameter τ: i.i.d. Trials with i.i.d. AWGN

signal and used for all other signals with similar characteristics. To find the optimum

τ, a search algorithm taken from convex optimization theory [41] can be employed to

evaluate the function for different values of the optimization parameter and selecting

the minimum (or maximum) result. A good search algorithm finds the minimum (or

maximum) value of a function using the fewest possible number of evaluations and

produces results with low error. Evaluating the error for a single sweep parameter

requires the regression operation to be performed on the entire data set, a computa-

tionally expensive operation, therefore it is critical to find the optimal parameter using

the fewest possible number of evaluations.

The error surface in the last few examples has shown a convex and apparently

a quadratic error function. To find the minimum of a quadratic function, inverse

quadratic interpolation can be performed, only requiring 3 function evaluations to ob-

tain the 3 necessary points for the interpolation and doesn’t require the explicit knowl-

edge of the function or its derivative. If the error function is assumed to be approx-

imately quadratic, the result of quadratic inverse interpolation will approximate the

location of the function’s minimum. The resulting accuracy depends on the validity of

the assumptions that are made. In many cases, the value produced by quadratic inverse

interpolation is “good enough” after only a few function evaluations. The objective of

54


1 2 3 4 5 6

1

2

3

4

5

6

7

8

9

10

x

f(x)

Inverse Quadratic Interpolation Method − f(x)=x2−6x+10

f(x)

a

b

c

resulting f(xmin

)

Figure 30: Finding the Abscissa of a Quadratic Function’s Minumum Using InverseQuadratic Interpolation

the search algorithm is to improve the error performance of the LWR algorithm by find-

ing the optimal value of τ while minimizing the number of required computations to

do so. Trade-offs can be made between error and search time.

Given three locations, a, b and c and values f (a), f (b) and f (c), such that a < b < c

and f (b)< f (a)≤ f (c) or f (b)< f (c)≤ f (a), the three locations are said to “bracket”

the minimum value of the function. Given that these inequalities are satisfied, it can

be inferred that a minimum of the function is “down there somewhere”. Assuming the

function is quadratic, the location of the minimum xmin can be found directly.

xmin = b−1

2

(b− a)2�

f (b)− f (c)�

− (b− c)2�

f (b)− f (a)�

(b− a)�

f (b)− f (c)�

− (b− c)�

f (b)− f (a)� . (67)

Fig. 30 illustrates a toy example of finding the minimum value of a quadratic function.

The filled red markers indicate the points that were found to bracket the minimum.

The value of f (xmin) found by Eq 67 is indicated by the large red “x” marker. This

method works perfectly for this example because the function is quadratic. If the func-

tion isn’t quadratic, this method can be extended by performing “successive inverse

quadratic interpolation” where the largest out of the three points used to compute the

initial quadratic interpolation is thrown out and the three remaining points are used

for another quadratic interpolation that more closely locates the true minimum. Each

55


0 0.5 1 1.5 20.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x

f(x)

Successive Inverse Quadratic Interpolation Method − f(x)=1/2−xe(−x

2)

f(x)

a

b

c

resulting f(xmin

(i))

Figure 31: Finding the Abscissa of an Arbitrary Function’s Local Minumum Using theSuccessive Inverse Quadratic Minimum Finding Technique

iteration costs only one additional function evaluation and converges to the true min-

imum super-linearly (quadratically). The algorithm can stop after the difference of

each result is smaller than a threshold value, or the maximum number of permissible

iterations has been exceeded. Fig. 31 shows the successive quadratic interpolation al-

gorithm finding the minimum of a non-quadratic function using 9 function evaluations

to achieve an accuracy within .01 of the true minimum location. Unfortunately, the

points used to bracket the minimum greatly impact this algorithm’s speed of conver-

gence, as one would expect.

An example training procedure that finds the optimum τ parameter uses a known

time-varying channel observed by an OFDM receiver, the LWR algorithm, and er-

ror feedback provided to the successive inverse quadratic minimum finding algorithm

(SIQMF). The test is performed using a 20 MHz LTE downlink OFDM signal that has

passed through an emulated multi-path fading channel using the LTE-specified ETU

model with a maximum Doppler frequency of 100 Hz. The ETU channel model pos-

sesses the greatest excess delay of the specified LTE channel models, therefore is very

frequency selective, preferring lower values of τ. The variance of the noise added to

the signal in this experiment is σ2 = .1.

Fig. 32 shows the results of the training experiment. The red curve in Fig. 32

shows the obtained minimum value of τ with an error tolerance of less than .01 and

56


Figure 32: Successive Inverse Quadratic Inerpolation Minimum Finding AlgorithmFinding the Minimum Across the Error Surface of the LWR Kernel Parameter Sweeps

a mean function evaluation count of 6.5 over 120 channel estimates. Fig. 32 shows

that the multi-path fading channel introduces a dynamic variance of the error surface

as time progresses. Despite the dynamic error surface, the successive inverse quadratic

interpolation minimum finding algorithm does an excellent job “snaking” through the

lowest points on the error surface as the symbols go by, while minimizing the number

of necessary function evaluations. Averaging the values of τ “trains” the LWR algorithm

for signals with similar contour features, which are largely determined by the channel’s

excess delay.

57


Reference Symbol

Other Symbol (Data)

freq

uenc

y in

dex

n

symbol index k

"Extended" CP Mode

freq

uenc

y in

dex

n

symbol index k

"Normal" CP Mode

Figure 33: Frequency-Staggered, Time-Spaced Reference Symbol Orientation in the“Extended” and “Normal” CP Modes Used in the LTE Downlink

4.2 The Missing Link: Frequency-Time Interpolation

In standards such as LTE, the reference symbols sparsely populate the frequency-time

resource grid. In the LTE downlink, known-value modulated QPSK reference symbols

(RSs) are interspersed throughout the OFDM time-frequency resource grid as indicated

by Fig. 33. In release 8, 9 and 10, the LTE downlink includes two possible CP lengths,

“normal” and “extended”, used to trade off throughput and compatibility with channels

that may have extra long excess delays. In both modes, the RSs are spaced along the

frequency dimension by 6 subcarrier positions in each OFDM symbol. However, in

the extended CP mode, the RSs are evenly spaced in time by 3 OFDM symbols rather

than the uneven arrangement of 4 and 3 symbol gaps in the normal CP mode [20].

The inclusion of both staggered and the evenly-spaced RS arrangement possibilities

slightly complicates the interpolation process, discussed shortly. The “phase”, or shift

in position of the RS grid pattern along the frequency dimension, is determined by the

cell identification number and the symbol position of the OFDM symbol in each time

slot.

The frequency-time arrangement of the LTE RSs implies a maximum mobile velocity

(given a carrier frequency) as well as the maximum excess channel delay. The spac-

ing of the RSs along the time and frequency dimensions establish respective Nyquist

boundaries that must contain the frequency content of the channel’s contour in each

dimension. The respective Nyquist sampling criterion must be satisfied under normal

operating conditions. Unfortunately, the effective “downsampling” induced by spacing

the RSs in time and frequency aliases any noise energy present at the base sampling

58


rate directly into much narrower Nyquist zones.

To reduce the noise in the estimated channel, the LWR algorithm can be used to

“refine” the estimates found using the standard least-squares method (Eq. ??) at the

RS positions. Using pre-computed training, a set of kernels can be precomputed and

stored. By performing offline training, the complexity-performance benefits of the LWR

algorithm can be fully utilized.

The LTE RS arrangement resembles a rotated checkerboard pattern in the time-

frequency grid. Interpolating across two dimensions can potentially require very large

buffers of symbols. Different interpolation algorithms require varying numbers of sam-

ples to operate. Each buffered symbol along the time dimension requires up to 4 ad-

ditional symbols to be stored in memory that each await the interpolation result for

equalization. Clearly, minimizing the latency of the channel estimation and interpo-

lation algorithms has a wider impact on system-level complexity. Not only do larger

buffers increase memory size, they increase system latency. Latency is critical in stan-

dards such as LTE that are heavily reliant on closed-loop feedback to determine beam-

forming modes and hybrid ARQ handshaking.

To minimize the buffer sizes, the interpolation scheme used along the time dimen-

sion must be designed to require the least possible number of points. A good interpo-

lator for this application not only produces interpolated values with little error, but can

operate on juxtaposed blocks of data while maintaining smooth continuity between

blocks. Once an interpolator has been designed to operate along the time dimension,

the frequency dimension can be considered.

The interpolator that operates along the time dimension must consider both the

RS arrangements given by both the normal and extended CP mode. A polyphase FIR

interpolator is a good candidate for the extended CP configuration, but requires many

samples to generate outputs with low levels of error. When the staggered RS spacing

of the normal CP mode is added, the polyphase FIR interpolation process must be split

into two downsampled, periodic substreams. After zero-packing, the time-series of the

RS sequence in the normal CP configuration resembles

x =�

. . . , 0, 0, x1, 0, 0, 0, x2, 0, 0, x3, 0, 0, 0, x4, 0, 0, x5, 0, 0, 0, . . .�

, (68)

which can be split into even and odd indexed streams with equal spacing

xeven =�

. . . , 0, 0, x1, 0, 0, 0, 0, 0, 0, x3, 0, 0, 0, 0, 0, 0, x5, 0, 0, 0, . . .�

xodd =�

. . . , 0, 0, 0, 0, 0, 0, x2, 0, 0, 0, 0, 0, 0, x4, 0, 0, 0, 0, 0, 0, . . .�

,(69)

59


which can be individually upsampled by a factor of 7 and summed to form the final

result. Theoretically, either the odd or even stream of samples by themselves could

be upsampled by 7 to obtain the needed result, but discarding information is never

advisable; therefore the two streams of samples will be upsampled and summed to

generate the final result.

Unfortunately, splitting the data into two separate sub-streams violates the Nyquist

conditions in the LTE-specified high speed train scenario, requiring a maximum Doppler

spread of 1340 Hz [42]. Assuming a classical Doppler spectrum, a channel null, or

deep fade for a particular subcarrier occurs at the time interval T0, determined by the

time the receiver takes to travel half the distance of the carrier’s wavelength (λ/2),

corresponding to double the Doppler frequency fd [43,44].

T0 =λ/2

v=

1/2

fd(70)

The time interval between RSs after splitting the single, unevenly spaced stream of RSs

into two substreams is equal to the duration of an entire LTE time slot, .5 ms, and can-

not resolve time variations caused by Doppler frequencies greater than fd = 1000 Hz.

To drive the point home even further, the polyphase FIR interpolator that performs

rate-7 upsampling requires a prohibitively large number of samples to compute each

output. Using the Remez filter design algorithm in MATLAB, a polyphase prototype

filter with an excess bandwidth parameter α = .3 and a stop-band attenuation of only

-60 dB requires 11 samples per polyphase arm; 11 RS-filled symbols must be stored

along with every other adjacent symbol that awaits equalization. Even in the extended

CP mode, where the data doesn’t need to be split and the Nyquist sampling criteria

is met by design, a rate-3 prototype filter requires 5 samples per polyphase arm for a

roughly -60 dB stop-band, given the same α. The stop-band attenuation and the α pa-

rameter could be relaxed even further for lower performance, but the required number

of samples that must be held in storage doesn’t compare to other more efficient and

higher performing algorithms, such as cubic spline interpolation.

Cubic spline interpolation is frequently used in 2-D interpolation tasks in image

processing, particularly for image scaling and rotation. Its main advantage is its ability

to fit data using low-order piece-wise polynomials with a continuous first and sec-

ond derivative between each piece. Spline interpolation has no restrictions on sample

spacing, making it an ideal class of interpolation algorithms for operating on both the

normal and extended RS grid configurations. The method used to compute the coeffi-

cients for each piecewise polynomial is quite simple, allowing a very computationally

60


efficient implementation. The mathematical background and theory of splines is well

established and will not be covered in detail here. The reader is referred to MATLAB’s

“spline”, “interp” and “interp2” functions for an implementation example, and [45] for

a thorough mathematical background.

The two cubic spline interpolation implementations present in the MATLAB func-

tions are the “clamped” and “not-a-knot” varieties. The clamped variety allows the 2nd

derivative to be explicitly specified at either end-point of the interpolation, and the

not-a-knot combines a single cubic for the first and last two subintervals. Either variety

can be implemented using a similar system of equations. The cubic spline interpolation

methods assign a continuous, piecewise polynomial segment for each contiguous set of

3 samples, defined by

fi(x) = ai + bi�

x − xi�

+ ci�

x − xi�2+ di

�

x − xi�3 , (71)

where i = 0, 1, . . . , n are the indices for the vector of ordered abscissas

xi =�

x0, x1, . . . , xn�T such that x0 < x1 < . . . < xn and vector of corresponding points

yi =�

y0, y1, . . . , yn�T . Each polynomial is evaluated using the ith element of the poly-

nomial coefficient vectors a,b,c,d and is valid for x i ≤ x ≤ x i+1. The polynomial

coefficient vectors can be solved using the linear equation

Am= r , (72)

where the m vector will be used to generate the four vectors of polynomial coefficients.

To define the A matrix, it is convenient to first introduce the vector hi = xi+1 − xi. For

the clamped spline method, the A matrix is defined

A=

2h0 h0 0 · · · · · · 0

h0 2�

h0+ h1

�

h1 0 · · · 0

0 h1 2�

h1+ h2

�

h2. . .

......

.... . . . . . . . .

...

0 0 0 hn−1 2�

hn−2+ hn−1

�

hn−1

0 0 0 0 hn−1 2hn−1

, (73)

61


followed by the r vector

r= 6

y1−y0

h0−δbegin

y2−y1

h1− y1−y0

h0...

yn−yn−1

hn−1− yn−1−yn−2

hn−2

δend −yn−yn−1

hn−1

, (74)

where δbegin and δend are the parameters used to force the derivative at the beginning

and end of each spline to a specific value. For the not-a-knot variant, the A matrix and

r vector are slightly modified. Note that the derivatives are known using the obtained

sets of piece-wise polynomials.

A=

−1 2 −1 · · · · · · 0

h0 2�

h0+ h1

�

h1 0 · · · 0

0 h1 2�

h1+ h2

�

h2. . .

......

.... . . . . . . . .

...

0 0 0 hn−1 2�

hn−2+ hn−1

�

hn−1

0 0 0 −1 2 −1

, (75)

r= 6

0y2−y1

h1− y1−y0

h0...

yn−yn−1

hn−1− yn−1−yn−2

hn−2

0

. (76)

Solving for the m vector in Eq. 72 allows for the computation of the vector of polyno-

mial coefficients.

m= A−1r (77)

ai = yi

bi =yi+1− yi

hi−

hi

2mi −

hi

6

�

mi+1−mi�

ci =mi

2

di =mi+1−mi

6hi

(78)

using i = 0,1, . . . , n− 1. A quick inspection of the A matrices for each method reveals

the following structures for evenly spaced sampling intervals. Given the RS spacing

62


along the time dimension in the extended CP mode, xi = [1, 4,7, ..., 3n+ 1]T , therefore

hi = [3, 3,3, ..., 3]T , resulting in

A=

6 3 0 · · · · · · 0

3 12 3 0 · · · 0

0 3 12 3...

......

.... . . . . . . . .

...

0 0 0 3 12 3

0 0 0 0 3 6

(79)

for the clamped algorithm and

A=

−1 2 −1 · · · · · · 0

3 12 3 0 · · · 0

0 3 12 3...

......

.... . . . . . . . .

...

0 0 0 3 12 3

0 0 0 −1 2 −1

(80)

for the not-a-knot. Note that the A matrix for the clamped algorithm is tridiagonal,

while the matrix for the not-a-knot algorithm is nearly so. There are very efficient

algorithms that can solve linear equations with matrices that have band-diagonal or

tridiagonal properties in O (n) operations [41], but these are unnecessary in the case

when the A matrix is constant and its inverse can be pre-computed, stored and multi-

plied by the dynamically updated r vector to solve for m. In the LTE RS configuration,

the RS spacing is periodic, allowing the A and its inverse matrix to be constant. Note

that the A−1 matrix varies with its dimensions, algorithm variety, and h. Also, the divi-

sion operations required by the r vector and Eq. 78 can be reduced to multiplications

by a pre-computed inverse or constant.

The cubic spline interpolation methods require a minimum of 4 points to be eval-

uated, therefore the minimum number of RS-filled symbols that must be stored along

the time dimension to perform interpolation between the RSs for both the extended

and normal CP modes is substantially fewer than the polyphase FIR interpolation tech-

nique. Using spline interpolation, only two slots (each consisting of 6 OFDM symbols

in the extended CP mode and 7 in the normal mode) must be stored to meet the 4

RS minimum along the time dimension, requiring 13 total symbols in the extended CP

63


100 200 300 400 500 600 700 800 900 1000 1100 12000

0.5

1

1.5

2

2.5


channel m

agnitude

Interpolation Using a 6x Polyphase FIR Filter

known channel

6x Polyphase Interpolation Result

Figure 34: Valid Output Samples of a Rate-6 Polyphase Upsampler Overlaid on theKnown Channel Magnitude

mode and 15 symbols in the normal CP mode.

A solution for interpolation along the frequency dimension is to use a rate-6 FIR

polyphase interpolator, which can finally be allowed to use large buffers of RSs. To

produce valid results, an FIR filter must keep its memory buffers filled with valid sam-

ples. As the memory buffer of the filter fills and empties, the resulting output exhibits

“settling” as only a portion of the full set of coefficients are contributing to the con-

volution output. The samples produced during the settling period of the filter are not

reliable and exhibit large deviations from the curvature of the input signal. In “nor-

mal use”, when an FIR interpolator operates an a infinitely long time series, assuming

no beginning and no end, these characteristics need not be considered, otherwise the

transients produced as the filter is settling must be dealt with. In this application, it

is best to simply trim any output samples that are computed when the filter’s memory

buffer isn’t completely filled. Fig. 34 illustrates the problem that the FIR filter interpo-

lator presents. The interpolator can’t produce valid outputs near the edges, or outer

subcarrier positions, otherwise the results are quite good for the interior subcarrier

indices.

64


Another possible solution to the interpolation problem is one that exploits the fact

that the channel’s frequency response, by definition from the FFT operation, is periodic

across the entire M block of samples produced by the receiver’s FFT operation. The

EDFT (extended discrete Fourier transform) algorithm introduced by [46] and avail-

able in [47] is specially designed to recover sparsely sampled periodic signals. The

algorithm doesn’t even require evenly spaced sampling and can recover signals with

large gaps of missing samples, a particularly attractive set of features considering the

channel’s frequency response in a typical OFDM signal is sparsely sampled only in the

central portion of the received frequency domain representation (the outer subcarriers

contain all zeros, i.e. the signal is oversampled). The algorithm is iterative, usually

requiring only a few iterations. The first iteration starts off using a diagonal weight

matrix G(1) = IM to produce the Hermitian Toeplitz matrix

R=1

NWMG(i)WH

M , (81)

where the WM matrix is the M × M DFT matrix that corresponds with the receiver’s

FFT (in this case, the samples are uniformly spaced). Next, using the weight matrix G

and the inverse R matrix, the complex i th estimate of the input signal vector x’s IDFT,

F(i) is computed.

F(i) = xR−1WMG(i) (82)

Next, to update the weight vector, the already computed xR−1WM and R−1WM terms

are used to generate the diagonal amplitude spectrum matrix S(i).

S(i) =xR−1WM

diag¦

WHMR−1WM

© (83)

G(i+1) = diag¦

|S(i)|2©

(84)

In normal circumstances, where an adequate number of RSs exist, the algorithm has

adequately converged after the 4th or 5th iteration when the identity matrix is used

as the initial weight vector. If the final weight vector from a previous result is used,

iterations typically end earlier. Vilinis in [47] stops the iterations when R becomes

ill-conditioned, or when the difference between successively computed F falls below a

set threshold. It should be noted that the R matrix has Toeplitz structure and can be

inverted using the computationally efficient Levinson-Durbin recursion in O (n) opera-

tions [48,49].

Fig. 35 illustrates an example OFDM symbol configuration (slightly different than

65


200 400 600 800 1000 1200 1400 1600 1800 20000

1

2

3

4

5

6

7

8

9


channel m

agnitude

EDFT Used for Recovering a Sparsely Sampled FFT − OFDM Channel Estimation

RS Locations

known channel

edft algorithm result

Figure 35: Interpolation and Gap-Filling of a Periodic Signal using the Extended DFTAlgorithm

the LTE configuration), which is sparsely sampled in the central subcarrier positions.

The given RSs are spaced periodically by 4 subcarrier indices. To assure a unique

solution, only the RSs are placed in the x vector and the FFT size is reduced from

2048 to 512, reducing the sampling rate so the portion of the signal given by the RSs

becomes critically sampled. This action prevents the algorithm from converging to a

solution that contains aliases (multiple solutions exist), or frequency multiples, while

reducing the computational complexity for each iteration. Size reduction is possible

when the starting FFT size is integer divisible by the RS spacing. Inconveniently, the

FFT size in the LTE configuration is not integer divisible by the 6-times undersampling

achieved by the RS spacing, therefore a preemptive upsampling by 3 operation should

be performed to allow optimal use of the EDFT algorithm. Once the algorithm has

operated on the downsampled RSs, zeros are inserted into the F matrix, which is then

multiplied by a full-sized DFT matrix, upsampling the result back to the original rate.

66


After the RS configuration permits, the algorithm gives excellent results, as seen in

Fig. 35, where the entire FFT is nearly perfectly recovered. After an FFT-wide recovery

of the channel’s frequency response, even the noise residing in the inactive, or unused

subcarrier positions can be equalized, allowing the receiver to directly gather statistics

on the channel’s noise, information that is invaluable to a wide variety of algorithms

that require knowledge of auto or cross-correlation matrices for linear prediction, or

MMSE estimation [14,39].

Perhaps one of the simplest, best performing methods for interpolating the chan-

nel’s frequency response is the cubic spline interpolation algorithm, introduced earlier

for interpolating the LTE RS grid along the time axis. The piecewise polynomial co-

efficients obtained by the algorithm can be used for extrapolation to obtain estimates

along the edges of the RS grid where no surrounding RSs exist.

To obtain the fully interpolated 2-D channel estimate, first the cubic spline algo-

rithm can be used to perform interpolation and extrapolation within each OFDM sym-

bol, providing channel estimates that span the entire width of the occupied subcarrier

space, as shown in Fig. 36. Note that this operation can be performed fully in parallel

using up to 5 simultaneous, independent processing elements, reducing added latency

and pile-up of symbols as the selected block of symbols awaits equalization. The in-

terpolated channel estimates obtained using interpolation and extrapolation along the

frequency axis are illustrated in Fig. 36 as step 1 in the 2 step procedure to obtain the

full grid of estimates. The cubic spline algorithm is the same for both the extended

and normal CP modes in the LTE configuration, requiring the storage of only one pre-

computed A−1 matrix (Eq. 77).

The next step is to use the cubic spline interpolation algorithm along the time di-

mension as previously described, using the fully interpolated symbols from the pre-

vious step. Fig. 37 shows this operation using 5 symbols to obtain the interpolated

channel estimates for a segment of 12 symbols. Indicated by the red boxes and arrows

as an example in Fig. 37, the interpolation process in step 2 can be split into many

parallel interpolation operations that each simultaneously ratchet along the frequency

dimension. The number of parallel operations that can exist depends on the available

computational resources. The interpolation process in step 2 can even begin while the

interpolators in step 1 are in operation, trailing behind the step 1 interpolator outputs

as they are produced to minimize latency and buffer sizes.

Using the EVA (extended vehicular A) channel model defined by the LTE specifica-

tion, the overall performance of the combined 2-D cubic spline interpolation techniques

can be measured by computing the MSE across many symbols. Fig. 38 shows an ex-

67


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 241

4

7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

61

64

67

70

OFDM Symbol Index (time)

Su

bca

rrie

r In

de

x (

fre

qu

en

cy)

Step 1: Cubic Spline Interpolation/Extrapolation Along Frequency Dimension

Data Subcarrier Positions

RS

Obtained by Cubic Spline Interpolation

Obtained by Cubic Spline Extrapolation

Figure 36: Cubic Spline Interpolation/Extrapolation Along the Frequency Dimension(Step 1)

periment averaging the MSE over 4 frames (480 symbols). The lobes of increased MSE

appear when the curvature of the channel’s frequency response is more dynamic. Also

note that the MSE increases along the edges and in the middle subcarrier indices. The

indices along the edges are where extrapolation has been performed, as seen in Fig. 36.

The increase in MSE found in the centrally located indices is caused by the wider RS

spacing introduced by the DC subcarrier, which isn’t active in the LTE configuration. In

this area, the two RSs are spaced by 7 subcarrier positions rather than 6. Otherwise,

the performance of the cubic spline interpolation algorithm is excellent.

Next, the performance of the combined LWR and cubic splines interpolation algo-

rithms is evaluated. The test shown in Fig. 39 uses a kernel parameter obtained with

prior training on another simulation using the EVA (vehicular) channel model with

AWGN. In an implemented receiver, the proper kernel parameter must be chosen that

matches the available training scenarios. As previously suggested, the receiver can use

the measured excess delay of the channel to choose its kernel parameter. In this simu-

lation, the m parameter of the LWR algorithm is set to 3, allowing the algorithm to find

the best-fit parabola using the locally weighted data for each output. Increasing m al-

lows the kernel to be widened, including more data to generate each output. Because

68


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 241

4

7

10

13

16

19

22

25

28

31

34

37

40

43

46

49

52

55

58

61

64

67

70

OFDM Symbol Index (time)

Su

bca

rrie

r In

de

x (

fre

qu

en

cy)

Step 2: Cubic Spline Interpolation/Extrapolation Along Time Dimension

Data Subcarrier Positions

RS

Obtained by Cubic Spline Interpolation

Obtained by Cubic Spline Extrapolation

Figure 37: Cubic Spline Interpolation/Extrapolation Along the Time Dimension (Step2)

QLWR matrix is constant, the increase in m only linearly increases the computational

complexity. This particular test uses the Kaiser window as the kernel, which is pa-

rameterized by β . The Kaiser window’s parameter is normally used to trade transition

bandwidth for stopband attenuation in FIR filter design and spectral analysis; however

in this case the parameter is used to narrow and widen the window kernel. The Kaiser

window is defined by [24]:

w [n] =

I0

�

β(1−[(n−α)/α]2)1/2�

I0(β) 0≤ n≤ p

0 otherwise(85)

where β is the kernel parameter, α = p/2, and I0 (·) represents the zeroth-order mod-

ified Bessel function of the first kind. Fig. 39 illustrates the smoothing aspects of the

LWR technique and the good results given by the cubic splines interpolation method.

The LWR algorithm clearly smoothens the noisy LS channel estimates using the locally

best-fit parabolas generated using the Kaiser weighting kernel. The equalized result of

the test reveals the clear performance gains achieved by the LWR algorithm in Fig. 40,

69


0 200 400 600 800 1000 1200−90

−80

−70

−60

−50

−40

−30Cubic Spline Interpolation/Extrapolation MSE − EVA Channel Model


20

×lo

g10(M

SE

)

Figure 38: MSE of Cubic Spline Interpolation/Extrapolation Operating Under the EVAChannel Model

which shows the constellation of an LTE signal populated with QPSK modulated data

of 8 frames in duration, the EPA channel model with 100 Hz Doppler and added WGN

noise variance of σ2 = .001. The EPA channel model yields the best results due to its

low excess delay. Fig. 41 confirms that the excess delay is directly related to the perfor-

mance of the LWR estimator, which offers little to no improvement in the ETU model,

which has the largest excess delay. The test was performed using a full receiver, com-

plete with synchronization and the LTE RS arrangement. The results of both estimators

are interpolated to the full-sized frequency-time grid. This test was performed without

mobility. Fig. 41 shows the LWR estimator provided between 2-4 dB SNR improvement

from the LS method using the EPA and EVA models, each having 410 and 2510 ns

respective excess delays. Little improvement was attained by the LWR estimator when

using the ETU model with the longest excess delay of 5 µs.

70


600 800 1000 1200 1400 16000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

DFT index (frequency)

magnitude

LWR Channel Estimator with Cubic Spline Interpolator − LTE RS Arrangement − EVA − σ2=.005

LS

LWR

Interpolated LS

Interpolated LWR

Figure 39: Comparison Between LWR and LS Algorithms Applied to LTE RS configura-tion with Cubic Splines Interpolator, EVA Channel Model, σ2 = .005

4.3 Reference Symbol Arrangements and Their Relationship with Timing Syn-

chronization

Reference symbols are usually interleaved in the same symbol with data subcarriers

and are not usually included in every transmitted symbol, requiring the receiver to

interpolate the channel’s response for the subcarrier positions that do not contain ref-

erence symbols. It is common that the receiver is required to interpolate along both

the time and frequency axes, as shown throughout this chapter.

In Fig. 33, the receiver is able to directly estimate the channel at the reference sym-

bol locations. Spacing the reference symbols apart in frequency and time brings forth

several new restrictions on timing synchronization. In Eq. 6, the relationship between

the continuous-time symbol timing offset parameter τ and the frequency-domain phase

shift vector φ is given. The channel estimation and equalization component allows

some timing synchronization error tolerance. The phase shifts induced by symbol tim-

ing errors can be absorbed into the channel’s frequency response. In the previous case

71


Figure 40: An Example Comparison of LWR vs. LS Equalization: QPSK ModulatedData, EPA-5 Channel Model

where reference symbols were assumed to be present in every subcarrier location, if

the symbol timing placement satisfies −(L − d) < m ≤ 0, so that left and right errors

are not present, the channel estimation component can resolve the phase shifts for all

valid m.

Intermingling RSs and data subcarriers in the same symbol reduces the available

observations of the channel’s frequency response at the RSs. Assuming the RSs are

periodically spaced, increasing the spacing establishes a new Nyquist boundary for

sampling the channel. Spacing the RSs in frequency effectively downsamples in the

frequency domain, producing periodic downsampled versions of the CIR in the time

domain (after zero-packing in the non-RS positions in the frequency domain), each

period spanning W/q samples, where q denotes the downsampling factor. The W/q

length period allows the timing window to be delayed or advanced by W/2q without

phase aliasing in the frequency domain. The frequency domain phase shifts from timing

offsets more than W/2q in either direction cannot be resolved by the RSs resulting in

corrupted channel estimates.

In the OFDM symbol configuration, where the CP separates two consecutive sym-

72


8 10 12 14 16 18 20 22 24 26 28

10−2

10−1

100

MSE Comparison of LS and LWR Channel Estimators Using LTE Channel Models

SNR

MS

E

EPA LS

EPA LWR

EVA LS

EVA LWR

ETU LS

ETU LWR

Figure 41: MSE Performance Comparison of LS and LWR Channel Estimators UsingLTE Channel Models

bols, symbol timing advancement induces ISI. Only symbol timing delay is possible

in ISI-free operation. The resolution boundary introduced by the RS spacing is com-

bined with the restrictions imposed by the channel’s excess delay to form the following

inequality

−�

min�

M

2q, (L− d)

��

< m≤ 0 , (86)

which must be satisfied by the symbol timing synchronization component to assure

correct operation.

In the LTE configuration, the RSs are separated by 5 subcarriers, downsampling

the channel’s frequency response by q = 6. Using M = 2048, the maximum valid

symbol timing offset is 170.67 samples. Additional timing offsets are possible if the

entire frequency domain symbol is purposefully phase-rotated by a known amount,

effectively shifting the valid symbol timing window in either direction.

The spacing of the reference symbols along time axis creates additional challenges

for symbol timing synchronization. The channel estimate may be updated for each

OFDM symbol that contains subcarriers with reference symbols. In many OFDM frequency-

73


time arrangements, the RSs are sparsely populated in both time and frequency. The

channel can only be estimated at the RS positions, which may not be present in every

OFDM symbol. In LTE, the RS-occupied symbols are spaced 3 and sometimes 4 sym-

bols apart (Fig. 33). The spacing in time requires the non-RS symbols to be equalized

by interpolated channel estimates using the adjacent RS-occupied symbols. The pre-

sented cubic spline interpolation technique requires a buffer of 4 RS symbols. If the

symbol timing is not constant within this buffer, the phase rotations will be disjointed

and the channel estimation result that spans between the RS symbols will be adversely

affected. The disjoint in symbol timing is especially detrimental to adaptive channel

estimators [33]. If the symbol timing must change, the phase of the frequency domain

vectors must be rotated accordingly to compensate for the known differential shifts in

timing.


Designing a practical channel estimation algorithm requires the designer to make trade-

offs that greatly impact the performance and overall computational complexity apsects

of an OFDM receiver. An alternative to existing channel estimation techniques has been

presented that offers ML optimality rather than MMSE to trade performance for com-

plexity. Practical algorithms for frequency-time interpolation have also been presented

that consider the latency of the channel estimation and equalization components in the

receiver. Many aspects of the channel estimation and interpolation algorithms have

been distilled to use constant matrices, minimizing costly online matrix manipulations.

Finally, an important link between the timing synchronization and channel estimation

component was introduced that defines limitations that arise with sparse RS spacing

along the frequency axis.

74


5 Resampling Techniques Using Locally Weighted Linear Regression

Many additional and interesting uses for the LWR algorithm exist outside of the channel

estimation context.

In previous discussions, the LWR algorithm is used to “refine” noisy channel esti-

mates by smoothing them, followed by interpolation using the cubic splines algorithm

to obtain the full channel estimate in the frequency dimension. The LWR algorithm

finds coefficients for the best fit polynomial for each weighted set of samples. These

coefficients can be used to interpolate data between samples, replacing the function of

the cubic splines algorithm. The cubic splines algorithm guarantees a continuous 2nd

derivative connecting each polynomial, a feature the LWR algorithm does not provide.

However, using the LWR algorithm to perform interpolation as well as data smoothing

provides excellent results and a substantial reduction in computational workload if a

subsequent interpolation stage can be eliminated. To utilize the LWR algorithm’s inter-

polation capability, the optimal coefficient vector is computed as in Eq. 66, and the x(i)

vector is substituted with coordinates other than those given in the original set (hence

interpolation). Using the technique, the LWR algorithm is similar to cubic splines inter-

polation, such that the polynomial coefficient vectors are valid within a specific range

of coordinates. An example signal shown in Fig. 42 shows that the interpolation perfor-

mance is excellent. This example shows the LWR algorithm implemented with m = 4,

p = 32, and uses the Kaiser window as the kernel function. The example displays up-

sampling by a factor of 4. Notice the smoothness of the interpolated curve achieved

by fitting cubic polynomials, matching if not exceeding the performance of the cubic

spline interpolation algorithm. The LWR algorithm can be used to interpolate any co-

ordinate within the valid range of the computed best fit polynomials, enabling tasks

such as arbitrary-ratio resampling.

The LWR algorithm can be implemented using the constant QLWR matrix, eliminat-

ing an online matrix inversion and several matrix multiplications. If the possible struc-

ture of the implementated algorithm is considered, a striking resemblence is revealed.

75


0 20 40 60 80 100 120−3

−2

−1

0

1

2

3

vector index

am

plit

ud

e

Data Smoothing and Interpolation Using the LWR Algorithm

noisy data

LWR interpolation

known signal

Figure 42: Simultaneous Data Smoothing and (4x) Interpolation Using the LWR Algo-rithm (m= 4)

To show this, the X matrix is defined:

X=

�

− p2+ 1�0 �

− p2+ 1�1· · ·

�

− p2+ 1�m−1

�

− p2+ 2�0 �

− p2+ 2�1· · ·

�

− p2+ 2�m−1

...... · · ·

...

(0)0 (0)1 · · · (0)m−1

(1)0 (1)1 · · · (1)m−1

...... · · ·

...�

p2

�0 �

p2

�1· · ·

�

p2

�m−1

, (87)

which is now centered around the zero coordinate. Next, the kernel matrix is defined

by using the kernel parameter β and Eq. 85 to place a length-p Kaiser window along

the diagonal of the W matrix, populated with zeros elsewhere. As in Eq. 66, the QLWR

matrix is computed using X and W, which is an m× p matrix. Multiplying QLWR with

the incoming p element vector of data, the optimum parameter vector θ (i)opt is obtained

(Eq. 66). Finally, to generate an output by(i), the θ (i)opt vector is multiplied by an x vector

76


y[n+Δ]

Ɵ4[n] Ɵ3[n] Ɵ2[n] Ɵ1[n]

q5

Ɵ5[n]

Δ

y[n]

q4 q3 q2 q1

Figure 43: Farrow Filter Structure Derived from the LWR Algorithm

generated using the coordinate of the desired interpolation position that lies in the

valid range for the particular θ (i)opt , where x=�

∆0,∆1, . . . ,∆m−1�.

The output vector θ (i)opt is found according to Eq. 66, which can be written as a time

series using:

Q=�

XT WX�−1

XT W

θ [n] = Qy [n] =

∑pi=1 Q(1,i)yi

∑pi=1 Q(2,i)yi

...∑p

i=1 Q(m,i)yi

.(88)

Each element of the vector θ [n] is obtained by performing a sum of products, which

can be implemented in hardware using a systolic array of MACC elements. Next, the

computed θ [n] is used to evaluate the polynomial using the coordinate ∆:

y [n+∆] = xθ [n]

= ∆0θ 1 [n] +∆1θ 2 [n] + . . .+∆m−1θm [n] .

(89)

Using Horner’s rule, assuming m = 5 for this example, Eq. 89 can be evaluated more

efficiently according to

y [n+∆] = θ 1 [n] +∆�

θ 2 [n] +∆�

θ 3 [n] +∆�

θ 4 [n] +∆�

θ 5 [n]��

. (90)

Considering these equations, the Farrow structure is revealed, illustrated in Fig 43.

The Farrow filter is a widely used continuously variable delay filter (CVDF), also used

77


0 5−0.5

0

0.5

1

Q(1,:)

0 5−1

0

1

Q(2,:)

0 5−0.5

0

0.5

1

Q(3,:)

0 5−0.5

0

0.5

Q(4,:)

0 5−0.1

0

0.1

Q(5,:)

−0.5 0 0.50

0.5

1

1.5

f_Q(1,:)

−0.5 0 0.50

1

2

f_Q(2,:)

−0.5 0 0.50

1

2

f_Q(3,:)

−0.5 0 0.50

0.5

1

f_Q(4,:)

−0.5 0 0.50

0.2

0.4

f_Q(5,:)

Figure 44: Q Matrix Row-Wise Taps (top row), Q Matrix Row-Wise Frequency Re-sponses (bottom row), m=5, p=8, β = 30

for arbitrary resampling [16, 50], originally introduced by Farrow [51], elaborated by

Harris [25,50], and seems to remain an area of research in papers such as [52].

The results seen in the mentioned citations have been achieved using the LWR

algorithm, an alternative formulation. If m = 5, as initially used in Eq. 90 and Fig. 43,

with p = 8 and β = 30, the Q matrix can be computed and compared with the results

illustrated in [50], shown in Fig. 44. Analyzing Fig. 10 and 11 in [50], Fig. 44 bears

a striking resemblance. As [50] mentions, the first three rows of the Q matrix are FIR

low-pass, first and second-order differentiators. Also note that the frequency response

of the first three filters is constant, linearly, and quadratically related to frequency in

the center portion of the frequency axis. Fig. 44 shows a smoother passband in the

low-pass frequency response and shows consistent symmetry among the sets of taps in

each filter, unlike the results shown in [50], which exhibit significant ripple.

As a CVDF would be expected to have, the group delay is relatively flat and di-

rectly related to ∆ in the center portion of the frequency response, especially between

±.25π radsample

shown in Fig. 45. This range of frequencies is clearly where the passband

of the filter exhibits minimal distortion and attenuation, as shown in Fig. 46. The next

example will show the effect of the Kaiser window kernel function, which provides a

dimension of variability to the user with its β parameter. Using m = 5 as before, but

increasing to p = 24, and β = 250, nearly the same filter is generated as the previous

example. Fig. 47 shows the effect of the windowing operation effectively canceling the

78


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.83

3.5

4

4.5

5

5.5

6


Gro

up

de

lay (

in s

am

ple

s)

Group Delay: Kaiser Window Kernal, p=8, m=5, β=30

∆=−1

∆=−.75

∆=−.5

∆=−.25

∆=0

∆=.25

∆=.5

∆=.75

∆=1 .

Figure 45: Generated CVFD (Farrow) Filter’s Group Delay vs. ∆

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−60

−50

−40

−30

−20

−10

0


Ma

gn

itu

de

(d

B)

(no

rma

lize

d t

o 0

dB

)

Magnitude (dB) and Phase Response: Kaiser Window Kernal, p=8, m=5, β=30

−14.0845

−9.017

−3.9495

1.118

6.1855

11.253

16.3205

Ph

ase

(ra

dia

ns)

Figure 46: Generated CVFD (Farrow) Filter’s Magnitude and Phase vs. ∆

79


0 10 20−0.5

0

0.5

1

Q(1,:)

0 10 20−1

0

1

Q(2,:)

0 10 20−0.5

0

0.5

1

Q(3,:)

0 10 20−0.5

0

0.5

Q(4,:)

0 10 20−0.1

0

0.1

Q(5,:)

−0.5 0 0.50

0.5

1

1.5

f_Q(1,:)

−0.5 0 0.50

1

2

f_Q(2,:)

−0.5 0 0.50

1

2

f_Q(3,:)

−0.5 0 0.50

0.5

1

f_Q(4,:)

−0.5 0 0.50

0.2

0.4

f_Q(5,:)

Figure 47: Q Matrix Row-Wise Taps (top row), Q Matrix Row-Wise Frequency Re-sponses (bottom row), m= 5, p = 24,β = 250

non-centrally located taps, leaving the same effective filter as in the previous, lower or-

der example (m = 8,β = 30). Decreasing β with more available taps (p = 24) widens

the impulse response of each subfilter (rows of the Q matrix) and narrows the passband

of the overall filter’s frequency response. The subfilter impulse responses, frequency

responses, and group delay are shown using β = 14 in Fig. 48, 49 and 50, respectively.

With a wider window parameter, the entire coefficient set significantly contributes to

each subfilter’s output. In Fig. 49, notice that the zone of frequencies where the group

delay is linear across frequency and directly related to ∆ has become narrower. The

region of useful group delay properties is apparent by the passband shown in Fig. 50.

Notice in Fig. 50 that the filter now has distinct pass, transition, and stop bands. The

filter’s sidelobes and sidelobe taper are also consistent with the characteristics given by

the Kaiser window. The useful range of frequencies now lies between ±.125π radsample

,

half of the range available in the first example. This property is useful for simultaneous

smoothing and interpolation, as was seen when the algorithm was used for channel

estimation and interpolation of noisy data.

The window, or kernel used to generate the set of subfilters is not restricted to

the Kaiser window. Other windows that are only parameterized by their length may

also be useful for generating better stopband attenuation or narrower transition bands,

particularly the Nuttall and Blackmann windows (the Blackmann-Harris window in

particular).

An interesting conclusion is given in [50], stating that the computational workload

of a Farrow filter (CVDF) performing an interpolation task more than the 1-to-5 ratio

80


0 10 20−0.5

0

0.5

1

Q(1,:)

0 10 20−0.4

−0.2

0

0.2

0.4

Q(2,:)

0 10 20−0.1

−0.05

0

0.05

0.1

Q(3,:)

0 10 20−5

0

5x 10

−3Q(4,:)

0 10 20−1

−0.5

0

0.5

1x 10

−3Q(5,:)

−0.5 0 0.50

1

2

3

4

5

f_Q(1,:)

−0.5 0 0.50

0.5

1

1.5

f_Q(2,:)

−0.5 0 0.50

0.1

0.2

0.3

0.4

0.5

f_Q(3,:)

−0.5 0 0.50

0.01

0.02

0.03

0.04

f_Q(4,:)

−0.5 0 0.50

2

4

6x 10

−3f_Q(5,:)

Figure 48: Q Matrix Row-Wise Taps (top row), Q Matrix Row-Wise Frequency Re-sponses (bottom row), m= 5, p = 24,β = 14

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

11.5

12

12.5

13

13.5


Gro

up

de

lay (

in s

am

ple

s)

Group Delay: Kaiser Window Kernal, p=24, m=5, β=14

∆=−1

∆=−.75

∆=−.5

∆=−.25

∆=0

∆=.25

∆=.5

∆=.75

∆=1

Figure 49: Generated CVFD (Farrow) Filter’s Group Delay vs. ∆: m= 5, p = 24,β = 14

81


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8

−120

−100

−80

−60

−40

−20

0


Magnitude (

dB

) (n

orm

aliz

ed to 0

dB

)

Magnitude (dB) and Phase Response: Kaiser Window Kernal, p=24, m=5, β=14

Figure 50: Generated CVFD (Farrow) Filter’s Magnitude vs. ∆: m = 5, p = 24,β = 14,Useful for Simultaneous Interpolation and Smoothing

results in fewer computations than the traditional polyphase implementation. This con-

clusion is intuitive because the filter has m = 5 arms, as does a comparable polyphase

upsampler; therefore any resampling ratio greater than 1-to-m or m-to-1 yields com-

putational benefits.

Perhaps the most attractive capability of this class of filters is the ability to per-

form irrational-ratio or “inconvenient-ratio” resampling tasks. A good example of

inconvenient-ratio resampling can be found when simulating a multi-path fading chan-

nel in software. Many channel models specify 5 or 10 ns tap delay resolution, implying

a 200 or 100 MHz sampling rate, respectively. However, to use the specified models,

the standard UMTS (i.e. LTE) sampling rates are multiples of 30.72 MHz and must be

resampled to match the channel model’s rate.

To perform resampling, the delay of the filter is continuously varied so the appro-

priate interpolated points are produced according to the desired resampling ratio. The

CVDF can be used to arbitrarily increase or decrease the sampling rate with arbitrary

phase shift. Fig. 51 shows the sidelobes produced when the input signal is oversampled

by varying degrees (indicated by N) and upsampled by 8 by decreasing ∆ by 1/8 for

each successive output sample, rolling over back around 1 upon underflowing below

0. The CVDF produces significant nulls located in the center of each spectral duplicate.

82


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8−150

−100

−50

0


Ma

gn

itu

de

(d

B)

(no

rma

lize

d t

o 0

dB

)

CVDF Resampler − Magnitude Response (dB) vs. Input Oversampledness − 1−to−8 upsampling

N=1

N=2

N=4

N=8

Figure 51: Sidelobes Resulting from CVFD Rate Transition with Varying Levels of InputOversampledness

�

m= 5, p = 8,β = 30�

Increasingly narrow spectral duplicates are increasingly attenuated by the nulls. The

oversampledness of the filter’s input determines the cascaded stop-band performance.

To achieve the generally desirable -96 dB stopband attenuation (or close to it), the

input signal must be oversampled by at least a factor of 8. The -96 dB stop-band repre-

sents the theoretical quantization noise floor for signals with 16-bit precision. This level

of precision is higher than what is effectively achieved by most realistic data converters,

assuring the converter is likely to be the limiting factor in the system.

Fig. 52 illustrates the structure of a generalized Farrow-based arbitrary-ratio upsam-

pler. If the input signal is oversampled, the design can be considered to be an arbitrary-

ratio resampler rather than an upsampler. The input is fed into a 8x polyphase upsam-

pling preprocessing filter, which provides the minimum necessary oversampledness to

achieve desirable stopband performance, even for nearly critically sampled input sig-

nals. The 8x prototype filter is assumed to be “non-ideal” and thus has a finite-width

transition band; hence the input signal can be “nearly” critically sampled, depending

on the prototype filter’s design characteristics. After the preprocessing step, the signal

is sent to the Farrow structure. The only updates to the Farrow structure from Fig. 43

83


y[(n/8)+Δ]

q4

Ɵ4[n]

q3

Ɵ3[n]

q2

Ɵ2[n]

q1

Ɵ1[n]

q5

Ɵ5[n]

y[n/8]

h2

h3

h4

h5

h6

h7

h8

h1

y[n]

Δ- z-1δ α

Figure 52: Generalized CVDF (Farrow) Based Arbitrary-Ratio Upsampler Using 8xPolyphase Upsampling Preprocessor

are the added accumulator and offset adder, controlled by δ and α, which determine

the upsampling ratio and the phase shift, respectively. To achieve the desired rate

transition, the δ parameter must be set according to:

δ =N fin

fout, (91)

where fin is the base sampling rate before the rate-N preprocessing stage, and fout is the

final, desired sampling rate. Any continuous value of δ is allowed, such that 0< δ ≤ N

(assuming an ideal preprocessing upsampling filter). Similarly, α is permitted to be in

the range−0.5< α≤ 0.5. Note that as δ approaches zero, the output sampling rate be-

comes extraordinarily high, on the order of millions, or billions is possible given enough

precision in the accumulator. Also note that the preprocessing upsampler will never be

ideal, allowing δ to equal or slightly exceed N without violating Nyquist because the

incoming signal must have some excess bandwidth without becoming affected by the

finite-width transition band when non-ideal prototype filters are used.

To illustrate an example of inconvenient-rate upsampling, the structure in Fig. 52 is

used to transition the rate of a signal with a 30.72 MHz sampling rate to 100 MHz, de-

sirable for processing LTE signals in a hardware or software wireless channel emulator.

84


−50 −40 −30 −20 −10 0 10 20 30 40 50

−100

−80

−60

−40

−20

0

Frequency (MHz)

Magnitude (

dB

) (n

orm

aliz

ed to 0

dB

)

CVDF−Based Arbirary Upsampler Magnitude Response (dB): δ=2.4576

Figure 53: Inconvenient-Rate Resampling for an LTE or UMTS System to 100 MHzSampling Rate from 30.72 MHz

The required rate transition for this example is:

N =100

30.72= 3.25520833333333 . . .=

625

192, (92)

which can be precisely achieved by upsampling by 625 and downsampling the re-

sult by 192. The structure in Fig. 52 can perform this task by simply setting δ =

(8× 30.72)/100 = 2.4576, so the accumulator steps in strides of 2.4576 and thus is

downsampling by this rate. The resulting magnitude frequency response of the upsam-

pler is shown in Fig. 53. An LTE signal fits neatly within the 20 MHz wide passband of

the resampler with minimal stop-band spectral residue from the resampling process.

As mentioned earlier, the prototype filter implemented in the preprocessing upsam-

pler will have a finite-width transition band, and therefore its incoming signal must

have excess bandwidth according to the prototype filter; otherwise portions of the

spectrum will be attenuated by the filter’s transition band. If the δ parameter is made

slightly higher than the upsampling factor of the preprocessing filter, overall functional-

ity can include downsampling without folding aliases into the passband and the signal

of interest. Assuming the center of the transition band is located at the Nyquist bound-

ary of the sampling rate of the original signal, the range of the δ parameter can be

85


−15 −10 −5 0 5 10 15

−1

−0.8

−0.6

−0.4

−0.2

0

Frequency (MHz)

Ma

gn

itu

de

(d

B)

(no

rma

lize

d t

o 0

dB

)

CVDF (Farrow)−Based Resampler Magnitude Response (dB): N=8, δ=8.17, α=.35

Figure 54: Farrow-Based LTE Resampling Filter

extended to 0 < δ <�

N + α

2

�

without overlapping aliases, where α in this context

denotes the excess bandwidth of the preprocessing filter and thus the input signal.

Using δ = 8.17 and configuring the preprocessing prototype filter’s excess band-

width using α = .35, the result in Fig. 54 shows minimal interference from aliases in

the passband. The overall downsampling factor in this example is 8/8.17 ∼= 0.9791,

and the system is able to recover LTE signals with sampling clock errors of just over

+2%. This capability is a critical component to the architecture illustrated in Fig. 7 and

introduced in Sec. 3.3, where the incoming signal is dynamically resampled to cancel

sampling clock frequency errors based on time-domain estimates.

86


6 Exploitation of Excess Cyclic Prefix to Improve Reception Quality

As described in Sec. 3.2, ISI is introduced when the DFT operation contains energy from

more than one symbol. Orthogonal reception is maintained when the starting point of

the symbol position lies in the segment of CP after the echoes from the previous symbol

have subsided (Eq. 86). Echoes are introduced by a channel with “memory”. The size

of unusable CP is determined by the channel’s excess delay, denoted by d.

In a system operating in a memoryless channel (i.e. d = 0), no ISI energy is present

and the entire CP is redundant. In a channel with memory, the first segment of d

samples in the CP are corrupted by ISI, leaving L− d excess CP samples. It is assumed

that an OFDM system operating in its intended channel environment will be designed

with an adequately long CP to provide ISI-free operation in normal circumstances.

Therefore, in normal circumstances, the CP is longer than is required by the channel

and the received signal contains redundancy.

The LTE standard includes two CP modes, “normal” and “extended”, with a dura-

tion of 4.6875 µs and 16.666 µs, respectively. Out of the three channels used in the

LTE conformance tests, the longest excess delay is 5 µs in the extended typical urban

scenario [42], leaving inadequate CP for the “normal” mode and vast lengths of un-

used, excess CP for the extended mode. Considering the RS configuration, highlighted

by Sec. 4.3, the frequency spacing of the RSs introduces downsampling of the channel’s

frequency response, limiting the maximum excess delay to a value far shorter than the

entire CP duration, thereby implying that not only does normal operation imply CP

redundancy, but very large amounts of it. The LTE RS configuration can resolve excess

delays and symbol timing shifts of up to M/2q samples, or 5.555 µs (Eq. 86). Assum-

ing no symbol timing shift while operating in the extended CP mode, at least 2/3 of

the CP is redundant under typical conditions.

Palenik in [53] likens the excess CP as an inner repetition channel code, while

the added turbo or low-density parity check (LDPC) channel coding is the outer code.

Palenik combines the redundant information after demodulation and demapping by

summing the semi-correlated set of log-likelihood ratios computed by separate soft-

decision demappers. Performing redundancy combination after equalization, demod-

ulation, and demapping dramatically increases the computational complexity of the

receiver and fails to benefit receiver components that could take advantage of the

available redundancy, such as the channel estimation component, which primarily in-

fluences overall receiver performance.

Nearly every OFDM receiver must be capable of performing channel estimation and

87


equalization, a process that requires the capablility of estimating the (inverse of) the

channel’s frequency and/or impulse response. In all cases, an estimate of d is either

directly, or readily available, which can be used to estimate the available number of re-

dundant CP samples. Many existing channel estimation algorithms take the channel’s

observed noisy frequency response, convert it to the CIR on which statistical estima-

tion is performed, then convert the result back to the frequency response for equaliza-

tion [32,54,55]. Through the process of converting the frequency response to the CIR,

an estimate of d can be easily gleaned from the already existing information.

First, an inefficient but intuitive method to utilize the available redundancy requires

a slight modification of the existing OFDM receiver by adding a second DFT operation.

One DFT operates on the block of M contiguous samples starting at m = −(L − d) +

1, and the other starts at m = 0. The DFT starting at m = −(L − d) + 1 is offset

by a known amount from the other, and the channel estimation component is not

needed to derotate its phase. Phase correction (pre-equalization) can be performed by

multiplying the DFT result with ΦH .

ΦH = diag�

φH , (93)

where φ is the M × 1 vector defined in Eq. 6 using τ = −m. The timing offset be-

tween the DFT operations is known and the RSs are not needed to determine the phase

rotation. In this case, the timing shift can extend beyond the frequency resolution lim-

itation imposed by the RS frequency spacing. After equalizing the phase of the offset

DFT, the DFT results can be summed to form a single vector. After the summation,

the signal and noise power are not scaled equally. The signal power is doubled, and

the noise power increase is dependent on the level of correlation between the additive

noise residing in the samples used in the pair of DFT operations. The correlation is di-

rectly dependent on the time separation of the two DFT windows. When the two DFT

windows contain exactly the same samples, the noise terms are equal and maximally

correlated. Conversely, if the DFT windows are maximally separated, the number of

mutually exclusive observations of the signal obtained by each DFT operation are maxi-

mized, thus minimizing the level of correlation in the noise. The receiver that combines

88


the redundancy in the available excess CP using two DFT operations is defined by

u(k) = HZT WHMx(k) + n(k)

v1(k) =WMZRu(k)

v2(k) = ΦHmWMZmu(k)

v(k) = E�

1

2

�

v1(k) + v2(k)�

�

,

(94)

where the new permuation matrix Zm is used to select any contiguous group of samples

in u(k) starting at postion m=−(L− d).

Zm =

0d

0IM

0L−d

0

(95)

A simpler, more computationally efficient method exists. Rather than performing

two DFT operations, the redundancy can be captured in the time domain. The re-

dundant samples can be directly summed together before the receiver’s DFT operation.

Assuming the noise is i.i.d throughout the entire symbol, the noise contained in each

redundant segment of symbol is uncorrelated. Summing the two segments in the time

domain reduces the relative noise variance while only requiring a single DFT opera-

tion. The time-domain combination is most effective when sampling frequency error

has been minimized or eliminated. The matrix notation representation of this receiver

is defined by

u(k) = HZT WHMx(k) + n(k)

v(k) = EWM

�

Z1+1

2

�

Z2+ Z3

�

�

u(k) ,(96)

where the M × P dimension permutation matrices are defined by

Z1 =

0L IM−L−d 0L−d

0 0 0

Z2 =

0 0

0 IL−d

Z3 =

0 0

0d IL−d0M

(97)

89


Z1 selects the first M − L− d samples after the CP, Z2 selects the final L− d samples in

u(k) and places zeros elsewhere, and Z3 selects the L − d samples before the CP and

shifts them to the end of a vector of zeros.

The redundant segments summed in Eq. 96 contain i.i.d. WGN from the n(k) vector.

Prior to the DFT operation, two levels of noise variance are present in the symbol

vector, resulting from the redundancy combination. The first M − L − d samples are

unaffected by the operation, having the baseline noise variance σ2n. However, the last

L − d samples now have an expected noise variance ofσ2

n

2. The two segments of noise

in each symbol are uncorrelated; therefore the noise variances of each segment can be

summed to determine the joint variance.

σ2n =

σ2n

M − L− d+

1

2

σ2n

L− d(98)

According to the Parseval/Rayleigh energy conservation theorem, the noise power re-

mains constant before and after the DFT, therefore by combining the excess CP samples,

the noise variance is reduced by a level that depends on the channel’s excess delay and

the relative size of the CP to the DFT size.

After combining the redundancy, a level of correlation is introduced to the noise

across the frequency bins of the DFT, which is straightforward to prove. The noise

profile can be represented using the rectangle, or “rect” function and its discrete-time

Fourier transform as defined by

rect� n

M

�

=

1,�

�

�

nM

�

�

�≤ 12

0, otherwise

DT F T←→sin�

ω�

M + 12

��

sin�

ω

2

� . (99)

and, using the comb function

combM (n) =+∞∑

k=−∞

δ (n− kM)DT F T←→

1

M

+∞∑

k=−∞

δ

�

ω

2π−

k

M

�

=+∞∑

k=−∞

e− jωMk (100)

to periodically profile a discrete-time unit-variance zero-mean Gaussian random vari-

able n (n)

�

σ2

2

�

rect� n

M − d

�

+ 1�

⊗ combM

�

n−M − d

2

�

�

× n (n) , (101)

where ⊗ denotes the discrete convolution operator and σ2n denotes the variance of the

90


z-M

÷2

u(k)FFT

v(k)

ChannelEstimationselect

0

1

EQ

E

Figure 55: Receiver Architecture: Combining CP Redundancy in the Time Domain

channel’s noise applied to n (n). The comb function makes the profiling periodic with

M ; therefore analysis can be performed using the DFT of size M , while absorbing the

time shift into the rect function.

σ2

2

rect

n− M−d2

M − d

!

+ 1

!

× n (n) (102)

Taking the DFT of Eq. 102 reveals a Dirichlet kernel with a linear phase shift and a DC

component, circularly convolved with the DFT of the noise in the frequency domain.

σ2

sin�

ω�

M−d2+ 1

2

��

2sin�

ω

2

� e− jω M−d2 +

�

M −d

2

�

δ(ω)

⊗ N (ω) (103)

The Dirichlet kernel imposes a circular “low-pass” filter on the noise in the frequency

domain. This action correlates the noise across the frequency bins of the DFT. The

circular autocorrelation of the noise is defined by the left-hand side of Eq. 103 and

depends on d, the available excess delay in the channel. The circular convolution in

the frequency domain can be likened to low-pass filtering the noise, attenuating the

high-frequency components with a cutoff dependent on d. Larger d widens the Dirich-

let kernel and narrows the passband of the filter. Perhaps this was already intuitive,

but now the correlation of the noise is explicitly defined, which is useful information

to have when using statistical techniques that require an autocorrelation matrix, par-

ticularly the receiver’s channel estimation component.

Fig. 55 depicts a system architecture for the receiver described by Eq. 96. The

architecture requires a delay element of length M , a division by two, which can be

performed by a simple bit-shift operation, and a multiplexer. The multiplexer selects

u(k) (input 0) for the first M− L−d samples of each symbol, then switching to input 1

91


4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 3010

−3

10−2

10−1

100

101

SNR

MS

E

MSE With and Without CP Redundancy Combining, With and Without Muti−Path Channel Conditions

multi−path channel, combining enabled

multi−path channel, combining disabled

AWGN channel, combining enabled

AWGN channel, combining disabled

Figure 56: SNR Enhancement Using CP Redundancy in AWGN Channel and Multi-PathChannel Conditions

to select L−d redundant, averaged samples. Perhaps the most significant modification

is using the estimated channel to estimate d.

This simple addition to the receiver can provide modest boosts in SNR (Eq. 98) as

seen from the test results shown in Fig. 56. Testing was conducted using two chan-

nel conditions, an AWGN-only channel and a AWGN with a 3-tap channel that has an

excess delay of 2, 500 µs, equal to that of the EVA model found in the LTE confor-

mance tests. In both scenarios, the LTE RS configuration is used, along with the most

basic least-squares channel estimation algorithm coupled with spline interpolation to

estimate the non-RS positions, the baseline configuration presented in Sec. 4.2. Both

tests reveal a very constant level of MSE improvement across SNR. In this test, the CP

redundancy combination provides .5-.75 dB SNR improvement.


In normal operation, when the excess delay of the channel is shorter than the duration

of the CP, a received OFDM signal contains redundancy provided by the excess CP that

remains ISI-free after passing through the channel. By simply summing the two redun-

dant segments in each symbol and dividing by two, the signal power remains constant

92


while the noise variance is reduced by a factor of 2. In an LTE system operating in nor-

mal conditions, CP redundancy is guaranteed. Assuming normal operating conditions,

at least 2/3 of the CP is redundant in the extended CP configuration. The overall SNR

gain achieved by the redundancy combination has been shown to be significant, given

its simplicity. The SNR gain is obtained using an estimate of the channel’s excess delay,

information that is already available or easily obtained from the channel estimation

component in the receiver.

93


7 Real-Time Wireless Channel Emulation

To test algorithms in a wireless communications system, the designer may first perform

simplistic simulations using a time-stationary AWGN channel. Later, more complex

simulations with time-varying channel conditions must be performed that take into

account the channel conditions in the intended operating environment. To perform

time-varying channel simulations, recorded channel conditions could be used, or even

live field testing could be performed. These methods constrain the simulation to a

specific operating environment, may not be repeatable, and may be cost prohibitive.

Instead, if the channel can be modeled, computer simulation can be performed with

user-defined channel properties that emulate real-world conditions. The user can pro-

gram the emulator with industry standard models, or even their own models derived

from empirical measurements in their desired scenario.

Using computer software, short simulations of time-varying channels can be per-

formed with relatively little effort. Computer simulation is rarely performed in real-

time and is not suitable for use with a communications system that has already been

implemented in real-time hardware. Real-time hardware tests are an essential part of

system development. It is typical for a designer to discover new or unforeseen problems

when their implementation is exposed to real-world or real-time conditions, particu-

larly involving complex state machines or control systems embedded in the receiver

architecture.

It has become common to generate and store pre-computed simulated signals or

recorded field test signals in large banks of DRAM or disk storage. The stored signal

is then “played” by streaming the samples to a DAC in real-time. For long simulations,

the signal must either repeat without cyclic continuity, or must end. When channel

conditions are repeated, a large instantaneous discontinuity occurs. Not only are the

discontinuities unrealistic, but can cause unexpected problems. The abrupt repetition

boundaries can cause spurious spectral emissions, or can even cause internal receiver

control systems and adaptive algorithms to fail or perform unexpectedly. Even if the

channel conditions do have continuous repetition boundaries, the receiver experiences

the same repeating scenario, which may give the designer the false illusion and confi-

dence in the receiver’s general behavior.

For continuous, long-term testing of hardware receivers in real-time, a hardware

channel emulator becomes necessary, capable of processing signals in real-time. The

transmitted signal is generated and stored in RAM but can be repeated with cyclic

continuity at the symbol boundaries. For simulation with an LTE system, several frames

94


or even a single frame of “clean” signal can be stored and looped, repeating every 10s

of milliseconds and processed by a real-time hardware channel emulator for extended

periods of repetition-free channel conditions.

This chapter will introduce a theoretical framework for multi-path fading channel

emulation for both SISO and MIMO channels. The highlight of the chapter and the

biggest contribution of work will be the developed system architecture and hardware

implementation of the channel emulators in FPGA hardware, achieving real-time oper-

ation.

7.1 Real-Time Multi-Path SISO Channel Emulation

In a single-input single-output (SISO) OFDM system, the influence of the wireless chan-

nel on the transmitted signal can be modeled by a linear convolution with the channel’s

(finite-length) impulse response. In a mobile channel, the channel’s impulse response

is time-varying or time-dependent and can be described by

h(t,τ) =p∑

i=1

ci(t)δ�

τ−τi�

, (104)

where τ =�

τ1 = 0,τ2, . . . ,τp

�T, τi ∈ R, τi > 0 for 2 < i ≤ p, indicates the vector of

p delays corresponding to each echo, or path in the channel. Each echo also has a cor-

responding complex weight defined by c(t) =�

c1(t), c2(t), . . . , cp(t)�T

, ci(t) ∈ C. The

time-varying function h(t,τ) indicates the response of the channel at the continuous

time t for the delay τ ∈ (−∞,+∞). The transmitted signal x(t) is linearly convolved

with the time-varying h(t,τ) to produce the received signal y(t)

y(t) =∞∑

n=−∞h(n,τ)x(t − n) . (105)

The graphical representation of Eq. 105 is shown in Fig. 57. In a discrete-time system,

the operating rate of the tapped delay-line structure shown in Fig. 57 determines the

processing bandwidth and the tap-delay resolution.

7.1.1 Stochastic Jakes Process Generation

In a mobile channel environment, the elements in the channel coefficient vector c(t)

are time-dependent. The time-varying nature results from the mobile device traveling

through space, encountering time-varying reflections and diffraction from the chang-

95


x[t]Δt2

c1[t]

Δt3

c2[t]

Δtp...

cp[t]...

...

Σ

y[t]

Figure 57: Time-Varying SISO Channel Model

ing surroundings. To model the behavior of a multi-path fading channel in a dense

scattering environment assuming an omnidirectional antenna radiation pattern, the el-

ements in c(t) can each be modeled by i.i.d. stochastic Jakes processes [44, 56, 57],

which can be characterized using only two parameters, the carrier wavelength λ and

the velocity of the receiver v. These two parameters are used to define the maximum

Doppler spread fmax , expressed in units of Hz, that results from the changing relative

ray lengths from the reflections in the channel as the mobile device travels through

space (the Jakes model assumes no line-of-sight component).

fmax =v

λ(106)

To define the Jakes processes that make up c(t), the elements of c(t) are first decom-

posed into their real and imaginary components.

ci(t) = µ1(t) + jµ2(t) (107)

The real and imaginary components of ci(t) have the following statistical proper-

ties [57,58].

rµ1µ2(τ) = 0,∀τ , (108)

rµµ(τ) = σ2µJ0

�

2π fmaxτ�

,∀τ . (109)

According to Eq. 109 and 108, the real and imaginary components have zero cross-

correlation and an autocorrelation that depends on the Bessel function of the ze-

roth kind that depends on fmax . Taking the Fourier transform of Eq. 109 reveals the

continuous-frequency power spectral density (PSD) of the real and imaginary compo-

96


nents of each ci(t).

Sµµ�

f , fmax�

=

1

π fmax

q

1−( f / fmax)2, | f | ≤ fmax

0, | f |> fmax

f ∈ (−∞,∞)

(110)

To generate a stochastic Jakes process, several methods are available in the liter-

ature. Pätzold in [58] presents a method that sums a large number of random-phase

sinusoids, weighted and spaced to fit the PSD defined in Eq. 110. The approximation

quality and the repetition length depend on the spacing and the number of sinusoids

used. The sum of sinusoids (SOS) method can be efficiently implemented in hardware

by storing a single period of the lowest frequency sinusoid in a ROM, as suggested

by [58]. Using the concept of direct-digital synthesis (DDS), the ROM can be accessed

and shared by a number of phase accumulators, each with different accumulate and

offset values. Instead of storing a full period in ROM, only one quarter of one period

is necessary if the symmetry properties of the sinusoid are exploited. The SOS method

has become popular, appearing in several recent publications [59–61].

Despite the promising implementation of the SOS method, the “traditional” method

used prior to the SOS method can also be implemented in a very computationally

efficient manner. The method suggested by [44, 57, 58, 62, 63] uses a discrete IIR

or FIR filter to process i.i.d. WGN to generate each Jakes process. The coefficients

for the discrete filters are derived from sampling the ACF of the Jakes process. The

literature tends to focus on the generation of a single channel coefficient and neglects

to consider the larger system-level view, i.e. when multiple channel coefficients must

be generated for a complex multi-path channel. [63] presents a method that generates

many i.i.d. WGN processes from a single serial-output WGN source by distributing the

output samples to a bank of Jakes filters using a deserializer. The time-multiplexing

of a single WGN generator is made possible by exploiting the property that a sample

taken from a WGN process is uncorrelated and independent of other samples; thus

any desired number of lower rate i.i.d. WGN sub-processes can be generated from a

single WGN parent process. Digital WGN sources typically have very long repetition

lengths and can be generated using simple logical elements [44]. Both methods give

good results, but it is believed that the “spectrum shaping” method scales more easily

by replicating the filters and adding outputs to the WGN deserializer.

To generate a Jakes process using the traditional method of filtering WGN, an FIR

97


−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1−100

−80

−60

−40

−20

0

norm

aliz

ed m

agnitude (

dB

)

normalized frequency (× fs)

frequency response of designed jakes filter, fd=.8, N=256

Pre−Window

Post−Window

16 32 48 64 80 96 112 128 144 160 176 192 208 224 240 256−0.2

0

0.2

0.4

0.6

0.8

1

norm

aliz

ed a

mplit

ude

samples

impulse response

Pre−Window

Post−Window

Figure 58: Designed Jakes FIR Filter: NJakes = 256, fmax = 100 Hz, fd = .8

filter can be generated by sampling the PSD given by Eq. 110.

H( fk) =Æ

Sµµ�

fk, fd�

,

fk =k− 1

NJakes−

1

2,

k = 1, 2, . . . , NJakes ,

fd =2 fmax

fs.

(111)

Here, NJakes represents the number of taps used in the FIR filter, fs indicates the sam-

pling frequency, and fd defines the Doppler spread normalized to the sampling fre-

quency. After sampling the PSD, a Kaiser window is used on the Fourier transform

of H( fk) to obtain the set of FIR filter coefficients. To establish a running design ex-

ample, the following parameters will be chosen; p = 9, fmax = 100 Hz, fd = .8, and

NJakes = 256. Using these parameters and Eq. 111, Fig. 58 shows the designed Jakes

FIR filter. Before windowing, the filter’s impulse response decays very slowly due to the

98


WGN Generator

...

shift register 2p-1

Jakes Filter Coeff ROM

...

FIFO 1

FIFO 2

FIFO 2p-1

FIFO 2p

shift register 1

shift register 2

shift register 2p

z-1

reset

µ1[n]

ROM addrREG addr

Scaling Coeff RAM

Channel addrScaling Data

µ2[n]

µ2p-1[n]

µp[n]

Figure 59: Single MACC Element Jakes Filter Processing p Complex Jakes Processes

Bessel function in the ACF. The windowing serves to suppress the infinite tails of the

ideal Jakes ACF to a finite window size, reducing Gibb’s phenomenon and minimizing

the out-of-band content to an acceptable level.

An IIR design better describes the Jakes ACF and can match the infinitely long ideal

Jakes ACF. However, as seen in the IIR design presented by [58], large spectral peaks

are generated as�

� f�

� approaches fmax , and the designed filter exhibits large amounts of

ripple. IIR designs have been found to have problems resulting from the spectral peaks

of the Jakes PSD. The discontinuity in the PSD causes the poles of the designed IIR filter

to be placed dangerously near the unit circle, as one would expect with the infinitely

long decaying, or “ringing” behavior of the impulse response, making the filter highly

susceptible to instability associated with numerical error.

The overall implementation structure of the multi-channel FIR-based Jakes process

generator is shown in Fig. 59. The implementation requires 2p shift registers, each

having NJakes storage elements in addition to a NJakes

2+1 element coefficient ROM and a

bank of 2p FIFO buffers that hold at least 2 elements each to time-align the sequentially

generated output stream into vectors. The user programs the expected power of each

path by populating the scaling coefficient RAM with the p values indicated by the

desired channel power-delay profile (PDP).

7.1.2 Arbitrary-Ratio Upsampler Design: User-Variable Doppler

At this point in the design, the sampling rate of the Jakes processes are far too low

to be usable in a wideband system. The rate of the Jakes processes must match the

processing rate of the channel; therefore an upsampler must be inserted between the

coefficient generators and the channel processing structure (Fig. 57). If the imple-

mented upsampler allows its rate transition to be varied by the user while keeping the

rate of the channel processing component constant, the user can adjust the Doppler

99


4x upsamplerPolyphaseFilter - y

PolyphaseFilter - y*

integer

δ

k

fraction

α

k

1-α

μ[n]

accumulator

μ[δn/128]

Figure 60: Arbitrary-Ratio Resampler Architecture

frequency fmax . The change in rate adds a variable amount of additional excess band-

width to the nearly critically sampled Jakes processes. This action also reduces the

workload of the Jakes process generators. Increasing (decreasing) the rate transition

slows (speeds) the consumption of the samples produced by the Jakes processes. The

variable rate upsampler paces (determines the rate of) the Jakes process generator.

To perform arbitrary-ratio upsampling, the design shown in Fig. 60, introduced

in [25, 64], as well as the resampling architecture presented in Fig. 52 can be used.

Both are capable of producing virtually limitless upsampling factors with near arbitrary

resolution that depends on their respective accumulator widths. Both designs require a

preprocessing upsampling stage to assure adequate stop-band performance. The dual

polyphase design is the slightly more attractive of the two with fewer components that

must operate at the full output rate. The Farrow-based design requires a long chain of

adders and multipliers that evaluate output polynomials at the full output rate of the

system. While the chain can be pipelined to achieve good speed performance, it would

still require more full-rate resources than the dual polyphase technique.

The dual polyphase design exploits the linear-time-variant nature of polyphase fil-

ters (and resamplers in general). A rate-N polyphase upsampler can produce N ver-

sions of its output, depending on where the commutator is located at a particular time

instant. The clever concept of this design is that, given two polyphase upsampling

filters that process the same input, two phases (versions of the output signal) can be

generated at the output by having one of the commutator arms lag the other in the

adjacent position. In this design, even if δ > 1 and the commutators hop and skip

locations, the two filters produce two adjacent phases in the phase space. The frac-

100


tional component of the accumulator is then used for linear interpolation between the

two polyphase filter outputs. The linear interpolation upsamples the output signal of

the polyphase upsamplers while attenuating the spectral duplicates that are generated.

The process is equivalent to upsampling the signal by a variable amount and convolving

the result with a variable-width triangle pulse. The Fourier transform of the triangle

pulse is a squared sinc that has its spectral nulls aligned with the spectral duplicates

produced by the upsampling. If the output of the polyphase upsampler is oversampled

by a large enough factor, the spectral duplicates will be narrow enough to allow most of

the unwanted spectral content to reside deep within each null of the squared sinc func-

tion. This concept was also seen in the design of the Farrow-based arbitrary upsampler

in Fig. 51. If the incoming signal is first oversampled by 4x, and proceeded by upsam-

pling by a factor of 32x using the dual polyphase structure, the spectral duplicates will

be narrow enough to be attenuated below the target -96 dB stopband target.

To design the arbitrary resampler, the two fixed-rate upsampling components are

designed for maximum efficiency. Several design options exist for the preprocessing

filter. A single-stage polyphase FIR structure requires a larger coefficient ROM than the

functionally similar upsampler that is split into cascaded stages of rate-2 components,

i.e. Fig. 21. This type of design uses half-band FIR filters, which only require one

quarter of their coefficients to be stored in ROM but may use more hardware multipliers

in an FPGA/ASIC implementation as a result of the split to two stages. Similar to the

previous option, the third option uses two stages of half-band polyphase IIR filters that

are designed for linear phase. The IIR filters are constructed for efficiency using all-

pass second-order sections, and are especially well-suited for software implementation.

The IIR design uses very few coefficients and has a very low workload.

The design of the filter will compare both FIR and IIR half-band designs. The over-

lay and cascaded response of the designed Jakes and resampling filters are shown in

Fig. 61. The two filters require 18 and 5 coefficients for the first and second stages,

respectively. The passband ripple of the cascaded response is 200 µdB using floating-

point coefficients, and the workload of the filter is only 5.75 MACCs/output.

Each 2x upsampling component in the cascade can be implemented using the struc-

ture shown in Fig. 62. The implementation structure exploits the half-band polyphase

structure, which has one of its polyphase arms collapsed to a delay-line. In the single

MACC implementation, the input shift register is doubly used as the delay-line for the

polyphase arm that contains only zero coefficients. The output MUX acts as a virtual

commutator, passing one element from the MACC operation and then a subsequent

element from the end of the input shift register.

101


−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

−100

−80

−60

−40

−20

0

Cascaded Response

ma

gn

itu

de

(d

B)

normalized output frequency (× fs)

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

−100

−80

−60

−40

−20

0

Jakes Overlay with Cascaded Rate−2 Half−Band FIR Filters

ma

gn

itu

de

(d

B)


Figure 61: Dyadic Half-Band 4x FIR Upsampler - Overlaid and Cascaded FrequencyResponse

A special class of two-path (half-band) linear-phase IIR upsampling filters can be

implemented using a cascade of all-pass sections [25, 65–67]. Each half-band filter is

used as a rate-2 upsampler, capable of forming the same general structure as the dyadic

FIR half-band structure to raise the sampling rate of a signal by a power of 2. The IIR

half-band filter is constructed using cascaded type-1 and type-2 sections, each shown

in Fig. 63. The linear phase constraint increases the number of coefficients necessary to

achieve the desired frequency response features relative to a non-linear phase design

but still maintains a very low overall workload. The general structure of the linear

phase IIR upsampler is shown in Fig. 64. The design requires very little coefficient

storage, although the coefficients must have very high precision to maintain good per-

formance and stability. The resulting frequency response of the designed IIR cascade

is shown in Fig. 65, exhibiting near-equal performance to the FIR version. The coeffi-

cient sets of each filter in the IIR design is listed in Tbl. 6. The design could be greatly

simplified if the Jakes filter had slightly more excess bandwidth. The transition band

of the half-band IIR design is not symmetric about fs/4 as it is with the FIR version,

requiring the transition band of the first stage to be very narrow in order to adequately

suppress the spectral duplicates, increasing the overall complexity. Minimizing the ex-

102


z-1

reset

Coefficient ROM

reg addr

comm. addr

ROM addr

...FIFO 2p-1

FIFO 2p

FIFO 1

Shift Reg 2

...

Shift Reg 2p-1

Shift Reg 2p

Shift Reg 1

FIFO 2

... ...

c1[n/2]

c2[n/2]

c2p-1[n/2]

c2p[n/2]

c1[n]

c2[n]

c2p-1[n]

c2p[n]

Figure 62: Dyadic Half-Band 2x FIR Upsampler - Implementation Structure

α1

z-M

z-M

_

_

α2

z-M

α1

z-M z-M

z-M

_α2

_

z-M

type1 2nd order type2 2nd order

G(Z) H(Z)

Figure 63: Second Order Type-1 and Type-2 All-Pass Sections

1:2

1:2z-4

G(Z) H(Z)

b1 b2 b3 b4

z-12

G(Z)

a1 a2

H(Z) H(Z)

a9 a10 a11a12

H(Z) H(Z)

a5 a6 a7 a8

H(Z)

a3 a4

Figure 64: An Example of Cascaded Half-Band IIR Upsamplers Constructed Using Cas-caded 2nd-Order All-Pass Sections

103


−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

−100

−80

−60

−40

−20

0

Jakes Overlay with Cascaded Rate−2 Linear−Phase Half−Band IIR Filters

ma

gn

itu

de

(d

B)


−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

−100

−80

−60

−40

−20

0

Cascaded Response

ma

gn

itu

de

(d

B)


Figure 65: Dyadic Half-Band Linear Phase 4x IIR Upsampler - Overlaid and CascadedFrequency Response

Filter A Filter Ba1 0.842338371106072 b1 0.6136370164613714a2 -0.406026593274168 b2 -0.1198054373949318a3 0.836552788992820 b3 -0.03821849743531386a4 0.256408499490701 b4 0.02196375889154819a5 0.397589014368451a6 0.200469375895506a7 -0.0357235445134671a8 0.179596315064722a9 -0.430096176797810a10 0.170254497049323a11 -0.710357440120546a12 0.166067239169935

Table 6: Coefficients of Designed Dyadic Linear Phase Half-Band IIR Upsampler

104


coeff. storage workload Passband Ripple(samples) (mults/output) (dB)

FIR 23 5.75 200 µIIR 16 4 1 n

Table 7: IIR and FIR Interpolation Performance Comparison

cess bandwidth in the Jakes filter in turn reduces the amount of upsampling required

to achieve the final rate of the channel processing component.

The final workload and coefficient storage breakdown for both filter designs is

shown in Tbl. 7. Given the same task, the half-band IIR filter performs it using fewer

multiplications and coefficients. However, the structure of the implemented IIR design

is not particularly well-suited for FPGAs/ASICs or (low precision) fixed-point imple-

mentation. If the implementation is in computer software, the half-band IIR method

offers superior performance and workload and requires fewer coefficients of the two

analyzed designs.

Next, the dual polyphase structure responsible for upsampling by 32x with dual

commutators must be designed. The design of the prototype filter can exploit the al-

ready oversampled nature of its incoming signal. The spectral duplicates produced by

the 32x upsampling are very narrow and are spaced far apart from each other. The pro-

totype filter only needs to suppress the segments of spectrum occupied by the spectral

duplicates, thus allowing a great simplification of the filter and reduction in workload.

Using the Remez filter design algorithm, only the frequency regions that require atten-

uation are included in the stop-band constraints. The remaining regions are considered

“don’t care” regions, allowing the size of the coefficient set to be dramatically reduced.

The response of the cascaded Jakes and the preprocessing FIR filter is overlaid with the

32x upsampling prototype filter in Fig. 66, magnified on along the frequency axis for

clarity. The designed prototype filter is symmetric and has a 192 tap impulse response,

requiring the storage of only 96 coefficients. The designed filter could be simplified if

spectral droop were tolerable in the main passband lobe.

As described in the literature, the dual polyphase design is capable of arbitrary re-

sampling, which permits any δ > 0. In this case, it is possible that the pair of commuta-

tor arms can skip locations as indicated by the integer component of the δ accumulator

when δ > 1. To implement the resampling structure, the shift register length required

by the prototype filter is extended by one element. This element will provide the nec-

essary “overhang” as the commutators wrap around and are simultaneously positioned

105


−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1

−100

−80

−60

−40

−20

0Rate−32 FIR Prototype Filter Response Overlaid with Oversampled Jakes

ma

gn

itu

de

(d

B)


−0.1 −0.08 −0.06 −0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1

−100

−80

−60

−40

−20

0Cascaded Response with Rate−32 FIR Prototype Filter

ma

gn

itu

de

(d

B)


Figure 66: Prototype Filter for 32x Upsampler: Exploiting the Oversampled Input Sig-nal (5x magnification)

at the first and last position in the polyphase structure. In this case, the two filter seg-

ments must process different sets of samples, one including the old “overhang” sample,

and the other the newly captured sample. Each stage in the traversal of the pair of

commutators across the polyphase structure is shown in Fig. 67. The workload of this

filter in this application is still low enough that the implemented structure only re-

quires a single MACC element for each commutator output, indicated by the top half of

Fig. 68. The integer portion of the accumulator k depicted in Fig. 60 is used to control

the virtual commutator address in the coefficient ROM. If it is guaranteed that δ ≤ 1,

the commutator pair never skips locations and the lagging commutator arm will always

produce a delayed copy of the other, therefore it can be replaced with a simple register

that stores the output of the commutator arm for the next iteration. This simplification

eliminates the entire processing structure of one of the polyphase filters.

The bottom half of Fig. 68 shows the variable linear interpolation components with

this simplification. Fig. 68 now shows the entire implementation that uses a single

MACC element for all 2p upsamplers. The linear interpolation components operate at

the final channel processing rate; 2p of them are needed for the entire design.

The value of the accumulator input δ determines the final upsampling factor. To

determine the desired value of δ for this implementation, the following equation can

106


0 1 432

shift register 4 3 2 1 04

...

h0(z)

h1(z)

hN-2(z)

hN-1(z)_

+ y[0]

y*[0]

x[0]

- arm

+ arm

0 1 432


...

h0(z)

h1(z)

hN-2(z)

hN-1(z)_

+ y[N]

y*[N]

x[1]

- arm

+ arm

0 1 432

shift register 4 3 2 1 0...

h0(z)

h1(z)

hN-2(z)

hN-1(z)

+ y[1]

- arm

+ arm

_y*[1]

0 1 432


...

h0(z)

h1(z)

hN-2(z)

hN-1(z) + y[N-1]

- arm

+ arm

_y*[N-1]

Figure 67: Dual Polyphase Filter Arbitrary-Ratio Resampler: Dual Commutator Traver-sal States with Extended Shift Register Positioning

be used

δ =256 fmax

fd fs, (112)

where fs is the operating rate of the channel processing component. If the parameters

fs = 100 MHz, fd = .8, fmax = 100 Hz are selected, δ = 3.2× 10−4. Given a value of

δ, the upsampling factor of the system, including the 4x preprocessing stage, can be

determined by

M =128

δ. (113)

Therefore, for δ = 3.2× 10−4, M = 4× 105.

An example shown in Fig. 69 illustrates the system’s cascaded frequency response

with δ = .0225, approximating an inconvenient upsampling factor of 5688.8888 . After

a 500x magnification, the bottom half of Fig. 69 shows the Jakes response. Both plots

in Fig. 69 show that the spectral duplicates have been more than adequately attenuated

107


Coefficient ROM

reg addr

k

c[n/4]

Shift Reg 2

...

Shift Reg 2p-1

Shift Reg 2p

Shift Reg 1

... ...FIFO 2p-1

FIFO 2p

FIFO 1

FIFO 2

...

c[n/128]

z-1

reset

Addr. Generator

c1[n/128]z-1

α

1-α

c1[δn/128]

c2p[n/128]z-1

α

1-α

c2p[δn/128]

en

en

...

...

...

Figure 68: Arbitrary-Ratio Upsampler: Rate-32 Polyphase Upsampling with Linear In-terpolators

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

−100

−50

0


ma

gn

itu

de

(d

B)

Full System Frequency Response: δ=0.0225, M=5688.8889

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

x 10−3

−100

−50

0


ma

gn

itu

de

(d

B)

Full System Frequency Response (500x zoom)

Figure 69: Frequency Response of the Cascaded Jakes and Arbitrary-Ratio Resampler:δ = 0.0225

108


coeff. storage workload(elements) (MACCs/output)

Jakes 129 256Rate-4 FIR cascade 18 5.75Rate-32 dual FIR 97 6

Linear Interp 0 2

total 244 2+ 267.75128

δ

Table 8: Workload and Coefficient Storage Breakdown for a Single Variable-Rate Chan-nel Coefficient Generator

below the desired -96 dB level.

The overall workload analysis of the design can be generalized using δ to determine

the number of MACCs per output for the entire system. The workload breakdown is

shown in Tbl. 8. The total workload can be found using

Wtotal = 2+267.75

128δ (114)

Smaller δ reduces the rate of the upstream components, so large upsampling factors

reduce the overall workload. Using the workload total and the design example value

of δ = 3.2 × 10−4, the resulting workload is a meager 2.000669 MACCs/output for

each filter. Using 2p filters for p = 9 complex channel paths, the workload increases

to only 2pWtotal = 36.012 MACCs/vector. In this system configuration with a channel

component operating at sampling rate of 100 MHz, the channel coefficient generator

must perform a mere 3.6012 GMACCs/s.

7.2 Real-Time Milti-Path MIMO Channel Emulation

The extension of the established SISO architecture to MIMO is relatively straightfor-

ward. The LTE specification has chosen the Kronecker model for the conformance

testing regimen of its MIMO-capable functions. This section will show the extension of

the developed SISO model to the Kronecker MIMO model based on the design in [63].

To establish a conceptual framework, a generic MIMO system is introduced. The

number of transmit and receive antennas in the MIMO system are indicated by M and

N , respectively. The M × 1 transmitted symbol vector x passes through the channel,

modeled by the multiplication of x with the N × M matrix of complex channel path

109


x[t]Δt2

H1[t]

Δt3

H2[t]

Δtp...

Hp[t]...

...

Σ

y[t]

Figure 70: Tapped Delay Line MIMO Channel Model

coefficients H, the channel matrix, to form the received N × 1 symbol vector y.

y= Hx

y1

y2...

yN

=

h1,1 h1,2 · · · h1,M

h2,1 h2,2 · · · h2,M...

.... . .

...

hN ,1 hN ,2 · · · hN ,M

x1

x2...

xM

(115)

The multiplication between H and x models the passage of the N transmitted symbols

traveling through N M paths in the channel, which are then summed accordingly at the

receiver’s N antennas. To add the capability of the model to describe channels with

mobility, time indices are added to each element in Eq. 115, consequently adding time

indices to the individual elements that comprise both vectors and the channel matrix.

y [t] = H [t]x [t] (116)

For the multi-path case, the channel can be modeled by

A (t,τ) =P∑

i=1

Hi [t]δ�

τ−τi�

. (117)

and pictorially described by the structure shown in Fig. 70. Each channel path now has

its own matrix of complex weights. To keep the model simple, it has been assumed that

the paths that make the elements in each channel matrix have the same p× 1 delay

vector τ =�

τ1 = 0,τ2, . . . ,τp

�T, where each element is a positive continuous-time

delay.

The capacity of the SISO channel, i.e. M = N = 1 is given by [68]

C = log2

�

1+ρh2�

b/s/Hz (118)

110


where h is the normalized complex gain of the stationary, or the instantaneous real-

ization of a mobile channel, and ρ is the SNR at the receiver. In a MIMO system, the

channel’s capacity is defined by

CEP = log2

�

det�

IM +ρ

NHHH

��

b/s/Hz (119)

assuming equal power (EP) uncorrelated sources [68, 69]. The channel’s capacity

grows linearly with min (M , N). The determinant operator yields a product of min (M , N)

non-zero eigenvalues, which are determined by the properties of the channel matrix.

Each eigenvalue characterizes the SNR over the channel “eigenmode”. The channel’s

capacity is determined by the sum of the capacities of each individual eigenmode,

therefore the capacity increases with additional antennas and the spatial properties of

the channel [68]. Intuitively, an orthonormal channel matrix maximizes the channel’s

capacity.

In a mobile channel, the elements of the H matrix become time-varying, and HHH

in Eq. 119 is replaced with E�

HHH�, which determines the expected channel capac-

ity determined by the statistics of the channel matrix. Clearly, the statistics of the

time-varying behavior of the channel matrix have a profound impact on the emulated

system.

In the special case where each element in H is i.i.d., each “sub-channel” is spatially

independent, or uncorrelated, maximizing capacity by diagonalizing the expected value

of the HHH term and maximizing the value produced by the determinant in Eq. 119.

The i.i.d. assumption can still hold using Jakes processes in each subchannel [70],

which models the Doppler spectrum and temporal correlation properties of a typical

channel scenario with mobility. Each Jakes processes can be generated in real-time

using the Jakes process generator developed in Sec. 7.1.

In a realistic mobile scenario, the antennas on the mobile device are confined to a

small area, restricting the spacing between each antenna and limiting the achievable

spatial diversity. The base station antenna array is less restricted for space and can offer

better spatial diversity. The spatial characteristics of the channel and the properties of

the two antenna arrays introduce spatial correlation, revoking the ideal i.i.d. aspect of

the elements in H. With the i.i.d. assumption removed, the expected value of the HHH

term begins growing off-diagonal elements, decreasing the eigenvalues and decreasing

the channel’s capacity from its maximum.

Establishing an accurate model for some common antenna configurations and chan-

nel conditions allows the spatial correlation properties to be user-selectable, just as the

111


Jakes model allows the user to select the temporal correlation by providing the velocity

of the mobile device and the carrier frequency. One popular model that allows this is

the Kronecker model, first introduced and verified in [70–72]. However, this model has

been claimed to be too “simplistic” and can be invalid for some special cases [73, 74].

The method presented in [74] adds a small modification to the Kronecker model that

significantly increases the accuracy of the modeled channel capacity in the many illus-

trated cases where the Kronecker model fails.

Despite the claimed inaccuracies, presently, the Kronecker model is very relevant;

the simple model is used in the LTE standard [42] among others for conformance

testing, making it desirable in the test equipment marketplace. Here, the Kronecker

model will be introduced along with its architecture as implemented and tested in

FPGA hardware in real-time.

The Kronecker model formulation starts by lumping the correlation properties of

the two antenna arrays and their local spatial features into correlation matrices. These

matrices represent the spatial statistics that result from antenna spacing, radiation pat-

tern, and the local scattering environment. According to [74], the two correlation

matrices accurately model the effects of the channel scatterers clustered around the

link ends, or antenna arrays, without considering any scatterers in between. This is an

accurate model for some cases in a cellular system. The base station antennas are typi-

cally surrounded by clutter from its antenna mast, while the mobile device is frequently

located in a clutter-filled environment such as a building or a vehicle. In this scenario,

assuming the base station and the mobile device are separated by a sparse scattering

environment such as free space, the assumptions made by the Kronecker model seem

conceivably valid.

To conceptualize the fundamental theory of the modeling as desribed by [68], the

power azimuth distribution function p (θ) is introduced, which defines the distribution

of scatterers in azimuth angle θ as seen by the base station, where θ ∈ [Θ−∆,Θ+∆],

Θ and ∆ indicating the angle of arrival at the receiver and the angle spread, respec-

tively. The angle spread is affected by the relative height between the base station and

the mobile device. The base station antennas are usually elevated above the mobile

device on an antenna mast. The angle of arrival is the angle at which the transmitted

signal energy arrives w.r.t. broadside at the receive antenna array. Given p (θ) and

using the notation in [68], the spacial correlation between the paths from receive and

112


transmit antennas Rn and Tm and Rn and Tm′ can be found using

Ψ�

RnTm, RnTm′�

=

∫ Θ+∆

Θ−∆p (θ)exp

�

j2πsin (θ)λ

D�

Tm, Tm′�

�

dθ , (120)

where D�

Tm, Tm′�

indicates the distance between antennas Tm and Tm′ , and n ∈ [1,2 . . . , N],

and m ∈ [1,2 . . . , M], and the prime simply indicates n 6= n′ and m 6= m′. The Kro-

necker model approximates the correlation matrix Ψ by performing the Kronecker

product between the local transmit and receive correlation matrices ΨT X and ΨRX .

Ψ≈ΨRX ⊗ΨT X (121)

The statements regarding the Kronecker model inaccuracies in [74] are now more in-

tuitive with the added context. The model only considers the spatial channel features

immediately surrounding each antenna array that make up the correlation matrices, in-

dependently of the other, and independently of the channel features that exist between

the two arrays.

To apply the desired spatial correlation properties to the matrix of i.i.d. channel

coefficient processes, the Ψ matrix is first decomposed using Cholesky decomposition,

resulting in the product of the all-real lower-diagonal C matrix and its transpose.

Ψ= CCT (122)

The C matrix is then multiplied by j, the vectorized matrix of i.i.d. complex Jakes

113


Spatial C

orrelation(K

ronecker model)

Tem

poral Correlation

(Jakes Doppler M

odel)

Variable R

ateT

ransition

WG

NG

enerator

J[n] H[n] H[δn/128]

Figure 71: Channel Matrix Generator System Diagram

processes, generating the desired correlation properties between the elements.

Vec {H [t]}= CVec {J [t]}

h1,1 [t]...

h1,N [t]

h2,1 [t]...

h2,N [t]......

hM ,1 [t]...

hM ,N [t]

=

c1,1 0 · · · 0

c2,1 c2,2 · · · 0...

.... . . 0

cMN ,1 cMN ,2 · · · cMN ,MN

j1,1 [t]...

j1,N [t]

j2,1 [t]...

j2,N [t]......

jM ,1 [t]...

jM ,N [t]

(123)

This operation is shown in Fig. 16 of [70] along with an extensive proof of the resulting

correlation properties of H.

According to Eq. 123, the only modification required to generate spatiotemporally

correlated channel matrix coefficients is the multiplication of the i.i.d. Jakes processes

with the C matrix. In this architecture, the C matrix is constant, and programmable by

the user. The new channel matrix generation flow diagram is shown in Fig. 71. In this

architecture, the Jakes processes and spatial correlation components reside in the low-

rate end of the resampler, reducing complexity tremendously. The MIMO design adds

2 (MN − 1) additional Jakes processes along with the correlation matrix operation to

the existing SISO architecture. The real-valued, lower-diagonal nature of C eliminates

more than half of the normally required computations required if it were a fully popu-

lated complex-valued matrix. The implemented structure shown in Fig. 72 operates at

114


Jn,m[n] ...

FIFO 1

FIFO 2

FIFO 2NM-1

FIFO 2NM

Hn,m[n]z-1

reset

...

Cn,m...

matrix addr

vector addr

...

Figure 72: Hardware Matrix Multiplication Operation for Correlating i.i.d. Jakes Pro-cesses

a low enough rate that allows a single MACC element to perform the necessary oper-

ations. Replicating the full structure shown in Fig. 71 and connecting it to the MIMO

channel processing component in Fig. 70 completes the full MIMO channel emulator.

Each component shown in the SISO generator now must process 2N M p streams.

7.3 Implemention in FPGA Hardware

An M = N = 2, p = 1 structure has been implemented and tested in an X5-400M

XMC card made by Innovative Integration that features a Xilinx Virtex5 SX95T FPGA

and pairs of high-speed ADCs and DACs. The implemented design operates at fs =

200 MHz. The high sampling rate provides the capability of processing wide bandwidth

signals, suitable for the LTE and LTE-Advanced downlink.

Having fs = 200 MHz allows very fine 5 ns delay tap resolution in the channel

processing structure at little extra cost in the upsampling components in the channel

matrix generators. Instead of performing summed sinc interpolation to correlate ad-

jacent channel taps (i.e. Eq. 8), and allowing fractional tap delays while running at a

much lower rate, as shown in the architecture presented in [44] and used in MATLAB’s

“rayleighchan” and “mimochan” functions, the sampling rate is set to a frequency that

provides more than adequate tap delay resolution for any of the models in the LTE

conformance tests and other widely used models (ITU, COST, etc.). Adding a summed

sinc interpolation component vastly increases the complexity of the channel processing

structure at the benefit of reduced upsampling ratios in the channel matrix generators.

Interestingly, the workload of the implemented architecture decreases with increasing

115


PCI ExpressInterface and SignalGeneration/Capture

Software

Host Computer: ePC

PCIe 2.0

X5-400M FPGA Hardware

ChannelMatrix

Generator TestStimulus

WGNH

x

y

select

Figure 73: Test Configuration of the Implemented Channel Emulator

upsampling ratios (Eq. 114), therefore running the design at the highest possible rate

is beneficial in more ways than one.

The X5-400M features a high-speed PCI-express 2.0 soft core in its FPGA, allow-

ing the host computer to stream data to and from the FPGA at hundreds of MBytes/s,

providing an ideal testbed for algorithm development and verification, shown in de-

tail in Fig. 73. The 6-7 orders of magnitude disparity between the sampling clock and

the Doppler frequencies make verifying this design very difficult in simulation. Syn-

thesizing the design into FPGA hardware and providing hardware and software-based

stimulus allows verification to be performed in approximately half-time (near real-

time). The PCI-express and the hard disk of the host computer are the bottleneck in

the system in this configuration. After testing, the data streams can be switched from

the PCI-express to the on-board data converters, which sustain a throughput stream of

1.4901 GBytes/s when processing signals in real-time at the full fs = 200 MHz.

While the X5-400M only features pairs of ADCs and DACs, 2× 2 MIMO operation

can take place if the input signals are all-real and centered at 100 MHz while being

sampled at fs = 400 MHz. Once sampled, the real signals are processed by a Hilbert

transform operation and frequency translation, converting the real signals into complex

baseband, each sampled at fs = 200 MHz. After being processed by the MIMO emu-

lator, the opposite operation is performed before DA conversion. This process can be

efficiently performed using polyphase heterodyned halfband Hilbert transformers [25],

which will not be discussed here.

The implemented design features a 22-bit accumulator width in the arbitrary-ratio

upsampler, allowing the user to select the Doppler frequency with 0.149 Hz resolution.

To test the Jakes generators, the WGN source is bypassed and a single test matrix

W=�

1− j1, 1− j1;1− j1, 1− j1�

is sent through the Jakes generation filter cascade.

Meanwhile C= I is programmed into the Kronecker model, passing the Jakes processes

116


Figure 74: Hardware-Sourced Jakes Impulse Response from MIMO Emulator

without modification, while a constant x =�

1+ j1; 1+ j1�

is transmitted into the

channel processing component. This test reveals the full-scale impulse response of

the Jakes filter at the real components at both receive antenna outputs. Conjugating

x places the full-scale output at the imaginary components of each receive antenna.

Finally, zeroing individual rows of W while conjugating x isolates the individual real

and imaginary component in each element of the output y vector. This procedure

verifies that each Jakes filter cascade is performing as expected and allows the user to

observe the variation of the Jakes impulse response as the Doppler is varied accordingly.

The length of the impulse response lengthens and shortens with the Doppler setting.

The test is performed using hardware-sourced test stimulus from a small ROM lo-

cated in the FPGA, as shown in Fig. 73. The hardware-source test results in Fig. 74

show an impulse response of approximately 40 million taps. Reducing the Doppler to

pedestrian velocities (<10 Hz) increases the length of the impulse response to on the

order of 1 billion taps.

To test the correlation properties introduced by the Kronecker model, access to the

instantaneous channel matrices must be gained. With no way to access the values in

the channel matrix directly from the output y vector in hardware, verification of the

implemented design was performed in bit-true cycle-true simulation. The expected

value of the channel matrix correlation E�

HHH� can be quickly be obtained in simula-

tion by bypassing the upsampling component in the channel matrix generator, passing

the nearly critically sampled Jakes processes into the spatial correlation component,

allowing the expected value to be estimated using far fewer samples.

The LTE specification defines 3 Kronecker model matrices for conformance testing

of two antenna systems, providing low, medium and high levels of correlation. The

117


Used Available Percentageslices 1,646 14,720 11%BRAM 6 244 2%DSP48E 38 640 6%

Table 9: FPGA Resource Consumption for a Single Channel Matrix Generator, ExcludingWGN Source ( fs = 200 MHz)

simulation results along with both correlation matrices taken from the LTE specification

for each antenna array are shown below. RHH is the result obtained after averaging

500,000 HHH operations, closely estimating E�

HHH�.

ΨT Xlow =Ψ

RXlow =

1 0

0 1

RHH =

0.9994 0.0003+ j0.0009

0.0003− j0.0009 0.9994

(124)

ΨT Xmed =

1 .3

.3 1

,ΨRXmed =

1 .9

.9 1

RHH =

1.0001 0.2996− j0.0004

0.2996+ j0.0004 0.9982

(125)

ΨT Xhigh =Ψ

RXhigh =

1 .9

.9 1

RHH =

1.0001 0.9003+ j0.0003

0.9003− j0.0003 1.0002

(126)

The correlation results show good correspondence with the correlation matrices pro-

vided by the model. The matrices defined in the medium correlation scenario in Eq. 125

reinforce the earlier assumptions about the impact of antenna spacing and spatial cor-

relation at the base station and the mobile device.

The final resource consumption tabulation in Tbl. 9 reveals a very hardware ef-

ficient implementation, occupying a small fraction of the resources available in the

Virtex5 SX95T FPGA. The implemented model contains a single coefficient matrix gen-

erator, which leaves the components before the variable rate transition idle much of

the time. Increasing the number of channel paths keeps these components busier. The

expected hardware resource consumption should not dramatically increase with M and

118


N , depending on the rates in the system and the FPGA clock speed.

Perhaps surprisingly, the largest consumers of DSP48E elements are the channel

processing element and the linear interpolators. The channel processing element be-

comes quite complex with scaled M , N and p, increasing the number of complex multi-

plications and additions super-linearly. The product of the complex channel matrix and

transmit vector must operate at full rate, requiring 16 dedicated DSP48E elements in

the implemented configuration. Similarly, the linear interpolation components require

2MN p multipliers, 8 in this design. These two components alone occupy more than

half of the total DSP48E consumption.

The variable delay elements shown in the channel processing structure are imple-

mented using the FPGA’s dual-port BRAM as a tail-chasing circular buffer that can

provide up to (1024/N) − 1 taps of delay per BRAM, and were not included in the

presented resource breakdown. This particular hardware implementation includes a

single channel tap, allowing the omission of the variable delay element.

Finally, if channel emulation is performed in software, or if the sampling rate of

the transmitted and received signal is restricted to 30.72 MHz, the Farrow resampler

introduced in Fig. 52 can transition the rate from 30.72 MHz directly to 200 MHz for

channel processing and back again. Adding a Farrow-based resampler at the input and

output of the channel-processing component enables user-selectable fractional sample

delay and arbitrary tap delay resolution.


The Jakes and Kronecker models, two widely used models for wireless channel emula-

tion, have been implemented and tested in FPGA hardware. A system architecture has

been developed that allows the user to program spatial as well as temporal correlation

properties to emulate the behavior of a mobile MIMO channel. The unique system

architecture is highly flexible, yet implemented in a very efficient structure while pro-

viding greater than 16-bit performance. An implemented design that supports high

dimension MIMO systems is also capable of emulating lower dimensionality, even SISO

systems, by programming the appropriate C matrix with padded zeros.

119


8 Conclusions

This dissertation has covered a wide range of topics with several notable contributions.

Ch. 3 introduces OFDM receiver synchronization concepts, showing a connection be-

tween sampling frequency offset and symbol timing synchronization. A receiver ar-

chitecture was introduced that simultaneously corrects sampling frequency offset and

symbol timing. The technique was shown to maintain excellent performance, even in

harsh multi-path highly-mobile channel conditions. To enhance performance in an LTE

system, a technique was developed that is able to efficiently detect symbol timing using

the primary synchronization signal (PSS) in the time domain. The detection method

exploits the band-limited nature of the PSS to minimize the number of necessary com-

putations.

Ch. 4 continued with the LTE OFDM receiver design by first introducing a stochastic

optimization technique to directly estimate the equalization matrix using the received

signal and the available reference symbols (RSs). Exploration of an alternative method

was introduced using locally weighted regression. The regression technique features

a parametrized kernel that can be selected for a particular channel environment. The

kernel selection was found using offline training. The regression technique was shown

to significantly reduce estimation error in two out of the three LTE-specified channel

environments.

The regression technique used for channel estimation was found to be quite use-

ful for other tasks, such as arbitrary-ratio resampling. Ch. 5 showed that the locally

weighted regression algorithm can be formulated to generate the Farrow filter. Us-

ing a parametrized kernel, the response of the Farrow filter can be adjusted. Using a

pre-processing upsampler, the Farrow filter was found to exhibit excellent resampling

performance, which was demonstrated in the simulation results of Ch. 3.

Later, in Ch. 6 a technique that utilizes cyclic prefix redundancy was introduced,

capable of providing modest SNR improvements in an OFDM receiver operating in

“normal channel conditions”. The redundancy combination is performed using a single

addition and bit-shift for each received sample. The technique requires already-known

or readily available measurements of the channel’s excess delay.

Finally, Ch. 7 presents the theory and implementation of a real-time multi-path

MIMO fading channel emulator. The developed architecture was implemented and

tested in FPGA hardware. The unique architecture utilizes an arbitrary-ratio resampler

that guarantees 16-bit performance, enabling the user to select the desired Doppler

frequency at run-time with high resolution. The architecture also allows run-time pro-

120


graming of the spatial correlation aspects of the channel, which determines the ex-

pected channel capacity in a MIMO system. Hardware-based test results and FPGA

resource consumption reveal a very cost-effective, high-performance design.

121


A Generic Multicarrier System Model

A multicarrier system is usually implemented using a transmultiplexer [75, 76]. For

OFDM, the transmultiplexer is implemented using the discrete Fourier transform (DFT)

[77,78], which is an orthogonal filter bank. The DFT has very nice properties allowing

very efficient implementation in hardware, making OFDM an popular choice in multi-

carrier systems. Other types of multicarrier, sometimes called FBMC (a more generic

term), can use orthogonal or nonorthogonal filterbanks [76] instead of the DFT. In a

nonorthogonal, or oversampled filter bank, each sub-band contains some energy from

its neighbor, providing some redundancy [79]. There are many variants of FMBC,

which all rely on the fundamental properties of its transmultiplexer.

A.1 Linear Transforms and Basis Functions

In a multicarrier system, a vector of modulated symbols x is transformed into the vector

y using linear operations or transforms, i.e. [80]

y= Tx , (127)

where T, the transform matrix, is generally unitary, i.e.

‖y‖2 = ‖x‖2 . (128)

This is equivalent to changing the coordinate system from the domain of x to the trans-

form domain of y.

To demonstrate this concept on linear transforms, consider a vector in a two-

dimensional space, the x vector is defined by the two-element coordinate system with

orthonormal axes defined by the vectors x0 and x1; therefore the description of x by

these elements, or “bases”, defines the vector. For example

x=

r

1

2· x0+

r

1

2· x1 (129)

In the basis�

x0, x1

�

, the vector x can be written

x=

x0

x1

=

r

1

2

1

1

. (130)

If the coordinate system is rotated so the axes are defined by the new orthonormal

122


x1

0-1 1

1

x0

x

Figure 75: Two-Element Vector Defined in the Orthonormal Basis�

x0, x1

�

x1

0-1 1

1

x0

yy0y1

Figure 76: Two-Element Vector Redefined in the Orthonormal Basis�

y0, y1

�

basis�

y0, y1

�

, the old x vector can still be expressed in the new transformed basis as

the vector y (Fig. 76). For this example, let

y= 1 · y0+ 0 · y1 (131)

Using the basis�

y0, y1

�

, the vector y can be written

y=

y0

y1

=

1

0

. (132)

The same vector can be expressed both in the x and y basis. The coordinate systems

can be related using the following expression using the vectors shown in Fig. 77.

y0

y1

=

yT0 x0 yT

0 x1

yT1 x0 yT

1 x1

x0

x1

=

r

1

2

1 1

−1 1

x0

x1

, (133)

123


x1

0-1 1

1

x0

y0y1

Figure 77: x and y orthonormal basis vectors defined in the x basis

T=

r

1

2

1 1

−1 1

(134)

Now the matrix T transforms the x basis to the y basis. Notice that T is an orthonormal

matrix, e.g.

TTH = THT= I ; (135)

therefore the inverse transform, to transform from the y basis back to the x basis can be

easily performed. Note that the Hermetian transpose is used to support the possibility

of complex transform matrices.

x= T−1y= THy (136)

This illuminates the concept of linear transforms using an orthonormal transformation

matrix. In a multicarrier system, the forward and reverse transforms are separated and

reside in the transmitter and receiver. In an ideal system, a vector of modulated symbols

x is transformed at the transmitter using an orthonormal transformation matrix, T. In

this discussion, T is simply a generic transformation matrix. The result of the transform

operation is the vector y, which is transmitted and received by the receiver unaltered.

y= Tx (137)

The receiver then uses the reverse transform to obtain the originally transmitted symbol

vector x

x= THy⇒

x= TH [Tx] = Ix= x(138)

124


1

1

-1

-1 real

imag

00 01

10 11

real

imag

0000 000110001010

0100 011011001101

0101 011111101111

0010 001110011011

-1-2-3 1 2 3

1

2

3

-1

-2

-3

QPSK 16QAM

Figure 78: QAM Constellations: QPSK and 16QAM

A.2 Serial-to-Parallel and Mapping

To generate the x vector of modulated symbols, a binary data stream undergoes “map-

ping”. First, a serial stream of data is split into a matrix D of size N × M . The serial

data fills each row of the matrix forming rows of binary M -tuples. In the established

notation, a MN × 1 vector b contains all of the binary data to be modulated, which is

only a portion of a continuous stream of data that fits into a single symbol vector trans-

mission. The algorithm for serial to parallel conversion is shown in Alg. 1. The data

Algorithm 1 Transmitter Serial to Parallel Conversion

for i = 1 : N dofor j = 1 : M do

Bi, j = b((i−1)M)+ j

end forend for

is arranged into row-wise M -tuples to form the N ×M B matrix. This operation is per-

formed so the next stage, the “mapping” stage, can take each M -tuple row-by-row and

map it to a representative point in a constellation of symbols. The constellation used

in mapping must be made up of 2M symbols, so that any possible binary combination

is represented. Two possible constellations are illustrated in Fig. 78. The constellation

points are indicated in red, juxtaposed with the corresponding M -tuple. The mapped

coordinates for each constellation point are labeled on the real and imaginary axis. The

complex constellation coordinates for the constellation points fill the transmit symbol

125


vector for the linear transformation operation (the x vector in the example above).

The mapper is not constrained to assign any single modulation type to all of the

rows of the B matrix. For example, the mapper could use both of the modulation types

shown in Fig. 78 as long as the corresponding M -tuples are constrained to use only 2M

bits for the assigned constellation. For simplicity, it will be assumed that the constel-

lation type will be uniform across all rows in B unless otherwise noted. Fig. 78 does

not show any normalization between constellations. Normally, if multiple constellation

types are simulaneously available to the mapper, the constellation axes are scaled so

that the expected symbol power (squared-magnitude of each symbol vector) is equal

throughout each modulation type. If normalization is performed, the symbols in the

higher order modulation types lie more closely together on the complex plane relative

to the lower order constellations.

Given a constant output rate of the mapping component, the mapper determines

the data throughput of the system. Each vector x contains MN mapped bits, so if more

mapped symbols (N) or a higher order constellation is used (M), the data throughput

increases. In a multicarrier communications system, the rate of the mapper and the

size of the mapped vector of symbols (N) are usually fixed, allowing the modulation

order (M) to be varied to throttle the data throughput.

As seen in Fig. 79, after mapping, the linear transformation is applied, and a

parallel-to-serial operation is performed on the result. This operation simply reads the

transformed vector row by row. In the mathematical model using matrix notation, this

operation has no effect, but in a real system, the transformed vector must be converted

by a DAC and must be converted element by element. This operation is included to

illustrate the concept of the signal propagating through the wireless channel as a time

sequence.

After passing through the wireless channel, the signal arrives at the receiver and is

serially collected in groups of N elements by the receiver’s ADC. The received signal

comprises the received vector u after the serial-to-parallel operation. The u vector is

transformed using the transformation matrix that undoes the transmitter’s, producing

the v vector. The demapper then determines the most likely transmitted complex sym-

bols for each row in the v and outputs the corresponding binary M -tuple. Finally, the

parrallel-to-serial operation transforms the N × M matrix into a serial stream of bits.

The algorithm for the receiver’s parallel-to-serial operation is shown in Alg. 2.

126


mapper

serial-to-parallel

...

Bj,:

b

... T

x

parallel-to-serial

... channel

serial-to-parallel

y

y u ...

u

TH

demapper

... ...

parallel-to-serial

c

Cj,:v

Figure 79: Generic System Model

Algorithm 2 Receiver Parallel to Serial Conversion

for i = 1 : N dofor j = 1 : M do

c((i−1)M)+ j = Ci, j

end forend for

A.3 The Stationary AWGN Channel

Notice in Fig. 79, if the channel component passes the y vector unaltered to the receiver

(i.e. u = y,v = x,C = B and c = b) the system will be error free after demapping. In

a more realistic example, the transmitted signal travels through a channel and arrives

at the receiver with some alterations, such as added noise and echoes from multi-path

propagation through the channel.

In the most basic system model, the channel adds complex WGN (AWGN) to the

signal. To define the added noise, we must first introduce some notation used with

random variables. Let n(t) denote a Gaussian random variable with mean µn and

variance σ2n, i.e.

n(t) = n0(t) + jn1(t) (139)

where n0(t) and n1(t) are each real-valued, i.i.d., zero-mean Gaussian random vari-

ables, i.e. the cross-correlation and autocorrelation of n0(t) and n1(t) are defined by

rn0n1(τ) = 0, ∀ τ,

rni ni(τ) = σ2

niδ(τ), ∀ τ, i ∈ [0,1]

(140)

The mean µniand variance σ2

niof the real and imaginary components of µ(t) are

127


defined by

µni= E

�

ni�

,

σ2ni= E

��

ni(t)−µni

��

ni(t)−µni

��

,

i ∈ [0,1] ;

(141)

therefore, the mean µn and variance σ2n are defined similarly by,

µn = E [n]

σ2n = E

�

�

n(t)−µn��

n(t)−µn�∗� (142)

In the wireless channel model, it will be assumed unless otherwise noted, that

µn0= µn1

= 0. To define a complex noise process, the mean and variance of n(t) will

be specified. To generate the noise, the variances of the i.i.d. real and imaginary parts

are related by σ2ni= σ2

np2.

To stay aligned with the established vector and matrix notation, the complex ran-

dom vector variable n(t) must be defined by

n(t)¬�

n1(t), n2(t), · · · , nN(t)�T , (143)

where each element in n is an i.i.d. complex, random variable as defined in Eq. 139

and N defines the number of elements in the complex random vector.

Often in simulation, the noise in the AWGN channel will be varied to achieve a

desired SNR, or more specifically a signal to noise power spectral density ratio, i.e.Es/N0, requiring the noise density to be selected according to the signal power. For a

complex random variable, its power is defined by the expected value of its Hermitian

product, which is equivalent to the definition its variance.

Pn = σ2n = E [n(t)∗n(t)] (144)

For the complex vector of complex random variables, the expected power is defined by

the expectation of its Hermetian inner product.

Pn = σ2n = E

�

n(t)Hn(t)�

= Nσ2ni

, (145)

where σ2ni

defines the noise variance for each of the N random complex variables that

comprise n(t).

128


Knowing the expected SNR in the wireless channel is useful for defining or pre-

dicting system performance. SNR is usually expressed logarithmically using the scaled

Briggsian (base-10) logarithm (named after the British mathematician Henry Briggs).

The expected SNR of a signal u = y+ n (as seen in Fig. 79), where y is the noiseless

signal and n is a realization of the complex Gaussian random variable n(t) is defined

by

E [SNR(u)] = 10log10

E�

Psi gnal

�

E�

Pnoise�

!

= 10log10

�

E�

yHy�

Nσ2n

�

. (146)

The instantaneous SNR is defined by

SNR(u) = 10log10

�Psi gnal

Pnoise

�

= 10log10

�

yHy

nHn

�

. (147)

In the instantaneous case, the expected mean and variance of n are 0 and σ2n. The

instantaneous mean and variance will, themselves, have a random distribution across

realizations, which can be problematic when measuring the noise in small vector sizes.

The maximum likelihood estimator for the mean and variance of an observed realiza-

tion n of n(t) is defined by

bµn =1

N

N∑

i=1

ni

bσ2n =

1

N

N∑

i=1

�

ni − bµn

��

ni − bµn

�∗

bµn→ µn, bσ2n→ σ2

n, N →∞

(148)

Note that the variance estimate depends on the estimated mean. Given large N , the

estimated mean and variance bµn and bσ2n are distributed:

bµn ∼N�

µn,σ2

n

N

�

bσ2n ∼

σ2n

Nχ2

N−1 ,

(149)

where N�

µ,σ2� denotes a real, Gaussian (normal) distribution with mean µ and

varianceσ2 and χ2N−1 denotes a Chi-squared distribution with N−1 degrees of freedom.

As N increases to infinity, the variance of the mean estimate approaches zero, and the

degrees of freedom and diminishing scaling factor of the variance estimate distribution

129


approach zero as well, as indicated in the final line of Eq. 148.

In addition to noise, the channel can have “memory”. In the above case, where the

channel only adds noise to the signal, the signal propagating through the channel can

be modeled as convolution with an impulse, e.g. the channel is an all-pass filter and

passes the transmitted signal unmodified, only adding noise. In a nonideal system, the

channel’s impulse response is no longer a unit-impulse. In the simplest channel with

memory, the channel only delays the signal, which can be modeled as a convolution

with a time-delayed impulse, causing a linear phase shift across frequency at the re-

ceiver. The channel can be modeled as a causal FIR filter of order N , charactarized

by [81]

H(z) =N+1∑

k=1

hkz−(k−1) , (150)

which is a polynomial in z−1. The (N + 1)× 1 vector h contains each coefficient of z−1

and is assumed to be a unit vector such that ‖h‖2 = 1. In the time domain, the relation

to the input (transmitted signal) x[n], and the output (received signal) y[n] is

y[n] =N+1∑

k=1

hk x [n− k] . (151)

The convolution operation can be performed by a matrix multiplication between the

input vector and a Toeplitz [48] matrix consisting of row-shifted copies of the filter

polynomial H(z). The equation below shows the equivalant operation as Eq. 151 using

matrix notation [81].

y= Hx=

h1 0 · · · 0 0

h2 h1 · · ·...

...

h3 h2 · · · 0 0... h3 · · · h1 0

hM−1... · · · h2 h1

hM hM−1...

... h2

0 hM · · · hM−2...

0 0 · · · hM−1 hM−2...

...... hM hM−1

0 0 0 · · · hM

x1

x2...

xN

, (152)

where M is now the length of the vector of coefficients in the convolution and N is

130


now the length of the signal being convolved. The H matrix must have dimensions

(M + N − 1)× N . To find the frequency response of the channel, multiply the channel

coefficient vector h by the N × N DFT matrix WN defined by

�

WN�

m,n =1p

Ne

j2πmnN , 0≤ m, n≤ N − 1 . (153)

As an example, if the channel vector element hm = 1 and is zero elsewhere such that

hi = 0, i 6= m, the frequency response can be defined by

dn =Wh=1p

Ne

j2πn(m−1)N , 0≤ n≤ N − 1 , (154)

which shows a constant magnitude and linear phase dependance on m across the fre-

quency index n. In a wireless channel, Eq. 154 shows the effect of delay in the channel

for m 6= 1 (assuming the transmitter and receiver are synchronized). More realistically,

the channel coefficient vector will have multiple non-zero elements. In this case, the

phase and magnitude depend on the frequecny index n.

In Fig. 79, the u vector can now be defined by

u= Hy+ n , (155)

where n is the realization of the random complex noise vector n(t) as defined in Eq. 143

and Hy is the convolution between the channel impulse response h and the transmitted

signal y.

An example shown in Fig. 80 demonstrates the effect of a noisy channel with mem-

ory. In this example, h has 3 non-zero elements. The top row of Fig. 80 shows various

aspects of the x vector at the transmitter. The x vector has been formed by mapping

a binary stream to QPSK symbols. In this example, 383 out of N = 512 elements in x

are occupied. Using OFDM for this example, the x undergoes OFDM modulation and

passes through the channel. The static channel effects are applied using Eq. 155. The

u vector arrives at the receiver where ideal OFDM demodulation is carried out to form

the v vector. Various aspects of the v vector are shown in the bottom row of Fig. 80. As

seen in the bottom row of Fig. 80, the magnitude and phase of the transmitted signal

have been greatly distorted by the channel. The blue dots indicate the known magni-

tude and phase of the channel. The expected SNR in this example is 20 dB (Eq. 146).

For the receiver to perform demapping, the magnitude and phase effects of the

channel must be removed. This process is called equalization. In multicarrier systems,

131


−1.5 −1 −0.5 0 0.5 1 1.5

−1.5

−1

−0.5

0

0.5

1

1.5

real

imag

x vector

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

||xk||

2

k0 100 200 300 400 500

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

angle(xk)

×2π

radi

ans

k

−1.5 −1 −0.5 0 0.5 1 1.5

−1.5

−1

−0.5

0

0.5

1

1.5

real

imag

v vector

0 100 200 300 400 5000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

||vk||

2− rxed (red) − known (blue)

k0 100 200 300 400 500

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

angle(vk) − rxed (red) − known modulus (blue)

×2π

radi

ans

k

Figure 80: Effect of a noisy channel with memory: OFDM example

equalization is performed after the linear transformation operation. This operation will

is inserted as required to form various extensions of the system model shown in Fig. 79.

132


B OFDM System Model

Using the established model for a generic multicarrier system (Sec. A), a more specific

model can be developed to describe a generic OFDM system. As in the generic mul-

ticarrier system model, an OFDM system features a linear transform operation. The

DFT, which has many attractive properties, is used at the transmitter and receiver as a

transmultiplexer. The orthonormal DFT matrix is defined as

�

WN�

m,n =1p

Ne

j2πmnN , 0≤ m, n≤ N − 1 . (156)

The 1pN

term is used to scale the matrix to be orthonormal, i.e.

WWH =WHW=W−1W=WW−1 = I (157)

Using the above properties, the DFT performs restoring linear transformations using

the forward (W) and inverse (WH) transforms. As in Sec. A, a vector of mapped con-

stellation symbols x is transformed into the y vector.

y=WHx (158)

The y vector is then transmitted into a static noisy channel with memory (as described

in Sec. A.3).

u= HWHx+ n (159)

Finally, the receiver’s linear transform (DFT) undoes the transmitter’s (IDFT), produc-

ing the v vector.

v=W�

HWHx+ n�

=WHWHx+Wn=WHWHx+ n (160)

Now, consider multiple symbols being transmitted contiguously, one after another

through the channel, i.e. after the y vector has been read row-wise, another y vector is

generated and transmitted. This process is illustrated using block matrix notation. The

transmitted y vectors make up the time series of vectors y(k) with k as the time index

(recall that the y vector is generated by multiplying the x vector with WHM) [82].

v(k) =WM

h

H0 H1

i

WHM 0M

0M WHM

x(k− 1)

x(k)

+ n(k)

!

, (161)

133


where

H0 =

0 · · · hd · · · h2...

. . . . . ....

.... . . hd

.... . .

...

0 · · · · · · · · · 0

H1 =

h1 0 · · · · · · 0...

. . . . . ....

hd · · · h1. . .

.... . . . . . 0

0 hd · · · h1

.

(162)

where the excess delay d defines the time separation between the first and last echo in

the channel, or the time separation between the first and last energy-bearing elements

in the channel’s impulse response. In this example, H0 and H1 are each M×M matrices,

concatenated to form an M × 2M block matrix. If the channel coefficient vector meets

the requirement

h j =

(

non-zero if j = 1

0 otherwise, (163)

the block matrixh

H0 H1

i

=h

0 h1IM

i

, and v(k) simplifies to Eq. 160 with added

time indices. If Eq. 163 is not satisfied, such simplifications cannot be made, and

energy, or inter-symbol interference (ISI) from x(k − 1) is added to v(k). In this case

the “excess delay” is non-zero, i.e. d > 1, and H0 is no longer a matrix of zeros;

therefore ISI is introduced. If ISI is present, after the DFT operation, the subcarriers

are no longer orthogonal, and energy from the previous symbol is spread over each

subcarrier, degrading the receiver’s ability to properly equalize and demap the received

symbol vector. The effect can be likened to SNR degradation, where the leaked energy

is added to the noise term in the SNR equation.

The consequences of ISI are severe enough that a guard interval is inserted to pre-

vent ISI, allowing for reliable operation in channels with memory. The DFT has a very

nice periodic property that lends itself to an elegant solution to the ISI problem. If the

DFT operation of the transmitted symbol vector is visualized as a finite summation of

134


weighted complex sinusoids (Eq. 131),

yk+1 = x11+ x2e j2πk/N + x3e j2π2k/N + x4e j2π3k/N + · · ·+ xN e j2π(N−1)k/N

k = 0,1, · · · , N − 1(164)

it becomes clear that y is guaranteed to be periodic. Each element in the summation

that comprises y makes an integer number of traversals around the complex plane. Sim-

ilarly, elements can be added to the y by extending the phase traversal of each element.

If the new vector is generated using an extended k index, where k = 0, 1, · · · , N+ L−1,

the y vector becomes extended by L elements. This action “suffixes” L additional sam-

ples to the end of the vector. The receiver can then choose the first N samples in

the vector for its DFT operation. Having added additional samples, the periodic na-

ture of y is no longer guaranteed unless L is an integer multiple of N . Also, notice

that the phase traversals of each sinusoidal component all align at zero phase when

k = 0, N , 2N ; therefore, y1 = yN+1, y2 = yN+2, y3 = yN+3 etc. The elements of yk for

k ≥ N can be generated by simply copying the elements at the beginning of the vector

and placing them at the end, e.g. yk = yk−N for k = N + 1, N + 2, · · · , N + L− 1.

The suffixing of samples implies that the receiver must choose the first block of N

elements in the y vector for operation so that no phase shift penalty in the frequency

domain vector x is incurred (see Eq. 154). If any excess delay exists in the channel,

this segment of y will become corrupted with ISI, forcing the receiver to select its N

contiguous samples using higher values of k vector, i.e. y(d+1):(N+d), where d is the

excess delay. The excess delay forces phase shift in the frequency domain (Eq. 154).

A more elegant solution is, instead of extending the phase traversal at the end of

the vector, to offset the starting point to an earlier position as shown below.

y j+1 = x11+ x2e j2πk/N + x3e j2π2k/N + x4e j2π3k/N + · · ·+ xN e j2π(N−1)k/N

j = (0, 1, · · · , N + L− 1)

k = j− L

(165)

This operation “cyclicly prefixes” the last L elements of y to the beginning of the vector.

The added vector elements are “sacrificial”, intended to absorb corruption caused by

channels with memory, allowing the final N+ L−d samples in y to be used for the DFT

operation. The receiver can now avoid the frequency domain phase rotation penalty

by selecting the final N samples in y, i.e. y(L+1):(N+L), even when the channel has excess

delay.

135


0 20 40 60 80 100 120 140 160−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

k

va

lue

CP Illustration in an OFDM Signal − y k

real

imaginary

Figure 81: Illustration of the Cyclic Prefix in an OFDM signal

Note that in the cyclic prefixing or suffixing operations, the additional samples can

be generated using no arithmetic operations, and due to the signal’s periodic nature,

as seen in Fig. 81, the added elements are simply copied from the end of the vector to

the beginning, and phase continuity is maintained.

The system model must be updated with the cyclic prefix addition and removal

at the transmitter and receiver, respectively. The cyclic prefix operation will be used

throught the remainder of discussions as it is by far more common than the suffix

operation. To maintain the standard matrix notation, the cyclic prefix can be added

and removed using specially designed permutation matrices. Let L be the CP length,

M be the DFT dimension, and P = M + L be the total length of y. The y vector is now

generated using

y= ZT WHMx (166)

where

ZT =

0L×(M−L) IL

IM

, (167)

and finally,

v(k) =WMZR

h

H0 H1

i

ZT 0

0 ZT

WHM 0M

0M WHM

x(k− 1)

x(k)

+ n(k)

!

, (168)

136


where

ZR =h

0M×L IM

i

. (169)

Now that the CP has been integrated into the system model and signals can now

be properly received in the presence of a channel with memory, the act of frequency-

domain equalization can be investigated so that the receiver can perform proper demap-

ping of the symbol constellation. To determine the matrix that equalizes the effect of

the channel, we must first investigate the inner workings of Eq. 168. For equalization,

the zero-forcing equalization matrix E must satisfy the following.

EWMZR

h

H0 H1

i

ZT 0

0 ZT

WHM 0M

0M WHM

=h

0M IM

i

(170)

Alternatively,

EWMH0WHM = 0M , (171)

EWMH1WHM = IM , (172)

where

H=h

H0 H1

i

= ZR

h

H0 H1

i

ZT 0

0 ZT

. (173)

Due to the added CP elements, the H0 and H1 matrices each have the dimension

(M + L)× (M + L). The addition and removal of the CP makes each H0 and H1 matrix

have the dimension M ×M , i.e.

H0 = ZRH0ZT

H1 = ZRH1ZT

(174)

Interestingly, if the CP accomodates the channel’s excess delay (e.g. d ≤ L), the CP

137


addition and removal matrices cause H1 to be circulant, and H0 to be the zeroes matrix.

H0 = 0M

H1 =

h1 0 · · · 0 hd · · · h2...

. . . . . . · · · . . . . . ....

.... . . . . . . . . · · · . . . hd

hd. . . . . . . . . . . . · · · 0

0... . . . . . . . . . . . .

......

. . . . . . . . . . . . . . . 0

0 · · · 0 hd · · · · · · h1

(175)

In the case of H0 being the zeros matrix, the CP prevents channel energy from influ-

encing u(k). Now, because of the special property of the DFT, the circular matrix H1

is diagonalized. The diagonal matrix D can be found from the simplified version of

Eq. 168.

D=WMZRH1ZT WTM =WMH1WT

M

= diag{WTMH1(:,1)}

(176)

Therefore, Eq. 170 can be reduced to

ED= IM , (177)

and the equalization matrix can now be found

E= D−1 = diag

¨

1

WTMH1(:,1)

«

, (178)

revealing the final OFDM system model equation.

v(k) = EWMZR

h

H0 H1

i

ZT 0

0 ZT

WHM 0M

0M WHM

x(k− 1)

x(k)

+ n(k)

!

(179)

The OFDM system model in Eq. 179 is shown in Fig. 82 with added mapping and

demapping components. The equalizer component has been left out of this diagram

because it can be implemented in either the time or frequency domain. Typically, equal-

ization is performed in the frequency domain.

138


map

per

seria

l-to-

para

llel

...

Bj,:

b

... WH

x

... channel

y

y u

u

Cj,:v

add

cycl

ic p

refix

...

para

llel-t

o-se

rial

seria

l-to-

para

llel

rem

ove

cycl

ic p

refix

... ... W

...

dem

appe

r

...

para

llel-t

o-se

rial

c

Figure 82: Generic OFDM System Model (Equalization Component not Shown)

139


References

[1] S. Cherry, “Edholm’s law of bandwidth,” Spectrum, IEEE, vol. 41, no. 7, pp. 58 –60, Jul. 2004.

[2] M. Plumb, “Fantastic 4G,” Spectrum, IEEE, vol. 49, no. 1, pp. 51 –53, Jan. 2012.

[3] S. Cherry, “4G in the U.S.A.” Spectrum, IEEE, vol. 47, no. 1, p. 15, Jan. 2010.

[4] R. Nee and R. Prasad, OFDM for wireless multimedia communications, ser. ArtechHouse universal personal communications series. Artech House, 2000.

[5] R. Prasad, OFDM for wireless communications systems, ser. Artech House universalpersonal communications series. Artech House, 2004.

[6] T. Chiueh and P. Tsai, OFDM baseband receiver design for wireless communications.John Wiley and Sons (Asia), 2007.

[7] Y. Lin, S. Phoong, and P. Vaidyanathan, Filter Bank Transceivers for OFDM andDMT Systems. Cambridge University Press, 2010.

[8] T. Pollet and M. Moeneclaey, “Synchronizability of OFDM signals,” in GlobalTelecommunications Conference, 1995. GLOBECOM ’95., IEEE, vol. 3, 1995, pp.2054–2058.

[9] Y. Mostofi and D. Cox, “Mathematical analysis of the impact of timing synchro-nization errors on the performance of an ofdm system,” Communications, IEEETransactions on, vol. 54, no. 2, pp. 226 – 230, Feb. 2006.

[10] T. Schmidl and D. Cox, “Robust frequency and timing synchronization for OFDM,”Communications, IEEE Transactions on, vol. 45, pp. 1613–1621, 1997.

[11] T. Schmidl, “Synchronization algorithms for wireless data transmission using or-thogonal frequency division multiplexing (ofdm),” Ph.D. dissertation, StanfordUniversity, USA, 1997.

[12] D. Lee and K. Cheun, “A new symbol timing recovery algorithm for OFDM sys-tems,” Consumer Electronics, IEEE Transactions on, vol. 43, pp. 767–775, 1997.

[13] J. van de Beek, M. Sandell, and P. Borjesson, “Ml estimation of time and frequencyoffset in ofdm systems,” Signal Processing, IEEE Transactions on, vol. 45, no. 7, pp.1800 –1805, Jul. 1997.

[14] M. Hayes, Statistical digital signal processing and modeling. John Wiley & Sons,1996.

[15] A. Sayed, Adaptive Filters. Wiley-Interscience, 2008.

140


[16] E. Briggs, B. Nutter, and D. McLane, “Sample clock offset detectionand correction in the lte downlink,” Journal of Signal Processing Systems,pp. 1–9, 2011, 10.1007/s11265-011-0643-5. [Online]. Available: http://dx.doi.org/10.1007/s11265-011-0643-5

[17] E. Briggs, C. Kang, A. Mane, B. Nutter, and D. McLane, “Sample clock offset detec-tion and correction in the lte downlink receiver,” in European Wireless InnovationForum Conference, Jun. 2011.

[18] T. Pollet, P. Spruyt, and M. Moeneclaey, “The ber performance of ofdm systems us-ing non-synchronized sampling,” in Global Telecommunications Conference, 1994.GLOBECOM ’94. Communications: The Global Bridge., IEEE, Dec. 1994, pp. 253–257 vol.1.

[19] E. del Castillo-Sanchez, F. Lopez-Martinez, E. Martos-Naya, and J. Entram-basaguas, “Joint Time, Frequency and Sampling Clock Synchronization forOFDM-Based Systems,” in Wireless Communications and Networking Conference,2009. WCNC 2009. IEEE, 2009, pp. 1–6.

[20] 3GPP. (2010) Physical channels and modulation. [Online]. Available: http://www.3gpp.org/ftp/Specs/archive/36_series/36.211/36211-890.zip

[21] M. Mansour, “Optimized architecture for computing zadoff-chu sequences withapplication to lte,” in Global Telecommunications Conference, 2009. GLOBECOM2009. IEEE, Dec. 2009, pp. 1 –6.

[22] K. Manolakis, D. Gutierrez Estevez, V. Jungnickel, W. Xu, and C. Drewes, “A closedconcept for synchronization and cell search in 3gpp lte systems,” in Wireless Com-munications and Networking Conference, 2009. WCNC 2009. IEEE, Apr. 2009, pp.1 –6.

[23] S. Sesia, M. Baker, and I. Toufik, LTE, The UMTS Long Term Evolution: From Theoryto Practice. Wiley, 2009.

[24] A. Oppenheim and R. Schafer, Discrete-time signal processing, ser. Prentice-Hallsignal processing series. Prentice Hall, 2010.

[25] F. Harris, Multirate Signal Processing for Communications Systems. Prentice HallPTR, 2004.

[26] Xilinx. (2012) Xilinx virtex 7 dsp48e1 slice user’s guide. [Online].Available: http://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf

[27] Cray. (2012) Cray history. [Online]. Available: http://www.cray.com/About/History.aspx

[28] Xilinx. (2012) Fast fourier transform v8.0. [Online]. Available: http://www.xilinx.com/support/documentation/ip_documentation/ds808_xfft.pdf

141

http://dx.doi.org/10.1007/s11265-011-0643-5

http://dx.doi.org/10.1007/s11265-011-0643-5

http://www.3gpp.org/ftp/Specs/archive/36_series/36.211/36211-890.zip

http://www.3gpp.org/ftp/Specs/archive/36_series/36.211/36211-890.zip

http://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf

http://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf

http://www.cray.com/About/History.aspx

http://www.cray.com/About/History.aspx

http://www.xilinx.com/support/documentation/ip_documentation/ds808_xfft.pdf

http://www.xilinx.com/support/documentation/ip_documentation/ds808_xfft.pdf


[29] T. Pollet, M. Van Bladel, and M. Moeneclaey, “Ber sensitivity of ofdm systems tocarrier frequency offset and wiener phase noise,” Communications, IEEE Transac-tions on, vol. 43, pp. 191–193, 1995.

[30] P. Moose, “A technique for orthogonal frequency division multiplexing frequencyoffset correction,” Communications, IEEE Transactions on, vol. 42, pp. 2908–2914,1994.

[31] H. Minn, “A robust timing and frequency synchronization for ofdm systems,”Wireless Communications, IEEE Transactions on, vol. 2, pp. 822–839, 2003.

[32] J.-J. van de Beek, O. Edfors, M. Sandell, S. Wilson, and P. Borjesson, “On channelestimation in ofdm systems,” in Vehicular Technology Conference, 1995 IEEE 45th,vol. 2, Jul. 1995, pp. 815 –819 vol.2.

[33] V. Srivastava, C. K. Ho, P. H. W. Fung, and S. Sun, “Robust mmse channel estima-tion in ofdm systems with practical timing synchronization,” in Wireless Commu-nications and Networking Conference, 2004. WCNC. 2004 IEEE, vol. 2, Mar. 2004,pp. 711 – 716 Vol.2.

[34] M. Noh, Y. Lee, and H. Park, “Low complexity lmmse channel estimation forofdm,” Communications, IEE Proceedings-, vol. 153, no. 5, pp. 645 –650, Oct.2006.

[35] X. Hou, S. Li, C. Yin, and G. Yue, “Two-dimensional recursive least square adap-tive channel estimation for OFDM systems,” in Wireless Communications, Net-working and Mobile Computing, 2005. Proceedings. 2005 International Conferenceon, vol. 1, 2005, pp. 232 – 236.

[36] C. Rom, “Physical layer parameter and algorithm study in a downlink ofdm-ltecontext,” Ph.D. dissertation, Radio Access Technology Section, Department ofElectronic Systems, Aalborg University, Denmark, 2008.

[37] R. Duda, P. Hart, and D. Stork, Pattern classification, ser. Pattern Classification andScene Analysis: Pattern Classification. Wiley, 2001.

[38] C. Bishop, Pattern Recognition and Machine Learning. Springer, 2006.

[39] A. Mertins, Signal Analysis: Wavelets, Filter Banks, Time-Frequency Transforms andApplications, ser. Ultrasound in Biomedicine Research Series. Wiley, 1999.

[40] P. Wand and C. Jones, Kernel Smoothing, ser. Monographs on Statistics and Ap-plied Probability. Taylor & Francis, 1994.

[41] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery, Numerical Recipes 3rd Edi-tion: The Art of Scientific Computing. Cambridge University Press, 2007.

[42] 3GPP. (2010) User Equipment (UE) Radio Transmission and Reception.[Online]. Available: http://www.3gpp.org/ftp/Specs/archive/36_series/36.101/36101-8h0.zip

142

http://www.3gpp.org/ftp/Specs/archive/36_series/36.101/36101-8h0.zip

http://www.3gpp.org/ftp/Specs/archive/36_series/36.101/36101-8h0.zip


[43] T. S. Rappaport, Wireless Communications: Principles and Practice, ser. PrenticeHall communications engineering and emerging technologies. Pearson Educa-tion, 2009.

[44] M. Jeruchim, P. Balaban, and K. Shanmugan, Simulation of Communication Sys-tems: Modeling, Methodology, and Techniques, ser. Information technology: trans-mission, processing, and storage. Kluwer Academic/Plenum Publishers, 2000.

[45] C. de Boor, A Practical Guide to Splines, ser. Applied Mathematical Sciences.Springer, 2001, no. v. 27.

[46] L. Vilinis, “High resolution spectral analysis by using basis function adaptationapproach,” Ph.D. dissertation, Univ. of Latvia, Riga (Latvia). Inst. of Electronicsand Computer Science, 1997.

[47] L. Vilinis. (2012) Extended dft. [Online]. Available: http://www.mathworks.com/matlabcentral/fileexchange/11020-extended-dft

[48] T. Kailath and A. Sayed, Fast reliable algorithms for matrices with structure. So-ciety for Industrial and Applied Mathematics, 1999.

[49] G. Golub and C. Van Loan, Matrix Computations, ser. Johns Hopkins Studies inthe Mathematical Sciences. Johns Hopkins University Press, 1996.

[50] F. Harris, “Performance and design of farrow filter used for arbitrary resampling,”in Digital Signal Processing Proceedings, 1997. DSP 97., 1997 13th InternationalConference on, vol. 2, Jul. 1997, pp. 595 –599 vol.2.

[51] C. Farrow, “A continuously variable digital delay element,” in Circuits and Systems,1988., IEEE International Symposium on, Jun. 1988, pp. 2641 –2645 vol.3.

[52] L. Wu-Sheng and D. Tian-Bo, “An improved weighted least-squares design forvariable fractional delay fir filters,” Circuits and Systems II, IEEE Transactions on,vol. 46, no. 8, pp. 1035–1040, Aug. 1999.

[53] T. Palenik and P. Farkas, “Exploiting cyclic prefix redundancy in ofdm to improveperformance of tanner: graph based decoding,” Analog Integrated Circuits andSignal Processing, vol. 69, pp. 143–152, 2011, 10.1007/s10470-011-9662-1.[Online]. Available: http://dx.doi.org/10.1007/s10470-011-9662-1

[54] J. Beek, “Synchronization and channel estimation in ofdm systems,” Ph.D. disser-tation, Luleå Univ. of Technology, Division of Signal Processing, 1998.

[55] M. Fernandez-Getino Garcia, J. Paez-Borrallo, and S. Zazo, “DFT-based channelestimation in 2D-pilot-symbol-aided OFDM wireless systems,” in Vehicular Tech-nology Conference, 2001. VTC 2001 Spring. IEEE VTS 53rd, vol. 2, 2001, pp. 810– 814.

[56] W. Jakes, Microwave Mobile Communications. John Wiley & Sons Inc, 1974.

143

http://www.mathworks.com/matlabcentral/fileexchange/11020-extended-dft

http://www.mathworks.com/matlabcentral/fileexchange/11020-extended-dft

http://dx.doi.org/10.1007/s10470-011-9662-1


[57] M. Pätzold, Mobile Fading Channels. J. Wiley, 2002.

[58] M. Pätzold, R. Garcia, and F. Laue, “Design of High-Speed Simulation Models forMobile Fading Channels by Using Table Look-up Techniques,” Vehicular Technol-ogy, IEEE Transactions on, vol. 49, no. 4, pp. 1178 –1190, Jul. 2000.

[59] C. Gutiérrez and M. Pätzold, “On the correlation and ergodic properties of thesquared envelope of soc rayleigh fading channel simulators,” Wireless PersonalCommunications, pp. 1–17, 2012, 10.1007/s11277-011-0493-2. [Online].Available: http://dx.doi.org/10.1007/s11277-011-0493-2

[60] A. Alimohammad, S. Fard, B. Cockburn, and C. Schlegel, “A Novel Techniquefor Efficient Hardware Simulation of Spatiotemporally Correlated MIMO FadingChannels,” in Communications, 2008. ICC ’08. IEEE International Conference on,May 2008, pp. 718 –724.

[61] F. Ren and Y. Zheng, “Hardware Emulation of Wideband Correlated Multiple-Input Multiple-Output Fading Channels,” Journal of Signal Processing Systems,vol. 66, pp. 273–284, 2012.

[62] E. Briggs, D. McLane, and B. Nutter, “A Real-Time Multi-Path Fading ChannelEmulator Developed for LTE Testing,” in Wireless Innovation Forum Conference,Dec. 2011.

[63] E. Briggs, T. Karp, B. Nutter, and D. McLane, “A system architecture for real-time multi-path mimo fading channel emulation,” in European Wireless InnovationForum Conference, June 2012.

[64] C. Dick and F. Harris, “Options for Arbitrary Resamplers in FPGA-Based Modula-tors,” in Signals, Systems and Computers, 2004. Conference Record of the Thirty-Eighth Asilomar Conference on, vol. 1, Nov. 2004, pp. 777 – 781 Vol.1.

[65] C. Dick and F. Harris, “On the structure, performance, and applications of recur-sive all-pass filters with adjustable and linear group delay,” in Acoustics, Speech,and Signal Processing (ICASSP), 2002 IEEE International Conference on, vol. 2,May 2002, pp. II–1517 –II–1520.

[66] R. Valenzuela and A. Constantinides, “Digital signal processing schemes for effi-cient interpolation and decimation,” Electronic Circuits and Systems, IEE Proceed-ings G, vol. 130, no. 6, pp. 225 –235, Dec. 1983.

[67] F. Harris, M. d’Oreye de Lantremange, and A. Constantinides, “Design and imple-mentation of efficient resampling filters using polyphase recursive all-pass filters,”in Signals, Systems and Computers, 1991. 1991 Conference Record of the Twenty-Fifth Asilomar Conference on, Nov. 1991, pp. 1031 –1036 vol.2.

[68] D. Gesbert, M. Shafi, D. shan Shiu, P. Smith, and A. Naguib, “From theory topractice: an overview of MIMO space-time coded wireless systems,” Selected Areasin Communications, IEEE Journal on, vol. 21, no. 3, pp. 281–302, 2003.

144

http://dx.doi.org/10.1007/s11277-011-0493-2


[69] G. J. Foschini and M. J. Gans, “On Limits of wireless communications in a fadingenvironment when using multiple antennas,” Selected Areas in Communications,IEEE Journal on, vol. 6, pp. 311–335, 1998.

[70] J. Kermoal, L. Schumacher, K. Pedersen, P. Mogensen, and F. Frederiksen, “AStochastic MIMO Radio Channel Model with Experimental Validation,” SelectedAreas in Communications, IEEE Journal on, vol. 20, no. 6, pp. 1211 – 1226, Aug.2002.

[71] K. Pedersen, J. Andersen, J. Kermoal, and P. Mogensen, “A Stochastic Multiple-Input-Multiple-Output Radio Channel Model for Evaluation of Space-Time Cod-ing Algorithms,” in Vehicular Technology Conference, 2000. IEEE VTS-Fall VTC2000. 52nd, vol. 2, 2000, pp. 893 –897 vol.2.

[72] Y. Kai, M. Bengtsson, B. Ottersten, D. McNamara, P. Karlsson, and M. Beach, “Sec-ond Order Statistics of NLOS Indoor MIMO Channels Based on 5.2 GHz Measure-ments,” in Global Telecommunications Conference, 2001. GLOBECOM ’01. IEEE,vol. 1, 2001, pp. 156–160, vol. 1.

[73] A. Sayeed, “Deconstructing multiantenna fading channels,” Signal Processing,IEEE Transactions on, vol. 50, no. 10, pp. 2563–2579, 2002.

[74] W. Weichselberger, M. Herdin, H. Ozcelik, and E. Bonek, “A stochastic MIMOchannel model with joint correlation of both link ends,” Wireless Communications,IEEE Transactions on, vol. 5, no. 1, pp. 90–100, 2006.

[75] M. Bellanger and J. Daguet, “Tdm-fdm transmultiplexer: Digital polyphase andfft,” Communications, IEEE Transactions on, vol. 22, no. 9, pp. 1199 – 1205, Sept.1974.

[76] B. Farhang-Boroujeny, “OFDM versus filter bank multicarrier,” Signal ProcessingMagazine, IEEE, vol. 28, no. 3, pp. 92 –112, May 2011.

[77] B. Hirosaki, “An orthogonally multiplexed qam system using the discrete fouriertransform,” Communications, IEEE Transactions on, vol. 29, no. 7, pp. 982 – 989,Jul. 1981.

[78] S. Weinstein and P. Ebert, “Data transmission by frequency-division multiplexingusing the discrete fourier transform,” Communication Technology, IEEE Transac-tions on, vol. 19, no. 5, pp. 628 –634, Oct. 1971.

[79] G. Cherubini, E. Eleftheriou, S. Oker, and J. Cioffi, “Filter bank modulation tech-niques for very high speed digital subscriber lines,” Communications Magazine,IEEE, vol. 38, no. 5, pp. 98 –104, May 2000.

[80] S. Weiss, “Transforms and filter banks for computationally inexpensive implemen-tations,” Steepest Ascent Lecture Notes, 2008.

145


[81] S. Mitra, Digital signal processing: a computer based approach. McGraw-HillHigher Education, 2005.

[82] T. Karp, S. Trautmann, and N. Fliege, “Zero-forcing frequency-domain equaliza-tion for generalized dmt transceivers with insufficient guard interval.” EURASIPJ. Adv. Sig. Proc., pp. 1446–1459, 2004.

146

OFDM Physical Layer Architecture and Real-Time Multi-Path ...

Documents

Transcript of OFDM Physical Layer Architecture and Real-Time Multi-Path ...