Learning from Data in Radio Algorithm Design · Algorithm design methods for radio communications...

Learning from Data in Radio Algorithm Design

Timothy James O’Shea

Dissertation submitted to the Faculty of the

Virginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Electrical Engineering

T. Charles Clancy

Robert W. McGwier

Narendran Ramakrishnan

Sanjay Raman

Jeffrey Reed

Oct 26th, 2017

Arlington, Virginia

Keywords: deep learning, radio, physical layer, software radio, machine learning, neural

networks, sensing, communications system design, modulation, coding, sensing

Copyright 2017, Timothy James O’Shea



ABSTRACT

Algorithm design methods for radio communications systems are poised to undergo amassive disruption over the next several years. Today, such algorithms are typically de-signed manually using compact analytic problem models. However, they are shiftingincreasingly to machine learning based methods using approximate models with highdegrees of freedom, jointly optimized over multiple subsystems, and using real-worlddata to drive design which may have no simple compact probabilistic analytic form.

Over the past five years, this change has already begun occurring at a rapid pace in severalfields. Computer vision tasks led deep learning, demonstrating that low level featuresand entire end-to-end systems could be learned directly from complex imagery datasets,when a powerful collection of optimization methods, regularization methods, architec-ture strategies, and efficient implementations were used to train large models with highdegrees of freedom.

Within this work, we demonstrate that this same class of end-to-end deep neural networkbased learning can be adapted effectively for physical layer radio systems in order tooptimize for sensing, estimation, and waveform synthesis systems to achieve state of theart levels of performance in numerous applications.

First, we discuss the background and fundamental tools used, then discuss effectivestrategies and approaches to model design and optimization. Finally, we explore a se-ries of applications across estimation, sensing, and waveform synthesis where we applythis approach to reformulate classical problems and illustrate the value and impact thisapproach can have on several key radio algorithm design problems.



GENERAL AUDIENCE ABSTRACT

Radio communications and sensing systems are used pervasively in the modern worldevery day life to connect phones, computers, smart devices, industrial devices, inter-net services, space systems, emergency and military users, radar systems, interferencemonitoring systems, defense electronic systems, and others. Optimizing these systemsto function together reliably and efficently in an ever more complex world is becomingincreasingly hard and impractical.

Our work introduces a new and radically different method for the design of radio sys-tems by casting them in a new way as artificial intelligence problems relying on the fieldof machine learning called deep learning to find and optimize their design. We detail anddemonstrate the first such deep learning based communciations and sensing systems op-erating on raw radio signals and quantify their performance when compared to existingmethods, showing them to be competitive with and in some cases significantly betterperforming than state of the art systems today.

These ideas, and the evidence of their viability, are central to the emerging field of ma-chine learning communications systems, and will help to make tomorrow’s wireless sys-tems faster, cheaper, more reliable, more adaptive, more efficient, and lower power thancurrently possible. In a world of ever increasing complexity and connectedness, this newapproach to wireless system design from data using machine learning offers a power-ful new strategy to improve systems by directly leveraging the complexity in real worlddata and experience to find efficiencies where current day approaches and insufficientsimplified models and design tools can not.

Acknowledgments

Thank you to all my current and former colleagues at Virgina Tech, NC State, Bell Labs,

the US Government, the GNU Radio Community and industry who supported, critiqued,

mentored, collaborated, co-authored and discussed countless ideas surrounding software

radio, cognitive radio, and deep learning, especially my advisor Charles Clancy, who has

been a constant source of support and inspiration, and has provided me with significant

freedom to explore new and disruptive ideas.

I am also very grateful to the individuals and organizations who have supported myself

and my work throughout my studies including VT, DeepSig, DARPA, NSF, DOD, LM,

Hawkeye360, Federated Wireless and others who made much of this possible.

iv

Dedication

This work is dedicated to my family, friends, colleagues, mentors, sponsors and research

inspirations, all of whom have supported me and contributed to this work in countless

immeasurable ways for which I am extremely grateful.

More abstractly, this work is dedicated to engineering as a creative discipline. While

many engineering fields have become complex and tedious, end-to-end learning based

approaches to design offer to relieve some of the tedium and slow progress surrounding

the field today.

It is my sincere hope that the future of engineering will become more of a creative outlet

for experimentalists, contrarians, pragmatists and makers. That the expansion of machine

learning will empower all people to create and to view engineering in a positive, fun, and

creative light and artform, accessable to all rather than as the obscure, slow moving, and

specialized field that it can sometimes seem today.

v

Contents

1 Introduction 1

1.1 Chasing Optimality in Communication System Design . . . . . . . . . . . . 3

1.2 Neural Networks in Radio System Design . . . . . . . . . . . . . . . . . . . . 4

1.3 Implications, Trends and Challenges in Deep Learning . . . . . . . . . . . . 6

1.4 Deep Cognitive Radio Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 10

2.1 Radio Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1.1 Digital Communications . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.2 Radio Channel Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Cognitive Radio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.1 Sensing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

vi

2.2.2 Control Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.1 Error Feedback and Objectives . . . . . . . . . . . . . . . . . . . . . . 24

2.3.2 Network Model Primitives . . . . . . . . . . . . . . . . . . . . . . . . 29

2.3.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

2.3.4 Architectural Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.3.5 High Performance Computing . . . . . . . . . . . . . . . . . . . . . . 39

2.3.6 Model Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

2.3.7 Model Introspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3 Learning to Communicate 50

3.1 The Channel Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2 Learning to Synchronize with Attention . . . . . . . . . . . . . . . . . . . . . 65

3.3 Multi-User Interference Channel . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.4 Learning Multi-Antenna Diversity Channels . . . . . . . . . . . . . . . . . . 77

3.5 Learning MIMO with CSI Feedback . . . . . . . . . . . . . . . . . . . . . . . 81

3.6 System Identification Over the Air . . . . . . . . . . . . . . . . . . . . . . . . 87

vii

4 Learning to Label the Radio Spectrum 89

4.1 Learning Estimators from Data . . . . . . . . . . . . . . . . . . . . . . . . . . 91

4.2 Learning to Identify Modulation Types . . . . . . . . . . . . . . . . . . . . . 99

4.2.1 Expert Features for Modulation Recognition (Baseline) . . . . . . . . 101

4.2.2 Time series Modulation Classification With CNNs . . . . . . . . . . . 103

4.2.3 Deep Residual Network Time-series Modulation Classification . . . 108

4.3 Learning to Identify Radio Protocols . . . . . . . . . . . . . . . . . . . . . . . 136

4.4 Learning to Detect Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

5 Learning Radio Structure 150

5.1 Unsupervised Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . 151

5.2 Unsupervised Class Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 155

5.3 Neural Network Model Discovery and Optimization . . . . . . . . . . . . . 159

6 Conclusion 164

6.1 Publication List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Bibliography 172

viii

List of acronyms

ACF auto-correlation function

ADC analog-to-digital converter

AE autoencoder

AI artificial intelligence

AM amplitude modulation

ANN artificial neural network

ARP address resolution protocol

AWGN additive white Gaussian noise

BCE binary cross-entropy

BER bit error rate

BLER block error rate

ix

BPSK binary phase shift keying

CAF cross ambiguity function

CCE categorical cross-entropy

CFO carrier frequency offset

CNN convolutional neural network

CQI channel quality information

CR cognitive radio

CSI channel state information

CUDA Compute Unified Device Architecture

DL deep learning

DAC digital to analog converter

DNN deep neural network

DNS domain name server

DOF degrees of freedom

DSA dynamic spectrum access

DSP digital signal processing

x

DTree decision tree

EM electromagnetic

FEC forward error correction

FFT fast Fourier transform

FLOPS floating point operations per second

FM frequency modulation

FSK frequency shift keying

FV Fisher Vector

GF galois field

GR GNU Radio

GRU gated recurrent unit

GPGPU general purpose graphic processing unit

GPU graphic processing unit

GMR ground mobile radio

HMM hidden Markov model

HOC higher order cumulants

xi

HOS higher order statistic

HOM higher order moment

I/Q In-phase and Quadrature

ICA independent component analysis

IEEE Institute of Electrical and Electronics Engineers

IID independent and identically distributed

IOU Intersection over union

ISM industrial, scientific, and medical radio

ISI inter-symbol interference

LDPC low density parity check

LO local oscillator

LOS line of sight

LTE long term evolution

LSTM long short-term memory

LTI linear time invariant

MAE mean absolute error

xii

MAP maximum a posteriori

MF matched filter

MFCC Mel-frequency cepstral coefficient

MIMO multiple-input multiple-output

ML machine learning

MLD maximum likelihood

MLE maximum likelihood estimation

MLSP machine learning for signal processing

MMSE minimum mean square error

MNIST Modified National Institute of Standards and Technology

MRSA mean-response scaled initializations

MU multi-user

NNSP neural networks for signal processing

MSE mean squared error

NLP natural languasge processing

NN neural network

xiii

OFDM orthogonal frequency-division multiplexing

OODA observe orient decide act

OTA over-the-air

PAPR peak to average power ratio

PCA principal component analysis

PHY physical layer

PPM parts per million

PPB parts per billion

PSK phase-shif keying

QAM quadrature amplitude modulation

QRNN quasi-recurrent neural network

QoS quality of service

QPSK quadrature phase shift keying

R-CNN region-based convolutional neural network

ReLU rectified linear unit

ResNet residual network

xiv

RF radio frequency

RFIC radio frequency integrated circuit

RNN recurrent neural network

ROC receiver operating characteristic

RRC root-raised cosine

RTN radio transformer network

SCF spectral correlation function

SDR software-defined radio

SGD stochastic gradient descent

SELU scaled exponential linear units

SIC successive interference cancellation

SIFT scale-invariant feature transform

SNR signal-to-noise ratio

SRO symbol rate offset

STN spatial transformer network

SoC system-on-chip

xv

STBC space-time block code

SVM support vector machine

t-SNE t-distributed stochastic neighbor embedding

TS time-slotted

USRP universal software radio peripheral

YOLO you only look once

ZF zero forcing

xvi

List of Figures

2.1 Direct Conversion Radio Front-End Architecture . . . . . . . . . . . . . . . . 11

2.2 Impulse Response Plots of Varying Delay Spreads . . . . . . . . . . . . . . . 17

2.3 A single fully connected neuron . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4 A simple 1D 2-long 2-filter convolutional layer . . . . . . . . . . . . . . . . . 31

2.5 A sequence of 2D convolutional layers from AlexNet [1] . . . . . . . . . . . 32

2.6 An example dilated convolution structure from WaveNet [2] . . . . . . . . . 33

2.7 Dropout effect on network connectivity, from [3] . . . . . . . . . . . . . . . . 35

2.8 Example Effect of Dropout on Training and Validation Loss . . . . . . . . . . 36

2.9 A single residual network unit, from [4] . . . . . . . . . . . . . . . . . . . . . 37

2.10 An exemplary residual network stack, from [4] . . . . . . . . . . . . . . . . . 38

2.11 Spatial transformer network structure, from [5] . . . . . . . . . . . . . . . . . 39

2.12 Single threading ceiling illustrated, from [6] . . . . . . . . . . . . . . . . . . . 40

xvii

2.13 Concurrent GPU vs CPU compute architecture scaling (2017), from [7] . . . 43

2.14 Evolutionary performance of image classifier search, from [8] . . . . . . . . 45

2.15 Layer 1 and 2 filter weights from CNN trained on ImageNet, from [9] . . . . 46

2.16 Filter activation visualization in convolutional neural networks (CNNs),

from [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.17 Optimization of input images for feature activation, from [10] . . . . . . . . 47

2.18 GradCAM Saliency Maps for Dogs and Cats, from [11] . . . . . . . . . . . . 48

2.19 Information theoretic visualization of deep learning, from [12] . . . . . . . . 49

3.1 Illustration of the many modular algorithms present in a modern wireless

physical layer modem such as long term evolution (LTE) . . . . . . . . . . . 51

3.2 The Fundamental Communications Learning Problem . . . . . . . . . . . . . 53

3.3 A simple autoencoder for a 2D MNIST image, from [14] . . . . . . . . . . . . 53

3.4 A Simple Channel Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.5 BLER versus Eb/N0 for autoencoder . . . . . . . . . . . . . . . . . . . . . . . 57

3.6 BLER versus Eb/N0 for autoencoder . . . . . . . . . . . . . . . . . . . . . . . 58

xviii

3.7 Constellations produced by autoencoders using parameters (n, k): (a) (2, 2)

(b) (2, 4), (c) (2, 4) with average power constraint, (d) (7, 4) 2-dimensional t-

distributed stochastic neighbor embedding (t-SNE) embedding of received

symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.8 Learned QAM Modes for Example Mean Power (EMP) . . . . . . . . . . . . 61

3.9 Learned QAM Modes for Batch Mean Power (BMP) . . . . . . . . . . . . . . 61

3.10 Learned QAM Modes for Batch Mean Amplitude (BMA) . . . . . . . . . . . 62

3.11 Learned QAM Modes for Batch Mean Max Power (BMMP) . . . . . . . . . . 63

3.12 Learned 4-Symbol QAM Modes using BMA for 2 bit, 4bit, and 8bit) . . . . . 64

3.13 Spatial Transformer Example on MNIST Digit from [5] . . . . . . . . . . . . 66

3.14 Radio Transformer Network Architecture . . . . . . . . . . . . . . . . . . . . 67

3.15 Autoencoder training loss with and without RTN . . . . . . . . . . . . . . . 69

3.16 BLER versus Eb/N0 for various communication schemes over a channel

with L = 3 Rayleigh fading taps . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.17 The two-user interference channel seen as a combination of two interfering

autoencoders that try to reconstruct their respective messages . . . . . . . . 72

3.18 block error rate (BLER) versus Eb/N0 for the two-user interference channel

achieved by the autoencoder (AE) and 22k/n-quadrature amplitude modu-

lation (QAM) time-slotted (TS) for different parameters (n, k) . . . . . . . . 74

xix

3.19 Learned constellations for the two-user interference channel with parame-

ters (a) (1, 1), (b) (2, 2), (c) (4, 4), and (d) (4, 8). The constellation points of

Transmitter 1 and 2 are represented by red dots and black crosses, respec-

tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3.20 Open Loop MIMO Channel Autoencoder Architecture . . . . . . . . . . . . 78

3.21 Alamouti Coding Scheme for 2x1 Open Loop MIMO . . . . . . . . . . . . . 79

3.22 Error Rate Performance of Learned Diversity Scheme. . . . . . . . . . . . . . 79

3.23 2x1 MIMO AE, Diagonal H . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.24 2x1 MIMO AE, Random H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3.25 Closed Loop MIMO Learning Autoencoder Architecture . . . . . . . . . . . 82

3.26 Error Rate Performance of Learned 2x2 Scheme (Perfect CSI). . . . . . . . . . 82

3.27 Closed Loop MIMO Autoencoder with Quantized Feedback . . . . . . . . . 83

3.28 Bit Error Rate Performance of Baseline ZF Method . . . . . . . . . . . . . . . 84

3.29 Bit Error Rate Performance Comparison of MIMO Autoencoder 2x2 Closed-

Loop Scheme with Quantized CSI . . . . . . . . . . . . . . . . . . . . . . . . 85

3.30 Learned 2x2 Scheme 1 bit CSI Random Channels. . . . . . . . . . . . . . . . 86

3.31 Learned 2x2 Scheme 1-bit CSI All-Ones Channel. . . . . . . . . . . . . . . . . 86

3.32 Learned 2x2 Scheme 2-bit CSI Random Channels. . . . . . . . . . . . . . . . 86

xx

3.33 Learned 2x2 Scheme 2-bit CSI All-Ones Channel. . . . . . . . . . . . . . . . . 86

3.34 Deployment Configuration for Quantized MIMO Autoencoder . . . . . . . 87

4.1 CFO Expert Estimator Power Spectrum with simulated 2500 Hz offset . . . 92

4.2 Timing Estimation MAE Comparison . . . . . . . . . . . . . . . . . . . . . . 97

4.3 Mean CFO Estimation Absolute Error for AWGN Channel . . . . . . . . . . 98

4.4 Mean CFO Estimation Absolute Error (Fading σ=0.5) . . . . . . . . . . . . . 98

4.5 Mean CFO Estimation Absolute Error (Fading σ=1) . . . . . . . . . . . . . . 99

4.6 Mean CFO Estimation Absolute Error (Fading σ=2) . . . . . . . . . . . . . . 99

4.7 Traditional Approach to Modulation Recognition, from [15] . . . . . . . . . 102

4.8 10 Modulation CNN performance comparison of accuracy vs signal-to-

noise ratio (SNR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

4.9 Confusion matrix of the CNN (SNR = 10 dB) . . . . . . . . . . . . . . . . . . 107

4.10 System for modulation recognition dataset signal generation and synthetic

channel impairment modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.11 Over the air capture system diagram . . . . . . . . . . . . . . . . . . . . . . . 112

4.12 Picture of over the air lab capture and training system . . . . . . . . . . . . . 112

xxi

4.13 Example graphic of high level feature learning based residual network ar-

chitecture for modulation recognition . . . . . . . . . . . . . . . . . . . . . . 113

4.14 Complex time domain examples of 24 modulations from the dataset at sim-

ulated 10dB Eb/N0 and ` = 256 . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.15 Complex time domain examples of 24 modulations over the air at high

SNR and ` = 256 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.16 Complex constellation examples of 24 modulations from the dataset at sim-

ulated 10dB Eb/N0 and ` = 256 . . . . . . . . . . . . . . . . . . . . . . . . . . 116

4.17 Complex time domain examples of 24 modulations from the dataset at sim-

ulated 0dB Eb/N0 and ` = 256 . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4.18 11-Modulation normal dataset performance comparison (N=1M) . . . . . . 118

4.19 24-Modulation difficult dataset performance comparison (N=240k) . . . . . 119

4.20 Residual unit and residual stack architectures . . . . . . . . . . . . . . . . . . 120

4.21 Resnet performance under various channel impairments (N=240k) . . . . . 121

4.22 Baseline performance under channel impairments (N=240k) . . . . . . . . . 121

4.23 Comparison models under LO impairment . . . . . . . . . . . . . . . . . . . 122

4.24 ResNet performance vs depth (L = number of residual stacks) . . . . . . . . 123

xxii

4.25 Modrec performance vs modulation type (Resnet on synthetic data with

N=1M, σclk=0.0001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

4.26 24-modulation confusion matrix for ResNet trained and tested on synthetic

dataset with N=1M, additive white Gaussian noise (AWGN), and SNR ≥

0dB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

4.27 Performance vs training set size (N) with ` = 1024 . . . . . . . . . . . . . . . 126

4.28 24-modulation confusion matrix for ResNet trained and tested on synthetic

dataset with N=1M and σclk = 0.0001 . . . . . . . . . . . . . . . . . . . . . . . 127

4.29 Performance vs example length in samples (`) . . . . . . . . . . . . . . . . . 128

4.30 24-modulation confusion matrix for ResNet trained and tested on OTA ex-

amples with SNR ∼ 10 dB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

4.31 Resnet transfer learning OTA performance . . . . . . . . . . . . . . . . . . . 130

4.32 24-modulation confusion matrix for ResNet trained on synthetic σclk =

0.0001 and tested on OTA examples with SNR ∼ 10 dB (prior to fine-tuning) 132

4.33 24-modulation confusion matrix for ResNet trained on synthetic σclk =

0.0001 and tested on OTA examples with SNR ∼ 10 dB (after fine-tuning) . . 133

4.34 Transfer function of the LSTM unit, from [16] . . . . . . . . . . . . . . . . . . 137

4.35 Best LSTM256 confusion with RNN length of 512 time-steps . . . . . . . . . 139

4.36 Detection Algorithm Trade-space Sensitivity vs Specialization . . . . . . . . 141

xxiii

4.37 Computer Vision CNN-based Object Detection Trade Space, from [17] . . . 143

4.38 Example bounding box detections in computer vision, from [17] . . . . . . . 144

4.39 YOLO style per-grid-cell bounding box regression targets . . . . . . . . . . . 146

4.40 Radio bounding box detection examples, from [18] . . . . . . . . . . . . . . . 147

4.41 Over the air wideband signal bounding box prediction example . . . . . . . 148

5.1 Example Radio Communications Basis Functions . . . . . . . . . . . . . . . 151

5.2 Convolutional Autoencoder Architecture for Signal Compression . . . . . . 152

5.3 Convolutional Autoencoder reconstruction of QPSK example 1 . . . . . . . 153

5.4 Convolutional Autoencoder reconstruction of QPSK example 2 . . . . . . . 154

5.5 AE Encoder Filter Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.6 AE Decoder Filter Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.7 Supervised Embedding Approach . . . . . . . . . . . . . . . . . . . . . . . . 157

5.8 Unsupervised Embedding Approach . . . . . . . . . . . . . . . . . . . . . . . 157

5.9 Supervised Signal Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 158

5.10 Unsupervised Signal Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 158

5.11 Compact Model Network Digraph and Hyper-Parameter Search Process . . 160

5.12 EvolNN ModRec Net Search Accuracy . . . . . . . . . . . . . . . . . . . . . . 161

xxiv

5.13 EvolNN MNIST Net Search Accuracy . . . . . . . . . . . . . . . . . . . . . . 161

5.14 EvolNN CFO estimation network search loss . . . . . . . . . . . . . . . . . . 162

xxv

List of Tables

2.1 List of widely used neural network (NN) optimization loss functions . . . . 25

2.2 List of activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.1 Layout of the autoencoder used in Figs. 3.6 and 3.5. It has (2M + 1)(M + n) + 2M

trainable parameters, resulting in 62, 791, and 135,944 parameters for the

(2,2), (7,4), and (8,8) autoencoder, respectively. . . . . . . . . . . . . . . . . . 56

3.2 Candidate channel autoencoder transmit normalization functions . . . . . . 60

3.3 Layout of the multi-user autoencoder model . . . . . . . . . . . . . . . . . . 73

4.1 ANN Architecture Used for CFO Estimation . . . . . . . . . . . . . . . . . . 94

4.2 ANN Architecture Used for Timing Estimation . . . . . . . . . . . . . . . . . 94

4.3 Layout for our 10 modulation CNN modulation classifier . . . . . . . . . . . 105

4.4 Random Variable Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.5 Features Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

xxvi

4.6 CNN Network Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.7 ResNet Network Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.8 Protocol traffic classes considered for classification . . . . . . . . . . . . . . . 137

4.9 Recurrent network architecture used for network traffic classification . . . . 138

4.10 Performance measurements for RNN protocol classification for varying se-

quence lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

4.11 Table input/output shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

5.1 Final small MNIST search CNN network . . . . . . . . . . . . . . . . . . . . 161

5.2 Final Modrec search CNN network . . . . . . . . . . . . . . . . . . . . . . . . 161

xxvii

Chapter 1

Introduction

Algorithms in radio signal processing have advanced drastically over the past hundred

years. Today’s radio physical layer has evolved to become a complex collection of highly

specialized disciplines of research. forward error correction (FEC), channel state infor-

mation (CSI) estimation, equalization, multi-carrier modulation, multi-antenna transmis-

sion schemes, and numerous other specific areas of research have each become mature

research fields in which many people specialize and achieve small incremental improve-

ments within a highly compartmented and modular specialized subsystems.

Meanwhile, deep learning (DL) [19] has been rapidly disrupting numerous algorithmic

information processing fields by re-thinking problems as end-to-end optimization prob-

lems rather than as collections of highly specialized hand tailored subsystem models.

Many problems in wireless communications are ripe for this form of high level rethinking

1

Timothy J. O’Shea Chapter 1. Introduction 2

in the context of end-to-end system optimization, and a new class of optimization tools

offers the possibility to cope with system complexities and degrees of freedom which

were previously intractible for direct complete-system optimization.

Throughout this work, we consider many current models and approaches to wireless

communications, the application of neural networks to radio signal processing, recent

advances in large scale network optimization behind deep learning, and disruptive ap-

plications and ways these techniques can fundamentally change how communications

systems are designed.

Throughout this work, we motivate embracing wireless signal processing problems as

data centric machine learning problems, demonstrating the significant potential of end-

to-end learning approaches which can be used in constrast to more traditional simplified

analytic subsystem model driven approaches. While ultimately some mix of the two is

currently the best solution in many cases, much of this work is intended to provide a

contrarian perspective to the status quo in the field, embracing and comparing quanti-

tative performance with baselines as much as possible, but also attempting to see how

far we can go in relying on completely learned systems rather than incremental hybrid

approaches.


1.1 Chasing Optimality in Communication System Design

Since the seminal works of Shannon [20] in establishing upper bounds for capacity nd

performance in communications systems (further detailed in chapter 2.1.1) much of the

focus of radio communications research has been on trying to achieve this near-optimal

level of performance in real world systems.

In recent years, techniques such as turbo codes [21], turbo product codes [22], low den-

sity parity check (LDPC) codes [23] and other modulation techniques such as orthogonal

frequency-division multiplexing (OFDM) [24] and multiple-input multiple-output (MIMO)

have allowed for performance which comes quite close to this limit. Key enablers for

modern FEC codes enabling this have been large block sizes with probabilistic models

(such as belief propagation) which iteratively compute most likely codewords based on

soft log-likelihoods estimated from received symbols.

Several attempts have been made to extend this maximum likelihood (MLD) block code-

word selection task to probabilistically encompass earlier physical layer tasks such as

equalization, synchronization and interference cancellation. Approaches in this field in-

clude successive interference cancellation (SIC) [25], as well as factor graph/belief prop-

agation models [26, 27]. Both of these have shown to be attractive from a sensitivity and

bit error rate performance under certain harsh conditions, but both run into difficulty in

practical use due to computational complexity limitations and the exponential complex-

ity problem of increasing realism, complexity and degrees of freedom (DOF) in closed


form analytic channel, emitter, and interferer models.

1.2 Neural Networks in Radio System Design

The use of artificial neural networks in radio signal processing is not a new idea. Sig-

nificant interest in this area rose and fell in the 80s and 90s. Institute of Electrical and

Electronics Engineers (IEEE) even developed technical committees such as the neural net-

works for signal processing (NNSP) which surged initially in interest, looking at appli-

cations of learning to signal processing tasks (later renamed to machine learning for sig-

nal processing (MLSP) when neural networks fell out of favor). Numerous ideas which

I revisit in this work were proposed long ago: Neuro-evolution was proposed in 1994

[28, 29], Neural network based forward error correction code decoding was proposed in

1989 [30], Neural network based modulation recognition was proposed in 1985 [31], and

many other early works exist which first considered the ways in which neural networks

could be applied to difficult regression and classification problems in the context of the

radio signal processing domain.

Unfortunately, during this first surge of interest, the optimization algorithmic tools, com-

putational tools, regularization tools, data storage capabilities, data gathering capabili-

ties, and many other requisites for large scale data centric algorithm learning were not

yet available for practitioners. Because of this, many people wrote these ideas off as in-

tractable, or overly complex to be of practical use, and relied instead on more compact


models based on either toy analytic problem representations or max-margin style data re-

ducing optimization techniques (e.g. support vector machine (SVM) [32]). artificial neural

network (ANN) methods were generally regarded as failed and uninteresting for quite

some time. Several researchers including, most famously, Hinton, Bengio, LeCun, and

Schmidthuber continued heavy research into NN optimization silently for many years

building and maturing the tools which today allow model and network complexity to

scale many orders of magnitude above what was possible at that time.

With the emergence of deep learning in 2012 and the results demonstrating the ability

of such techniques to scale, it was clear that any prior assumptions made in the signal

processing domain with regards to model performance, complexity, feasibility and prac-

ticality needed to be completely re-evaluated in light of modern algorithms and compu-

tational capabilities.

In this work, I hope to provide a significant re-consideration of many of the core func-

tions within radio signal processing algorithm design, re-cast fundamental radio signal

processing tasks in the context of modern DL optimization tools, capabilities, and con-

structs, and compare the efficacy of data-centric design of algorithms to the state of the

art methods used today which rely currently much more heavily on complex manual

system engineering and algorithm design.


1.3 Implications, Trends and Challenges in Deep Learning

Many of the ideas which constitute DL have been around for quite some time. However,

it was not until relatively recently (e.g. AlexNet in 2012[1]) that many recent ideas in net-

work architecture, training, regularization, and high performance implementation were

combined to great effect, that deep learning really gained widespread attention, adoption,

and success. Alexnet was one of the first major efforts and publications which employed

these techniques and provided an order of magnitude improvement in machine learn-

ing (ML) model performance, in this case on the ImageNet [33] dataset and classification

task, reducing top-1 classification error rates by around 10% from 47.1% to 37.5% accu-

racy.

The key breakthrough here were that it was now possible to train very large many-

free-parameter models using gradient descent on high performance graphic processing

unit (GPU) architectures directly from large datasets, with sufficient regularization us-

ing end-to-end learning and low level feature learning, which could outperform previ-

ous state of the art systems with many years of analytic feature engineering and tuning

such as scale-invariant feature transform (SIFT) [34] and Fisher Vectors (FVs) [35]. Since

then, this trend of low-level feature learning outperforming hand engineered features

has replaced the state of the art in computer vision, and has shown the same capacity to

replace low level features such as Mel-frequency cepstral coefficients (MFCCs) in time do-

main voice processing [36] and equivalents in natural languasge processing (NLP). This


trend towards low level feature learning from raw data is likely to continue to subsume

many domains’ existing feature extractors and pre-processors into learned equivalents

optimized directly on high level objectives.

Knowledge of domain specific information however is not discarded or unneeded in the

frightening way that this statement may however seem. For instance in [36], MFCC filters

are replaced with learned features, but the network architecture is set up to allow for

time domain convolutional filter learning of quite similarly structured filter taps which

happen to fit the real distribution of the human voice spectrum slightly better than a pure

dyadic scale. This trend of building NN architectures which leverage prior expert domain

knowledge and combine it with end-to-end learned architectures to find more optimal

solutions is bound to continue and to enable the combination of domain knowledge with

state of the art NN architecture approaches to yield new state of the art results in many

domains.

One of the key breakthroughs in understanding of deep learning and why it really works

was set forth in [37]. Here Dauphine demonstrates that as the number of free-parameters

in the model goes up, the probability of getting stuck in a [non global] local minima

goes down, and an optimizer such as stochastic gradient descent (SGD) is more likely

to instead encounter a saddle point, which can be further optimized in a non-terminal

fashion. This results underlies why Deep Learning works, why it can find good solu-

tions, and why large/deep networks which are of higher dimension than the minimum

required solution are actually key to the ability to find globally good solutions rapidly. As


a result, the field of compressing or pruning these large networks to a smaller minimal

subset once trained has also become an important research area which has shown very

promising results in reducing computation and network size once a global solution has

been found [38, 39].

While we have already discussed feature learning as a key trend, end-to-end learning

continues to extend the scope of the model which can be trained in an end-to-end fash-

ion. Attention models or saliency are key methods in which end-to-end problems may

have their learning architecture decomposed into sub-tasks which help to deal with very

high dimensional inputs. By focusing attention within a high dimensional space, regional

proposal networks (Fast R-CNN [40]) and spatial transformer networks [5] both demon-

strated key methods in which low complexity front-end networks could direct a small

patch of transformed relevant input into a secondary discriminative network to operate

within the relevant input more effectively and with a canonicalized form with various

permutations removed. This design strategy is critical to high dimensional search prob-

lems like the Google Street View House Number recognition task [41], and are extremely

applicable to high dimensional radio search spaces as we will discuss later.

1.4 Deep Cognitive Radio Systems

In this work, we consider how cognitive radio, which has been a slow moving idealized

dream over the past handful of years, can be truly realized from a ground-up level of


physical layer learning using the new tools for high dimensional model learning which

are now available. By combining the end-to-end and feature learning methodologies

which have been highly successful in the computer vision domain and other domains,

with a ground up approach to radio algorithm learning, the results shown herein demon-

strate that a true breakthrough in cognitive radio is finally possible, where we can learn

sensing, waveform synthesis, and control behaviors for radio signals which are uncon-

strained by rigid pre-processors, problem formulations, or other assumptions which were

previously necessary in order to make the problem tractable under an older learning

regime. For lack of a better expression, we term this combination of deep learning as

an enabler for realizing cognitive radio capabilities, deep cognitive radio. Throughout

this document we hope to better explain and quantitatively demonstrate the potential in

this concrete realization of this powerful union.

Chapter 2

Background

This work spans between three distinct disciplines which are rapidly converging. digital

signal processing (DSP) for radio communications systems provides the core knowledge

surrounding analog to digital conversion, sampling theorum, dynamic range and signal

to noise ratio management, and algorithmic knowledge. cognitive radio (CR) builds on

DSP and software radio appling ideas from artificial intelligence (AI) to help automate

and optimize radio applications for specialized objectives. deep learning (DL) has re-

cently grown as a rapidly accelerating field withing AI which relies on large datasets,

error feedback, and high level objective functions to guide the formation of very large

parametric models, previously intractible with other AI approaches. Before combining

and extending these three technologies throughout this dissertation, an overview of key

background concepts, models, and approaches is provided within this section. This is

particularly important in the Radio Deep Learning field as most practitioners at this point

10

Timothy J. O’Shea Chapter 2. Background 11

come either from the radio signal processing field or the machine learning field, and sel-

dom have a deep experience in both. This is likely to change quickly over the coming

years as the field of radio signal processing adopts these techniques and begins to use

data science centric language more accessible to the machine learning community.

2.1 Radio Signal Processing

Figure 2.1: Direct Conversion Radio Front-End Architecture

Radio signal processing has a rich history which can barely be scratched within the scope

of this background section. We focus primarily on modern digital radio communications

systems and radio sensing systems. In both cases radio front-end hardware typically

employs a single or multi-stage oscillator and mixer to convert signals between a specific

radio frequency and baseband (or low intermediate frequency) for digitization. Filters

are used to reject energy from outside of the desired radio frequency (RF) band both at RF

frequencies (RF band pass filters), and as low-pass image-rejection filters at either DC or


low IF frequencies. analog-to-digital converters (ADCs) and digital to analog converters

(DACs) are used to convert analog baseband energy to and from discretized baseband

sampled representation within the digital side of a radio system.

Today, the vast majority of low cost radio systems leverage this form of direct conversion

[42] digital transceiver hardware architecture (shown in figure 2.1) and perform signal

specific signal processing (e.g. detection, modulation or demodulation) on the resulting

sampled complex valued quadrature baseband signal digitally in microprocessors (often

termed baseband processing units or digital signal processors).

In many mobile devices (e.g. phones, computers, tablets) this whole radio front-end hard-

ware architecture may all be combined into a single system-on-chip (SoC). Most com-

monly today, the Analog Devices AD9361 [43] is used to perform most of these steps in

common lab software-defined radio (SDR) hardware used herein, but numerous similar

chips exist from Qualcomm, Samsung, Broadcom, Intel and others.

2.1.1 Digital Communications

Today, the vast majority of radio systems are implemented using digital signal processing,

and of those many of the most ubiquitous which we use every day are digital modula-

tions carrying binary information between computing platforms such as phones, laptops,

tablets, cars, base stations, pagers, spacecrafts, airplanes, boats, law enforcement radios,

and virtually every other platform frequented by mankind. These systems share a num-


ber of key properties which must be understood and will apply to machine learning based

radio communications systems as much as they apply to current day systems.

Sampling Theory gives us theoretical bounds on the conversion of information between

analog and digital forms. Nyquist observed [44] in 1928, that to perform undistorted re-

construction of a radio telegraph signal of a given bandwidth (speed of signaling), one

must sample the signal at a rate of twice the bandwidth in order to become unambiguous

(to avoid aliasing of portions of the signal). This is today known as the Nyquist Frequency

(or critical sampling frequency) and represents the speed at which any signal must be

sampled in order to avoid distortion due to aliasing, two times the highest frequency of

the underlying signal. While there does exist an area of investigation in digital communi-

cations into compressive sensing which breaks this assumption, virtually all systems we

use today sample at or above the nyquist frequency, and we shall continue to assume a

system sampled at or above nyquist for the purposes of our investigations herein. Com-

pressive sensing based machine learning systems which drop this assumption do pose an

interesting prospect, but we do not consider them in our work here. We also generally

do not model the effects of quantization which occur within the sampling process. An

analog signal of peak-to-peak voltage V, can be divided into 2N discrete voltage levels

spanning [-V/2,V/2] when converted to an N-bit digital equivalent with reconstruction

error bounded by ε ≤ V2N+1 . However, for this to be true, we must assume the signal

amplitude is scaled appropriately for the converters range, and that the dynamic range

of the analog signal plus noise can be sufficiently represented within N bits. Many mod-


ern ADCs employ 14 or 16 bit conversion, including those used in our measurements

(largely universal software radio peripheral (USRP) [45] devices based on the AD9361

[43] chip), these provide sufficient dynamic range for many applications wherein thermal

noise N per sample is greater than quantization noise ε, and we shall make this assump-

tion within our work as well. Most of the work herein is therefor conducted using 32 bit

floating point representations for simplicity. This presents more than enough dynamic

range for our applications, and can be reduced in precision in future work for many of

them.

Information Theory can be be used to express an upper bound on channel capacity [20].

This defines the maximum information throughput in bits per second per hertz which

can traverse a wireless channel. Most commonly this is expressed for a single transmitter,

single receiver channel where the impairment is given by AWGN, and signal power is

expressed as a signal to noise ratio (SNR) relating the signal power to the noise power.

This capacity equation is given traditionally by the following:

C = W log2

(1 +

P

N0W

)(2.1)

Here we obtain a maximum capacity C in bits per second based on the transmit signal

power P , the noise power N0, and the bandwidth W . This is considered one of the most

important bounds in communications, as it characterizes a fundamental limit on how

much information we can transmit over a given channel with a specific signal and noise


power.

Achieving this bound has driven much of communications research and algorithm iter-

ation over the past 50 years as we seek systems which operate closer and closer to this

bound. Each of these specific modulation and coding scheme can also be expressed with

an expected analytic error bound given a similar set of operating conditions (SNR, band-

width) whose information capacity is governed by this bound.

The Shannon bound however, does not in its common form, address multi-user capacity

(aggregate bits per second per hertz for all users sharing a common channel) or realis-

tic wireless channel impairments beyond thermal noise (e.g. fading, distortion, or other

sources of impairment). Numerous more complex formulations of capacity do exist for

more complex modulation and channel models, but no general solution exists for the

multi-user, realistically impaired channel and arbitrary emitter modulation case.

2.1.2 Radio Channel Models

Modeling of radio propagation channels is a highly mature field which has developed

throughout the history of communications systems. Typical channel models allow us to

come up with simplified parametric models which reasonably approximate the effects

seen over the wireless channel. High quality monte-carlo simulation algorithms do exist

as well which can produce realistic sample by sample distributions of impairment models

for simulation [46], but often do not have a compact form.


These channel models often may be used to perform analytic optimization, as they have

been in many instance, to simulate the transmission and reception of a wireless signal

in a monte-carlo sense, or as discussed later, may be used directly in the development

of domain specific attention models or simulation models including within end-to-end

radio system optimization processes.

Thermal Noise is a key physical limitation in analog to digital conversion which limits

sensitivity and achievable signal to noise ratios for any given received signal power [47]

and bandwidth. We can model the absolute thermal noise power as P = kTB where P is the

power Watts, k is Boltzmanns constant (1.38 × 1023 Joules/Kelvin), T is the temperature

at the ADC in Kelvin, and B is the conversion bandwidth (sample rate) in Hz. Given this

fundamental bound on SNR for specific receier powers due to device physics, all radio

systems must function within the finite SNR margin governed by this limit. For simu-

lation and analysis, this is modeled accurately as an additive process of white Gaussian

noise (AWGN) where received samples (r) are each the sum of some transmitted signal (t)

and a noise component (N). This can be expressed as r = s+N where Nthermal ∼ N(0, σN)

and |s|2/σ2N expresses the SNR. This is typically expressed in dB as 10log10(|s|2/σ2

N) rather

than as a ratio.

Delay spread occurs during wireless propagation when multiple coppies of a transmitted

signal are received at differing delays and phase offsets. This is commonly due to multi-

path fading, or the summation of many different propagtion paths either direct, reflective

or otherwise, arriving and summing together at a receive antenna additively. For an


impulsive channel, shown in figure 2.2 for σ = 0, all energy arrives at a single time-delay

in the impulse response.

Figure 2.2: Impulse Response Plots of Varying Delay Spreads

For a frequency-selective fading channel (with non-zero delay spread), this energy arrives

at some combination of random time intervals which combine additively at the receive

antenna. This results in an impulse response which contains power at a range of different

frequencies, often following a distribution such as Rayleigh (where there is no dominant

mode or line-of-sight) or Rician (when a large line-of-sight component is present). Figure

2.2 shows fading channels for σ = 0.5, 1.0, 2.0 which we will refer to light, medium and

harsh fading conditions later on. Typically this impulse response is considered stationary

for the ’coherence time’ of the channel, and this is the assumption we will be making in

many of our experiments later where, for instance we convolve a single example of 1024

time-samples with a single random impulse response as a simplifying assumption. This

sort of modeling is used routinely in communications systems today.

Timing Offsets is also present in all wireless systems, where path lengths and propaga-


tion times can change based on radio mobility or changing path lengths due to reflection,

refraction, dispersion etc. This is of course governed by the the propagation of radio-

frequency waves at the speed of light (c = 3 × 108) over some distance (d), where the

time delay (τ0) is given by d/c. This time-delay τ0 can be treated as a random process for

simulation purposes, and is typically estimated in a radio receiver through the process

of synchronization. Most commonly, the use of a matched filter to some set of reference

tones at the beginning of a transmission, allows for time of arrival estimation and the

extraction of a received signal from the beginning of a single transmission.

Clock Offsets occur because physically seperated radios (e.g. a base station and a hand-

set) typically have seperate free running clocks from which the digital sampling rate and

center-frequency tuning oscillator signals are derived. Free running clock rates can be

treated as a gaussian random walk process, where they are stable on short time-intervals

and stability decreases looking at larger time intervals. This is typically characterized in

the hardware specification of a given hardware device in terms of expected clock error in

parts per million (PPM) or parts per billion (PPB). For short time-intervals we can make

the assumption of stationarity and assign a fixed estimate for symbol rate offset (SRO) and

carrier frequency offset (CFO) between a transmitter and a reciever, or between a received

signal and a receiver. Motion of transmitters, receivers, or reflecters can additionally in-

troduce SRO or CFO through the Doppler effect, where the CFO due to motion (∆fdoppler)

is given by ∆fdoppler = ∆vcFc, where ∆v is the difference in the velocity of the transmitter

and receiver along the path of transmission, and Fc is the center frequency of the signal


emitter. Generally the CFO and SRO incident on a wireless receiver are a combination of

offset due to doppler and offset due to random clock offsets between hardware devices.

In much of our work, focused on short-time examples, we assume coherence of sample

rate and center frequency over a small number of samples in one example, and randomly

draw CFO and SRO from a normal distribution. In the case of CFO, we a assume a carrier

frequency distribution ∆Fc ∼ N(0, σCFO) and in the case of SRO, we assumpe a small

resampling ratio near one, ∆R ∼ N(1, σSRO).

Aggregate Effects are present in any real system, where all of the above uncertainties

about a channel are combined into a single simplified wireless propagation model. We can

express this as a transmitted signal, s(t), purturbed by a number of channel effects over

the air before being received as r(t) at the receiver. Considering the effects of time delay,

time dilation, complex carrier phase rotation, carrier frequency offset, additive thermal

noise, and channel impulse responses being convolved with the signal, all random time-

varying processes. A closed form of the analytic transform between time varying signals

s(t) and r(t) including each of these effects can be approximated as shown in the equation

below.

r(t) = ej2π∆Fc(t)/Fs

∫ τ0+T

τ=τ0

s(t− τ∆R(t))h(τ) + nthermal(t) (2.2)

Unfortunately, such an expression is quite unwieldy when performing analytic optimiza-

tion of estimators in closed form, involving interpolation with a time-varying function


delay function, and integration with a time-varying impulse response. To simplify this,

in many cases, the simplied expression below is used.

r(t) = s(t− τ0) +Nthermal (2.3)

When considring time and frequency offsets a slightly more involved expression is also

commonly used.

r(t) = ej2π∆Fcts(t− τ0) +Nthermal (2.4)

Since the focus for many estimators focuses on the structure of s(t), which contains well

formed structures such as the following for quadrature phase shift keying (QPSK) when

considering perfectly sampled symbol periods.

s(t) = ej(2πN/4+π/4), N ∈ {0, 1, 2, 3} (2.5)

Such structured forms of s(t) and simplified AWGN-only propagation models are key

to clever derivation of estimators today which are specialized to specific forms of s(t)

(e.g. realizing s(t)4 falls on a single point for all N). However once the more complex

or nonlinear cumbersome analytic channel model is introduced and/or many different

s(t) transmitted signal structures need to be considered, this kind of manual analytic trick


begins to break down quite rapidly and require practical model simplifications to remain

tractable.

2.2 Cognitive Radio

Cognitive radio [48, 31] is a field which explores the potential ways in which our radio

and mobile devices can behave in much smarter and more efficient ways by leveraging

artificial intelligence to make better, more informed decisions and employ improved con-

trol systems and channel access schemes.

Commonly examples of this include radios saving power by intelligently searching for

towers based on expected locationd and distributions, or historical information, conduct-

ing hand-off more intelligently, managing finite resources (typically power and spectrum)

efficiently, and tuning RF communications systems and front-end parameters such as

gain, filtering, tuning or otherwise in order to improve radio performance [49].

Perhaps the most widely published applications of cognitive radio is that of dynamic

spectrum access (DSA) [50, 51] which seeks to increase spectrum usage and efficiency by

allowing for much more dynamic spectrum sharing by secondary users through intelli-

gent sensing, radio user identification, and non-invasive access strategies designed not

to harm primary spectrum users, but to use spectrum vacancies and holes available in

frequency and temporal vacancies.


Unfortunately, many of the techniques investigated within the first surge of interest in

cognitive radio and dynamic spectrum access (before the first cognitive radio winter) at-

tempted to solve very specific sensing or control system problems through a process of

specialized modeling of specific scenario features, processes, and distributions (often only

for one specific primary user or frequency band). This resulted in a number of potential

end solutions, for instance for inter-operability optimizations specifically with TV broad-

cast signals in TV broadcast bands, or control protocols to maximize fairness among shar-

ing secondary access nodes, however by and large it did not provide a general solution

which allowed us to generalize spectrum sensing, spectrum access, and control optimiza-

tion widely for many different scenarios, emitters, and bands. Due to this narrow appi-

cability and slow moving spectrum policy which has refused to allow for sensing based

secondary spectrum access, much of the research in this field yielded relatively narrow

interest and effected relatively minimal change in radio system design and deployment

as whole.

2.2.1 Sensing Techniques

One of the earliest applications of artificial intelligence in radio systems was that of spec-

trum sensing for emitter identification. This is often a multi-stage expert system which

first performs a form of wideband energy detection, often by identifying concentrated

energy within the power spectrum density, localizing and extracting carriers, and then

further characterizing these carriers through an iterative process of carrier estimation and


classification.

There have been numerous attempts to use neural network based approximations espe-

cially in the latter stage of signal classification (single signal identification on top of expert

feature sets), but many of them have relied on preprocessed feature spaces as input such

as the spectral correlation function (SCF) [52] to provide a relatively simple neural net-

work mapping tasks. The scope of previous expert sensing methods is quite large, and

we explore it partially in more depth in the later sensing section.

2.2.2 Control Modeling

Control system modeling in radio systems is another interesting task which was ad-

dressed within the scope of Cognitive Radio problems and publications. Control op-

timization approaches have been applied to many tasks such as channel frequency se-

lection in dynamic spectrum access systems and for avoidance of malicious users such

as following tone jammers. Two of the most commonly considered approaches include

modeling access opportunities of whitespace as a hidden Markov model (HMM) [53], as

well as modeling collective control problems as Game Theoretical problems [54]. Each

of these models and solutions is however unfortunately quite highly specialized for the

specific scenario, band, and primary user considered for many of these works.

Works also considered the effects of optimal radio mode and tuning control using a va-

riety of methods [55], including the use of expert planning approaches [56] such as the


popular observe orient decide act (OODA)-loop concept. However, these two approachs

are also quite reliant on expert knowledge, modeling, descriptions, and specific scenario-

centric learning. We hope that with the methods presented here we can begin to devise

and build solutions to these classes of problems which generalize much better without

significant expert model construction and manual adaptation needed.

2.3 Deep Learning Models

The study of deep learning has recently brought together a collection of powerful opti-

mization tools, network architectural tools, regularization knowlede, high performance

implementation, and other techniques which can be used to learn powerful models from

datasets and simulators. Here, we highlight a number of key ideas and enablers in greater

depth for background. These will be employed in later sections to several core problems

in radio signal processing.

2.3.1 Error Feedback and Objectives

At its core, deep learning as it exists today is focused on the optimization of large para-

metric network models which can accommodate very high degrees of freedom, non-linear

transformations, and deep hierarchical structure.

Today, such networks define one or more loss function (L ) between network output val-


Table 2.1: List of widely used NN optimization loss functions

Name L (y, y)

Mean Squared Error (MSE) ‖y − y‖2

Mean Absolute Error (MAE) |y − y|Binary cross-entropy (BCE) −y log(y)− (1− y) log(1− y)

Categorical cross-entropy (CCE) −1N

∑Ni=0 [yi log(yi) + (1− yi) log(1− yi)]

Log-cosh 1N

∑Ni=0 log (cosh (yi − yi))

Huber 1N

∑Ni=0

{12

(yi − yi)2 abs(yi − yi) < 1

(yi − yi) abs(yi − yi) ≥ 1

ues (y) and target network output values (y) (where yi denotes the i’th output value),

and use a form of global error feedback from this loss function in order to train network

parameters (also referred to as learning). Artificial neural networks (ANNs or just NNs)

have long relied on back-propagation [57] of error gradients to fit the parameters in their

networks. At the simplest form, the iterative weight update process of back-propagation

of some function y = f(x, θ) is given by, the following simple weight update equation

with a learning rate (η).

θn+1 = θn − η∂L (y, y)

∂θ= θn − η

∂L (y, f(x, θ))

∂θ(2.6)

This gradient can be derived in an automated fashion using automated differentiation,

for very complex functions representing entire networks. A key enabler for the flexibility

and rapid speed at which deep learning architectures are able to evolve today. One SGD

weight update evaluation of ∂L (y,f(x,θ))∂θ

is often referred to as a backwards pass, while

network evaluation of y = f(x, θ) is often referred to as a forwards pass.


This form of iterative weight update through SGD with global error feedback through

back-propagation is used today in virtually all DL model training applications. A wide

variety of loss functions are used for different applications, but many of the most com-

monly used loss functions include mean squared error (MSE) and categorical cross-entropy

(CCE) are shown in table 3.2. MSE is commonly used for real-valued regression problems,

while CCE is typically used for classification problems. In classification with CCE loss

fucntion a so called ”one-hot” encoding is typically used, where the output targets (yi)

take the form of a zero vector with a one at the index of the correct class label. In this case

output predictions yi for each class i of N , fall on the range (0, 1) which can be enforced

with an output activation function with bounded (0, 1) output range such as sigmoid or

softmax (softmax is typically used). When bounded in this way, these output predictions

are often referred to as pseudo-probabilities, since they are trained to predict the discrete

target probabilities p(yi = 1) or p(yi = 0) for each output index.

SGD has improved drastically since the basic formulation shown in equation 2.6. Mo-

mentum [58, 59] is an important enhancement on the simple formulation of SGD shown

above. With momentum, the learning rate η is updated dynamically based on the stabil-

ity of the gradient in each direction to prevent oscillation and to accelerate descent across

large nearly flat regions. The simple form of the gradient update expression with mo-

mentum is given in equation 2.7, where velocity v is now updated iteratively and used to

derive new weights θ.


vn+1 = γvn+1 + α∂L (y, y)

∂x

θn+1 = θn − vn+1

(2.7)

This approach was accelerated further using Nesterov’s approach [60] which improves

momentum updates assuming the target loss manifold is a smooth function. Within the

past handful of years, both RMSProp [61] and Adam [62] have become widely used which

incorporate gradient normalization into their momentum updates. In Adam, which is

used in the vast majority of the work included herein, the update equation is given in

equation 2.8.

mn+1 = β1mn + (1− β1)∂L (y, y)

∂x

vn+1 = β2vn + (1− β2)

(∂L (y, y)

∂x

)2

ˆmn+1 =mn+1

1− βn+11

ˆvn+1 =vn+1

1− βn+12

θn+1 = θn −η ˆmn+1√

ˆvn+1 + ε

(2.8)

Even more recently, the problem of learning rate control during SGD have been read-

dressed in novel ways which provide faster optimization (often at the cost of increased

computational complexity per iteration). These include the use of curvature and gradient

variance in a closed loop system [63], as well as casting the learning rate tuning problem

as a separate reinforcement learning problem naively [64] (e.g. learning to learn faster).


These methods have shown promising results, but are not in wide-spread use at this time

and appear to provide relatively incremental performance improvements in our limited

experimentation.

There has been significant discussion lately surrounding whether global error feedback is

really appropriate, optimal or biologically plausible within the human brain. The notion

of a global loss function and global error feedback both seem unlikely in the human mind.

More plausible formulations generally include a more localized form of loss computation

and a more localized and distributed form of error feedback. Numerous ideas on im-

proved optimization are currently under development, and will almost certainly provide

improvements in network training within the coming years. Key explorations in this field

include Feedback Alignment [65], Equilibrium Propagation [66], Inverse Autoregressive

Flow [67], and others. This is a very active area of research, and a challenging field.

Most of the work herein relies on mature global back-propagation using forms of SGD for

network optimization due to their maturity, effectiveness (current state of the art on most

tasks) and the availability of optimized implementations. However, given the promising

nature of emerging basic research into distributed and local-feedback optimization meth-

ods (which attempt to mirror more closely what the human brain is believed to do, i.e.

no single global loss function or global clock synchronization) and the similarity of the

network functions on which they may operate, we expect many of these methods will be

readily applicable to lend further improvements to much of the work shown here.


2.3.2 Network Model Primitives

Neural network architectures have come quite a long way. From early use of a very small

number of ’perceptrons’, the formulation of a feed-forward memoryless single neuron

has been relatively straight forward given by equation 2.9. Here a set of input values X

of size (1, N) is concatenated with a ones vector (to include a bias term) of size (1, 1) and

multiplied with a weight vector of size ((N+1),M) to produce an output vectorH of size

(M, 1).

Y = f(W ×X_1) (2.9)

An output value Y is then produced using some activation function f which may be lin-

ear (e.g. the identify function) or it may be a non-linear function such as a sigmoid or

rectified linear unit. Commonly used activation functions, f , are given in table 2.2. Sig-

moid activation functions have a long and rich history in literature, but today a number

of different activations are used. The simple rectified linear unit (ReLU) activation [68]

has been used increasingly in recent times instead of the sigmoid due to a number of

important properties. Computationally it is much cheaper to compute, as is its gradient,

and training typically converges much faster than when using smooth sigmoid or tanh

activations which suffer more from the vanishing gradient problem [69] (e.g. successfully

using back-prop through many layers), where gradient contributions to loss can differ by

orders of magnitude between subsequent layers making optimization very slow.


Table 2.2: List of activation functions

Name Function f(x) RangeLinear x (−∞,∞)

ReLU [68] max(0, xi) [0,∞)

Leaky ReLU [70]

{αx, for x < 0

x, for x ≥ 0[−∞,∞)

TanH tanh(xi) (−1, 1)

ArcTan. tan−1(x) (−∞,∞)

Sigmoid 11+e−x (0, 1)

SoftMax [71] exi∑Nj=0 e

xj(0, 1)

Step

{0, for x < 0

1, for x ≥ 0[0, 1]

ELU [72]

{x, for x > 0

αex − α, for x ≤ 0(0,∞)

SELU [73] λ

{x, for x > 0

αex − α, for x ≤ 0(0,∞)

SoftPlus [74] ln(1 + ex) (0,∞)

Each of these activations is expressed compactly in table 2.2, in the case of tanh, sigmoid,

and softmax, exponentiation operations are used for forward passes, while in ReLU units,

a simple peace-wise linear transfer is incredibly cheap to compute. Below, α denotes

some leaky (non-activaited) coefficient, while λ denotes a scaling factor; both of these

are considered hyper-parameters (e.g. defined with the network architecture and not

updated durring SGD). In each case, x denotes a single output neuron (activation of each

output is independent) except in the case of SoftMax, where each output xi is scaled by

exponentiated versions of all outputs xj in the layer.

The perceptron description given in 2.9 and illustrated in figure 2.3 provides a simple,


Figure 2.3: A single fully connected neuron

highly compact matrix multiplication operation followed by some activation which can

generally be computed concurrently for each element in the matrix. This class of layer

is typically referred to as a fully-connected (or Dense) layer, where the weight vector

dimension is the product of the input and output dimensions. This is the most expressive

layer, but also contains the highest free-parameter count, making it both flexible and data-

hungry to obtain good solutions to fit all the parameter values well.

Figure 2.4: A simple 1D 2-long 2-filter convolutional layer


One solution for reducing the free-parameter count and introducing invariance properties

which may be desired in certain layers is by leveraging the convolutional layer [75, 1]

which can be realized commonly for 1D,2D,3D or higher dimensional input spaces. Here,

the weight vector W is decomposed into a number of distinct filter channels as shown

in figure 2.4, where each filter has some size smaller than the input dimension, and is

strided across the input vector typically at some periodic interval. This has two enormous

benefits. First, if the input is a translation invariant domain such as a signal arriving

at random time offsets, or an image occurring at random X,Y translations, this forms

a powerful regularization which learns the same features at all offsets within the input.

And Second, the number of free-parameters is virtually always drastically reduced versus

the equivalent fully connected layer, reducing the number of examples required to obtain

similar accuracy on the lower number of free-parameters which must be accurately fit.

Figure 2.5: A sequence of 2D convolutional layers from AlexNet [1]

Dilated convolutions [76] deserve a special mention within our discussion of radio time-

series as well. Their recent use in neural networks [2] has been conducted in the audio

and voice processing domain where dyadically scaled features of many temporal support


widths contribute key features within both music and natural language. However, this

property of helping (in multiple layer form) to represent exponentially different scalings

of raw features is critical in the radio domain as well, where high samples rates are used,

and features may easily span 10x to 1000x or more in varying temporal feature support

width.

Figure 2.6: An example dilated convolution structure from WaveNet [2]

Each of these constructs presumes a feed-forward model for information flow (e.g. each

layer only depends on preceding layers’ outputs). Recurrent layers relax this assumption

and allow for a ’memory’ connection within a single layer. This is a powerful tool which

has been demonstrated to be highly effective in temporal sequence modeling [77, 78] par-

tially due to the fact that it can relax the simplifying Markov assumption which is made

in the case of a HMM.


2.3.3 Regularization

One of the core problems with stochastic gradient descent based methods (and many

other machine learning methods) is the propensity of the training process to overfit the

model to training set data. To avoid overfitting, or aligning the model solution more

closely with the specific training examples than the general solution to the problem they

represent, a number of solutions have been proposed and used over time. Simple forms

of regularization may focus on the L1 or L2 norm of either activations or weight vectors,

attempting to push unused or rarely used conditions to zero, or reduce high magnitude

overfitting to specific cases. Ridge regression attempts to strike an optimal balance be-

tween these factors.

Dropout introduces an entirely new form of regularization [1, 3], which embraces the

combinatorially large number of neurons and paths through a network, and probabilisti-

cally zeros neuron outputs during the training process, effectively removing connection

as shown in figure 2.7.

By doing this, networks can not overly rely on any one specific neuron or network path for

a single use case or example, and instead can be seen as training an exponentially large

ensemble model of all possible sub-graphs of neurons through the network randomly,

an enormous computational gain over actually training that many separate independent

graphs. The effect of Dropout is quite stark, as shown in the exaple in figure 2.8. Here,

when training on the cononical Modified National Institute of Standards and Technology


Figure 2.7: Dropout effect on network connectivity, from [3]

(MNIST) dataset, without dropout training loss goes near zero quickly, but overfits with

validation loss plateauing at a high level. With dropout however, training and validation

loss track much more closely, and overfitting does not occur until much later, and to

a much lesser degree, causing much better generalization while training against a very

small (500 example) subset from MNIST.

DropConnect [79] was more recently introduced, employing the same variety of proba-

bilistic dropout on network paths during training with slightly improved performance

vs Dropout, but dropping out fine grained neuron inputs rather than outputs. Unfortu-

nately DropConnect requires an increase in computational complexity when computing

ensemble outputs, and is not nearly as simple to implement as Dropout. Its adoption and

widespread usage has not been as notable as Dropout at this time.

More recently, batch normalization [80] has begun to be adopted widely as another form

of regularization (especcially for convolutional layers) and functions surprisingly well.


Figure 2.8: Example Effect of Dropout on Training and Validation Loss

In batch normalization, mean and variance of inter-layer activations are normalized to

zero mean and unit variance for mini-batches during training, resulting in a more sta-

ble covariance properties, and providing a surprisingly good regularization property.

Currently this is one of the most widely used regularization methods for state of the art

CNNs. Very recently, an approach has been devised [73] which employs carefully crafted

network weight initializations and scaled exponential linear unitss (SELUs) in order to

guarantee the same inter-layer activation properties (normalization) without explicitly

having to scale them. This can result in significantly faster convergence and lower com-

putational complexity in some cases.


2.3.4 Architectural Strategies

There are a number of high level architecture design strategies which have played im-

portant roles in deep neural network design over the past few years. Beyond basic layer

design, higher level connectivity design is important in shaping the flow of information,

combining features from different regions within larger networks, and achieving the right

structure with a limited number of free parameters. Early attempts at providing paths

through the network to combine low level inputs and features with higher level features

included the use of highway networks [81], which showed improvements in some cases.

However more recently, residual networks (ResNets) [4] have become widely adopted

within computer vision due to their ability to fit many features of varying scale, leverage

depth effectively, and to not heavily overfit to training sets. They are typically used with

batch normalization for regularization, a single ’residual unit’ is shown in figure 2.9.

Figure 2.9: A single residual network unit, from [4]

Many of these units can be stacked into a ’residual stack’, to form a network where fea-

tures may easily pass through many layers of embedding, or may bypass embeddings,

and may fit optimal sets of features which mix both types of features at many layers of


abstraction. This is an important breakthrough in multi-scale learning, and one that gen-

erally represents the state of the art today in computer vision architectures.

Figure 2.10: An exemplary residual network stack, from [4]

Attention or saliency is another key high level architectural design consideration in many

networks. Many networks have a hard time scaling to very large input sizes, so for tasks

such as the google street view challenge, which must consume very high resolution im-

agery, and discriminate house number digits, some method for directing attention to the

digits before discriminating can drastically reduce network complexity by introducing

domain appropriate transforms. In the case of vision the 2D Affine transform works very

well at resolving scale, translation skew and rotation in input patches. Figure 2.11 illus-

trates the spatial transformer network (STN) architecture where a localization network

estimates some set of parameters θ which work with a transformer to produce a canoni-

cal image, which can be classified using a relatively simple discriminative network.

Many of these architectures were developed for computer vision or for voice, however

the high level concepts outlined here are at least as applicable in the radio domain, where

high dimensional search spaces may include time, frequency, spatial, polarization or other

search spaces with well understood transforms as discussed later.


Figure 2.11: Spatial transformer network structure, from [5]

2.3.5 High Performance Computing

Usable computational capacity through high performance computing and powerful algo-

rithm expressive models has been a core enabler to deep learning. Since Gordon Moore’s

famous statement [82], that the number of components/transistors on an integrated cir-

cuit appeared to double every year (later adjusted to every 18 months), we have seen

one of the most incredible technological scaling processes in history, driving the growth

of computing and computing related industries. Unfortunately, over the past 10 years

we have begun to run into limitations on translating this transistor count into growth in

useful computation. The cause for this is best illustrated by the plots shown in figure

2.12. Transistor counts continue to scale, however clock speed and single threaded per-

formance have largely plateaued and no longer see the same exponential gains each year.

This has led to a growth in the number of cores per processor reaching a growth rate

almost equal to that of the number of transistors on chip.

In the past 10+ years, computing has attempted to embrace this many-core future by intro-

ducing numerous processing architectures with multi-core or many-core structures, and


Figure 2.12: Single threading ceiling illustrated, from [6]

introducing many unique programming models to attempt to embrace it. While many

hardware architectures have been able to achieve theoretical peak performance numbers

which continue to ride Moore’s law, some of them achieved it such as the Cell Broad-

band Engine (CBE) [83], the Tile processor [84] and others while placing the vast majority

of the burden on the software programmer to effectively balance algorithm distribution,

data movement, thread communication, etc between many cores. Unfortunately, this led

to a highly limited adoption of such architectures, where significant software develop-

ment and tuning of algorithms for specific architectures was required in order to obtain

near-theoretical performance numbers. Around this time, we investigated efficient high

throughput software radio on the CBE and obtained limited success [85], but ultimately

faced very large development times and an end-of-life’d processor roadmap from IBM.

At the same time, GPUs, were rapidly expanding to meet the needs of wide dense ma-


trix algebra operations required for high rate and high resolution rendering of games and

movies using OpenGL. To meet the needs of these rendering algorithms, graphics cards

generally turned to many-core solutions where operations could leverage wide architec-

tures, busses and concurrent processing at power-efficient clocks speeds and very high

floating point throughput rates.

Around 2007, the notion of general purpose graphic processing unit (GPGPU) computing

began to come into the forefront. Nvidia released their Compute Unified Device Architec-

ture (CUDA) [86] software development kit, ATI released their Close-to-the-Metal (CTM)

SDK [87], and shortly thereafter OpenCL [88] emerged as an attempted at a mainstream

cross-vendor GPGPU programming solution.

CUDA, CTM (now discontinued), and OpenCL have all been used widely in specific ap-

plications and generally employ a more programmer friendly architecture than possible

with Tile or CBE, however their use in radio signal processing has been somewhat limited

to high computation kernels ported to them and tuned. Wideband channelization [89] has

seen widespread success in this space along with a variety of kernels [90, 91]. In general

these attempts have continued to be plagued by the problem of balancing I/O and com-

pute distribution among compute elements in a general way across a heterogeneous set

of algorithms.

Theano [92] in 2010 introduced a quite new model, which relied on high level Numpy-

like [93] matrix algebra definition in python and efficient data-flow computation graph

partitioning, GPU compilation, and mapping and optimization over distributed GPU and


CPU compute elements. This was a huge step in that it made the programming model

for concurrent architectures much more rapid and accessible without significant invest-

ment in custom CUDA code, and maintained portability across different CPU and GPU

backends. Google followed shortly thereafter with the release of TensorFlow [94] which

ultimately improved upon and displaced Theano (a university project) with a fully sup-

ported commercial open source project. While AlexNet [1] used CUDA implementations

of their convolutional neural network directly, it was very shortly thereafter that Theano

and similar languages began to be heavily leveraged for rapid model iteration and neural

network prototyping leveraging its high level programming language an highly efficient

concurrent GPGPU compute architectures for rapid training.

Theano [92] and TensorFlow [94] in this sense really pioneered an entirely new class of

computing, based on the functional programing [95] style definition of very large matrix

algebra computation graphs. This capability has so far been heavily leveraged by the

machine learning and neural network community in libraries such as Keras [96] which

express large Tensor graphs expressing entire networks and efficiently place them down

onto multi-CPU or multi-GPU architectures for rapid training and inference. However,

the applicability of these models is actually far wider than solely in machine learning,

with countless signal processing applications standing to benefit from large functional

graph composition, partitioning, kernel synthesis, optimization layout, and orchestration

onto large distributed compute architectures. Within the past few years, the growth of

high performance computing frameworks centered around deep learning, and leverag-


ing these core ideas has been astounding: Caffe [97], Chainer [98], Torch [99], PyTorch

[100], MXNet [101], Lasagne [102], and many other frameworks have explored various

enhancements and syntaxes for such high level deep learning models.

Figure 2.13: Concurrent GPU vs CPU compute architecture scaling (2017), from [7]

In recent years, the spread between concurrent architectures able to continue to grow and

leverage Moore’s law, and those that are more limited in their ability to scale to wide ar-

chitectures has widened greatly, as illustrated in figure 2.13. At this point, virtually every

compute architecture is now following suit and providing very high throughput, wide

tensor operations which scale very well with neural network primitives. Not all algo-

rithms scale well on such architectures, such as tight sequential single loop dependen-

cies, but the class represented by most wide and deep neural networks maps incredibly

well and efficiently onto such wide architectures where they can be partitioned readily for

both pipeline and data parallelism automatically from large functional data-flow graph


definitions. This synergy between concurrent model and compute architectures is one of

the key enablers for the adoption of deep learning models, which offer highly efficient re-

alizations versus algorithms which rely on more iterative or tightly looped designs. This

ensures that any algorithm or approximation fit to such a network will likely map well to

the distributed architectures which realize well on real world scalable compute architec-

tures and play well with the limitations imposed on us due to device physics.

2.3.6 Model Search

Since the original AlexNet paper [1] there have been numerous improvements in image

recognition architectures. Some of these have been due to significant algorithm enhance-

ments and others have been due to simple architectural and hyper-parameter adjustments

in the architectural elements or training procedures. This general problem of how to best

find an architecture for some learning problem, especially for new problems which have

not been heavily explored (like vision), is still an open one. There have been a number of

attempts to explore this problem of architecture search or hyper-parameter search which

have yielded significant steps forward, but tools to address and solve these problems are

not yet widely disseminated and it is still a major need among many practitioners.

Approaches which have been explored in recent time to solve this problem include using

gradient descent on the hyper-parameters (so called hyper-gradient descent) [103, 104] as

well as reinforcement learning driven search processes [105] and evolutionary methods


[8, 106]. Evolutionary methods seem to currently show some of the most robust results.

Figure 2.14 illustrates the performance of one such evolutionary search for convolutional

network models to solve the CIFAR-10 and CIFAR-100 dataset image classification tasks

[107].

Figure 2.14: Evolutionary performance of image classifier search, from [8]

Unfortunately, today the computational resources need for such very large NN model

evolutionary search is quite high. As a result we introduce a simpler small scale evolu-

tionary strategy later in this work.

2.3.7 Model Introspection

One of the largest critiques of deep learning today is that is can be seen as a ”Black Box”

method, in which inputs and output tasks are optimized, but there is little visibility into

what is going on inside the model. While there is some truth to this accusation, it is also


a bit unfair to say a trained neural network is a black box. Aside from the basic intuition

of specific layers’ capabilities, there are a number of techniques which can be employed

to visualize and measure the effects of what is going on within each layer.

Figure 2.15: Layer 1 and 2 filter weights from CNN trained on ImageNet, from [9]

For low level weights, direct inspection of weight vectors can be informative. Layer 1

CNN weights shown in figure 2.15 can provide some intuition as to what each filter rep-

resents. Various rotations and configurations of small low level patterns actually wind

up quite close in some cases to the Gabor filters which were previously used as an ex-

pert low level feature extractor. However at higher layers in a CNN architectures, the

direct meaning of a set of filter weights is not so immediately clear from direct weight

inspection.

Figure 2.16: Filter activation visualization in CNNs, from [9]

A popular technique for understanding high level CNN feature meaning is by looking

at activations of different features at different layers based on known image stimulus as


explored in [9]. Certain classes of objects which are known to stimulate class labels, can

be seen to activate a number of intermediate feature maps within the image. Example

top-9 activation maps are shown in figure 2.16 for a handful high level features. Here, ac-

tivations can often be seen to be correlated directly with component features of high level

classes by observation. For instance specific facial features may produce activations at

one layer and combine to form a full face activation at a higher level as has been demon-

strated.

In classification tasks, it is possible to perform gradient descent to find a random im-

age which maximally actives some class label. This method was first performed in [108]

and then improved in [10] through including a regularization term (requiring a relatively

smooth input). By doing this, random inputs can be generated which demonstrate what

low level features activate any given activation within a network. Figure 2.17 shows this

techniques used on imagery, clearly illustrating some of the visual features of each class

which have been captured by the high level class specific feature map activation.

Figure 2.17: Optimization of input images for feature activation, from [10]

Other methods for introspection focus on localizing where in the input vector the con-


tributions to a feature’s activation occur (a so called saliency map). This can be done in

several ways, but one of the most promising recent methods involves differentiating a fea-

ture’s activation output with regard to pixels or points in the input image. This method,

the gradient class activation map (GradCAM) [11] is a powerful method for localizing and

highlighting which regions in an input correspond to which activations. Figure 2.18 illus-

trates this technique on dog and cat classes within a single image for a classifier trained

on image labels without any location information.

Figure 2.18: GradCAM Saliency Maps for Dogs and Cats, from [11]

From an information theoretical point of view, newer work [12] looks at the performance

of each layer of a neural network from an information theoretical viewpoint, measuring

the joint information between input, output, and intermediate layers throughout the deep

learning training process.

In figure 2.19, we illustrate the information plane, which relates the joint information be-

tween raw input (X-axis) and output/targets (Y-axis) to the information contained at each

layer of the model during training. Interestingly we can see as training progresses, the


Figure 2.19: Information theoretic visualization of deep learning, from [12]

layers move to represent more information about the input, while continually represent-

ing more joint information with the output, and then finally enter a compression stage

where they filter and remove information about the input X while preserving informa-

tion about targets, Y. This is an interesting viewpoint for understanding information flow

and compression through a so called ’bottleneck’ during DL model training. On the right

we see the mean and variance of gradients used to guide the gradient descent, which

start with a high SNR (large mean and low variance), and throughout training decrease

in SNR gradually until they no longer possess significant meaningful gradient informa-

tion to further guide the solution. Such an information centric view is quite important

when considering deep learning for numerous communications and signal processing

tasks where preservation or compression of information throughout the networks is of-

ten desired, and a solid understanding of how information is preserved or compressed

can be helpful.

Chapter 3

Learning to Communicate

Since virtually the beginning of radio, radio transceivers and waveforms have been con-

ceived through human design. Original electromagnetic (EM) communications systems

such as the telegraph and the spark gap transmitter [109] were practical due to hardware

and EM understanding at the time.

Physical layer designs grew increasingly more complex as multiple access schemes such

as frequency-division were introduced to allow additional users, higher data rates, in-

creased device power efficiency and decreased cost. In 1948 Shannon introduced infor-

mation theory and the notion of optimal channel capacity to the world, defining the fun-

damental problem of communication as, reproducing at one point either exactly or ap-

proximately a message selected at another point.” [20] This placed a theoretical upper

bound on the capabilities of single antenna transceivers over a Gaussian channel, but it

50

Timothy J. O’Shea Chapter 3. Learning to Communicate 51

did not inform radio designers specifically how to attain those levels of performance.

Figure 3.1: Illustration of the many modular algorithms present in a modern wirelessphysical layer modem such as LTE

Since then, radio engineers have iterated through numerous modulation, coding, and ra-

dio design approaches every few years in an attempt to improve capacity, reduce cost

and power requirements, and generally push our devices closer to these capacity bounds.

In today’s world, modern modems look something like that shown in figure 3.1, which

depicts the physical layer of a modern wireless physical layer such as LTE with its many

modular algorithms. Here, each module represents one of numerous intense areas of re-

search surrounding optimal coding, MIMO precoding, subframe allocation, modulation,

and other tasks which are all composed sequentially and distinctly to form the powerful

and efficient standards we use today.

Within each of these modules typically lies some analytic formulation of the wireless

channel. In the case of error correction codes, random bit flips may be used when testing


or validating a code, and for modulations or MIMO coding schemes, Gaussian noise or

Rayleigh fading channels are frequently used to model the propagation channel. In each

of these cases, such an approach generally requires simplifying assumptions and modular

optimization of individual algorithmic components rather than as a whole.

This has proven to be effective, but generally leaves open the questions, can we do better

with more rich information about the real distributions of actual impairments in a spe-

cific deployment scenario, and can we do better if we jointly optimize the system rather

than building components with rigid interfaces and intermediate values? Can we find a

more straightforward way to build complex communications systems which attain sim-

ilar performance without the need for thousands of man-hours in engineering, software

implementation and optimization time? And can we find such systems which maintain

near-Shannon levels of performance while maintaining flexibility to adapt the physical

layer more fully than this sort of rigid physical layer algorithm definition will allow?

3.1 The Channel Autoencoder

To answer these questions, we consider again the fundamental task of a radio commu-

nications system: reproducing at one point either exactly or approximately a message

selected at another point.” [20] This task is strikingly similar to that of an autoencoder,

whose objective is to reconstruct some input vector x at the output x and minimize the

loss between the two, by learning an encoder and a decoder for some input vector. We


Figure 3.2: The Fundamental Communications Learning Problem

first introduce this idea in [110] and further refine it in [111].

Figure 3.3: A simple autoencoder for a 2D MNIST image, from [14]

Traditionally an autoencoder is used to learn a lower dimensionality sparse representa-

tion of the input vector x (such as the MNIST digits shown in figure 3.3), which may be

non-linear when using non-linear neural network activation functions. This approach for

learning encoding, decoding, and sparse representations has the benefits that it can be

fit non-linearly to the distribution of a given input dataset, can be tuned for a specific


loss function (e.g. MSE, binary cross-entropy (BCE), CCE), and that it can act as a fil-

ter to remove non-structural noise which does not lie within the learned support of the

compressed representation.

Figure 3.4: A Simple Channel Autoencoder

We can formulate the radio communications system problem as a similar autoencoder,

where a message to transmit s, either a k-bit binary vector with M = 2k possible code-

words or an equivalent one-hot codeword vector of lengthM , is encoded, passed through

some set of channel impairments, and then decoded to recover s, an estimate of the orig-

inally transmitted message. The channel layer may be stochastic in nature, as has been

regularly used within computer vision actually for its nice regularizing properties (e.g.

[3], [112]).

This channel autoencoder differs from the conventional use of an autoencoder in a few

ways, first the intermediate representation of the signal may actually be of higher dimen-

sion (as opposed to most autoencoders which seek a sparse representation). Second, the

channel layer introduces numerous lossy and mixing impairments rarely seen in other


configurations (e.g. noise, fading, rotation, etc). We consider s to be a number of bits

k producing 2k = M distinct messages which are encoded into some number, n, of real

or complex valued digital samples. Controlling this ratio of k/n, (further referred to as

(n, k)) for a given sample rate and signal and noise power controls the information rate at

which bits are transmitted over the channel. By modifying these dimensions, any rational

rate system can be obtained using the same approach for arbitrary values of k and n or

simply M and n.

We construct the network using a relatively small network shown in table 3.1 whose di-

mensions scale based on M and n. Interestingly, while a single fully connect linear layer

in the encoder is fully capable of mapping all codewords to all real valued possible trans-

mit symbols in one step, SGD can not find a good solution when only using onle a single

layer, and gets stuck in a sub-optimal local minima during training. Adding a second

layer of depth to the transmit and receive networks however, allows the network to very

rapidly converge to a very good global optimum set of network weights. This is actually

an excellent illustration of the work in [37] demonstrating that using a deeper network

with a higher dimensional parametric search space actually helps networks converge to

more globally optimum solutions, as they are much less likely to become trapped in a

local minima simple due to the probabilistic nature of all degrees of freedom not likely

aligning in curvature. They are more likely instead in this deeper / higher dimensional

space to encounter a saddle point, which is not neccisarily terminal in a gradient descent

search when using a strong saddle-free optimization method (some, such as Newton’s


Table 3.1: Layout of the autoencoder used in Figs. 3.6 and 3.5. It has(2M + 1)(M + n) + 2M trainable parameters, resulting in 62, 791, and 135,944

parameters for the (2,2), (7,4), and (8,8) autoencoder, respectively.

Layer Output dimensionsInput M

Dense + ReLU M

Dense + linear n

Normalization n

Noise n

Dense + ReLU M

Dense + softmax M

method may have difficulty).

In order to avoid the trivial solution of using very large values for x in the symbol en-

coding, to increase the effective SNR over a constant channel noise power, we introduce a

transmit normalization layer after the encoder which enforces a constant average power

for transmitted symbols during training, as indicated in figure 3.1. This can be done on a

per-symbol or per-batch level, and can be enforced in an umber of ways including mean

amplitude, mean power, max power, or other similar constraint, yielding quite different

results for each in some cases.

In figure 3.5 from [111] we compare the performance of a learned physical layer encoding

for block sizes of 2 and 8 bits, and compare to the block/codeword error rate perfor-

mance of an uncoded binary phase shift keying (BPSK) modulation. In this case, we have

the interesting result that, for a 2-bit codeword size, 2xBPSK and the (2,2) autoencoder

obtain the same information rate (by definition), and align on an almost identical error


rate curve. As we increase the block size to 8 bits, we begin to see the (8,8) autoencoder

system outperform the un-coded 8xBPSK system by 1-2 dB at higher SNR values. This

indicates that the larger block size (8,8) autoencoder is in fact learning some form of error

correction, where its encoding scheme is more robust than the simple BPSK solution.

Figure 3.5: BLER versus Eb/N0 for autoencoder

−2 0 2 4 6 8 1010−5

10−4

10−3

10−2

10−1

100

Eb/N0 [dB]

Blo

cker

ror

rate

Uncoded BPSK (8,8)Autoencoder (8,8)Uncoded BPSK (2,2)Autoencoder (2,2)

In figure 3.6 we consider the comparison of an autoencoder with 4-bit codewords and 7

real valued symbols over the channel. Here, we consider three different baselines, first

the uncoded (4,4) BPSK solution which provides the worse performance, and then two

baselines using a hamming code with the same 4/7ths rate as the autoencoder. In the

case of the hard decision decoder, there is still a 1-2dB gap in performance, while for

MLD decoding, the performance is nearly identical. This is a very promising result as it

shows that for small block sizes, the channel autoencoder approach can learn very strong

solutions which rival commonly used modulation and error correction codes.

To further understand the solutions learned by this naive autoencoder learning process,


Figure 3.6: BLER versus Eb/N0 for autoencoder

−4 −2 0 2 4 6 810−5

10−4

10−3

10−2

10−1

100

Eb/N0 [dB]

Blo

cker

ror

rate

Uncoded BPSK (4,4)Hamming (7,4) Hard DecisionAutoencoder (7,4)Hamming (7,4) MLD

we can plot the constellations of each learned encoding scheme simply from their input

to the channel module. Figure 3.7 illustrates the constellations learned for (2,2), (2,4),

(2,4), and (7,4) schemes, where different power normalization constraints on (2,4) produce

different constellations (e.g. 16-PSK or non-standard 16-QAM), and the 7-dimensional

encoding space of the (7,4) code is visualized in 2-dimensions using t-SNE [113]. It is

pleasing here that the canonical QPSK solution (with random rotation) is achieved for the

(2,2) code, and that the familiar PSK as well as non-rectangular near-optimally packed

16QAM is achieved for (2,4).

The training process for channel autoencoders is an interesting problem in which the

model must learn to perform well in low and high SNR conditions, and the channel and

training parameters may be manipulated during training. Experimentally, we find that

training at a mid-range SNR (8dB Eb/N0) works well, but that varying batch size from

small (50) to large (10,000) in two passes works well to effectively train the system. This


Figure 3.7: Constellations produced by autoencoders using parameters (n, k): (a) (2, 2)(b) (2, 4), (c) (2, 4) with average power constraint, (d) (7, 4) 2-dimensional t-SNE

embedding of received symbols.

(a) (b)

(c) (d)

is an interesting result, as the batch size has an effect on the effective SNR of the gradi-

ents and the average receive symbol locations. In general in computer vision, high SNR

images are used, which may have occlusions, permutations, or small objects, but gener-

ally do not have white noise competing with the ’signal power’ of an actual visual object.

However, in vision there has also been discussion recently surrounding the use of increas-

ing batch sizes, rather than decreasing learning rate throughout training as smaller (and

more noisey) step sizes are needed durring optimization.

The choice of transmit normalization is an interesting one which has no clear ’best choice’,


Table 3.2: Candidate channel autoencoder transmit normalization functions

Tx Norm Method Expression

Example Mean Power (EMP) Xt = X(Nx ∗Ns)/∑

i,j

√∑kX

2i,j,k

Batch Mean Power (BMP) Xt = X(Nx ∗Ns ∗Nc)/√∑

i,j,kX2i,j,k

Batch Mean Ampl. (BMA) Xt = X(Nx ∗Ns ∗Nc)/∑

i,j,k abs (Xi,j,k)

Batch Mean Max Power (BMMP) Xt = X(Nx ∗Ns ∗Nc)/√∑

i,j,k max(X2i,j,k, 1

)but has a significant effect on the learned solution. We consider a number of normaliza-

tion functions which map the output of the encoderXi,j,k to the input to the channelXt, as

Xt = fnorm(Xenc). Here, Xi,j,k represents a 3 dimensional tensor, over i the example index,

j the sample index within one example, and k complex sample component index (i.e. I

and Q), for one training iteration. The table below provides several possible transmit nor-

malization functions fnorm which can be used. Where Nx is the number of examples, Ns

is the number of samples, and Nc is the number of components per sample (2).

To gather an intuition for the learned solutions of this class of learned constellation in a

traditional 2D (I/Q) single symbol space, we can compute and plot the learned constel-

lations for 2-QAM through 33-QAM for each normalization strategy below. Interestingly,

since we can map to any number of codewords trivially with this approach, we don’t need

an integer number of bits to transmit, only an integer number of codewords, leading to

numerous possible rate adaptation possibilities beyond the traditional 2N constellations

used today for QAM.

First, in figure 3.8 we show using the symbol power constraint per example (EMP), in

this case each symbol takes on an average power of 1, leading to conventional constant


Figure 3.8: Learned QAM Modes for Example Mean Power (EMP)

modulous solutions of phase-shif keying (PSK).

Figure 3.9: Learned QAM Modes for Batch Mean Power (BMP)


In figure 3.9 we use the average symbol power over an entire batch, which frees each indi-

vidual symbol up to vary to some degree as long as the mean is constrained. Here, we be-

gin to see multi-level constellations form which are quite interesting and non-conventional.

However, one interesting case here is that of 5-QAM where it has learned a relatively con-

stant power constellation which differs from BMA.

Figure 3.10: Learned QAM Modes for Batch Mean Amplitude (BMA)

Figure 3.10 shows the batch mean amplitude mode, where again we obtain a number of

novel solutions, such as for 5-QAM, where we obtain a QPSK looking constellation which

also uses the zero-power mode as a 5th constellation point.

Numerous additional constraints are possible, in figure 3.11 we use a constraint which

limits mean power per batch, but considersmax(X2i,j,k, 1) before averaging to avoid overly

incentivizing low-power constellation points (e.g. all points under the average power are


Figure 3.11: Learned QAM Modes for Batch Mean Max Power (BMMP)

of equal penalty). These for example might lead to results with very poor peak to average

power ratio (PAPR) (leading to poor amplifier efficiency).

These results are of course only for a single symbol, by scaling values of n and k, we

can design a system which encodes an arbitrary number of bits into an arbitrary number

of symbols. When encoding across multiple symbols, a typical solution appears to be

a unique non-standard 2k-QAM constellation for each symbol, and then some kind of

trellis-like combining across multiple symbols to obtain good coding gain. Examples of

this are shown in figure 3.12 where we encode 2-bit, 4-bit, and 8-bit messages into groups

of 4 sequential symbols. In this case, additional error correction capacity is obtained vs

the single symbol form and a distinct QAM arrangement is learned for each symbol with

a highly non-intuitive arrangement. Here one codeword corresponds to a point in each


Figure 3.12: Learned 4-Symbol QAM Modes using BMA for 2 bit, 4bit, and 8bit)

of the four spaces. While two constellation points may be close together in one symbol,

the points corresponding to the same message will be far apart in another symbol time,

allowing for non-linear combining to perform an efficient representation and decoding

over all the dimensions.

This method works surprisingly well, but one of the key challenges with it is scaling to

much large codeword sizes such as the 1000+ bits used in modern turbo codes. When


using LCCE we must select 2k codeword indices (messages), scaling our network expo-

nentially as bits are added. One solution is to use k binary inputs and k sigmoid binary

bit outputs along with a LBCE loss function. In this case, the network scales more lin-

early with block size, however we have not yet been able to obtain near optimal capac-

ity performance from a network trained in such a fashion. Other strategies for scaling

to larger network have been explored very recently within the scope of error correction

codes [114, 115, 116] through methods involving partitioning, and leveraging belief prop-

agation graphs to seed neural network weights, however significant work remains to

allow for scaling these techniques to large codeword sizes, such as are widely used today

in modern LTE systems. Ultimately methods such as replicating network structure within

the full block size, whether through weight/connection tieing, or through some form of

recurrent operaton with state, hold significant promise for solving this problem in the fu-

ture and allowing these methods to be competitive with state of the art error correction

and modulation schemes.

3.2 Learning to Synchronize with Attention

When learning to discriminate between received symbols in a channel autoencoder (or

between classes in a classifier), the discriminative model must generally learn to classify

all forms of signal variation which may arrive at the receiver. In radio, permutations due

to the channel include additive noise, phase offset, frequency offset, delay spread, inter-


ference, and many other distortions such as hardware non-linearities and mixer inter-

modulation products. Previous results were shown only with AWGN impairments, how-

ever real world systems include all of these effects and more.

Figure 3.13: Spatial Transformer Example on MNIST Digit from [5]

In computer vision, objects undergo a somewhat analogous set of permutations when

being viewed, including scaling, rotation, skew, translation, occlusion, and noise. Since

these permutations are geometrically well understood, a domain appropriate parametric

transformation such as the 2D Affine transform may be applied to correct them directly

as shown in figure 3.14 from [5]. By imparting expert knowledge about the domain ap-

propriate parametric transforms, the task of canonicalizing an object may be reduced to

estimating a set of parameters and then executing the transform. By splitting a classifica-

tion task up into learned parameter estimation (localization), parametric transformation,

and learned class descrimination, the model complexity needed to classify a range of

permutations on the classes may be greatly simplified. If the parametric transform is im-

plemented in a way in which it can maintain its differentiability, both localization and


discrimination networks may be trained in an end-to-end fashion as a single task (e.g.

minimize CCE) by using back-propagation from the global loss function both before and

after the transform. This architecture has proven to be very effective for image classi-

fication, such as the google streetview house number challenge, where the localization

network helps locate and cononicalize digits and the discriminitive networks classifies

digits.

Figure 3.14: Radio Transformer Network Architecture

The same architecture can be applied to radio communications problems (as we show

in [117, 111]), where current day transformations such as application of equalizer taps,

removal of carrier phase and frequency, or timing errors can be applied directly, as long

as they can be implemented in a differentiable manner. In this case, we can split the

network into a more general (not just spatial) parameter estimation network to estimate

CSI, and a discriminative network to perform symbol estimation (or anyother task), while

maintaining our expert knowledge about the domain appropriate transforms in order to

simplify the target learning manifold task and often reduce the number of free parameters

needed in our model. Since we have imparted expert knowledge about the physical radio


effects, we have only specialized our solution for the domain in general (e.g. things that

happen to all radio signals). This is an important point, since we have not done anything

to specialize the parameter estimation or discriminative networks for any one specific

signal or modulation type, keeping domain-wide non-signal-specific generality in our

model architecture.

To validate the radio transformer network (RTN) approach, we consider several tasks.

First, the performance of a channel autoencoder under a Rayleigh fading channel with

a tap length of L = 3. In this case, we allowed the estimated parameters, θ to take the

form of h−1, the channel impulse response inverse which can be directly convolved with

the received signal to obtain a canonical impulsive copy of the signal. We implement

the convolution in differentiable tensor algebra within Keras [96] as a set of dense ma-

trix multiplies and adds (the standard tensorflow convolution operation can not be used

when both the input and convolution taps are free variables).

In figure 3.15 we illustrate the training complexity reduction for this task, comparing the

training loss curve for an autoencoder both with and without the CSI estimation network

and transformer in front of the symbol discrimination task. Here, we can see that it con-

verges to a solution for both, but in the case of the RTN, it converges much more quickly

to a good solution in only a few epochs, and ultimately achieves a much lower final CCE

loss (and BLER).

Comparing the performance of the autoencoder with and without the RTN synchronizer

on the front, we can observe the fully trained bit error rate performance in figure 3.16.


Figure 3.15: Autoencoder training loss with and without RTN

0 20 40 60 80 1000

0.2

0.4

0.6

0.8

1

Training epoch

Cat

egor

ical

cros

s-en

trop

ylo

ssAutoencoderAutoencoder + RTN

Figure 3.16: BLER versus Eb/N0 for various communication schemes over a channel withL = 3 Rayleigh fading taps

0 5 10 15 2010−4

10−3

10−2

10−1

100

Eb/N0 [dB]

Blo

cker

ror

rate

Autoencoder (8,4)DBPSK(8,7) MLE + Hamming(7,4)Autoencoder (8,4) + RTN


Here, the non-RTN version is unable to achieve a level of performance which outperforms

the baseline method of MLD DBPSK decoding with a hamming code while the autoen-

coder with RTN achieves a significantly better performance result, especially for higher

SNR values. This is quite an exciting result, as it shows that a fully learned approach

can leverage expert domain knowledge about radio propagation physics, still maintain

full generality among signals, and very quickly learn a good solution which outperforms

common baseline levels of performance through the RTN approach of CSI estimation,

transformation, and symbol estimation. In this case, the learned model may also benefit

from the bias present in the fading channel model (the distribution of the taps), since it is

constrained to a set of L=3 Rayleigh fading taps, the solution space is not uniform over

all possible real values for h−1 which generally allows the system to specialize better for

the actual distribution.

Such a result could be incredibly powerful in a wireless environment, where CSI estima-

tion and equalization could be heavily specialized and improved for the delay spread

distribution within specific deployment scenarios and conditions, but is also somewhat

troubling in that it is increasingly important that the simulations and impairment mod-

els used for training sufficiently match the possible channel conditions which may be

encountered in the real world at inference time.

This technique is a very general front-end startegy when constructing ANN models for

high dimensionality parametric search spaces, to leverage knowledge about appropriate

transforms. Results here are shown for the autoencoder and symbol decoding problem,


but preliminary results show that such an approach can also help in sensing and other

tasks such as signal type or modulation recognition or other sorts of signal property la-

beling through model learning on RF emissions in the spectrum.

3.3 Multi-User Interference Channel

One of the nice features of the channel autoencoder is the versatility with which it can

solve many different formulations of the radio communications problem with variations

on the same compact optimization problem, with no need to devise complex new phys-

ical layer encoding or signal processing strategies. One important such case is that of

the multi-user interference channel, where optimization of some aggregate multi-user

capacity is the goal rather than a single transmitter and receiver. This is a critical case

in wireless systems as it represents most wireless channels with which we interact on a

daily bases, where we share some piece of spectrum (e.g. cellular bands, industrial, scien-

tific, and medical radio (ISM) bands, ground mobile radio (GMR) bands) with a number

of different users who must somehow share the available spectrum to optimize for some

joint objective such as capacity. While multi-user capacity bounds have been derived for

specific instances, no general solution exists to bound aggregate capacity under all condi-

tions, meaning we do not know how far current day systems are from optimal usage of the

interference channel. Unfortunately today, we have a slow iterative process of physical

layer design, optimization, analysis, and then manual redesign based on whatever intu-


ition gleaned from the analysis. Channel autoencoders offer to give us a tool by which

to break out of this painful cycle and directly seek to find a globally optimal multi-user

physical layer (PHY) scheme from the ground up, optimizing for aggregate capacity or

any other pertinent design objective or constraint deemed important for its application.

Figure 3.17: The two-user interference channel seen as a combination of two interferingautoencoders that try to reconstruct their respective messages

Using the same channel autoencoder construct previously used, we can formulate the

problem with a new mixing channel within the channel layer of two autoencoders as

shown in figure 3.17. Here there are two objectives to minimize, L1 = LCCE(s1, s1) and

L2 = LCCE(s2, s2), the reconstruction loss for user 1 and 2 respectively. These can be

treated as a single network, where each optimization step chooses a random batch of

independent values for both s1 and s2 and complete a back-propagation step to minimize

the two. Encoders in this case only have knowledge of their own transmit codeword,

and the network architecture from table 3.3 is used where dimensions [x, x] indicates two

separated paths of size x and a dimension of [x] indicates a single path of size x.

When optimizing for multiple loss functions there is often a question of how to combine


Table 3.3: Layout of the multi-user autoencoder model

Layer Output dimensionsInput [M,M ]

Dense + ReLU [M,M ]

Dense + linear [n, n]

Normalization [n, n]

Addition [n]

Noise [n, n]

Dense + ReLU [M,M ]

Dense + softmax [M,M ]

them. This can be done additively, multiplicitively, or many other ways which all have

an effect on the optimization process and the form of the resulting error gradients. The

most straightforward approach is to simply sum the two loss functions. Unfortunately,

when doing this, it is not uncommon for imbalance to occur between the two objectives

(e.g. favoring one user’s CCE loss and therefore BLER over that of another). If equal loss

is desired among the loss functions, some means for balancing the loss magnitudes must

be used. In this case, we seek to obtain fair performance among two users accessing the

same channel. As described in [111], to address this, we adopt the following joint loss

term LI with loss weight term αt which is given an initial condition of α = 0.5 and is

updated each mini-batch time step t as follows.

LI = αL1 + (1− α)L2

αt+1 =L1

L1 + L2

, t > 0

(3.1)

While this metric is heuristic in nature, it does a good job empirically balancing the two


Figure 3.18: BLER versus Eb/N0 for the two-user interference channel achieved by theAE and 22k/n-QAM TS for different parameters (n, k)

0 2 4 6 8 10 12 1410−5

10−4

10−3

10−2

10−1

100

Eb/N0 [dB]

Blo

cker

ror

rate

TS/AE (1, 1) TS/AE (2, 2) TS (4, 4)

AE (4, 4) TS (4, 8) AE (4, 8)

loss functions during training to arrive at a PHY with roughly equal BLERs and mean

symbol powers.

When comparing the aggregate BLER (and thus multi-user capacity) of such a system

with a completely orthogonal QAM based access sharing system such as time-sharing

(orthogonal time access (TDM) from [111]), as is shown in figure 3.18, we observe several

important results. First, the time-sharing autoencoder system (TS/AE), outperforms the

baseline time-sharing QAM system (TS) as we have previously shown, in this case the

autoencoder simply learns a single user access strategy within each of its time-slots. Sec-

ondly, the multiuser autoencoder (AE or multi-user (MU)/AE), learns a solution which

outperforms the TS/AE system even further. This result is illustrated for both 4-bit and

8-bit codeword sizes over a Gaussian interference channel in figure 3.18.


Figure 3.19: Learned constellations for the two-user interference channel withparameters (a) (1, 1), (b) (2, 2), (c) (4, 4), and (d) (4, 8). The constellation points ofTransmitter 1 and 2 are represented by red dots and black crosses, respectively.

(a) (b)

(c)

(d)

In the case of the MU/AE system, an aggregate BLER is achieved of roughly 10−3 for

the 4-bit system at around 0.7dB lower Eb/N0, while for the 8-bit system it is around

1dB lower. Offering quite significant potential gains for future multi-user access systems,

which generally only stand to improve as additional channel impairments and numbers

of users increase.

Inspecting the constellations learned by the MU/AE system helps to provide some intu-

ition as to what has been learned. In figure 3.19, we illustrate the constellations learned

in the (1,1), (2,2), (4,4), and (4,8) MU/AE configurations.

For the (1,1) system, the solution is a nice, quite easy to interpret solution which a human

designer might easily have come up with. Here the system has learned a set of two phase-


orthogonal BPSK modulations at random rotation, providing in this case, an orthogonal

solution which does not reduce the rate of either other user.

For the (2,2) system, the solution begins to become quite interesting. In this case, the

solution of a sort of super-position code, where slightly skewed and phase-offset 4-QAM

constellations are used by each user within each time-slot is found, where users alternate

opportunities as the high powered user. This is not necessarily an intuitive solution, but

inspecting the performance curve in figure 3.18, we see that it actually achieves better

performance than the obvious solution of purely orthogonal time-slotted QPSK.

For (4,4) and (4,8) systems, this trend of pseudo-orthogonal super-position code learn-

ing continues, but solution begin to become increasingly complex and are hard to gather

significant intuition from. Inspecting the (4,4) code, we can see that each user for each

symbol uses a unique layout of 16-QAM to encode the 4 bits robustly across 4 symbols.

The learned decoding process appears to be able to combine these decision surfaces very

effectively into a robust low-error rate system for both cases. For the (4,8) system it is

difficult to glean much from the constellation layouts, but we can see that the clusters of

non-standard QAM-256 points form roughly oval shaped layouts where the major axis

appears to be orthogonal.

The exciting nature of this approach to physical layer design MU-scheme design is that it

can seemingly readily be learned for virtually any rate configuration, information density,

impairment model, or other set of constraints introduced into the network training pro-

cess. This opens up the door for highly efficient multi-user CIFAR schemes to be heavily


specialized to their deployment domain, impairment distributions, multi-user configura-

tions, and potentially higher level traffic patterns and requirements as well. Significant

work remains to be done to consider optimal fusion of higher level network traffic re-

quirements and source coding on top of the model presented here, as well as scaling the

models to additional impairment constraints and higher numbers of users.

3.4 Learning Multi-Antenna Diversity Channels

Many modern radios such as LTE smart phones today, do not use a single antenna el-

ement for transmit or receive. In fact, the LTE E-UTRA Physical Layer [118, 119] has

required for several years that handsets (UEs) employ at least 2 receive antennas to allow

for decoding of 2x2 MIMO [120] modes of transmission. Many phones today actually

support 4 antenna receive, standards are now discussing 8x8 modes as a reality in future

devices, and 5G test labs are evaluating techniques involving up to 128 base station an-

tennas [121], or even 500-1000 antennas in some cases. The motivation for this is clear,

MIMO systems have proven themselves to be invaluable both in extending range at the

edge of coverage areas by coding redundancy across multiple propagation modes, and

in increasing the achievable capacity in dense urban multi-path rich environments where

separate information can be coded across multiple propagation modes in order to increase

aggregate throughput to a single or multiple users.

Today, state of the art methods for encoding information at the physical layer for the


Figure 3.20: Open Loop MIMO Channel Autoencoder Architecture

MIMO channel typically rely on either open-loop (no CSI feedback) space-time block code

(STBC) [122] methods (the simplest being the Alamouti code [123]), or closed-loop (with

CSI feedback used for pre-coding) style spatial multiplexing [124] methods.

First we consider the case of an open-loop MIMO system where no CSI is known at the

transmitter. We can structure this problem for an mt transmit antenna and mr receive

antenna system as an autoencoder as shown in figure 3.20, where each codeword is k bits

and spans n time-samples. Here, we encode some block of information s as before, using

a learned encoder, pass through a channel model, and then recover an estimate s from

the received signal y. The primary difference here is that x now takes the form of a 2D

mt×n tensor for each example, and y takes the form of a mr×n tensor for each example.

The process for complex MIMO Rayleigh channel matrix (H) generation and complex

valued tensor multiplication must be implemented in differentiable tensor form within

the channel impairment model, and then the same additive noise layer may be used to


Figure 3.21: Alamouti Coding Scheme for 2x1 Open Loop MIMO

−5 0 5 10 15 20 25 30

10−5

10−4

10−3

10−2

10−1

100

Signal to Noise Radio (dB)

BitE

rror

Rat

e(B

ER)

2x1 Spatial Diversity Code Comparison

2x1 AE No CSI2x1 Alamouti

Figure 3.22: Error Rate Performance of Learned Diversity Scheme.

impose SNR constraints.

We compare the bit error rate performance of the learned autoencoder-based 2x1 MIMO

scheme based on the model in figure 3.20 to the conventional Alamouti code which is also

an open-loop 2x1 code shown in figure 3.21.

Results for open-loop are mixed, and not initially as favorable as prior results for au-

toencoder or multi-user schemes. In both cases, we compare a (2x1,4) system, where two

QPSK symbols (4 bits) are encoded into two time-slots and one receive antenna. Perfor-


Figure 3.23: 2x1 MIMO AE, Diagonal H Figure 3.24: 2x1 MIMO AE, Random H

mance between the two schemes is similar, however we observe two distinct regions, at

low SNR the Alamouti scheme tends to outperform, providing lower bit error rates, while

at high SNR, the learned scheme provides a 2-3 dB advantage for obtaining equivalent er-

ror rates. An additional comparison incorporating error correction may make sense when

comparing performance such as a (4x1,6) scheme where a 3/4 rate code is used to map

6 bits onto two (2x1,4) alamouti code words, while allowing the autoencoder to directly

learn a solution to the (4x1,6) problem. However, these results are promising enough to

warrant further investigation and promise that strong open-loop schemes may be learned

in a similar way.

Inspecting the resulting constellations learned in figures 3.23 and 3.24 we observe that a

form of superposition code appears to be learned here as well to satisfy the average power


constraint. This is an interesting solution, but it suggests that different and/or possibly

better results could be obtained by introducing some kind of additional constraint to in-

centivize equal power between transmit antenna symbols (as is the case for Alamouti).

This does beg the question to some extent as to whether the parameter search manifold

for this problem has several very large local minima, where in this case we have been

pulled into one solution which is sub-optimal despite the use of large networks, regular-

ization, and infinite (generative) training data.

3.5 Learning MIMO with CSI Feedback

In dense urban environments with many radio reflectors, spatial multiplexing modes

[124] and closed-loop MIMO are commonly used to increase throughput and improve

performance from multi-path propagation. These too can be represented through an ap-

propriate autoencoder architecture. Figure 3.25 illustrates an autoencoder architecture for

learning such a MIMO scheme which incorporates CSI (e.g. closed loop) into the transmit-

ter encoding process. Here we have collapsed the traditional radio transmitter functions

including FEC, modulation, and MIMO pre-coding all into a single encoder block which

is learned end-to-end with the channel and decoding processes.

We can structure the architecture here such that our random channel state, H is passed

to both the channel impairment model (the complex multiply) as well as into the encoder

module, simply by concatenating it with the symbol to transmit s.


Figure 3.25: Closed Loop MIMO Learning Autoencoder Architecture

Training such a system, we can compare to a variety of baseline methods such as zero

forcing (ZF) or minimum mean square error (MMSE) methods for pre-coding. In this case,

we consider the case where mt = 2 and mr = 2, which is the common 2x2 MIMO Case

used widely in LTE and other systems, but still a relatively small scale MIMO system.

−5 0 5 10 15 20 25 30

10−4

10−3

10−2

10−1


BitE

rror

Rat

e(B

ER)

2x2 Scheme Performance with Perfect CSI

2x2 AE P-CSI2x2 Baseline

Figure 3.26: Error Rate Performance of Learned 2x2 Scheme (Perfect CSI).

In this case, the learned scheme compares quite favorably to the baseline method. We see

roughly a 5dB improvement at a bit error rate (BER) of 10−2 and a 10dB improvement at a


Figure 3.27: Closed Loop MIMO Autoencoder with Quantized Feedback

BER of 10−3, both substantial. Of course the baseline could improve significantly with the

introduction of error correction, but would have to give up some amount of information

rate to do so, making the learned system extremely appealing.

However, in the real world, MIMO systems can not and do not transmit real-valued chan-

nel estimates (H) over the air (e.g. between eNodeBs and UEs). Instead they typically

must minimize protocol overhead used for channel quality information (CQI)/CSI feed-

back, which has led to the adoption of techniques like p-bit codebooks which contain

compact discrete valued codes indicating distinct channel modes.

Considering this task of compact discrete valued CQI feedback representation as part of

the end-to-end communications system learning architecture, we can cast the problem as

shown in figure 3.27. Here, we introduce a discretization network (dis(H)), which encodes

the real valued channel estimate H (H is used in our work without estimation error), into

a v-bit discrete value with one-hot encoding over 2v possible channel modes. This one-hot


encoding is then concatenated with s to form the MIMO encoder/modulator. This is quite

exciting as we have now cast the entire end-to-end problem of compact CSI feedback, CSI-

enhanced MIMO pre-coding, FEC encoding, modulation, over-the-air (OTA) representa-

tion, MIMO combining, demodulation, and decoding all into one single learned model

which jointly optimizes for all of these free parameters to maximize capacity for any dif-

ferentiable channel model.

Figure 3.28: Bit Error Rate Performance of Baseline ZF Method

−5 0 5 10 15 20 25 30

10−2

10−1


BitE

rror

Rat

e(B

ER)

Baseline 2x2 Scheme Performance with Quantized CSI

2x2 Baseline Perfect CSI2x2 Baseline 8-bit CSI2x2 Baseline 4-bit CSI2x2 Baseline 2-bit CSI

In figure 3.28 we illustrate the decline in performance when quantizing the real valued

H feedback values with the ZF 2x2 scheme. Here, real-values provide the best solution,

and while 8-bit CSI does not provide significant degradation, 4-bit and 2-bit CSI modes

are substantially degraded.

In stark contrast, we can easily train the autoencoder based system to learn a v-bit CSI


Figure 3.29: Bit Error Rate Performance Comparison of MIMO Autoencoder 2x2Closed-Loop Scheme with Quantized CSI

−6 −4 −2 0 2 4 6 8 10 12 14 16 18 20 22 24

10−5

10−4

10−3

10−2

10−1

100


BitE

rror

Rat

e(B

ER)

2x2 Scheme Performance With Quantized CSI

2x2 AE 1 Bit2x2 AE 2 Bit2x2 AE 4 Bit2x2 AE 8 Bit

2x2 AE P-CSI

feedback mode which attempts to be optimal for any positive non-zero value of v. Figure

3.29 illustrates the performance curves of 1-Bit, 2-Bit, 4-Bit, and 8-Bit CSI feedback modes,

alongside perfect-CSI, the real-valued CSI feedback mode.

Interestingly, we obtain the best performance from a 2-bit feedback mode rather than

larger numbers of bits or continuous valued feedback. This is likely because, for 2-bit

feedback, we have enough to effectively generate a 4 entry code-book, whereas 1-bit is

insufficient for the number of codebook modes required, and greater numbers of bits or

continuous valued feedback requires the encoder to learn a more complex manifold of

different or continuously varying encoder modes, which is made significantly simpler

and more rapidly trained for a small but sufficient number of bits (e.g. v = 2).


Figure 3.30: Learned 2x2 Scheme 1 bit CSIRandom Channels.

Figure 3.31: Learned 2x2 Scheme 1-bit CSIAll-Ones Channel.

Figure 3.32: Learned 2x2 Scheme 2-bit CSIRandom Channels.

Figure 3.33: Learned 2x2 Scheme 2-bit CSIAll-Ones Channel.

Inspecting the learned constellations for the 1-bit and 2-bit CSI feedback MIMO channel

autoencoders under random channel conditions, and under even-power per channel path

(all 1’s) assumptions for the H matrix, in figures 3.30, 3.31, 3.32, and 3.33 we can see that

our best performing 2-bit scheme learns a set of non-standard 16-QAM transmit constel-

lations which combine to form a relatively constant modulus non-standard PSK kind of

ring arrangement at the receiver.

The system in figure 3.27 can be easily produced in simulation, where knowledge of H is


Figure 3.34: Deployment Configuration for Quantized MIMO Autoencoder

free, however in a real world system, such a trained system would need to be deployed

such as given in figure 3.34, where an estimate H is produced at the receiver, and used

to form a discrete v-bit embedding to feed back to the encoder. This feedback could be

included digitally coded messages within a higher level media access control (MAC) pro-

tocol.

3.6 System Identification Over the Air

The key problem with this approach and use in over the air systems, is that we have relied

on having a closed-form differentiable model for the channel during training. This is an

ok assumption, if you can build such a thing, but in the real world it may be difficult to do

so when faced with complex impairment distributions over a range of difference channel


effects. Very recent published work realizing such a system over the air [125] addresses

this problem by only fine-tuning the receiver/decoder half of the channel autoencoder

using error feedback from OTA data. This is a partial solution, but it does not allow

the encoder or over the air representation to update to optimize for the real over the air

impairments.

In general this is still an open system identification [126] problem in which we desire to

fit a function to the OTA data permutation which is occurring in the wireless channel.

This is an important area of research when combined with the channel autoencoder to

allow the systems to truly adapt under heavy real world impairments. By approximat-

ing the transfer function in a way that its gradient can be computed or approximated

accurately, we can continue to train such systems end-to-end with a black box physical

transform in the middle. Our future work and prototype systems will seek to solve this

problem thoroughly in order to fully realize the power of channel autoencoders in the

real world. Recent work is beginning to mature the approach of gradient approximation

and back-propagation for black-box functions [127] which holds significant promise for

this problem.

Chapter 4

Learning to Label the Radio Spectrum

Interpreting and labeling the radio spectrum is a critical building block on which count-

less radio capabilities are built today, and will increasingly be built tomorrow. In its sim-

plest form, wireless channel estimation consumes some form of radio signal in time or

frequency and produces an estimate for some parameter of an emitted and impaired sig-

nal. This is used in wireless synchronization to estimate time of arrival, digital symbol

clock rates, carrier frequency and phase, as well as impulse response over the channel.

Larger scale radio labeling problems involve detection and identification of radio signal

emissions, information about physical emitters, changes in channel propagation condi-

tions, user access patterns, and countless other applications which may help inform spec-

trum regulators, dynamic spectrum access systems, wireless cyber-intrusion detection

and anomaly detection systems, or other spectrum monitoring applications.

89

Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 90

For many years radio data labeling problems have been treated as highly niche estimation

tasks, where compact models of the emitter signal, compact (usually simplified) models

for the wireless channel, and an analytical estimator derivation process are used to pro-

duce some analytic estimator expression. This has gotten us extremely far in the radio

and radio labeling domain, however it has several key drawbacks relating to insufficient

model detail and unfavorably formed estimator algorithm forms. Radio signal models

are often simplified when considered in the context of underlying data distributions,

hardware impairments, and other distortions. Radio channel models are almost always

simplified by assuming only-AWGN, or including only a simplified compact simplified

fading model, often omitting other real world impairments. Estimator derivation no the

other hand, often results in an analytically convenient small expression whose algorith-

mic implementation may be considered or approximated later when considering efficient

implementation on available compute hardware and/or instruction sets.

By leveraging deep learning based on large datasets for estimator and label learning, we

hope to demonstrate in this chapter how estimators, while merely serving as approxi-

mations, can often outperform the traditional way of doing things, by incorporating rich

emitter and contextual information, rich and accurate channel models, and by forcing ap-

proximations to take the form of highly efficient wide matrix operations which synthesize

efficiently onto modern wide/concurrent compute platforms, ultimately improving accu-

racy and sensitivity, reducing power, weight, and size requirements for resulting systems,

and greatly reducing the amount of manual engineering time and cost required to obtain


good practical solutions to new estimation problems.

4.1 Learning Estimators from Data

Synchronization is the principal difficult task of any radio receiver or modem. Aligning

time, frequency, phase, and impulse response correctly for a received signal enables opti-

mal decoding of transmitted symbols and reception of digital transmissions. Two of the

most widely used estimators in any communications system are the timing estimator and

the carrier frequency estimator.

Traditionally maximum a posteriori (MAP), maximum likelihood estimation (MLE), and

MMSE estimators are widely used for estimation of CSI values. We consider the canon-

ical task of timing and frequency recovery for a single carrier QPSK signal [128]. Here,

a common approach to carrier frequency offset (CFO) estimation is an fast Fourier trans-

form (FFT) based technique which estimates the frequency using a periodogram of the

mth power of the received signal [129]. The frequency offset detected by this technique is

then given by (4.1).

∆f =Fs

N ·margmaxf|N−1∑k=0

rm[k]e−j2πkt/N | (4.1)

(−Rsym

2≤ f ≤ Rsym

2

),


where m is the modulation order, r(k) is the received sequence, Rsym is the symbol rate,

Fs is the sampling frequency, and N is the number of samples. The algorithm searches for

a frequency that maximizes the time average of the mth power of the received signal over

various frequencies in the range of(−Rsym

2≤ f ≤ Rsym

2

). Due to the algorithm operating

in the frequency domain, the center frequency offset manifests as the maximum peak in

the spectrum of rm(k). Fig. 4.1 shows an example cyclic spectrum for a QPSK signal

with a 2500 Hz center frequency offset (and a baud rate of 100ksym/sec), where the peak

indicates the center frequency offset for the burst.

Figure 4.1: CFO Expert Estimator Power Spectrum with simulated 2500 Hz offset

We conduct timing offset estimation in the canonical way by using a matched filter on

the received sequence matched to a known preamble sequence. The time-offset which

maximizes the output of the matched filter’s convolution is then taken to be the time-

offset of the received signal. Matched filtering can be represented by (4.2)


y(k) =k=∞∑k=−∞

h[n− k]r[k], (4.2)

where h[k] is the preamble sequence. The matched-filter is known as the optimal filter

for maximizing detection sensitvity in terms of SNR in the presence of additive stochastic

white noise.

Our approximate, learned approach relies instead on construction, training and evaluat-

ing an ANN based on a representative dataset. When relying on learned estimators, much

of work and difficulty lies in generating a dataset which accurately reflects the final us-

age conditions desired for the estimator. In our case, we produce numerous examples of

wireless emissions in complex baseband sampling with rich channel impairment effects

which are designed to match the intended real world conditions the system will operate

in. We associate target labels from ground truth for center frequency offset and timing

error which are used to optimize the estimator.

To train an ANN model, we consider the minimization of MSE and log-cosine hyperbolic

(log-cosh) [130] and Huber loss functions (shown in table 3.2). The latter are known to

have improved properties in robust learning, which may benefit such a regression learn-

ing task on some datasets and tasks. In our initial experiments in this paper, we observe

the best quantitative performance using the MSE loss function which we shall use for the

remainder.

We search over a large range of model architectures using Adam [62] to perform gradi-


Table 4.1: ANN Architecture Used for CFOEstimation

Layer Output dimensionsInput (nsamp,2)Conv1D + ReLU (variable,32)AveragePooling1D (variable,32)Conv1D + ReLU (variable,128)Conv1D + ReLU (variable,256)Linear 1

Table 4.2: ANN Architecture Used forTiming Estimation

Layer Output dimensionsInput (2048,2)Conv1D + ReLU (511,32)Conv1D + ReLU (126,64)Conv1D + ReLU (30,128)Conv1D + ReLU (2,256)Dense + Linear (1)

ent descent to optimize each model parameters based on our training dataset. This is

done by computing a loss function (e.g. LMSE) and updating the weights of the neural

network model iteratively using back-propagation of loss gradients. More information

on the model search and selection process used is provided in chapter 5.3. This model

search and optimization process ideally seeks a model of minimal computational com-

plexity which achieves a satisfactory level of performance (the frontier of efficient models

represents a trade-off between model complexity and accuracy).

The ANN architectures used for our performance evaluation are shown below, both are

stacked convolutional neural networks with narrowing dimensions which map noisy

high dimensional raw time series data down to a compact single valued regression out-

put. In the case of CFO estimation architecture shown in Table 4.1, we find that an average

pooling layer works well to help improve performance and generalization of the initial

layer feature maps, while in the timing estimation architecture in table 4.2 no-pooling, or

max-pooling tends to work better. This makes sense on an intuitive level as CFO is distill-

ing all symbols received throughout the input into a best frequency estimate, while timing


in a traditional matched filter sense, is derived typically from a maximum response at a

single offset.

We generate two different sets of data for evaluating the performance of the two com-

peting approaches. All generated data are based off of QPSK bursts with equiprobable

independent and identically distributed (IID) symbols, and shaped with a square root

root-raised cosine (RRC) filter with a roll-off β = 0.25 and a filter span of 6, and sampled

at 400 kHz with a symbol rate of 100 kHz. We consider 4 channel conditions, AWGN

with no fading, and three cases of Rayleigh fading with varying mean delay spreads in

samples of σ = 0.5, 1, 2. Amplitude envelopes for a number of complex valued channel

responses for each of these delay spreads are shown in figure 2.2 to provide some visual

insight into the impact of Rayleigh fading effects at each of these delays. For the last case,

inter-symbol interference (ISI) is present in the data.

The first dataset generated is the timing dataset, in which we prepended the burst with

a known preamble of 64 symbols and random noise samples at the same SNR as the

data portion of the burst. The number of noise samples prepended is drawn from a U ∼

(0, 1.25), in units of milliseconds. Additionally, a random phase offset drawn from a U ∼

(0, 2π) is introduced for each burst in the dataset.

The second dataset generated is the center frequency offset data, in which every example

burst has a center frequency offset drawn from a U ∼ (−50e3, 50e3) distribution, in units

of Hz. The bounds of this correspond to half the symbol rate, Rsym/2. Additionally, a

random phase offset drawn from a U ∼ (0, 2π) is introduced for each burst in the dataset.


These datasets are generated for SNR’s of 0 dB, 5 dB, and 10 dB and for an AWGN chan-

nel and three different Rayleigh fading channels with different mean delay spread values

(0.5, 1, and 2) representing different levels of reflection in a given wireless channel envi-

ronment. We store the label of the timing offset and center frequency offsets as ground

truth for training and evaluation.

For each dataset generated above we optimize network weights using Adam [62] for 100

epochs, reducing the initial learning rate of 1e − 3 by a factor of two for each 10 epochs

with no reduction in validation loss, ultimately using the parameters corresponding to

the epoch with the lowest validation loss. With the datasets generated above, we then

compute the test error using a separate data partition between ground truth labels for

timing and center frequency offset and predicted values generated using both expert and

deep learning/ANN based estimators. The mean absolute error (MAE) of the estimator

is used as our metric for comparison.

In the timing estimation comparison, we show estimator MAE results in figure 4.2, for

each model AWGN(τ, χ) and Fading(τ, χ) where τ is the mean delay spread, and χ is the

SNR. Inspecting these results we can see that the traditional matched filter (MF)/MLE

achieves excellent performance under the AWGN channel condition (AWGN channel

model). We can see significant degradation of the MF/MLE baseline accuracy under the

fading channel models however as a simple matched filter MLE timing estimation ap-

proach has no ability to compensate for the expected range of channel delay spreads. In

this case the artificial neural network / machine learning (ML/ANN) estimator approach


on average can not attain equivalent performance in all or even most cases. However, we

see that this approach does attain a MAE within the same order of magnitude, and does

in some fading cases achieve a lower MAE in the case of a fading channel.

Figure 4.2: Timing Estimation MAE Comparison

Quantitative results for estimation of center frequency offset error are shown in figures

4.3,4.4,4.5,4.6, summarizing the performance of both the baseline MLD method with dashed

lines and the ML/ANN method with solid lines. We compare the mean absolute center

frequency estimate error for each method at a range of different estimator block input

length sizes. As moment based methods generally improve for longer block sizes, we

compare performance over a range of short-time examples to longer-time examples.

In the AWGN case, in figure 4.3 we can see that for 5 and 10dB SNR cases, by the time

we reach a block size of 1024 samples, the baseline estimator is doing quite well, and

for larger block sizes (above 1024 samples) with SNR of at least 5dB, performance of the


Figure 4.3: Mean CFO Estimation AbsoluteError for AWGN Channel

102 103

101

102

103

104

105

Block Size (samples)

Esti

mat

orM

AE

(Hz)

CFO MAE under AWGN Channel

ML/ANN Estimator 0dB



MAP Estimator 0dB

Figure 4.4: Mean CFO Estimation AbsoluteError (Fading σ=0.5)

102 103

104

Block Size (samples)Es

tim

ator

MA

E(H

z)

CFO MAE under Light Fading




MAP Estimator 0dB

MAP Estimator 5dB

MAP Estimator 10dB

baseline method is generally better. However, even in the AWGN case, for small block

sizes we are able to achieve lower error using the ML/ANN approach, even at low SNR

levels of near 0dB.

In the cases of fading channels shown in figures 4.4,4.5,4.6, we can see that performance

of the baseline estimator degrades enormously from the AWGN case under which it was

derived when delay spread is introduced. Performance gets perpetually worse as σ in-

creases from 0.5 to 2 samples of mean delay spread. In the case of the ML/ANN estimator,

we also see a degradation of estimator accuracy as delay spread increases, but the effect

is not nearly as dramatic, ranging from 3.4 to 23254 Hz in the MLD case (almost a 7000x

increase in error) versus a range of 2027 to 3305 Hz in the ML/ANN case (around a 1.6x

increase in error).

From an accuracy standpoint, these results are quite interesting, we do not see significant


Figure 4.5: Mean CFO Estimation AbsoluteError (Fading σ=1)

102 103

103.5

104

104.5

Block Size (samples)

Esti

mat

orM

AE

(Hz)

CFO MAE under Medium Fading




MAP Estimator 0dB

Figure 4.6: Mean CFO Estimation AbsoluteError (Fading σ=2)

102 103

103.5

104

104.5

Block Size (samples)Es

tim

ator

MA

E(H

z)

CFO MAE under Heavy Fading




MAP Estimator 0dB

MAP Estimator 5dB

MAP Estimator 10dB

improvement in timing estimation here against a matched filter, however for frequency

estimation, we see significant potential gains for both short-time estimators, and for esti-

mation under heavily impaired fading channel environments where AWGN assumptions

used during derivation fail. This result helps illustrate how often approximate data cen-

tric learned models can outperform toy analytic solutions in cases where the simplified

model assumptions do not hold and where the degrees of freedom are too high to allow

for accurate and efficient closed form solutions.

4.2 Learning to Identify Modulation Types

One of the canonical tasks in radio estimation and detection, is that of radio signal mod-

ulation identification. In radio sensing systems such as DSA systems [50], as well as

in spectrum regulatory enforcement and other monitoring systems, signal modulation


identification is often the first step towards identifying the emitter or protocol used by

an emitter, and being able to communicate with or monitor it. This task can be treated

simply as a classification problem among possible transmission modes (although this is

a simplification of the possible hierarchical classification problem among emitter param-

eters). Significant literature exists into prior methods for performing radio signal type

classification when using analytically derived deicions boundaries as well as compact

learned decision criterion with previous methods for machine learning such as decision

trees (DTrees) or SVMs.

Our early work in this area, conducted in 2015 and first published publicly in 2016 [131]

has received significant attention, spurring international interest, numerous derivative

and related works works at the IEEE DySpan 2017 Mod-Rec workshop [132, 133, 134,

135, 136, 137, 138] and elsewhere, DARPA’s RF Machine Learning Systems Program,

DARPA’s Battle-of-the-ModRecs Challenges, and parts of the DARPA Spectrum Collab-

oration Challenge (SC2), along with spurring internal research programs at numerous

companies.

Our basic approach relying on end-to-end feature learning on raw In-phase and Quadra-

ture (I/Q) data remains the same, but a number of techniques and methods have been

improved upon since the orignal paper [131], which lead to significant improvements in

detection sensitivity, power efficiency, and generality of such systems. Numerous draw-

backs with this approach however, can not be taken for granted. The need for labeled

data, robust and realistic datasets, and comprehensive metrics for comparison can not be


overstated, and these often limit the performance attainable for a given problem. Initial

attempts to address these needs by open sourcing classifiers classifiers, datasets/generators

[139], and metrics/scores were welcomed by a few, but have not been heavily adopted or

contributed to by many publishing in the field. The radio signal processing community

still has a long way to go to embrace data science in the way that has become the norm

in computer vision and many other disciplines. High quality public datasets from more

high profile institutions such as DARPA or NSF would be significant help in facilitating

this some day.

4.2.1 Expert Features for Modulation Recognition (Baseline)

Modulation recognition has long been used as a toy problem in the radio estimation and

detection world [140, 141, 142, 15, 143, 144, 138]. It sees some usage in spectrum mon-

itoring applications, but is not widely deployed or neccesary in many widely deployed

communications systesm.

Early work on this problem relies on analytically derived statistics and decision thresh-

olds typically derived probabilistically from a simplified analytic signal model (we refer

to these as expert methods [e.g. written explicitly by an expert in the domain]). Figure 4.7

(from [15]) illustrates one such traditional modulation recognition process for a digitally

modulated radio signal. Here a series of statistics (vn) are compared to a series of analyti-

cally derived decision thresholds (ηn), and a rigid analyticly formed decision tree is used


Figure 4.7: Traditional Approach to Modulation Recognition, from [15]

to make a modulation recognition decision.

For our baseline features in this work, we leverage a number of compact higher order

statistics (HOSs). To obtain these we compute the higher order moments (HOMs) using

the expression given below:

M(p, q) = E[xp−q(x∗)q] (4.3)

From these HOMs we can derive a number of higher order cumulantss (HOCs) which

have been shown to be effective discriminators for many modulation types [145]. HOCs

can be computed combinatorially using HOMs, each expression varying slightly; below

we show one example such expression for the C(4, 0) HOM.

C(4, 0) =

√M(4, 0)− 3×M (2, 0)2 (4.4)


Additionally we consider a number of analog features which capture other statistical be-

haviors which can be useful, these include mean, standard deviation and kurtosis of the

normalized centered amplitude, the centered phase, instantaneous frequency, absolute

normalized instantaneous frequency, and several others which have shown to be useful

in prior work. [146].

Machine learning is also considered for the decision making based on these sets of fea-

tures. SVM and DTree are two commonly used methods which can be trained on the

low-dimensional feature space in order to derive an optimized set of decision criteria.

Prior work has generally used machine learning and pattern recognition on simpler sets

of features such as those described above. However, results have also been shown using

the increased complexity features such as the auto-correlation function (ACF), the SCF or

the α-profile (a one dimensional cut of the SCF) [147]. In our case, we compare instead

to the full dimensional input samples withour imparting expert design about what form

features should take.

4.2.2 Time series Modulation Classification With CNNs

CNN layers have a very nice property in that layer parameters (weights) correspond to

specific filters or kernels which are evaluated at regular shift intervals across the input

values, limiting the parameter count while enforcing weight re-use at time shifts. This

key feature is well suited to any input domain where translation invariance is appropri-


ate. In imagery, learning arbitrary 2D shifts of where an object occurs in an image’s X

and Y axes can be greatly simplified, by ensuring that the same feature weights are used

to form activaitons at all shifts in the input using a convolutional layer. This property is

also extremely similar to the properties of linear time invariant (LTI) systems which are

widely used to model radio communications systems as 1D time series constructs. Be-

cause radio signals may arrive with random time offsets and consist of primitive objects

such as symbols which occur randomly in time to form a hierarchical structure, CNNs

are well suited to learning low level time-domain features or basis function for represent-

ing them. In fact, we already know and use this structure heavily in communications, as

we have used matched filters for preamble detection, symbol detections and decisions,

and many other purposes throughout the history of communications. The primary dif-

ferences then are that we optimize filter weights durrign the training process, rather than

using pre-defined weights, we often use large hierarchies of multiple convolutional lay-

ers, and these layers often have many different filter channels operating in paralel to form

higher feature-space representations.

Building upon key trends discussed in more depth in chapter 1.3, the raw CNN approach

to modulation recognition leverages the relatively recent abilities of training algorithms,

network architectures, and computational platforms to directly train using an end-to-end

feature learning approach on high dimensional raw radio time series as an alternative to

trying to pre-engineer specific features such as statistical moments, cyclic moments, or

other manually derived distillations of information. In both of [131, 111] we explore this


Table 4.3: Layout for our 10 modulation CNN modulation classifier

Layer Output dimensionsInput 2× 128Convolution (128 filters, size 2× 8) + ReLU 128× 121Max Pooling (size 2, strides 2) 128× 60Convolution (64 filters, size 1× 16) + ReLU 64× 45Max Pooling (size 2, strides 2) 64× 22Flatten 1408Dense + ReLU 128Dense + ReLU 64Dense + ReLU 32Dense + softmax 10

approach in depth. Here we rely on convolutional neural network on time series data

to learn a deep net with capable of performing robust classification of radio modulation

types with random data.

The only pre-processing used, is to ensure zero mean and unit variance of the raw signal

input vector, to ensure examples are nicely scaled to facilitate learning. In some cases, we

only enforce unit variance since certain classes are only differentiated by their mean shift

(e.g. analog modulations with and without a carrier at DC).

As is widely done for image classification, we adopt a narrowing series of convolutional

layers followed by dense/fully-connected layers and terminated with a dense softmax

layer for our classifier (similar to a VGG architecture [148]). The dataset1 for this bench-

mark consists of 1.2 M sequences of 128 complex-valued baseband I/Qsamples corre-

sponding to ten different digital and analog single-carrier modulation schemes (amplitude

modulation (AM), frequency modulation (FM), PSK, QAM, etc.) that have gone through

1RML2016.10b—https://radioml.com/datasets/radioml-2016-10-dataset/

https://radioml.com/datasets/radioml-2016-10-dataset/


a wireless channel with harsh impairments including multi-path fading and both clock

and carrier rate offset [131]. The samples are taken at 20 different SNR within the range

from −20 dB to 18 dB.

Figure 4.8: 10 Modulation CNN performance comparison of accuracy vs SNR

−20 −10 0 100

0.2

0.4

0.6

0.8

1

SNR

Cor

rect

clas

sific

atio

npr

obab

ility

CNNBoosted TreeSingle TreeRandom Guessing

In Fig. 4.8, we compare the classification accuracy of the CNN against that of extreme

gradient boosting with 1000 estimators, as well as a single scikit-learn decision tree [149],

operating on a mix of 16 analog and cumulant expert features as proposed in [146] and

[145]. The short-time nature of the examples places this task on the difficult end of the

modulation classification spectrum since we cannot compute expert features with high

stability over long periods of time. The CNN outperforms the boosted feature-based

classifier by around 4 dB in the low to medium SNR range while the performance at high


Figure 4.9: Confusion matrix of the CNN (SNR = 10 dB)

8PSK

AM-DSBBPS

KCPFS

KGFS

KPA

M4QAM16

QAM64QPS

KWBFM

Prediction

8PSK

AM-DSB

BPSK

CPFSK

GFSK

PAM4

QAM16

QAM64

QPSK

WBFM

Grou

nd tr

uth

0.0

0.2

0.4

0.6

0.8

1.0

SNR is similar. Performance in the single tree case is about 6 dB worse than the CNN at

medium SNR and 3.5 % worse at high SNR.

Fig. 4.9 shows the confusion matrix for the CNN at SNR = 10 dB, revealing confusing

cases between QAM16 and QAM64 and between Wideband FM (WBFM) and double-

sideband AM (AM-DSB). Despite the high SNR, classification is imperfect due to several

other impairments as described above. The distinction between AM-DSB and WBFM is

additionally complicated by the small observation window (0.64 ms of modulated speech

per example) and low information rate with frequent silence between words. Discrimi-

nating between QAM16 and QAM64 also suffers from short-time observations over only

a few symbols since constellations are higher order and share common points. The accu-


racy of the feature-based classifier saturates at high SNR for the same reasons, and neither

classifier reaches a perfect score on this dataset. In [150], the authors report on a success-

ful application of a similar CNN for the detection of black hole mergers in astrophysics

from noisy time-series data.

4.2.3 Deep Residual Network Time-series Modulation Classification

Since the publication of our original work [131, 111] in CNN based signal identification

work desribed in the previous section, numerous advances have been made in neural

network architecture with significant implications towards structuring CNN solutions for

the modulation recognition problem. Key among these are residual networks [4], batch

normalization [41], self-normalizing networks [73], and the used of deep dilated convo-

lutional architectures [2], and others. In this section, we detail updated results leverag-

ing these techniques, considering performance over the air, and improving our synthetic

dataset performance, while providing performance trade-off comparisons detailing the

impact of a number of factors.

Dataset Structure and Improvements

Dataset related issues became clear from the dataset in [139] and prior datasets, that

streaming models with coherent channel impairments were not appropriate for training.

Randomly sampling many samples with independent channel state, rather than adjacent


Table 4.4: Random Variable Initialization

Random Variable Distributionα U(0.1, 0.4)∆t U(0, 16)∆fs N(0, σclk)θc U(0, 2π)∆fc N(0, σclk)H Σiδ(t− Rayleighi(τ))

correlated channel state provided significant gain and realism for the problem. To better

characterize the distribution of the data, we introduce the random variables in table 4.4,

each IID for every independent training example. The training data synthesis model is

illustrated in figure 4.10.

Figure 4.10: System for modulation recognition dataset signal generation and syntheticchannel impairment modeling

We consider two different compositions of the dataset, first a “Normal” dataset, which

consists of 11 classes which are all relatively low information density and are commonly

seen in impaired environments. These 11 signals represent a relatively simple classifi-

cation task at high SNR in most cases, somewhat comparable to the canonical MNIST

digits. Second, we introduce a “Difficult” dataset, which contains all 24 modulations.

These include a number of high order modulations (QAM256 and APSK256), which are

used in the real world in very high-SNR low-fading channel environments such as on line

of sight (LOS) impulsive satellite links [151] (e.g. DVB-S2X). We however, apply impair-


ments which are beyond that which you would expect to see in such a scenario and con-

sider only relatively short-time observation windows for classification, where the number

of samples, ` = 1024. Short time classification is a hard problem since decision processes

can not wait and acquire more data to increase certainty. This is the case in many real

world systems when dealing with short observations (such as when rapidly scanning a

receiver) or short signal bursts in the environment. Under these effects, with low SNR

examples (from -20 dB to +30 dB Es/N0), one would not expect to be able to achieve any-

where near 100% classification rates on the full dataset, making it a good benchmark for

comparison and future research comparison.

The specific modulations considered within each of these two dataset types are as follows:

• Normal Classes: OOK, 4ASK, BPSK, QPSK, 8PSK, 16QAM, AM-SSB-SC, AM-DSB-

SC, FM, GMSK, OQPSK

• Difficult Classes: OOK, 4ASK, 8ASK, BPSK, QPSK, 8PSK, 16PSK, 32PSK, 16APSK,

32APSK, 64APSK, 128APSK, 16QAM, 32QAM, 64QAM, 128QAM, 256QAM, AM-

SSB-WC, AM-SSB-SC, AM-DSB-WC, AM-DSB-SC, FM, GMSK, OQPSK

The raw datasets will be made available on the RadioML website 2 after publication.

2https://radioml.org


Over the air dataset generation

In additional to simulating wireless channel impairments, we also implement an OTA

test-bed in which we modulate and transmit signals using a USRP [152] B210 SDR. We

use a second B210 (with a separate free-running local oscillator (LO)) to receive these

transmissions in the lab, over a relatively benign indoor wireless channel on the 900MHz

ISM band. These radios use the Analog Devices AD9361 [153] radio frequency integrated

circuit (RFIC) as their radio front-end and have an LO that provides a frequency (and

clock) stability of around 2 parts per million (PPM). We off-tune our signal by around 1

MHz to avoid DC signal impairment associated with direct conversion, but store signals

at base-band (offset only by LO error). Received test emissions are stored off unmodified

along with ground truth labels for the modulation from the emitter. Figure 4.11 illustrates

the hardware recording architecture used for our data capture, and the picture in figure

4.12 illustrates the actual hardware used for data capture, training and evaluation.

Baseline classification approach

Our baseline method leverages the list of HOMs and other aggregate signal behavior

statistics given in table 4.5. Here we can compute each of these statistics over each 1024

sample example, and translate the example into feature space, a set of real values asso-

ciated with each statistic for the example. This new representation has reduced the di-

mension of each example from R1024∗2 to R28, making the classification task much simpler


Figure 4.11: Over the air capture systemdiagram

Figure 4.12: Picture of over the air labcapture and training system

Table 4.5: Features Used

Feature NameM(2,0), M(2,1)M(4,0), M(4,1), M(4,2), M(4,3)M(6,0), M(6,1), M(6,2), M(6,3)C(2,0), C(2,1)C(4,0), C(4,1), C(4,2),C(6,0), C(6,1), C(6,2), C(6,3)Additional analog 4.2.1

but also discarding the vast majority of the data. We use an ensemble model of gradient

boosted trees (XGBoost) [154] to classify modulations from these features, which outper-

forms a single decision tree or SVM significantly on the task. (We additionally evaluated

methods including SVM [32], Naive Bayes, k-Nearest Neighbor, and deep neural net-

work (DNN) on feature data in [131, 111], but ultimately XGBoost offered the strongest

performing feature-based classification approach which is why we focus on it here.)


Deep Learning based classification approaches

We evaluate and tune two classes of networks, first a VGG-style CNN using max-pooling

shown in table 4.6, and second, a residual network leveraging dilated convolutions ap-

propriate for time series radio signals and self-normalizing fully connected layers to map

residual/CNN features to outputs, shown in table 4.7.

In [148], the question of how to structure such networks is explored, and several basic

design principals for ”VGG” networks are introduced (e.g. filter size is minimized at 3x3,

smallest size pooling operations are used at 2x2). Following this approach has generally

led to straight forward way to construct CNNs with good performance. We adapt the

VGG architecture principals to a 1D CNN, improving upon the similar networks in [131,

111]. This represents a simple DL CNN design approach which can be readily trained

and deployed to effectively accomplish many small radio signal classification tasks.

Figure 4.13: Example graphic of high level feature learning based residual networkarchitecture for modulation recognition


As network algorithms and architectures have improved since Alexnet, they have made

the effective training of deeper networks using more and wider layers possible, and lead-

ing to improved performance. In the computer vision space, the idea of deep residual

networks has become increasingly effective [4]. In a deep residual network, as is shown

in figure 4.20, the notion of skip or bypass connections is used heavily, allowing for fea-

tures to operate at multiple scales and depths through the network. This has led to signif-

icant improvements in computer vision performance, and has also been used effectively

on time-series audio data [2]. In [155], the use of residual networks for time-series radio

classification is investigated, and seen to train in fewer epochs, but not to provide signif-

icant performance improvements in terms of classification accuracy. We revisit the prob-

lem of modulation recognition with a modified residual network and obtain improved

performance when compared to the CNN on this dataset, a high level depiction of this

architecture is shown in figure 4.13. The basic residual unit and stack of residual units is

shown in figure 4.20, while the complete network architecture for our best architecture for

(` = 1024) is shown in table 4.7. We also employ self-normalizing neural networks [73]

in the fully connected region of the network, employing the SELU activation function

[73], mean-response scaled initializations (MRSA) [156], and Alpha Dropout [73], which

provides a slight improvement over conventional ReLU performance.

Significant tuning time was spent optimizing both networks, and a collection of different

trade studies are shown below. A thorough analysis of all of the hundreds (or limitless)

network architecture design choices possible is difficult to address in this same depth.


Table 4.6: CNN Network Layout

Layer Output dimensionsInput 2× 1024Conv 64× 1024Max Pool 64× 512Conv 64× 512Max Pool 64× 256Conv 64× 256Max Pool 64× 128Conv 64× 128Max Pool 64× 64Conv 64× 64Max Pool 64× 32Conv 64× 32Max Pool 64× 16Conv 64× 16Max Pool 64× 8FC/SeLU 128FC/SeLU 128FC/Softmax 24

Table 4.7: ResNet Network Layout

Layer Output dimensionsInput 2× 1024Residual Stack 32× 512Residual Stack 32× 256Residual Stack 32× 128Residual Stack 32× 64Residual Stack 32× 32Residual Stack 32× 16FC/SeLU 128FC/SeLU 128FC/Softmax 24

However, the architecture tuning process is revisited again in more depth later in chapter

5.3, where we consider dealing with the model hyper-parameter design choices using a

secondary optimization process.

Figure 4.14: Complex time domain examples of 24 modulations from the dataset atsimulated 10dB Eb/N0 and ` = 256


Figure 4.15: Complex time domain examples of 24 modulations over the air at high SNRand ` = 256

Figure 4.16: Complex constellation examples of 24 modulations from the dataset atsimulated 10dB Eb/N0 and ` = 256

We show a number of examples from both the synthetic and and over the air datasets for

a bit of dataset intuition about what each example looks like at differing SNR levels, and

how similar classes appear at lower SNR. Each example is 1024 complex valued samples

at 1 MSamp/sec with a baud rate of 200Ksym/sec. We show time domain examples for

all 24 classes, where figures 4.14 and 4.17 illustrate time domain signals at 10dB and 0dB

respectively. Figure 4.15 illustrates an OTA capture of the dataset with relatively high


Figure 4.17: Complex time domain examples of 24 modulations from the dataset atsimulated 0dB Eb/N0 and ` = 256

SNR, and figure 4.16 illustrates the 10dB SNR synthetic dataset in the complex plane, to

provide an alternate perspective on the complex valued trajectories through modulation

symbol points.

Classification on low-order modulations

We first compare performance on the lower difficulty dataset on lower order modulation

types. Training on a dataset of 1 million example, each 1024 samples long, we obtain

excellent performance at high SNR for both the VGG CNN and the ResNet (RN) CNN.

In this case, the ResNet achieves roughly 5 dB higher sensitivity for equivalent classifi-

cation accuracy than the baseline, and at high SNR a maximum classification accuracy

rate of 99.8% is achieved by the ResNet, while the VGG network achieves 98.3% and the

baseline method achieves a 94.6% accuracy. At lower SNRs, performance between VGG

and ResNet networks are virtually identical, but at high-SNR performance improves con-


Figure 4.18: 11-Modulation normal dataset performance comparison (N=1M)

−20 −15 −10 −5 0 5 10 150

0.2

0.4

0.6

0.8

1

Es/N0 [dB]

Cor

rect

clas

sific

atio

npr

obab

ility Baseline

VGG/CNNResNet

siderably using the ResNet and obtaining almost perfect classification accuracy.

For the remainder of this chapter, we will consider the much harder task of 24 class high

order modulations containing higher information rates and much more easily confused

classes between multiple high order PSKs, APSKs and QAMs.

Classification under AWGN

Signal classification under AWGN is the canonical problem which has been explored for

many years in communications literature. It is a simple starting point, and it is the con-

dition under which analytic feature extractors should generally perform their best (since

they were derived under these conditions). In figure 4.19 we compare the performance

of the ResNet (RN), VGG network, and the baseline (BL) method on our full dataset for


Figure 4.19: 24-Modulation difficult dataset performance comparison (N=240k)

−20 −15 −10 −5 0 5 10 150

0.2

0.4

0.6

0.8

1

Es/N0 [dB]

Cor

rect

clas

sific

atio

npr

obab

ility BL AWGN

RN AWGNVGG AWGN

` = 1024 samples, N = 239, 616 examples, and L = 6 residual stacks. Here, the residual

network provides the best performance at both high and low SNRs on the difficult dataset

by a margin of 2-6 dB in improved sensitivity for equivalent classification accuracy. Here,

N indicates the number of examples in the dataset, ` indicates the number of samples of

input per example, and L indicates the number of residual stacks included in the network

(where a single residual stack architecture is shown in figure 4.20).

Classification under Impairments

In any real world scenario, wireless signals are impaired by a number of effects. While

AWGN is widely used in simulation and modeling, the effects of fading, carrier offset, and

clock offset are present almost universally in wireless systems. It is interesting to inspect

how well this class of learned classifiers perform under such impairments and compare


Figure 4.20: Residual unit and residual stack architectures

their rate of degradation under impairments with that of more traditional approaches to

signal classification.

In figure 4.21 we plot the performance of the residual network based classifier under each

considered impairment model. This includes AWGN, minor LO offset (σclk = 0.0001),

moderate LO offset (σclk = 0.01), and several fading models ranging from minor (τ = 0.5)

to harsh (τ = 4.0). Under all fading models, minor LO offset is assumed as well. Interest-

ingly in this plot, ResNet performance improves under LO offset rather than degrading.

Additional LO offset which results in spinning or dilated versions of the original sig-

nal, appears to have a positive regularizing effect on the learning process which provides

quite a noticeable improvement in performance. At high SNR performance ranges from

around 80% in the best case down to about 59% in the worst case.

In figure 4.22 we show the degradation of the baseline classifier under impairments. In

this case, LO offset never helps, but the performance instead degrades with both LO offset


Figure 4.21: Resnet performance under various channel impairments (N=240k)

−20 −15 −10 −5 0 5 10 150

0.2

0.4

0.6

0.8

1

Es/N0 [dB]

Cor

rect

clas

sific

atio

npr

obab

ility RN AWGN

RN σclk = 0.01RN σclk = 0.0001RN τ = 0.5RN τ = 1RN τ = 2RN τ = 4

Figure 4.22: Baseline performance under channel impairments (N=240k)

−20 −15 −10 −5 0 5 10 150

0.2

0.4

0.6

0.8

1

Es/N0 [dB]

Cor

rect

clas

sific

atio

npr

obab

ility BL AWGN

BL σclk = 0.01BL σclk = 0.0001BL τ = 0.5BL τ = 1BL τ = 2BL τ = 4


Figure 4.23: Comparison models under LO impairment

−20 −15 −10 −5 0 5 10 150

0.2

0.4

0.6

0.8

1

Es/N0 [dB]

Cor

rect

clas

sific

atio

npr

obab

ility BL σclk = 0.01

RN σclk = 0.01VGG σclk = 0.01

and fading effects, in the best case at high SNR this method obtains about 61% accuracy

while in the worst case it degrades to around 45% accuracy.

Directly comparing the performance of each model under moderate LO impairment ef-

fects, in figure 4.23 we show that for many real world systems with unsynchronized LOs

and Doppler frequency offset there is nearly a 6dB performance advantage of the ResNet

approach vs the baseline, and a 20% accuracy increase at high SNR. In this section, all

models are trained using N = 239, 616 and ` = 1024 for this comparison.

Classifier performance by network depth

Model size can have a significant impact on the ability of large neural network models

to accurately represent complex features. In computer vision, convolutional layer based


−20 −15 −10 −5 0 5 10 150

0.2

0.4

0.6

0.8

1

Es/N0 [dB]

Cor

rect

clas

sific

atio

npr

obab

ility L=1

L=2L=3L=4L=5L=6

Figure 4.24: ResNet performance vs depth (L = number of residual stacks)

DL models for the ImageNet dataset started around 10 layers deep, but modern state of

the art networks on ImageNet are often over 100 layers deep [157], and more recently

even over 200 layers. Initial investigations of deeper networks in [155] did not show

significant gains from such large architectures, but with use of deep residual networks

on this larger dataset, we begin to see quite a benefit to additional depth. This is likely

due to the significantly larger number of examples and classes used. In figure 4.24 we

show the increasing validation accuracy of deep residual networks as we introduce more

residual stack units within the network architecture (i.e. making the network deeper). We

see that performance steadily increases with depth in this case with diminishing returns

as we approach around 6 layers. When considering all of the primitive layers within this

network, when L = 6 we the ResNet has 121 layers and 229k trainable parameters, when

L = 0 it has 25 layers and 2.1M trainable parameters. Results are shown for N = 239, 616


Figure 4.25: Modrec performance vs modulation type (Resnet on synthetic data withN=1M, σclk=0.0001)

−20 −15 −10 −5 0 5 10 150

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Signal to noise ratio (Es/N0) [dB]

Cor

rect

clas

sific

atio

npr

obab

ility

OOK4ASK8ASKBPSKQPSK8PSK16PSK32PSK16APSK32APSK64APSK128APSK16QAM32QAM64QAM128QAM256QAMAM-SSB-WCAM-SSB-SCAM-DSB-WCAM-DSB-SCFMGMSKOQPSK

and ` = 1024.

Classification performance by modulation type

In figure 4.25 we show the performance of the classifier for individual modulation types.

Detection performance of each modulation type varies drastically over about 18dB of

SNR. Some signals with lower information rates and vastly different structure such as AM


Figure 4.26: 24-modulation confusion matrix for ResNet trained and tested on syntheticdataset with N=1M, AWGN, and SNR ≥ 0dB

and FM analog modulations are much more readily identified at low SNR, while high-

order modulations require higher SNRs for robust performance and never reach perfect

classification rates. However, all modulation types reach rates above 80% accuracy by

around 10dB SNR. In figure 4.26 we show a confusion matrix for the classifier across all 24

classes for AWGN validation examples where SNR is greater than or equal to zero. We can

see again here that the largest sources of error are between high order PSK (16/32-PSK),

between high order QAM (64/128/256-QAM), as well as between AM modes (confusing

with-carrier (WC) and suppressed-carrier (SC)). This is largely to be expected as for short


Figure 4.27: Performance vs training set size (N) with ` = 1024

−20−18−16−14−12−10 −8 −6 −4 −2 0 2 4 6 8 10 12 14 16 180

0.2

0.4

0.6

0.8

1

Es/N0 [dB]

Cor

rect

clas

sific

atio

npr

obab

ility

N=1kN=2kN=4kN=8kN=15kN=31kN=62kN=125kN=250kN=500kN=1MN=2M

time observations, and under noisy observations, high order QAM and PSK modes can

be extremely difficult to tell apart through any approach.

Classifier Training Size Requirements

When using data-centric machine learning methods, the dataset often has an enormous

impact on the quality of the model learned. We consider the influence of the number

of example signals in the training set, N , as well as the time-length of each individual

example in number of samples, `.

In figure 4.27 we show how performance of the resulting model changes based on the total


Figure 4.28: 24-modulation confusion matrix for ResNet trained and tested on syntheticdataset with N=1M and σclk = 0.0001

number of training examples used. Here we see that dataset size has a dramatic impact on

model training, high SNR classification accuracy is near random until 4-8k examples and

improves 5-20% with each doubling until around 1M. These results illustrate that having

sufficient training data is critical for performance. For the largest case, with 2 million

examples, training on a single state of the art Nvidia V100 GPU (with approximately

125 tera-floating point operations per second (FLOPS)) takes around 16 hours to reach

a stopping point, making significant experimentation at these dataset sizes cumbersome.

We do not see significant improvement going from 1M to 2M examples, indicating a point


Figure 4.29: Performance vs example length in samples (`)

−20 −15 −10 −5 0 5 10 150

0.2

0.4

0.6

0.8

1

Es/N0 [dB]

Cor

rect

clas

sific

atio

npr

obab

ility `=16

`=32`=64`=128`=256`=512`=768`=1024

of diminishing returns for number of examples around 1M with this configuration. With

either 1M or 2M examples we obtain roughly 95% test set accuracy at high SNR. The

class-confusion matrix for the best performing mode with `=1024 and N=1M is shown

in figure 4.28 for test examples at or above 0dB SNR, in all instances here we use the

σclk = 0.0001 dataset, which yeilds slightly better performance than AWGN.

Figure 4.29 shows how the model performance varies by window size, or the number of

time-samples per example used for a single classification. Here we obtain approximately

a 3% accuracy improvement for each doubling of the input size (with N=240k), with sig-

nificant diminishing returns once we reach ` = 512 or ` = 1024. We find that CNNs scale

very well up to this 512-1024 size, but may need additional scaling strategies thereafter for

larger input windows simply due to memory requirements, training time requirements,

and dataset requirements.


Over the air performance

We generate 1.44M examples of the 24 modulation dataset over the air using the USRP

setup described above. Using a partition of 80% training and 20% test, we can directly

train a ResNet for classification. Doing so on an Nvidia V100 in around 14 hours, we

obtain a 95.6% test set accuracy on the over the air dataset, where all examples are roughly

10dB SNR. A confusion matrix for this OTA test set performance based on direct training

is shown in figure 4.30.

Figure 4.30: 24-modulation confusion matrix for ResNet trained and tested on OTAexamples with SNR ∼ 10 dB


Figure 4.31: Resnet transfer learning OTA performance

0 5 10 15 20 25 30 35 40 45 500.6

0.65

0.7

0.75

0.8

0.85

0.9

Transfer Learning Epochs

Cor

rect

clas

sific

atio

npr

obab

ility

(Tes

tSet

)

AWGNσclk=0.0001σclk=0.01τ = 0.5τ = 1.0

Transfer Learning to Over-the-air Performance

We also consider over the air signal classification as a transfer learning problem, where

the model is trained on synthetic data and then only evaluated and/or fine-tuned on

OTA data. Because full model training can take hours on a high end GPU and typi-

cally requires a large dataset to be effective, transfer learning is a convenient alternative

for leveraging existing models and updating them on smaller computational platforms

and target datasets. We consider transfer learning, where we freeze network parameter

weights for all layers except the last several fully connected layers (last three layers from

table 4.7) in our network when while updating. This is commonly done today with com-

puter vision models where it is common start by using pre-trained VGG or other model

weights for ImageNet or similar datasets and perform transfer learning using another


dataset or set of classes. In this case, many low-level features work well for different

classes or datasets, and do not need to change during fine tuning. In our case, we con-

sider several cases where we start with models trained on simulated wireless impairment

models using residual networks and then evaluate them on OTA examples. The accura-

cies of our initial models (trained with N=1M) on synthetic data shown in figure 4.21, and

these ranged from 84% to 96% on the hard 24-class dataset. Evaluating performance of

these models on OTA data, without any model updates, we obtain classification accura-

cies between 64% and 80%. By fine-tuning the last two layers of these models on the OTA

data using transfer learning, we and can recover approximately 10% of additional accu-

racy. The validation accuracies are shown for this process in figure 4.31. These ResNet

update epochs on dense layers for 120k examples take roughly 60 seconds on a Titan X

card to execute instead of the full ∼ 500 seconds on V100 card per epoch when updating

model weights.

Ultimately, the model trained on just moderate LO offset (σclk = 0.0001) performs the best

on OTA data. The model obtained 94% accuracy on synthetic data, and drops roughly

7% accuracy when evaluating on OTA data, obtaining an accuracy of 87%. The primary

confusion cases prior to training seem to be dealing with suppress or non-suppressed

carrier analog signals, as well as the high order QAM and APSK modes.

This seems like it is perhaps the best suited among our models to match the OTA data.

Very small LO impairments are present in the data, the radios used had extremely stable

oscillators present (GPSDO modules providing high stable 75 PPB clocks) over very short


Figure 4.32: 24-modulation confusion matrix for ResNet trained on syntheticσclk = 0.0001 and tested on OTA examples with SNR ∼ 10 dB (prior to fine-tuning)

example lengths (1024 samples), and that the two radios were essentially right next to

each other, providing a very clean impulsive direct path while any reflections from the

surrounding room were likely significantly attenuated in comparison, making for a near

impulsive channel. Training on harsher impairments seemed to degrade performance of

the OTA data significantly.

We suspect as we evaluate the performance of the model under increasingly harsh real

world scenarios, our transfer learning will favor synthetic models which are similarly

impaired and most closely match the real wireless conditions (e.g. matching LO distribu-

tions, matching fading distributions, etc). In this way, it will be important for this class

of systems to train either directly on target signal environments, or on very good im-


Figure 4.33: 24-modulation confusion matrix for ResNet trained on syntheticσclk = 0.0001 and tested on OTA examples with SNR ∼ 10 dB (after fine-tuning)

pairment simulations of them under which well suited models can be derived. Possible

mitigation to this are to include domain-matched attention mechanisms such as the ra-

dio transformer network [139] in the network architecture to improve generalization to

varying wireless propagation conditions.

Modulation Recognition Learning Analysis

We have extended prior work on using deep convolutional neural networks for radio sig-

nal classification by heavily tuning deep residual networks for the same task. We have

also conducted a much more thorough set of performance evaluations on how this type


of classifier performs over a wide range of design parameters, channel impairment con-

ditions, and training dataset parameters. This residual network approach achieves state

of the art modulation classification performance on a difficult new signal database both

synthetically and in over the air performance. Other architectures still hold significant

potential, radio transformer networks, recurrent units, and other approaches all still need

to be adapted to the domain, tuned and quantitatively benchmarked against the same

dataset in the future. Other works have explored these to some degree, but generally not

with sufficient hyper-parameter optimization to be meaningful.

We have shown that, contrary to prior work, deep networks do provide significant per-

formance gains for time-series radio signals where the need for such deep feature hier-

archies was not apparent, and that residual networks are a highly effective way to build

these structures where more traditional CNNs such as VGG struggle to achieve the same

performance or make effective use of deep networks. We have also shown that simulated

channel effects, especially moderate LO impairments improve the effect of transfer learn-

ing to OTA signal evaluation performance, a topic which will require significant future

investigation to optimize the synthetic impairment distributions used for training.

ADL methods continue to show enormous promise in improving radio signal identifi-

cation sensitivity and accuracy, especially for short-time observations. We have shown

deep networks to be increasingly effective when leveraging deep residual architectures

and have shown that synthetically trained deep networks can be effectively transferred

to over the air datasets with (in our case) a loss of around 7% accuracy or directly trained


effectively on OTA data if enough training data is available. While large well labeled

datasets can often be difficult to obtain for such tasks today, and channel models can be

difficult to match to real-world deployment conditions, we have quantified the real need

to do so when training such systems and helped quantify the performance impact of do-

ing so.

We still have much to learn about how to best curate datasets and training regimes for this

class of systems. However, we have demonstrated in this work that our approach pro-

vides roughly the same performance on high SNR OTA datasets as it does on the equiva-

lent synthetic datasets, a major step towards real world use. We have demonstrated that

transfer learning can be effective, but have not yet been able to achieve equivalent perfor-

mance to direct training on very large datasets by using transfer learning. As simulation

methods become better, and our ability to match synthetic datasets to real world data

distributions improves, this gap will close and transfer learning will become and increas-

ingly important tool when real data capture and labeling is difficult. The performance

trades shown in this work help shed light on these key parameters in data generation

and training, hopefully helping increase understanding and focus future efforts on the

optimization of such systems.


4.3 Learning to Identify Radio Protocols

The results in the previous section focused principally on sensing of modulation type,

but the same fundamental approach is valid for labeling many different properties of

digital communications waveforms at the PHY or MAC layer. As shown by Saineth et

al in [36], features in a time series waveform which construct hierarchical time series

structure among short-time features (such as voice utterances), can be learned in an end-

to-end fashion with a higher level sequence model for effective sequence classification on

noisy time series data. This has proved incredibly effective in voice recognition, and the

approach can also be leveraged for higher level radio protocol identification on top of the

basic modulation features [158].

Protocol identification serves an important role in network quality of service (QoS) man-

agement, intrusion detection, and anomaly detection. Today, many such systems rely on

brittle parsing routines which are highly specialized to a specific set of protocols, can be-

come useless, or worse cause faults or vulnerabilities [159] when protocol fields change

or are malformed, and can be extremely expensive and time consuming to keep up to

date or constantly update to add new protocol modes. As an alternative, we consider a

data-based approach in which high level protocol labeling can be conducted directly on

a physical layer modulated signal through end-to-end learning of the low level modu-

lation features, and high level classification loss guided by curated protocol labels and

examples.


Figure 4.34: Transfer function of the LSTM unit, from [16]

Table 4.8: Protocol traffic classes considered for classification

Traffic Type Traffic ClassStreaming Video (ABC Video)Streaming Video (YouTube)Streaming Music (Spotify)Utilities Apt-getUtilities ICMP PingUtilities Git Version ControlUtilities IRC ChatBrowsing Bit-TorrentBrowsing Web-BrowsingBrowsing FTP TransferBrowsing HTTP Download

Several powerful recurrent network structures such as the long short-term memory (LSTM)

[160, 161], the gated recurrent unit (GRU) [162], and more recently the computationally

efficient quasi-recurrent neural network (QRNN)[163]. For our work we leverage the

LSTM in both an RNN-DNN architecture and a CNN-RNN-DNN (CLDNN) architecture.

We generate a set of recorded IP traffic captures using Wireshark [164] from the list of

protocols in table 4.8 and re-modulate them over an un-coded QPSK with HDLC com-

munications link to produce labeled I/Q sample files for classification.


Table 4.9: Recurrent network architecture used for network traffic classification

Layer Output dimensionsInput N × (2× 128)

LSTM 256

LSTM 256

LSTM 256

Dense + ReLU 64

Dense + softmax 11

Table 4.10: Performance measurements for RNN protocol classification for varyingsequence lengths

Sequence Length Val. Loss Val. Accuracy Nsamples Nsymbols Nbits Sec/Epoch

32 1.2126 0.498805 1120 140 280 5

64 1.0386 0.553546 2144 268 536 18

128 0.7179 0.65894 4192 524 1048 17

256 0.4586 0.75621 8288 1036 2072 29

512 0.2711 0.836535 16480 2060 4120 38

768 0.5328 0.730413 24672 3084 6168 27

The recurrent neural network (RNN) network architecture evaluated (which in this case

had the best performance on clean signal data), is shown in figure 4.9. No network tuning

was used, this was the same network structure commonly used for character level RNN’s

(char-rnn [165]).

We evaluate a range of different input sequence lengths of the LSTM N , comparing the

average number of input samples/bits required to obtain a good estimate of each mod-

ulated protocol traffic type. Table 4.10 tabulates the resulting network performance for


training and evaluating classification performance with differing sequence lengths of 128

complex sample windows. Andrej Karpathy’s article title from the excellent article [165]

is apt here, as the ’unreasonable effectiveness’ of LSTMs is able to quite effectively iden-

tify high level traffic protocol behaviors with only access to raw modulated I/Q data. In

this case, we obtain best performance with a sequence of 512 windows, with a validation

set accuracy of around 83.6%.

Figure 4.35: Best LSTM256 confusion with RNN length of 512 time-steps

The confusion matrix for the resulting classifier performance with sequence length of

N = 512 (16,480 samples) is shown in figure 4.35. Since the observation window is only

16ms of traffic observation, some error is to be expected as not all observation windows

will contain distinctive traffic patterns and all classes may have some amount of common


background traffic (domain name server (DNS), address resolution protocol (ARP), etc).

These results indicate some initial promise of deep learning based protocol analysis even

down to the raw physical layer, but significant investment and work in larger scale ro-

bust dataset development needs to occur to significantly advance the field. Our efforts to

perform similar classification on impaired RF channels (including noise, fading, offsets,

etc) were less successful with a straight forward RNN approach. We believe this avenue

can certainly be fruitful (likely using a CLDNN style architecture), but newer tools for ar-

chitecture optimization, hyper-parameter tuning, domain specific dataset augmentation,

and generally larger datasets will be required to accomplish this task.


4.4 Learning to Detect Signals

Figure 4.36: Detection Algorithm Trade-space Sensitivity vs Specialization

Radio signal detection is a key task in spectrum diagnostic and monitoring systems as

well as cognitive radios such as those performing DSA. Today, systems which do de-

tection typically have to make a difficult design choice: specialize detection algorithms

heavily for features of a specific signal type or class of signals, or rely on highly generaliz-

able energy based detection methods with lower sensitivity. This is an unfortunate design

trade-off as it forces designers to either forego generality or performance during design or

dynamically at run-time using additional complex logic and estimation [166]. The gen-

erality of feature based detectors varies, for instance cyclo-stationary or moment based

detectors may have more generality than highly specialized features such as matched fil-


ters or cross ambiguity function (CAF) plane searches, but they are still highly specific

to a narrow class of modulation types or properties which is problematic, especcially

as learned communications systems drastically increase the range of signal types possi-

ble. Figure 4.36 illustrates this trade-space at a high level, showing how objective based

learned feature detectors fill a much desired void of obtaining both. This ideal class of de-

tectors which achieves both high sensitivity and wide generality can be obtained through

data centric machine learning approaches relying on feature learning, where, given suffi-

cient data, highly sensitive features are learned for many different signal types using the

same basic approach without the need for hand tuning or manual feature engineering.

There are many pre-processing signal representation domains in which detection strate-

gies can be applied: raw time domain, frequency domain, wavelet domain, combinations

of these, or others. As the most straightforward approach with analogues to existing work

in computer vision, we consider the 2D time-frequency spectrogram plane for our work

and leverage image object detection techniques which have already reached maturity,

surpassing human levels of performance in many cases [156, 167]. The intuition for this

approach is strong, as skilled domain engineers can regularly perform manual observa-

tion on spectrogram images and identify and localize signals highly accurately with their

eyes, illustrating the sufficient availability of information given the right interpretation.

This approach to object detection has in medical imaging and other non-visual domains

recently come to the forefront, providing computer assisted diagnosis in radiology and

other fields which in many cases outperforms panels of skilled radiologists in identifying


cancer [168], fractures, Alzheimer’s disease [169], and others.

Figure 4.37: Computer Vision CNN-based Object Detection Trade Space, from [17]

We consider the application of several leading computer vision object detection approaches

to the task of radio signal detection in [18].

Each of the leading techniques in recent years has relied on CNNs for learned features

on the front end, while numerous strategies exist for architectures, targets, loss functions,

iteration, and training. Each of these relies on large training sets containing annotations

with bounding boxes to indicate and localize ground truth of various object classes in the

image. Networks then typically learn to predict bounding boxes, class labels, and confi-

dence metrics through some means for which there are several strategies. Initial promis-

ing solutions to the problem relied on region proposal networks such as region-based

convolutional neural network (R-CNN) [170], Fast R-CNN [40], and newer versions of

this technique which rely on conducting multiple network forwards passes for each ob-


Figure 4.38: Example bounding box detections in computer vision, from [17]

ject or region proposal in an image iteratively to refine the region prediction. This works

well, but is quite expensive computationally and consequently slow when considering

the throughput of many classifications on finite computing resources. In radio detection,

we often seek to perform detection at extremely high rates and low latencies for many

wide-band spectrum sensing tasks, where speed is key. The you only look once (YOLO)

approach [171] solved this by proposing a single feed forward pass network which jointly

produces object bound and class proposals for a grid of regions within the image simul-

taneously. This approach of bounding box and class prediction within a single network

forward pass was improved upon by SSD [172] and then improved further in [17]. Among

the improvements are network architectures, as well as the use of anchor boxes, and im-

proved loss functions for regression which led to numerous improvements.


We use a network architecture for this work which is a variant of YOLO (known as tiny-

YOLO) as is described in table 4.11. Note that this network is much smaller than the

full-size one used in [17]. Compared to visual object recognition tasks, recognition of

spectral events is a relatively simpler task in many cases, allowing for smaller networks

to be used. Additionally, a smaller network helps to reduce over-fitting on the currently

available smaller datasets for the task, and reduces the computational complexity of for-

wards passes, resulting in lower power and faster operation.

Table 4.11: Table input/output shapes

Layer Number Layer Type Kernel Size Number of Feature Maps

1,2,3,4,5,6 Conv+Maxpool (3,3) 16,32,64,128,256,512

7,8 Conv (3,3) 1024,1024

9 Conv (1,1) 30

We train our system using the same approach as presented in the YOLO method, but we

can make a handful of simplifications for detection. We consider an S × S grid of detec-

tions, predicting B bounding boxes for each cell along with a set of C class probabilities

as in [171]. We consider the YOLO loss function given below in equation 4.5, where 1obj

is evaluated only when the cell contains an object, and 1no−obj is evaluated only when

the cell does not contain an object. We do not use anchor boxes or Intersection over

union (IOU) loss for this initial work, performing direct regression of w and h instead,

leaving this for future work which we believe almost certainly yield further improve-

ments.


LY OLO = λc

S2∑i=0

B∑j=0

1objij D

2L2((xi, yi), (xi, yi))

+λc

S2∑i=0

B∑j=0

1objij DL2((wi, hi), (wi, hi))

+S2∑i=0

B∑j=0

1objij (Ci − Ci)2

+λno−obj

S2∑i=0

B∑j=0

1no−objij (Ci − Ci)2

+S2∑i=0

∑c∈classes

(pi(c)− pi(c))2

(4.5)

Figure 4.39: YOLO style per-grid-cell bounding box regression targets

Here, the first two terms of the loss minimize the L2 distance of the bounding box location

(x/y) and size (h/w) when an object is present (as shown in figure 4.39), while terms three

and four minimize error in class prediction probabilities, and the final term minimizes a

confidence metric. In our case, if we seek to perform object detection on a single class,


RF emissions, we can drop the third and forth terms and only perform bounding box

regression and confidence estimation for a single object class, simplifying the task and

network complexity significantly.

Figure 4.40: Radio bounding box detection examples, from [18]

In figure 4.40 we illustrate a synthetic wide-band bounding-box annotated radio dataset

generated for the DARPA Battle-of-the-ModRecs competition using a set of our custom

wide-band signal generation tools in GNU Radio [173]. Here we show ground truth

bounding boxes along side predicted bounding boxes produced by our trained tiny-

YOLO detector on a validation portion of the dataset. In this case, we obtain excellent

performance in predicting good bounding box annotations and maintain resilience to

wideband noise emissions across the band which appear as energy as our detector.

We also illustrate the performance of the model as tested on an over the air wide-band

spectrogram using tools being developed by DeepSig Inc. In figure 4.41, we show the


Figure 4.41: Over the air wideband signal bounding box prediction example

received radio spectrogram for an ISM band, with a series of rapid bursty radio emission

occuring throughout. This spectrogram has been labeled with annotations using a similar

Yolo style network with bounding box regression and confidence prediction, where we

have thresholded and removed all the low confidence boxes not shown. Here we can

see that a number of traditionally difficult tasks such as discerning overlapping bursts,

adjacent bursts, and heavily faded bursts are all handled appropriately.

This is a key result for the learned detector approach, through a generic process of human

bounding box guidance we are able to rapidly train a detector to perform as desired for an

unknown signal type without significant investment in additional specialized detection

algorithms. This techniques is especially powerful as the detector as a receptive field is

much more resilient to small impairments, occlusions (interference), or other distortions


in the signals which might have readily caused a simple energy based detector to mis-

detect or poorly bound a radio signal emission. Work remains to be done to quantify the

performance of the detector in a classical constant false alarm rate or receiver operating

characteristic (ROC) curve style sensitivity analysis against the classical binned energy

detector, but based on comparable results in computer vision and human visual capabil-

ities when performing this task manually, we believe such a study in future work will

yield excellent results soon.

Chapter 5

Learning Radio Structure

Much of the work discussed to this point has been focused on either learning new physi-

cal layer communications systems or learning in a supervised way how to detect, classify

and label radio emissions. This chapter takes a step back and looks at how unlabeled

radio signal data (which describes most available data in the world, and the data hitting

our sensors) can be used in order to learn structure of radio signals, enable compression

of radio signals, and to partition and learn to separate types of radio signals without train-

ing or through a semi-supervised approach. It also takes a deeper dive into the question

of how to select network architectures and hyper-parameters for training various tasks

through approaching it as a guided model search problem, a key enabler for radio algo-

rithm discovery and optimization.

150

Timothy J. O’Shea Chapter 5. Learning Radio Structure 151

5.1 Unsupervised Structure Learning

Widely used single-carrier radio signal time series modulations schemes today use a rel-

atively simple set of supporting basis functions to modulate information into the radio

spectrum. Digital modulations typically use sine wave basis functions with pseudo-

orthogonal properties in phase, amplitude, or frequency. Information bits are used to

map a symbol value si to a location in this space φj, φk, .... In figure 5.1 we show three

common basis functions where φ0 and φ1 form phase-orthogonal bases used in PSK and

QAM, while φ0 and φ2 show frequency-orthogonal bases used in frequency shift key-

ing (FSK) In the final figure of 5.1 we show a common mapping of constellation points

into this space used in Quadrature Phase Shift Keying (QPSK) to encode two bits of in-

formation per symbol.

Digital modulation theory in communications is a rich subject explored in much greater

depth in numerous great texts such as [174].

Figure 5.1: Example Radio Communications Basis Functions


We seek to learn a sparse representation using learned convolutional basis functions

which maximally compresses radios signals of interest, obtaining the most sparse rep-

resentation possible. Given there is random data modulated onto the radio signal and

CSI information stored about its arrival mode, there is certainly some information theo-

retic limit to how compressed the information can become and still reconstruct the same

information on a radio signal reconstruction. We can lower bound this by the entropy of

the data bits, but likely need to also consider the entropy encoded into the encoded CSI.

Figure 5.2: Convolutional Autoencoder Architecture for Signal Compression

We set up a minimal convolutional autoencoder as shown in figure 5.2 where an input

complex time domain radio signal is decomposed into a small set of convolutional fil-

ters, compressed to a small number of activations through a fully-connected layer, then

decompressed and reconstructed through a similar fully-connect and convolutional re-

gression layer. In this case, we use linear activations on the convolutional layers, and

non-linear activations only on the fully-connected compression layers.


Figure 5.3: Convolutional Autoencoder reconstruction of QPSK example 1

Inspecting a QPSK signal compressed in this way in figure 5.3, we see that the complex

continuous valued 88 sample input signal can be quite cleanly reconstructed at the output

while passing through an intermediate layer of 44 intermediate values which saturated

at 0 or 1. Interestingly, while representing only the structural portions of the signal in

the basis functions, significant amounts of high frequency noise which does not lie on the

basis function naturally has been removed in the reconstruction.

Another example is shown in figure 5.4 where relatively clean construction is achieved

in the same way. Considering the compression occurring here, we have 88*2=176 float32

values for each input example, consisting of a total of approximately 5632 bits, while we

have a saturated sparse representation of approximately 44 bits. This is a compression

factor of approximately 128x.

If we instead consider the input signal to be dynamic range limited to approximately 20dB

SNR (assuming optimal representation scaling), we assume the signal can be represented


Figure 5.4: Convolutional Autoencoder reconstruction of QPSK example 2

Figure 5.5: AE Encoder Filter Weights

Figure 5.6: AE Decoder Filter Weights

in 4-bit precision with quantization error not reducing SNR. (e.g. 6.02dB*4bits = 24.08dB ¿

20dB) then we can assume the input signal to be 704 bits of information compressed down

to 44 bits, still a compression factor of 16x. This is relatively encouraging for a scheme

which is perhaps the simplest convolutional autoencoder which could be employed for

such a thing with no tuning.

Interestingly, if we inspect the filter weights learned in the convolutional encoding layer


and convolutional decoding layer in figures 5.5 and 5.6, we can see that the basis functions

for PSK modulation at the given relative symbol rate with RRC pulse shaping are learned

directly in the filter weights. This raises an interesting possibility for discovering the basis

functions for any new unknown modulation type simply based on learning a similar

sparse representation thereof. It also raises the question of if some galois field (GF)(2)

logic function exists to map the sparse representation bits into the transmitted data bits. If

that is the case, through compression we would have just naively learned a demodulator

for any new random modulation type solely through reconstruction loss.

Finally, the implications for denoising the input signal visible in figures 5.3 and 5.4 are

quite interesting. Through projection onto basis functions and reconstruction therefrom,

such an approach might offer a lower complexity alternative to full demodulation, re-

modulation and subtraction currently used in successive interference cancelation (SIC)

offering the possibility for a computationally cheaper version of this technique.

5.2 Unsupervised Class Discovery

Labeling of datasets can be expensive, difficult and time consuming. For this reason, as

we turn increasingly to machine learning and data centric methods, it is important to

develop methods which exploit unsupervised learning as much as possible to minimize

the human curation requirement when unnecessary, and to maximally leverage human

guidance when it is needed. In [175], we consider a collection of techniques for unsu-


pervised and semi-supervied [176, 177, 178] identification of radio signal emission types

using structure learning, sparse embedding, and clustering.

Dimensionality reduction techniques such as principal component analysis (PCA) [179],

independent component analysis (ICA) [180] have been used widely in signal processing

to obtain low dimension representations, to perform compression and de-noising, and

other purposes. Non-linear versions such as kernel-PCA [181] exist which extend these

methods into the non-linear representation domain, however choice of kernel is often

extremely limiting non-linear representation capacity, and leaves much to be desired in

terms of improved non-linear models for dimensionality reduction. Autoencoders with

non-linear activations as discussed in section 5.1 offer a potential for significantly im-

proved non-linear dimensionality reduction and representation beyond what has been

achievable with prior methods. Recent work for instance in image and video compression

domains [182, 183] has shown that such nonlinear autoencoder compression schemes can

achieve better and more compressed low-dimensional representations of image domain

examples than previously achievable with other techniques.

We consider both supervised and unsupervised methods for learning sparse representa-

tions or embeddings of RF signal examples in figures 5.7 and 5.8. These are both compres-

sive non-linear representations, but they have different objectives. In the case of the su-

pervised method, discriminative features are learned which help impart human guidance

on the objective class separation. In the case of the unsupervised method, reconstructive

features are learned which simply try to best reconstruct each example through the non-


Figure 5.7: Supervised EmbeddingApproach

Figure 5.8: Unsupervised EmbeddingApproach

linear compressed representations of supporting learned convolutional basis functions

which minimize reconstruction loss (e.g. MSE).

Each of these embeddings offers its own advantages, in the case of purely unsupervised

of course, the appeal of zero labeling work is appealing as large amounts of unlabeled

radio data are readily available. In the case of supervised learning, the features and rep-

resentations are already guided towards signal type discrimination, but in some cases

may not generalize well to separation of new modulation types.

In figure 5.9 and 5.10 we illustrate the resulting clustering of 11 radio signal modulation

signal classes using these two embedding approaches. Embeddings are further reduced

from ∼ 40 dimensions down to 2 for visualization using t-SNE [113]. For the supervised

features we use the embedding of the final layer of a VGG-style CNN, prior to the fi-

nal fully-connected SoftMax output layer, and for the unsupervised feature training we


Figure 5.9: Supervised Signal EmbeddingsFigure 5.10: Unsupervised Signal

Embeddings

use the output of a small convolutional autoencoder. We color example points with their

class labels for visualization. For the supervised embedding clustering, we can see excel-

lent separability of classes for virtually all classes, but label information was used in the

creation of the feature space. For unsupervised embedding clustering, we can see some

degree of separability in some of the more distinct classes (e.g. 8-PSK, AM-SSB, AM-DSB),

but see significant mixing between similar modulation types which share common basis

function properties (e.g. BPSK/QPSK mixing, QAM16/QAM64 mixing).

We can measure the ability of these approaches to generalize to some degree by training

and clustering them using hold-out classes which are introduced after embedding space

training, without labels. In doing so, we can begin to measure the quantitative accu-

racy with which each approach successfully detects new classes as new clusters. We also

create a clustering representation in which a human curator can begin to label examples

by cluster rather than by individual example. These are both important steps towards


creating learning systems which scale and learn from new data and emitters over time,

however much of the quantitative analysis and optimization of this approach is left for

future work.

5.3 Neural Network Model Discovery and Optimization

One of the biggest problems in the use of artificial neural networks for machine learn-

ing is the task of architecture selection and hyper-parameter optimization. Architectures

can make an enormous difference in the performance of a neural network in terms of

accuracy and computational cost (as recently demonstrate in [184]), by introducing ap-

propriate classes of tied weights (e.g. convolutional layers, dilated convolutions) and by

appropriately managing the degrees of freedom in a network (e.g. pooling, striding, etc)

to preserve enough information at each layer while keeping the free-parameter count low

enough and incorporating a domain appropriate distillation mode for information.

In section 5.3 we review a number of the published state of the art approaches in re-

cent deep learning literature for solving this problem. Unfortunately, many of these ap-

proaches are too computationally complex for people with finite computing resources

and funding (i.e. other than Google/Facebook).

As a solution we develop a model based on a simplified version of Google’s evolutionary

model search approach in [8]. Here we represent a directed graph of high level neural

network primitives and key hyper-parameters as a compact model description as shown


Figure 5.11: Compact Model Network Digraph and Hyper-Parameter Search Process

in figure 5.11. We implement evolutionary routines [185] for random model generation,

mutation and crossover of model graph structure and hyper-parameters, and leverage

an evolutionary particle swarm optimization [186] approach to generating and breeding

populations of models. In contrast to the approach in [8] which we presume is run across

a large distributed cluster of computing nodes (to support population sizes of 1000), we

evaluate our model on a single Nvidia Digits development server with 4 Titan X GPU

cards with substantially smaller population sizes and search lengths. We call this ap-

proach EvolNN (evolutionary neural network).

Evaluating the model on several benchmark test sets, evolutionary model search finds so-

lutions which score quite well on standard benchmarks like MNIST fairly readily (figure

5.13), while we can also apply the search problem to very difficult datasets such as the

hard 24-modulation dataset from section 4.2.3, shown in figure 5.13. In this case, the tasks


Figure 5.12: EvolNN ModRec Net SearchAccuracy

Figure 5.13: EvolNN MNIST Net SearchAccuracy

Table 5.1: Final small MNIST search CNNnetwork

Layer Output dimensionsInput 28× 28× 1Conv 24× 24× 104Dropout 24× 24× 104Flatten 59904FC/SoftMax 10

Table 5.2: Final Modrec search CNNnetwork

Layer Output dimensionsInput 1024× 2Conv 335× 11Conv 323× 256AvgPool 107× 256MaxPool 21× 256MaxPool 5× 256Flatten 1280FC/SoftMax 24

of image and modulation classification are completely different domains, but the same

evolutionary approach is able to find reasonable solutions to both very quickly.

Both of these task are configured by providing a reference dataset, with input and output

shapes, a loss function for classification using CCE, and an evolutionary model configu-

ration including population size, generations, cross-over rate, mutation rate, etc. For the

search accuracy trajectories shown above, we ultimately obtain the best models given in

table 5.1 and 5.2.


For the MNIST model, we find the solution in only 4 generations of population size 32

with the best model achieving an accuracy of 99.22% on the validation set. For the mod-

ulation recognition task, a significantly more difficult task, we obtain a slightly larger

network, which learns to narrow the information representation gradually using several

convolutional layers and pooling layers. In this case the best performance is only 42%

validation set accuracy, compared to the ∼ 76% achieved through expert design, but we

observe a stead slow growth in performance throughout the evolution process, and be-

lieve with additional search tuning and longer search times much better models could be

found through this approach.

Figure 5.14: EvolNN CFO estimation network search loss

By simply changing the objective loss function of the evolutionary process (in this case to

MSE) we can use the same infrastructure to search for optimal regression networks. In

this case, the CFO estimation network we previously showed in table 4.1 is the best model

found for a model search on our CFO estimation dataset and task. Figure 5.14 shows the


evolutionary model loss over a number of generation, where we can see the estimator

MSE converging to smaller values throughout the search process and ultimately arriving

at a best MSE of 0.0011. Here we search for 32 generations each with a population size of

32.

Neuro-evolution [29] is a very powerful tool, and holds significant biological grounding

in living creatures. We leverage this very high level intuition for evolutionary model

selection and loss feedback based model optimization, both are very very rough approx-

imations of how we believe biological learning and evolution occur. Both of these pro-

cesses seem to have a very long way to go before any notion of optimality is reached, but

initial results are still very promising and provide reasonably good results and generality

on new tasks such as estimator synthesis for which no literature exists in best practices

for manually crafting and optimizing model architectures. The models shown here are

still trivially small compared with full size state of the art architectures used today, but

sufficient computational cost and evolutionary tuning will close this gap. As computing

costs and data become increasingly cheaper, such guided search approaches to model and

architecture selection are increasingly appealing when compared with lengthy, expensive

and less effective manual architecture engineering and tuning cycles.

Chapter 6

Conclusion

Machine learning and parallel computing have provided a set of incredibly powerful

tools over recent years which have opened up orders of magnitude improvement in our

ability to optimize very large scale high degree of freedom problems through direct gra-

dient descent on well formed loss functions. While these tools are being readily applied

in the computer vision and NLP spaces today, the full impact of their engineering impact

will not be realized in applications and in industry for many years to come. These tools

represent a major shift in algorithm design away from simplified model based solutions

to problems and specialized software routines towards data-centric model optimization

using highly general parametric models capable of learning very highly dimensional so-

lutions to many difficult tasks through end-to-end learning.

This enormous shift in design methodology does not mean we can’t perform quantita-

164

Timothy J. O’Shea Chapter 6. Conclusion 165

tive analysis, measurement, and probabilistic characterization of the performance of such

models, but it does make predicting or guaranteeing performance somewhat more diffi-

cult in many cases principally because they are derived from dataset distributions which

in themselves are not well characterized and are formed from complex real world dis-

tributions. Many of the probabilistic tools for guaranteeing, explaining and optimizing

performance are catching up quickly, but since such models now rely heavily on high-

dimensional datasets directly for learning, perfomance guarantees will neccisarily be-

come a much more complex function of the dataset distribution rather than of a compact

simplified model as well.

This same shift was extremely contentious in the computer vision domain before wide

spread adoption and the same resistance is being felt in many other fields including radio

signal processing. The hostility towards learning directly from rich distributions of large

datasets rather than assuming conventional compact models, which have been used for

years in the wireless space, is contentious. As stated by an anonymous [highly negative]

reviewer for a conference paper this past year, ”Radio spectra are not mere images of

cats but are issued from well acknowledged and fairly accurate wireless communication

models”, many people are not pleased with attempts rely on data instead of solely these

models. In reality many of the models used are insufficient, and there is much still to gain

by leveraging the best of both worlds.

Ultimately we are at a crossroads in wireless and signal processing, where practitioners

of both analytic compact model construction and large approximate model construction


must both adopt good practices for data science such as adopting benchmark tasks and

datasets which truly reflect useful target tasks in the real world. We have attempted to

help address this issue by open sourcing and publishing several datasets throughout this

work which can be fairly compared in a quantitative fashion across numerous classes of

approach. We truly hope that more people will adopt this approach to algorithm develop-

ment and optimization, making benchmarks open and quantitatively tracking approach

scores in a way similar to ImageNet [1], CIFAR [107], Kaggle challenges [187], or other

well characterized and scored tasks.

Funding agencies such as DARPA, NIST, NSF, or industry can significantly help this pro-

cess by explicitly funding, promoting and publishing high quality datasets and data to

accompany desired tasks, which is no trivial task and can often require significant real in-

vestment. This approach has been highly successful in vision and other fields and stands

to revolutionize how communications system engineering is done today.

While much of the the work throughout my dissertation studies has perhaps raised many

more questions than it has answered about the topic, I believe many of the data-centric

approaches to radio signal sensing, labeling, communications system synthesis, and de-

sign designed herein all hold a high degree of inevitability for the field. Certainly specific

optimization techniques and architectures will continue to change and advance over the

coming years, but the basic shift towards optimizaton of high degree of freedom mod-

els on real datasets and impairment models seems likely to rapidly become the norm

as quantitative performance results become stronger and more widely disseminated and


accepted. The list of potential research directions and applications in the field as this

transition occurs represents an enormously rich array of possibilities and areas for im-

provement. Initial results shown herein provide significant evidence of the disruptive

potential for improvement in the radio signal processing space, in communications sys-

tem learning, sensor system learning, and many similar applications considered from a

machine learning perspective. Building, integrating and deploying such systems into the

real world holds many remaining engineering challenges, but I look forward to rapidly

maturing this field and learning from others as the field grows and machine learning

based radio physical layers and signal processing techniques improve.

6.1 Publication List

Below is the relevant body of academic published work corresponding to my dissertation

research over the past several years.

Journal Articles

• T. O’Shea, J. Hoydis [An Introduction to Deep Learning for the Physical Layer], IEEE

Transactions on Cognitive Communications Systems, 2017 (accepted)

• T. O’Shea, T. Roy, T. Clancy [Over the Air Deep Learning Based Radio Signal Clas-

sification] IEEE JSTSP 2017 (accepted)

• T. O’Shea, T. Erpek, T. Clancy [Deep Learning-Based MIMO Communications], (un-


der resubmission)

• T. O’Shea, T. Clancy, T. Roy, T. Erpek, K. Karra [Deep Learning and Data Centric

Approaches to Wireless Signal Processing Systems], (under resubmission)

• C Clancy, J Hecker, E Stuntebeck, T O’Shea [Applications of machine learning to

cognitive radio networks] Wireless Communications, IEEE 14 (4), 47-52

Peer Reviewed Conference Papers

• T. OShea, T. Roy, T. Clancy, [Learning Robust General Radio Signal Detection using

Computer Vision Methods], Asilomar SSC 2017 (to appear)

• T. OShea, T. Erpek, T. Clancy, [Physical Layer Deep Learning of Encodings for the

MIMO Fading Channel], Allerton Conference on Communications, Control, and

Computing 2017

• T. OShea, K. Karra, T. Clancy, [Learning Approximate Neural Estimators for Wire-

less Channel State Information], IEEE MLSP 2017

• T. OShea, T. Roy, T. Erpek, [Spectral Detection and Localization of Radio Events with

Learned Convolutional Neural Features], IEEE EUSIPCO 2017

• T. OShea, N. West, M. Vondal, T. Clancy [Semi-Supervised Radio Signal Identifi-

cation], IEEE International Conference on Advanced Communications Technology,

2017 (outstanding paper award)


• N. West, T. OShea [Deep Architectures for Modulation Recognition], IEEE DySpan,

2017

• T. OShea, S. Hitefield, J. Corgan [End-to-end Traffic Sequence Recognition with Re-

current Neural Networks], IEEE GlobalSip, 2016

• T. OShea, K. Karra, T. Clancy, [Learning to Communicate: Channel auto-encoders,

Domain Specific Regularizers, and Attention], IEEE International Symposium on

Signal Processing and Information Technology 2016

• T. OShea, L. Pemula, D. Batra, T. Clancy, [Radio Transformer Networks: Attention

Models for Learning to Synchronize in Wireless Systems], IEEE Asilomar Confer-

ence on Signals, Systems and Computing 2016

• T. OShea, N. West, [Radio Machine Learning Dataset Generation with GNU Radio],

GNU Radio Conference 2016

• T. OShea, J. Corgan, T. Clancy, [Unsupervised Representation Learning of Struc-

tured Radio Communication Signals] International Workshop on Sensing, Process-

ing and Learning for Intelligent Machines 2016

• T. OShea, J. Corgan, T. Clancy, [Convolutional Radio Modulation Recognition Net-

works] Engineering Applications of Neural Networks 2016

• D. CaJacob, N. McCarthy, T. O’Shea, R. McGwier, [Geolocation of RF Emitters with

a Formation-Flying Cluster of Three Microsatellites] Small Satellite Conference 2016


• T. O’Shea, K. Karra [GNU Radio Signal Processing Models for Dynamic Multi-User

Burst Modems] Software Radio Implementation Forum 2015

• S. Hitefield, V. Nguyen, C. Carlson, T. O’Shea, T. Clancy [Demonstrated LLC-layer

attack and defense strategies for wireless communication systems] IEEE Conference

on Communications and Network Security (CNS) 2014

• C. Carlson, V. Nguyen, S. Hitefield, T. O’Shea, T. Clancy [Measuring smart jammer

strategy efficacy over the air] IEEE Conference on Communications and Network

Security (CNS) 2014

• T. O’Shea, T. Rondeau, [A universal GNU radio performance benchmarking suite],

Karlsruhe Workshop on Software Radio 2014

• T. Rondeau, T. O’Shea, [Designing Analysis and Synthesis Filterbanks in GNU Ra-

dio], Karlsruhe Workshop on Software Radios 2014

Pre-Publication Papers

• T. OShea, T. Clancy, [Deep Reinforcement Learning Radio Control and Signal De-

tection with KeRLym, a Gym RL Agent], ArXiv Pre-publication 1605.09221 2016

• T. O’Shea, T. Clancy, R. McGwier [Recurrent Neural Radio Anomaly Detection],

ArXiv Pre-Publication 1611.00301 2016

• T. O’Shea, A. Mondl, T. Clancy [A Modest Proposal for Open Market Risk Assess-

ment to Solve the Cyber-Security Problem] ArXiv Pre-Publication 1604.08675 2016


Invited/Non-Paper Talks

• T. O’Shea, [The Future of Radio: Learning Efficient Signal Processing Systems],

GNU Radio Conference 2017

• T. O’Shea, [Learning Signal Processing and Communications Systems from Data],

IEEE CCAA Workshop Keynote 2017

• T. O’Shea, [Deep Learning on the Radio Physical Layer], JASON 2017 Summer

Study

• T. OShea, [TensorFlow Applications in Signal Processing], IEEE International Con-

ference for High Performance Computing, Networking, Storage and Analysis (Ten-

sorFlow BoF Hosted by Google) 2016

• T. OShea, [Radio Data Analytics with Machine Learning], International Symposium

on Advanced Radio Technologies (ISART) 2016

• R. McGwier, T. OShea, K. Karra, M. Fowler, [Recent Developments in Artificial Intel-

ligence Applications of Deep Learning for Signal Processing], Virginia Tech Wireless

Symposium 2016

• T. OShea, [Handing Full Control of the Radio Spectrum over to the Machines], DE-

FCON Wireless Village 2016

• T. OShea, [Radio Machine Learning with FOSS, GNU Radio and TensorFlow] FOS-

DEM 2016


• T. OShea, [Rapid GNU Radio GPU Algorithm Prototyping from Python (gr-theano)],

FOSDEM 2015

• T. O’Shea, [GNU Radio Tools for Radio Wrangling and Spectrum Domination], DE-

FCON 23 Wireless Village 2015

• T. O’Shea, [Tutorial: Exploring Data], GNU Radio Conference 2015

Bibliography

[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep

convolutional neural networks,” in Advances in neural information processing systems,

2012, pp. 1097–1105.

[2] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalch-

brenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw

audio,” arXiv preprint arXiv:1609.03499, 2016.

[3] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,

“Dropout: a simple way to prevent neural networks from overfitting.” Journal of

Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.

[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”

in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,

pp. 770–778.

[5] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” in

Advances in Neural Information Processing Systems, 2015, pp. 2017–2025.

173


[6] C. Moore, “Data processing in exascale-class computer systems,” in The Salishan

Conference on High Speed Computing, 2011.

[7] J.-H. Huang, “Keynote and volta series product announcement,” in GPU Technology

Conference, 2017.

[8] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. V. Le, and A. Kurakin,

“Large-scale evolution of image classifiers,” CoRR, vol. abs/1703.01041, 2017.

[Online]. Available: http://arxiv.org/abs/1703.01041

[9] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional net-

works,” in European conference on computer vision. Springer, 2014, pp. 818–833.

[10] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional net-

works: Visualising image classification models and saliency maps,” arXiv preprint

arXiv:1312.6034, 2013.

[11] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-

cam: Visual explanations from deep networks via gradient-based localization,” See

https://arxiv. org/abs/1610.02391 v3, 2016.

[12] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural

networks via information,” CoRR, vol. abs/1703.00810, 2017. [Online]. Available:

http://arxiv.org/abs/1703.00810




[13] “Lte phy lab, e-utra phy golden reference model,” https://www.is-wireless.com/

5g-toolset-old/lte-phy-lab-old/, (Accessed on 10/01/2017).

[14] F. Chollet, “Buiulding autoencoders in keras,” https://blog.keras.io/

building-autoencoders-in-keras.html, (Accessed on 10/01/2017).

[15] O. A. Dobre, A. Abdi, Y. Bar-Ness, and W. Su, “Survey of automatic modulation

classification techniques: classical approaches and new trends,” IET communica-

tions, vol. 1, no. 2, pp. 137–156, 2007.

[16] C. Olah, “Understanding lstm networks,” Online Article

http://colah.github.io/posts/2015-08-Understanding-LSTMs/, 2015.

[17] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint

arXiv:1612.08242, 2016.

[18] T. J. O’Shea, T. Roy, and T. C. Clancy, “Learning robust general radio signal detec-

tion using computer vision methods,” in 2016 51th Asilomar Conference on Signals,

Systems and Computers, Nov 2017.

[19] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.

[20] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical

Journal, vol. 27, no. 3, pp. 379–423, Jul. 1948.

[21] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannon limit error-correcting

coding and decoding: Turbo-codes. 1,” in Communications, 1993. ICC’93 Geneva.

https://www.is-wireless.com/5g-toolset-old/lte-phy-lab-old/

https://www.is-wireless.com/5g-toolset-old/lte-phy-lab-old/

https://blog.keras.io/building-autoencoders-in-keras.html

https://blog.keras.io/building-autoencoders-in-keras.html


Technical Program, Conference Record, IEEE International Conference on, vol. 2. IEEE,

1993, pp. 1064–1070.

[22] R. M. Pyndiah, “Near-optimum decoding of product codes: Block turbo codes,”

IEEE Transactions on communications, vol. 46, no. 8, pp. 1003–1010, 1998.

[23] R. Gallager, “Low-density parity-check codes,” IRE Transactions on information the-

ory, vol. 8, no. 1, pp. 21–28, 1962.

[24] R. v. Nee and R. Prasad, OFDM for wireless multimedia communications. Artech

House, Inc., 2000.

[25] P. Patel and J. Holtzman, “Analysis of a simple successive interference cancella-

tion scheme in a ds/cdma system,” IEEE journal on selected areas in communications,

vol. 12, no. 5, pp. 796–807, 1994.

[26] H. Wymeersch, Iterative receiver design. Cambridge University Press Cambridge,

2007, vol. 234.

[27] D. J. Jakubisin, R. M. Buehrer, and C. R. da Silva, “Bp, mf, and ep for joint channel

estimation and detection of mimo-ofdm signals,” in Global Communications Confer-

ence (GLOBECOM), 2016 IEEE. IEEE, 2016, pp. 1–6.

[28] M. J. Demongeot, M. J. Mazoyer, M. P. Peretto, and M. D. Whitley, “Neural network

synthesis using cellular encoding and the genetic algorithm.” 1994.


[29] J. Branke, “Evolutionary algorithms for neural network design and training,” in

In Proceedings of the First Nordic Workshop on Genetic Algorithms and its Applications.

Citeseer, 1995.

[30] J. Bruck and M. Blaum, “Neural networks, error-correcting codes, and polynomials

over the binary n-cube,” IEEE Transactions on information theory, vol. 35, no. 5, pp.

976–987, 1989.

[31] F. Jondral, “Automatic classification of high frequency signals,” Signal Processing,

vol. 9, no. 3, pp. 177–190, 1985.

[32] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector

machines,” IEEE Intelligent Systems and their applications, vol. 13, no. 4, pp. 18–28,

1998.

[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-

scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009.

CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.

[34] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Interna-

tional journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.

[35] J. Sanchez and F. Perronnin, “High-dimensional signature compression for large-

scale image classification,” in Computer Vision and Pattern Recognition (CVPR), 2011

IEEE Conference on. IEEE, 2011, pp. 1665–1672.


[36] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, “Learning the

speech front-end with raw waveform cldnns,” in Sixteenth Annual Conference of the

International Speech Communication Association, 2015.

[37] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Iden-

tifying and attacking the saddle point problem in high-dimensional non-convex

optimization,” in Advances in neural information processing systems, 2014, pp. 2933–

2941.

[38] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural

networks with pruning, trained quantization and huffman coding,” arXiv preprint

arXiv:1510.00149, 2015.

[39] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional

neural networks for resource efficient inference,” 2016.

[40] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on com-

puter vision, 2015, pp. 1440–1448.

[41] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits

in natural images with unsupervised feature learning,” in NIPS workshop on deep

learning and unsupervised feature learning, vol. 2011, no. 2, 2011, p. 5.

[42] W. Namgoong and T. H. Meng, “Direct-conversion rf receiver design,” IEEE Trans-

actions on Communications, vol. 49, no. 3, pp. 518–529, 2001.


[43] R. AD9361, “Agile transceiver, data sheet, analog devices,” Inc, vol. 2, p. 014, 2013.

[44] H. Nyquist, “Certain topics in telegraph transmission theory,” Transactions of the

American Institute of Electrical Engineers, vol. 47, no. 2, pp. 617–644, 1928.

[45] M. Ettus, “Universal software radio peripheral,” 2009.

[46] T. O’Shea, “Gnu radio channel simulation,” in GNU Radio Conference 2013, 2013.

[47] D. Middleton, I. of Electrical, and E. Engineers, An introduction to statistical commu-

nication theory. IEEE press Piscataway, NJ, 1996.

[48] J. Mitola and G. Q. Maguire, “Cognitive radio: making software radios more per-

sonal,” IEEE personal communications, vol. 6, no. 4, pp. 13–18, 1999.

[49] T. W. Rondeau, “Application of artificial intelligence to wireless communications,”

Ph.D. dissertation, Virginia Polytechnic Institute and State University, 2007.

[50] P. J. Kolodzy, “Dynamic spectrum policies: promises and challenges,” CommLaw

Conspectus, vol. 12, p. 147, 2004.

[51] T. C. Clancy, “Dynamic spectrum access in cognitive radio networks,” Ph.D. disser-

tation, 2006.

[52] W. Gardner, W. Brown, and C.-K. Chen, “Spectral correlation of modulated signals:

Part ii–digital modulation,” IEEE Transactions on Communications, vol. 35, no. 6, pp.

595–601, 1987.


[53] S. Geirhofer, L. Tong, and B. M. Sadler, “Cognitive radios for dynamic spectrum

access-dynamic spectrum access in the time domain: Modeling and exploiting

white space,” IEEE Communications Magazine, vol. 45, no. 5, 2007.

[54] Z. Ji and K. R. Liu, “Cognitive radios for dynamic spectrum access-dynamic

spectrum sharing: A game theoretical overview,” IEEE Communications Magazine,

vol. 45, no. 5, 2007.

[55] A. Amanna and J. H. Reed, “Survey of cognitive radio architectures,” in IEEE South-

eastCon 2010 (SoutheastCon), Proceedings of the. IEEE, 2010, pp. 292–297.

[56] E. Stuntebeck, T. OShea, J. Hecker, and T. Clancy, “Architecture for an open-source

cognitive radio,” in Proceedings of the SDR forum technical conference, 2006.

[57] P. J. Werbos, “Applications of advances in nonlinear sensitivity analysis,” in System

modeling and optimization. Springer, 1982, pp. 762–770.

[58] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural

networks, vol. 12, no. 1, pp. 145–151, 1999.

[59] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization

and momentum in deep learning,” in International conference on machine learning,

2013, pp. 1139–1147.

[60] A. Nemirovskii, D. B. Yudin, and E. R. Dawson, “Problem complexity and method

efficiency in optimization,” 1983.


[61] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running

average of its recent magnitude,” COURSERA: Neural Networks for Machine Learn-

ing, vol. 4, no. 2, 2012.

[62] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint

arXiv:1412.6980, 2014.

[63] J. Zhang, I. Mitliagkas, and C. Re, “Yellowfin and the art of momentum tuning,”

arXiv preprint arXiv:1706.03471, 2017.

[64] C. Xu, T. Qin, G. Wang, and T.-Y. Liu, “Reinforcement learning for learning rate

control,” arXiv preprint arXiv:1705.11159, 2017.

[65] T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman, “Random synaptic

feedback weights support error backpropagation for deep learning,” Nature com-

munications, vol. 7, 2016.

[66] B. Scellier and Y. Bengio, “Equilibrium propagation: Bridging the gap between

energy-based models and backpropagation,” Frontiers in computational neuroscience,

vol. 11, 2017.

[67] D. P. Kingma, T. Salimans, and M. Welling, “Improving variational inference with

inverse autoregressive flow,” arXiv preprint arXiv:1606.04934, 2016.

[68] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann ma-

chines,” in Proc. Int. Conf. Mach. Learn. (ICML), 2010, pp. 807–814.


[69] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Pro-

ceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics,

2011, pp. 315–323.

[70] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural

network acoustic models,” in Proc. ICML, vol. 30, no. 1, 2013.

[71] J. S. Bridle, “Training stochastic model recognition algorithms as networks can lead

to maximum mutual information estimation of parameters,” in Advances in neural

information processing systems, 1990, pp. 211–217.

[72] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network

learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.

[73] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural

networks,” arXiv preprint arXiv:1706.02515, 2017.

[74] C. Dugas, Y. Bengio, F. Belisle, C. Nadeau, and R. Garcia, “Incorporating second-

order functional knowledge for better option pricing,” in Advances in neural infor-

mation processing systems, 2001, pp. 472–478.

[75] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time

series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995,

1995.


[76] A. B. Geva, “Scalenet-multiscale neural-network architecture for time series predic-

tion,” IEEE Transactions on neural networks, vol. 9, no. 6, pp. 1471–1482, 1998.

[77] M. Sundermeyer, R. Schluter, and H. Ney, “Lstm neural networks for language

modeling,” in Thirteenth Annual Conference of the International Speech Communication

Association, 2012.

[78] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual predic-

tion with lstm,” 1999.

[79] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of neural

networks using dropconnect,” in Proceedings of the 30th international conference on

machine learning (ICML-13), 2013, pp. 1058–1066.

[80] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training

by reducing internal covariate shift,” in International Conference on Machine Learning,

2015, pp. 448–456.

[81] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint

arXiv:1505.00387, 2015.

[82] G. E. Moore, “Cramming more components onto integrated circuits,” Electronics,

vol. 38, no. 8, 1965.

[83] M. Gschwind, “Chip multiprocessing and the cell broadband engine,” in Proceed-

ings of the 3rd conference on Computing frontiers. ACM, 2006, pp. 1–8.


[84] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina,

C.-C. Miao, J. F. Brown III, and A. Agarwal, “On-chip interconnection architecture

of the tile processor,” IEEE micro, vol. 27, no. 5, pp. 15–31, 2007.

[85] N. McCarthy, E. Blossom, N. Goergen, T. OShea, and C. Clancy, “High-performance

sdr: Gnu radio and the ibm cell broadband engine,” in Virginia Tech Wireless Personal

Communications Symposium, 2008.

[86] C. Nvidia, “Compute unified device architecture programming guide,” 2007.

[87] J. Hensley, “Close to the metal,” SIGGRAPH’07, 2007.

[88] A. Munshi, “Opencl: Parallel computing on the gpu and cpu,” SIGGRAPH, Tutorial,

pp. 11–15, 2008.

[89] G. Harrison, A. Sloan, W. Myrick, J. Hecker, and D. Eastin, “Polyphase channeliza-

tion utilizing general-purpose computing on a gpu,” in SDR 2008 technical conference

and product exposition, 2008.

[90] G. F. Zaki, W. Plishker, T. Oshea, N. McCarthy, C. Clancy, E. Blossom, and S. S.

Bhattacharyya, “Integration of dataflow optimization techniques into a software

radio design framework,” in Signals, Systems and Computers, 2009 Conference Record

of the Forty-Third Asilomar Conference on. IEEE, 2009, pp. 243–247.


[91] M. Piscopo, “Study on implementing opencl in common gnuradio blocks,”

Proceedings of the GNU Radio Conference, vol. 2, no. 1, p. 67, 2017. [Online]. Available:

https://pubs.gnuradio.org/index.php/grcon/article/view/15

[92] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,

D. Warde-Farley, and Y. Bengio, “Theano: A cpu and gpu math compiler in

python,” in Proc. 9th Python in Science Conf, 2010, pp. 1–7.

[93] E. Jones, T. Oliphant, P. Peterson et al., “SciPy: Open source scientific

tools for Python,” 2001–, [Online; accessed ¡today¿]. [Online]. Available:

http://www.scipy.org/

[94] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,

A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on

heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.

[95] J. McCarthy, “Recursive functions of symbolic expressions and their computation

by machine, part i,” Communications of the ACM, vol. 3, no. 4, pp. 184–195, 1960.

[96] F. Chollet, “keras,” https://github.com/fchollet/keras, 2015.

[97] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,

and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in

Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp.

675–678.

https://pubs.gnuradio.org/index.php/grcon/article/view/15

http://www.scipy.org/

https://github.com/fchollet/keras


[98] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation open source

framework for deep learning,” in Proceedings of workshop on machine learning systems

(LearningSys) in the twenty-ninth annual conference on neural information processing sys-

tems (NIPS), vol. 5, 2015.

[99] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-like environment

for machine learning,” in BigLearn, NIPS Workshop, 2011.

[100] A. Paszke, S. Gross, and S. Chintala, “Pytorch,” 2017.

[101] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and

Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heteroge-

neous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.

[102] S. Dieleman, J. Schlter, C. Raffel, E. Olson, S. K. Snderby, D. Nouri et al., “Lasagne:

First release.” Aug. 2015. [Online]. Available: http://dx.doi.org/10.5281/zenodo.

27878

[103] D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyperparameter

optimization through reversible learning,” in Proceedings of the 32nd International

Conference on Machine Learning, 2015.

[104] A. G. Baydin, R. Cornish, D. M. Rubio, M. Schmidt, and F. Wood, “Online learning

rate adaptation with hypergradient descent,” arXiv preprint arXiv:1703.04782, 2017.

http://dx.doi.org/10.5281/zenodo.27878

http://dx.doi.org/10.5281/zenodo.27878


[105] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,”


[106] T. Desell, “Large scale evolution of convolutional neural networks using

volunteer computing,” CoRR, vol. abs/1703.05422, 2017. [Online]. Available:


[107] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny im-

ages,” 2009.

[108] D. Erhan, Y. Bengio, A. Courville, and P. Vincent, “Visualizing higher-layer features

of a deep network,” University of Montreal, vol. 1341, p. 3, 2009.

[109] F. E. Terman et al., “Radio engineering,” 1937.

[110] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning to communicate: Channel auto-

encoders, domain specific regularizers, and attention,” in 2016 IEEE International

Symposium on Signal Processing and Information Technology (ISSPIT), Dec 2016, pp.

223–228.

[111] T. OShea and J. Hoydis, “An introduction to deep learning for the physical layer,”

IEEE Transactions on Cognitive Communications and Networking, vol. PP, no. 99, pp.

1–1, 2017.

[112] Y. Li, R. Xu, and F. Liu, “Whiteout: Gaussian adaptive regularization noise in deep

neural networks,” arXiv preprint arXiv:1612.01490, 2016.



[113] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res.,

vol. 9, no. Nov, pp. 2579–2605, 2008.

[114] F. Liang, C. Shen, and F. Wu, “An iterative bp-cnn architecture for channel decod-

ing,” arXiv preprint arXiv:1707.05697, 2017.

[115] S. Cammerer, T. Gruber, J. Hoydis, and S. t. Brink, “Scaling deep learning-based

decoding of polar codes via partitioning,” arXiv preprint arXiv:1702.06901, 2017.

[116] T. Gruber, S. Cammerer, J. Hoydis, and S. t. Brink, “On deep learning-based channel

decoding,” in 2017 51st Annual Conference on Information Sciences and Systems (CISS),

March 2017, pp. 1–6.

[117] T. J. O’Shea, L. Pemula, D. Batra, and T. C. Clancy, “Radio transformer networks:

Attention models for learning to synchronize in wireless systems,” in 2016 50th

Asilomar Conference on Signals, Systems and Computers, Nov 2016, pp. 662–666.

[118] N. TSGRANGRA, “Evolved universal terrestrial radio access (e-utra); multiplexing

and channel coding,” 3rd Generation Partnership Project (3GPP), vol. TS, vol. 36, 2009.

[119] L. ETSI, “Evolved universal terrestrial radio access (e-utra); physical channels and

modulation,” ETSI TS, vol. 136, no. 211, p. V9.

[120] D. Gesbert, M. Shafi, D.-s. Shiu, P. J. Smith, and A. Naguib, “From theory to practice:

An overview of mimo space-time coded wireless systems,” IEEE Journal on selected

areas in Communications, vol. 21, no. 3, pp. 281–302, 2003.


[121] E. Luther, “5g massive mimo testbed: From theory to reality,” white paper, avail-

able online: https://studylib. net/doc/18730180/5g-massive-mimo-testbed–from-theory-to-

reality, 2014.

[122] V. Tarokh, H. Jafarkhani, and A. R. Calderbank, “Space-time block codes from or-

thogonal designs,” IEEE Transactions on Information theory, vol. 45, no. 5, pp. 1456–

1467, 1999.

[123] S. M. Alamouti, “A simple transmit diversity technique for wireless communica-

tions,” IEEE Journal on Selected Areas in Communications, vol. 16, no. 8, pp. 1451–1458,

Oct 1998.

[124] A. J. Paulraj, R. W. Heath Jr, P. K. Sebastian, and D. J. Gesbert, “Spatial multiplexing

in a cellular network,” May 23 2000, uS Patent 6,067,290.

[125] S. Dorner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep Learning-Based Com-

munication Over the Air,” ArXiv e-prints, Jul. 2017.

[126] T. Soderstrom and P. Stoica, System identification. Prentice-Hall, Inc., 1988.

[127] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud, “Backpropagation

through the void: Optimizing control variates for black-box gradient estimation,”



[128] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning approximate neural estimators

for wireless channel state information,” in 2016 IEEE International Workshop on Ma-

chine Learning for Signal Processing (MLSP), Sep 2017.

[129] Y. Wang, K. Shi, and E. Serpedin, “Non-data-aided feedforward carrier frequency

offset estimators for qam constellations: A nonlinear least-squares approach,”

EURASIP Journal on Advances in Signal Processing, vol. 2004, no. 13, p. 856139, 2004.

[130] O. Catoni et al., “Challenging the empirical mean and empirical variance: a devia-

tion study,” in Annales de l’Institut Henri Poincare, Probabilites et Statistiques, vol. 48,

no. 4. Institut Henri Poincare, 2012, pp. 1148–1185.

[131] T. J. OShea, J. Corgan, and T. C. Clancy, “Convolutional radio modulation recogni-

tion networks,” in International Conference on Engineering Applications of Neural Net-

works. Springer, 2016, pp. 213–226.

[132] K. S. K. Arumugam, I. A. Kadampot, M. Tahmasbi, S. Shah, M. Bloch, and

S. Pokutta, “Modulation recognition using side information and hybrid learning,”

in Dynamic Spectrum Access Networks (DySPAN), 2017 IEEE International Symposium

on. IEEE, 2017, pp. 1–2.

[133] K. Triantafyllakis, M. Surligas, G. Vardakis, and S. Papadakis, “Phasma: An auto-

matic modulation classification system based on random forest,” in Dynamic Spec-

trum Access Networks (DySPAN), 2017 IEEE International Symposium on. IEEE, 2017,

pp. 1–3.


[134] M. Laghate, S. Chaudhari, and D. Cabric, “Usrp n210 demonstration of wideband

sensing and blind hierarchical modulation classification,” in Dynamic Spectrum Ac-

cess Networks (DySPAN), 2017 IEEE International Symposium on. IEEE, 2017, pp.

1–3.

[135] J. L. Ziegler, R. T. Arn, and W. Chambers, “Modulation recognition with gnu ra-

dio, keras, and hackrf,” in Dynamic Spectrum Access Networks (DySPAN), 2017 IEEE

International Symposium on. IEEE, 2017, pp. 1–3.

[136] K. Karra, S. Kuzdeba, and J. Petersen, “Modulation recognition using hierarchical

deep neural networks,” in Dynamic Spectrum Access Networks (DySPAN), 2017 IEEE

International Symposium on. IEEE, 2017, pp. 1–3.

[137] N. E. West, K. Harwell, and B. McCall, “Dft signal detection and channelization

with a deep neural network modulation classifier,” in Dynamic Spectrum Access Net-

works (DySPAN), 2017 IEEE International Symposium on. IEEE, 2017, pp. 1–3.

[138] C. M. Spooner, A. N. Mody, J. Chuang, and J. Petersen, “Modulation recognition

using second-and higher-order cyclostationarity,” in Dynamic Spectrum Access Net-

works (DySPAN), 2017 IEEE International Symposium on. IEEE, 2017, pp. 1–3.

[139] T. J. O’Shea and N. West, “Radio machine learning dataset generation with gnu

radio,” in Proceedings of the GNU Radio Conference, vol. 1, no. 1, 2016.


[140] C. Weaver, C. Cole, R. Krumland, and M. Miller, “The automatic classification of

modulation types by pattern recognition.” STANFORD UNIV CALIF STANFORD

ELECTRONICS LABS, Tech. Rep., 1969.

[141] J. Aisbett, “Automatic modulation recognition using time domain parameters,” Sig-

nal Processing, vol. 13, no. 3, pp. 323–328, 1987.

[142] W. A. Gardner and C. M. Spooner, “Cyclic spectral analysis for signal detection and

modulation recognition,” in Military Communications Conference, 1988. MILCOM 88,

Conference record. 21st Century Military Communications-What’s Possible? 1988 IEEE.

IEEE, 1988, pp. 419–424.

[143] ——, “Signal interception: performance advantages of cyclic-feature detectors,”

IEEE Transactions on Communications, vol. 40, no. 1, pp. 149–159, 1992.

[144] C. M. Spooner and W. A. Gardner, “Robust feature detection for signal intercep-

tion,” IEEE transactions on communications, vol. 42, no. 5, pp. 2165–2173, 1994.

[145] A. Abdelmutalab, K. Assaleh, and M. El-Tarhuni, “Automatic modulation classi-

fication based on high order cumulants and hierarchical polynomial classifiers,”

Physical Communication, vol. 21, pp. 10–18, 2016.

[146] A. K. Nandi and E. E. Azzouz, “Algorithms for automatic modulation recognition

of communication signals,” IEEE Transactions on communications, vol. 46, no. 4, pp.

431–436, 1998.


[147] A. Fehske, J. Gaeddert, and J. H. Reed, “A new approach to signal classification

using spectral correlation and neural networks,” in New Frontiers in Dynamic Spec-

trum Access Networks, 2005. DySPAN 2005. 2005 First IEEE International Symposium

on. IEEE, 2005, pp. 144–150.

[148] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale

image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[149] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-

del, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in

python,” Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825–2830, 2011.

[150] D. George and E. Huerta, “Deep neural networks to enable real-time multimessen-

ger astrophysics,” arXiv preprint arXiv:1701.00008, 2016.

[151] S. Cioni, G. Colavolpe, V. Mignone, A. Modenini, A. Morello, M. Ricciulli,

A. Ugolini, and Y. Zanettini, “Transmission parameters optimization and receiver

architectures for dvb-s2x systems,” International Journal of Satellite Communications

and Networking, vol. 34, no. 3, pp. 337–350, 2016.

[152] M. Ettus and M. Braun, “The universal software radio peripheral (usrp) family of

low-cost sdrd,” Opportunistic Spectrum Sharing and White Space Access: The Practical

Reality, pp. 3–23, 2015.

[153] A. D.-R. A. T. AD9361, “url: http://www.analog.com/static/imported-

files/data\ sheets/ad9361.pdf (visited on 09/14/08),” Cited on, p. 103.


[154] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings

of the 22nd acm sigkdd international conference on knowledge discovery and data mining.

ACM, 2016, pp. 785–794.

[155] N. E. West and T. O’Shea, “Deep architectures for modulation recognition,” in Dy-

namic Spectrum Access Networks (DySPAN), 2017 IEEE International Symposium on.

IEEE, 2017, pp. 1–6.

[156] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing

human-level performance on imagenet classification,” in Proceedings of the IEEE in-

ternational conference on computer vision, 2015, pp. 1026–1034.

[157] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-

houcke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the

IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.

[158] T. J. O’Shea, S. Hitefield, and J. Corgan, “End-to-end radio traffic sequence recogni-

tion with recurrent neural networks,” in 2016 IEEE Global Conference on Signal and

Information Processing (GlobalSIP), Dec 2016, pp. 277–281.

[159] R.-P. Weinmann, “Baseband attacks: Remote exploitation of memory corruptions in

cellular protocol stacks.” in WOOT, 2012, pp. 12–21.

[160] K. Greff, R. K. Srivastava, J. Koutnık, B. R. Steunebrink, and J. Schmidhuber, “Lstm:

A search space odyssey,” IEEE transactions on neural networks and learning systems,

2017.


[161] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation,

vol. 9, no. 8, pp. 1735–1780, 1997.

[162] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recur-

rent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.

[163] J. Bradbury, S. Merity, C. Xiong, and R. Socher, “Quasi-recurrent neural networks,”


[164] A. Orebaugh, G. Ramirez, and J. Beale, Wireshark & Ethereal network protocol analyzer

toolkit. Syngress, 2006.

[165] A. Karpathy, “The unreasonable effectiveness of recurrent neural networks,” Online

Article http://karpathy.github.io/2015/05/21/rnn-effectiveness/, 2015.

[166] H. Kim and K. G. Shin, “In-band spectrum sensing in cognitive radio networks:

energy detection or feature detection?” in Proceedings of the 14th ACM international

conference on Mobile computing and networking. ACM, 2008, pp. 14–25.

[167] R. Ewerth, M. Springstein, L. A. Phan-Vogtmann, and J. Schutze, “are machines

better than humans in image tagging?-a user study adds to the puzzle,” in European

Conference on Information Retrieval. Springer, 2017, pp. 186–198.

[168] D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck, “Deep learning for

identifying metastatic breast cancer,” arXiv preprint arXiv:1606.05718, 2016.


[169] S. Sarraf, G. Tofighi et al., “Deepad: Alzheimer s disease classification via deep

convolutional neural networks using mri and fmri,” bioRxiv, p. 070441, 2016.

[170] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional net-

works for accurate object detection and segmentation,” IEEE transactions on pattern

analysis and machine intelligence, vol. 38, no. 1, pp. 142–158, 2016.

[171] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified,

real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision

and Pattern Recognition, 2016, pp. 779–788.

[172] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, SSD:

Single Shot MultiBox Detector. Cham: Springer International Publishing, 2016, pp.

21–37. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-46448-0 2

[173] E. Blossom, “GNU radio: tools for exploring the radio frequency spectrum,” Linux

journal, vol. 2004, no. 122, p. 4, 2004.

[174] B. Sklar, Digital communications. Prentice Hall NJ, 2001, vol. 2.

[175] T. J. O’Shea, N. West, M. Vondal, and T. C. Clancy, “Semi-supervised radio signal

identification,” in Advanced Communication Technology (ICACT), 2017 19th Interna-

tional Conference on. IEEE, 2017, pp. 33–38.

[176] O. Chapelle and A. Zien, “Semi-supervised classification by low density separa-

tion.” in AISTATS, 2005, pp. 57–64.

http://dx.doi.org/10.1007/978-3-319-46448-0_2


[177] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning (chapelle, o. et

al., eds.; 2006)[book reviews],” IEEE Transactions on Neural Networks, vol. 20, no. 3,

pp. 542–542, 2009.

[178] X. Zhu and A. B. Goldberg, “Introduction to semi-supervised learning,” Synthesis

lectures on artificial intelligence and machine learning, vol. 3, no. 1, pp. 1–130, 2009.

[179] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics

and intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987.

[180] A. Hyvarinen, J. Karhunen, and E. Oja, Independent component analysis. John Wiley

& Sons, 2004, vol. 46.

[181] B. Scholkopf, A. Smola, and K.-R. Muller, “Kernel principal component analysis,”

in International Conference on Artificial Neural Networks. Springer, 1997, pp. 583–588.

[182] L. Theis, W. Shi, A. Cunningham, and F. Huszar, “Lossy image compression with

compressive autoencoders,” arXiv preprint arXiv:1703.00395, 2017.

[183] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell,

“Full resolution image compression with recurrent neural networks,” arXiv preprint

arXiv:1608.05148, 2016.

[184] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures

for scalable image recognition,” CoRR, vol. abs/1707.07012, 2017. [Online].

Available: http://arxiv.org/abs/1707.07012



[185] T. Back and H.-P. Schwefel, “An overview of evolutionary algorithms for parameter

optimization,” Evolutionary computation, vol. 1, no. 1, pp. 1–23, 1993.

[186] G. Venter and J. Sobieszczanski-Sobieski, “Particle swarm optimization,” AIAA

journal, vol. 41, no. 8, pp. 1583–1589, 2003.

[187] A. Goldbloom, “Data prediction competitions–far more than just a bit of fun,” in

Data Mining Workshops (ICDMW), 2010 IEEE International Conference on. IEEE, 2010,

pp. 1385–1386.

Learning from Data in Radio Algorithm Design · Algorithm design methods for radio communications...

Documents

Transcript of Learning from Data in Radio Algorithm Design · Algorithm design methods for radio communications...