OMO132050 BSC6900 GSM V9R11R12R13 Radio Channel Management Algorithm ISSUE 1.03
Learning from Data in Radio Algorithm Design · Algorithm design methods for radio communications...
Transcript of Learning from Data in Radio Algorithm Design · Algorithm design methods for radio communications...
Learning from Data in Radio Algorithm Design
Timothy James O’Shea
Dissertation submitted to the Faculty of the
Virginia Polytechnic Institute and State University
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in
Electrical Engineering
T. Charles Clancy
Robert W. McGwier
Narendran Ramakrishnan
Sanjay Raman
Jeffrey Reed
Oct 26th, 2017
Arlington, Virginia
Keywords: deep learning, radio, physical layer, software radio, machine learning, neural
networks, sensing, communications system design, modulation, coding, sensing
Copyright 2017, Timothy James O’Shea
Learning from Data in Radio Algorithm Design
Timothy James O’Shea
ABSTRACT
Algorithm design methods for radio communications systems are poised to undergo amassive disruption over the next several years. Today, such algorithms are typically de-signed manually using compact analytic problem models. However, they are shiftingincreasingly to machine learning based methods using approximate models with highdegrees of freedom, jointly optimized over multiple subsystems, and using real-worlddata to drive design which may have no simple compact probabilistic analytic form.
Over the past five years, this change has already begun occurring at a rapid pace in severalfields. Computer vision tasks led deep learning, demonstrating that low level featuresand entire end-to-end systems could be learned directly from complex imagery datasets,when a powerful collection of optimization methods, regularization methods, architec-ture strategies, and efficient implementations were used to train large models with highdegrees of freedom.
Within this work, we demonstrate that this same class of end-to-end deep neural networkbased learning can be adapted effectively for physical layer radio systems in order tooptimize for sensing, estimation, and waveform synthesis systems to achieve state of theart levels of performance in numerous applications.
First, we discuss the background and fundamental tools used, then discuss effectivestrategies and approaches to model design and optimization. Finally, we explore a se-ries of applications across estimation, sensing, and waveform synthesis where we applythis approach to reformulate classical problems and illustrate the value and impact thisapproach can have on several key radio algorithm design problems.
Learning from Data in Radio Algorithm Design
Timothy James O’Shea
GENERAL AUDIENCE ABSTRACT
Radio communications and sensing systems are used pervasively in the modern worldevery day life to connect phones, computers, smart devices, industrial devices, inter-net services, space systems, emergency and military users, radar systems, interferencemonitoring systems, defense electronic systems, and others. Optimizing these systemsto function together reliably and efficently in an ever more complex world is becomingincreasingly hard and impractical.
Our work introduces a new and radically different method for the design of radio sys-tems by casting them in a new way as artificial intelligence problems relying on the fieldof machine learning called deep learning to find and optimize their design. We detail anddemonstrate the first such deep learning based communciations and sensing systems op-erating on raw radio signals and quantify their performance when compared to existingmethods, showing them to be competitive with and in some cases significantly betterperforming than state of the art systems today.
These ideas, and the evidence of their viability, are central to the emerging field of ma-chine learning communications systems, and will help to make tomorrow’s wireless sys-tems faster, cheaper, more reliable, more adaptive, more efficient, and lower power thancurrently possible. In a world of ever increasing complexity and connectedness, this newapproach to wireless system design from data using machine learning offers a power-ful new strategy to improve systems by directly leveraging the complexity in real worlddata and experience to find efficiencies where current day approaches and insufficientsimplified models and design tools can not.
Acknowledgments
Thank you to all my current and former colleagues at Virgina Tech, NC State, Bell Labs,
the US Government, the GNU Radio Community and industry who supported, critiqued,
mentored, collaborated, co-authored and discussed countless ideas surrounding software
radio, cognitive radio, and deep learning, especially my advisor Charles Clancy, who has
been a constant source of support and inspiration, and has provided me with significant
freedom to explore new and disruptive ideas.
I am also very grateful to the individuals and organizations who have supported myself
and my work throughout my studies including VT, DeepSig, DARPA, NSF, DOD, LM,
Hawkeye360, Federated Wireless and others who made much of this possible.
iv
Dedication
This work is dedicated to my family, friends, colleagues, mentors, sponsors and research
inspirations, all of whom have supported me and contributed to this work in countless
immeasurable ways for which I am extremely grateful.
More abstractly, this work is dedicated to engineering as a creative discipline. While
many engineering fields have become complex and tedious, end-to-end learning based
approaches to design offer to relieve some of the tedium and slow progress surrounding
the field today.
It is my sincere hope that the future of engineering will become more of a creative outlet
for experimentalists, contrarians, pragmatists and makers. That the expansion of machine
learning will empower all people to create and to view engineering in a positive, fun, and
creative light and artform, accessable to all rather than as the obscure, slow moving, and
specialized field that it can sometimes seem today.
v
Contents
1 Introduction 1
1.1 Chasing Optimality in Communication System Design . . . . . . . . . . . . 3
1.2 Neural Networks in Radio System Design . . . . . . . . . . . . . . . . . . . . 4
1.3 Implications, Trends and Challenges in Deep Learning . . . . . . . . . . . . 6
1.4 Deep Cognitive Radio Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background 10
2.1 Radio Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.1 Digital Communications . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.2 Radio Channel Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Cognitive Radio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.1 Sensing Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
vi
2.2.2 Control Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Deep Learning Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.1 Error Feedback and Objectives . . . . . . . . . . . . . . . . . . . . . . 24
2.3.2 Network Model Primitives . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.3.4 Architectural Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.3.5 High Performance Computing . . . . . . . . . . . . . . . . . . . . . . 39
2.3.6 Model Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.3.7 Model Introspection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3 Learning to Communicate 50
3.1 The Channel Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Learning to Synchronize with Attention . . . . . . . . . . . . . . . . . . . . . 65
3.3 Multi-User Interference Channel . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.4 Learning Multi-Antenna Diversity Channels . . . . . . . . . . . . . . . . . . 77
3.5 Learning MIMO with CSI Feedback . . . . . . . . . . . . . . . . . . . . . . . 81
3.6 System Identification Over the Air . . . . . . . . . . . . . . . . . . . . . . . . 87
vii
4 Learning to Label the Radio Spectrum 89
4.1 Learning Estimators from Data . . . . . . . . . . . . . . . . . . . . . . . . . . 91
4.2 Learning to Identify Modulation Types . . . . . . . . . . . . . . . . . . . . . 99
4.2.1 Expert Features for Modulation Recognition (Baseline) . . . . . . . . 101
4.2.2 Time series Modulation Classification With CNNs . . . . . . . . . . . 103
4.2.3 Deep Residual Network Time-series Modulation Classification . . . 108
4.3 Learning to Identify Radio Protocols . . . . . . . . . . . . . . . . . . . . . . . 136
4.4 Learning to Detect Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5 Learning Radio Structure 150
5.1 Unsupervised Structure Learning . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.2 Unsupervised Class Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.3 Neural Network Model Discovery and Optimization . . . . . . . . . . . . . 159
6 Conclusion 164
6.1 Publication List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Bibliography 172
viii
List of acronyms
ACF auto-correlation function
ADC analog-to-digital converter
AE autoencoder
AI artificial intelligence
AM amplitude modulation
ANN artificial neural network
ARP address resolution protocol
AWGN additive white Gaussian noise
BCE binary cross-entropy
BER bit error rate
BLER block error rate
ix
BPSK binary phase shift keying
CAF cross ambiguity function
CCE categorical cross-entropy
CFO carrier frequency offset
CNN convolutional neural network
CQI channel quality information
CR cognitive radio
CSI channel state information
CUDA Compute Unified Device Architecture
DL deep learning
DAC digital to analog converter
DNN deep neural network
DNS domain name server
DOF degrees of freedom
DSA dynamic spectrum access
DSP digital signal processing
x
DTree decision tree
EM electromagnetic
FEC forward error correction
FFT fast Fourier transform
FLOPS floating point operations per second
FM frequency modulation
FSK frequency shift keying
FV Fisher Vector
GF galois field
GR GNU Radio
GRU gated recurrent unit
GPGPU general purpose graphic processing unit
GPU graphic processing unit
GMR ground mobile radio
HMM hidden Markov model
HOC higher order cumulants
xi
HOS higher order statistic
HOM higher order moment
I/Q In-phase and Quadrature
ICA independent component analysis
IEEE Institute of Electrical and Electronics Engineers
IID independent and identically distributed
IOU Intersection over union
ISM industrial, scientific, and medical radio
ISI inter-symbol interference
LDPC low density parity check
LO local oscillator
LOS line of sight
LTE long term evolution
LSTM long short-term memory
LTI linear time invariant
MAE mean absolute error
xii
MAP maximum a posteriori
MF matched filter
MFCC Mel-frequency cepstral coefficient
MIMO multiple-input multiple-output
ML machine learning
MLD maximum likelihood
MLE maximum likelihood estimation
MLSP machine learning for signal processing
MMSE minimum mean square error
MNIST Modified National Institute of Standards and Technology
MRSA mean-response scaled initializations
MU multi-user
NNSP neural networks for signal processing
MSE mean squared error
NLP natural languasge processing
NN neural network
xiii
OFDM orthogonal frequency-division multiplexing
OODA observe orient decide act
OTA over-the-air
PAPR peak to average power ratio
PCA principal component analysis
PHY physical layer
PPM parts per million
PPB parts per billion
PSK phase-shif keying
QAM quadrature amplitude modulation
QRNN quasi-recurrent neural network
QoS quality of service
QPSK quadrature phase shift keying
R-CNN region-based convolutional neural network
ReLU rectified linear unit
ResNet residual network
xiv
RF radio frequency
RFIC radio frequency integrated circuit
RNN recurrent neural network
ROC receiver operating characteristic
RRC root-raised cosine
RTN radio transformer network
SCF spectral correlation function
SDR software-defined radio
SGD stochastic gradient descent
SELU scaled exponential linear units
SIC successive interference cancellation
SIFT scale-invariant feature transform
SNR signal-to-noise ratio
SRO symbol rate offset
STN spatial transformer network
SoC system-on-chip
xv
STBC space-time block code
SVM support vector machine
t-SNE t-distributed stochastic neighbor embedding
TS time-slotted
USRP universal software radio peripheral
YOLO you only look once
ZF zero forcing
xvi
List of Figures
2.1 Direct Conversion Radio Front-End Architecture . . . . . . . . . . . . . . . . 11
2.2 Impulse Response Plots of Varying Delay Spreads . . . . . . . . . . . . . . . 17
2.3 A single fully connected neuron . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 A simple 1D 2-long 2-filter convolutional layer . . . . . . . . . . . . . . . . . 31
2.5 A sequence of 2D convolutional layers from AlexNet [1] . . . . . . . . . . . 32
2.6 An example dilated convolution structure from WaveNet [2] . . . . . . . . . 33
2.7 Dropout effect on network connectivity, from [3] . . . . . . . . . . . . . . . . 35
2.8 Example Effect of Dropout on Training and Validation Loss . . . . . . . . . . 36
2.9 A single residual network unit, from [4] . . . . . . . . . . . . . . . . . . . . . 37
2.10 An exemplary residual network stack, from [4] . . . . . . . . . . . . . . . . . 38
2.11 Spatial transformer network structure, from [5] . . . . . . . . . . . . . . . . . 39
2.12 Single threading ceiling illustrated, from [6] . . . . . . . . . . . . . . . . . . . 40
xvii
2.13 Concurrent GPU vs CPU compute architecture scaling (2017), from [7] . . . 43
2.14 Evolutionary performance of image classifier search, from [8] . . . . . . . . 45
2.15 Layer 1 and 2 filter weights from CNN trained on ImageNet, from [9] . . . . 46
2.16 Filter activation visualization in convolutional neural networks (CNNs),
from [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.17 Optimization of input images for feature activation, from [10] . . . . . . . . 47
2.18 GradCAM Saliency Maps for Dogs and Cats, from [11] . . . . . . . . . . . . 48
2.19 Information theoretic visualization of deep learning, from [12] . . . . . . . . 49
3.1 Illustration of the many modular algorithms present in a modern wireless
physical layer modem such as long term evolution (LTE) . . . . . . . . . . . 51
3.2 The Fundamental Communications Learning Problem . . . . . . . . . . . . . 53
3.3 A simple autoencoder for a 2D MNIST image, from [14] . . . . . . . . . . . . 53
3.4 A Simple Channel Autoencoder . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5 BLER versus Eb/N0 for autoencoder . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 BLER versus Eb/N0 for autoencoder . . . . . . . . . . . . . . . . . . . . . . . 58
xviii
3.7 Constellations produced by autoencoders using parameters (n, k): (a) (2, 2)
(b) (2, 4), (c) (2, 4) with average power constraint, (d) (7, 4) 2-dimensional t-
distributed stochastic neighbor embedding (t-SNE) embedding of received
symbols. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.8 Learned QAM Modes for Example Mean Power (EMP) . . . . . . . . . . . . 61
3.9 Learned QAM Modes for Batch Mean Power (BMP) . . . . . . . . . . . . . . 61
3.10 Learned QAM Modes for Batch Mean Amplitude (BMA) . . . . . . . . . . . 62
3.11 Learned QAM Modes for Batch Mean Max Power (BMMP) . . . . . . . . . . 63
3.12 Learned 4-Symbol QAM Modes using BMA for 2 bit, 4bit, and 8bit) . . . . . 64
3.13 Spatial Transformer Example on MNIST Digit from [5] . . . . . . . . . . . . 66
3.14 Radio Transformer Network Architecture . . . . . . . . . . . . . . . . . . . . 67
3.15 Autoencoder training loss with and without RTN . . . . . . . . . . . . . . . 69
3.16 BLER versus Eb/N0 for various communication schemes over a channel
with L = 3 Rayleigh fading taps . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.17 The two-user interference channel seen as a combination of two interfering
autoencoders that try to reconstruct their respective messages . . . . . . . . 72
3.18 block error rate (BLER) versus Eb/N0 for the two-user interference channel
achieved by the autoencoder (AE) and 22k/n-quadrature amplitude modu-
lation (QAM) time-slotted (TS) for different parameters (n, k) . . . . . . . . 74
xix
3.19 Learned constellations for the two-user interference channel with parame-
ters (a) (1, 1), (b) (2, 2), (c) (4, 4), and (d) (4, 8). The constellation points of
Transmitter 1 and 2 are represented by red dots and black crosses, respec-
tively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.20 Open Loop MIMO Channel Autoencoder Architecture . . . . . . . . . . . . 78
3.21 Alamouti Coding Scheme for 2x1 Open Loop MIMO . . . . . . . . . . . . . 79
3.22 Error Rate Performance of Learned Diversity Scheme. . . . . . . . . . . . . . 79
3.23 2x1 MIMO AE, Diagonal H . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.24 2x1 MIMO AE, Random H . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
3.25 Closed Loop MIMO Learning Autoencoder Architecture . . . . . . . . . . . 82
3.26 Error Rate Performance of Learned 2x2 Scheme (Perfect CSI). . . . . . . . . . 82
3.27 Closed Loop MIMO Autoencoder with Quantized Feedback . . . . . . . . . 83
3.28 Bit Error Rate Performance of Baseline ZF Method . . . . . . . . . . . . . . . 84
3.29 Bit Error Rate Performance Comparison of MIMO Autoencoder 2x2 Closed-
Loop Scheme with Quantized CSI . . . . . . . . . . . . . . . . . . . . . . . . 85
3.30 Learned 2x2 Scheme 1 bit CSI Random Channels. . . . . . . . . . . . . . . . 86
3.31 Learned 2x2 Scheme 1-bit CSI All-Ones Channel. . . . . . . . . . . . . . . . . 86
3.32 Learned 2x2 Scheme 2-bit CSI Random Channels. . . . . . . . . . . . . . . . 86
xx
3.33 Learned 2x2 Scheme 2-bit CSI All-Ones Channel. . . . . . . . . . . . . . . . . 86
3.34 Deployment Configuration for Quantized MIMO Autoencoder . . . . . . . 87
4.1 CFO Expert Estimator Power Spectrum with simulated 2500 Hz offset . . . 92
4.2 Timing Estimation MAE Comparison . . . . . . . . . . . . . . . . . . . . . . 97
4.3 Mean CFO Estimation Absolute Error for AWGN Channel . . . . . . . . . . 98
4.4 Mean CFO Estimation Absolute Error (Fading σ=0.5) . . . . . . . . . . . . . 98
4.5 Mean CFO Estimation Absolute Error (Fading σ=1) . . . . . . . . . . . . . . 99
4.6 Mean CFO Estimation Absolute Error (Fading σ=2) . . . . . . . . . . . . . . 99
4.7 Traditional Approach to Modulation Recognition, from [15] . . . . . . . . . 102
4.8 10 Modulation CNN performance comparison of accuracy vs signal-to-
noise ratio (SNR) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.9 Confusion matrix of the CNN (SNR = 10 dB) . . . . . . . . . . . . . . . . . . 107
4.10 System for modulation recognition dataset signal generation and synthetic
channel impairment modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.11 Over the air capture system diagram . . . . . . . . . . . . . . . . . . . . . . . 112
4.12 Picture of over the air lab capture and training system . . . . . . . . . . . . . 112
xxi
4.13 Example graphic of high level feature learning based residual network ar-
chitecture for modulation recognition . . . . . . . . . . . . . . . . . . . . . . 113
4.14 Complex time domain examples of 24 modulations from the dataset at sim-
ulated 10dB Eb/N0 and ` = 256 . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.15 Complex time domain examples of 24 modulations over the air at high
SNR and ` = 256 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.16 Complex constellation examples of 24 modulations from the dataset at sim-
ulated 10dB Eb/N0 and ` = 256 . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.17 Complex time domain examples of 24 modulations from the dataset at sim-
ulated 0dB Eb/N0 and ` = 256 . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.18 11-Modulation normal dataset performance comparison (N=1M) . . . . . . 118
4.19 24-Modulation difficult dataset performance comparison (N=240k) . . . . . 119
4.20 Residual unit and residual stack architectures . . . . . . . . . . . . . . . . . . 120
4.21 Resnet performance under various channel impairments (N=240k) . . . . . 121
4.22 Baseline performance under channel impairments (N=240k) . . . . . . . . . 121
4.23 Comparison models under LO impairment . . . . . . . . . . . . . . . . . . . 122
4.24 ResNet performance vs depth (L = number of residual stacks) . . . . . . . . 123
xxii
4.25 Modrec performance vs modulation type (Resnet on synthetic data with
N=1M, σclk=0.0001) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.26 24-modulation confusion matrix for ResNet trained and tested on synthetic
dataset with N=1M, additive white Gaussian noise (AWGN), and SNR ≥
0dB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
4.27 Performance vs training set size (N) with ` = 1024 . . . . . . . . . . . . . . . 126
4.28 24-modulation confusion matrix for ResNet trained and tested on synthetic
dataset with N=1M and σclk = 0.0001 . . . . . . . . . . . . . . . . . . . . . . . 127
4.29 Performance vs example length in samples (`) . . . . . . . . . . . . . . . . . 128
4.30 24-modulation confusion matrix for ResNet trained and tested on OTA ex-
amples with SNR ∼ 10 dB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.31 Resnet transfer learning OTA performance . . . . . . . . . . . . . . . . . . . 130
4.32 24-modulation confusion matrix for ResNet trained on synthetic σclk =
0.0001 and tested on OTA examples with SNR ∼ 10 dB (prior to fine-tuning) 132
4.33 24-modulation confusion matrix for ResNet trained on synthetic σclk =
0.0001 and tested on OTA examples with SNR ∼ 10 dB (after fine-tuning) . . 133
4.34 Transfer function of the LSTM unit, from [16] . . . . . . . . . . . . . . . . . . 137
4.35 Best LSTM256 confusion with RNN length of 512 time-steps . . . . . . . . . 139
4.36 Detection Algorithm Trade-space Sensitivity vs Specialization . . . . . . . . 141
xxiii
4.37 Computer Vision CNN-based Object Detection Trade Space, from [17] . . . 143
4.38 Example bounding box detections in computer vision, from [17] . . . . . . . 144
4.39 YOLO style per-grid-cell bounding box regression targets . . . . . . . . . . . 146
4.40 Radio bounding box detection examples, from [18] . . . . . . . . . . . . . . . 147
4.41 Over the air wideband signal bounding box prediction example . . . . . . . 148
5.1 Example Radio Communications Basis Functions . . . . . . . . . . . . . . . 151
5.2 Convolutional Autoencoder Architecture for Signal Compression . . . . . . 152
5.3 Convolutional Autoencoder reconstruction of QPSK example 1 . . . . . . . 153
5.4 Convolutional Autoencoder reconstruction of QPSK example 2 . . . . . . . 154
5.5 AE Encoder Filter Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.6 AE Decoder Filter Weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.7 Supervised Embedding Approach . . . . . . . . . . . . . . . . . . . . . . . . 157
5.8 Unsupervised Embedding Approach . . . . . . . . . . . . . . . . . . . . . . . 157
5.9 Supervised Signal Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . 158
5.10 Unsupervised Signal Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 158
5.11 Compact Model Network Digraph and Hyper-Parameter Search Process . . 160
5.12 EvolNN ModRec Net Search Accuracy . . . . . . . . . . . . . . . . . . . . . . 161
xxiv
5.13 EvolNN MNIST Net Search Accuracy . . . . . . . . . . . . . . . . . . . . . . 161
5.14 EvolNN CFO estimation network search loss . . . . . . . . . . . . . . . . . . 162
xxv
List of Tables
2.1 List of widely used neural network (NN) optimization loss functions . . . . 25
2.2 List of activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1 Layout of the autoencoder used in Figs. 3.6 and 3.5. It has (2M + 1)(M + n) + 2M
trainable parameters, resulting in 62, 791, and 135,944 parameters for the
(2,2), (7,4), and (8,8) autoencoder, respectively. . . . . . . . . . . . . . . . . . 56
3.2 Candidate channel autoencoder transmit normalization functions . . . . . . 60
3.3 Layout of the multi-user autoencoder model . . . . . . . . . . . . . . . . . . 73
4.1 ANN Architecture Used for CFO Estimation . . . . . . . . . . . . . . . . . . 94
4.2 ANN Architecture Used for Timing Estimation . . . . . . . . . . . . . . . . . 94
4.3 Layout for our 10 modulation CNN modulation classifier . . . . . . . . . . . 105
4.4 Random Variable Initialization . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5 Features Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
xxvi
4.6 CNN Network Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.7 ResNet Network Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.8 Protocol traffic classes considered for classification . . . . . . . . . . . . . . . 137
4.9 Recurrent network architecture used for network traffic classification . . . . 138
4.10 Performance measurements for RNN protocol classification for varying se-
quence lengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.11 Table input/output shapes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.1 Final small MNIST search CNN network . . . . . . . . . . . . . . . . . . . . 161
5.2 Final Modrec search CNN network . . . . . . . . . . . . . . . . . . . . . . . . 161
xxvii
Chapter 1
Introduction
Algorithms in radio signal processing have advanced drastically over the past hundred
years. Today’s radio physical layer has evolved to become a complex collection of highly
specialized disciplines of research. forward error correction (FEC), channel state infor-
mation (CSI) estimation, equalization, multi-carrier modulation, multi-antenna transmis-
sion schemes, and numerous other specific areas of research have each become mature
research fields in which many people specialize and achieve small incremental improve-
ments within a highly compartmented and modular specialized subsystems.
Meanwhile, deep learning (DL) [19] has been rapidly disrupting numerous algorithmic
information processing fields by re-thinking problems as end-to-end optimization prob-
lems rather than as collections of highly specialized hand tailored subsystem models.
Many problems in wireless communications are ripe for this form of high level rethinking
1
Timothy J. O’Shea Chapter 1. Introduction 2
in the context of end-to-end system optimization, and a new class of optimization tools
offers the possibility to cope with system complexities and degrees of freedom which
were previously intractible for direct complete-system optimization.
Throughout this work, we consider many current models and approaches to wireless
communications, the application of neural networks to radio signal processing, recent
advances in large scale network optimization behind deep learning, and disruptive ap-
plications and ways these techniques can fundamentally change how communications
systems are designed.
Throughout this work, we motivate embracing wireless signal processing problems as
data centric machine learning problems, demonstrating the significant potential of end-
to-end learning approaches which can be used in constrast to more traditional simplified
analytic subsystem model driven approaches. While ultimately some mix of the two is
currently the best solution in many cases, much of this work is intended to provide a
contrarian perspective to the status quo in the field, embracing and comparing quanti-
tative performance with baselines as much as possible, but also attempting to see how
far we can go in relying on completely learned systems rather than incremental hybrid
approaches.
Timothy J. O’Shea Chapter 1. Introduction 3
1.1 Chasing Optimality in Communication System Design
Since the seminal works of Shannon [20] in establishing upper bounds for capacity nd
performance in communications systems (further detailed in chapter 2.1.1) much of the
focus of radio communications research has been on trying to achieve this near-optimal
level of performance in real world systems.
In recent years, techniques such as turbo codes [21], turbo product codes [22], low den-
sity parity check (LDPC) codes [23] and other modulation techniques such as orthogonal
frequency-division multiplexing (OFDM) [24] and multiple-input multiple-output (MIMO)
have allowed for performance which comes quite close to this limit. Key enablers for
modern FEC codes enabling this have been large block sizes with probabilistic models
(such as belief propagation) which iteratively compute most likely codewords based on
soft log-likelihoods estimated from received symbols.
Several attempts have been made to extend this maximum likelihood (MLD) block code-
word selection task to probabilistically encompass earlier physical layer tasks such as
equalization, synchronization and interference cancellation. Approaches in this field in-
clude successive interference cancellation (SIC) [25], as well as factor graph/belief prop-
agation models [26, 27]. Both of these have shown to be attractive from a sensitivity and
bit error rate performance under certain harsh conditions, but both run into difficulty in
practical use due to computational complexity limitations and the exponential complex-
ity problem of increasing realism, complexity and degrees of freedom (DOF) in closed
Timothy J. O’Shea Chapter 1. Introduction 4
form analytic channel, emitter, and interferer models.
1.2 Neural Networks in Radio System Design
The use of artificial neural networks in radio signal processing is not a new idea. Sig-
nificant interest in this area rose and fell in the 80s and 90s. Institute of Electrical and
Electronics Engineers (IEEE) even developed technical committees such as the neural net-
works for signal processing (NNSP) which surged initially in interest, looking at appli-
cations of learning to signal processing tasks (later renamed to machine learning for sig-
nal processing (MLSP) when neural networks fell out of favor). Numerous ideas which
I revisit in this work were proposed long ago: Neuro-evolution was proposed in 1994
[28, 29], Neural network based forward error correction code decoding was proposed in
1989 [30], Neural network based modulation recognition was proposed in 1985 [31], and
many other early works exist which first considered the ways in which neural networks
could be applied to difficult regression and classification problems in the context of the
radio signal processing domain.
Unfortunately, during this first surge of interest, the optimization algorithmic tools, com-
putational tools, regularization tools, data storage capabilities, data gathering capabili-
ties, and many other requisites for large scale data centric algorithm learning were not
yet available for practitioners. Because of this, many people wrote these ideas off as in-
tractable, or overly complex to be of practical use, and relied instead on more compact
Timothy J. O’Shea Chapter 1. Introduction 5
models based on either toy analytic problem representations or max-margin style data re-
ducing optimization techniques (e.g. support vector machine (SVM) [32]). artificial neural
network (ANN) methods were generally regarded as failed and uninteresting for quite
some time. Several researchers including, most famously, Hinton, Bengio, LeCun, and
Schmidthuber continued heavy research into NN optimization silently for many years
building and maturing the tools which today allow model and network complexity to
scale many orders of magnitude above what was possible at that time.
With the emergence of deep learning in 2012 and the results demonstrating the ability
of such techniques to scale, it was clear that any prior assumptions made in the signal
processing domain with regards to model performance, complexity, feasibility and prac-
ticality needed to be completely re-evaluated in light of modern algorithms and compu-
tational capabilities.
In this work, I hope to provide a significant re-consideration of many of the core func-
tions within radio signal processing algorithm design, re-cast fundamental radio signal
processing tasks in the context of modern DL optimization tools, capabilities, and con-
structs, and compare the efficacy of data-centric design of algorithms to the state of the
art methods used today which rely currently much more heavily on complex manual
system engineering and algorithm design.
Timothy J. O’Shea Chapter 1. Introduction 6
1.3 Implications, Trends and Challenges in Deep Learning
Many of the ideas which constitute DL have been around for quite some time. However,
it was not until relatively recently (e.g. AlexNet in 2012[1]) that many recent ideas in net-
work architecture, training, regularization, and high performance implementation were
combined to great effect, that deep learning really gained widespread attention, adoption,
and success. Alexnet was one of the first major efforts and publications which employed
these techniques and provided an order of magnitude improvement in machine learn-
ing (ML) model performance, in this case on the ImageNet [33] dataset and classification
task, reducing top-1 classification error rates by around 10% from 47.1% to 37.5% accu-
racy.
The key breakthrough here were that it was now possible to train very large many-
free-parameter models using gradient descent on high performance graphic processing
unit (GPU) architectures directly from large datasets, with sufficient regularization us-
ing end-to-end learning and low level feature learning, which could outperform previ-
ous state of the art systems with many years of analytic feature engineering and tuning
such as scale-invariant feature transform (SIFT) [34] and Fisher Vectors (FVs) [35]. Since
then, this trend of low-level feature learning outperforming hand engineered features
has replaced the state of the art in computer vision, and has shown the same capacity to
replace low level features such as Mel-frequency cepstral coefficients (MFCCs) in time do-
main voice processing [36] and equivalents in natural languasge processing (NLP). This
Timothy J. O’Shea Chapter 1. Introduction 7
trend towards low level feature learning from raw data is likely to continue to subsume
many domains’ existing feature extractors and pre-processors into learned equivalents
optimized directly on high level objectives.
Knowledge of domain specific information however is not discarded or unneeded in the
frightening way that this statement may however seem. For instance in [36], MFCC filters
are replaced with learned features, but the network architecture is set up to allow for
time domain convolutional filter learning of quite similarly structured filter taps which
happen to fit the real distribution of the human voice spectrum slightly better than a pure
dyadic scale. This trend of building NN architectures which leverage prior expert domain
knowledge and combine it with end-to-end learned architectures to find more optimal
solutions is bound to continue and to enable the combination of domain knowledge with
state of the art NN architecture approaches to yield new state of the art results in many
domains.
One of the key breakthroughs in understanding of deep learning and why it really works
was set forth in [37]. Here Dauphine demonstrates that as the number of free-parameters
in the model goes up, the probability of getting stuck in a [non global] local minima
goes down, and an optimizer such as stochastic gradient descent (SGD) is more likely
to instead encounter a saddle point, which can be further optimized in a non-terminal
fashion. This results underlies why Deep Learning works, why it can find good solu-
tions, and why large/deep networks which are of higher dimension than the minimum
required solution are actually key to the ability to find globally good solutions rapidly. As
Timothy J. O’Shea Chapter 1. Introduction 8
a result, the field of compressing or pruning these large networks to a smaller minimal
subset once trained has also become an important research area which has shown very
promising results in reducing computation and network size once a global solution has
been found [38, 39].
While we have already discussed feature learning as a key trend, end-to-end learning
continues to extend the scope of the model which can be trained in an end-to-end fash-
ion. Attention models or saliency are key methods in which end-to-end problems may
have their learning architecture decomposed into sub-tasks which help to deal with very
high dimensional inputs. By focusing attention within a high dimensional space, regional
proposal networks (Fast R-CNN [40]) and spatial transformer networks [5] both demon-
strated key methods in which low complexity front-end networks could direct a small
patch of transformed relevant input into a secondary discriminative network to operate
within the relevant input more effectively and with a canonicalized form with various
permutations removed. This design strategy is critical to high dimensional search prob-
lems like the Google Street View House Number recognition task [41], and are extremely
applicable to high dimensional radio search spaces as we will discuss later.
1.4 Deep Cognitive Radio Systems
In this work, we consider how cognitive radio, which has been a slow moving idealized
dream over the past handful of years, can be truly realized from a ground-up level of
Timothy J. O’Shea Chapter 1. Introduction 9
physical layer learning using the new tools for high dimensional model learning which
are now available. By combining the end-to-end and feature learning methodologies
which have been highly successful in the computer vision domain and other domains,
with a ground up approach to radio algorithm learning, the results shown herein demon-
strate that a true breakthrough in cognitive radio is finally possible, where we can learn
sensing, waveform synthesis, and control behaviors for radio signals which are uncon-
strained by rigid pre-processors, problem formulations, or other assumptions which were
previously necessary in order to make the problem tractable under an older learning
regime. For lack of a better expression, we term this combination of deep learning as
an enabler for realizing cognitive radio capabilities, deep cognitive radio. Throughout
this document we hope to better explain and quantitatively demonstrate the potential in
this concrete realization of this powerful union.
Chapter 2
Background
This work spans between three distinct disciplines which are rapidly converging. digital
signal processing (DSP) for radio communications systems provides the core knowledge
surrounding analog to digital conversion, sampling theorum, dynamic range and signal
to noise ratio management, and algorithmic knowledge. cognitive radio (CR) builds on
DSP and software radio appling ideas from artificial intelligence (AI) to help automate
and optimize radio applications for specialized objectives. deep learning (DL) has re-
cently grown as a rapidly accelerating field withing AI which relies on large datasets,
error feedback, and high level objective functions to guide the formation of very large
parametric models, previously intractible with other AI approaches. Before combining
and extending these three technologies throughout this dissertation, an overview of key
background concepts, models, and approaches is provided within this section. This is
particularly important in the Radio Deep Learning field as most practitioners at this point
10
Timothy J. O’Shea Chapter 2. Background 11
come either from the radio signal processing field or the machine learning field, and sel-
dom have a deep experience in both. This is likely to change quickly over the coming
years as the field of radio signal processing adopts these techniques and begins to use
data science centric language more accessible to the machine learning community.
2.1 Radio Signal Processing
Figure 2.1: Direct Conversion Radio Front-End Architecture
Radio signal processing has a rich history which can barely be scratched within the scope
of this background section. We focus primarily on modern digital radio communications
systems and radio sensing systems. In both cases radio front-end hardware typically
employs a single or multi-stage oscillator and mixer to convert signals between a specific
radio frequency and baseband (or low intermediate frequency) for digitization. Filters
are used to reject energy from outside of the desired radio frequency (RF) band both at RF
frequencies (RF band pass filters), and as low-pass image-rejection filters at either DC or
Timothy J. O’Shea Chapter 2. Background 12
low IF frequencies. analog-to-digital converters (ADCs) and digital to analog converters
(DACs) are used to convert analog baseband energy to and from discretized baseband
sampled representation within the digital side of a radio system.
Today, the vast majority of low cost radio systems leverage this form of direct conversion
[42] digital transceiver hardware architecture (shown in figure 2.1) and perform signal
specific signal processing (e.g. detection, modulation or demodulation) on the resulting
sampled complex valued quadrature baseband signal digitally in microprocessors (often
termed baseband processing units or digital signal processors).
In many mobile devices (e.g. phones, computers, tablets) this whole radio front-end hard-
ware architecture may all be combined into a single system-on-chip (SoC). Most com-
monly today, the Analog Devices AD9361 [43] is used to perform most of these steps in
common lab software-defined radio (SDR) hardware used herein, but numerous similar
chips exist from Qualcomm, Samsung, Broadcom, Intel and others.
2.1.1 Digital Communications
Today, the vast majority of radio systems are implemented using digital signal processing,
and of those many of the most ubiquitous which we use every day are digital modula-
tions carrying binary information between computing platforms such as phones, laptops,
tablets, cars, base stations, pagers, spacecrafts, airplanes, boats, law enforcement radios,
and virtually every other platform frequented by mankind. These systems share a num-
Timothy J. O’Shea Chapter 2. Background 13
ber of key properties which must be understood and will apply to machine learning based
radio communications systems as much as they apply to current day systems.
Sampling Theory gives us theoretical bounds on the conversion of information between
analog and digital forms. Nyquist observed [44] in 1928, that to perform undistorted re-
construction of a radio telegraph signal of a given bandwidth (speed of signaling), one
must sample the signal at a rate of twice the bandwidth in order to become unambiguous
(to avoid aliasing of portions of the signal). This is today known as the Nyquist Frequency
(or critical sampling frequency) and represents the speed at which any signal must be
sampled in order to avoid distortion due to aliasing, two times the highest frequency of
the underlying signal. While there does exist an area of investigation in digital communi-
cations into compressive sensing which breaks this assumption, virtually all systems we
use today sample at or above the nyquist frequency, and we shall continue to assume a
system sampled at or above nyquist for the purposes of our investigations herein. Com-
pressive sensing based machine learning systems which drop this assumption do pose an
interesting prospect, but we do not consider them in our work here. We also generally
do not model the effects of quantization which occur within the sampling process. An
analog signal of peak-to-peak voltage V, can be divided into 2N discrete voltage levels
spanning [-V/2,V/2] when converted to an N-bit digital equivalent with reconstruction
error bounded by ε ≤ V2N+1 . However, for this to be true, we must assume the signal
amplitude is scaled appropriately for the converters range, and that the dynamic range
of the analog signal plus noise can be sufficiently represented within N bits. Many mod-
Timothy J. O’Shea Chapter 2. Background 14
ern ADCs employ 14 or 16 bit conversion, including those used in our measurements
(largely universal software radio peripheral (USRP) [45] devices based on the AD9361
[43] chip), these provide sufficient dynamic range for many applications wherein thermal
noise N per sample is greater than quantization noise ε, and we shall make this assump-
tion within our work as well. Most of the work herein is therefor conducted using 32 bit
floating point representations for simplicity. This presents more than enough dynamic
range for our applications, and can be reduced in precision in future work for many of
them.
Information Theory can be be used to express an upper bound on channel capacity [20].
This defines the maximum information throughput in bits per second per hertz which
can traverse a wireless channel. Most commonly this is expressed for a single transmitter,
single receiver channel where the impairment is given by AWGN, and signal power is
expressed as a signal to noise ratio (SNR) relating the signal power to the noise power.
This capacity equation is given traditionally by the following:
C = W log2
(1 +
P
N0W
)(2.1)
Here we obtain a maximum capacity C in bits per second based on the transmit signal
power P , the noise power N0, and the bandwidth W . This is considered one of the most
important bounds in communications, as it characterizes a fundamental limit on how
much information we can transmit over a given channel with a specific signal and noise
Timothy J. O’Shea Chapter 2. Background 15
power.
Achieving this bound has driven much of communications research and algorithm iter-
ation over the past 50 years as we seek systems which operate closer and closer to this
bound. Each of these specific modulation and coding scheme can also be expressed with
an expected analytic error bound given a similar set of operating conditions (SNR, band-
width) whose information capacity is governed by this bound.
The Shannon bound however, does not in its common form, address multi-user capacity
(aggregate bits per second per hertz for all users sharing a common channel) or realis-
tic wireless channel impairments beyond thermal noise (e.g. fading, distortion, or other
sources of impairment). Numerous more complex formulations of capacity do exist for
more complex modulation and channel models, but no general solution exists for the
multi-user, realistically impaired channel and arbitrary emitter modulation case.
2.1.2 Radio Channel Models
Modeling of radio propagation channels is a highly mature field which has developed
throughout the history of communications systems. Typical channel models allow us to
come up with simplified parametric models which reasonably approximate the effects
seen over the wireless channel. High quality monte-carlo simulation algorithms do exist
as well which can produce realistic sample by sample distributions of impairment models
for simulation [46], but often do not have a compact form.
Timothy J. O’Shea Chapter 2. Background 16
These channel models often may be used to perform analytic optimization, as they have
been in many instance, to simulate the transmission and reception of a wireless signal
in a monte-carlo sense, or as discussed later, may be used directly in the development
of domain specific attention models or simulation models including within end-to-end
radio system optimization processes.
Thermal Noise is a key physical limitation in analog to digital conversion which limits
sensitivity and achievable signal to noise ratios for any given received signal power [47]
and bandwidth. We can model the absolute thermal noise power as P = kTB where P is the
power Watts, k is Boltzmanns constant (1.38 × 1023 Joules/Kelvin), T is the temperature
at the ADC in Kelvin, and B is the conversion bandwidth (sample rate) in Hz. Given this
fundamental bound on SNR for specific receier powers due to device physics, all radio
systems must function within the finite SNR margin governed by this limit. For simu-
lation and analysis, this is modeled accurately as an additive process of white Gaussian
noise (AWGN) where received samples (r) are each the sum of some transmitted signal (t)
and a noise component (N). This can be expressed as r = s+N where Nthermal ∼ N(0, σN)
and |s|2/σ2N expresses the SNR. This is typically expressed in dB as 10log10(|s|2/σ2
N) rather
than as a ratio.
Delay spread occurs during wireless propagation when multiple coppies of a transmitted
signal are received at differing delays and phase offsets. This is commonly due to multi-
path fading, or the summation of many different propagtion paths either direct, reflective
or otherwise, arriving and summing together at a receive antenna additively. For an
Timothy J. O’Shea Chapter 2. Background 17
impulsive channel, shown in figure 2.2 for σ = 0, all energy arrives at a single time-delay
in the impulse response.
Figure 2.2: Impulse Response Plots of Varying Delay Spreads
For a frequency-selective fading channel (with non-zero delay spread), this energy arrives
at some combination of random time intervals which combine additively at the receive
antenna. This results in an impulse response which contains power at a range of different
frequencies, often following a distribution such as Rayleigh (where there is no dominant
mode or line-of-sight) or Rician (when a large line-of-sight component is present). Figure
2.2 shows fading channels for σ = 0.5, 1.0, 2.0 which we will refer to light, medium and
harsh fading conditions later on. Typically this impulse response is considered stationary
for the ’coherence time’ of the channel, and this is the assumption we will be making in
many of our experiments later where, for instance we convolve a single example of 1024
time-samples with a single random impulse response as a simplifying assumption. This
sort of modeling is used routinely in communications systems today.
Timing Offsets is also present in all wireless systems, where path lengths and propaga-
Timothy J. O’Shea Chapter 2. Background 18
tion times can change based on radio mobility or changing path lengths due to reflection,
refraction, dispersion etc. This is of course governed by the the propagation of radio-
frequency waves at the speed of light (c = 3 × 108) over some distance (d), where the
time delay (τ0) is given by d/c. This time-delay τ0 can be treated as a random process for
simulation purposes, and is typically estimated in a radio receiver through the process
of synchronization. Most commonly, the use of a matched filter to some set of reference
tones at the beginning of a transmission, allows for time of arrival estimation and the
extraction of a received signal from the beginning of a single transmission.
Clock Offsets occur because physically seperated radios (e.g. a base station and a hand-
set) typically have seperate free running clocks from which the digital sampling rate and
center-frequency tuning oscillator signals are derived. Free running clock rates can be
treated as a gaussian random walk process, where they are stable on short time-intervals
and stability decreases looking at larger time intervals. This is typically characterized in
the hardware specification of a given hardware device in terms of expected clock error in
parts per million (PPM) or parts per billion (PPB). For short time-intervals we can make
the assumption of stationarity and assign a fixed estimate for symbol rate offset (SRO) and
carrier frequency offset (CFO) between a transmitter and a reciever, or between a received
signal and a receiver. Motion of transmitters, receivers, or reflecters can additionally in-
troduce SRO or CFO through the Doppler effect, where the CFO due to motion (∆fdoppler)
is given by ∆fdoppler = ∆vcFc, where ∆v is the difference in the velocity of the transmitter
and receiver along the path of transmission, and Fc is the center frequency of the signal
Timothy J. O’Shea Chapter 2. Background 19
emitter. Generally the CFO and SRO incident on a wireless receiver are a combination of
offset due to doppler and offset due to random clock offsets between hardware devices.
In much of our work, focused on short-time examples, we assume coherence of sample
rate and center frequency over a small number of samples in one example, and randomly
draw CFO and SRO from a normal distribution. In the case of CFO, we a assume a carrier
frequency distribution ∆Fc ∼ N(0, σCFO) and in the case of SRO, we assumpe a small
resampling ratio near one, ∆R ∼ N(1, σSRO).
Aggregate Effects are present in any real system, where all of the above uncertainties
about a channel are combined into a single simplified wireless propagation model. We can
express this as a transmitted signal, s(t), purturbed by a number of channel effects over
the air before being received as r(t) at the receiver. Considering the effects of time delay,
time dilation, complex carrier phase rotation, carrier frequency offset, additive thermal
noise, and channel impulse responses being convolved with the signal, all random time-
varying processes. A closed form of the analytic transform between time varying signals
s(t) and r(t) including each of these effects can be approximated as shown in the equation
below.
r(t) = ej2π∆Fc(t)/Fs
∫ τ0+T
τ=τ0
s(t− τ∆R(t))h(τ) + nthermal(t) (2.2)
Unfortunately, such an expression is quite unwieldy when performing analytic optimiza-
tion of estimators in closed form, involving interpolation with a time-varying function
Timothy J. O’Shea Chapter 2. Background 20
delay function, and integration with a time-varying impulse response. To simplify this,
in many cases, the simplied expression below is used.
r(t) = s(t− τ0) +Nthermal (2.3)
When considring time and frequency offsets a slightly more involved expression is also
commonly used.
r(t) = ej2π∆Fcts(t− τ0) +Nthermal (2.4)
Since the focus for many estimators focuses on the structure of s(t), which contains well
formed structures such as the following for quadrature phase shift keying (QPSK) when
considering perfectly sampled symbol periods.
s(t) = ej(2πN/4+π/4), N ∈ {0, 1, 2, 3} (2.5)
Such structured forms of s(t) and simplified AWGN-only propagation models are key
to clever derivation of estimators today which are specialized to specific forms of s(t)
(e.g. realizing s(t)4 falls on a single point for all N). However once the more complex
or nonlinear cumbersome analytic channel model is introduced and/or many different
s(t) transmitted signal structures need to be considered, this kind of manual analytic trick
Timothy J. O’Shea Chapter 2. Background 21
begins to break down quite rapidly and require practical model simplifications to remain
tractable.
2.2 Cognitive Radio
Cognitive radio [48, 31] is a field which explores the potential ways in which our radio
and mobile devices can behave in much smarter and more efficient ways by leveraging
artificial intelligence to make better, more informed decisions and employ improved con-
trol systems and channel access schemes.
Commonly examples of this include radios saving power by intelligently searching for
towers based on expected locationd and distributions, or historical information, conduct-
ing hand-off more intelligently, managing finite resources (typically power and spectrum)
efficiently, and tuning RF communications systems and front-end parameters such as
gain, filtering, tuning or otherwise in order to improve radio performance [49].
Perhaps the most widely published applications of cognitive radio is that of dynamic
spectrum access (DSA) [50, 51] which seeks to increase spectrum usage and efficiency by
allowing for much more dynamic spectrum sharing by secondary users through intelli-
gent sensing, radio user identification, and non-invasive access strategies designed not
to harm primary spectrum users, but to use spectrum vacancies and holes available in
frequency and temporal vacancies.
Timothy J. O’Shea Chapter 2. Background 22
Unfortunately, many of the techniques investigated within the first surge of interest in
cognitive radio and dynamic spectrum access (before the first cognitive radio winter) at-
tempted to solve very specific sensing or control system problems through a process of
specialized modeling of specific scenario features, processes, and distributions (often only
for one specific primary user or frequency band). This resulted in a number of potential
end solutions, for instance for inter-operability optimizations specifically with TV broad-
cast signals in TV broadcast bands, or control protocols to maximize fairness among shar-
ing secondary access nodes, however by and large it did not provide a general solution
which allowed us to generalize spectrum sensing, spectrum access, and control optimiza-
tion widely for many different scenarios, emitters, and bands. Due to this narrow appi-
cability and slow moving spectrum policy which has refused to allow for sensing based
secondary spectrum access, much of the research in this field yielded relatively narrow
interest and effected relatively minimal change in radio system design and deployment
as whole.
2.2.1 Sensing Techniques
One of the earliest applications of artificial intelligence in radio systems was that of spec-
trum sensing for emitter identification. This is often a multi-stage expert system which
first performs a form of wideband energy detection, often by identifying concentrated
energy within the power spectrum density, localizing and extracting carriers, and then
further characterizing these carriers through an iterative process of carrier estimation and
Timothy J. O’Shea Chapter 2. Background 23
classification.
There have been numerous attempts to use neural network based approximations espe-
cially in the latter stage of signal classification (single signal identification on top of expert
feature sets), but many of them have relied on preprocessed feature spaces as input such
as the spectral correlation function (SCF) [52] to provide a relatively simple neural net-
work mapping tasks. The scope of previous expert sensing methods is quite large, and
we explore it partially in more depth in the later sensing section.
2.2.2 Control Modeling
Control system modeling in radio systems is another interesting task which was ad-
dressed within the scope of Cognitive Radio problems and publications. Control op-
timization approaches have been applied to many tasks such as channel frequency se-
lection in dynamic spectrum access systems and for avoidance of malicious users such
as following tone jammers. Two of the most commonly considered approaches include
modeling access opportunities of whitespace as a hidden Markov model (HMM) [53], as
well as modeling collective control problems as Game Theoretical problems [54]. Each
of these models and solutions is however unfortunately quite highly specialized for the
specific scenario, band, and primary user considered for many of these works.
Works also considered the effects of optimal radio mode and tuning control using a va-
riety of methods [55], including the use of expert planning approaches [56] such as the
Timothy J. O’Shea Chapter 2. Background 24
popular observe orient decide act (OODA)-loop concept. However, these two approachs
are also quite reliant on expert knowledge, modeling, descriptions, and specific scenario-
centric learning. We hope that with the methods presented here we can begin to devise
and build solutions to these classes of problems which generalize much better without
significant expert model construction and manual adaptation needed.
2.3 Deep Learning Models
The study of deep learning has recently brought together a collection of powerful opti-
mization tools, network architectural tools, regularization knowlede, high performance
implementation, and other techniques which can be used to learn powerful models from
datasets and simulators. Here, we highlight a number of key ideas and enablers in greater
depth for background. These will be employed in later sections to several core problems
in radio signal processing.
2.3.1 Error Feedback and Objectives
At its core, deep learning as it exists today is focused on the optimization of large para-
metric network models which can accommodate very high degrees of freedom, non-linear
transformations, and deep hierarchical structure.
Today, such networks define one or more loss function (L ) between network output val-
Timothy J. O’Shea Chapter 2. Background 25
Table 2.1: List of widely used NN optimization loss functions
Name L (y, y)
Mean Squared Error (MSE) ‖y − y‖2
Mean Absolute Error (MAE) |y − y|Binary cross-entropy (BCE) −y log(y)− (1− y) log(1− y)
Categorical cross-entropy (CCE) −1N
∑Ni=0 [yi log(yi) + (1− yi) log(1− yi)]
Log-cosh 1N
∑Ni=0 log (cosh (yi − yi))
Huber 1N
∑Ni=0
{12
(yi − yi)2 abs(yi − yi) < 1
(yi − yi) abs(yi − yi) ≥ 1
ues (y) and target network output values (y) (where yi denotes the i’th output value),
and use a form of global error feedback from this loss function in order to train network
parameters (also referred to as learning). Artificial neural networks (ANNs or just NNs)
have long relied on back-propagation [57] of error gradients to fit the parameters in their
networks. At the simplest form, the iterative weight update process of back-propagation
of some function y = f(x, θ) is given by, the following simple weight update equation
with a learning rate (η).
θn+1 = θn − η∂L (y, y)
∂θ= θn − η
∂L (y, f(x, θ))
∂θ(2.6)
This gradient can be derived in an automated fashion using automated differentiation,
for very complex functions representing entire networks. A key enabler for the flexibility
and rapid speed at which deep learning architectures are able to evolve today. One SGD
weight update evaluation of ∂L (y,f(x,θ))∂θ
is often referred to as a backwards pass, while
network evaluation of y = f(x, θ) is often referred to as a forwards pass.
Timothy J. O’Shea Chapter 2. Background 26
This form of iterative weight update through SGD with global error feedback through
back-propagation is used today in virtually all DL model training applications. A wide
variety of loss functions are used for different applications, but many of the most com-
monly used loss functions include mean squared error (MSE) and categorical cross-entropy
(CCE) are shown in table 3.2. MSE is commonly used for real-valued regression problems,
while CCE is typically used for classification problems. In classification with CCE loss
fucntion a so called ”one-hot” encoding is typically used, where the output targets (yi)
take the form of a zero vector with a one at the index of the correct class label. In this case
output predictions yi for each class i of N , fall on the range (0, 1) which can be enforced
with an output activation function with bounded (0, 1) output range such as sigmoid or
softmax (softmax is typically used). When bounded in this way, these output predictions
are often referred to as pseudo-probabilities, since they are trained to predict the discrete
target probabilities p(yi = 1) or p(yi = 0) for each output index.
SGD has improved drastically since the basic formulation shown in equation 2.6. Mo-
mentum [58, 59] is an important enhancement on the simple formulation of SGD shown
above. With momentum, the learning rate η is updated dynamically based on the stabil-
ity of the gradient in each direction to prevent oscillation and to accelerate descent across
large nearly flat regions. The simple form of the gradient update expression with mo-
mentum is given in equation 2.7, where velocity v is now updated iteratively and used to
derive new weights θ.
Timothy J. O’Shea Chapter 2. Background 27
vn+1 = γvn+1 + α∂L (y, y)
∂x
θn+1 = θn − vn+1
(2.7)
This approach was accelerated further using Nesterov’s approach [60] which improves
momentum updates assuming the target loss manifold is a smooth function. Within the
past handful of years, both RMSProp [61] and Adam [62] have become widely used which
incorporate gradient normalization into their momentum updates. In Adam, which is
used in the vast majority of the work included herein, the update equation is given in
equation 2.8.
mn+1 = β1mn + (1− β1)∂L (y, y)
∂x
vn+1 = β2vn + (1− β2)
(∂L (y, y)
∂x
)2
ˆmn+1 =mn+1
1− βn+11
ˆvn+1 =vn+1
1− βn+12
θn+1 = θn −η ˆmn+1√
ˆvn+1 + ε
(2.8)
Even more recently, the problem of learning rate control during SGD have been read-
dressed in novel ways which provide faster optimization (often at the cost of increased
computational complexity per iteration). These include the use of curvature and gradient
variance in a closed loop system [63], as well as casting the learning rate tuning problem
as a separate reinforcement learning problem naively [64] (e.g. learning to learn faster).
Timothy J. O’Shea Chapter 2. Background 28
These methods have shown promising results, but are not in wide-spread use at this time
and appear to provide relatively incremental performance improvements in our limited
experimentation.
There has been significant discussion lately surrounding whether global error feedback is
really appropriate, optimal or biologically plausible within the human brain. The notion
of a global loss function and global error feedback both seem unlikely in the human mind.
More plausible formulations generally include a more localized form of loss computation
and a more localized and distributed form of error feedback. Numerous ideas on im-
proved optimization are currently under development, and will almost certainly provide
improvements in network training within the coming years. Key explorations in this field
include Feedback Alignment [65], Equilibrium Propagation [66], Inverse Autoregressive
Flow [67], and others. This is a very active area of research, and a challenging field.
Most of the work herein relies on mature global back-propagation using forms of SGD for
network optimization due to their maturity, effectiveness (current state of the art on most
tasks) and the availability of optimized implementations. However, given the promising
nature of emerging basic research into distributed and local-feedback optimization meth-
ods (which attempt to mirror more closely what the human brain is believed to do, i.e.
no single global loss function or global clock synchronization) and the similarity of the
network functions on which they may operate, we expect many of these methods will be
readily applicable to lend further improvements to much of the work shown here.
Timothy J. O’Shea Chapter 2. Background 29
2.3.2 Network Model Primitives
Neural network architectures have come quite a long way. From early use of a very small
number of ’perceptrons’, the formulation of a feed-forward memoryless single neuron
has been relatively straight forward given by equation 2.9. Here a set of input values X
of size (1, N) is concatenated with a ones vector (to include a bias term) of size (1, 1) and
multiplied with a weight vector of size ((N+1),M) to produce an output vectorH of size
(M, 1).
Y = f(W ×X_1) (2.9)
An output value Y is then produced using some activation function f which may be lin-
ear (e.g. the identify function) or it may be a non-linear function such as a sigmoid or
rectified linear unit. Commonly used activation functions, f , are given in table 2.2. Sig-
moid activation functions have a long and rich history in literature, but today a number
of different activations are used. The simple rectified linear unit (ReLU) activation [68]
has been used increasingly in recent times instead of the sigmoid due to a number of
important properties. Computationally it is much cheaper to compute, as is its gradient,
and training typically converges much faster than when using smooth sigmoid or tanh
activations which suffer more from the vanishing gradient problem [69] (e.g. successfully
using back-prop through many layers), where gradient contributions to loss can differ by
orders of magnitude between subsequent layers making optimization very slow.
Timothy J. O’Shea Chapter 2. Background 30
Table 2.2: List of activation functions
Name Function f(x) RangeLinear x (−∞,∞)
ReLU [68] max(0, xi) [0,∞)
Leaky ReLU [70]
{αx, for x < 0
x, for x ≥ 0[−∞,∞)
TanH tanh(xi) (−1, 1)
ArcTan. tan−1(x) (−∞,∞)
Sigmoid 11+e−x (0, 1)
SoftMax [71] exi∑Nj=0 e
xj(0, 1)
Step
{0, for x < 0
1, for x ≥ 0[0, 1]
ELU [72]
{x, for x > 0
αex − α, for x ≤ 0(0,∞)
SELU [73] λ
{x, for x > 0
αex − α, for x ≤ 0(0,∞)
SoftPlus [74] ln(1 + ex) (0,∞)
Each of these activations is expressed compactly in table 2.2, in the case of tanh, sigmoid,
and softmax, exponentiation operations are used for forward passes, while in ReLU units,
a simple peace-wise linear transfer is incredibly cheap to compute. Below, α denotes
some leaky (non-activaited) coefficient, while λ denotes a scaling factor; both of these
are considered hyper-parameters (e.g. defined with the network architecture and not
updated durring SGD). In each case, x denotes a single output neuron (activation of each
output is independent) except in the case of SoftMax, where each output xi is scaled by
exponentiated versions of all outputs xj in the layer.
The perceptron description given in 2.9 and illustrated in figure 2.3 provides a simple,
Timothy J. O’Shea Chapter 2. Background 31
Figure 2.3: A single fully connected neuron
highly compact matrix multiplication operation followed by some activation which can
generally be computed concurrently for each element in the matrix. This class of layer
is typically referred to as a fully-connected (or Dense) layer, where the weight vector
dimension is the product of the input and output dimensions. This is the most expressive
layer, but also contains the highest free-parameter count, making it both flexible and data-
hungry to obtain good solutions to fit all the parameter values well.
Figure 2.4: A simple 1D 2-long 2-filter convolutional layer
Timothy J. O’Shea Chapter 2. Background 32
One solution for reducing the free-parameter count and introducing invariance properties
which may be desired in certain layers is by leveraging the convolutional layer [75, 1]
which can be realized commonly for 1D,2D,3D or higher dimensional input spaces. Here,
the weight vector W is decomposed into a number of distinct filter channels as shown
in figure 2.4, where each filter has some size smaller than the input dimension, and is
strided across the input vector typically at some periodic interval. This has two enormous
benefits. First, if the input is a translation invariant domain such as a signal arriving
at random time offsets, or an image occurring at random X,Y translations, this forms
a powerful regularization which learns the same features at all offsets within the input.
And Second, the number of free-parameters is virtually always drastically reduced versus
the equivalent fully connected layer, reducing the number of examples required to obtain
similar accuracy on the lower number of free-parameters which must be accurately fit.
Figure 2.5: A sequence of 2D convolutional layers from AlexNet [1]
Dilated convolutions [76] deserve a special mention within our discussion of radio time-
series as well. Their recent use in neural networks [2] has been conducted in the audio
and voice processing domain where dyadically scaled features of many temporal support
Timothy J. O’Shea Chapter 2. Background 33
widths contribute key features within both music and natural language. However, this
property of helping (in multiple layer form) to represent exponentially different scalings
of raw features is critical in the radio domain as well, where high samples rates are used,
and features may easily span 10x to 1000x or more in varying temporal feature support
width.
Figure 2.6: An example dilated convolution structure from WaveNet [2]
Each of these constructs presumes a feed-forward model for information flow (e.g. each
layer only depends on preceding layers’ outputs). Recurrent layers relax this assumption
and allow for a ’memory’ connection within a single layer. This is a powerful tool which
has been demonstrated to be highly effective in temporal sequence modeling [77, 78] par-
tially due to the fact that it can relax the simplifying Markov assumption which is made
in the case of a HMM.
Timothy J. O’Shea Chapter 2. Background 34
2.3.3 Regularization
One of the core problems with stochastic gradient descent based methods (and many
other machine learning methods) is the propensity of the training process to overfit the
model to training set data. To avoid overfitting, or aligning the model solution more
closely with the specific training examples than the general solution to the problem they
represent, a number of solutions have been proposed and used over time. Simple forms
of regularization may focus on the L1 or L2 norm of either activations or weight vectors,
attempting to push unused or rarely used conditions to zero, or reduce high magnitude
overfitting to specific cases. Ridge regression attempts to strike an optimal balance be-
tween these factors.
Dropout introduces an entirely new form of regularization [1, 3], which embraces the
combinatorially large number of neurons and paths through a network, and probabilisti-
cally zeros neuron outputs during the training process, effectively removing connection
as shown in figure 2.7.
By doing this, networks can not overly rely on any one specific neuron or network path for
a single use case or example, and instead can be seen as training an exponentially large
ensemble model of all possible sub-graphs of neurons through the network randomly,
an enormous computational gain over actually training that many separate independent
graphs. The effect of Dropout is quite stark, as shown in the exaple in figure 2.8. Here,
when training on the cononical Modified National Institute of Standards and Technology
Timothy J. O’Shea Chapter 2. Background 35
Figure 2.7: Dropout effect on network connectivity, from [3]
(MNIST) dataset, without dropout training loss goes near zero quickly, but overfits with
validation loss plateauing at a high level. With dropout however, training and validation
loss track much more closely, and overfitting does not occur until much later, and to
a much lesser degree, causing much better generalization while training against a very
small (500 example) subset from MNIST.
DropConnect [79] was more recently introduced, employing the same variety of proba-
bilistic dropout on network paths during training with slightly improved performance
vs Dropout, but dropping out fine grained neuron inputs rather than outputs. Unfortu-
nately DropConnect requires an increase in computational complexity when computing
ensemble outputs, and is not nearly as simple to implement as Dropout. Its adoption and
widespread usage has not been as notable as Dropout at this time.
More recently, batch normalization [80] has begun to be adopted widely as another form
of regularization (especcially for convolutional layers) and functions surprisingly well.
Timothy J. O’Shea Chapter 2. Background 36
Figure 2.8: Example Effect of Dropout on Training and Validation Loss
In batch normalization, mean and variance of inter-layer activations are normalized to
zero mean and unit variance for mini-batches during training, resulting in a more sta-
ble covariance properties, and providing a surprisingly good regularization property.
Currently this is one of the most widely used regularization methods for state of the art
CNNs. Very recently, an approach has been devised [73] which employs carefully crafted
network weight initializations and scaled exponential linear unitss (SELUs) in order to
guarantee the same inter-layer activation properties (normalization) without explicitly
having to scale them. This can result in significantly faster convergence and lower com-
putational complexity in some cases.
Timothy J. O’Shea Chapter 2. Background 37
2.3.4 Architectural Strategies
There are a number of high level architecture design strategies which have played im-
portant roles in deep neural network design over the past few years. Beyond basic layer
design, higher level connectivity design is important in shaping the flow of information,
combining features from different regions within larger networks, and achieving the right
structure with a limited number of free parameters. Early attempts at providing paths
through the network to combine low level inputs and features with higher level features
included the use of highway networks [81], which showed improvements in some cases.
However more recently, residual networks (ResNets) [4] have become widely adopted
within computer vision due to their ability to fit many features of varying scale, leverage
depth effectively, and to not heavily overfit to training sets. They are typically used with
batch normalization for regularization, a single ’residual unit’ is shown in figure 2.9.
Figure 2.9: A single residual network unit, from [4]
Many of these units can be stacked into a ’residual stack’, to form a network where fea-
tures may easily pass through many layers of embedding, or may bypass embeddings,
and may fit optimal sets of features which mix both types of features at many layers of
Timothy J. O’Shea Chapter 2. Background 38
abstraction. This is an important breakthrough in multi-scale learning, and one that gen-
erally represents the state of the art today in computer vision architectures.
Figure 2.10: An exemplary residual network stack, from [4]
Attention or saliency is another key high level architectural design consideration in many
networks. Many networks have a hard time scaling to very large input sizes, so for tasks
such as the google street view challenge, which must consume very high resolution im-
agery, and discriminate house number digits, some method for directing attention to the
digits before discriminating can drastically reduce network complexity by introducing
domain appropriate transforms. In the case of vision the 2D Affine transform works very
well at resolving scale, translation skew and rotation in input patches. Figure 2.11 illus-
trates the spatial transformer network (STN) architecture where a localization network
estimates some set of parameters θ which work with a transformer to produce a canoni-
cal image, which can be classified using a relatively simple discriminative network.
Many of these architectures were developed for computer vision or for voice, however
the high level concepts outlined here are at least as applicable in the radio domain, where
high dimensional search spaces may include time, frequency, spatial, polarization or other
search spaces with well understood transforms as discussed later.
Timothy J. O’Shea Chapter 2. Background 39
Figure 2.11: Spatial transformer network structure, from [5]
2.3.5 High Performance Computing
Usable computational capacity through high performance computing and powerful algo-
rithm expressive models has been a core enabler to deep learning. Since Gordon Moore’s
famous statement [82], that the number of components/transistors on an integrated cir-
cuit appeared to double every year (later adjusted to every 18 months), we have seen
one of the most incredible technological scaling processes in history, driving the growth
of computing and computing related industries. Unfortunately, over the past 10 years
we have begun to run into limitations on translating this transistor count into growth in
useful computation. The cause for this is best illustrated by the plots shown in figure
2.12. Transistor counts continue to scale, however clock speed and single threaded per-
formance have largely plateaued and no longer see the same exponential gains each year.
This has led to a growth in the number of cores per processor reaching a growth rate
almost equal to that of the number of transistors on chip.
In the past 10+ years, computing has attempted to embrace this many-core future by intro-
ducing numerous processing architectures with multi-core or many-core structures, and
Timothy J. O’Shea Chapter 2. Background 40
Figure 2.12: Single threading ceiling illustrated, from [6]
introducing many unique programming models to attempt to embrace it. While many
hardware architectures have been able to achieve theoretical peak performance numbers
which continue to ride Moore’s law, some of them achieved it such as the Cell Broad-
band Engine (CBE) [83], the Tile processor [84] and others while placing the vast majority
of the burden on the software programmer to effectively balance algorithm distribution,
data movement, thread communication, etc between many cores. Unfortunately, this led
to a highly limited adoption of such architectures, where significant software develop-
ment and tuning of algorithms for specific architectures was required in order to obtain
near-theoretical performance numbers. Around this time, we investigated efficient high
throughput software radio on the CBE and obtained limited success [85], but ultimately
faced very large development times and an end-of-life’d processor roadmap from IBM.
At the same time, GPUs, were rapidly expanding to meet the needs of wide dense ma-
Timothy J. O’Shea Chapter 2. Background 41
trix algebra operations required for high rate and high resolution rendering of games and
movies using OpenGL. To meet the needs of these rendering algorithms, graphics cards
generally turned to many-core solutions where operations could leverage wide architec-
tures, busses and concurrent processing at power-efficient clocks speeds and very high
floating point throughput rates.
Around 2007, the notion of general purpose graphic processing unit (GPGPU) computing
began to come into the forefront. Nvidia released their Compute Unified Device Architec-
ture (CUDA) [86] software development kit, ATI released their Close-to-the-Metal (CTM)
SDK [87], and shortly thereafter OpenCL [88] emerged as an attempted at a mainstream
cross-vendor GPGPU programming solution.
CUDA, CTM (now discontinued), and OpenCL have all been used widely in specific ap-
plications and generally employ a more programmer friendly architecture than possible
with Tile or CBE, however their use in radio signal processing has been somewhat limited
to high computation kernels ported to them and tuned. Wideband channelization [89] has
seen widespread success in this space along with a variety of kernels [90, 91]. In general
these attempts have continued to be plagued by the problem of balancing I/O and com-
pute distribution among compute elements in a general way across a heterogeneous set
of algorithms.
Theano [92] in 2010 introduced a quite new model, which relied on high level Numpy-
like [93] matrix algebra definition in python and efficient data-flow computation graph
partitioning, GPU compilation, and mapping and optimization over distributed GPU and
Timothy J. O’Shea Chapter 2. Background 42
CPU compute elements. This was a huge step in that it made the programming model
for concurrent architectures much more rapid and accessible without significant invest-
ment in custom CUDA code, and maintained portability across different CPU and GPU
backends. Google followed shortly thereafter with the release of TensorFlow [94] which
ultimately improved upon and displaced Theano (a university project) with a fully sup-
ported commercial open source project. While AlexNet [1] used CUDA implementations
of their convolutional neural network directly, it was very shortly thereafter that Theano
and similar languages began to be heavily leveraged for rapid model iteration and neural
network prototyping leveraging its high level programming language an highly efficient
concurrent GPGPU compute architectures for rapid training.
Theano [92] and TensorFlow [94] in this sense really pioneered an entirely new class of
computing, based on the functional programing [95] style definition of very large matrix
algebra computation graphs. This capability has so far been heavily leveraged by the
machine learning and neural network community in libraries such as Keras [96] which
express large Tensor graphs expressing entire networks and efficiently place them down
onto multi-CPU or multi-GPU architectures for rapid training and inference. However,
the applicability of these models is actually far wider than solely in machine learning,
with countless signal processing applications standing to benefit from large functional
graph composition, partitioning, kernel synthesis, optimization layout, and orchestration
onto large distributed compute architectures. Within the past few years, the growth of
high performance computing frameworks centered around deep learning, and leverag-
Timothy J. O’Shea Chapter 2. Background 43
ing these core ideas has been astounding: Caffe [97], Chainer [98], Torch [99], PyTorch
[100], MXNet [101], Lasagne [102], and many other frameworks have explored various
enhancements and syntaxes for such high level deep learning models.
Figure 2.13: Concurrent GPU vs CPU compute architecture scaling (2017), from [7]
In recent years, the spread between concurrent architectures able to continue to grow and
leverage Moore’s law, and those that are more limited in their ability to scale to wide ar-
chitectures has widened greatly, as illustrated in figure 2.13. At this point, virtually every
compute architecture is now following suit and providing very high throughput, wide
tensor operations which scale very well with neural network primitives. Not all algo-
rithms scale well on such architectures, such as tight sequential single loop dependen-
cies, but the class represented by most wide and deep neural networks maps incredibly
well and efficiently onto such wide architectures where they can be partitioned readily for
both pipeline and data parallelism automatically from large functional data-flow graph
Timothy J. O’Shea Chapter 2. Background 44
definitions. This synergy between concurrent model and compute architectures is one of
the key enablers for the adoption of deep learning models, which offer highly efficient re-
alizations versus algorithms which rely on more iterative or tightly looped designs. This
ensures that any algorithm or approximation fit to such a network will likely map well to
the distributed architectures which realize well on real world scalable compute architec-
tures and play well with the limitations imposed on us due to device physics.
2.3.6 Model Search
Since the original AlexNet paper [1] there have been numerous improvements in image
recognition architectures. Some of these have been due to significant algorithm enhance-
ments and others have been due to simple architectural and hyper-parameter adjustments
in the architectural elements or training procedures. This general problem of how to best
find an architecture for some learning problem, especially for new problems which have
not been heavily explored (like vision), is still an open one. There have been a number of
attempts to explore this problem of architecture search or hyper-parameter search which
have yielded significant steps forward, but tools to address and solve these problems are
not yet widely disseminated and it is still a major need among many practitioners.
Approaches which have been explored in recent time to solve this problem include using
gradient descent on the hyper-parameters (so called hyper-gradient descent) [103, 104] as
well as reinforcement learning driven search processes [105] and evolutionary methods
Timothy J. O’Shea Chapter 2. Background 45
[8, 106]. Evolutionary methods seem to currently show some of the most robust results.
Figure 2.14 illustrates the performance of one such evolutionary search for convolutional
network models to solve the CIFAR-10 and CIFAR-100 dataset image classification tasks
[107].
Figure 2.14: Evolutionary performance of image classifier search, from [8]
Unfortunately, today the computational resources need for such very large NN model
evolutionary search is quite high. As a result we introduce a simpler small scale evolu-
tionary strategy later in this work.
2.3.7 Model Introspection
One of the largest critiques of deep learning today is that is can be seen as a ”Black Box”
method, in which inputs and output tasks are optimized, but there is little visibility into
what is going on inside the model. While there is some truth to this accusation, it is also
Timothy J. O’Shea Chapter 2. Background 46
a bit unfair to say a trained neural network is a black box. Aside from the basic intuition
of specific layers’ capabilities, there are a number of techniques which can be employed
to visualize and measure the effects of what is going on within each layer.
Figure 2.15: Layer 1 and 2 filter weights from CNN trained on ImageNet, from [9]
For low level weights, direct inspection of weight vectors can be informative. Layer 1
CNN weights shown in figure 2.15 can provide some intuition as to what each filter rep-
resents. Various rotations and configurations of small low level patterns actually wind
up quite close in some cases to the Gabor filters which were previously used as an ex-
pert low level feature extractor. However at higher layers in a CNN architectures, the
direct meaning of a set of filter weights is not so immediately clear from direct weight
inspection.
Figure 2.16: Filter activation visualization in CNNs, from [9]
A popular technique for understanding high level CNN feature meaning is by looking
at activations of different features at different layers based on known image stimulus as
Timothy J. O’Shea Chapter 2. Background 47
explored in [9]. Certain classes of objects which are known to stimulate class labels, can
be seen to activate a number of intermediate feature maps within the image. Example
top-9 activation maps are shown in figure 2.16 for a handful high level features. Here, ac-
tivations can often be seen to be correlated directly with component features of high level
classes by observation. For instance specific facial features may produce activations at
one layer and combine to form a full face activation at a higher level as has been demon-
strated.
In classification tasks, it is possible to perform gradient descent to find a random im-
age which maximally actives some class label. This method was first performed in [108]
and then improved in [10] through including a regularization term (requiring a relatively
smooth input). By doing this, random inputs can be generated which demonstrate what
low level features activate any given activation within a network. Figure 2.17 shows this
techniques used on imagery, clearly illustrating some of the visual features of each class
which have been captured by the high level class specific feature map activation.
Figure 2.17: Optimization of input images for feature activation, from [10]
Other methods for introspection focus on localizing where in the input vector the con-
Timothy J. O’Shea Chapter 2. Background 48
tributions to a feature’s activation occur (a so called saliency map). This can be done in
several ways, but one of the most promising recent methods involves differentiating a fea-
ture’s activation output with regard to pixels or points in the input image. This method,
the gradient class activation map (GradCAM) [11] is a powerful method for localizing and
highlighting which regions in an input correspond to which activations. Figure 2.18 illus-
trates this technique on dog and cat classes within a single image for a classifier trained
on image labels without any location information.
Figure 2.18: GradCAM Saliency Maps for Dogs and Cats, from [11]
From an information theoretical point of view, newer work [12] looks at the performance
of each layer of a neural network from an information theoretical viewpoint, measuring
the joint information between input, output, and intermediate layers throughout the deep
learning training process.
In figure 2.19, we illustrate the information plane, which relates the joint information be-
tween raw input (X-axis) and output/targets (Y-axis) to the information contained at each
layer of the model during training. Interestingly we can see as training progresses, the
Timothy J. O’Shea Chapter 2. Background 49
Figure 2.19: Information theoretic visualization of deep learning, from [12]
layers move to represent more information about the input, while continually represent-
ing more joint information with the output, and then finally enter a compression stage
where they filter and remove information about the input X while preserving informa-
tion about targets, Y. This is an interesting viewpoint for understanding information flow
and compression through a so called ’bottleneck’ during DL model training. On the right
we see the mean and variance of gradients used to guide the gradient descent, which
start with a high SNR (large mean and low variance), and throughout training decrease
in SNR gradually until they no longer possess significant meaningful gradient informa-
tion to further guide the solution. Such an information centric view is quite important
when considering deep learning for numerous communications and signal processing
tasks where preservation or compression of information throughout the networks is of-
ten desired, and a solid understanding of how information is preserved or compressed
can be helpful.
Chapter 3
Learning to Communicate
Since virtually the beginning of radio, radio transceivers and waveforms have been con-
ceived through human design. Original electromagnetic (EM) communications systems
such as the telegraph and the spark gap transmitter [109] were practical due to hardware
and EM understanding at the time.
Physical layer designs grew increasingly more complex as multiple access schemes such
as frequency-division were introduced to allow additional users, higher data rates, in-
creased device power efficiency and decreased cost. In 1948 Shannon introduced infor-
mation theory and the notion of optimal channel capacity to the world, defining the fun-
damental problem of communication as, reproducing at one point either exactly or ap-
proximately a message selected at another point.” [20] This placed a theoretical upper
bound on the capabilities of single antenna transceivers over a Gaussian channel, but it
50
Timothy J. O’Shea Chapter 3. Learning to Communicate 51
did not inform radio designers specifically how to attain those levels of performance.
Figure 3.1: Illustration of the many modular algorithms present in a modern wirelessphysical layer modem such as LTE
Since then, radio engineers have iterated through numerous modulation, coding, and ra-
dio design approaches every few years in an attempt to improve capacity, reduce cost
and power requirements, and generally push our devices closer to these capacity bounds.
In today’s world, modern modems look something like that shown in figure 3.1, which
depicts the physical layer of a modern wireless physical layer such as LTE with its many
modular algorithms. Here, each module represents one of numerous intense areas of re-
search surrounding optimal coding, MIMO precoding, subframe allocation, modulation,
and other tasks which are all composed sequentially and distinctly to form the powerful
and efficient standards we use today.
Within each of these modules typically lies some analytic formulation of the wireless
channel. In the case of error correction codes, random bit flips may be used when testing
Timothy J. O’Shea Chapter 3. Learning to Communicate 52
or validating a code, and for modulations or MIMO coding schemes, Gaussian noise or
Rayleigh fading channels are frequently used to model the propagation channel. In each
of these cases, such an approach generally requires simplifying assumptions and modular
optimization of individual algorithmic components rather than as a whole.
This has proven to be effective, but generally leaves open the questions, can we do better
with more rich information about the real distributions of actual impairments in a spe-
cific deployment scenario, and can we do better if we jointly optimize the system rather
than building components with rigid interfaces and intermediate values? Can we find a
more straightforward way to build complex communications systems which attain sim-
ilar performance without the need for thousands of man-hours in engineering, software
implementation and optimization time? And can we find such systems which maintain
near-Shannon levels of performance while maintaining flexibility to adapt the physical
layer more fully than this sort of rigid physical layer algorithm definition will allow?
3.1 The Channel Autoencoder
To answer these questions, we consider again the fundamental task of a radio commu-
nications system: reproducing at one point either exactly or approximately a message
selected at another point.” [20] This task is strikingly similar to that of an autoencoder,
whose objective is to reconstruct some input vector x at the output x and minimize the
loss between the two, by learning an encoder and a decoder for some input vector. We
Timothy J. O’Shea Chapter 3. Learning to Communicate 53
Figure 3.2: The Fundamental Communications Learning Problem
first introduce this idea in [110] and further refine it in [111].
Figure 3.3: A simple autoencoder for a 2D MNIST image, from [14]
Traditionally an autoencoder is used to learn a lower dimensionality sparse representa-
tion of the input vector x (such as the MNIST digits shown in figure 3.3), which may be
non-linear when using non-linear neural network activation functions. This approach for
learning encoding, decoding, and sparse representations has the benefits that it can be
fit non-linearly to the distribution of a given input dataset, can be tuned for a specific
Timothy J. O’Shea Chapter 3. Learning to Communicate 54
loss function (e.g. MSE, binary cross-entropy (BCE), CCE), and that it can act as a fil-
ter to remove non-structural noise which does not lie within the learned support of the
compressed representation.
Figure 3.4: A Simple Channel Autoencoder
We can formulate the radio communications system problem as a similar autoencoder,
where a message to transmit s, either a k-bit binary vector with M = 2k possible code-
words or an equivalent one-hot codeword vector of lengthM , is encoded, passed through
some set of channel impairments, and then decoded to recover s, an estimate of the orig-
inally transmitted message. The channel layer may be stochastic in nature, as has been
regularly used within computer vision actually for its nice regularizing properties (e.g.
[3], [112]).
This channel autoencoder differs from the conventional use of an autoencoder in a few
ways, first the intermediate representation of the signal may actually be of higher dimen-
sion (as opposed to most autoencoders which seek a sparse representation). Second, the
channel layer introduces numerous lossy and mixing impairments rarely seen in other
Timothy J. O’Shea Chapter 3. Learning to Communicate 55
configurations (e.g. noise, fading, rotation, etc). We consider s to be a number of bits
k producing 2k = M distinct messages which are encoded into some number, n, of real
or complex valued digital samples. Controlling this ratio of k/n, (further referred to as
(n, k)) for a given sample rate and signal and noise power controls the information rate at
which bits are transmitted over the channel. By modifying these dimensions, any rational
rate system can be obtained using the same approach for arbitrary values of k and n or
simply M and n.
We construct the network using a relatively small network shown in table 3.1 whose di-
mensions scale based on M and n. Interestingly, while a single fully connect linear layer
in the encoder is fully capable of mapping all codewords to all real valued possible trans-
mit symbols in one step, SGD can not find a good solution when only using onle a single
layer, and gets stuck in a sub-optimal local minima during training. Adding a second
layer of depth to the transmit and receive networks however, allows the network to very
rapidly converge to a very good global optimum set of network weights. This is actually
an excellent illustration of the work in [37] demonstrating that using a deeper network
with a higher dimensional parametric search space actually helps networks converge to
more globally optimum solutions, as they are much less likely to become trapped in a
local minima simple due to the probabilistic nature of all degrees of freedom not likely
aligning in curvature. They are more likely instead in this deeper / higher dimensional
space to encounter a saddle point, which is not neccisarily terminal in a gradient descent
search when using a strong saddle-free optimization method (some, such as Newton’s
Timothy J. O’Shea Chapter 3. Learning to Communicate 56
Table 3.1: Layout of the autoencoder used in Figs. 3.6 and 3.5. It has(2M + 1)(M + n) + 2M trainable parameters, resulting in 62, 791, and 135,944
parameters for the (2,2), (7,4), and (8,8) autoencoder, respectively.
Layer Output dimensionsInput M
Dense + ReLU M
Dense + linear n
Normalization n
Noise n
Dense + ReLU M
Dense + softmax M
method may have difficulty).
In order to avoid the trivial solution of using very large values for x in the symbol en-
coding, to increase the effective SNR over a constant channel noise power, we introduce a
transmit normalization layer after the encoder which enforces a constant average power
for transmitted symbols during training, as indicated in figure 3.1. This can be done on a
per-symbol or per-batch level, and can be enforced in an umber of ways including mean
amplitude, mean power, max power, or other similar constraint, yielding quite different
results for each in some cases.
In figure 3.5 from [111] we compare the performance of a learned physical layer encoding
for block sizes of 2 and 8 bits, and compare to the block/codeword error rate perfor-
mance of an uncoded binary phase shift keying (BPSK) modulation. In this case, we have
the interesting result that, for a 2-bit codeword size, 2xBPSK and the (2,2) autoencoder
obtain the same information rate (by definition), and align on an almost identical error
Timothy J. O’Shea Chapter 3. Learning to Communicate 57
rate curve. As we increase the block size to 8 bits, we begin to see the (8,8) autoencoder
system outperform the un-coded 8xBPSK system by 1-2 dB at higher SNR values. This
indicates that the larger block size (8,8) autoencoder is in fact learning some form of error
correction, where its encoding scheme is more robust than the simple BPSK solution.
Figure 3.5: BLER versus Eb/N0 for autoencoder
−2 0 2 4 6 8 1010−5
10−4
10−3
10−2
10−1
100
Eb/N0 [dB]
Blo
cker
ror
rate
Uncoded BPSK (8,8)Autoencoder (8,8)Uncoded BPSK (2,2)Autoencoder (2,2)
In figure 3.6 we consider the comparison of an autoencoder with 4-bit codewords and 7
real valued symbols over the channel. Here, we consider three different baselines, first
the uncoded (4,4) BPSK solution which provides the worse performance, and then two
baselines using a hamming code with the same 4/7ths rate as the autoencoder. In the
case of the hard decision decoder, there is still a 1-2dB gap in performance, while for
MLD decoding, the performance is nearly identical. This is a very promising result as it
shows that for small block sizes, the channel autoencoder approach can learn very strong
solutions which rival commonly used modulation and error correction codes.
To further understand the solutions learned by this naive autoencoder learning process,
Timothy J. O’Shea Chapter 3. Learning to Communicate 58
Figure 3.6: BLER versus Eb/N0 for autoencoder
−4 −2 0 2 4 6 810−5
10−4
10−3
10−2
10−1
100
Eb/N0 [dB]
Blo
cker
ror
rate
Uncoded BPSK (4,4)Hamming (7,4) Hard DecisionAutoencoder (7,4)Hamming (7,4) MLD
we can plot the constellations of each learned encoding scheme simply from their input
to the channel module. Figure 3.7 illustrates the constellations learned for (2,2), (2,4),
(2,4), and (7,4) schemes, where different power normalization constraints on (2,4) produce
different constellations (e.g. 16-PSK or non-standard 16-QAM), and the 7-dimensional
encoding space of the (7,4) code is visualized in 2-dimensions using t-SNE [113]. It is
pleasing here that the canonical QPSK solution (with random rotation) is achieved for the
(2,2) code, and that the familiar PSK as well as non-rectangular near-optimally packed
16QAM is achieved for (2,4).
The training process for channel autoencoders is an interesting problem in which the
model must learn to perform well in low and high SNR conditions, and the channel and
training parameters may be manipulated during training. Experimentally, we find that
training at a mid-range SNR (8dB Eb/N0) works well, but that varying batch size from
small (50) to large (10,000) in two passes works well to effectively train the system. This
Timothy J. O’Shea Chapter 3. Learning to Communicate 59
Figure 3.7: Constellations produced by autoencoders using parameters (n, k): (a) (2, 2)(b) (2, 4), (c) (2, 4) with average power constraint, (d) (7, 4) 2-dimensional t-SNE
embedding of received symbols.
(a) (b)
(c) (d)
is an interesting result, as the batch size has an effect on the effective SNR of the gradi-
ents and the average receive symbol locations. In general in computer vision, high SNR
images are used, which may have occlusions, permutations, or small objects, but gener-
ally do not have white noise competing with the ’signal power’ of an actual visual object.
However, in vision there has also been discussion recently surrounding the use of increas-
ing batch sizes, rather than decreasing learning rate throughout training as smaller (and
more noisey) step sizes are needed durring optimization.
The choice of transmit normalization is an interesting one which has no clear ’best choice’,
Timothy J. O’Shea Chapter 3. Learning to Communicate 60
Table 3.2: Candidate channel autoencoder transmit normalization functions
Tx Norm Method Expression
Example Mean Power (EMP) Xt = X(Nx ∗Ns)/∑
i,j
√∑kX
2i,j,k
Batch Mean Power (BMP) Xt = X(Nx ∗Ns ∗Nc)/√∑
i,j,kX2i,j,k
Batch Mean Ampl. (BMA) Xt = X(Nx ∗Ns ∗Nc)/∑
i,j,k abs (Xi,j,k)
Batch Mean Max Power (BMMP) Xt = X(Nx ∗Ns ∗Nc)/√∑
i,j,k max(X2i,j,k, 1
)but has a significant effect on the learned solution. We consider a number of normaliza-
tion functions which map the output of the encoderXi,j,k to the input to the channelXt, as
Xt = fnorm(Xenc). Here, Xi,j,k represents a 3 dimensional tensor, over i the example index,
j the sample index within one example, and k complex sample component index (i.e. I
and Q), for one training iteration. The table below provides several possible transmit nor-
malization functions fnorm which can be used. Where Nx is the number of examples, Ns
is the number of samples, and Nc is the number of components per sample (2).
To gather an intuition for the learned solutions of this class of learned constellation in a
traditional 2D (I/Q) single symbol space, we can compute and plot the learned constel-
lations for 2-QAM through 33-QAM for each normalization strategy below. Interestingly,
since we can map to any number of codewords trivially with this approach, we don’t need
an integer number of bits to transmit, only an integer number of codewords, leading to
numerous possible rate adaptation possibilities beyond the traditional 2N constellations
used today for QAM.
First, in figure 3.8 we show using the symbol power constraint per example (EMP), in
this case each symbol takes on an average power of 1, leading to conventional constant
Timothy J. O’Shea Chapter 3. Learning to Communicate 61
Figure 3.8: Learned QAM Modes for Example Mean Power (EMP)
modulous solutions of phase-shif keying (PSK).
Figure 3.9: Learned QAM Modes for Batch Mean Power (BMP)
Timothy J. O’Shea Chapter 3. Learning to Communicate 62
In figure 3.9 we use the average symbol power over an entire batch, which frees each indi-
vidual symbol up to vary to some degree as long as the mean is constrained. Here, we be-
gin to see multi-level constellations form which are quite interesting and non-conventional.
However, one interesting case here is that of 5-QAM where it has learned a relatively con-
stant power constellation which differs from BMA.
Figure 3.10: Learned QAM Modes for Batch Mean Amplitude (BMA)
Figure 3.10 shows the batch mean amplitude mode, where again we obtain a number of
novel solutions, such as for 5-QAM, where we obtain a QPSK looking constellation which
also uses the zero-power mode as a 5th constellation point.
Numerous additional constraints are possible, in figure 3.11 we use a constraint which
limits mean power per batch, but considersmax(X2i,j,k, 1) before averaging to avoid overly
incentivizing low-power constellation points (e.g. all points under the average power are
Timothy J. O’Shea Chapter 3. Learning to Communicate 63
Figure 3.11: Learned QAM Modes for Batch Mean Max Power (BMMP)
of equal penalty). These for example might lead to results with very poor peak to average
power ratio (PAPR) (leading to poor amplifier efficiency).
These results are of course only for a single symbol, by scaling values of n and k, we
can design a system which encodes an arbitrary number of bits into an arbitrary number
of symbols. When encoding across multiple symbols, a typical solution appears to be
a unique non-standard 2k-QAM constellation for each symbol, and then some kind of
trellis-like combining across multiple symbols to obtain good coding gain. Examples of
this are shown in figure 3.12 where we encode 2-bit, 4-bit, and 8-bit messages into groups
of 4 sequential symbols. In this case, additional error correction capacity is obtained vs
the single symbol form and a distinct QAM arrangement is learned for each symbol with
a highly non-intuitive arrangement. Here one codeword corresponds to a point in each
Timothy J. O’Shea Chapter 3. Learning to Communicate 64
Figure 3.12: Learned 4-Symbol QAM Modes using BMA for 2 bit, 4bit, and 8bit)
of the four spaces. While two constellation points may be close together in one symbol,
the points corresponding to the same message will be far apart in another symbol time,
allowing for non-linear combining to perform an efficient representation and decoding
over all the dimensions.
This method works surprisingly well, but one of the key challenges with it is scaling to
much large codeword sizes such as the 1000+ bits used in modern turbo codes. When
Timothy J. O’Shea Chapter 3. Learning to Communicate 65
using LCCE we must select 2k codeword indices (messages), scaling our network expo-
nentially as bits are added. One solution is to use k binary inputs and k sigmoid binary
bit outputs along with a LBCE loss function. In this case, the network scales more lin-
early with block size, however we have not yet been able to obtain near optimal capac-
ity performance from a network trained in such a fashion. Other strategies for scaling
to larger network have been explored very recently within the scope of error correction
codes [114, 115, 116] through methods involving partitioning, and leveraging belief prop-
agation graphs to seed neural network weights, however significant work remains to
allow for scaling these techniques to large codeword sizes, such as are widely used today
in modern LTE systems. Ultimately methods such as replicating network structure within
the full block size, whether through weight/connection tieing, or through some form of
recurrent operaton with state, hold significant promise for solving this problem in the fu-
ture and allowing these methods to be competitive with state of the art error correction
and modulation schemes.
3.2 Learning to Synchronize with Attention
When learning to discriminate between received symbols in a channel autoencoder (or
between classes in a classifier), the discriminative model must generally learn to classify
all forms of signal variation which may arrive at the receiver. In radio, permutations due
to the channel include additive noise, phase offset, frequency offset, delay spread, inter-
Timothy J. O’Shea Chapter 3. Learning to Communicate 66
ference, and many other distortions such as hardware non-linearities and mixer inter-
modulation products. Previous results were shown only with AWGN impairments, how-
ever real world systems include all of these effects and more.
Figure 3.13: Spatial Transformer Example on MNIST Digit from [5]
In computer vision, objects undergo a somewhat analogous set of permutations when
being viewed, including scaling, rotation, skew, translation, occlusion, and noise. Since
these permutations are geometrically well understood, a domain appropriate parametric
transformation such as the 2D Affine transform may be applied to correct them directly
as shown in figure 3.14 from [5]. By imparting expert knowledge about the domain ap-
propriate parametric transforms, the task of canonicalizing an object may be reduced to
estimating a set of parameters and then executing the transform. By splitting a classifica-
tion task up into learned parameter estimation (localization), parametric transformation,
and learned class descrimination, the model complexity needed to classify a range of
permutations on the classes may be greatly simplified. If the parametric transform is im-
plemented in a way in which it can maintain its differentiability, both localization and
Timothy J. O’Shea Chapter 3. Learning to Communicate 67
discrimination networks may be trained in an end-to-end fashion as a single task (e.g.
minimize CCE) by using back-propagation from the global loss function both before and
after the transform. This architecture has proven to be very effective for image classi-
fication, such as the google streetview house number challenge, where the localization
network helps locate and cononicalize digits and the discriminitive networks classifies
digits.
Figure 3.14: Radio Transformer Network Architecture
The same architecture can be applied to radio communications problems (as we show
in [117, 111]), where current day transformations such as application of equalizer taps,
removal of carrier phase and frequency, or timing errors can be applied directly, as long
as they can be implemented in a differentiable manner. In this case, we can split the
network into a more general (not just spatial) parameter estimation network to estimate
CSI, and a discriminative network to perform symbol estimation (or anyother task), while
maintaining our expert knowledge about the domain appropriate transforms in order to
simplify the target learning manifold task and often reduce the number of free parameters
needed in our model. Since we have imparted expert knowledge about the physical radio
Timothy J. O’Shea Chapter 3. Learning to Communicate 68
effects, we have only specialized our solution for the domain in general (e.g. things that
happen to all radio signals). This is an important point, since we have not done anything
to specialize the parameter estimation or discriminative networks for any one specific
signal or modulation type, keeping domain-wide non-signal-specific generality in our
model architecture.
To validate the radio transformer network (RTN) approach, we consider several tasks.
First, the performance of a channel autoencoder under a Rayleigh fading channel with
a tap length of L = 3. In this case, we allowed the estimated parameters, θ to take the
form of h−1, the channel impulse response inverse which can be directly convolved with
the received signal to obtain a canonical impulsive copy of the signal. We implement
the convolution in differentiable tensor algebra within Keras [96] as a set of dense ma-
trix multiplies and adds (the standard tensorflow convolution operation can not be used
when both the input and convolution taps are free variables).
In figure 3.15 we illustrate the training complexity reduction for this task, comparing the
training loss curve for an autoencoder both with and without the CSI estimation network
and transformer in front of the symbol discrimination task. Here, we can see that it con-
verges to a solution for both, but in the case of the RTN, it converges much more quickly
to a good solution in only a few epochs, and ultimately achieves a much lower final CCE
loss (and BLER).
Comparing the performance of the autoencoder with and without the RTN synchronizer
on the front, we can observe the fully trained bit error rate performance in figure 3.16.
Timothy J. O’Shea Chapter 3. Learning to Communicate 69
Figure 3.15: Autoencoder training loss with and without RTN
0 20 40 60 80 1000
0.2
0.4
0.6
0.8
1
Training epoch
Cat
egor
ical
cros
s-en
trop
ylo
ssAutoencoderAutoencoder + RTN
Figure 3.16: BLER versus Eb/N0 for various communication schemes over a channel withL = 3 Rayleigh fading taps
0 5 10 15 2010−4
10−3
10−2
10−1
100
Eb/N0 [dB]
Blo
cker
ror
rate
Autoencoder (8,4)DBPSK(8,7) MLE + Hamming(7,4)Autoencoder (8,4) + RTN
Timothy J. O’Shea Chapter 3. Learning to Communicate 70
Here, the non-RTN version is unable to achieve a level of performance which outperforms
the baseline method of MLD DBPSK decoding with a hamming code while the autoen-
coder with RTN achieves a significantly better performance result, especially for higher
SNR values. This is quite an exciting result, as it shows that a fully learned approach
can leverage expert domain knowledge about radio propagation physics, still maintain
full generality among signals, and very quickly learn a good solution which outperforms
common baseline levels of performance through the RTN approach of CSI estimation,
transformation, and symbol estimation. In this case, the learned model may also benefit
from the bias present in the fading channel model (the distribution of the taps), since it is
constrained to a set of L=3 Rayleigh fading taps, the solution space is not uniform over
all possible real values for h−1 which generally allows the system to specialize better for
the actual distribution.
Such a result could be incredibly powerful in a wireless environment, where CSI estima-
tion and equalization could be heavily specialized and improved for the delay spread
distribution within specific deployment scenarios and conditions, but is also somewhat
troubling in that it is increasingly important that the simulations and impairment mod-
els used for training sufficiently match the possible channel conditions which may be
encountered in the real world at inference time.
This technique is a very general front-end startegy when constructing ANN models for
high dimensionality parametric search spaces, to leverage knowledge about appropriate
transforms. Results here are shown for the autoencoder and symbol decoding problem,
Timothy J. O’Shea Chapter 3. Learning to Communicate 71
but preliminary results show that such an approach can also help in sensing and other
tasks such as signal type or modulation recognition or other sorts of signal property la-
beling through model learning on RF emissions in the spectrum.
3.3 Multi-User Interference Channel
One of the nice features of the channel autoencoder is the versatility with which it can
solve many different formulations of the radio communications problem with variations
on the same compact optimization problem, with no need to devise complex new phys-
ical layer encoding or signal processing strategies. One important such case is that of
the multi-user interference channel, where optimization of some aggregate multi-user
capacity is the goal rather than a single transmitter and receiver. This is a critical case
in wireless systems as it represents most wireless channels with which we interact on a
daily bases, where we share some piece of spectrum (e.g. cellular bands, industrial, scien-
tific, and medical radio (ISM) bands, ground mobile radio (GMR) bands) with a number
of different users who must somehow share the available spectrum to optimize for some
joint objective such as capacity. While multi-user capacity bounds have been derived for
specific instances, no general solution exists to bound aggregate capacity under all condi-
tions, meaning we do not know how far current day systems are from optimal usage of the
interference channel. Unfortunately today, we have a slow iterative process of physical
layer design, optimization, analysis, and then manual redesign based on whatever intu-
Timothy J. O’Shea Chapter 3. Learning to Communicate 72
ition gleaned from the analysis. Channel autoencoders offer to give us a tool by which
to break out of this painful cycle and directly seek to find a globally optimal multi-user
physical layer (PHY) scheme from the ground up, optimizing for aggregate capacity or
any other pertinent design objective or constraint deemed important for its application.
Figure 3.17: The two-user interference channel seen as a combination of two interferingautoencoders that try to reconstruct their respective messages
Using the same channel autoencoder construct previously used, we can formulate the
problem with a new mixing channel within the channel layer of two autoencoders as
shown in figure 3.17. Here there are two objectives to minimize, L1 = LCCE(s1, s1) and
L2 = LCCE(s2, s2), the reconstruction loss for user 1 and 2 respectively. These can be
treated as a single network, where each optimization step chooses a random batch of
independent values for both s1 and s2 and complete a back-propagation step to minimize
the two. Encoders in this case only have knowledge of their own transmit codeword,
and the network architecture from table 3.3 is used where dimensions [x, x] indicates two
separated paths of size x and a dimension of [x] indicates a single path of size x.
When optimizing for multiple loss functions there is often a question of how to combine
Timothy J. O’Shea Chapter 3. Learning to Communicate 73
Table 3.3: Layout of the multi-user autoencoder model
Layer Output dimensionsInput [M,M ]
Dense + ReLU [M,M ]
Dense + linear [n, n]
Normalization [n, n]
Addition [n]
Noise [n, n]
Dense + ReLU [M,M ]
Dense + softmax [M,M ]
them. This can be done additively, multiplicitively, or many other ways which all have
an effect on the optimization process and the form of the resulting error gradients. The
most straightforward approach is to simply sum the two loss functions. Unfortunately,
when doing this, it is not uncommon for imbalance to occur between the two objectives
(e.g. favoring one user’s CCE loss and therefore BLER over that of another). If equal loss
is desired among the loss functions, some means for balancing the loss magnitudes must
be used. In this case, we seek to obtain fair performance among two users accessing the
same channel. As described in [111], to address this, we adopt the following joint loss
term LI with loss weight term αt which is given an initial condition of α = 0.5 and is
updated each mini-batch time step t as follows.
LI = αL1 + (1− α)L2
αt+1 =L1
L1 + L2
, t > 0
(3.1)
While this metric is heuristic in nature, it does a good job empirically balancing the two
Timothy J. O’Shea Chapter 3. Learning to Communicate 74
Figure 3.18: BLER versus Eb/N0 for the two-user interference channel achieved by theAE and 22k/n-QAM TS for different parameters (n, k)
0 2 4 6 8 10 12 1410−5
10−4
10−3
10−2
10−1
100
Eb/N0 [dB]
Blo
cker
ror
rate
TS/AE (1, 1) TS/AE (2, 2) TS (4, 4)
AE (4, 4) TS (4, 8) AE (4, 8)
loss functions during training to arrive at a PHY with roughly equal BLERs and mean
symbol powers.
When comparing the aggregate BLER (and thus multi-user capacity) of such a system
with a completely orthogonal QAM based access sharing system such as time-sharing
(orthogonal time access (TDM) from [111]), as is shown in figure 3.18, we observe several
important results. First, the time-sharing autoencoder system (TS/AE), outperforms the
baseline time-sharing QAM system (TS) as we have previously shown, in this case the
autoencoder simply learns a single user access strategy within each of its time-slots. Sec-
ondly, the multiuser autoencoder (AE or multi-user (MU)/AE), learns a solution which
outperforms the TS/AE system even further. This result is illustrated for both 4-bit and
8-bit codeword sizes over a Gaussian interference channel in figure 3.18.
Timothy J. O’Shea Chapter 3. Learning to Communicate 75
Figure 3.19: Learned constellations for the two-user interference channel withparameters (a) (1, 1), (b) (2, 2), (c) (4, 4), and (d) (4, 8). The constellation points ofTransmitter 1 and 2 are represented by red dots and black crosses, respectively.
(a) (b)
(c)
(d)
In the case of the MU/AE system, an aggregate BLER is achieved of roughly 10−3 for
the 4-bit system at around 0.7dB lower Eb/N0, while for the 8-bit system it is around
1dB lower. Offering quite significant potential gains for future multi-user access systems,
which generally only stand to improve as additional channel impairments and numbers
of users increase.
Inspecting the constellations learned by the MU/AE system helps to provide some intu-
ition as to what has been learned. In figure 3.19, we illustrate the constellations learned
in the (1,1), (2,2), (4,4), and (4,8) MU/AE configurations.
For the (1,1) system, the solution is a nice, quite easy to interpret solution which a human
designer might easily have come up with. Here the system has learned a set of two phase-
Timothy J. O’Shea Chapter 3. Learning to Communicate 76
orthogonal BPSK modulations at random rotation, providing in this case, an orthogonal
solution which does not reduce the rate of either other user.
For the (2,2) system, the solution begins to become quite interesting. In this case, the
solution of a sort of super-position code, where slightly skewed and phase-offset 4-QAM
constellations are used by each user within each time-slot is found, where users alternate
opportunities as the high powered user. This is not necessarily an intuitive solution, but
inspecting the performance curve in figure 3.18, we see that it actually achieves better
performance than the obvious solution of purely orthogonal time-slotted QPSK.
For (4,4) and (4,8) systems, this trend of pseudo-orthogonal super-position code learn-
ing continues, but solution begin to become increasingly complex and are hard to gather
significant intuition from. Inspecting the (4,4) code, we can see that each user for each
symbol uses a unique layout of 16-QAM to encode the 4 bits robustly across 4 symbols.
The learned decoding process appears to be able to combine these decision surfaces very
effectively into a robust low-error rate system for both cases. For the (4,8) system it is
difficult to glean much from the constellation layouts, but we can see that the clusters of
non-standard QAM-256 points form roughly oval shaped layouts where the major axis
appears to be orthogonal.
The exciting nature of this approach to physical layer design MU-scheme design is that it
can seemingly readily be learned for virtually any rate configuration, information density,
impairment model, or other set of constraints introduced into the network training pro-
cess. This opens up the door for highly efficient multi-user CIFAR schemes to be heavily
Timothy J. O’Shea Chapter 3. Learning to Communicate 77
specialized to their deployment domain, impairment distributions, multi-user configura-
tions, and potentially higher level traffic patterns and requirements as well. Significant
work remains to be done to consider optimal fusion of higher level network traffic re-
quirements and source coding on top of the model presented here, as well as scaling the
models to additional impairment constraints and higher numbers of users.
3.4 Learning Multi-Antenna Diversity Channels
Many modern radios such as LTE smart phones today, do not use a single antenna el-
ement for transmit or receive. In fact, the LTE E-UTRA Physical Layer [118, 119] has
required for several years that handsets (UEs) employ at least 2 receive antennas to allow
for decoding of 2x2 MIMO [120] modes of transmission. Many phones today actually
support 4 antenna receive, standards are now discussing 8x8 modes as a reality in future
devices, and 5G test labs are evaluating techniques involving up to 128 base station an-
tennas [121], or even 500-1000 antennas in some cases. The motivation for this is clear,
MIMO systems have proven themselves to be invaluable both in extending range at the
edge of coverage areas by coding redundancy across multiple propagation modes, and
in increasing the achievable capacity in dense urban multi-path rich environments where
separate information can be coded across multiple propagation modes in order to increase
aggregate throughput to a single or multiple users.
Today, state of the art methods for encoding information at the physical layer for the
Timothy J. O’Shea Chapter 3. Learning to Communicate 78
Figure 3.20: Open Loop MIMO Channel Autoencoder Architecture
MIMO channel typically rely on either open-loop (no CSI feedback) space-time block code
(STBC) [122] methods (the simplest being the Alamouti code [123]), or closed-loop (with
CSI feedback used for pre-coding) style spatial multiplexing [124] methods.
First we consider the case of an open-loop MIMO system where no CSI is known at the
transmitter. We can structure this problem for an mt transmit antenna and mr receive
antenna system as an autoencoder as shown in figure 3.20, where each codeword is k bits
and spans n time-samples. Here, we encode some block of information s as before, using
a learned encoder, pass through a channel model, and then recover an estimate s from
the received signal y. The primary difference here is that x now takes the form of a 2D
mt×n tensor for each example, and y takes the form of a mr×n tensor for each example.
The process for complex MIMO Rayleigh channel matrix (H) generation and complex
valued tensor multiplication must be implemented in differentiable tensor form within
the channel impairment model, and then the same additive noise layer may be used to
Timothy J. O’Shea Chapter 3. Learning to Communicate 79
Figure 3.21: Alamouti Coding Scheme for 2x1 Open Loop MIMO
−5 0 5 10 15 20 25 30
10−5
10−4
10−3
10−2
10−1
100
Signal to Noise Radio (dB)
BitE
rror
Rat
e(B
ER)
2x1 Spatial Diversity Code Comparison
2x1 AE No CSI2x1 Alamouti
Figure 3.22: Error Rate Performance of Learned Diversity Scheme.
impose SNR constraints.
We compare the bit error rate performance of the learned autoencoder-based 2x1 MIMO
scheme based on the model in figure 3.20 to the conventional Alamouti code which is also
an open-loop 2x1 code shown in figure 3.21.
Results for open-loop are mixed, and not initially as favorable as prior results for au-
toencoder or multi-user schemes. In both cases, we compare a (2x1,4) system, where two
QPSK symbols (4 bits) are encoded into two time-slots and one receive antenna. Perfor-
Timothy J. O’Shea Chapter 3. Learning to Communicate 80
Figure 3.23: 2x1 MIMO AE, Diagonal H Figure 3.24: 2x1 MIMO AE, Random H
mance between the two schemes is similar, however we observe two distinct regions, at
low SNR the Alamouti scheme tends to outperform, providing lower bit error rates, while
at high SNR, the learned scheme provides a 2-3 dB advantage for obtaining equivalent er-
ror rates. An additional comparison incorporating error correction may make sense when
comparing performance such as a (4x1,6) scheme where a 3/4 rate code is used to map
6 bits onto two (2x1,4) alamouti code words, while allowing the autoencoder to directly
learn a solution to the (4x1,6) problem. However, these results are promising enough to
warrant further investigation and promise that strong open-loop schemes may be learned
in a similar way.
Inspecting the resulting constellations learned in figures 3.23 and 3.24 we observe that a
form of superposition code appears to be learned here as well to satisfy the average power
Timothy J. O’Shea Chapter 3. Learning to Communicate 81
constraint. This is an interesting solution, but it suggests that different and/or possibly
better results could be obtained by introducing some kind of additional constraint to in-
centivize equal power between transmit antenna symbols (as is the case for Alamouti).
This does beg the question to some extent as to whether the parameter search manifold
for this problem has several very large local minima, where in this case we have been
pulled into one solution which is sub-optimal despite the use of large networks, regular-
ization, and infinite (generative) training data.
3.5 Learning MIMO with CSI Feedback
In dense urban environments with many radio reflectors, spatial multiplexing modes
[124] and closed-loop MIMO are commonly used to increase throughput and improve
performance from multi-path propagation. These too can be represented through an ap-
propriate autoencoder architecture. Figure 3.25 illustrates an autoencoder architecture for
learning such a MIMO scheme which incorporates CSI (e.g. closed loop) into the transmit-
ter encoding process. Here we have collapsed the traditional radio transmitter functions
including FEC, modulation, and MIMO pre-coding all into a single encoder block which
is learned end-to-end with the channel and decoding processes.
We can structure the architecture here such that our random channel state, H is passed
to both the channel impairment model (the complex multiply) as well as into the encoder
module, simply by concatenating it with the symbol to transmit s.
Timothy J. O’Shea Chapter 3. Learning to Communicate 82
Figure 3.25: Closed Loop MIMO Learning Autoencoder Architecture
Training such a system, we can compare to a variety of baseline methods such as zero
forcing (ZF) or minimum mean square error (MMSE) methods for pre-coding. In this case,
we consider the case where mt = 2 and mr = 2, which is the common 2x2 MIMO Case
used widely in LTE and other systems, but still a relatively small scale MIMO system.
−5 0 5 10 15 20 25 30
10−4
10−3
10−2
10−1
Signal to Noise Radio (dB)
BitE
rror
Rat
e(B
ER)
2x2 Scheme Performance with Perfect CSI
2x2 AE P-CSI2x2 Baseline
Figure 3.26: Error Rate Performance of Learned 2x2 Scheme (Perfect CSI).
In this case, the learned scheme compares quite favorably to the baseline method. We see
roughly a 5dB improvement at a bit error rate (BER) of 10−2 and a 10dB improvement at a
Timothy J. O’Shea Chapter 3. Learning to Communicate 83
Figure 3.27: Closed Loop MIMO Autoencoder with Quantized Feedback
BER of 10−3, both substantial. Of course the baseline could improve significantly with the
introduction of error correction, but would have to give up some amount of information
rate to do so, making the learned system extremely appealing.
However, in the real world, MIMO systems can not and do not transmit real-valued chan-
nel estimates (H) over the air (e.g. between eNodeBs and UEs). Instead they typically
must minimize protocol overhead used for channel quality information (CQI)/CSI feed-
back, which has led to the adoption of techniques like p-bit codebooks which contain
compact discrete valued codes indicating distinct channel modes.
Considering this task of compact discrete valued CQI feedback representation as part of
the end-to-end communications system learning architecture, we can cast the problem as
shown in figure 3.27. Here, we introduce a discretization network (dis(H)), which encodes
the real valued channel estimate H (H is used in our work without estimation error), into
a v-bit discrete value with one-hot encoding over 2v possible channel modes. This one-hot
Timothy J. O’Shea Chapter 3. Learning to Communicate 84
encoding is then concatenated with s to form the MIMO encoder/modulator. This is quite
exciting as we have now cast the entire end-to-end problem of compact CSI feedback, CSI-
enhanced MIMO pre-coding, FEC encoding, modulation, over-the-air (OTA) representa-
tion, MIMO combining, demodulation, and decoding all into one single learned model
which jointly optimizes for all of these free parameters to maximize capacity for any dif-
ferentiable channel model.
Figure 3.28: Bit Error Rate Performance of Baseline ZF Method
−5 0 5 10 15 20 25 30
10−2
10−1
Signal to Noise Radio (dB)
BitE
rror
Rat
e(B
ER)
Baseline 2x2 Scheme Performance with Quantized CSI
2x2 Baseline Perfect CSI2x2 Baseline 8-bit CSI2x2 Baseline 4-bit CSI2x2 Baseline 2-bit CSI
In figure 3.28 we illustrate the decline in performance when quantizing the real valued
H feedback values with the ZF 2x2 scheme. Here, real-values provide the best solution,
and while 8-bit CSI does not provide significant degradation, 4-bit and 2-bit CSI modes
are substantially degraded.
In stark contrast, we can easily train the autoencoder based system to learn a v-bit CSI
Timothy J. O’Shea Chapter 3. Learning to Communicate 85
Figure 3.29: Bit Error Rate Performance Comparison of MIMO Autoencoder 2x2Closed-Loop Scheme with Quantized CSI
−6 −4 −2 0 2 4 6 8 10 12 14 16 18 20 22 24
10−5
10−4
10−3
10−2
10−1
100
Signal to Noise Radio (dB)
BitE
rror
Rat
e(B
ER)
2x2 Scheme Performance With Quantized CSI
2x2 AE 1 Bit2x2 AE 2 Bit2x2 AE 4 Bit2x2 AE 8 Bit
2x2 AE P-CSI
feedback mode which attempts to be optimal for any positive non-zero value of v. Figure
3.29 illustrates the performance curves of 1-Bit, 2-Bit, 4-Bit, and 8-Bit CSI feedback modes,
alongside perfect-CSI, the real-valued CSI feedback mode.
Interestingly, we obtain the best performance from a 2-bit feedback mode rather than
larger numbers of bits or continuous valued feedback. This is likely because, for 2-bit
feedback, we have enough to effectively generate a 4 entry code-book, whereas 1-bit is
insufficient for the number of codebook modes required, and greater numbers of bits or
continuous valued feedback requires the encoder to learn a more complex manifold of
different or continuously varying encoder modes, which is made significantly simpler
and more rapidly trained for a small but sufficient number of bits (e.g. v = 2).
Timothy J. O’Shea Chapter 3. Learning to Communicate 86
Figure 3.30: Learned 2x2 Scheme 1 bit CSIRandom Channels.
Figure 3.31: Learned 2x2 Scheme 1-bit CSIAll-Ones Channel.
Figure 3.32: Learned 2x2 Scheme 2-bit CSIRandom Channels.
Figure 3.33: Learned 2x2 Scheme 2-bit CSIAll-Ones Channel.
Inspecting the learned constellations for the 1-bit and 2-bit CSI feedback MIMO channel
autoencoders under random channel conditions, and under even-power per channel path
(all 1’s) assumptions for the H matrix, in figures 3.30, 3.31, 3.32, and 3.33 we can see that
our best performing 2-bit scheme learns a set of non-standard 16-QAM transmit constel-
lations which combine to form a relatively constant modulus non-standard PSK kind of
ring arrangement at the receiver.
The system in figure 3.27 can be easily produced in simulation, where knowledge of H is
Timothy J. O’Shea Chapter 3. Learning to Communicate 87
Figure 3.34: Deployment Configuration for Quantized MIMO Autoencoder
free, however in a real world system, such a trained system would need to be deployed
such as given in figure 3.34, where an estimate H is produced at the receiver, and used
to form a discrete v-bit embedding to feed back to the encoder. This feedback could be
included digitally coded messages within a higher level media access control (MAC) pro-
tocol.
3.6 System Identification Over the Air
The key problem with this approach and use in over the air systems, is that we have relied
on having a closed-form differentiable model for the channel during training. This is an
ok assumption, if you can build such a thing, but in the real world it may be difficult to do
so when faced with complex impairment distributions over a range of difference channel
Timothy J. O’Shea Chapter 3. Learning to Communicate 88
effects. Very recent published work realizing such a system over the air [125] addresses
this problem by only fine-tuning the receiver/decoder half of the channel autoencoder
using error feedback from OTA data. This is a partial solution, but it does not allow
the encoder or over the air representation to update to optimize for the real over the air
impairments.
In general this is still an open system identification [126] problem in which we desire to
fit a function to the OTA data permutation which is occurring in the wireless channel.
This is an important area of research when combined with the channel autoencoder to
allow the systems to truly adapt under heavy real world impairments. By approximat-
ing the transfer function in a way that its gradient can be computed or approximated
accurately, we can continue to train such systems end-to-end with a black box physical
transform in the middle. Our future work and prototype systems will seek to solve this
problem thoroughly in order to fully realize the power of channel autoencoders in the
real world. Recent work is beginning to mature the approach of gradient approximation
and back-propagation for black-box functions [127] which holds significant promise for
this problem.
Chapter 4
Learning to Label the Radio Spectrum
Interpreting and labeling the radio spectrum is a critical building block on which count-
less radio capabilities are built today, and will increasingly be built tomorrow. In its sim-
plest form, wireless channel estimation consumes some form of radio signal in time or
frequency and produces an estimate for some parameter of an emitted and impaired sig-
nal. This is used in wireless synchronization to estimate time of arrival, digital symbol
clock rates, carrier frequency and phase, as well as impulse response over the channel.
Larger scale radio labeling problems involve detection and identification of radio signal
emissions, information about physical emitters, changes in channel propagation condi-
tions, user access patterns, and countless other applications which may help inform spec-
trum regulators, dynamic spectrum access systems, wireless cyber-intrusion detection
and anomaly detection systems, or other spectrum monitoring applications.
89
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 90
For many years radio data labeling problems have been treated as highly niche estimation
tasks, where compact models of the emitter signal, compact (usually simplified) models
for the wireless channel, and an analytical estimator derivation process are used to pro-
duce some analytic estimator expression. This has gotten us extremely far in the radio
and radio labeling domain, however it has several key drawbacks relating to insufficient
model detail and unfavorably formed estimator algorithm forms. Radio signal models
are often simplified when considered in the context of underlying data distributions,
hardware impairments, and other distortions. Radio channel models are almost always
simplified by assuming only-AWGN, or including only a simplified compact simplified
fading model, often omitting other real world impairments. Estimator derivation no the
other hand, often results in an analytically convenient small expression whose algorith-
mic implementation may be considered or approximated later when considering efficient
implementation on available compute hardware and/or instruction sets.
By leveraging deep learning based on large datasets for estimator and label learning, we
hope to demonstrate in this chapter how estimators, while merely serving as approxi-
mations, can often outperform the traditional way of doing things, by incorporating rich
emitter and contextual information, rich and accurate channel models, and by forcing ap-
proximations to take the form of highly efficient wide matrix operations which synthesize
efficiently onto modern wide/concurrent compute platforms, ultimately improving accu-
racy and sensitivity, reducing power, weight, and size requirements for resulting systems,
and greatly reducing the amount of manual engineering time and cost required to obtain
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 91
good practical solutions to new estimation problems.
4.1 Learning Estimators from Data
Synchronization is the principal difficult task of any radio receiver or modem. Aligning
time, frequency, phase, and impulse response correctly for a received signal enables opti-
mal decoding of transmitted symbols and reception of digital transmissions. Two of the
most widely used estimators in any communications system are the timing estimator and
the carrier frequency estimator.
Traditionally maximum a posteriori (MAP), maximum likelihood estimation (MLE), and
MMSE estimators are widely used for estimation of CSI values. We consider the canon-
ical task of timing and frequency recovery for a single carrier QPSK signal [128]. Here,
a common approach to carrier frequency offset (CFO) estimation is an fast Fourier trans-
form (FFT) based technique which estimates the frequency using a periodogram of the
mth power of the received signal [129]. The frequency offset detected by this technique is
then given by (4.1).
∆f =Fs
N ·margmaxf|N−1∑k=0
rm[k]e−j2πkt/N | (4.1)
(−Rsym
2≤ f ≤ Rsym
2
),
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 92
where m is the modulation order, r(k) is the received sequence, Rsym is the symbol rate,
Fs is the sampling frequency, and N is the number of samples. The algorithm searches for
a frequency that maximizes the time average of the mth power of the received signal over
various frequencies in the range of(−Rsym
2≤ f ≤ Rsym
2
). Due to the algorithm operating
in the frequency domain, the center frequency offset manifests as the maximum peak in
the spectrum of rm(k). Fig. 4.1 shows an example cyclic spectrum for a QPSK signal
with a 2500 Hz center frequency offset (and a baud rate of 100ksym/sec), where the peak
indicates the center frequency offset for the burst.
Figure 4.1: CFO Expert Estimator Power Spectrum with simulated 2500 Hz offset
We conduct timing offset estimation in the canonical way by using a matched filter on
the received sequence matched to a known preamble sequence. The time-offset which
maximizes the output of the matched filter’s convolution is then taken to be the time-
offset of the received signal. Matched filtering can be represented by (4.2)
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 93
y(k) =k=∞∑k=−∞
h[n− k]r[k], (4.2)
where h[k] is the preamble sequence. The matched-filter is known as the optimal filter
for maximizing detection sensitvity in terms of SNR in the presence of additive stochastic
white noise.
Our approximate, learned approach relies instead on construction, training and evaluat-
ing an ANN based on a representative dataset. When relying on learned estimators, much
of work and difficulty lies in generating a dataset which accurately reflects the final us-
age conditions desired for the estimator. In our case, we produce numerous examples of
wireless emissions in complex baseband sampling with rich channel impairment effects
which are designed to match the intended real world conditions the system will operate
in. We associate target labels from ground truth for center frequency offset and timing
error which are used to optimize the estimator.
To train an ANN model, we consider the minimization of MSE and log-cosine hyperbolic
(log-cosh) [130] and Huber loss functions (shown in table 3.2). The latter are known to
have improved properties in robust learning, which may benefit such a regression learn-
ing task on some datasets and tasks. In our initial experiments in this paper, we observe
the best quantitative performance using the MSE loss function which we shall use for the
remainder.
We search over a large range of model architectures using Adam [62] to perform gradi-
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 94
Table 4.1: ANN Architecture Used for CFOEstimation
Layer Output dimensionsInput (nsamp,2)Conv1D + ReLU (variable,32)AveragePooling1D (variable,32)Conv1D + ReLU (variable,128)Conv1D + ReLU (variable,256)Linear 1
Table 4.2: ANN Architecture Used forTiming Estimation
Layer Output dimensionsInput (2048,2)Conv1D + ReLU (511,32)Conv1D + ReLU (126,64)Conv1D + ReLU (30,128)Conv1D + ReLU (2,256)Dense + Linear (1)
ent descent to optimize each model parameters based on our training dataset. This is
done by computing a loss function (e.g. LMSE) and updating the weights of the neural
network model iteratively using back-propagation of loss gradients. More information
on the model search and selection process used is provided in chapter 5.3. This model
search and optimization process ideally seeks a model of minimal computational com-
plexity which achieves a satisfactory level of performance (the frontier of efficient models
represents a trade-off between model complexity and accuracy).
The ANN architectures used for our performance evaluation are shown below, both are
stacked convolutional neural networks with narrowing dimensions which map noisy
high dimensional raw time series data down to a compact single valued regression out-
put. In the case of CFO estimation architecture shown in Table 4.1, we find that an average
pooling layer works well to help improve performance and generalization of the initial
layer feature maps, while in the timing estimation architecture in table 4.2 no-pooling, or
max-pooling tends to work better. This makes sense on an intuitive level as CFO is distill-
ing all symbols received throughout the input into a best frequency estimate, while timing
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 95
in a traditional matched filter sense, is derived typically from a maximum response at a
single offset.
We generate two different sets of data for evaluating the performance of the two com-
peting approaches. All generated data are based off of QPSK bursts with equiprobable
independent and identically distributed (IID) symbols, and shaped with a square root
root-raised cosine (RRC) filter with a roll-off β = 0.25 and a filter span of 6, and sampled
at 400 kHz with a symbol rate of 100 kHz. We consider 4 channel conditions, AWGN
with no fading, and three cases of Rayleigh fading with varying mean delay spreads in
samples of σ = 0.5, 1, 2. Amplitude envelopes for a number of complex valued channel
responses for each of these delay spreads are shown in figure 2.2 to provide some visual
insight into the impact of Rayleigh fading effects at each of these delays. For the last case,
inter-symbol interference (ISI) is present in the data.
The first dataset generated is the timing dataset, in which we prepended the burst with
a known preamble of 64 symbols and random noise samples at the same SNR as the
data portion of the burst. The number of noise samples prepended is drawn from a U ∼
(0, 1.25), in units of milliseconds. Additionally, a random phase offset drawn from a U ∼
(0, 2π) is introduced for each burst in the dataset.
The second dataset generated is the center frequency offset data, in which every example
burst has a center frequency offset drawn from a U ∼ (−50e3, 50e3) distribution, in units
of Hz. The bounds of this correspond to half the symbol rate, Rsym/2. Additionally, a
random phase offset drawn from a U ∼ (0, 2π) is introduced for each burst in the dataset.
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 96
These datasets are generated for SNR’s of 0 dB, 5 dB, and 10 dB and for an AWGN chan-
nel and three different Rayleigh fading channels with different mean delay spread values
(0.5, 1, and 2) representing different levels of reflection in a given wireless channel envi-
ronment. We store the label of the timing offset and center frequency offsets as ground
truth for training and evaluation.
For each dataset generated above we optimize network weights using Adam [62] for 100
epochs, reducing the initial learning rate of 1e − 3 by a factor of two for each 10 epochs
with no reduction in validation loss, ultimately using the parameters corresponding to
the epoch with the lowest validation loss. With the datasets generated above, we then
compute the test error using a separate data partition between ground truth labels for
timing and center frequency offset and predicted values generated using both expert and
deep learning/ANN based estimators. The mean absolute error (MAE) of the estimator
is used as our metric for comparison.
In the timing estimation comparison, we show estimator MAE results in figure 4.2, for
each model AWGN(τ, χ) and Fading(τ, χ) where τ is the mean delay spread, and χ is the
SNR. Inspecting these results we can see that the traditional matched filter (MF)/MLE
achieves excellent performance under the AWGN channel condition (AWGN channel
model). We can see significant degradation of the MF/MLE baseline accuracy under the
fading channel models however as a simple matched filter MLE timing estimation ap-
proach has no ability to compensate for the expected range of channel delay spreads. In
this case the artificial neural network / machine learning (ML/ANN) estimator approach
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 97
on average can not attain equivalent performance in all or even most cases. However, we
see that this approach does attain a MAE within the same order of magnitude, and does
in some fading cases achieve a lower MAE in the case of a fading channel.
Figure 4.2: Timing Estimation MAE Comparison
Quantitative results for estimation of center frequency offset error are shown in figures
4.3,4.4,4.5,4.6, summarizing the performance of both the baseline MLD method with dashed
lines and the ML/ANN method with solid lines. We compare the mean absolute center
frequency estimate error for each method at a range of different estimator block input
length sizes. As moment based methods generally improve for longer block sizes, we
compare performance over a range of short-time examples to longer-time examples.
In the AWGN case, in figure 4.3 we can see that for 5 and 10dB SNR cases, by the time
we reach a block size of 1024 samples, the baseline estimator is doing quite well, and
for larger block sizes (above 1024 samples) with SNR of at least 5dB, performance of the
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 98
Figure 4.3: Mean CFO Estimation AbsoluteError for AWGN Channel
102 103
101
102
103
104
105
Block Size (samples)
Esti
mat
orM
AE
(Hz)
CFO MAE under AWGN Channel
ML/ANN Estimator 0dB
ML/ANN Estimator 5dB
ML/ANN Estimator 10dB
MAP Estimator 0dB
Figure 4.4: Mean CFO Estimation AbsoluteError (Fading σ=0.5)
102 103
104
Block Size (samples)Es
tim
ator
MA
E(H
z)
CFO MAE under Light Fading
ML/ANN Estimator 0dB
ML/ANN Estimator 5dB
ML/ANN Estimator 10dB
MAP Estimator 0dB
MAP Estimator 5dB
MAP Estimator 10dB
baseline method is generally better. However, even in the AWGN case, for small block
sizes we are able to achieve lower error using the ML/ANN approach, even at low SNR
levels of near 0dB.
In the cases of fading channels shown in figures 4.4,4.5,4.6, we can see that performance
of the baseline estimator degrades enormously from the AWGN case under which it was
derived when delay spread is introduced. Performance gets perpetually worse as σ in-
creases from 0.5 to 2 samples of mean delay spread. In the case of the ML/ANN estimator,
we also see a degradation of estimator accuracy as delay spread increases, but the effect
is not nearly as dramatic, ranging from 3.4 to 23254 Hz in the MLD case (almost a 7000x
increase in error) versus a range of 2027 to 3305 Hz in the ML/ANN case (around a 1.6x
increase in error).
From an accuracy standpoint, these results are quite interesting, we do not see significant
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 99
Figure 4.5: Mean CFO Estimation AbsoluteError (Fading σ=1)
102 103
103.5
104
104.5
Block Size (samples)
Esti
mat
orM
AE
(Hz)
CFO MAE under Medium Fading
ML/ANN Estimator 0dB
ML/ANN Estimator 5dB
ML/ANN Estimator 10dB
MAP Estimator 0dB
Figure 4.6: Mean CFO Estimation AbsoluteError (Fading σ=2)
102 103
103.5
104
104.5
Block Size (samples)Es
tim
ator
MA
E(H
z)
CFO MAE under Heavy Fading
ML/ANN Estimator 0dB
ML/ANN Estimator 5dB
ML/ANN Estimator 10dB
MAP Estimator 0dB
MAP Estimator 5dB
MAP Estimator 10dB
improvement in timing estimation here against a matched filter, however for frequency
estimation, we see significant potential gains for both short-time estimators, and for esti-
mation under heavily impaired fading channel environments where AWGN assumptions
used during derivation fail. This result helps illustrate how often approximate data cen-
tric learned models can outperform toy analytic solutions in cases where the simplified
model assumptions do not hold and where the degrees of freedom are too high to allow
for accurate and efficient closed form solutions.
4.2 Learning to Identify Modulation Types
One of the canonical tasks in radio estimation and detection, is that of radio signal mod-
ulation identification. In radio sensing systems such as DSA systems [50], as well as
in spectrum regulatory enforcement and other monitoring systems, signal modulation
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 100
identification is often the first step towards identifying the emitter or protocol used by
an emitter, and being able to communicate with or monitor it. This task can be treated
simply as a classification problem among possible transmission modes (although this is
a simplification of the possible hierarchical classification problem among emitter param-
eters). Significant literature exists into prior methods for performing radio signal type
classification when using analytically derived deicions boundaries as well as compact
learned decision criterion with previous methods for machine learning such as decision
trees (DTrees) or SVMs.
Our early work in this area, conducted in 2015 and first published publicly in 2016 [131]
has received significant attention, spurring international interest, numerous derivative
and related works works at the IEEE DySpan 2017 Mod-Rec workshop [132, 133, 134,
135, 136, 137, 138] and elsewhere, DARPA’s RF Machine Learning Systems Program,
DARPA’s Battle-of-the-ModRecs Challenges, and parts of the DARPA Spectrum Collab-
oration Challenge (SC2), along with spurring internal research programs at numerous
companies.
Our basic approach relying on end-to-end feature learning on raw In-phase and Quadra-
ture (I/Q) data remains the same, but a number of techniques and methods have been
improved upon since the orignal paper [131], which lead to significant improvements in
detection sensitivity, power efficiency, and generality of such systems. Numerous draw-
backs with this approach however, can not be taken for granted. The need for labeled
data, robust and realistic datasets, and comprehensive metrics for comparison can not be
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 101
overstated, and these often limit the performance attainable for a given problem. Initial
attempts to address these needs by open sourcing classifiers classifiers, datasets/generators
[139], and metrics/scores were welcomed by a few, but have not been heavily adopted or
contributed to by many publishing in the field. The radio signal processing community
still has a long way to go to embrace data science in the way that has become the norm
in computer vision and many other disciplines. High quality public datasets from more
high profile institutions such as DARPA or NSF would be significant help in facilitating
this some day.
4.2.1 Expert Features for Modulation Recognition (Baseline)
Modulation recognition has long been used as a toy problem in the radio estimation and
detection world [140, 141, 142, 15, 143, 144, 138]. It sees some usage in spectrum mon-
itoring applications, but is not widely deployed or neccesary in many widely deployed
communications systesm.
Early work on this problem relies on analytically derived statistics and decision thresh-
olds typically derived probabilistically from a simplified analytic signal model (we refer
to these as expert methods [e.g. written explicitly by an expert in the domain]). Figure 4.7
(from [15]) illustrates one such traditional modulation recognition process for a digitally
modulated radio signal. Here a series of statistics (vn) are compared to a series of analyti-
cally derived decision thresholds (ηn), and a rigid analyticly formed decision tree is used
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 102
Figure 4.7: Traditional Approach to Modulation Recognition, from [15]
to make a modulation recognition decision.
For our baseline features in this work, we leverage a number of compact higher order
statistics (HOSs). To obtain these we compute the higher order moments (HOMs) using
the expression given below:
M(p, q) = E[xp−q(x∗)q] (4.3)
From these HOMs we can derive a number of higher order cumulantss (HOCs) which
have been shown to be effective discriminators for many modulation types [145]. HOCs
can be computed combinatorially using HOMs, each expression varying slightly; below
we show one example such expression for the C(4, 0) HOM.
C(4, 0) =
√M(4, 0)− 3×M (2, 0)2 (4.4)
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 103
Additionally we consider a number of analog features which capture other statistical be-
haviors which can be useful, these include mean, standard deviation and kurtosis of the
normalized centered amplitude, the centered phase, instantaneous frequency, absolute
normalized instantaneous frequency, and several others which have shown to be useful
in prior work. [146].
Machine learning is also considered for the decision making based on these sets of fea-
tures. SVM and DTree are two commonly used methods which can be trained on the
low-dimensional feature space in order to derive an optimized set of decision criteria.
Prior work has generally used machine learning and pattern recognition on simpler sets
of features such as those described above. However, results have also been shown using
the increased complexity features such as the auto-correlation function (ACF), the SCF or
the α-profile (a one dimensional cut of the SCF) [147]. In our case, we compare instead
to the full dimensional input samples withour imparting expert design about what form
features should take.
4.2.2 Time series Modulation Classification With CNNs
CNN layers have a very nice property in that layer parameters (weights) correspond to
specific filters or kernels which are evaluated at regular shift intervals across the input
values, limiting the parameter count while enforcing weight re-use at time shifts. This
key feature is well suited to any input domain where translation invariance is appropri-
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 104
ate. In imagery, learning arbitrary 2D shifts of where an object occurs in an image’s X
and Y axes can be greatly simplified, by ensuring that the same feature weights are used
to form activaitons at all shifts in the input using a convolutional layer. This property is
also extremely similar to the properties of linear time invariant (LTI) systems which are
widely used to model radio communications systems as 1D time series constructs. Be-
cause radio signals may arrive with random time offsets and consist of primitive objects
such as symbols which occur randomly in time to form a hierarchical structure, CNNs
are well suited to learning low level time-domain features or basis function for represent-
ing them. In fact, we already know and use this structure heavily in communications, as
we have used matched filters for preamble detection, symbol detections and decisions,
and many other purposes throughout the history of communications. The primary dif-
ferences then are that we optimize filter weights durrign the training process, rather than
using pre-defined weights, we often use large hierarchies of multiple convolutional lay-
ers, and these layers often have many different filter channels operating in paralel to form
higher feature-space representations.
Building upon key trends discussed in more depth in chapter 1.3, the raw CNN approach
to modulation recognition leverages the relatively recent abilities of training algorithms,
network architectures, and computational platforms to directly train using an end-to-end
feature learning approach on high dimensional raw radio time series as an alternative to
trying to pre-engineer specific features such as statistical moments, cyclic moments, or
other manually derived distillations of information. In both of [131, 111] we explore this
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 105
Table 4.3: Layout for our 10 modulation CNN modulation classifier
Layer Output dimensionsInput 2× 128Convolution (128 filters, size 2× 8) + ReLU 128× 121Max Pooling (size 2, strides 2) 128× 60Convolution (64 filters, size 1× 16) + ReLU 64× 45Max Pooling (size 2, strides 2) 64× 22Flatten 1408Dense + ReLU 128Dense + ReLU 64Dense + ReLU 32Dense + softmax 10
approach in depth. Here we rely on convolutional neural network on time series data
to learn a deep net with capable of performing robust classification of radio modulation
types with random data.
The only pre-processing used, is to ensure zero mean and unit variance of the raw signal
input vector, to ensure examples are nicely scaled to facilitate learning. In some cases, we
only enforce unit variance since certain classes are only differentiated by their mean shift
(e.g. analog modulations with and without a carrier at DC).
As is widely done for image classification, we adopt a narrowing series of convolutional
layers followed by dense/fully-connected layers and terminated with a dense softmax
layer for our classifier (similar to a VGG architecture [148]). The dataset1 for this bench-
mark consists of 1.2 M sequences of 128 complex-valued baseband I/Qsamples corre-
sponding to ten different digital and analog single-carrier modulation schemes (amplitude
modulation (AM), frequency modulation (FM), PSK, QAM, etc.) that have gone through
1RML2016.10b—https://radioml.com/datasets/radioml-2016-10-dataset/
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 106
a wireless channel with harsh impairments including multi-path fading and both clock
and carrier rate offset [131]. The samples are taken at 20 different SNR within the range
from −20 dB to 18 dB.
Figure 4.8: 10 Modulation CNN performance comparison of accuracy vs SNR
−20 −10 0 100
0.2
0.4
0.6
0.8
1
SNR
Cor
rect
clas
sific
atio
npr
obab
ility
CNNBoosted TreeSingle TreeRandom Guessing
In Fig. 4.8, we compare the classification accuracy of the CNN against that of extreme
gradient boosting with 1000 estimators, as well as a single scikit-learn decision tree [149],
operating on a mix of 16 analog and cumulant expert features as proposed in [146] and
[145]. The short-time nature of the examples places this task on the difficult end of the
modulation classification spectrum since we cannot compute expert features with high
stability over long periods of time. The CNN outperforms the boosted feature-based
classifier by around 4 dB in the low to medium SNR range while the performance at high
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 107
Figure 4.9: Confusion matrix of the CNN (SNR = 10 dB)
8PSK
AM-DSBBPS
KCPFS
KGFS
KPA
M4QAM16
QAM64QPS
KWBFM
Prediction
8PSK
AM-DSB
BPSK
CPFSK
GFSK
PAM4
QAM16
QAM64
QPSK
WBFM
Grou
nd tr
uth
0.0
0.2
0.4
0.6
0.8
1.0
SNR is similar. Performance in the single tree case is about 6 dB worse than the CNN at
medium SNR and 3.5 % worse at high SNR.
Fig. 4.9 shows the confusion matrix for the CNN at SNR = 10 dB, revealing confusing
cases between QAM16 and QAM64 and between Wideband FM (WBFM) and double-
sideband AM (AM-DSB). Despite the high SNR, classification is imperfect due to several
other impairments as described above. The distinction between AM-DSB and WBFM is
additionally complicated by the small observation window (0.64 ms of modulated speech
per example) and low information rate with frequent silence between words. Discrimi-
nating between QAM16 and QAM64 also suffers from short-time observations over only
a few symbols since constellations are higher order and share common points. The accu-
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 108
racy of the feature-based classifier saturates at high SNR for the same reasons, and neither
classifier reaches a perfect score on this dataset. In [150], the authors report on a success-
ful application of a similar CNN for the detection of black hole mergers in astrophysics
from noisy time-series data.
4.2.3 Deep Residual Network Time-series Modulation Classification
Since the publication of our original work [131, 111] in CNN based signal identification
work desribed in the previous section, numerous advances have been made in neural
network architecture with significant implications towards structuring CNN solutions for
the modulation recognition problem. Key among these are residual networks [4], batch
normalization [41], self-normalizing networks [73], and the used of deep dilated convo-
lutional architectures [2], and others. In this section, we detail updated results leverag-
ing these techniques, considering performance over the air, and improving our synthetic
dataset performance, while providing performance trade-off comparisons detailing the
impact of a number of factors.
Dataset Structure and Improvements
Dataset related issues became clear from the dataset in [139] and prior datasets, that
streaming models with coherent channel impairments were not appropriate for training.
Randomly sampling many samples with independent channel state, rather than adjacent
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 109
Table 4.4: Random Variable Initialization
Random Variable Distributionα U(0.1, 0.4)∆t U(0, 16)∆fs N(0, σclk)θc U(0, 2π)∆fc N(0, σclk)H Σiδ(t− Rayleighi(τ))
correlated channel state provided significant gain and realism for the problem. To better
characterize the distribution of the data, we introduce the random variables in table 4.4,
each IID for every independent training example. The training data synthesis model is
illustrated in figure 4.10.
Figure 4.10: System for modulation recognition dataset signal generation and syntheticchannel impairment modeling
We consider two different compositions of the dataset, first a “Normal” dataset, which
consists of 11 classes which are all relatively low information density and are commonly
seen in impaired environments. These 11 signals represent a relatively simple classifi-
cation task at high SNR in most cases, somewhat comparable to the canonical MNIST
digits. Second, we introduce a “Difficult” dataset, which contains all 24 modulations.
These include a number of high order modulations (QAM256 and APSK256), which are
used in the real world in very high-SNR low-fading channel environments such as on line
of sight (LOS) impulsive satellite links [151] (e.g. DVB-S2X). We however, apply impair-
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 110
ments which are beyond that which you would expect to see in such a scenario and con-
sider only relatively short-time observation windows for classification, where the number
of samples, ` = 1024. Short time classification is a hard problem since decision processes
can not wait and acquire more data to increase certainty. This is the case in many real
world systems when dealing with short observations (such as when rapidly scanning a
receiver) or short signal bursts in the environment. Under these effects, with low SNR
examples (from -20 dB to +30 dB Es/N0), one would not expect to be able to achieve any-
where near 100% classification rates on the full dataset, making it a good benchmark for
comparison and future research comparison.
The specific modulations considered within each of these two dataset types are as follows:
• Normal Classes: OOK, 4ASK, BPSK, QPSK, 8PSK, 16QAM, AM-SSB-SC, AM-DSB-
SC, FM, GMSK, OQPSK
• Difficult Classes: OOK, 4ASK, 8ASK, BPSK, QPSK, 8PSK, 16PSK, 32PSK, 16APSK,
32APSK, 64APSK, 128APSK, 16QAM, 32QAM, 64QAM, 128QAM, 256QAM, AM-
SSB-WC, AM-SSB-SC, AM-DSB-WC, AM-DSB-SC, FM, GMSK, OQPSK
The raw datasets will be made available on the RadioML website 2 after publication.
2https://radioml.org
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 111
Over the air dataset generation
In additional to simulating wireless channel impairments, we also implement an OTA
test-bed in which we modulate and transmit signals using a USRP [152] B210 SDR. We
use a second B210 (with a separate free-running local oscillator (LO)) to receive these
transmissions in the lab, over a relatively benign indoor wireless channel on the 900MHz
ISM band. These radios use the Analog Devices AD9361 [153] radio frequency integrated
circuit (RFIC) as their radio front-end and have an LO that provides a frequency (and
clock) stability of around 2 parts per million (PPM). We off-tune our signal by around 1
MHz to avoid DC signal impairment associated with direct conversion, but store signals
at base-band (offset only by LO error). Received test emissions are stored off unmodified
along with ground truth labels for the modulation from the emitter. Figure 4.11 illustrates
the hardware recording architecture used for our data capture, and the picture in figure
4.12 illustrates the actual hardware used for data capture, training and evaluation.
Baseline classification approach
Our baseline method leverages the list of HOMs and other aggregate signal behavior
statistics given in table 4.5. Here we can compute each of these statistics over each 1024
sample example, and translate the example into feature space, a set of real values asso-
ciated with each statistic for the example. This new representation has reduced the di-
mension of each example from R1024∗2 to R28, making the classification task much simpler
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 112
Figure 4.11: Over the air capture systemdiagram
Figure 4.12: Picture of over the air labcapture and training system
Table 4.5: Features Used
Feature NameM(2,0), M(2,1)M(4,0), M(4,1), M(4,2), M(4,3)M(6,0), M(6,1), M(6,2), M(6,3)C(2,0), C(2,1)C(4,0), C(4,1), C(4,2),C(6,0), C(6,1), C(6,2), C(6,3)Additional analog 4.2.1
but also discarding the vast majority of the data. We use an ensemble model of gradient
boosted trees (XGBoost) [154] to classify modulations from these features, which outper-
forms a single decision tree or SVM significantly on the task. (We additionally evaluated
methods including SVM [32], Naive Bayes, k-Nearest Neighbor, and deep neural net-
work (DNN) on feature data in [131, 111], but ultimately XGBoost offered the strongest
performing feature-based classification approach which is why we focus on it here.)
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 113
Deep Learning based classification approaches
We evaluate and tune two classes of networks, first a VGG-style CNN using max-pooling
shown in table 4.6, and second, a residual network leveraging dilated convolutions ap-
propriate for time series radio signals and self-normalizing fully connected layers to map
residual/CNN features to outputs, shown in table 4.7.
In [148], the question of how to structure such networks is explored, and several basic
design principals for ”VGG” networks are introduced (e.g. filter size is minimized at 3x3,
smallest size pooling operations are used at 2x2). Following this approach has generally
led to straight forward way to construct CNNs with good performance. We adapt the
VGG architecture principals to a 1D CNN, improving upon the similar networks in [131,
111]. This represents a simple DL CNN design approach which can be readily trained
and deployed to effectively accomplish many small radio signal classification tasks.
Figure 4.13: Example graphic of high level feature learning based residual networkarchitecture for modulation recognition
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 114
As network algorithms and architectures have improved since Alexnet, they have made
the effective training of deeper networks using more and wider layers possible, and lead-
ing to improved performance. In the computer vision space, the idea of deep residual
networks has become increasingly effective [4]. In a deep residual network, as is shown
in figure 4.20, the notion of skip or bypass connections is used heavily, allowing for fea-
tures to operate at multiple scales and depths through the network. This has led to signif-
icant improvements in computer vision performance, and has also been used effectively
on time-series audio data [2]. In [155], the use of residual networks for time-series radio
classification is investigated, and seen to train in fewer epochs, but not to provide signif-
icant performance improvements in terms of classification accuracy. We revisit the prob-
lem of modulation recognition with a modified residual network and obtain improved
performance when compared to the CNN on this dataset, a high level depiction of this
architecture is shown in figure 4.13. The basic residual unit and stack of residual units is
shown in figure 4.20, while the complete network architecture for our best architecture for
(` = 1024) is shown in table 4.7. We also employ self-normalizing neural networks [73]
in the fully connected region of the network, employing the SELU activation function
[73], mean-response scaled initializations (MRSA) [156], and Alpha Dropout [73], which
provides a slight improvement over conventional ReLU performance.
Significant tuning time was spent optimizing both networks, and a collection of different
trade studies are shown below. A thorough analysis of all of the hundreds (or limitless)
network architecture design choices possible is difficult to address in this same depth.
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 115
Table 4.6: CNN Network Layout
Layer Output dimensionsInput 2× 1024Conv 64× 1024Max Pool 64× 512Conv 64× 512Max Pool 64× 256Conv 64× 256Max Pool 64× 128Conv 64× 128Max Pool 64× 64Conv 64× 64Max Pool 64× 32Conv 64× 32Max Pool 64× 16Conv 64× 16Max Pool 64× 8FC/SeLU 128FC/SeLU 128FC/Softmax 24
Table 4.7: ResNet Network Layout
Layer Output dimensionsInput 2× 1024Residual Stack 32× 512Residual Stack 32× 256Residual Stack 32× 128Residual Stack 32× 64Residual Stack 32× 32Residual Stack 32× 16FC/SeLU 128FC/SeLU 128FC/Softmax 24
However, the architecture tuning process is revisited again in more depth later in chapter
5.3, where we consider dealing with the model hyper-parameter design choices using a
secondary optimization process.
Figure 4.14: Complex time domain examples of 24 modulations from the dataset atsimulated 10dB Eb/N0 and ` = 256
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 116
Figure 4.15: Complex time domain examples of 24 modulations over the air at high SNRand ` = 256
Figure 4.16: Complex constellation examples of 24 modulations from the dataset atsimulated 10dB Eb/N0 and ` = 256
We show a number of examples from both the synthetic and and over the air datasets for
a bit of dataset intuition about what each example looks like at differing SNR levels, and
how similar classes appear at lower SNR. Each example is 1024 complex valued samples
at 1 MSamp/sec with a baud rate of 200Ksym/sec. We show time domain examples for
all 24 classes, where figures 4.14 and 4.17 illustrate time domain signals at 10dB and 0dB
respectively. Figure 4.15 illustrates an OTA capture of the dataset with relatively high
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 117
Figure 4.17: Complex time domain examples of 24 modulations from the dataset atsimulated 0dB Eb/N0 and ` = 256
SNR, and figure 4.16 illustrates the 10dB SNR synthetic dataset in the complex plane, to
provide an alternate perspective on the complex valued trajectories through modulation
symbol points.
Classification on low-order modulations
We first compare performance on the lower difficulty dataset on lower order modulation
types. Training on a dataset of 1 million example, each 1024 samples long, we obtain
excellent performance at high SNR for both the VGG CNN and the ResNet (RN) CNN.
In this case, the ResNet achieves roughly 5 dB higher sensitivity for equivalent classifi-
cation accuracy than the baseline, and at high SNR a maximum classification accuracy
rate of 99.8% is achieved by the ResNet, while the VGG network achieves 98.3% and the
baseline method achieves a 94.6% accuracy. At lower SNRs, performance between VGG
and ResNet networks are virtually identical, but at high-SNR performance improves con-
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 118
Figure 4.18: 11-Modulation normal dataset performance comparison (N=1M)
−20 −15 −10 −5 0 5 10 150
0.2
0.4
0.6
0.8
1
Es/N0 [dB]
Cor
rect
clas
sific
atio
npr
obab
ility Baseline
VGG/CNNResNet
siderably using the ResNet and obtaining almost perfect classification accuracy.
For the remainder of this chapter, we will consider the much harder task of 24 class high
order modulations containing higher information rates and much more easily confused
classes between multiple high order PSKs, APSKs and QAMs.
Classification under AWGN
Signal classification under AWGN is the canonical problem which has been explored for
many years in communications literature. It is a simple starting point, and it is the con-
dition under which analytic feature extractors should generally perform their best (since
they were derived under these conditions). In figure 4.19 we compare the performance
of the ResNet (RN), VGG network, and the baseline (BL) method on our full dataset for
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 119
Figure 4.19: 24-Modulation difficult dataset performance comparison (N=240k)
−20 −15 −10 −5 0 5 10 150
0.2
0.4
0.6
0.8
1
Es/N0 [dB]
Cor
rect
clas
sific
atio
npr
obab
ility BL AWGN
RN AWGNVGG AWGN
` = 1024 samples, N = 239, 616 examples, and L = 6 residual stacks. Here, the residual
network provides the best performance at both high and low SNRs on the difficult dataset
by a margin of 2-6 dB in improved sensitivity for equivalent classification accuracy. Here,
N indicates the number of examples in the dataset, ` indicates the number of samples of
input per example, and L indicates the number of residual stacks included in the network
(where a single residual stack architecture is shown in figure 4.20).
Classification under Impairments
In any real world scenario, wireless signals are impaired by a number of effects. While
AWGN is widely used in simulation and modeling, the effects of fading, carrier offset, and
clock offset are present almost universally in wireless systems. It is interesting to inspect
how well this class of learned classifiers perform under such impairments and compare
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 120
Figure 4.20: Residual unit and residual stack architectures
their rate of degradation under impairments with that of more traditional approaches to
signal classification.
In figure 4.21 we plot the performance of the residual network based classifier under each
considered impairment model. This includes AWGN, minor LO offset (σclk = 0.0001),
moderate LO offset (σclk = 0.01), and several fading models ranging from minor (τ = 0.5)
to harsh (τ = 4.0). Under all fading models, minor LO offset is assumed as well. Interest-
ingly in this plot, ResNet performance improves under LO offset rather than degrading.
Additional LO offset which results in spinning or dilated versions of the original sig-
nal, appears to have a positive regularizing effect on the learning process which provides
quite a noticeable improvement in performance. At high SNR performance ranges from
around 80% in the best case down to about 59% in the worst case.
In figure 4.22 we show the degradation of the baseline classifier under impairments. In
this case, LO offset never helps, but the performance instead degrades with both LO offset
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 121
Figure 4.21: Resnet performance under various channel impairments (N=240k)
−20 −15 −10 −5 0 5 10 150
0.2
0.4
0.6
0.8
1
Es/N0 [dB]
Cor
rect
clas
sific
atio
npr
obab
ility RN AWGN
RN σclk = 0.01RN σclk = 0.0001RN τ = 0.5RN τ = 1RN τ = 2RN τ = 4
Figure 4.22: Baseline performance under channel impairments (N=240k)
−20 −15 −10 −5 0 5 10 150
0.2
0.4
0.6
0.8
1
Es/N0 [dB]
Cor
rect
clas
sific
atio
npr
obab
ility BL AWGN
BL σclk = 0.01BL σclk = 0.0001BL τ = 0.5BL τ = 1BL τ = 2BL τ = 4
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 122
Figure 4.23: Comparison models under LO impairment
−20 −15 −10 −5 0 5 10 150
0.2
0.4
0.6
0.8
1
Es/N0 [dB]
Cor
rect
clas
sific
atio
npr
obab
ility BL σclk = 0.01
RN σclk = 0.01VGG σclk = 0.01
and fading effects, in the best case at high SNR this method obtains about 61% accuracy
while in the worst case it degrades to around 45% accuracy.
Directly comparing the performance of each model under moderate LO impairment ef-
fects, in figure 4.23 we show that for many real world systems with unsynchronized LOs
and Doppler frequency offset there is nearly a 6dB performance advantage of the ResNet
approach vs the baseline, and a 20% accuracy increase at high SNR. In this section, all
models are trained using N = 239, 616 and ` = 1024 for this comparison.
Classifier performance by network depth
Model size can have a significant impact on the ability of large neural network models
to accurately represent complex features. In computer vision, convolutional layer based
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 123
−20 −15 −10 −5 0 5 10 150
0.2
0.4
0.6
0.8
1
Es/N0 [dB]
Cor
rect
clas
sific
atio
npr
obab
ility L=1
L=2L=3L=4L=5L=6
Figure 4.24: ResNet performance vs depth (L = number of residual stacks)
DL models for the ImageNet dataset started around 10 layers deep, but modern state of
the art networks on ImageNet are often over 100 layers deep [157], and more recently
even over 200 layers. Initial investigations of deeper networks in [155] did not show
significant gains from such large architectures, but with use of deep residual networks
on this larger dataset, we begin to see quite a benefit to additional depth. This is likely
due to the significantly larger number of examples and classes used. In figure 4.24 we
show the increasing validation accuracy of deep residual networks as we introduce more
residual stack units within the network architecture (i.e. making the network deeper). We
see that performance steadily increases with depth in this case with diminishing returns
as we approach around 6 layers. When considering all of the primitive layers within this
network, when L = 6 we the ResNet has 121 layers and 229k trainable parameters, when
L = 0 it has 25 layers and 2.1M trainable parameters. Results are shown for N = 239, 616
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 124
Figure 4.25: Modrec performance vs modulation type (Resnet on synthetic data withN=1M, σclk=0.0001)
−20 −15 −10 −5 0 5 10 150
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Signal to noise ratio (Es/N0) [dB]
Cor
rect
clas
sific
atio
npr
obab
ility
OOK4ASK8ASKBPSKQPSK8PSK16PSK32PSK16APSK32APSK64APSK128APSK16QAM32QAM64QAM128QAM256QAMAM-SSB-WCAM-SSB-SCAM-DSB-WCAM-DSB-SCFMGMSKOQPSK
and ` = 1024.
Classification performance by modulation type
In figure 4.25 we show the performance of the classifier for individual modulation types.
Detection performance of each modulation type varies drastically over about 18dB of
SNR. Some signals with lower information rates and vastly different structure such as AM
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 125
Figure 4.26: 24-modulation confusion matrix for ResNet trained and tested on syntheticdataset with N=1M, AWGN, and SNR ≥ 0dB
and FM analog modulations are much more readily identified at low SNR, while high-
order modulations require higher SNRs for robust performance and never reach perfect
classification rates. However, all modulation types reach rates above 80% accuracy by
around 10dB SNR. In figure 4.26 we show a confusion matrix for the classifier across all 24
classes for AWGN validation examples where SNR is greater than or equal to zero. We can
see again here that the largest sources of error are between high order PSK (16/32-PSK),
between high order QAM (64/128/256-QAM), as well as between AM modes (confusing
with-carrier (WC) and suppressed-carrier (SC)). This is largely to be expected as for short
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 126
Figure 4.27: Performance vs training set size (N) with ` = 1024
−20−18−16−14−12−10 −8 −6 −4 −2 0 2 4 6 8 10 12 14 16 180
0.2
0.4
0.6
0.8
1
Es/N0 [dB]
Cor
rect
clas
sific
atio
npr
obab
ility
N=1kN=2kN=4kN=8kN=15kN=31kN=62kN=125kN=250kN=500kN=1MN=2M
time observations, and under noisy observations, high order QAM and PSK modes can
be extremely difficult to tell apart through any approach.
Classifier Training Size Requirements
When using data-centric machine learning methods, the dataset often has an enormous
impact on the quality of the model learned. We consider the influence of the number
of example signals in the training set, N , as well as the time-length of each individual
example in number of samples, `.
In figure 4.27 we show how performance of the resulting model changes based on the total
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 127
Figure 4.28: 24-modulation confusion matrix for ResNet trained and tested on syntheticdataset with N=1M and σclk = 0.0001
number of training examples used. Here we see that dataset size has a dramatic impact on
model training, high SNR classification accuracy is near random until 4-8k examples and
improves 5-20% with each doubling until around 1M. These results illustrate that having
sufficient training data is critical for performance. For the largest case, with 2 million
examples, training on a single state of the art Nvidia V100 GPU (with approximately
125 tera-floating point operations per second (FLOPS)) takes around 16 hours to reach
a stopping point, making significant experimentation at these dataset sizes cumbersome.
We do not see significant improvement going from 1M to 2M examples, indicating a point
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 128
Figure 4.29: Performance vs example length in samples (`)
−20 −15 −10 −5 0 5 10 150
0.2
0.4
0.6
0.8
1
Es/N0 [dB]
Cor
rect
clas
sific
atio
npr
obab
ility `=16
`=32`=64`=128`=256`=512`=768`=1024
of diminishing returns for number of examples around 1M with this configuration. With
either 1M or 2M examples we obtain roughly 95% test set accuracy at high SNR. The
class-confusion matrix for the best performing mode with `=1024 and N=1M is shown
in figure 4.28 for test examples at or above 0dB SNR, in all instances here we use the
σclk = 0.0001 dataset, which yeilds slightly better performance than AWGN.
Figure 4.29 shows how the model performance varies by window size, or the number of
time-samples per example used for a single classification. Here we obtain approximately
a 3% accuracy improvement for each doubling of the input size (with N=240k), with sig-
nificant diminishing returns once we reach ` = 512 or ` = 1024. We find that CNNs scale
very well up to this 512-1024 size, but may need additional scaling strategies thereafter for
larger input windows simply due to memory requirements, training time requirements,
and dataset requirements.
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 129
Over the air performance
We generate 1.44M examples of the 24 modulation dataset over the air using the USRP
setup described above. Using a partition of 80% training and 20% test, we can directly
train a ResNet for classification. Doing so on an Nvidia V100 in around 14 hours, we
obtain a 95.6% test set accuracy on the over the air dataset, where all examples are roughly
10dB SNR. A confusion matrix for this OTA test set performance based on direct training
is shown in figure 4.30.
Figure 4.30: 24-modulation confusion matrix for ResNet trained and tested on OTAexamples with SNR ∼ 10 dB
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 130
Figure 4.31: Resnet transfer learning OTA performance
0 5 10 15 20 25 30 35 40 45 500.6
0.65
0.7
0.75
0.8
0.85
0.9
Transfer Learning Epochs
Cor
rect
clas
sific
atio
npr
obab
ility
(Tes
tSet
)
AWGNσclk=0.0001σclk=0.01τ = 0.5τ = 1.0
Transfer Learning to Over-the-air Performance
We also consider over the air signal classification as a transfer learning problem, where
the model is trained on synthetic data and then only evaluated and/or fine-tuned on
OTA data. Because full model training can take hours on a high end GPU and typi-
cally requires a large dataset to be effective, transfer learning is a convenient alternative
for leveraging existing models and updating them on smaller computational platforms
and target datasets. We consider transfer learning, where we freeze network parameter
weights for all layers except the last several fully connected layers (last three layers from
table 4.7) in our network when while updating. This is commonly done today with com-
puter vision models where it is common start by using pre-trained VGG or other model
weights for ImageNet or similar datasets and perform transfer learning using another
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 131
dataset or set of classes. In this case, many low-level features work well for different
classes or datasets, and do not need to change during fine tuning. In our case, we con-
sider several cases where we start with models trained on simulated wireless impairment
models using residual networks and then evaluate them on OTA examples. The accura-
cies of our initial models (trained with N=1M) on synthetic data shown in figure 4.21, and
these ranged from 84% to 96% on the hard 24-class dataset. Evaluating performance of
these models on OTA data, without any model updates, we obtain classification accura-
cies between 64% and 80%. By fine-tuning the last two layers of these models on the OTA
data using transfer learning, we and can recover approximately 10% of additional accu-
racy. The validation accuracies are shown for this process in figure 4.31. These ResNet
update epochs on dense layers for 120k examples take roughly 60 seconds on a Titan X
card to execute instead of the full ∼ 500 seconds on V100 card per epoch when updating
model weights.
Ultimately, the model trained on just moderate LO offset (σclk = 0.0001) performs the best
on OTA data. The model obtained 94% accuracy on synthetic data, and drops roughly
7% accuracy when evaluating on OTA data, obtaining an accuracy of 87%. The primary
confusion cases prior to training seem to be dealing with suppress or non-suppressed
carrier analog signals, as well as the high order QAM and APSK modes.
This seems like it is perhaps the best suited among our models to match the OTA data.
Very small LO impairments are present in the data, the radios used had extremely stable
oscillators present (GPSDO modules providing high stable 75 PPB clocks) over very short
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 132
Figure 4.32: 24-modulation confusion matrix for ResNet trained on syntheticσclk = 0.0001 and tested on OTA examples with SNR ∼ 10 dB (prior to fine-tuning)
example lengths (1024 samples), and that the two radios were essentially right next to
each other, providing a very clean impulsive direct path while any reflections from the
surrounding room were likely significantly attenuated in comparison, making for a near
impulsive channel. Training on harsher impairments seemed to degrade performance of
the OTA data significantly.
We suspect as we evaluate the performance of the model under increasingly harsh real
world scenarios, our transfer learning will favor synthetic models which are similarly
impaired and most closely match the real wireless conditions (e.g. matching LO distribu-
tions, matching fading distributions, etc). In this way, it will be important for this class
of systems to train either directly on target signal environments, or on very good im-
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 133
Figure 4.33: 24-modulation confusion matrix for ResNet trained on syntheticσclk = 0.0001 and tested on OTA examples with SNR ∼ 10 dB (after fine-tuning)
pairment simulations of them under which well suited models can be derived. Possible
mitigation to this are to include domain-matched attention mechanisms such as the ra-
dio transformer network [139] in the network architecture to improve generalization to
varying wireless propagation conditions.
Modulation Recognition Learning Analysis
We have extended prior work on using deep convolutional neural networks for radio sig-
nal classification by heavily tuning deep residual networks for the same task. We have
also conducted a much more thorough set of performance evaluations on how this type
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 134
of classifier performs over a wide range of design parameters, channel impairment con-
ditions, and training dataset parameters. This residual network approach achieves state
of the art modulation classification performance on a difficult new signal database both
synthetically and in over the air performance. Other architectures still hold significant
potential, radio transformer networks, recurrent units, and other approaches all still need
to be adapted to the domain, tuned and quantitatively benchmarked against the same
dataset in the future. Other works have explored these to some degree, but generally not
with sufficient hyper-parameter optimization to be meaningful.
We have shown that, contrary to prior work, deep networks do provide significant per-
formance gains for time-series radio signals where the need for such deep feature hier-
archies was not apparent, and that residual networks are a highly effective way to build
these structures where more traditional CNNs such as VGG struggle to achieve the same
performance or make effective use of deep networks. We have also shown that simulated
channel effects, especially moderate LO impairments improve the effect of transfer learn-
ing to OTA signal evaluation performance, a topic which will require significant future
investigation to optimize the synthetic impairment distributions used for training.
ADL methods continue to show enormous promise in improving radio signal identifi-
cation sensitivity and accuracy, especially for short-time observations. We have shown
deep networks to be increasingly effective when leveraging deep residual architectures
and have shown that synthetically trained deep networks can be effectively transferred
to over the air datasets with (in our case) a loss of around 7% accuracy or directly trained
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 135
effectively on OTA data if enough training data is available. While large well labeled
datasets can often be difficult to obtain for such tasks today, and channel models can be
difficult to match to real-world deployment conditions, we have quantified the real need
to do so when training such systems and helped quantify the performance impact of do-
ing so.
We still have much to learn about how to best curate datasets and training regimes for this
class of systems. However, we have demonstrated in this work that our approach pro-
vides roughly the same performance on high SNR OTA datasets as it does on the equiva-
lent synthetic datasets, a major step towards real world use. We have demonstrated that
transfer learning can be effective, but have not yet been able to achieve equivalent perfor-
mance to direct training on very large datasets by using transfer learning. As simulation
methods become better, and our ability to match synthetic datasets to real world data
distributions improves, this gap will close and transfer learning will become and increas-
ingly important tool when real data capture and labeling is difficult. The performance
trades shown in this work help shed light on these key parameters in data generation
and training, hopefully helping increase understanding and focus future efforts on the
optimization of such systems.
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 136
4.3 Learning to Identify Radio Protocols
The results in the previous section focused principally on sensing of modulation type,
but the same fundamental approach is valid for labeling many different properties of
digital communications waveforms at the PHY or MAC layer. As shown by Saineth et
al in [36], features in a time series waveform which construct hierarchical time series
structure among short-time features (such as voice utterances), can be learned in an end-
to-end fashion with a higher level sequence model for effective sequence classification on
noisy time series data. This has proved incredibly effective in voice recognition, and the
approach can also be leveraged for higher level radio protocol identification on top of the
basic modulation features [158].
Protocol identification serves an important role in network quality of service (QoS) man-
agement, intrusion detection, and anomaly detection. Today, many such systems rely on
brittle parsing routines which are highly specialized to a specific set of protocols, can be-
come useless, or worse cause faults or vulnerabilities [159] when protocol fields change
or are malformed, and can be extremely expensive and time consuming to keep up to
date or constantly update to add new protocol modes. As an alternative, we consider a
data-based approach in which high level protocol labeling can be conducted directly on
a physical layer modulated signal through end-to-end learning of the low level modu-
lation features, and high level classification loss guided by curated protocol labels and
examples.
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 137
Figure 4.34: Transfer function of the LSTM unit, from [16]
Table 4.8: Protocol traffic classes considered for classification
Traffic Type Traffic ClassStreaming Video (ABC Video)Streaming Video (YouTube)Streaming Music (Spotify)Utilities Apt-getUtilities ICMP PingUtilities Git Version ControlUtilities IRC ChatBrowsing Bit-TorrentBrowsing Web-BrowsingBrowsing FTP TransferBrowsing HTTP Download
Several powerful recurrent network structures such as the long short-term memory (LSTM)
[160, 161], the gated recurrent unit (GRU) [162], and more recently the computationally
efficient quasi-recurrent neural network (QRNN)[163]. For our work we leverage the
LSTM in both an RNN-DNN architecture and a CNN-RNN-DNN (CLDNN) architecture.
We generate a set of recorded IP traffic captures using Wireshark [164] from the list of
protocols in table 4.8 and re-modulate them over an un-coded QPSK with HDLC com-
munications link to produce labeled I/Q sample files for classification.
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 138
Table 4.9: Recurrent network architecture used for network traffic classification
Layer Output dimensionsInput N × (2× 128)
LSTM 256
LSTM 256
LSTM 256
Dense + ReLU 64
Dense + softmax 11
Table 4.10: Performance measurements for RNN protocol classification for varyingsequence lengths
Sequence Length Val. Loss Val. Accuracy Nsamples Nsymbols Nbits Sec/Epoch
32 1.2126 0.498805 1120 140 280 5
64 1.0386 0.553546 2144 268 536 18
128 0.7179 0.65894 4192 524 1048 17
256 0.4586 0.75621 8288 1036 2072 29
512 0.2711 0.836535 16480 2060 4120 38
768 0.5328 0.730413 24672 3084 6168 27
The recurrent neural network (RNN) network architecture evaluated (which in this case
had the best performance on clean signal data), is shown in figure 4.9. No network tuning
was used, this was the same network structure commonly used for character level RNN’s
(char-rnn [165]).
We evaluate a range of different input sequence lengths of the LSTM N , comparing the
average number of input samples/bits required to obtain a good estimate of each mod-
ulated protocol traffic type. Table 4.10 tabulates the resulting network performance for
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 139
training and evaluating classification performance with differing sequence lengths of 128
complex sample windows. Andrej Karpathy’s article title from the excellent article [165]
is apt here, as the ’unreasonable effectiveness’ of LSTMs is able to quite effectively iden-
tify high level traffic protocol behaviors with only access to raw modulated I/Q data. In
this case, we obtain best performance with a sequence of 512 windows, with a validation
set accuracy of around 83.6%.
Figure 4.35: Best LSTM256 confusion with RNN length of 512 time-steps
The confusion matrix for the resulting classifier performance with sequence length of
N = 512 (16,480 samples) is shown in figure 4.35. Since the observation window is only
16ms of traffic observation, some error is to be expected as not all observation windows
will contain distinctive traffic patterns and all classes may have some amount of common
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 140
background traffic (domain name server (DNS), address resolution protocol (ARP), etc).
These results indicate some initial promise of deep learning based protocol analysis even
down to the raw physical layer, but significant investment and work in larger scale ro-
bust dataset development needs to occur to significantly advance the field. Our efforts to
perform similar classification on impaired RF channels (including noise, fading, offsets,
etc) were less successful with a straight forward RNN approach. We believe this avenue
can certainly be fruitful (likely using a CLDNN style architecture), but newer tools for ar-
chitecture optimization, hyper-parameter tuning, domain specific dataset augmentation,
and generally larger datasets will be required to accomplish this task.
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 141
4.4 Learning to Detect Signals
Figure 4.36: Detection Algorithm Trade-space Sensitivity vs Specialization
Radio signal detection is a key task in spectrum diagnostic and monitoring systems as
well as cognitive radios such as those performing DSA. Today, systems which do de-
tection typically have to make a difficult design choice: specialize detection algorithms
heavily for features of a specific signal type or class of signals, or rely on highly generaliz-
able energy based detection methods with lower sensitivity. This is an unfortunate design
trade-off as it forces designers to either forego generality or performance during design or
dynamically at run-time using additional complex logic and estimation [166]. The gen-
erality of feature based detectors varies, for instance cyclo-stationary or moment based
detectors may have more generality than highly specialized features such as matched fil-
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 142
ters or cross ambiguity function (CAF) plane searches, but they are still highly specific
to a narrow class of modulation types or properties which is problematic, especcially
as learned communications systems drastically increase the range of signal types possi-
ble. Figure 4.36 illustrates this trade-space at a high level, showing how objective based
learned feature detectors fill a much desired void of obtaining both. This ideal class of de-
tectors which achieves both high sensitivity and wide generality can be obtained through
data centric machine learning approaches relying on feature learning, where, given suffi-
cient data, highly sensitive features are learned for many different signal types using the
same basic approach without the need for hand tuning or manual feature engineering.
There are many pre-processing signal representation domains in which detection strate-
gies can be applied: raw time domain, frequency domain, wavelet domain, combinations
of these, or others. As the most straightforward approach with analogues to existing work
in computer vision, we consider the 2D time-frequency spectrogram plane for our work
and leverage image object detection techniques which have already reached maturity,
surpassing human levels of performance in many cases [156, 167]. The intuition for this
approach is strong, as skilled domain engineers can regularly perform manual observa-
tion on spectrogram images and identify and localize signals highly accurately with their
eyes, illustrating the sufficient availability of information given the right interpretation.
This approach to object detection has in medical imaging and other non-visual domains
recently come to the forefront, providing computer assisted diagnosis in radiology and
other fields which in many cases outperforms panels of skilled radiologists in identifying
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 143
cancer [168], fractures, Alzheimer’s disease [169], and others.
Figure 4.37: Computer Vision CNN-based Object Detection Trade Space, from [17]
We consider the application of several leading computer vision object detection approaches
to the task of radio signal detection in [18].
Each of the leading techniques in recent years has relied on CNNs for learned features
on the front end, while numerous strategies exist for architectures, targets, loss functions,
iteration, and training. Each of these relies on large training sets containing annotations
with bounding boxes to indicate and localize ground truth of various object classes in the
image. Networks then typically learn to predict bounding boxes, class labels, and confi-
dence metrics through some means for which there are several strategies. Initial promis-
ing solutions to the problem relied on region proposal networks such as region-based
convolutional neural network (R-CNN) [170], Fast R-CNN [40], and newer versions of
this technique which rely on conducting multiple network forwards passes for each ob-
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 144
Figure 4.38: Example bounding box detections in computer vision, from [17]
ject or region proposal in an image iteratively to refine the region prediction. This works
well, but is quite expensive computationally and consequently slow when considering
the throughput of many classifications on finite computing resources. In radio detection,
we often seek to perform detection at extremely high rates and low latencies for many
wide-band spectrum sensing tasks, where speed is key. The you only look once (YOLO)
approach [171] solved this by proposing a single feed forward pass network which jointly
produces object bound and class proposals for a grid of regions within the image simul-
taneously. This approach of bounding box and class prediction within a single network
forward pass was improved upon by SSD [172] and then improved further in [17]. Among
the improvements are network architectures, as well as the use of anchor boxes, and im-
proved loss functions for regression which led to numerous improvements.
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 145
We use a network architecture for this work which is a variant of YOLO (known as tiny-
YOLO) as is described in table 4.11. Note that this network is much smaller than the
full-size one used in [17]. Compared to visual object recognition tasks, recognition of
spectral events is a relatively simpler task in many cases, allowing for smaller networks
to be used. Additionally, a smaller network helps to reduce over-fitting on the currently
available smaller datasets for the task, and reduces the computational complexity of for-
wards passes, resulting in lower power and faster operation.
Table 4.11: Table input/output shapes
Layer Number Layer Type Kernel Size Number of Feature Maps
1,2,3,4,5,6 Conv+Maxpool (3,3) 16,32,64,128,256,512
7,8 Conv (3,3) 1024,1024
9 Conv (1,1) 30
We train our system using the same approach as presented in the YOLO method, but we
can make a handful of simplifications for detection. We consider an S × S grid of detec-
tions, predicting B bounding boxes for each cell along with a set of C class probabilities
as in [171]. We consider the YOLO loss function given below in equation 4.5, where 1obj
is evaluated only when the cell contains an object, and 1no−obj is evaluated only when
the cell does not contain an object. We do not use anchor boxes or Intersection over
union (IOU) loss for this initial work, performing direct regression of w and h instead,
leaving this for future work which we believe almost certainly yield further improve-
ments.
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 146
LY OLO = λc
S2∑i=0
B∑j=0
1objij D
2L2((xi, yi), (xi, yi))
+λc
S2∑i=0
B∑j=0
1objij DL2((wi, hi), (wi, hi))
+S2∑i=0
B∑j=0
1objij (Ci − Ci)2
+λno−obj
S2∑i=0
B∑j=0
1no−objij (Ci − Ci)2
+S2∑i=0
∑c∈classes
(pi(c)− pi(c))2
(4.5)
Figure 4.39: YOLO style per-grid-cell bounding box regression targets
Here, the first two terms of the loss minimize the L2 distance of the bounding box location
(x/y) and size (h/w) when an object is present (as shown in figure 4.39), while terms three
and four minimize error in class prediction probabilities, and the final term minimizes a
confidence metric. In our case, if we seek to perform object detection on a single class,
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 147
RF emissions, we can drop the third and forth terms and only perform bounding box
regression and confidence estimation for a single object class, simplifying the task and
network complexity significantly.
Figure 4.40: Radio bounding box detection examples, from [18]
In figure 4.40 we illustrate a synthetic wide-band bounding-box annotated radio dataset
generated for the DARPA Battle-of-the-ModRecs competition using a set of our custom
wide-band signal generation tools in GNU Radio [173]. Here we show ground truth
bounding boxes along side predicted bounding boxes produced by our trained tiny-
YOLO detector on a validation portion of the dataset. In this case, we obtain excellent
performance in predicting good bounding box annotations and maintain resilience to
wideband noise emissions across the band which appear as energy as our detector.
We also illustrate the performance of the model as tested on an over the air wide-band
spectrogram using tools being developed by DeepSig Inc. In figure 4.41, we show the
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 148
Figure 4.41: Over the air wideband signal bounding box prediction example
received radio spectrogram for an ISM band, with a series of rapid bursty radio emission
occuring throughout. This spectrogram has been labeled with annotations using a similar
Yolo style network with bounding box regression and confidence prediction, where we
have thresholded and removed all the low confidence boxes not shown. Here we can
see that a number of traditionally difficult tasks such as discerning overlapping bursts,
adjacent bursts, and heavily faded bursts are all handled appropriately.
This is a key result for the learned detector approach, through a generic process of human
bounding box guidance we are able to rapidly train a detector to perform as desired for an
unknown signal type without significant investment in additional specialized detection
algorithms. This techniques is especially powerful as the detector as a receptive field is
much more resilient to small impairments, occlusions (interference), or other distortions
Timothy J. O’Shea Chapter 4. Learning to Label the Radio Spectrum 149
in the signals which might have readily caused a simple energy based detector to mis-
detect or poorly bound a radio signal emission. Work remains to be done to quantify the
performance of the detector in a classical constant false alarm rate or receiver operating
characteristic (ROC) curve style sensitivity analysis against the classical binned energy
detector, but based on comparable results in computer vision and human visual capabil-
ities when performing this task manually, we believe such a study in future work will
yield excellent results soon.
Chapter 5
Learning Radio Structure
Much of the work discussed to this point has been focused on either learning new physi-
cal layer communications systems or learning in a supervised way how to detect, classify
and label radio emissions. This chapter takes a step back and looks at how unlabeled
radio signal data (which describes most available data in the world, and the data hitting
our sensors) can be used in order to learn structure of radio signals, enable compression
of radio signals, and to partition and learn to separate types of radio signals without train-
ing or through a semi-supervised approach. It also takes a deeper dive into the question
of how to select network architectures and hyper-parameters for training various tasks
through approaching it as a guided model search problem, a key enabler for radio algo-
rithm discovery and optimization.
150
Timothy J. O’Shea Chapter 5. Learning Radio Structure 151
5.1 Unsupervised Structure Learning
Widely used single-carrier radio signal time series modulations schemes today use a rel-
atively simple set of supporting basis functions to modulate information into the radio
spectrum. Digital modulations typically use sine wave basis functions with pseudo-
orthogonal properties in phase, amplitude, or frequency. Information bits are used to
map a symbol value si to a location in this space φj, φk, .... In figure 5.1 we show three
common basis functions where φ0 and φ1 form phase-orthogonal bases used in PSK and
QAM, while φ0 and φ2 show frequency-orthogonal bases used in frequency shift key-
ing (FSK) In the final figure of 5.1 we show a common mapping of constellation points
into this space used in Quadrature Phase Shift Keying (QPSK) to encode two bits of in-
formation per symbol.
Digital modulation theory in communications is a rich subject explored in much greater
depth in numerous great texts such as [174].
Figure 5.1: Example Radio Communications Basis Functions
Timothy J. O’Shea Chapter 5. Learning Radio Structure 152
We seek to learn a sparse representation using learned convolutional basis functions
which maximally compresses radios signals of interest, obtaining the most sparse rep-
resentation possible. Given there is random data modulated onto the radio signal and
CSI information stored about its arrival mode, there is certainly some information theo-
retic limit to how compressed the information can become and still reconstruct the same
information on a radio signal reconstruction. We can lower bound this by the entropy of
the data bits, but likely need to also consider the entropy encoded into the encoded CSI.
Figure 5.2: Convolutional Autoencoder Architecture for Signal Compression
We set up a minimal convolutional autoencoder as shown in figure 5.2 where an input
complex time domain radio signal is decomposed into a small set of convolutional fil-
ters, compressed to a small number of activations through a fully-connected layer, then
decompressed and reconstructed through a similar fully-connect and convolutional re-
gression layer. In this case, we use linear activations on the convolutional layers, and
non-linear activations only on the fully-connected compression layers.
Timothy J. O’Shea Chapter 5. Learning Radio Structure 153
Figure 5.3: Convolutional Autoencoder reconstruction of QPSK example 1
Inspecting a QPSK signal compressed in this way in figure 5.3, we see that the complex
continuous valued 88 sample input signal can be quite cleanly reconstructed at the output
while passing through an intermediate layer of 44 intermediate values which saturated
at 0 or 1. Interestingly, while representing only the structural portions of the signal in
the basis functions, significant amounts of high frequency noise which does not lie on the
basis function naturally has been removed in the reconstruction.
Another example is shown in figure 5.4 where relatively clean construction is achieved
in the same way. Considering the compression occurring here, we have 88*2=176 float32
values for each input example, consisting of a total of approximately 5632 bits, while we
have a saturated sparse representation of approximately 44 bits. This is a compression
factor of approximately 128x.
If we instead consider the input signal to be dynamic range limited to approximately 20dB
SNR (assuming optimal representation scaling), we assume the signal can be represented
Timothy J. O’Shea Chapter 5. Learning Radio Structure 154
Figure 5.4: Convolutional Autoencoder reconstruction of QPSK example 2
Figure 5.5: AE Encoder Filter Weights
Figure 5.6: AE Decoder Filter Weights
in 4-bit precision with quantization error not reducing SNR. (e.g. 6.02dB*4bits = 24.08dB ¿
20dB) then we can assume the input signal to be 704 bits of information compressed down
to 44 bits, still a compression factor of 16x. This is relatively encouraging for a scheme
which is perhaps the simplest convolutional autoencoder which could be employed for
such a thing with no tuning.
Interestingly, if we inspect the filter weights learned in the convolutional encoding layer
Timothy J. O’Shea Chapter 5. Learning Radio Structure 155
and convolutional decoding layer in figures 5.5 and 5.6, we can see that the basis functions
for PSK modulation at the given relative symbol rate with RRC pulse shaping are learned
directly in the filter weights. This raises an interesting possibility for discovering the basis
functions for any new unknown modulation type simply based on learning a similar
sparse representation thereof. It also raises the question of if some galois field (GF)(2)
logic function exists to map the sparse representation bits into the transmitted data bits. If
that is the case, through compression we would have just naively learned a demodulator
for any new random modulation type solely through reconstruction loss.
Finally, the implications for denoising the input signal visible in figures 5.3 and 5.4 are
quite interesting. Through projection onto basis functions and reconstruction therefrom,
such an approach might offer a lower complexity alternative to full demodulation, re-
modulation and subtraction currently used in successive interference cancelation (SIC)
offering the possibility for a computationally cheaper version of this technique.
5.2 Unsupervised Class Discovery
Labeling of datasets can be expensive, difficult and time consuming. For this reason, as
we turn increasingly to machine learning and data centric methods, it is important to
develop methods which exploit unsupervised learning as much as possible to minimize
the human curation requirement when unnecessary, and to maximally leverage human
guidance when it is needed. In [175], we consider a collection of techniques for unsu-
Timothy J. O’Shea Chapter 5. Learning Radio Structure 156
pervised and semi-supervied [176, 177, 178] identification of radio signal emission types
using structure learning, sparse embedding, and clustering.
Dimensionality reduction techniques such as principal component analysis (PCA) [179],
independent component analysis (ICA) [180] have been used widely in signal processing
to obtain low dimension representations, to perform compression and de-noising, and
other purposes. Non-linear versions such as kernel-PCA [181] exist which extend these
methods into the non-linear representation domain, however choice of kernel is often
extremely limiting non-linear representation capacity, and leaves much to be desired in
terms of improved non-linear models for dimensionality reduction. Autoencoders with
non-linear activations as discussed in section 5.1 offer a potential for significantly im-
proved non-linear dimensionality reduction and representation beyond what has been
achievable with prior methods. Recent work for instance in image and video compression
domains [182, 183] has shown that such nonlinear autoencoder compression schemes can
achieve better and more compressed low-dimensional representations of image domain
examples than previously achievable with other techniques.
We consider both supervised and unsupervised methods for learning sparse representa-
tions or embeddings of RF signal examples in figures 5.7 and 5.8. These are both compres-
sive non-linear representations, but they have different objectives. In the case of the su-
pervised method, discriminative features are learned which help impart human guidance
on the objective class separation. In the case of the unsupervised method, reconstructive
features are learned which simply try to best reconstruct each example through the non-
Timothy J. O’Shea Chapter 5. Learning Radio Structure 157
Figure 5.7: Supervised EmbeddingApproach
Figure 5.8: Unsupervised EmbeddingApproach
linear compressed representations of supporting learned convolutional basis functions
which minimize reconstruction loss (e.g. MSE).
Each of these embeddings offers its own advantages, in the case of purely unsupervised
of course, the appeal of zero labeling work is appealing as large amounts of unlabeled
radio data are readily available. In the case of supervised learning, the features and rep-
resentations are already guided towards signal type discrimination, but in some cases
may not generalize well to separation of new modulation types.
In figure 5.9 and 5.10 we illustrate the resulting clustering of 11 radio signal modulation
signal classes using these two embedding approaches. Embeddings are further reduced
from ∼ 40 dimensions down to 2 for visualization using t-SNE [113]. For the supervised
features we use the embedding of the final layer of a VGG-style CNN, prior to the fi-
nal fully-connected SoftMax output layer, and for the unsupervised feature training we
Timothy J. O’Shea Chapter 5. Learning Radio Structure 158
Figure 5.9: Supervised Signal EmbeddingsFigure 5.10: Unsupervised Signal
Embeddings
use the output of a small convolutional autoencoder. We color example points with their
class labels for visualization. For the supervised embedding clustering, we can see excel-
lent separability of classes for virtually all classes, but label information was used in the
creation of the feature space. For unsupervised embedding clustering, we can see some
degree of separability in some of the more distinct classes (e.g. 8-PSK, AM-SSB, AM-DSB),
but see significant mixing between similar modulation types which share common basis
function properties (e.g. BPSK/QPSK mixing, QAM16/QAM64 mixing).
We can measure the ability of these approaches to generalize to some degree by training
and clustering them using hold-out classes which are introduced after embedding space
training, without labels. In doing so, we can begin to measure the quantitative accu-
racy with which each approach successfully detects new classes as new clusters. We also
create a clustering representation in which a human curator can begin to label examples
by cluster rather than by individual example. These are both important steps towards
Timothy J. O’Shea Chapter 5. Learning Radio Structure 159
creating learning systems which scale and learn from new data and emitters over time,
however much of the quantitative analysis and optimization of this approach is left for
future work.
5.3 Neural Network Model Discovery and Optimization
One of the biggest problems in the use of artificial neural networks for machine learn-
ing is the task of architecture selection and hyper-parameter optimization. Architectures
can make an enormous difference in the performance of a neural network in terms of
accuracy and computational cost (as recently demonstrate in [184]), by introducing ap-
propriate classes of tied weights (e.g. convolutional layers, dilated convolutions) and by
appropriately managing the degrees of freedom in a network (e.g. pooling, striding, etc)
to preserve enough information at each layer while keeping the free-parameter count low
enough and incorporating a domain appropriate distillation mode for information.
In section 5.3 we review a number of the published state of the art approaches in re-
cent deep learning literature for solving this problem. Unfortunately, many of these ap-
proaches are too computationally complex for people with finite computing resources
and funding (i.e. other than Google/Facebook).
As a solution we develop a model based on a simplified version of Google’s evolutionary
model search approach in [8]. Here we represent a directed graph of high level neural
network primitives and key hyper-parameters as a compact model description as shown
Timothy J. O’Shea Chapter 5. Learning Radio Structure 160
Figure 5.11: Compact Model Network Digraph and Hyper-Parameter Search Process
in figure 5.11. We implement evolutionary routines [185] for random model generation,
mutation and crossover of model graph structure and hyper-parameters, and leverage
an evolutionary particle swarm optimization [186] approach to generating and breeding
populations of models. In contrast to the approach in [8] which we presume is run across
a large distributed cluster of computing nodes (to support population sizes of 1000), we
evaluate our model on a single Nvidia Digits development server with 4 Titan X GPU
cards with substantially smaller population sizes and search lengths. We call this ap-
proach EvolNN (evolutionary neural network).
Evaluating the model on several benchmark test sets, evolutionary model search finds so-
lutions which score quite well on standard benchmarks like MNIST fairly readily (figure
5.13), while we can also apply the search problem to very difficult datasets such as the
hard 24-modulation dataset from section 4.2.3, shown in figure 5.13. In this case, the tasks
Timothy J. O’Shea Chapter 5. Learning Radio Structure 161
Figure 5.12: EvolNN ModRec Net SearchAccuracy
Figure 5.13: EvolNN MNIST Net SearchAccuracy
Table 5.1: Final small MNIST search CNNnetwork
Layer Output dimensionsInput 28× 28× 1Conv 24× 24× 104Dropout 24× 24× 104Flatten 59904FC/SoftMax 10
Table 5.2: Final Modrec search CNNnetwork
Layer Output dimensionsInput 1024× 2Conv 335× 11Conv 323× 256AvgPool 107× 256MaxPool 21× 256MaxPool 5× 256Flatten 1280FC/SoftMax 24
of image and modulation classification are completely different domains, but the same
evolutionary approach is able to find reasonable solutions to both very quickly.
Both of these task are configured by providing a reference dataset, with input and output
shapes, a loss function for classification using CCE, and an evolutionary model configu-
ration including population size, generations, cross-over rate, mutation rate, etc. For the
search accuracy trajectories shown above, we ultimately obtain the best models given in
table 5.1 and 5.2.
Timothy J. O’Shea Chapter 5. Learning Radio Structure 162
For the MNIST model, we find the solution in only 4 generations of population size 32
with the best model achieving an accuracy of 99.22% on the validation set. For the mod-
ulation recognition task, a significantly more difficult task, we obtain a slightly larger
network, which learns to narrow the information representation gradually using several
convolutional layers and pooling layers. In this case the best performance is only 42%
validation set accuracy, compared to the ∼ 76% achieved through expert design, but we
observe a stead slow growth in performance throughout the evolution process, and be-
lieve with additional search tuning and longer search times much better models could be
found through this approach.
Figure 5.14: EvolNN CFO estimation network search loss
By simply changing the objective loss function of the evolutionary process (in this case to
MSE) we can use the same infrastructure to search for optimal regression networks. In
this case, the CFO estimation network we previously showed in table 4.1 is the best model
found for a model search on our CFO estimation dataset and task. Figure 5.14 shows the
Timothy J. O’Shea Chapter 5. Learning Radio Structure 163
evolutionary model loss over a number of generation, where we can see the estimator
MSE converging to smaller values throughout the search process and ultimately arriving
at a best MSE of 0.0011. Here we search for 32 generations each with a population size of
32.
Neuro-evolution [29] is a very powerful tool, and holds significant biological grounding
in living creatures. We leverage this very high level intuition for evolutionary model
selection and loss feedback based model optimization, both are very very rough approx-
imations of how we believe biological learning and evolution occur. Both of these pro-
cesses seem to have a very long way to go before any notion of optimality is reached, but
initial results are still very promising and provide reasonably good results and generality
on new tasks such as estimator synthesis for which no literature exists in best practices
for manually crafting and optimizing model architectures. The models shown here are
still trivially small compared with full size state of the art architectures used today, but
sufficient computational cost and evolutionary tuning will close this gap. As computing
costs and data become increasingly cheaper, such guided search approaches to model and
architecture selection are increasingly appealing when compared with lengthy, expensive
and less effective manual architecture engineering and tuning cycles.
Chapter 6
Conclusion
Machine learning and parallel computing have provided a set of incredibly powerful
tools over recent years which have opened up orders of magnitude improvement in our
ability to optimize very large scale high degree of freedom problems through direct gra-
dient descent on well formed loss functions. While these tools are being readily applied
in the computer vision and NLP spaces today, the full impact of their engineering impact
will not be realized in applications and in industry for many years to come. These tools
represent a major shift in algorithm design away from simplified model based solutions
to problems and specialized software routines towards data-centric model optimization
using highly general parametric models capable of learning very highly dimensional so-
lutions to many difficult tasks through end-to-end learning.
This enormous shift in design methodology does not mean we can’t perform quantita-
164
Timothy J. O’Shea Chapter 6. Conclusion 165
tive analysis, measurement, and probabilistic characterization of the performance of such
models, but it does make predicting or guaranteeing performance somewhat more diffi-
cult in many cases principally because they are derived from dataset distributions which
in themselves are not well characterized and are formed from complex real world dis-
tributions. Many of the probabilistic tools for guaranteeing, explaining and optimizing
performance are catching up quickly, but since such models now rely heavily on high-
dimensional datasets directly for learning, perfomance guarantees will neccisarily be-
come a much more complex function of the dataset distribution rather than of a compact
simplified model as well.
This same shift was extremely contentious in the computer vision domain before wide
spread adoption and the same resistance is being felt in many other fields including radio
signal processing. The hostility towards learning directly from rich distributions of large
datasets rather than assuming conventional compact models, which have been used for
years in the wireless space, is contentious. As stated by an anonymous [highly negative]
reviewer for a conference paper this past year, ”Radio spectra are not mere images of
cats but are issued from well acknowledged and fairly accurate wireless communication
models”, many people are not pleased with attempts rely on data instead of solely these
models. In reality many of the models used are insufficient, and there is much still to gain
by leveraging the best of both worlds.
Ultimately we are at a crossroads in wireless and signal processing, where practitioners
of both analytic compact model construction and large approximate model construction
Timothy J. O’Shea Chapter 6. Conclusion 166
must both adopt good practices for data science such as adopting benchmark tasks and
datasets which truly reflect useful target tasks in the real world. We have attempted to
help address this issue by open sourcing and publishing several datasets throughout this
work which can be fairly compared in a quantitative fashion across numerous classes of
approach. We truly hope that more people will adopt this approach to algorithm develop-
ment and optimization, making benchmarks open and quantitatively tracking approach
scores in a way similar to ImageNet [1], CIFAR [107], Kaggle challenges [187], or other
well characterized and scored tasks.
Funding agencies such as DARPA, NIST, NSF, or industry can significantly help this pro-
cess by explicitly funding, promoting and publishing high quality datasets and data to
accompany desired tasks, which is no trivial task and can often require significant real in-
vestment. This approach has been highly successful in vision and other fields and stands
to revolutionize how communications system engineering is done today.
While much of the the work throughout my dissertation studies has perhaps raised many
more questions than it has answered about the topic, I believe many of the data-centric
approaches to radio signal sensing, labeling, communications system synthesis, and de-
sign designed herein all hold a high degree of inevitability for the field. Certainly specific
optimization techniques and architectures will continue to change and advance over the
coming years, but the basic shift towards optimizaton of high degree of freedom mod-
els on real datasets and impairment models seems likely to rapidly become the norm
as quantitative performance results become stronger and more widely disseminated and
Timothy J. O’Shea Chapter 6. Conclusion 167
accepted. The list of potential research directions and applications in the field as this
transition occurs represents an enormously rich array of possibilities and areas for im-
provement. Initial results shown herein provide significant evidence of the disruptive
potential for improvement in the radio signal processing space, in communications sys-
tem learning, sensor system learning, and many similar applications considered from a
machine learning perspective. Building, integrating and deploying such systems into the
real world holds many remaining engineering challenges, but I look forward to rapidly
maturing this field and learning from others as the field grows and machine learning
based radio physical layers and signal processing techniques improve.
6.1 Publication List
Below is the relevant body of academic published work corresponding to my dissertation
research over the past several years.
Journal Articles
• T. O’Shea, J. Hoydis [An Introduction to Deep Learning for the Physical Layer], IEEE
Transactions on Cognitive Communications Systems, 2017 (accepted)
• T. O’Shea, T. Roy, T. Clancy [Over the Air Deep Learning Based Radio Signal Clas-
sification] IEEE JSTSP 2017 (accepted)
• T. O’Shea, T. Erpek, T. Clancy [Deep Learning-Based MIMO Communications], (un-
Timothy J. O’Shea Chapter 6. Conclusion 168
der resubmission)
• T. O’Shea, T. Clancy, T. Roy, T. Erpek, K. Karra [Deep Learning and Data Centric
Approaches to Wireless Signal Processing Systems], (under resubmission)
• C Clancy, J Hecker, E Stuntebeck, T O’Shea [Applications of machine learning to
cognitive radio networks] Wireless Communications, IEEE 14 (4), 47-52
Peer Reviewed Conference Papers
• T. OShea, T. Roy, T. Clancy, [Learning Robust General Radio Signal Detection using
Computer Vision Methods], Asilomar SSC 2017 (to appear)
• T. OShea, T. Erpek, T. Clancy, [Physical Layer Deep Learning of Encodings for the
MIMO Fading Channel], Allerton Conference on Communications, Control, and
Computing 2017
• T. OShea, K. Karra, T. Clancy, [Learning Approximate Neural Estimators for Wire-
less Channel State Information], IEEE MLSP 2017
• T. OShea, T. Roy, T. Erpek, [Spectral Detection and Localization of Radio Events with
Learned Convolutional Neural Features], IEEE EUSIPCO 2017
• T. OShea, N. West, M. Vondal, T. Clancy [Semi-Supervised Radio Signal Identifi-
cation], IEEE International Conference on Advanced Communications Technology,
2017 (outstanding paper award)
Timothy J. O’Shea Chapter 6. Conclusion 169
• N. West, T. OShea [Deep Architectures for Modulation Recognition], IEEE DySpan,
2017
• T. OShea, S. Hitefield, J. Corgan [End-to-end Traffic Sequence Recognition with Re-
current Neural Networks], IEEE GlobalSip, 2016
• T. OShea, K. Karra, T. Clancy, [Learning to Communicate: Channel auto-encoders,
Domain Specific Regularizers, and Attention], IEEE International Symposium on
Signal Processing and Information Technology 2016
• T. OShea, L. Pemula, D. Batra, T. Clancy, [Radio Transformer Networks: Attention
Models for Learning to Synchronize in Wireless Systems], IEEE Asilomar Confer-
ence on Signals, Systems and Computing 2016
• T. OShea, N. West, [Radio Machine Learning Dataset Generation with GNU Radio],
GNU Radio Conference 2016
• T. OShea, J. Corgan, T. Clancy, [Unsupervised Representation Learning of Struc-
tured Radio Communication Signals] International Workshop on Sensing, Process-
ing and Learning for Intelligent Machines 2016
• T. OShea, J. Corgan, T. Clancy, [Convolutional Radio Modulation Recognition Net-
works] Engineering Applications of Neural Networks 2016
• D. CaJacob, N. McCarthy, T. O’Shea, R. McGwier, [Geolocation of RF Emitters with
a Formation-Flying Cluster of Three Microsatellites] Small Satellite Conference 2016
Timothy J. O’Shea Chapter 6. Conclusion 170
• T. O’Shea, K. Karra [GNU Radio Signal Processing Models for Dynamic Multi-User
Burst Modems] Software Radio Implementation Forum 2015
• S. Hitefield, V. Nguyen, C. Carlson, T. O’Shea, T. Clancy [Demonstrated LLC-layer
attack and defense strategies for wireless communication systems] IEEE Conference
on Communications and Network Security (CNS) 2014
• C. Carlson, V. Nguyen, S. Hitefield, T. O’Shea, T. Clancy [Measuring smart jammer
strategy efficacy over the air] IEEE Conference on Communications and Network
Security (CNS) 2014
• T. O’Shea, T. Rondeau, [A universal GNU radio performance benchmarking suite],
Karlsruhe Workshop on Software Radio 2014
• T. Rondeau, T. O’Shea, [Designing Analysis and Synthesis Filterbanks in GNU Ra-
dio], Karlsruhe Workshop on Software Radios 2014
Pre-Publication Papers
• T. OShea, T. Clancy, [Deep Reinforcement Learning Radio Control and Signal De-
tection with KeRLym, a Gym RL Agent], ArXiv Pre-publication 1605.09221 2016
• T. O’Shea, T. Clancy, R. McGwier [Recurrent Neural Radio Anomaly Detection],
ArXiv Pre-Publication 1611.00301 2016
• T. O’Shea, A. Mondl, T. Clancy [A Modest Proposal for Open Market Risk Assess-
ment to Solve the Cyber-Security Problem] ArXiv Pre-Publication 1604.08675 2016
Timothy J. O’Shea Chapter 6. Conclusion 171
Invited/Non-Paper Talks
• T. O’Shea, [The Future of Radio: Learning Efficient Signal Processing Systems],
GNU Radio Conference 2017
• T. O’Shea, [Learning Signal Processing and Communications Systems from Data],
IEEE CCAA Workshop Keynote 2017
• T. O’Shea, [Deep Learning on the Radio Physical Layer], JASON 2017 Summer
Study
• T. OShea, [TensorFlow Applications in Signal Processing], IEEE International Con-
ference for High Performance Computing, Networking, Storage and Analysis (Ten-
sorFlow BoF Hosted by Google) 2016
• T. OShea, [Radio Data Analytics with Machine Learning], International Symposium
on Advanced Radio Technologies (ISART) 2016
• R. McGwier, T. OShea, K. Karra, M. Fowler, [Recent Developments in Artificial Intel-
ligence Applications of Deep Learning for Signal Processing], Virginia Tech Wireless
Symposium 2016
• T. OShea, [Handing Full Control of the Radio Spectrum over to the Machines], DE-
FCON Wireless Village 2016
• T. OShea, [Radio Machine Learning with FOSS, GNU Radio and TensorFlow] FOS-
DEM 2016
Timothy J. O’Shea Chapter 6. Conclusion 172
• T. OShea, [Rapid GNU Radio GPU Algorithm Prototyping from Python (gr-theano)],
FOSDEM 2015
• T. O’Shea, [GNU Radio Tools for Radio Wrangling and Spectrum Domination], DE-
FCON 23 Wireless Village 2015
• T. O’Shea, [Tutorial: Exploring Data], GNU Radio Conference 2015
Bibliography
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep
convolutional neural networks,” in Advances in neural information processing systems,
2012, pp. 1097–1105.
[2] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalch-
brenner, A. Senior, and K. Kavukcuoglu, “Wavenet: A generative model for raw
audio,” arXiv preprint arXiv:1609.03499, 2016.
[3] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: a simple way to prevent neural networks from overfitting.” Journal of
Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016,
pp. 770–778.
[5] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer networks,” in
Advances in Neural Information Processing Systems, 2015, pp. 2017–2025.
173
Timothy J. O’Shea Chapter 6. Conclusion 174
[6] C. Moore, “Data processing in exascale-class computer systems,” in The Salishan
Conference on High Speed Computing, 2011.
[7] J.-H. Huang, “Keynote and volta series product announcement,” in GPU Technology
Conference, 2017.
[8] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, Q. V. Le, and A. Kurakin,
“Large-scale evolution of image classifiers,” CoRR, vol. abs/1703.01041, 2017.
[Online]. Available: http://arxiv.org/abs/1703.01041
[9] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional net-
works,” in European conference on computer vision. Springer, 2014, pp. 818–833.
[10] K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep inside convolutional net-
works: Visualising image classification models and saliency maps,” arXiv preprint
arXiv:1312.6034, 2013.
[11] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-
cam: Visual explanations from deep networks via gradient-based localization,” See
https://arxiv. org/abs/1610.02391 v3, 2016.
[12] R. Shwartz-Ziv and N. Tishby, “Opening the black box of deep neural
networks via information,” CoRR, vol. abs/1703.00810, 2017. [Online]. Available:
http://arxiv.org/abs/1703.00810
Timothy J. O’Shea Chapter 6. Conclusion 175
[13] “Lte phy lab, e-utra phy golden reference model,” https://www.is-wireless.com/
5g-toolset-old/lte-phy-lab-old/, (Accessed on 10/01/2017).
[14] F. Chollet, “Buiulding autoencoders in keras,” https://blog.keras.io/
building-autoencoders-in-keras.html, (Accessed on 10/01/2017).
[15] O. A. Dobre, A. Abdi, Y. Bar-Ness, and W. Su, “Survey of automatic modulation
classification techniques: classical approaches and new trends,” IET communica-
tions, vol. 1, no. 2, pp. 137–156, 2007.
[16] C. Olah, “Understanding lstm networks,” Online Article
http://colah.github.io/posts/2015-08-Understanding-LSTMs/, 2015.
[17] J. Redmon and A. Farhadi, “Yolo9000: better, faster, stronger,” arXiv preprint
arXiv:1612.08242, 2016.
[18] T. J. O’Shea, T. Roy, and T. C. Clancy, “Learning robust general radio signal detec-
tion using computer vision methods,” in 2016 51th Asilomar Conference on Signals,
Systems and Computers, Nov 2017.
[19] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.
[20] C. E. Shannon, “A mathematical theory of communication,” Bell System Technical
Journal, vol. 27, no. 3, pp. 379–423, Jul. 1948.
[21] C. Berrou, A. Glavieux, and P. Thitimajshima, “Near shannon limit error-correcting
coding and decoding: Turbo-codes. 1,” in Communications, 1993. ICC’93 Geneva.
Timothy J. O’Shea Chapter 6. Conclusion 176
Technical Program, Conference Record, IEEE International Conference on, vol. 2. IEEE,
1993, pp. 1064–1070.
[22] R. M. Pyndiah, “Near-optimum decoding of product codes: Block turbo codes,”
IEEE Transactions on communications, vol. 46, no. 8, pp. 1003–1010, 1998.
[23] R. Gallager, “Low-density parity-check codes,” IRE Transactions on information the-
ory, vol. 8, no. 1, pp. 21–28, 1962.
[24] R. v. Nee and R. Prasad, OFDM for wireless multimedia communications. Artech
House, Inc., 2000.
[25] P. Patel and J. Holtzman, “Analysis of a simple successive interference cancella-
tion scheme in a ds/cdma system,” IEEE journal on selected areas in communications,
vol. 12, no. 5, pp. 796–807, 1994.
[26] H. Wymeersch, Iterative receiver design. Cambridge University Press Cambridge,
2007, vol. 234.
[27] D. J. Jakubisin, R. M. Buehrer, and C. R. da Silva, “Bp, mf, and ep for joint channel
estimation and detection of mimo-ofdm signals,” in Global Communications Confer-
ence (GLOBECOM), 2016 IEEE. IEEE, 2016, pp. 1–6.
[28] M. J. Demongeot, M. J. Mazoyer, M. P. Peretto, and M. D. Whitley, “Neural network
synthesis using cellular encoding and the genetic algorithm.” 1994.
Timothy J. O’Shea Chapter 6. Conclusion 177
[29] J. Branke, “Evolutionary algorithms for neural network design and training,” in
In Proceedings of the First Nordic Workshop on Genetic Algorithms and its Applications.
Citeseer, 1995.
[30] J. Bruck and M. Blaum, “Neural networks, error-correcting codes, and polynomials
over the binary n-cube,” IEEE Transactions on information theory, vol. 35, no. 5, pp.
976–987, 1989.
[31] F. Jondral, “Automatic classification of high frequency signals,” Signal Processing,
vol. 9, no. 3, pp. 177–190, 1985.
[32] M. A. Hearst, S. T. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector
machines,” IEEE Intelligent Systems and their applications, vol. 13, no. 4, pp. 18–28,
1998.
[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-
scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009.
CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255.
[34] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Interna-
tional journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
[35] J. Sanchez and F. Perronnin, “High-dimensional signature compression for large-
scale image classification,” in Computer Vision and Pattern Recognition (CVPR), 2011
IEEE Conference on. IEEE, 2011, pp. 1665–1672.
Timothy J. O’Shea Chapter 6. Conclusion 178
[36] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, “Learning the
speech front-end with raw waveform cldnns,” in Sixteenth Annual Conference of the
International Speech Communication Association, 2015.
[37] Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, “Iden-
tifying and attacking the saddle point problem in high-dimensional non-convex
optimization,” in Advances in neural information processing systems, 2014, pp. 2933–
2941.
[38] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural
networks with pruning, trained quantization and huffman coding,” arXiv preprint
arXiv:1510.00149, 2015.
[39] P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz, “Pruning convolutional
neural networks for resource efficient inference,” 2016.
[40] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on com-
puter vision, 2015, pp. 1440–1448.
[41] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits
in natural images with unsupervised feature learning,” in NIPS workshop on deep
learning and unsupervised feature learning, vol. 2011, no. 2, 2011, p. 5.
[42] W. Namgoong and T. H. Meng, “Direct-conversion rf receiver design,” IEEE Trans-
actions on Communications, vol. 49, no. 3, pp. 518–529, 2001.
Timothy J. O’Shea Chapter 6. Conclusion 179
[43] R. AD9361, “Agile transceiver, data sheet, analog devices,” Inc, vol. 2, p. 014, 2013.
[44] H. Nyquist, “Certain topics in telegraph transmission theory,” Transactions of the
American Institute of Electrical Engineers, vol. 47, no. 2, pp. 617–644, 1928.
[45] M. Ettus, “Universal software radio peripheral,” 2009.
[46] T. O’Shea, “Gnu radio channel simulation,” in GNU Radio Conference 2013, 2013.
[47] D. Middleton, I. of Electrical, and E. Engineers, An introduction to statistical commu-
nication theory. IEEE press Piscataway, NJ, 1996.
[48] J. Mitola and G. Q. Maguire, “Cognitive radio: making software radios more per-
sonal,” IEEE personal communications, vol. 6, no. 4, pp. 13–18, 1999.
[49] T. W. Rondeau, “Application of artificial intelligence to wireless communications,”
Ph.D. dissertation, Virginia Polytechnic Institute and State University, 2007.
[50] P. J. Kolodzy, “Dynamic spectrum policies: promises and challenges,” CommLaw
Conspectus, vol. 12, p. 147, 2004.
[51] T. C. Clancy, “Dynamic spectrum access in cognitive radio networks,” Ph.D. disser-
tation, 2006.
[52] W. Gardner, W. Brown, and C.-K. Chen, “Spectral correlation of modulated signals:
Part ii–digital modulation,” IEEE Transactions on Communications, vol. 35, no. 6, pp.
595–601, 1987.
Timothy J. O’Shea Chapter 6. Conclusion 180
[53] S. Geirhofer, L. Tong, and B. M. Sadler, “Cognitive radios for dynamic spectrum
access-dynamic spectrum access in the time domain: Modeling and exploiting
white space,” IEEE Communications Magazine, vol. 45, no. 5, 2007.
[54] Z. Ji and K. R. Liu, “Cognitive radios for dynamic spectrum access-dynamic
spectrum sharing: A game theoretical overview,” IEEE Communications Magazine,
vol. 45, no. 5, 2007.
[55] A. Amanna and J. H. Reed, “Survey of cognitive radio architectures,” in IEEE South-
eastCon 2010 (SoutheastCon), Proceedings of the. IEEE, 2010, pp. 292–297.
[56] E. Stuntebeck, T. OShea, J. Hecker, and T. Clancy, “Architecture for an open-source
cognitive radio,” in Proceedings of the SDR forum technical conference, 2006.
[57] P. J. Werbos, “Applications of advances in nonlinear sensitivity analysis,” in System
modeling and optimization. Springer, 1982, pp. 762–770.
[58] N. Qian, “On the momentum term in gradient descent learning algorithms,” Neural
networks, vol. 12, no. 1, pp. 145–151, 1999.
[59] I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization
and momentum in deep learning,” in International conference on machine learning,
2013, pp. 1139–1147.
[60] A. Nemirovskii, D. B. Yudin, and E. R. Dawson, “Problem complexity and method
efficiency in optimization,” 1983.
Timothy J. O’Shea Chapter 6. Conclusion 181
[61] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient by a running
average of its recent magnitude,” COURSERA: Neural Networks for Machine Learn-
ing, vol. 4, no. 2, 2012.
[62] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014.
[63] J. Zhang, I. Mitliagkas, and C. Re, “Yellowfin and the art of momentum tuning,”
arXiv preprint arXiv:1706.03471, 2017.
[64] C. Xu, T. Qin, G. Wang, and T.-Y. Liu, “Reinforcement learning for learning rate
control,” arXiv preprint arXiv:1705.11159, 2017.
[65] T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman, “Random synaptic
feedback weights support error backpropagation for deep learning,” Nature com-
munications, vol. 7, 2016.
[66] B. Scellier and Y. Bengio, “Equilibrium propagation: Bridging the gap between
energy-based models and backpropagation,” Frontiers in computational neuroscience,
vol. 11, 2017.
[67] D. P. Kingma, T. Salimans, and M. Welling, “Improving variational inference with
inverse autoregressive flow,” arXiv preprint arXiv:1606.04934, 2016.
[68] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann ma-
chines,” in Proc. Int. Conf. Mach. Learn. (ICML), 2010, pp. 807–814.
Timothy J. O’Shea Chapter 6. Conclusion 182
[69] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Pro-
ceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics,
2011, pp. 315–323.
[70] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural
network acoustic models,” in Proc. ICML, vol. 30, no. 1, 2013.
[71] J. S. Bridle, “Training stochastic model recognition algorithms as networks can lead
to maximum mutual information estimation of parameters,” in Advances in neural
information processing systems, 1990, pp. 211–217.
[72] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network
learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.
[73] G. Klambauer, T. Unterthiner, A. Mayr, and S. Hochreiter, “Self-normalizing neural
networks,” arXiv preprint arXiv:1706.02515, 2017.
[74] C. Dugas, Y. Bengio, F. Belisle, C. Nadeau, and R. Garcia, “Incorporating second-
order functional knowledge for better option pricing,” in Advances in neural infor-
mation processing systems, 2001, pp. 472–478.
[75] Y. LeCun, Y. Bengio et al., “Convolutional networks for images, speech, and time
series,” The handbook of brain theory and neural networks, vol. 3361, no. 10, p. 1995,
1995.
Timothy J. O’Shea Chapter 6. Conclusion 183
[76] A. B. Geva, “Scalenet-multiscale neural-network architecture for time series predic-
tion,” IEEE Transactions on neural networks, vol. 9, no. 6, pp. 1471–1482, 1998.
[77] M. Sundermeyer, R. Schluter, and H. Ney, “Lstm neural networks for language
modeling,” in Thirteenth Annual Conference of the International Speech Communication
Association, 2012.
[78] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to forget: Continual predic-
tion with lstm,” 1999.
[79] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, and R. Fergus, “Regularization of neural
networks using dropconnect,” in Proceedings of the 30th international conference on
machine learning (ICML-13), 2013, pp. 1058–1066.
[80] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training
by reducing internal covariate shift,” in International Conference on Machine Learning,
2015, pp. 448–456.
[81] R. K. Srivastava, K. Greff, and J. Schmidhuber, “Highway networks,” arXiv preprint
arXiv:1505.00387, 2015.
[82] G. E. Moore, “Cramming more components onto integrated circuits,” Electronics,
vol. 38, no. 8, 1965.
[83] M. Gschwind, “Chip multiprocessing and the cell broadband engine,” in Proceed-
ings of the 3rd conference on Computing frontiers. ACM, 2006, pp. 1–8.
Timothy J. O’Shea Chapter 6. Conclusion 184
[84] D. Wentzlaff, P. Griffin, H. Hoffmann, L. Bao, B. Edwards, C. Ramey, M. Mattina,
C.-C. Miao, J. F. Brown III, and A. Agarwal, “On-chip interconnection architecture
of the tile processor,” IEEE micro, vol. 27, no. 5, pp. 15–31, 2007.
[85] N. McCarthy, E. Blossom, N. Goergen, T. OShea, and C. Clancy, “High-performance
sdr: Gnu radio and the ibm cell broadband engine,” in Virginia Tech Wireless Personal
Communications Symposium, 2008.
[86] C. Nvidia, “Compute unified device architecture programming guide,” 2007.
[87] J. Hensley, “Close to the metal,” SIGGRAPH’07, 2007.
[88] A. Munshi, “Opencl: Parallel computing on the gpu and cpu,” SIGGRAPH, Tutorial,
pp. 11–15, 2008.
[89] G. Harrison, A. Sloan, W. Myrick, J. Hecker, and D. Eastin, “Polyphase channeliza-
tion utilizing general-purpose computing on a gpu,” in SDR 2008 technical conference
and product exposition, 2008.
[90] G. F. Zaki, W. Plishker, T. Oshea, N. McCarthy, C. Clancy, E. Blossom, and S. S.
Bhattacharyya, “Integration of dataflow optimization techniques into a software
radio design framework,” in Signals, Systems and Computers, 2009 Conference Record
of the Forty-Third Asilomar Conference on. IEEE, 2009, pp. 243–247.
Timothy J. O’Shea Chapter 6. Conclusion 185
[91] M. Piscopo, “Study on implementing opencl in common gnuradio blocks,”
Proceedings of the GNU Radio Conference, vol. 2, no. 1, p. 67, 2017. [Online]. Available:
https://pubs.gnuradio.org/index.php/grcon/article/view/15
[92] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian,
D. Warde-Farley, and Y. Bengio, “Theano: A cpu and gpu math compiler in
python,” in Proc. 9th Python in Science Conf, 2010, pp. 1–7.
[93] E. Jones, T. Oliphant, P. Peterson et al., “SciPy: Open source scientific
tools for Python,” 2001–, [Online; accessed ¡today¿]. [Online]. Available:
http://www.scipy.org/
[94] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on
heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
[95] J. McCarthy, “Recursive functions of symbolic expressions and their computation
by machine, part i,” Communications of the ACM, vol. 3, no. 4, pp. 184–195, 1960.
[96] F. Chollet, “keras,” https://github.com/fchollet/keras, 2015.
[97] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in
Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014, pp.
675–678.
Timothy J. O’Shea Chapter 6. Conclusion 186
[98] S. Tokui, K. Oono, S. Hido, and J. Clayton, “Chainer: a next-generation open source
framework for deep learning,” in Proceedings of workshop on machine learning systems
(LearningSys) in the twenty-ninth annual conference on neural information processing sys-
tems (NIPS), vol. 5, 2015.
[99] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlab-like environment
for machine learning,” in BigLearn, NIPS Workshop, 2011.
[100] A. Paszke, S. Gross, and S. Chintala, “Pytorch,” 2017.
[101] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and
Z. Zhang, “Mxnet: A flexible and efficient machine learning library for heteroge-
neous distributed systems,” arXiv preprint arXiv:1512.01274, 2015.
[102] S. Dieleman, J. Schlter, C. Raffel, E. Olson, S. K. Snderby, D. Nouri et al., “Lasagne:
First release.” Aug. 2015. [Online]. Available: http://dx.doi.org/10.5281/zenodo.
27878
[103] D. Maclaurin, D. Duvenaud, and R. P. Adams, “Gradient-based hyperparameter
optimization through reversible learning,” in Proceedings of the 32nd International
Conference on Machine Learning, 2015.
[104] A. G. Baydin, R. Cornish, D. M. Rubio, M. Schmidt, and F. Wood, “Online learning
rate adaptation with hypergradient descent,” arXiv preprint arXiv:1703.04782, 2017.
Timothy J. O’Shea Chapter 6. Conclusion 187
[105] B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,”
arXiv preprint arXiv:1611.01578, 2016.
[106] T. Desell, “Large scale evolution of convolutional neural networks using
volunteer computing,” CoRR, vol. abs/1703.05422, 2017. [Online]. Available:
http://arxiv.org/abs/1703.05422
[107] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny im-
ages,” 2009.
[108] D. Erhan, Y. Bengio, A. Courville, and P. Vincent, “Visualizing higher-layer features
of a deep network,” University of Montreal, vol. 1341, p. 3, 2009.
[109] F. E. Terman et al., “Radio engineering,” 1937.
[110] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning to communicate: Channel auto-
encoders, domain specific regularizers, and attention,” in 2016 IEEE International
Symposium on Signal Processing and Information Technology (ISSPIT), Dec 2016, pp.
223–228.
[111] T. OShea and J. Hoydis, “An introduction to deep learning for the physical layer,”
IEEE Transactions on Cognitive Communications and Networking, vol. PP, no. 99, pp.
1–1, 2017.
[112] Y. Li, R. Xu, and F. Liu, “Whiteout: Gaussian adaptive regularization noise in deep
neural networks,” arXiv preprint arXiv:1612.01490, 2016.
Timothy J. O’Shea Chapter 6. Conclusion 188
[113] L. v. d. Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res.,
vol. 9, no. Nov, pp. 2579–2605, 2008.
[114] F. Liang, C. Shen, and F. Wu, “An iterative bp-cnn architecture for channel decod-
ing,” arXiv preprint arXiv:1707.05697, 2017.
[115] S. Cammerer, T. Gruber, J. Hoydis, and S. t. Brink, “Scaling deep learning-based
decoding of polar codes via partitioning,” arXiv preprint arXiv:1702.06901, 2017.
[116] T. Gruber, S. Cammerer, J. Hoydis, and S. t. Brink, “On deep learning-based channel
decoding,” in 2017 51st Annual Conference on Information Sciences and Systems (CISS),
March 2017, pp. 1–6.
[117] T. J. O’Shea, L. Pemula, D. Batra, and T. C. Clancy, “Radio transformer networks:
Attention models for learning to synchronize in wireless systems,” in 2016 50th
Asilomar Conference on Signals, Systems and Computers, Nov 2016, pp. 662–666.
[118] N. TSGRANGRA, “Evolved universal terrestrial radio access (e-utra); multiplexing
and channel coding,” 3rd Generation Partnership Project (3GPP), vol. TS, vol. 36, 2009.
[119] L. ETSI, “Evolved universal terrestrial radio access (e-utra); physical channels and
modulation,” ETSI TS, vol. 136, no. 211, p. V9.
[120] D. Gesbert, M. Shafi, D.-s. Shiu, P. J. Smith, and A. Naguib, “From theory to practice:
An overview of mimo space-time coded wireless systems,” IEEE Journal on selected
areas in Communications, vol. 21, no. 3, pp. 281–302, 2003.
Timothy J. O’Shea Chapter 6. Conclusion 189
[121] E. Luther, “5g massive mimo testbed: From theory to reality,” white paper, avail-
able online: https://studylib. net/doc/18730180/5g-massive-mimo-testbed–from-theory-to-
reality, 2014.
[122] V. Tarokh, H. Jafarkhani, and A. R. Calderbank, “Space-time block codes from or-
thogonal designs,” IEEE Transactions on Information theory, vol. 45, no. 5, pp. 1456–
1467, 1999.
[123] S. M. Alamouti, “A simple transmit diversity technique for wireless communica-
tions,” IEEE Journal on Selected Areas in Communications, vol. 16, no. 8, pp. 1451–1458,
Oct 1998.
[124] A. J. Paulraj, R. W. Heath Jr, P. K. Sebastian, and D. J. Gesbert, “Spatial multiplexing
in a cellular network,” May 23 2000, uS Patent 6,067,290.
[125] S. Dorner, S. Cammerer, J. Hoydis, and S. ten Brink, “Deep Learning-Based Com-
munication Over the Air,” ArXiv e-prints, Jul. 2017.
[126] T. Soderstrom and P. Stoica, System identification. Prentice-Hall, Inc., 1988.
[127] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud, “Backpropagation
through the void: Optimizing control variates for black-box gradient estimation,”
arXiv preprint arXiv:1711.00123, 2017.
Timothy J. O’Shea Chapter 6. Conclusion 190
[128] T. J. O’Shea, K. Karra, and T. C. Clancy, “Learning approximate neural estimators
for wireless channel state information,” in 2016 IEEE International Workshop on Ma-
chine Learning for Signal Processing (MLSP), Sep 2017.
[129] Y. Wang, K. Shi, and E. Serpedin, “Non-data-aided feedforward carrier frequency
offset estimators for qam constellations: A nonlinear least-squares approach,”
EURASIP Journal on Advances in Signal Processing, vol. 2004, no. 13, p. 856139, 2004.
[130] O. Catoni et al., “Challenging the empirical mean and empirical variance: a devia-
tion study,” in Annales de l’Institut Henri Poincare, Probabilites et Statistiques, vol. 48,
no. 4. Institut Henri Poincare, 2012, pp. 1148–1185.
[131] T. J. OShea, J. Corgan, and T. C. Clancy, “Convolutional radio modulation recogni-
tion networks,” in International Conference on Engineering Applications of Neural Net-
works. Springer, 2016, pp. 213–226.
[132] K. S. K. Arumugam, I. A. Kadampot, M. Tahmasbi, S. Shah, M. Bloch, and
S. Pokutta, “Modulation recognition using side information and hybrid learning,”
in Dynamic Spectrum Access Networks (DySPAN), 2017 IEEE International Symposium
on. IEEE, 2017, pp. 1–2.
[133] K. Triantafyllakis, M. Surligas, G. Vardakis, and S. Papadakis, “Phasma: An auto-
matic modulation classification system based on random forest,” in Dynamic Spec-
trum Access Networks (DySPAN), 2017 IEEE International Symposium on. IEEE, 2017,
pp. 1–3.
Timothy J. O’Shea Chapter 6. Conclusion 191
[134] M. Laghate, S. Chaudhari, and D. Cabric, “Usrp n210 demonstration of wideband
sensing and blind hierarchical modulation classification,” in Dynamic Spectrum Ac-
cess Networks (DySPAN), 2017 IEEE International Symposium on. IEEE, 2017, pp.
1–3.
[135] J. L. Ziegler, R. T. Arn, and W. Chambers, “Modulation recognition with gnu ra-
dio, keras, and hackrf,” in Dynamic Spectrum Access Networks (DySPAN), 2017 IEEE
International Symposium on. IEEE, 2017, pp. 1–3.
[136] K. Karra, S. Kuzdeba, and J. Petersen, “Modulation recognition using hierarchical
deep neural networks,” in Dynamic Spectrum Access Networks (DySPAN), 2017 IEEE
International Symposium on. IEEE, 2017, pp. 1–3.
[137] N. E. West, K. Harwell, and B. McCall, “Dft signal detection and channelization
with a deep neural network modulation classifier,” in Dynamic Spectrum Access Net-
works (DySPAN), 2017 IEEE International Symposium on. IEEE, 2017, pp. 1–3.
[138] C. M. Spooner, A. N. Mody, J. Chuang, and J. Petersen, “Modulation recognition
using second-and higher-order cyclostationarity,” in Dynamic Spectrum Access Net-
works (DySPAN), 2017 IEEE International Symposium on. IEEE, 2017, pp. 1–3.
[139] T. J. O’Shea and N. West, “Radio machine learning dataset generation with gnu
radio,” in Proceedings of the GNU Radio Conference, vol. 1, no. 1, 2016.
Timothy J. O’Shea Chapter 6. Conclusion 192
[140] C. Weaver, C. Cole, R. Krumland, and M. Miller, “The automatic classification of
modulation types by pattern recognition.” STANFORD UNIV CALIF STANFORD
ELECTRONICS LABS, Tech. Rep., 1969.
[141] J. Aisbett, “Automatic modulation recognition using time domain parameters,” Sig-
nal Processing, vol. 13, no. 3, pp. 323–328, 1987.
[142] W. A. Gardner and C. M. Spooner, “Cyclic spectral analysis for signal detection and
modulation recognition,” in Military Communications Conference, 1988. MILCOM 88,
Conference record. 21st Century Military Communications-What’s Possible? 1988 IEEE.
IEEE, 1988, pp. 419–424.
[143] ——, “Signal interception: performance advantages of cyclic-feature detectors,”
IEEE Transactions on Communications, vol. 40, no. 1, pp. 149–159, 1992.
[144] C. M. Spooner and W. A. Gardner, “Robust feature detection for signal intercep-
tion,” IEEE transactions on communications, vol. 42, no. 5, pp. 2165–2173, 1994.
[145] A. Abdelmutalab, K. Assaleh, and M. El-Tarhuni, “Automatic modulation classi-
fication based on high order cumulants and hierarchical polynomial classifiers,”
Physical Communication, vol. 21, pp. 10–18, 2016.
[146] A. K. Nandi and E. E. Azzouz, “Algorithms for automatic modulation recognition
of communication signals,” IEEE Transactions on communications, vol. 46, no. 4, pp.
431–436, 1998.
Timothy J. O’Shea Chapter 6. Conclusion 193
[147] A. Fehske, J. Gaeddert, and J. H. Reed, “A new approach to signal classification
using spectral correlation and neural networks,” in New Frontiers in Dynamic Spec-
trum Access Networks, 2005. DySPAN 2005. 2005 First IEEE International Symposium
on. IEEE, 2005, pp. 144–150.
[148] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale
image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[149] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in
python,” Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825–2830, 2011.
[150] D. George and E. Huerta, “Deep neural networks to enable real-time multimessen-
ger astrophysics,” arXiv preprint arXiv:1701.00008, 2016.
[151] S. Cioni, G. Colavolpe, V. Mignone, A. Modenini, A. Morello, M. Ricciulli,
A. Ugolini, and Y. Zanettini, “Transmission parameters optimization and receiver
architectures for dvb-s2x systems,” International Journal of Satellite Communications
and Networking, vol. 34, no. 3, pp. 337–350, 2016.
[152] M. Ettus and M. Braun, “The universal software radio peripheral (usrp) family of
low-cost sdrd,” Opportunistic Spectrum Sharing and White Space Access: The Practical
Reality, pp. 3–23, 2015.
[153] A. D.-R. A. T. AD9361, “url: http://www.analog.com/static/imported-
files/data\ sheets/ad9361.pdf (visited on 09/14/08),” Cited on, p. 103.
Timothy J. O’Shea Chapter 6. Conclusion 194
[154] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings
of the 22nd acm sigkdd international conference on knowledge discovery and data mining.
ACM, 2016, pp. 785–794.
[155] N. E. West and T. O’Shea, “Deep architectures for modulation recognition,” in Dy-
namic Spectrum Access Networks (DySPAN), 2017 IEEE International Symposium on.
IEEE, 2017, pp. 1–6.
[156] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification,” in Proceedings of the IEEE in-
ternational conference on computer vision, 2015, pp. 1026–1034.
[157] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-
houcke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
[158] T. J. O’Shea, S. Hitefield, and J. Corgan, “End-to-end radio traffic sequence recogni-
tion with recurrent neural networks,” in 2016 IEEE Global Conference on Signal and
Information Processing (GlobalSIP), Dec 2016, pp. 277–281.
[159] R.-P. Weinmann, “Baseband attacks: Remote exploitation of memory corruptions in
cellular protocol stacks.” in WOOT, 2012, pp. 12–21.
[160] K. Greff, R. K. Srivastava, J. Koutnık, B. R. Steunebrink, and J. Schmidhuber, “Lstm:
A search space odyssey,” IEEE transactions on neural networks and learning systems,
2017.
Timothy J. O’Shea Chapter 6. Conclusion 195
[161] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation,
vol. 9, no. 8, pp. 1735–1780, 1997.
[162] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recur-
rent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
[163] J. Bradbury, S. Merity, C. Xiong, and R. Socher, “Quasi-recurrent neural networks,”
arXiv preprint arXiv:1611.01576, 2016.
[164] A. Orebaugh, G. Ramirez, and J. Beale, Wireshark & Ethereal network protocol analyzer
toolkit. Syngress, 2006.
[165] A. Karpathy, “The unreasonable effectiveness of recurrent neural networks,” Online
Article http://karpathy.github.io/2015/05/21/rnn-effectiveness/, 2015.
[166] H. Kim and K. G. Shin, “In-band spectrum sensing in cognitive radio networks:
energy detection or feature detection?” in Proceedings of the 14th ACM international
conference on Mobile computing and networking. ACM, 2008, pp. 14–25.
[167] R. Ewerth, M. Springstein, L. A. Phan-Vogtmann, and J. Schutze, “are machines
better than humans in image tagging?-a user study adds to the puzzle,” in European
Conference on Information Retrieval. Springer, 2017, pp. 186–198.
[168] D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck, “Deep learning for
identifying metastatic breast cancer,” arXiv preprint arXiv:1606.05718, 2016.
Timothy J. O’Shea Chapter 6. Conclusion 196
[169] S. Sarraf, G. Tofighi et al., “Deepad: Alzheimer s disease classification via deep
convolutional neural networks using mri and fmri,” bioRxiv, p. 070441, 2016.
[170] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional net-
works for accurate object detection and segmentation,” IEEE transactions on pattern
analysis and machine intelligence, vol. 38, no. 1, pp. 142–158, 2016.
[171] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified,
real-time object detection,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2016, pp. 779–788.
[172] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, SSD:
Single Shot MultiBox Detector. Cham: Springer International Publishing, 2016, pp.
21–37. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-46448-0 2
[173] E. Blossom, “GNU radio: tools for exploring the radio frequency spectrum,” Linux
journal, vol. 2004, no. 122, p. 4, 2004.
[174] B. Sklar, Digital communications. Prentice Hall NJ, 2001, vol. 2.
[175] T. J. O’Shea, N. West, M. Vondal, and T. C. Clancy, “Semi-supervised radio signal
identification,” in Advanced Communication Technology (ICACT), 2017 19th Interna-
tional Conference on. IEEE, 2017, pp. 33–38.
[176] O. Chapelle and A. Zien, “Semi-supervised classification by low density separa-
tion.” in AISTATS, 2005, pp. 57–64.
Timothy J. O’Shea Chapter 6. Conclusion 197
[177] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning (chapelle, o. et
al., eds.; 2006)[book reviews],” IEEE Transactions on Neural Networks, vol. 20, no. 3,
pp. 542–542, 2009.
[178] X. Zhu and A. B. Goldberg, “Introduction to semi-supervised learning,” Synthesis
lectures on artificial intelligence and machine learning, vol. 3, no. 1, pp. 1–130, 2009.
[179] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics
and intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987.
[180] A. Hyvarinen, J. Karhunen, and E. Oja, Independent component analysis. John Wiley
& Sons, 2004, vol. 46.
[181] B. Scholkopf, A. Smola, and K.-R. Muller, “Kernel principal component analysis,”
in International Conference on Artificial Neural Networks. Springer, 1997, pp. 583–588.
[182] L. Theis, W. Shi, A. Cunningham, and F. Huszar, “Lossy image compression with
compressive autoencoders,” arXiv preprint arXiv:1703.00395, 2017.
[183] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor, and M. Covell,
“Full resolution image compression with recurrent neural networks,” arXiv preprint
arXiv:1608.05148, 2016.
[184] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning transferable architectures
for scalable image recognition,” CoRR, vol. abs/1707.07012, 2017. [Online].
Available: http://arxiv.org/abs/1707.07012
Timothy J. O’Shea Chapter 6. Conclusion 198
[185] T. Back and H.-P. Schwefel, “An overview of evolutionary algorithms for parameter
optimization,” Evolutionary computation, vol. 1, no. 1, pp. 1–23, 1993.
[186] G. Venter and J. Sobieszczanski-Sobieski, “Particle swarm optimization,” AIAA
journal, vol. 41, no. 8, pp. 1583–1589, 2003.
[187] A. Goldbloom, “Data prediction competitions–far more than just a bit of fun,” in
Data Mining Workshops (ICDMW), 2010 IEEE International Conference on. IEEE, 2010,
pp. 1385–1386.