The Automatic Generation and Testing of Signal … · the automatic generation and testing of ......

THE AUTOMATIC GENERATION AND TESTING OF

SIGNAL RECOGNITION ALGORITHMS

APPROVED BY SUPERVISING COMMITTEE:

________________________________________ Dr. Kay A. Robbins, Chair

________________________________________

Dr. Weining Zhang

________________________________________ Dr. Tom Bylander

________________________________________

Dr. Qi Tian

________________________________________ Dr. Denise Varner, Outside Member

Accepted: ________________________________________

Dean, Graduate School

Copyright 2007

Kenneth Lynn Holladay

All Rights Reserved

Dedication

This dissertation is dedicated to six most important people in my life. First, To my loving wife

Deborah for her selfless and unwavering support through thirty-one years of marriage and nine years of

graduate school. Second, to my three wonderful children, Aaron, Miriam, and Benjamin, who are a con-

stant source of joy and inspiration, and finally to my parents, Dot and Bill Holladay, who taught me that

this was possible.



by

Kenneth Lynn Holladay, M.S.

DISSERTATION Presented to the Graduate Faculty of

The University of Texas at San Antonio In partial Fulfillment Of the Requirements

For the Degree of

DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE

THE UNIVERSITY OF TEXAS AT SAN ANTONIO College of Sciences

Department of Computer Science May 2007

v

Acknowledgments

I owe great debts of gratitude to many people for helping me complete this degree. Spe-

cial thanks go to my advisor, Dr. Kay Robbins. In the spring semester of 1998, she was teaching

the Operating Systems class in which I was enrolled. That was my first semester after acceptance

into the Computer Science Masters Degree program at the University of Texas in San Antonio.

Dr. Robbins convinced me to transition into the PhD program, and has been extremely patient

with me over the past nine years. Special thanks also go to Dr. Jeffery von Ronne for his help

with the type rules.

Next, I am grateful to my employer, Southwest Research Institute (SwRI), for providing

funding in the form of Internal Research grants from the Advisory Committee for Research

(ACR) and for providing access to computers and radio equipment for this research. A number of

people at SwRI deserve special mention. My supervisor, Ron Reinhard, has been very supportive

through this whole process, and the Institute Scientists in the Signal Exploitation and Geoloca-

tion Division, especially Dr. Denise Varner, Dr. Jackie Hipp, and Rob Black, have all provided

valuable insight and guidance.

Finally, computational support was provided by NIH Research Centers in Minority Insti-

tutions grant 2G12RR1364-06A1 and by the UT System San Antonio Computational Biology

Initiative.

April 2007

vi



Kenneth Lynn Holladay, Ph.D. The University of Texas at San Antonio, 2007

Supervising Professor: Dr. Kay A. Robbins

Signal recognition algorithms are representative of a large class of problems in which the

complexity of the algorithm and its operational domain make it impractical to derive valid opera-

tional characterization using only mathematical or analytical techniques. They are also

representative of algorithms whose practical application demands high sensitivity and high speci-

ficity. This dissertation explores the use of empirical testing techniques to characterize algorithm

behavior and the application of those techniques to the automatic creation of new algorithms for

signal recognition.

An essential first step in algorithm development is an accurate specification of the prob-

lem domain. This dissertation deals with signal recognition in High Frequency (HF) radio

communication bands. This problem domain is appropriate because it has important real-world

application, it can be specified accurately, and it can be represented by reasonably sized test sets.

A result of the domain specification work was the innovative Signal Exploitation Markup Lan-

guage (SIGEXML), an XML schema capable of expressing signal, environment, and receiver

parameters. Variants of the schema have been adopted for use in several government systems.

A second essential step in automated algorithm development is creating a framework ca-

pable of managing the large-scale data collection and analysis that is necessary to exercise an

algorithm within a statistically significant volume of its specified domain. The Algorithm

Evaluation Framework (ALEF) created for this dissertation was successfully used to demonstrate

three common development tasks: characterizing well-known symbol rate algorithms, improving

a recently published symbol rate algorithm, and facilitating development of new algorithms.

Our efforts to develop a new algorithm based on KL decomposition demonstrated the dif-

ficult nature of manual algorithm development, even when assisted by automated tools.

Addressing this difficulty led to the final phase of the dissertation: automated development of

signal recognition algorithms using genetic programming (GP). Unfortunately, no available GP

language supported intrinsic vector representation, which we strongly believed was necessary for

vii

automation to succeed. Therefore, we designed and implemented a new GP language, FIFTH™,

along with a fully distributed genetic programming environment. Using FIFTH as the creative

component and ALEF for detailed analysis, we demonstrated, by creation of an effective new

algorithm for determining signal symbol rate, that automated development of vector algorithms

with human-competitive results is possible.

viii

Table of Contents

Acknowledgments ......................................................................................................................... v

Abstract......................................................................................................................................... vi

List of Tables ................................................................................................................................ xi

List of Figures.............................................................................................................................. xii

Chapter 1 Introduction................................................................................................................. 1 1.1 Motivation....................................................................................................................... 1

1.2 Related work ................................................................................................................... 2

1.3 A different approach ....................................................................................................... 2

1.4 Dissertation overview ..................................................................................................... 3

Chapter 2 HF Signal Classification ............................................................................................. 5 2.1 Introduction to HF communications ............................................................................... 5

2.2 Describing HF signal characteristics .............................................................................. 6 2.2.1 International standards ............................................................................................ 8 2.2.2 Ad hoc amateur radio schemes ............................................................................... 8 2.2.3 Other schemes......................................................................................................... 9

2.3 Developing the XML schema ......................................................................................... 9 2.3.1 Single carrier signals............................................................................................. 10 2.3.2 Multiple carrier signals ......................................................................................... 11 2.3.3 Multiple carrier signals that vary in time.............................................................. 12 2.3.4 The Signal Exploitation Markup Language (SIGEXML) .................................... 13

2.4 Real world impact of SIGEXML.................................................................................. 14

Chapter 3 Algorithm Evaluation Framework (ALEF) ........................................................... 15 3.1 ALEF components ........................................................................................................ 16

3.1.1 Generating test signals .......................................................................................... 16 3.1.2 Executing tests ...................................................................................................... 18 3.1.3 Analyzing results .................................................................................................. 19

3.2 Comparing symbol rate algorithms............................................................................... 20 3.2.1 Phase 1 – test signal generation ............................................................................ 22 3.2.2 Phase 2 – test execution ........................................................................................ 24 3.2.3 Phase 3 – results analysis...................................................................................... 24

3.3 Improving a wavelet based symbol rate algorithm....................................................... 26 3.3.1 Overview of wavelets ........................................................................................... 26 3.3.2 Designing a wavelet based symbol rate algorithm ............................................... 27

ix

3.3.3 Validating the original MCWT behavior.............................................................. 28 3.3.4 Selecting wavelets and scales ............................................................................... 29 3.3.5 Performance comparison with DPDT................................................................... 31 3.3.6 Developing a quality metric.................................................................................. 32 3.3.7 Summary ............................................................................................................... 34

3.4 Developing a new modulation classification algorithm................................................ 35 3.4.1 Selecting a signal feature vector ........................................................................... 36 3.4.2 Developing the basis vectors ................................................................................ 37 3.4.3 Resulting modulation classification algorithm ..................................................... 38 3.4.4 Algorithm accuracy using test data set ................................................................. 39 3.4.5 Algorithm shortcomings and future exploration................................................... 39

3.5 Real world impact of ALEF.......................................................................................... 41

Chapter 4 FIFTH™ - A Vector-based Language for Automatic Algorithm Development . 42

4.1 Genetic programming background ............................................................................... 43 4.1.1 Definition of terms................................................................................................ 43 4.1.2 The genetic programming algorithm .................................................................... 44 4.1.3 The genetic programming environment................................................................ 45 4.1.4 Current state of genetic programming research .................................................... 51

4.2 Motivation for FIFTH................................................................................................... 53

4.3 The FIFTH language..................................................................................................... 55 4.3.1 Parameter stack ..................................................................................................... 56 4.3.2 Core vocabulary .................................................................................................... 56 4.3.3 Validating program execution............................................................................... 57 4.3.4 Flow control and function definition .................................................................... 58 4.3.5 Formal typing........................................................................................................ 58 4.3.6 Function validation ............................................................................................... 61

4.4 The FIFTH genetic programming environment (GPE5) .............................................. 61 4.4.1 Random program generation................................................................................. 62 4.4.2 Fitness evaluation.................................................................................................. 62 4.4.3 Parent Pool Selection ............................................................................................ 64 4.4.4 Probability Ranking .............................................................................................. 64 4.4.5 Crossover and mutation ........................................................................................ 65 4.4.6 Implementation ..................................................................................................... 68

4.5 Using GPE5 to solve a problem.................................................................................... 68 4.5.1 Identify the terminal set ........................................................................................ 68 4.5.2 Identify the function set ........................................................................................ 69 4.5.3 Select the control parameters ................................................................................ 69

x

4.6 Polynomial regression example problem...................................................................... 69

Chapter 5 Automatic Generation of Symbol Rate Algorithms .............................................. 72 5.1 Preparing the GP run..................................................................................................... 72

5.1.1 Problem formulation ............................................................................................. 72 5.1.2 Terminal set and function set................................................................................ 73 5.1.3 Fitness evaluation strategy.................................................................................... 74 5.1.4 Control parameters................................................................................................ 74

5.2 A successfully evolved algorithm................................................................................. 74 5.2.1 Baseline performance of DPDT............................................................................ 74 5.2.2 Evolution results ................................................................................................... 74 5.2.3 P1333 algorithm structure..................................................................................... 76 5.2.4 P1333 algorithm analysis...................................................................................... 77

5.3 The efficacy of genetic programming........................................................................... 79

Chapter 6 Discussion and Future Work ................................................................................... 80 6.1 SIGEXML: the Signal Exploitation Markup Language schema .................................. 80

6.2 ALEF: the automated Algorithm Evaluation Framework ............................................ 81

6.3 FIFTH: a new look at genetic programming................................................................. 82

6.4 The path forward........................................................................................................... 84

Appendix A Glossary and Acronyms........................................................................................ 86

Appendix B Typical Wideband Signal Surveillance System................................................... 88

Appendix C Signal Exploitation Markup Language Schemas ............................................... 90

Appendix D Index of ALEF Functions ..................................................................................... 93

D.1 Index for directory SigGen (test signal generator functions)........................................ 93

D.2 Index for directory SigGen\models (SigGen support functions) .................................. 94

D.3 Index for directory SAFramework (signal analysis framework) .................................. 95

D.4 Index for directory SymbolRate (symbol rate algorithm functions)............................. 96

D.5 Index of directory KLDecomp (KL decomposition functions) .................................... 96

Appendix E Analysis Report from DPDT Experiment ........................................................... 98

Appendix F Signal Histogram Viewer Samples ..................................................................... 103

Bibliography .............................................................................................................................. 108

Vita

xi

List of Tables

Table 2.1: Common digital modulation techniques........................................................................ 7

Table 2.2: Typical HF digital signal properties .............................................................................. 8

Table 3.1: Peak search parameters for symbol rate experiments.................................................. 22

Table 3.2: HF signal property boundaries for symbol rate experiments....................................... 23

Table 3.3: Percent correct symbol rate estimation by algorithm and pulse shape........................ 25

Table 3.4: Wavelet algorithm parameters for symbol rate experiments....................................... 28

Table 3.5: Percent correct symbol rate estimation by wavelet type and scale.............................. 29

Table 3.6: Percent correct symbol rate estimation by wavelet and pulse shape ........................... 31

Table 3.7: Percent correct symbol rate estimation by algorithm and pulse shape........................ 32

Table 4.1: Example FORTH program execution trace ................................................................. 47

Table 4.2: Review of GP software................................................................................................ 52

Table 4.3: Summary of problems solved in GP books ................................................................. 52

Table 4.4: Intrinsic FIFTH words ................................................................................................. 57

Table 4.5: GPE5 fitness functions ................................................................................................ 63

Table 4.6: Word selection probabilities for the polynomial regression example ......................... 70

Table 5.1: Properties of the training set signals used for the symbol rate experiments................ 73

Table 5.2: Word selection probabilities for the symbol rate example .......................................... 73

Table 5.3: Percent correct performance by symbol rate value against the entire training set ...... 76

Table 5.4: Percent correct performance by symbol rate value against the test set ....................... 76

Table 5.5: Performance of variations on algorithm P1333 ........................................................... 77

Table C.1: Description of signal exploitation schema files .......................................................... 90

xii

List of Figures

Figure 2.1: BPSK spectrogram ..................................................................................................... 11

Figure 2.2: VFT spectrogram........................................................................................................ 12

Figure 2.3: CODAN spectrogram................................................................................................. 13

Figure 3.1: Test framework block diagram................................................................................... 15

Figure 3.2: Simulink model for QAM and PSK signals with pulse shaping ................................ 17

Figure 3.3: Example receiver operating characteristic curves ...................................................... 19

Figure 3.4: Radar plot of percent correct symbol rate estimation for three wavelets................... 30

Figure 3.5: Wavelet function ψ for Haar (left) and db6 (right) ................................................... 30

Figure 3.6: Raised cosine symbol pulse........................................................................................ 30

Figure 3.7: Probability density for db6 wavelet using normalized mean ..................................... 33

Figure 3.8: Probability density for DPDT using normalized mean .............................................. 33

Figure 3.9: ROC curves using normalized mean of the FFT bins for MCWT and DPDT........... 34

Figure 3.10: Plot of transform clusters by modulation type using three KL vectors.................... 38

Figure 3.11: Signal separation capability of a single KL vector................................................... 39

Figure 4.1: Basic genetic programming algorithm ....................................................................... 44

Figure 4.2: Tree based representation of a LISP expression ........................................................ 46

Figure 4.3: Two programs before crossover ................................................................................. 50

Figure 4.4: Two programs after crossover.................................................................................... 50

Figure 4.5: FIFTH basic type rules ............................................................................................... 59

Figure 4.6: FIFTH control flow type rules.................................................................................... 59

Figure 4.7: FIFTH stack manipulation type rules......................................................................... 60

Figure 4.8: FIFTH type rules for selected operations ................................................................... 60

Figure 4.9: Type derivation of “/ SWAP /” .............................................................................. 61

Figure 4.10: Block diagram for the FIFTH genetic programming environment (GPE5)............. 62

Figure 4.11: Effect of bias constant c on exponential ranking ..................................................... 65

Figure 4.12: FIFTH crossover example showing legal structural change .................................... 67

Figure 4.13: Best fitness progression for a polynomial regression GP run .................................. 71

Figure 5.1: Best fitness progression for a symbol rate GP run ..................................................... 75

Figure 5.2: Feature vector and FFT vector for P1333 and DPDT, both correct ........................... 78

Figure 5.3: Feature vector and FFT vector for P1333 correct, DPDT incorrect........................... 78

Figure 5.4: Feature vector and FFT vector for P1333 and DPDT, both incorrect........................ 79

xiii

Figure B.1: Simplified block diagram of a wideband signal surveillance system........................ 89

Figure C.1: Dependency hierarchy for signal exploitation schema files ...................................... 90

Figure C.2: Pictorial representation of XML schema type externalReportFSK........................... 91

Figure C.3: Pictorial representation of XML schema type SegmentReportType......................... 92

Figure F.1: Histograms for AM signal........................................................................................ 103

Figure F.2: Histograms for OOK, 35 WPM signal..................................................................... 104

Figure F.3: Histograms for PSK, 4 level, 50 baud, no pulse shape signal.................................. 105

Figure F.4: Histograms for PSK, 4 level, 50 baud, RRC pulse shape signal.............................. 106

Figure F.5: Histograms for FSK, 2 level, 50 baud, mod index 1 signal ..................................... 107

1

Chapter 1 Introduction

1.1 Motivation

Non-cooperative, blind analysis and recognition of digital communication signals is im-

portant both in the communication industry for developing new receiver technology and in the

intelligence community for deciphering intercepted signals. In both domains, algorithm research

focuses on extracting signal characteristics, such as modulation type and symbol rate, even when

the received signal is distorted by noise and fading in the transmission channel.

While signal recognition has been an area of active research for many years, recent tech-

nology advances have accelerated interest in algorithm development, especially in the

Communication Intelligence (COMINT) community. First, new Software Defined Radio initia-

tives based on commercial microprocessor technology have made it easier to rapidly create new

digital communication waveforms. Examples range from formal programs, such as the US

Army’s Joint Tactical Radio System Program [74, 81], to amateur radios based on personal com-

puters [87]. Second, there is a renewed interest in the High Frequency (HF) radio band [33].

While reliable communication in this band has been historically difficult to achieve due to noise,

fading, and interference, HF has a distinct advantage in long-range communication since these

signals can bounce off of the ionosphere and travel around the world. Modern microprocessor

based radios have essentially overcome the reliability problems for cooperative HF communica-

tion, and with increased crowding and expense in satellite communication channels, HF has

become a viable, low-cost alternative for global communication.

The combination of rapid waveform changes and increased traffic in an environment that

is difficult to monitor has a direct impact on homeland security, compelling both commercial and

government organizations to devote additional resources to developing Digital Signal Processing

(DSP) algorithms that can detect, identify, and locate these signals. Many of the signal recogni-

tion software products in use today depend on multiple separate modules, each of which may

only distinguish a small number of distinct signal types. Adding new capability requires a time-

2

consuming and labor-intensive development process comprised of the usual design, implementa-

tion, testing, and deployment stages.

This dissertation explores the concept of automating significant portions of the develop-

ment process, effectively bringing a signal recognition algorithm from concept to deployment

with minimal human interaction.

1.2 Related work

Several universities and research sponsors are working to improve the mechanism used to

express an algorithm during its design. Their emphasis has been on model-based tools that allow

a design engineer (who is typically not a software developer) to express an algorithm as a model

that can be deterministically reduced directly to running software for deployment. For example,

the Vanderbilt Institute for Software Integrated Systems has developed a Generic Modeling En-

vironment (GME) [80], while the University of California at Berkeley’s Center for Hybrid and

Embedded Software Systems has a similar modeling tool called Ptolemy [6]. Both of these or-

ganizations participated in a Defense Advanced Research Projects Agency (DARPA) project

called Model Based Integration of Embedded Software (MOBIES) [17], which included an ex-

periment in signal processing algorithm development. That work led to the development of a

configurable signal recognizer framework [26] that was successful in reducing the time required

to translate an algorithm developed by a signal processing expert into running software [39].

However, those efforts did not address the larger issue of determining the suitability of an

algorithm to perform its intended function. In fact, the analysis and characterization of a new al-

gorithm requires significantly more effort than simply translating the algorithm to running

software.

1.3 A different approach

Analyzing the performance of a DSP algorithm should yield insight into its expected be-

havior as a function of all of the factors that can vary in its deployment environment. These

factors may include algorithm tuning parameters (such as maximum likelihood thresholds), sig-

nal characteristics (such as symbol rate, modulation type, and pulse shaping), and transmission

channel propagation effects (such as additive white Gaussian noise or fading). For automated

3

surveillance applications in non-cooperative environments, the list of applicable factors can be

very large.

Mathematical analysis of an algorithm that accounts for all relevant factors is unlikely to

yield a reasonable closed form solution, and the operational effect of any simplifying assump-

tions must be carefully considered. For this reason, most researchers include simulation testing in

their evaluation. However, the significant variations in the planning and execution of published

simulations render it difficult to make direct quantitative comparisons between algorithms.

A good simulation test starts by examining the intended operational domain to establish

quantitative bounds for all pertinent signal and transmission channel properties. Any property

that is not constant is potentially a factor that can affect the algorithm performance. The combi-

nation of these factors defines a multidimensional problem domain space. A comprehensive

analysis of the algorithm requires a body of test signals whose range of factor values comprise a

statistically significant population of this space [22]. These test signals can be either synthetic or

recorded, as long as all factor values are known for every test signal.

The complexity of the interacting algorithm, signal and channel factors suggests that

techniques pioneered in experimental algorithmics [53] could yield useful insight into perform-

ance. Experimental algorithmic methods emphasize several key concepts that include clearly

specifying the testing goal, articulating parameter variations and any hidden algorithmic factors,

carefully constructing large test sets that span the problem domain, and systematically evaluating

the performance results using statistical techniques.

1.4 Dissertation overview

The remainder of the dissertation is organized as follows. Chapter 2 is an introduction to

the HF signal domain and to the communication signal meta-data expression language developed

to describe HF signals using Extensible Markup Language (XML) schema. This work provides

an unambiguous characterization the High Frequency signal space for research purposes.

Chapter 3 describes the Algorithm Evaluation Framework (ALEF) created to facilitate

characterizing signal processing algorithm behavior. The framework provides functions to auto-

mate the evaluation process including creating large numbers of test files, running these files

through the algorithm under test, and analyzing the results. Automating this cycle allows rapid

testing and evaluation of algorithm enhancements, as well as identifying the significant factors

4

that affect performance. The framework was implemented using a commercially available pro-

gramming and modeling environment (MATLAB® and Simulink® from The MathWorks, Inc.)

that includes a rich set of intrinsic digital communication functions. The last sections of Chapter

3 describe using the framework to perform several algorithm development tasks including com-

paring related algorithms, testing algorithm improvements, developing quality indicators, and

characterizing a new algorithm.

Chapter 4 introduces a new vector-based Genetic Programming (GP) language, FIFTH™,

designed to allow automatic discovery of human-competitive signal-processing algorithms. The

chapter reviews the current state of GP while highlighting the innovative features of FIFTH.

Chapter 5 describes using FIFTH and its programming environment to discover a new and highly

accurate symbol rate estimation algorithm.

Finally, Chapter 6 presents additional discussion of the impact of this work and plans for

its future continuation.

5

Chapter 2 HF Signal Classification

2.1 Introduction to HF communications

An essential first step in automated algorithm development is an accurate specification of

the domain. This is analogous to defining requirements and establishing operational bounds be-

fore developing software. Without it, the algorithm developer cannot determine whether selected

test cases are either necessary or sufficient. This chapter chronicles the evolution of the data

specification and representation format for digital communication signal analysis in the High

Frequency (HF) communication band. The HF problem domain is well suited to algorithm explo-

ration because it has important real-world application, it can be specified accurately, and it can

be represented by reasonably sized test sets.

The HF band is defined as the frequency range from 3 to 30 MHz. In practice, most HF

radios use the spectrum from 1.6 to 30 MHz. In this range, the ever-present hazards of noise, fad-

ing, and interference make establishing and maintaining a viable HF communication link more

difficult than in the VHF (Very High Frequency, 30 – 300 MHz) or UHF (Ultra High Frequency,

300 MHz – 3 GHz) bands. However, HF signals (especially in the range of 4 to 18 MHz) have

the unique ability to bounce off of the ionosphere, enabling them to move information around the

world instead of being limited to line-of-sight [2]. Government regulations and international trea-

ties divide the band into sets of frequency ranges for specific communication purposes including

maritime, aviation, distress, standard time, amateur, broadcasting, and radio astronomy [3].

The development of satellite communications and the proliferation of VHF and UHF ra-

dio repeaters resulted in a declining interest in HF communication for many years. That trend is

now reversing. Satellite communication channels are crowded, and their cost is increasing. Also,

recent technology advances have overcome many of the former problems associated with HF

communication, thus renewing interest in HF as a viable and cost-effective long-range commu-

nication medium [33].

While analog signals (e.g., broadcast music and amateur radio) constitute a significant

percentage of the traffic in the HF band, most of the research and development activity is di-

6

rected towards digital communication. Monitoring, analyzing, and classifying digital signals is

especially important in the intelligence community for deciphering intercepted communications.

Current emphasis is shifting from manual monitoring techniques that require a skilled operator to

computer controlled automated processes. Appendix A contains a glossary of terms and acro-

nyms used in the HF domain. Appendix B presents a brief overview of the components that

comprise an automated monitoring system.

Agreed-upon standards for information interchange facilitate the accurate and timely dis-

semination of data among interested parties. For automated computer processing, an information

exchange standard must also be syntactically rigorous, semantically unambiguous, and techni-

cally complete. Initial work directed at expressing the key features of this domain revealed that

there are few common standards for encoding test files, recording data, presenting metadata, or

expressing results. This made it difficult to compare the performance of algorithms from dispa-

rate sources.

After considering various database formats, binary file formats, and the few existing

standards, we decided to adopt the Extensible Markup Language (XML) and define representa-

tions using XML schema [83]. There are considerable advantages to this approach. In many

disciplines, XML is rapidly becoming the universal format for information exchange, primarily

due to its simple structure, published standards, platform independence, and ubiquitous support.

However, these attractive features conceal the practical challenges of developing a complete and

usable XML schema. Successful adoption within a technical discipline requires expert knowl-

edge of both the application domain and XML technology.

Appendix C gives an overview of the resulting Signal Exploitation schemas. The sche-

mas use a layered approach, first defining base types for communication signals. Subsequent

layers define a format for signal libraries, a documentation format for algorithm test records, and

an interchange format for signal processing tasking.

The remaining sections in this chapter present an overview of some existing techniques

for classifying HF signals and then describe the development of the Signal Exploitation schema.

2.2 Describing HF signal characteristics

The basic theory for digital communication is well established, and there are many good

books available such as Sklar [69] and Proakis [59]. Digital information is encoded in a transmit-

7

ted signal by varying in time one or more of three fundamental characteristics: amplitude, phase,

and instantaneous frequency. Combinations of these characteristics produce a variety of signals

ranging from simple carrier amplitude modulation, such as Morse code, to intricate schemes,

such as Voice Frequency Telegraphy (VFT). Table 2.1 lists common abbreviations associated

with frequently encountered digital modulation techniques.

Table 2.1: Common digital modulation techniques

Abbreviation Description

CW Carrier Wave. This is the same as OOK (On Off Keying) or Morse code.

FSK Frequency Shift Keying. This is the same as PFM (Pulse Frequency Modulation), also known as PSM (Pulse Skipping Modulation)

MSK Minimum Shift Keying

PSK Phase Shift Keying. This is the same as PPM (Pulse Phase Modulation)

DPSK Differential Phase Shift Keying

OQPSK Offset Quadrature Phase Shift Keying

ASK Amplitude Shift Keying. This is the same as PAM (Pulse Amplitude Modulation)

QAM Quadrature Amplitude Modulation. This is the same as APSK (Amplitude and Phase Shift Keying)

CPFSK Continuous Phase Frequency Shift Keying

CPM Continuous Phase Modulation

GMSK Gaussian Minimum Shift Keying

The acronyms of Appendix A provide a general classification of signal types, but they do

not completely specify a signal. There are numerous other parameters associated with each

modulation technique, and some of these techniques can be combined in a single broadcast sig-

nal. Most modulation strategies can be described by analytical formulas. However, this form is

not suitable for a general signal library specification because real signals have noise and are dis-

torted by channel fading and other effects. A complete specification must include all of these

parameters in a flexible manner to accommodate broadcast signals recorded off the air as well as

synthetically generated signals. Even when incompletely specified, real signals are useful in gen-

erating test suites for exercising signal recognition algorithms. Table 2.2 lists some of the

common parameters associated with digitally modulated signals used in the HF bands and pro-

vides typical values for practical ranges.

8

Table 2.2: Typical HF digital signal properties

Property Association Example HF Values

Modulation type Signal FSK, MSK, PSK, DPSK, OQPSK, ASK

Pulse shape Signal None, raised cosine (RC), root RC (RRC), Gaussian

Excess bandwidth (rolloff) Signal Limit: 0.00 to <1.00 (fraction of Nyquist bandwidth).Typical: 0.10 to 0.35

Symbol rate Signal Typical: 10 to 2400 symbols per second (baud)

Symbol states Signal 2, 4, 8, 16 states

Duration Signal Practical range: 0.1 to 5 seconds

Signal to noise ratio (SNR) Channel Practical range: 0 to 60 dB

Frequency offset from baseband Receiver Possible range: 0 to 500 Hz

Sampling rate Receiver Practical range: 8 – 32 kHz

2.2.1 International standards

The International Telecommunication Union (ITU) headquartered in Geneva, Switzer-

land, is an international organization within the United Nations system where governments and

the private sector coordinate global telecommunication networks and services. They publish a

Radio Regulations book that defines a detailed emissions classification system. The intent of this

system is to identify emission sources for regulatory and compliance monitoring.

The ITU format consists of a four-character expression specifying signal bandwidth fol-

lowed by a five-character encoded description of the emission. For example, 2K11H2BFN

represents a “selective calling signal using sequential single frequency code, single-sideband full

carrier with a bandwidth of 2.11 KHz” [1]. Selecting the appropriate letters and numbers is suffi-

ciently complicated that the International Amateur Radio Union has published a simplified guide

that reduces the 15-page standard down to a 2-page table that covers the most common signal

types [4]. The guide states that in ambiguous cases, anything else can be classified as “Un-

known.” The ITU standard serves its intended purpose and could be encoded in an XML schema,

but it does not contain sufficient detail to describe an arbitrary digital signal.

2.2.2 Ad hoc amateur radio schemes

Amateur Radio enthusiasts constitute another active group of people listening to signals

in the HF band. Companies such as Monteria [35] and Klingenfuss [38] publish lists of frequen-

cies, descriptions, and sample recordings of monitored signals. Until it shut down in 2006, the

9

Worldwide Utility News organization published a widely distributed “Frequently Asked Ques-

tions” (FAQ) document that described and categorized many of the signals heard on HF

frequencies [64]. Since the target audience for this information consists of people who are listen-

ing to radios, the principal signal classes in this FAQ are distinguished by how the signals sound.

For example, “Synchronous Data Block” signals are described as having a “distinctive chirping

sound,” while “Synchronous Bit Stream” signals are “continuous and possess a trilling quality.”

Within each of the principal classes is amassed a significant quantity of additional techni-

cal information. However, the presentation of this information is not without problems. Some

terms are used inconsistently, while others can be ambiguous. For example, the word “tone” is

used to describe both a symbol value in simple FSK modulation and one of many sub-carriers in

more complicated modulation schemes. Once again, this description format is not well suited for

computer based analysis and information exchange.

2.2.3 Other schemes

L-3 Communications Analytic Corporation proposed a Waveform Description Language

[21] primarily for use with Software Defined Radios. The language, which was based on the

Unified Modeling Language (UML) from the Object Management Group (OMG) [55], relies

heavily on realizing the model in MATLAB and Simulink. It deals only with the signal charac-

teristics from a cooperative transmit/receive perspective and cannot express transmission channel

features. There is no public record of its further adoption.

Finally, several government organizations and commercial companies have designed and

built automated signal acquisition and analysis systems with varying degrees of capability. Many

of these systems use databases to store signal descriptions, but there are no published standards

for tables or field names.

2.3 Developing the XML schema

Clearly, signal analysts and researchers would benefit from a more rigorous signal de-

scription mechanism, like an XML schema, that addresses the shortcomings of current schemes.

Ideally, the XML schema should define domain specific data types and elements that can unam-

biguously describe almost any digitally encoded signal while retaining a structure that can

support typical automated operations such as cataloging, searching, and sorting.

10

Designing an XML schema of this magnitude is not a trivial undertaking. There are many

critical decisions such as element names, attributes, and hierarchy levels. In most cases, there are

multiple ways to achieve the same objective, with no clear indication of which is best. We chose

to decompose the problem, beginning with simple signals then progressing to more complicated

signals. This section presents an abbreviated chronicle of that process. Figures 2.1 through 2.3

are representative signal spectrograms included to help visualize the problem. Spectrograms are

widely used to visualize the frequency components of a signal as a function of time. The bright-

ness in the plot area indicates the energy (amplitude) of the signal. In these spectrograms, white

represents the highest energy level, and black represents the lowest. Since these are baseband

signals, the frequency to which the receiver is tuned becomes the zero point on the y axis.

In XML, there are two ways to encode tags that identify information. The most common

is the “element,” the name found between the angle brackets. The second encoding technique is

called an “attribute” and is located after the element name but is still within the angle brackets.

For example, consider the following expression. <baud tolerance=”5”>100</baud>

The element name is “baud,” and “tolerance” is an attribute associated with “baud.” Attribute

data is separated from its identifier by an equal sign and surrounded with quotes. Element data is

enclosed between opening and closing tags, where a closing tag has a forward slash immediately

following the opening pointed bracket.

2.3.1 Single carrier signals

Figure 2.1 shows a spectrogram for a simple Binary Phase Shift Keying (PSK with 2

phase states) signal transmitting at 100 symbols per second (baud). The spectrogram shows a

clear, single energy band 1000 Hz above the tuning frequency. The information in this signal is

encoded in phase changes that are not visible on the image. Leaving out a few details, this signal

might be described with the following XML snippet. <carrier> <freqHz>1000</freqHz> <modulation>PSK</modulation> <baud>100</baud> <numStates>2</numStates> </carrier >

11

Time (Seconds)

Freq

uenc

y (H

z)

Spectrogram of Psk31Bpsk.wav

0 0.5 1 1.5 2 2.5 3 3.5 40

1000

2000

3000

4000

5000

Figure 2.1: BPSK spectrogram

Examining signals that use other modulation schemes from Table 2.1 yields several addi-

tional necessary elements. For example, an FSK signal might be described as follows. <carrier> <freq_Hz>1000</freq_Hz> <modulation>FSK</modulation> <baud>100</baud> <numStates>2</numStates> <shiftHz>3</shiftHz> <phaseContinuity>discontinuous</phaseContinuity> </carrier>

This approach requires separate named fields for all possible parameters associated with all pos-

sible modulation types. Rather than continuing to add optional fields into a large flat structure,

we defined separate XML complex types for each modulation class.

2.3.2 Multiple carrier signals

Next, consider the class of signals comprised of more than one carrier. Figure 2.2 shows a

VFT spectrogram that has a single unmodulated carrier (the solid white line at about 200 Hz)

along with multiple other carriers that are independently modulated. There are two obvious

12

choices for describing this signal: either introduce a new complex type to describe multiple carri-

ers or add a hierarchy level that allows for multiple signal segments. While the latter approach

adds a layer of complexity for simple signals, it has a significant benefit for later processing in

that all signal types have the same structure.

Time (Seconds)

Freq

uenc

y (H

z)

Spectrogram of br6028.wav

0 0.5 1 1.5 2 2.5 3 3.5 40

1000

2000

3000

4000

5000

Figure 2.2: VFT spectrogram

2.3.3 Multiple carrier signals that vary in time

Many real signals transmit data in bursts. Although the signal energy turns on and off as a

function of time, as long as the modulation characteristics do not change from one burst to the

next, the previous “segment” description provides an adequate definition. However, this does not

allow for the possibility that the signal will change its character midstream.

Figure 2.3 shows a CODAN signal from a smart modem that can monitor the effective-

ness of an HF transmission and adapt its modulation to achieve a high data throughput with a

minimal error rate. The modulation characteristics are different in the two areas of signal energy.

To describe this, the segment must capture the signal start and stop time, and an additional hier-

archy level is required to contain multiple segments.

13

Time (Seconds)

Freq

uenc

y (H

z)

Spectrogram of CODAN.wav

0 0.5 1 1.5 2 2.5 3 3.5 40

500

1000

1500

2000

2500

3000

3500

4000

Figure 2.3: CODAN spectrogram

2.3.4 The Signal Exploitation Markup Language (SIGEXML)

This process of refining the descriptive elements was essential to creating a viable XML

schema for digital communication signals. Wherever practical, we defined enumerations and re-

stricted data types so that the XML validation process would support semantically correct use of

the types. In addition to the signal description, we defined types to describe the communication

channel (noise and fading) as well as types related to saving a signal in a file.

The previous XML snippets also illustrate an important design decision. Many elements

can be expressed using more than one unit of measure. Some published XML schemas provide a

“units” attribute to identify the specific unit associated with an element value. Rather than risk

the confusion associated with multiple possible units of measurement, the Signal Exploitation

schema defines a fixed unit for each element and embeds the unit as part of the element name.

This decision favors ease of automated search and comparison over ease of constructing user dis-

plays.

Another conscious decision was to prefer elements over attributes. We defined only two

optional attributes for general use: confidence and tolerance. These provide a straightforward

14

way for an algorithm to report confidence levels for calculations and to associate tolerances with

specification values. There are two distinct uses for this type of information. One is to describe a

synthesized or captured signal (designated as a “report” type), and the other is to describe the

desired characteristics of a signal that might be of interest but for which precise parameter values

are not known (designated as a “template” type).

Appendix C briefly describes each of the files that comprise the Signal Exploitation

Markup Language (SIGEXML) schema. Also in that appendix, Figure C.2 shows a pictorial rep-

resentation of an FSK report, and Figure C.3 shows how the report types fit within a signal

segment. Due to the length of the SIGEXML documentation material (several hundred pages) it

is not included in this dissertation.

2.4 Real world impact of SIGEXML

The results of this work validate many of the XML claims. The data storage is well or-

ganized, and there are numerous available XML tools. Although the time and cost required to

develop the Signal Exploitation schema substantially exceeded initial estimates, the acceptance

of the schemas and resulting data exchange capabilities vindicated the effort. After initially pub-

lishing this work [29], two of the authors were invited to present the results to a government

agency. Derivatives of the schemas were subsequently adopted by Southwest Research Institute

and are now being used in operational systems worldwide.

15

Chapter 3 Algorithm Evaluation Framework (ALEF)

A second essential step in automated algorithm development is creating a framework ca-

pable of managing the large-scale data collection and analysis that is necessary to exercise an

algorithm within a statistically significant volume of its specified domain. Our automated test

framework takes a three-phase functional approach (test generation, test execution, and analysis)

as illustrated in Figure 3.1. Each phase stores its output in a format that facilitates systematic

identification, retrieval, and manipulation by the next phase, including all relevant input factors

and generated results.

Figure 3.1: Test framework block diagram

The Test Generator uses a collection of fully specified domain factors to produce a full-

factored set of test files. It records the domain properties of the test set in an XML signal library

file, and it encodes the specific factor values associated with each signal in the individual file

names. The Test Executor automatically invokes the algorithm against each test signal for every

combination of the algorithm parameters. It collects all pertinent test input as well as the algo-

16

rithm response for each test into a generalized, indexed structure that is saved in a single test re-

sults file. In the final analysis phase, a script applies ALEF analysis tools to the test results and

produces a report. The analysis tools include functions to produce summaries of general per-

formance, tables of results organized by test factors, and visualizations such as probability

density curves.

Section 3.1 describes the ALEF components. The remainder of the chapter applies ALEF

to the problem of determining the symbol rate of an HF signal. Section 3.2 uses ALEF to obtain

a quantitative comparison of three standard symbol rate algorithms over a realistic signal do-

main. Section 3.3 applies ALEF to tune an existing wavelet-based symbol rate algorithm, and

Section 3.4 illustrates how ALEF can be used to develop new algorithms. The design and some

preliminary results from using ALEF were reported in [27].

3.1 ALEF components

This section describes the three major components of the ALEF framework in more de-

tail. An index of ALEF functions can be found in Appendix D.

3.1.1 Generating test signals

The problem of obtaining standard test signal files for algorithm development has been

recognized for some time. Kremer and Shiels [46] stated, “There also appear to be no existing

databases of radio transmissions which could be used for our purpose.” In related areas of re-

search, such as speech recognition, significant effort has gone into establishing standard corpora

of test data files [5]. While there are a few communication signal resources available, such as

[19, 38, 48], these files usually have insufficient knowledge of one or more important signal at-

tributes to be useful for algorithm evaluation. Also, these repositories have no formal structure

and are unsuitable for automated selection of files that meet specific criteria. This is not surpris-

ing, however, because the number of files required to include every possible value of every

signal parameter would be enormous. As a result, most researchers are compelled to produce

their own body of simulated signals. Besides the tremendous duplication of effort, rarely are suf-

ficient details of the simulation process available to allow independent analysis of the test data

sets used to evaluate an algorithm.

17

Since developing a signal generation tool was essential to an automated testing frame-

work but was not itself intended to be a research project, we selected MATLAB and Simulink

from MathWorks as the development environment. These products provide a rich set of signal

processing related functions from which we constructed signal generators that required minimal

validation.

Modulation Section.Symbols are read from a workspace variable.To produce impulses that will be pulse shaped,upsampling is performed after the modulator block.

Filter and Channel Section.The model uses up-sampling and down-sampling to allownon-integer relations between the sample rate and the symbol rate. This final section first applies a fi lter to performpulse shaping, then downsamples to the specifiedsignal sample rate before saving the signal to the workspace.

Zero Generation Section.This generates complex zero values at the same samplerate as the modulation section. This provides a data for generating just noise when the modulation sectionis switched out. The zeros also provide a clean transitionthrough the pulse shaping filter when the modulator isswitched out.

Timing Section.This switches the modulation section on and off to provideunmodulated white noise at the beginning and end of themodulated signal.

AND

and

Zero Generator

Switch

<=

Start <= clock

nSymbols

Signal symbols fromworkspace

structGenSignal

Signal ToWorkspace

-C-

SignalStop Time

-C-

SignalStart Time

Generate a Qadrature Amplitude Modulation (QAM) Signal WITH Pulse Shaping.Last updated: Thu Sep 29 21:04:48 2005

MakeDiscrete

|.|

Magnitude-Angleto Complex

GeneralQAM

General QAMModulatorBaseband

nDo

Downsample

DF FIR

Digital Filter<=

Clock lte stop

0

Clock

Figure 3.2: Simulink model for QAM and PSK signals with pulse shaping

After considering the capabilities, maintenance, ease of expansion and ease of use of five

different architectures, we decided to use a series of Simulink models to implement the signal

modulation and channel effects. The models are paired with one or more MATLAB functions to

set specific workspace variables that are conveyed to the models. The output from this first tier

(Simulink model and associated MATLAB function) is a single signal stored in a workspace

variable. A second tier of MATLAB functions accepts vectors of signal property values and uses

the first tier functions to produce a full factorial set of baseband signal files containing all com-

18

binations of these values. If desired, the functions can also create several variations of each sig-

nal with different symbol sequences and signal to noise ratios. The output consists of simulated

analytic (complex valued) signals stored in a standard WAV file format with in-phase data in

channel 1 and phase-quadrature data in channel 2. The test framework manages variable signal

duration by generating a test signal file that exceeds the maximum length of interest and then ex-

tracting sections as needed. Figure 3.2 shows the Simulink model for generating a QAM or PSK

signal with pulse shaping. PSK is a subclass of QAM where the signal has only one amplitude

value.

3.1.2 Executing tests

In the Test Execution phase, a framework function processes the previously identified

body of test signals through the algorithm under test. The input to this phase must include all al-

gorithm specific parameters used to adjust performance. Incorporating these in the test process

requires iteratively executing the algorithm against each test signal using an appropriate range of

parameter values.

Note that the name of the function implementing the algorithm under test is supplied to

the framework as a string. The test executor uses a fixed calling convention of the form [kResult kAlgParam] = sAlgName(x, Fs, kAlgParam, kActual)

where x is the input signal, and Fs is the sample rate in Hz. This is generic enough that the test

executor requires no explicit knowledge of the contents of the structures that carry algorithm pa-

rameters (kAlgParam), actual signal information (kActual), or test results (kResult).

While each invocation of the test executor function may process thousands of tests, the

results are saved in a single file. A three-level hierarchical structure provides an organized for-

mat that is easy for the analysis functions to access and interpret. The field names are fixed only

at the second level. The first level is a single variable (kTest) that holds all aggregate data from

the single test executor invocation. This variable contains the second level required structure

names that are indexed by test number (kAlgParam, kActual, and kResult). The content of these

structures (the third level) consists of named fields, but by using MATLAB run-time field name

discovery, the framework does not impose any fixed names or data format for those fields.

19

3.1.3 Analyzing results

Once the test results are collected, they can be analyzed using one or more of the frame-

work analysis tools. In addition to calculating standard error metrics, functions are available to

calculate N-Way Analysis of Variance (ANOVA) [8] and to develop Receiver Operating Charac-

teristic (ROC) curves [82].

ANOVA is a statistical model representing an output value as a linear combination of

factors. For this framework, the factors can be signal characteristics, transmission channel char-

acteristics, or algorithm parameters. Provided that the data is a reasonable fit to the model, it can

determine if the means in a set of data differ when grouped by multiple factors. If they do differ,

the ANOVA values can help determine which factors or combinations of factors are primarily

associated with (or cause) the difference.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False Positive

True

Pos

itive

Receiver Operating Characteristic

Figure 3.3: Example receiver operating characteristic curves

ROC curves were developed in the 1950s as a method for characterizing receiver dis-

crimination of radio signals contaminated by noise. They have since been generalized and are

used extensively in epidemiology and medical research. The x-axis on the graph indicates the

probability of failure (0 to 1), commonly called a false positive, where the algorithm or test indi-

cates a true value when it is not true. The y-axis indicates the probability of success (0 to 1),

commonly called a true positive, where the algorithm or test indicates a true value when it is

true. Figure 3.3 illustrates two theoretical ROC curves for an algorithm. The best possible algo-

20

rithm would yield a graph that was a point in the upper left corner of the ROC space, i.e., 100%

sensitivity (all true positives) and 100% specificity (no false positives). This is rare, so desired

operating regions are usually specified by a maximum False Positive value and a minimum True

Positive value. Varying algorithm parameters will move the operating point along the curve.

Note that in Figure 3.3 the desired operating region (shown in gray) is intersected by the solid

curve but not by the dotted curve.

3.2 Comparing symbol rate algorithms

Estimation of the symbol rate of an unknown digital communication signal is an impor-

tant early analysis step for most automated surveillance applications. Since later analysis stages

often rely on the availability of an accurate symbol rate, a practical symbol rate analyzer should

provide a reliable estimate of accuracy as well as the symbol rate value. The performance of

symbol rate algorithms is heavily dependent on the method of signal encoding. Therefore, many

automated applications determine signal type upstream of symbol rate detection. As a first test of

ALEF, we assumed upstream signal type detection and focused on symbol rate estimation for

Phase Shift Key (PSK) signals.

Symbol rate estimation algorithms have been developed, analyzed, and published for

many years. For example, Reed [60] used a fourth-power signal envelope algorithm, Koh [40]

developed an algorithm based on the absolute value of the signal envelope, Sills [68] developed a

Euclidean algorithm based on sample histograms and probability, and textbooks (e.g., Sklar [69])

describe classic techniques using autocorrelation, derivative of the phase, and delay and multi-

ply. In practice, it is difficult to quantitatively compare the relative performance of these

algorithms based solely on the published information. Su and Kosinski [73] observed, “Robust-

ness studies of modulation recognition performance in regards to symbol rate error, symbol

timing error, symbol resampling error and channel distortion are not found in our studies.” An

author may concentrate on a specific aspect of an algorithm and, due to space limitations, omit

key implementation details required to independently reproduce the results. In some cases, sim-

plifying assumptions required to perform a closed form mathematical analysis may not reflect

realistic operating environments. In addition, few publications characterize the operation of an

algorithm with respect to all of the parameters that may affect its performance. Typical simula-

tions fix some parameters and omit boundary values for others. Symbol rate detection is

21

representative of a large class of practical signal processing problems in which analytical evalua-

tion is an inadequate predictor of how these algorithms perform in practice.

Most symbol rate algorithms have a similar structure consisting of two computational

stages [84]. The first stage develops a feature space (a vector) by applying a transform to the

digitized signal. The goal of the transform is to locate and emphasize symbol transitions. The

second stage analyzes this feature vector to calculate a periodic value that is the estimate of the

symbol rate. To determine the applicable algorithm-specific parameters, we consider the two

stages independently.

Using the ALEF algorithm calling convention, we implemented three common first stage

transforms applied to PSK signals: the derivative of the phase angle (DPDT), the magnitude of

the signal squared (MAGSQ), and the average sum of the signal envelope (AVSE). In recent

years, there has been significant interest in using the wavelet transform, and several authors have

explored this potential [16, 24, 63]. The wavelet transform algorithm is explored in the next sec-

tion of this chapter.

The first stage of the DPDT algorithm transforms the input signal by taking the magni-

tude of the derivative (approximated using the first difference) of the signal’s unwrapped

instantaneous phase angle. As seen in the MATLAB expression below, there are no adjustable

parameters. In these statements, x is the analytical (complex) signal vector, and x1out is the first

stage output. xphase = unwrap(angle(x)); x1out = abs(diff(xphase));

The second stage is more challenging. Most published descriptions determine the perio-

dicity associated with the symbol rate by using a fast Fourier transform (FFT) of the first stage

output and selecting a peak. The process of selecting a peak varies from simply choosing the

value with the largest magnitude to sophisticated application of thresholds, filtering, and weight-

ing prior to value selection [47]. Parameters associated with this hidden peak search algorithm

are rarely disclosed, yet they affect practical performance and therefore must be characterized.

Rather than attempting to reproduce specific second stage implementations from sparse descrip-

tions, we developed a common second stage for all symbol rate algorithms.

Attempting to avoid the peak search ambiguities, our first implementation simply se-

lected the peak with the largest magnitude. Visualization of failed test cases revealed

22

circumstances where the correct peak was obvious but was not the maximum. Many of the failed

cases were signals with low symbol rates and minimal to no pulse shaping, which resulted in

significant harmonic peaks appearing in the output of the Fourier transform. Since the harmonics

occurred at multiples of the symbol rate, we included harmonic detection code in the second

stage to improve performance.

Our final second stage design contained several adjustable parameters including the de-

sired baud resolution (which affects the size of the FFT), whether to use a Hamming window

prior to each FFT to reduce spectral leakage, and whether to use an overlap and addition of the

FFT results to enhance the desired peaks. This last option also requires a low pass filter on the

final FFT accumulation to remove any trend at the low frequencies, introducing another parame-

ter for the width of the moving average.

When we used the framework to characterize these second stage parameters, the results

indicated that the performance of the peak search behaved uniformly with respect to the different

first stage transforms. This was an important observation since it implied that the performance of

the second stage would not mask the performance of any of the first stage transforms. It also im-

plied that we could use the same fixed set of second stage parameters, and any variations in

performance would result solely from the interaction of the first stage transform with the varia-

tions in signal factors. Table 3.1 lists the final second stage peak search parameters selected for

the symbol rate experiments.

Table 3.1: Peak search parameters for symbol rate experiments

Parameter Stage Values used

Minimum cutoff 2 20 symbols per second (baud)

FFT Size 2 4096 points (~2 Hz resolution)

Overlap in FFT 2 50 %

Windowing 2 Hamming

Low pass filter 2 16 point moving average

Harmonic peak threshold 2 40 % of maximum peak value

3.2.1 Phase 1 – test signal generation

To reiterate, the signal domain for this set of experiments consisted of PSK modulated

signals in the High Frequency (HF) communication band, typically limited to channels with a

bandwidth of 5 kHz or less. Table 3.2 lists common signal properties that can affect the perform-

23

ance of a blind symbol rate estimation algorithm. The values selected for these experiments cor-

respond to expected realistic ranges for a surveillance system. We also included complexities

that arise in practice but are seldom considered in published studies, such as pulse shaping and

symbol rates that are not integer divisors of the sample rate.

Table 3.2: HF signal property boundaries for symbol rate experiments

Property Example HF Values Values Used

Modulation type FSK, MSK, PSK, DPSK, OQPSK, ASK PSK

Pulse shape None, raised cosine (RC), root RC (RRC), Gaussian

None, RC, RRC

Excess bandwidth (rolloff) Limit: 0.00 to <1.00. Typical: 0.10 to 0.35

0.1, 0.2, 0.35

Symbol rate Typical: 10 to 2400 symbols per second 50, 100, 300, 1280, 2400

Symbol states 2, 4, 8, 16 2, 4, 8

Signal to noise ratio1 (SNR) Practical range: 0 to 60 dB 9, 12, 16, 20, 40

Frequency offset from baseband Possible range: 0 to 500 Hz 0, 50, 100

For each combination of the six multiple-valued test signal factors we generated 10 ran-

dom variations of noise and symbol content, yielding an ensemble of 15,750 test signals. The

framework manages frequency offset by heterodyning the baseband signal before calling the al-

gorithm. Generating the test files required a small MATLAB script to set up the parameters and

options before calling the framework functions. This excerpt from the script shows the general

format. sLibPath = ‘C:\SymRateLib’; kOptions.nVersions = 10; kChannelParam.arSNRdB = [9 12 16 20 40]; kSignalParam.anStates = [2 4 8]; genPSKfiles(kSignalParam, kChannelParam, kOptions, sLibPath);

Based on expected limitations and desired characteristics of actual real-time surveillance

systems that use these algorithms, we selected a fixed sampling rate of 8000 complex samples

per second and a maximum signal duration of approximately 2 seconds.

1 AWGN bandwidth measured at approximately the Nyquist sample rate.

24

3.2.2 Phase 2 – test execution

The following is an excerpt from the MATLAB script used to process the test files

through the DPDT algorithm (function name ‘sr_dpdt’). sFiles = fw_listdir( sLibPath ); % Basic parameters rStartTimeSec = 0.6; rSampleRate = 8000; rDuration = 2* 2^13/rSampleRate; rOffsetHz = [0 50 100]; % Common algorithm parameters kAlgParam(1).sPeakOpt = ‘orh’; kAlgParam(1).rMinSymbolsPerSec = 20; kAlgParam(1).sSignalOpt = ‘c’; sAlg = ‘sr_dpdt’ fw_runtest(sFiles, rStartTimeSec, rDuration, rOffsetHz, sAlg, kAlgParam);

For this experiment, we programmed the symbol rate algorithms to return not only the es-

timated symbol rate but also several standard statistical measures of dispersion and central

tendency for the first and second stage outputs, including the mean and standard deviation. These

values were later used to develop unique quality metrics for the symbol rate estimation values.

3.2.3 Phase 3 – results analysis

After running the test execution scripts to generate the raw data, additional MATLAB

scripts directed the framework analysis functions. The analysis scripts often require field names

as part of their calling convention. The following excerpt was used to analyze a kTest structure

and produce a report summarizing test information and algorithm performance (percent correct

responses). This experiment deemed a symbol rate estimate correct when the error was less than

2% of the actual value. sField = ‘symbolsPerSec’; sType = ‘PctErr’; rPass = 2; [bPass, hfig, sReport] = fw_errsummary( kTest, sField, sType, rPass ); [rFactor, kInfo, sReport] = fw_factor( kTest );

The resulting bPass variable is an 1N × vector containing a 1 for each test that passed

and a 0 for each test that failed (where N is the number of tests). The variable rFactor is an

N M× matrix with a column for each of the M multiple valued factors, and kInfo is a structure

containing information about those factors.

25

To understand the flexibility of this approach, consider that the error calculation function

requires only a field name string to calculate the error for any continuous (non-discrete) value

recorded by the test. The following statement produces an 1N × vector (rPctErr) containing the

symbol rate percent error for each test. rPctErr = (kTest.kActual(:).(sFieldName) - kTest.kResult(:).(sFieldName) ) / (kTest.kActual(:).(sFieldName) * 100;

Table 3.3 combines the result tables from analyzing the three algorithms with respect to

two of the most significant performance factors: actual symbol rate and symbol pulse shaping.

For the specified range of values, each of the algorithms demonstrated reasonably uniform per-

formance with respect to variations in offset frequency, number of symbol states, and noise.

Appendix E shows a sample of the report for DPDT generated by the analysis portion of the

framework.

Table 3.3: Percent correct symbol rate estimation by algorithm and pulse shape

Algorithm DPDT MAGSQ AVSE

Pulse shape None RC RRC None RC RRC None RC RRC

50 79.6 77.8 85.0 0.0 0.0 9.1 0.0 1.6 20.7

100 100.0 97.3 98.4 0.0 0.0 27.6 0.0 18.2 55.6

300 100.0 100.0 100.0 0.0 5.8 54.7 0.0 89.6 98.4

1280 100.0 100.0 100.0 2.0 44.2 89.1 1.3 100.0 100.0

Act

ual S

ymbo

l R

ate

2400 100.0 100.0 100.0 3.3 65.6 99.3 2.7 100.0 100.0

For the outlined conditions, the results are striking. While DPDT showed the best overall

performance, its accuracy degraded significantly below 100 baud. This is primarily due to the

lower number of symbol transitions in our fixed length test signal. Some publications hide that

shortcoming by using a fixed number of symbol transitions (varying the signal length). This

might work for cooperative communication, but for a blind recognition system, the rate is un-

known. Therefore it is more likely that the analysis will be performed over a fixed length signal

capture.

26

3.3 Improving a wavelet based symbol rate algorithm

The failure of the common symbol rate algorithms in large parts of the practical signal

space motivated a search for more robust symbol rate algorithms. This section applies ALEF to

wavelet algorithms for symbol rate to illustrate how ALEF can be used to optimize the parame-

ters of algorithms to improve their performance.

3.3.1 Overview of wavelets

Wavelets are an alternative rather than a replacement for Fourier techniques. Instead of

the continuous sine and cosine functions employed by Fourier, wavelets are of limited duration.

A wavelet description typically starts at time 0t = and terminates at some time t N= . Outside

of that interval, the value of the wavelet is zero.

Essentially, wavelet transformations capture localized behavior better than classical Fou-

rier transforms. Fourier transforms of localized events lose their temporal relation, while a

wavelet transform of a localized event retains some connection to the time when it occurs. In-

stead of transforming a pure time description (the signal data) into a pure frequency description,

wavelets attempt to find a good compromise. Unfortunately, the Heisenberg uncertainty principle

still applies. Just as with short duration Fourier techniques, time resolution is lost as frequency

resolution increases.

An extreme example is an instantaneous impulse with all frequencies in equal amounts.

Its Fourier transform has a constant magnitude over the whole spectrum. By contrast, a wavelet

transform will involve only a small fraction of the wavelets, those that overlap the impulse. This

allows a wavelet transform to indicate large energy contributions at the time of the impulse.

All digital communication signals exhibit some form of localized behavior at the symbol

transitions. For example, FSK signals have constant amplitude but shift their carrier frequency

while PSK signals change phase at intervals that are a function of the data rate. Theoretically,

wavelet analysis should be ideally suited to extracting such details.

Wavelets are attractive computationally as well. Wavelet analysis often scales linearly,

O(n), with the size of the input data set. For large data sets, this is potentially faster than Fourier

analysis, which scales as O(n log n).

27

3.3.2 Designing a wavelet based symbol rate algorithm

Chan et al. [16] present a detailed mathematical description of the wavelet algorithm.

Their paper also includes a brief description of the implementation used in their simulations,

which follows the two-stage format. The first stage applies the continuous wavelet transform us-

ing multiple scale values. The squared magnitudes of the resulting coefficients are summed

across the scales to produce a vector that remains time coherent. The following MATLAB code

snippet is a simplified rendition of their first stage algorithm. The variable x is the analytic (com-

plex) signal vector, coefs is a matrix containing the complex coefficients of the resulting

transform, and xfeature is the first stage output vector. coefs = cwt( x, nScale, sWavelet ); xfeature = coefs.*conj(coefs); xfeature = sum(xfeature);

Since this publication had insufficient details to reconstruct the second stage peak search, we

used the same second stage component developed for the previous experiments.

There are two algorithm-specific parameters in the first stage of the wavelet algorithm:

the selection of the wavelet and the choice of scales. Chan et al. specified a Haar wavelet with

scales of 4, 6, and 8. The Haar wavelet is a popular choice because its simple structure allows

closed form analysis of the wavelet transform across a symbol change, and Haar is computation-

ally efficient. However, since the intended test set includes signals with pulse-shaped symbols, it

was neither analytically nor intuitively obvious that the Haar wavelet sufficiently matched realis-

tic input to adequately emphasize the symbol transitions. Therefore, we conducted experiments

that included a large number of available mother wavelet families, as well as an expanded num-

ber and combination of scales.

The wavelet algorithm also has two hidden parameters in the first stage: the sampling rate

and the length of the signal vector. Many published studies assume a sampling rate that is an in-

tegral multiple of the symbol rate. More importantly, these studies often assume a fixed number

of symbols, or worse, a fixed number of symbol changes. For the wavelet experiments we used

the same sample rate (8000 samples per second) as the previous symbol rate experiments and

maximum signal duration of approximately 2 seconds. Table 3.4 lists the algorithm parameter

values used as part of the various wavelet experiments. The final wavelet algorithm is referred to

hereafter as the multi-scale continuous wavelet transform (MCWT).

28

Table 3.4: Wavelet algorithm parameters for symbol rate experiments

Parameter Stage Values used

Wavelet 1 Haar, db6, sym6, coif3, bior1.3, bior5.5, mexh

Scales 1 2, 4, 6, 8, & combinations

Sample rate 1 8000 Hz

Signal duration 1 1 and 2 seconds

3.3.3 Validating the original MCWT behavior

The intent of the first wavelet experiment was to create a reasonable reproduction of the

original simulation conditions reported by Chan et al. to ensure that our implementation per-

formed comparably. However, since the published conditions included constraints that did not

correlate directly with our expected deployment environment, some changes and compromises

were required.

Limiting the test signal set satisfied two of the original constraints: no pulse shaped sym-

bols and a baseband carrier offset of zero. A third constraint, related to the sampling frequency,

was intentionally altered. Originally, the carrier frequency was fixed to be an integral multiple of

the symbol rate, and the sampling frequency was four times the carrier frequency. We used a

fixed sample rate of 8000 Hz and included multiple noise levels.

The final constraints imposed on the original test signals were that every symbol transi-

tion contained a phase change, and each test signal contained exactly 100 symbols. As already

indicated, we used fixed signal duration and randomly generated symbols to more realistically

simulate the target environment.

Theoretical analyses of symbol rate algorithms often rely on the symbols being independ-

ently and identically distributed (IID). For slow symbol rates, this assumption may be difficult to

satisfy in practice. By viewing the symbol distribution of a signal as a sampling problem (IID

relative to the number of symbol states), we can apply statistical methods to estimate the number

of symbols needed to ensure compliance with the assumed normal distribution. Equation 3.1 is

based on using a proportional sampling model [31].

( )( )2

21 2

1p pn z

rα−

−= (3.1)

Here n is the number of symbols, p is the reciprocal of the number of symbol states, r is the de-

sired accuracy, and z is the unit normal distribution with confidence factor α. For a binary (two

29

symbol states) signal, we would need about 380 symbol samples to achieve 5% accuracy with a

95% confidence level. Satisfying this condition requires a reasonable 0.16 seconds at 2400 sym-

bols per second, but a lengthy 7.6 seconds at 50 symbols per second. Therefore, for the first

experiment, we limited the symbol rate to values above 300, and we fixed the input duration to

one second.

For the conditions outlined, the MCWT algorithm achieved 100% correct responses for

the 450 test signals with an average symbol rate error of 0.04%.

3.3.4 Selecting wavelets and scales

The goal of the second wavelet experiment was to determine how the wavelet type and

combination of scales might affect the algorithm performance against a broader range of signals.

We expanded the test set to include pulse shaped symbols, lowered the symbol rate range to 50

baud, and included only the highest signal to noise ratio (40 dB) to yield a test set of 1050 sig-

nals. These were run against 49 different wavelets with 9 scale combinations each.

Table 3.5 summarizes the overall performance of four of the better-performing wavelets

plus the original Haar wavelet. The shaded areas mark the highest percent correct for each scale.

Note that combining scales does not always improve performance, and in many cases, it de-

grades performance.

Table 3.5: Percent correct symbol rate estimation by wavelet type and scale

Scale Haar sym6 db6 bior5.5 mexh

[2] 61.2 58.4 55.8 52.5 72.6 [4] 63.7 65.2 66.6 66.4 51.0 [6] 61.7 69.3 70.5 65.8 40.7 [8] 52.0 57.2 56.0 57.4 42.3

[2 4] 63.6 66.6 69.2 67.2 62.3 [4 6] 63.0 73.1 72.6 71.3 44.4 [6 8] 57.6 64.6 62.9 64.1 42.5

[2 4 6] 62.9 73.0 72.7 69.0 50.6 [4 6 8] 59.9 69.0 67.8 68.9 40.6

Figure 3.4 displays some of the data from Table 3.5 in a radar plot. The radial position

indicates the performance (percent correct) for each of the scale factors in the nine angular posi-

tions. This clearly shows that the db6 wavelet had the best overall performance at all scales

except the smallest.

30

0.0

20.0

40.0

60.0

80.0

100.0[2]

[4]

[6]

[8]

[2 4][4 6]

[6 8]

[2 4 6]

[4 6 8]

Haar mexh db6

Figure 3.4: Radar plot of percent correct symbol rate estimation for three wavelets

Figure 3.5: Wavelet function ψ for Haar (left) and db6 (right)

Figure 3.6: Raised cosine symbol pulse

31

Like the Haar wavelet, db6 is a member of the Daubechies family. Figure 3.5 compares

the wavelet function for the Haar and db6 families. Notice that db6 more closely resembles the

raised cosine symbol pulse shown in Figure 3.6.

Space does not permit publishing the complete set of statistical analysis data. Table 3.6

summarizes the performance for db6 and Haar using the scale for each with the best overall per-

formance. It is readily apparent that while pulse shaping affects both wavelets, the db6 wavelet

outperformed Haar. Considering that both wavelets performed poorly with raised cosine (RC)

pulse shaping at low symbol rates, equation 3.1 indicates that increasing the signal length above

one second would likely improve performance.

Table 3.6: Percent correct symbol rate estimation by wavelet and pulse shape

Wavelet /Scale Haar [4] Db6 [2 4 6]

Pulse shape None RC RRC None RC RRC

50 100.0 0.0 24.4 100.0 1.1 45.6

100 100.0 12.2 42.2 100.0 1.1 100.0

300 100.0 37.8 78.9 100.0 41.1 100.0

1280 100.0 88.9 100.0 100.0 100.0 100.0 Sym

bol R

ate

2400 100.0 92.2 100.0 100.0 92.2 100.0

3.3.5 Performance comparison with DPDT

The final experiment applied both the original Haar wavelet and the db6 wavelet to the

full 15,750 test signals and extended the signal duration to two seconds. This test set also intro-

duced a carrier offset factor, which simulates the target system where the process of converting a

detected signal to baseband may not precisely center tune the signal. With these changes, the re-

sults can be compared directly with the DPDT results in section 3.2.3.

As seen in Table 3.7, both wavelets achieved 100% correct response for symbols with no

pulse shaping, demonstrating immunity to both noise and carrier offset. However, the perform-

ance against pulse shaped symbols deteriorated, especially at the lower symbol rates.

32

Table 3.7: Percent correct symbol rate estimation by algorithm and pulse shape

Algorithm Haar [4] db6 [2 4 6] DPDT

Pulse shape None RC RRC None RC RRC None RC RRC

50 100.0 0.2 5.4 100.0 2.2 18.9 79.6 77.8 85.0

100 100.0 2.0 11.0 100.0 2.4 21.3 100.0 97.3 98.4

300 100.0 20.2 51.8 100.0 13.7 21.3 100.0 100.0 100.0

1280 100.0 95.8 100.0 100.0 99.8 100.0 100.0 100.0 100.0

Sym

bol R

ate

2400 100.0 99.3 100.0 100.0 98.9 100.0 100.0 100.0 100.0

3.3.6 Developing a quality metric

The performance achieved by MCWT in this experiment would not qualify this algorithm

for deployment, especially since DPDT performed better under a number of conditions. How-

ever, since DPDT was not 100% successful under all conditions, it would be beneficial if some

quality metric could be calculated that would indicate the probability that the estimated symbol

rate was correct.

The statistical data collected by ALEF provides the data necessary to search for such a

metric. Considering that the first stage transform emphasizes symbol transitions, statistical con-

ditions in the subsequent FFT vector might indicate the presence of a strong symbol rate peak.

Intuitively, the normalized dispersion (standard deviation) and central tendency (mean) of the bin

magnitude values should both be small. A low value for the mean of the FFT bin magnitudes

should imply that the FFT magnitude is small for most frequencies and that a peak associated

with the symbol rate would have good separation from the background. Similarly, a low value of

the standard deviation should imply a flat spectrum, again suggesting good peak separation.

For the db6 wavelet, Figure 3.7 shows the probability density curves produced by ALEF

of the normalized mean of the FFT magnitude bins for signals correctly analyzed (passed) and

incorrectly analyzed (failed). The y axis units are normalized magnitudes so that the area under

each curve is one. Note the small overlap area between the probability of the calculated symbol

rate being correct (solid line) and the probability of it being incorrect (dashed line). For the db6

wavelet, the normalized mean appears to be an excellent quality measure.

33

0 0.01 0.02 0.03 0.04 0.05 0.060

50

100

150

200

250

300

350

Normalized Mean of the FFT bins

PassedFailed

Figure 3.7: Probability density for db6 wavelet using normalized mean

By contrast, Figure 3.8 shows the same probability density curves for the DPDT algo-

rithm. Note the large area of overlap indicating that this would be a poor quality measurement.

-0.02 0 0.02 0.04 0.06 0.08 0.1 0.120

50

100

150

200

250

300

Normalized Mean of the FFT bins

PassedFailed

Figure 3.8: Probability density for DPDT using normalized mean

Based on the probability density curves, ALEF can calculate an ROC curve by varying a

threshold value applied to the normalized mean of the FFT bins as an indicator of whether the

calculated symbol rate is correct. Figure 3.9 shows the resulting curves for both MCWT (with

34

db6) and DPDT. This metric works well for MCWT but only marginally for DPDT. We explored

a number of statistical measures but were unable to find one that was a reasonable predictor of

performance for DPDT.

An example operating point might be a desired false positive rate below 1%. To achieve

this, we select a threshold value for the normalized mean of 0.016, resulting in a true positive

rate greater than 98%.

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

False Positive

True

Pos

itive

MCWTDPDT

Figure 3.9: ROC curves using normalized mean of the FFT bins for MCWT and DPDT

3.3.7 Summary

This section demonstrated the use of ALEF to evaluate the effectiveness of wavelets for

symbol rate estimation in PSK signals. For signal parameters typically reported in the literature,

the Haar wavelet performed with nearly a 100% success rate. However, when pulse shaping,

noise, and realistic constraints on sampling rate and signal length were imposed, the results

changed dramatically. The Haar wavelet performed consistently worse than other wavelets.

While there was some improvement when multiple wavelet scales were considered, the gain was

not enough to raise the success rate to an acceptable level.

Practical automated surveillance systems often use a combination of algorithms during

analysis. An algorithm that provides a high level of confidence in the predictions it classifies as

correct can be useful, even if it is unable to make a prediction for every signal. For example,

while both the Haar and db6 wavelet algorithms were able to correctly estimate the symbol rate

35

for only 50% - 75% of the signals in the test set, they showed significant differences in quality.

We found a simple quality measurement that when used with the db6 wavelet designates correct

predictions with almost 100% accuracy. In contrast, no quality measurement was apparent for

accurately separating correct and incorrect predictions using the Haar wavelet or the phase de-

rivative algorithm.

3.4 Developing a new modulation classification algorithm

Another important category of signal recognition algorithms that continues to be an ac-

tive area of research is modulation classification. These algorithms process a narrowband signal

and attempt to determine the modulation technique used to encode information (see Table 2.1).

Two researchers, Azzouz and Nandi, began publishing collections of algorithms in the early

1990s [54]. Since then, many different methods have been used in these algorithms ranging from

statistical measures [18, 88] to neural networks [86]. While most published algorithms claim

very high accuracy, it is often at the expense of limiting the applicable domain to narrow regions

by excluding common signal characteristics. This often renders the algorithm ineffective for

blind recognition in a real operating environment.

Our third application of ALEF involved developing a modulation classification algorithm

for narrowband HF signals based on Karhunen-Loève (KL) decomposition. Our goal was to de-

velop a new technique and to characterize its performance without artificially restricting the

domain space.

KL decomposition, a principal component analysis technique, is one of several functions

classified as a unitary image transformation. Other transforms in this class include the discrete

Fourier transform (DFT) used to produce the spectrogram, the discrete cosine transform (DCT)

used in JPEG compression, and certain wavelet transforms such as the Haar. Unlike the other

transforms in this class, KL decomposition does not use a fixed set of basis vectors to accom-

plish the transformation. Instead, known data sets are analyzed to derive an optimized or minimal

set of basis vectors. While deriving the basis vectors is computationally intensive, using them to

transform new data is more efficient than most DFT implementations. The mathematical deriva-

tion of KL is well described in several textbooks (see chapter 3 of Kirby [36]).

The potential for data recognition and reduction is illustrated in [49] where KL decompo-

sition was used to analyze electrocardiogram (ECG) records. The data set consisted of a

36

collection of 105 normal and abnormal ECG records of 15 minutes each. From these, the re-

searchers extracted over 97,000 data sets for training. The result of this analysis showed that

almost 90% of the data energy was retained using only 4 KL coefficients. This provided a sensi-

tivity of 81% and a positive predictability of 80% for many abnormal heart conditions.

Applications of KL decomposition in other fields include visualization of cortical activity [62],

fingerprint identification [58], and facial recognition [37].

In signal processing applications, KL and related techniques such as singular value de-

composition (SVD) have been successfully employed for noise reduction and compression [32].

Classification applications analyzing satellite data have also been successful, especially when a

posteriori knowledge of the signal can be applied [79]. Ho et al. [25] used orthogonal transforms

to identify radar signatures. However, their technique does not appear applicable to HF signals

since it relies exclusively on the phase of the incoming signal, and the article revealed insuffi-

cient detail about the construction of a key component, the signal library. Lu et al. [51] used KL

in one part of their signal classification scheme after first pre-processing the data using a wavelet

transform to remove noise.

A common theme in most of the classification algorithms that use KL is the identification

of a single specific signal (or fingerprint, or face) from a library of all known signals. This pro-

ject attempted a related but more general approach. We wanted to create a sample database of

signals using a range of signal parameter values characteristic of the HF domain, then match

other arbitrary signals to the closest match in the library to identify the modulation type, even if

the library did not contain a specific match for all signal parameters.

This section describes the development of a KL template-based method for signal modu-

lation classification using the ALEF. First, we describe the selection and construction of the

signal feature vector and the process of using these vectors to create a signal library. Next, we

demonstrate how the new algorithm used the library to identify the modulation type for a signal.

Finally, we discuss the algorithm’s shortcomings.

3.4.1 Selecting a signal feature vector

Azzouz and Nandi [10], in their classic textbook on automatic modulation recognition,

state that “there are generally two philosophies for approaching the modulation recognition prob-

lems in the available references, namely, 1) a decision-theoretic approach and 2) a statistical

pattern recognition approach.” They also identified the principal underlying data used by most

37

techniques as either spectral processing (Fourier and time-frequency methods), statistical fea-

tures derived from one or more of the three principal signal attributes (amplitude, phase, and

instantaneous frequency), or histograms of one or more of the three principal signal attributes.

After a few attempts to derive basis vectors directly from the time series signal data, it

quickly became apparent that for any reasonable ensemble of signals there are an infinite number

of representations possible for any of the standard signal attributes. To achieve any alignment

and any signal similarity, we needed a temporally neutral representation. This led to using histo-

grams of amplitude, phase, and instantaneous frequency (IF) where IF was estimated as the

derivative of the phase. See Appendix F for screen captures of a MATLAB tool constructed to

explore these signal histograms.

After observing a wide range of signals, we selected 25 bins for amplitude, 33 bins for

phase, and 161 bins for IF. For amplitude, increased resolution did not yield additional useful

information. For phase, odd bin numbers allowed better balancing around zero, and 33 bins gave

approximately 11 degrees resolution. This provided 4 bins per state for our maximum expected 8

phase states in HF. Finally, IF resolution was selected to give approximately 1 bin separation for

a 50 baud signal with a modulation index of 1 (about 50 Hz resolution per bin). The final feature

vector for each signal consisted of the concatenation of the three histograms (normalized for

point count) in the order amplitude, phase, and IF, yielding a vector length of 219 points.

Later, in an attempt to improve performance, we constructed two-dimensional histograms

using phase and IF. While the representations were interesting, the added dimensionality reduced

performance. We speculate that the increase in available bins diluted the average count per bin

and reduced the effective difference between points in the resulting database.

3.4.2 Developing the basis vectors

To create the signal database, we generated a set of training signals using four modula-

tion types (AM, OOK, PSK, and FSK) and no noise. For AM, we used a standard set of voice

files from the Linguistics Data Consortium [5]. For OOK, we used a fixed sequence of characters

for all signals, varying only the keying rate. For PSK and FSK signals, we used a contrived sym-

bol sequence that ensured a symbol transition at every symbol interval.

The original vector set consisted of 369 training signals, each represented by a 219 point

vector. KL decomposition yielded 153 basis vectors. Since 99 percent of the training set energy

was captured in only 16 vectors, we used those to create a final algorithm database of 369 train-

38

ing signals with 16 points per signal. Figure 3.10 shows the clustering results of using only the

top three KL basis vectors to transform the training set into a three dimensional space.

-1.5-1

-0.50

0.5

-0.5

0

0.5

1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

KL v1KL v2

KL v

3

AMFSKOOKPSK

Figure 3.10: Plot of transform clusters by modulation type using three KL vectors

Another interesting way to view these KL basis vectors is to plot the separation capability

of each vector, where Percent Separation = (Area of no overlap) / (Total area). The area of no

overlap is where at least one factor (in this case, modulation) is completely missing. Figure 3.11

shows this plot for the largest energy KL vector derived from the training set. While there is no

clear separation of a single modulation type, this vector does distinguish the amplitude modu-

lated signals (AM and OOK) from the phase and frequency modulated signals (FSK and PSK).

3.4.3 Resulting modulation classification algorithm

A high level description of the modulation classification algorithm that uses the derived

basis vectors to achieve 100% correct recognition of the training set is as follows.

1. Create a histogram for the incoming signal,

2. Transform that histogram using the 16 basis vectors,

3. Find the nearest neighbor (Euclidean distance) in the database,

4. Declare the modulation type to be that of the nearest neighbor.

39

5 10 15 20 25 300

0.05

0.1

0.15

0.2

0.25

0.3

0.35

bin number

fract

ion

of to

tal s

ampl

es

KLTRAIN16K_1dAmpPhIF Vector 1 PctSep= 1

AMFSKOOKPSK

Figure 3.11: Signal separation capability of a single KL vector

3.4.4 Algorithm accuracy using test data set

The next step was to use the algorithm against a completely different set of signals that

had been generated using random symbols, similar but different signal parameters, and a range of

signal to noise ratios (SNR) from 9 to 40 dB. Of the 8805 test signals, the overall performance

netted only 69% correct. However, at SNR of 40, the algorithm classified all modulation types

100% correctly, and at all noise levels, PSK was classified 100% correctly.

3.4.5 Algorithm shortcomings and future exploration

The results from these experiments show a strong dependence on signal noise. At low

SNR, the histograms became distorted and were no longer able to match against the database.

Prior to constructing the algorithm we also discovered a strong dependence on any phase

rotation in PSK signals. Examining the basis vectors produced by a training set that included

random initial phase conditions (essentially randomly rotated constellations) indicated that this

created significant problems in producing a set of basis vectors capable of reliably identifying

even the training set when using a reduced subset of the basis vectors. We briefly worked with

several algorithms to align peaks in the histograms to target points (such as balancing the weight

around the middle point, moving the maximum to a specific location, etc.). While these helped

significantly, we were only able to automatically resolve the critical locations to within plus or

minus one bin on the phase histogram. The resulting basis vectors included derivative vectors to

40

account for this bin jitter, which severely degraded performance of this basis vector set against

any signals that were not in the training set.

Further research in the literature indicates that similar problems have been encountered

with other applications of KL. For example, Prabhakar and Jain [58] faced this problem with fin-

gerprint matching, noting that “Correlation-based techniques require the precise location of a

registration point and are affected by image translation and rotation.” Moghaddam and Pentland

[52] addressed this problem with facial recognition and determined that “most systems did not

address the problem of detection and used manual registration of the images.” Their own auto-

mated registration process took 15 times longer to process than the decomposition and matching

phase, indicating that it was a much more difficult problem to solve.

Based on our analysis of the data collected using ALEF, we identified several potential

avenues for improving the KL template matching algorithm. However, for each improvement

outlined below, significant additional work would be required.

• A low-pass filter would probably improve the SNR performance. Although it would

potentially distort the signal, the resulting histograms would not be as erratic.

• Instead of combining the amplitude, phase, and IF histograms into a single vector,

use them separately to derive different basis vectors for each attribute. This would

decouple the attribute relationships, but it would complicate signal library match-

ing.

• Due to the number of parameter variations applied to the PSK signals, there were

significantly more of them in the training set than the other modulation types. An

alternative approach would be to create a library for each modulation type and use a

nearest neighbor match against each separate library.

• The tight modulation specific clusters in the three dimensional vector visualizations

suggests that more advanced clustering formulas could be applied to delineate re-

gions of the reduced space that contain training vectors from only a single

modulation class. This would facilitate both smaller database representation and

faster lookup while also allowing an “unknown” response.

41

3.5 Real world impact of ALEF

The Algorithm Evaluation Framework and accompanying symbol rate estimation work

have already had a significant impact. The Signal Exploitation Division at Southwest Research

Institute has adopted the signal generation capability for use on multiple projects, including a

training simulation environment for SIGINT operators. Portions of the framework have been

used on at least two other internal research projects.

Although the KL algorithm as currently implemented was not as accurate as existing al-

gorithms, its specificity for PSK is interesting and may be the subject for further research.

Finally, these algorithm experiments demonstrated that the process of designing, develop-

ing, testing, and optimizing signal processing algorithms is time consuming and difficult. Often,

simply finding the correct parameter settings for a given target environment requires significant

experimentation and tweaking. Without the framework, many of the exhaustive parameter

evaluations and searches for quality metrics would not have been feasible. In working on the KL

template matching algorithm, we realized that the number of potential changes in the feature ex-

traction was virtually infinite and that we could not possibly hope to manually optimize the

template features. Rather than trying to optimize our modulation algorithms, we decided to un-

dertake a significantly more challenging problem – that of automating the process for generating

feature extraction and detection algorithms. The final chapters of this thesis describe our efforts

in automatically generating signal processing algorithms.

42

Chapter 4 FIFTH™ - A Vector-based Language for Automatic

Algorithm Development

Many problems in imaging processing, data mining, and digital signal processing (DSP)

have a feature extraction phase prior to the application of a classification or other processing al-

gorithm. In most cases it is not possible to optimize the selection of features, and most

researchers define features using ad hoc trial and error techniques. Unfortunately, the range of

possible features to be extracted is virtually infinite. Automation is needed, yet the obstacles for

automatically generating algorithms are significant. We realized that ALEF provided a unique

opportunity to overcome these obstacles in a problem domain of significant practical interest.

ALEF has a fully characterized problem domain, can generate an unlimited number of test sig-

nals with known answers, and has the infrastructure for performing large scale analysis of the

results.

This chapter describes the development of FIFTH™, a new stack-based genetic pro-

gramming language that efficiently expresses solutions to a large class of feature recognition

problems. This problem class includes mining time-series data, classification of multivariate

data, image segmentation, and digital signal processing. FIFTH is based on principles originally

developed for the FORTH programming language. Key new features of FIFTH are a single data

stack for all data types and support for vectors and matrices as single stack elements. We demon-

strate that the language characteristics allow simple and elegant representation of signal

processing algorithms while maintaining the rules necessary to automatically evolve stack cor-

rect and control flow correct programs. FIFTH supports all essential program architecture

constructs such as loops, branches, variable storage, and automatically defined functions. An

XML configuration file provides easy selection from a rich set of operators, including domain

specific functions such as the Fourier transform (FFT). The fully-distributed genetic program-

ming environment for FIFTH (GPE5) uses CORBA for its underlying process communication.

43

4.1 Genetic programming background

4.1.1 Definition of terms

The general goal of creating computer systems that can solve problems without being

told precisely how to solve them has been articulated since the early days of modern computing.

Early references to this idea include the speculative essays of Alan Turing in the late 1940s and

the pioneering research work of R.M. Friedberg in the late 1950s [20].

Current literature uses several terms in relation to this general concept of computers auto-

matically learning to solve problems. While no officially sanctioned definitions exist, common

use suggests several delineations. Artificial intelligence (AI) is often used as an umbrella term to

encompass the broad goal of endowing computers with capabilities of human intelligence. As a

specific computer science discipline, AI research involves areas such as knowledge representa-

tion and neural networks. Another common name is machine learning, which is often classed as

a sub-field of artificial intelligence primarily concerned with learning from data using data min-

ing and statistical techniques.

In the 1960s, a number of researchers began using concepts from Darwin’s evolutionary

theories and applying them to AI problems [12]. This work is generally termed evolutionary

computing (EC) and is characterized by four basic operations: (1) working with large popula-

tions of potential solutions, (2) randomly modifying the population using genetically inspired

operations such as crossover (combining parts of two individuals in the population) and mutation

(randomly altering a single individual in the population), (3) determining the fitness of each in-

dividual as the population passes from one generation to the next, and (4) favoring the most fit

individuals to create each new generation.

There are two principal classes of EC techniques: genetic algorithms (GA) and genetic

programming (GP). Genetic algorithms are typically applied to search problems in a data space

using fixed length, syntax-free problem representation components called chromosomes [85].

Variable length chromosomes are occasionally employed, but the search remains in data space.

The resulting chromosome pattern is rarely directly executable. GP is characterized by using

techniques to search a function space to create directly executable programs. In many ways, ge-

netic programming is the research area that most closely approaches the initial artificial

intelligence goal of having a computer write its own software to solve a problem.

44

4.1.2 The genetic programming algorithm

Figure 4.1 illustrates the four high-level operations that comprise the typical genetic pro-

gramming process. The first step is to randomly generate an initial population of programs

representing possible solutions to the problem. Second, each individual in the population is

evaluated to determine its fitness in solving the problem. Once all individuals in the population

have been evaluated, comparison of the results with the termination criteria (successful solution

or maximum iterations) determines whether the algorithm continues. Finally, if no adequate solu-

tion exists in the current program population, evolutionary processes are applied to the current

program population to create the next generation of programs. The algorithm loops back to the

fitness evaluation step until the termination criteria are met.

No

Generate InitialProgram Population

Evaluate Fitness ofEach Program

TerminationCriteria Met?

Evolve Next Generationof Programs

Yes

Finish

Start

Figure 4.1: Basic genetic programming algorithm

Each of these four steps hides a diverse landscape of theoretical considerations and im-

plementation decisions. In particular, the execution of this algorithm is in the context of a genetic

programming environment.

45

4.1.3 The genetic programming environment

A genetic programming environment is itself a software program or set of programs that

implements the framework necessary to execute the GP algorithm. There are several required

features of the environment, some of which are problem domain independent, while others are

highly problem domain dependent.

The most important domain independent aspect of a GP environment is the selection of

the programming language used to express the evolved programs. We will refer to this as the ge-

netic programming language, or GPL. Subsequent decisions affected by this initial choice

include:

• Selecting a model for the GPL based on either using an existing language like LISP, or

creating a new language,

• Selecting the internal structure, such as a tree or a linear sequence, used to represent a

GPL program during application of evolutionary manipulations,

• Selecting the set of primitive functions included in the GPL, and

• Determining the execution and evolution rules associated with the GPL.

The key features that are necessarily domain dependent include:

• The input data (referred to in GP literature as the set of terminals),

• Any domain specific functions in the GPL, and

• The fitness evaluation function.

4.1.3.1 Selecting a GP language

Theoretically, any programming language can be used to represent programs in a GP en-

vironment. However, the complicated syntax of languages like C, C++, and Java does not lend

itself easily to genetic manipulation. Deciding on a GPL is often driven by the internal represen-

tation chosen for implementing the genetic manipulations. Currently, there are three popular

representation schemes: tree-based, graph-based, and linear (or stack-based).

Early work by Koza [42] popularized the tree-based representation and the use of LISP as

both the GP language and GP environment implementation language. The parenthetical grouping

of LISP sub-expressions lends itself well to a tree-based representation of the program. For ex-

ample, Figure 4.2 shows a rooted, point-labeled tree with ordered branches that represents the

46

LISP expression (+ 2 (IF (> TIME 10) 3 4). Note that the parenthetical grouping of LISP sub-

expressions is well suited to finding branch points and nodes required to perform program evolu-

tion. However, handling control structures such as loops and recursion is problematic for this

form of program representation.

Figure 4.2: Tree based representation of a LISP expression

Graph-based representations have been explored primarily for evolving parallel programs

[57]. This representation is not as common in the literature and has several shortcomings related

to generating syntactically correct programs when using control structures such as loops and re-

cursion.

The advantages of a linear, stack-based representation over tree-based representations

were introduced by Perkis [56]. This approach was more fully developed by Spector [70], who

recently introduced a GP environment that uses the stack-based language PUSH3. Others, such

as Tchernev [76], made the observation that FORTH [13] is a well developed stack-based lan-

guage especially suited for genetic manipulation. Stack-based languages tend to use postfix

notation (also known as reverse Polish notation or RPN) where all function parameters are taken

from a stack, and all function results are left on a stack. Figure 4.2 can be expressed in FORTH

as follows. Time @ 10 > IF 3 ELSE 4 THEN 2 +

The execution trace for this statement is shown in Table 2.1 with the value of the variable

Time equal to 20. Every non-numeric entity in the expression that is separated by one or more

spaces is a program token, also known in FORTH as a word. The top of the stack is the right

most value in the Stack column. Note that since all parameters are stack-based, the following is

47

functionally identical. The value 2 is simply pushed on the stack first and ends up under the 3

before the final + operation. 2 Time @ 10 > IF 3 ELSE 4 THEN +

Table 4.1: Example FORTH program execution trace

Program Stack (top ->) Comment Time (address) The address of the variable Time is pushed onto the stack. @ 20 The value of Time is fetched from memory and placed on the stack. 10 20 10 The number 10 is pushed onto the stack on top of the 20. > 1 The operator consumes two stack items and leaves a Boolean result. IF IF consumes the top stack item. Since it is non-zero, execution con-

tinues to the ELSE. 3 3 Push the number 3 on the stack.

ELSE 3 Unconditionally jump to the code following THEN. 2 3 2 Push the number 2 on the stack. + 5 Add the top two numbers on the stack and leave the results on the

stack.

4.1.3.2 Selecting the GPL function set

GPL primitive function sets usually include standard math, logic, and control flow opera-

tions that comprise the basic constructs used to express any program. Early work concentrated on

primitive math operations (PLUS, MINUS, TIMES, DIVIDE) and Boolean operations (AND,

OR, XOR). Researchers soon recognized that in order to evolve general programs capable of

solving large scale problems, the function set must include program control capabilities such as

memory, iteration, recursion, and branching [50]. Early attempts to evolve signal processing al-

gorithms using GP demonstrated the requirement for memory in order to store and access the

results of iterated calculations [67].

A major influence on the selection of the primitive function set is how the developer

chooses to represent different data types. Some GPLs use only floating point or only integer rep-

resentations, while others, such as PUSH3, use separate stacks for every data type including

Boolean, floating point, integer, and code execution. This requires significant repetition of intra-

stack manipulation primitives as well as inter-stack data movement primitives.

Deciding on the number and type of operators to include in the function set is non-trivial

for any reasonably complicated problem. The effective dimensionality of the search space in-

48

creases exponentially as the number of available functions in the GPL increases. On the other

hand, the availability of higher-level, domain specific functions often reduces the number of

functions required to solve a problem and has been shown to be a large contributing factor in

whether a program solution evolves within reasonable time and computing constraints.

4.1.3.3 Defining the GPL execution and evolution rules

In addition to the GPL syntax, function set, and data representation, the GP environment

must define rules for how to evolve and evaluate programs. Some environments allow evolution

of syntactically incorrect programs but differ in how these programs are evaluated. For example,

Stoffel [72] employs a loose strategy that skips operations that would cause a stack underflow

and ignores any extra stack content when examining program output. Bruce [14] uses strict

evaluation rules and assigns a fitness of 0 to any program exhibiting either behavior, thus elimi-

nating it from the population. Tchernev [75] propounds using program generation constraints that

ensure stack correctness.

These decisions directly relate to reducing the search space. It is obvious that, no matter

what the programming language, the set of all syntactically incorrect programs is far greater than

the set of all syntactically correct programs. Therefore, early elimination of incorrect programs

from the search space is highly desirable.

4.1.3.4 Evaluating the fitness of each individual

GP environments use several different techniques for feeding data to evolved programs

and for examining program output to determine whether the program solved the problem. Some

environments are almost application domain independent. For example, Disciplus [7] employs a

simple ASCII file format from which the environment reads both the input data and the expected

output response. While simple and general, this approach imposes significant restrictions on the

types of problems that can be addressed. Other environments provide an application program-

ming interface (API). While this facilitates application to a wider range of problem domains, it

requires the researcher to write and test a significant amount of software.

Finally, some GP environments are very domain specific. For example, Koza [44] used

SPICE (a general-purpose circuit simulation program for nonlinear direct current, nonlinear tran-

49

sient, and linear alternating current analyses) as a key element of a GP environment for evolving

analog circuits such as filters and amplifiers.

Whatever the implementation, the fitness function must be applied to every individual in

the population at each generation, and the result must be a graded value that can distinguish a

better solution from a good one. The GP environment must organize this information so that it

can be used during the evolution of the next generation of programs.

A few experiments using GP to produce esthetic outputs such as art or music relied heav-

ily on a human evaluation of fitness. This is impractical for most problems due to the large

number of candidate program solutions and the large number of generations typically required to

achieve reasonable results.

4.1.3.5 Determining the termination criteria

Termination criteria are defined as either a maximum number of generations or the emer-

gence of a program that meets or exceeds the fitness target. Generic GP environments tend to use

single valued decision points, often in the form of a maximum error calculated by comparing the

program output with the desired output. Domain specific GP environments may embody a more

sophisticated fitness evaluation function but often still rely on a single threshold value. For ex-

ample, some of Koza’s analog circuit design work expressed the termination criteria as the mean

squared error between the output of the SPICE evaluation of the genetically evolved circuit and

the desired output waveform [41].

4.1.3.6 Evolving the next generation

In an attempt to emulate natural genetic processes, all GP environments rely on two

classes of evolution operators: innovation operators and conservation operators. Innovation op-

erators attempt to introduce new characteristics into a single individual. The most common

innovation technique is mutation, where randomly selected constants, functions, or program sec-

tions are randomly altered. There are many variations depending on the expected behavior of the

mutation. For example, in a stack-based language, a randomly selected function should only be

replaced with a new function that matches the previous stack signature (number of values ex-

pected and left on the stack). Otherwise, the new program will not be operationally correct.

50

Conservation operators consolidate parts from two or more individuals in an attempt to

bring together good components from each. The most common conservation technique is cross-

over. The GP environment selects two individuals from the population at random. Within each

individual, a compatible transfer position is selected, and program fragments are exchanged. For

example, Figure 4.3 shows two parent programs in tree representation form before crossover.

The arrows identify the selected crossover points. Figure 4.4 shows the resulting two child pro-

grams after crossover.

+

2 IF

> 3 4

Time 10

*

5 IF

< 1 0

Time 5

Figure 4.3: Two programs before crossover

+

2IF

> 3 4

Time 10

*

5IF

< 1 0

Time 5

Figure 4.4: Two programs after crossover

The GP environment’s selection of individuals in each generation for inclusion in the

evolutionary process must be both random and proportional to the fitness function. Typical set-

tings involve high rates of crossover and low rates of mutation. However, Banzhaf [11]

conducted experiments where a more equal balance led to better results on harder problems.

51

4.1.3.7 Optional GP environment features

As with any experimental environment, there are auxiliary functions that while not essen-

tial often contribute to the usefulness of the tool. In GP environments, these include visualization

of the evolutionary process, analysis tools, ability to distribute the evaluation process among

multiple processors, optimization of program parameters, and analysis of the resulting evolved

program solutions.

For example, one of the difficulties encountered with GP evolved programs is “code

bloat” [30, 71]. This is a condition in which large sections of nonfunctional code (i.e., the code

does not contribute to the problem solution) grow within an evolving program. These useless

sections are often called introns as they seem to be homologous to biological introns, which are

sections of seemingly useless polynucleotide sequences in a nucleic acid that do not code infor-

mation for protein synthesis. Biological introns are removed before translation of messenger

RNA, and removing GP introns is an active area of research. The GP environment can either

monitor development of introns as part of the evolutionary process, or it may wait for the final

program solution before applying code simplification techniques.

4.1.4 Current state of genetic programming research

Since Koza’s early work in 1996, there have been many GP environments developed in

the academic community. Table 4.2 lists a selection of GP environments that are mentioned in

conference papers, textbooks, and websites. The table only includes packages that could be veri-

fied as available and whose description indicated at least some level of maturity necessary to

attempt real problems. It is obvious from this survey that GP is still in its infancy. Most of the

GPEs referenced in academic papers have become dated and are no longer supported. A small

number have transitioned into open-source development efforts, one is strictly proprietary, and

one has gone commercial. Based on this survey, it does not appear that any of these environ-

ments are particularly well suited to serve as a foundation for developing signal processing

algorithms.

52

Table 4.2: Review of GP software

Software Last Update

Impl. Lang.

Reference Comments

EPROF Evolutionary Programming Forth

2005 FORTH Elko Tchernev [[email protected]]

Source not available.

BEAGLE 2004 C++ sourceforge.net/projects/beagle beagle.gel.ulaval.ca

Large flexible framework. Limited documentation.

PUSH3 2004 C++ hampshire.edu/lspector/ push.html

Stack syntax, but LISP like tree representation.

Evolutionary Compu-tation Framework

2004 C++ eodev.sourceforge.net Paradigm free, but not very complete.

Discipulus (RML Technologies)

2004 Unk. www.aimlearning.com/ products.htm

Commercial product. Source not available.

GPLab 2004 MATLAB sourceforge.net/projects/gplab Uses tree structure, limits MATLAB functions.

PMDGP Distributed Genetic Programming

2002 C++ sourceforge.net/projects/pmdgp Uses tree-structure.

GPPS 2.0 1999 C Koza, GPII. www.genetic-programming.com


lilgp 1998 Java garage.cse.msu.edu/software/ lil-gp/

Simplistic.

Genetic Programming Studio

1998 C++ www.uco.es/~i52canoa/www3/index_en.htm


GPData 1997 C++ ftp://cs.ucl.ac.uk/genetic/ gp-code/

Simplistic.

Table 4.3: Summary of problems solved in GP books

Book Year Problem Descriptions

I [42] 1992 Simple search problems such as 2D robotic arm movement, truck backup, moving boxes on a grid, and curve fitting.

II [43] 1994 Introduced Automatically Defined Functions (ADF). Applied these to elementary non-linear equations, basic biochemistry and molecular biology sequence prediction.

III [44] 1999 Electrical engineering problems using SPICE to evaluate evolved models. Successfully evolved high performance analog filters, crossover networks, and amplifiers.

IV [45] 2003 Claims 36 instances where GP has produced human-competitive results, including 23 instances where GP has duplicated the functionality of a previously patented invention. Example problems shown in the areas of controllers, circuits, and antenna design.

While the scarcity of mature GP environments capable of handling complex problems

clearly indicates that this discipline is in its infancy, significant progress has been achieved since

the first GP exercises in the 1960s. One measure of progress comes from the examples presented

53

in four books written by GP pioneer John Koza. Table 4.3 lists the year each book was published

and gives a brief description of the types of problems addressed. The examples in book IV are

certainly more complex than those in book I, but the evolved programs still do not deal with the

advanced mathematics and vector manipulations necessary for digital signal processing.

4.2 Motivation for FIFTH

A fundamental difficulty in applying traditional GP environments to feature extraction

problems is the mismatch between the data handling capacity of the GPL and the requirements of

the applications. When examined from an abstract viewpoint, feature recognition algorithms can

usually be broken down into a series of transformations on vector spaces. Thus, the GP search

should explore functions on vector spaces. Unfortunately, common GP languages do not treat

vectors as single data elements. Even a simple vector operation such as multiplication of a vector

by a scalar requires a loop construct with branch and flow control. The probability that such con-

trol operations will be damaged during mutation and crossover is high. Even if the control

operations are maintained, it is unlikely that intrinsic sizes of vectors will be correctly preserved

as successive transformations are performed.

While there have been published applications of GP to vector related applications, the re-

curring theme has been to circumvent the native vector handling problem rather than to increase

the expressiveness of the language. For example, to automatically classify sound and image da-

tabases, Teller [78] designed a GP system that used special data access primitives with specific

bound constraints rather than using generalized vectors. In applying GP to signal processing al-

gorithms, Sharman [66] also avoided vector functions by introducing delay nodes and processing

time series data one point at a time. For pattern recognition, Rizki et al. [61] pre-processed the

data to reduce dimensionality (feature extraction) prior to applying GP to the problem.

The need to lift genetic program search to functions on vector spaces by supporting natu-

ral, control-free behavior for vector and matrix operations was a principal motivation for creating

a new genetic programming language. To ensure that the new language efficiently addressed a

large class of vector-based problems, including signal processing applications, we established

several essential design goals. First, the new language had to be capable of compactly expressing

sophisticated vector manipulation algorithms, naturally supporting mappings from n m→ or

in the case of complex values, n m→ . Second, although vector operations would not require

54

control structures, the language still had to be Touring complete, supporting both branching and

looping operations. Finally, since the language is primarily for genetic programming use, its syn-

tax and structure needed to be tractable for automated manipulation. To accomplish these goals,

we chose to implement our new language using a linear, stack-based program representation with

syntax patterned closely after the FORTH language and early FORTH design considerations.

This parentage suggested the name for the new language: FIFTH™.

The following example serves to explain how FIFTH operates and to examine some of

the inherent difficulties that non-vector languages encounter in properly searching functions on

vector spaces. Consider again the feature development stage of the DPDT symbol rate algorithm

introduced in section 3.2 as written using MATLAB. xphase = unwrap(angle(x)); x1out = abs(diff(xphase));

MATLAB’s built-in functions for calculating the phase angle as well as numerous other

vector functions make the code far simpler than corresponding code written in a typical proce-

dural language such as C. For example, the function angle, which calculates the phase angle of

each element in the vector from its corresponding complex sample, could be implemented by the

equivalent C function: double[] angle(double *realXf, double *imagXf, int n){ int i; double[] result = malloc(n*sizeof(double)); for( i = 0; i < n; i++ ) result[i] = atan2(imagXf[i], realXf[i]); return result; }

This relatively simple C function illustrates two fundamental difficulties with trying to

evolve vector transforms in a non-vectorized language. First, simple transform operators require

control constructs (usually a loop) in a non-vectorized language. Second, since the elements of a

vector are not considered part of a single unit, the language cannot enforce consistent and correct

behavior. For example, the C version of angle assumes that the real and imaginary vectors are of

equal length n, a constraint that must be enforced outside this function. These two difficulties

indicate that the probability of such a program evolving in a way that preserves the correct be-

havior for standard algorithms is too low for practical consideration.

While it is possible to incorporate standard vector operations as primitives of the lan-

guage to eliminate some of these difficulties, there is no glue to preserve the integrity of data

55

vectors as they are transformed by these functions. From this perspective, a complete implemen-

tation of DPDT in C is quite complex, even if the Fourier transform (FFT) and angle functions

are provided as primitives.

To overcome these difficulties, we designed FIFTH to treat complex vectors and matrices

as single data units so that standard operations behave as expected and do not require control.

This is similar to the data model found in MATLAB. One reason that MATLAB is popular with

signal processing analysts who are not programmers is that variables do not have to be statically

declared, and functions automatically determine argument data types at run-time. Also,

MATLAB’s rich function set, together with run-time typing, makes the expression of many DSP

algorithms both shorter and easier to understand than the same algorithm written in C or C++.

Since FIFTH syntax is simpler than MATLAB syntax, similar programs are even more compact.

The following code is the equivalent implementation of the DPDT first stage in FIFTH.

Since FIFTH is a stack-based language, no intermediate storage variables are necessary. x ANGLE UNWRAP DIFF MAGNITUDE

From a GP perspective, there are a number of relevant observations. While the MATLAB

code is somewhat easier for a human to read, the FIFTH code is syntactically more compact

since it does not require any parenthetical grouping. Also, the MATLAB code demonstrates the

tradeoff between performing multiple function calls on a single line, which would be harder to

evolve in a syntactically correct manner, and splitting functions onto multiple lines, which re-

quires intermediate variables and the proper subsequent use of those variables.

4.3 The FIFTH language

The FIFTH programming language is a FORTH-like interpreted language that supports

vectors and two-dimensional arrays. The syntax for FIFTH is simple. A FIFTH program consists

of an input stream of tokens separated by white space. Tokens can either be words or numbers.

The design philosophy behind FIFTH is that all operators (FIFTH words) are expected to “do the

right thing” with whatever container types they find on the stack. This includes reshaping vectors

when required to match lengths and preserve dimensionality. It also includes gracefully handling

normal mathematical errors such as division by zero, typically by setting the result to 0. This ap-

proach has several benefits for genetic programming. First, it simplifies the syntax for complex

vector operations and reduces the number of structural words needed in the language, effectively

56

reducing the search space and simplifying genetic manipulation. Also, this approach increases

the viability of programs by reducing the number of run-time errors.

4.3.1 Parameter stack

All operators take their input data from a single stack and leave their results on this stack.

This is fundamentally different than FORTH or PUSH3, which use separate stacks for each data

type. The FIFTH parameter stack holds “containers” that have two principal attributes: type and

shape. Shape refers to the number of dimensions and the number of elements in each dimension

of the data in the container. Type refers to the data format and may be NUMERIC (with inherent

support for complex, real, and integer scalars and vectors), STRING (UTF-8), or ADDRESS (to sup-

port container stack addresses and word dictionary addresses). All functions assume row vectors

where vector elements progress in the x dimension.

Entry of vectors and matrices are supported by the “start vector” token represented by the

open curly brace. When the FIFTH interpreter encounters “start vector,” it creates an empty con-

tainer on the stack, then begins collecting numbers from the input stream and appending them to

this container until the “stop vector” token (the close curly brace) is encountered. Between the

“start vector” and “stop vector” tokens, the “next row” token (the vertical bar) can also appear. It

indicates the end of the previous vector and the beginning of the next vector. If necessary, vector

sizes are padded with zeros to achieve a uniform length. For example, the following line puts a

matrix on the stack consisting of two vectors with three complex elements each. The first vector

is automatically padded to match the length of the second vector. { 1+1i -2 | 4.0 5.1-2.2i 6 }

Similarly the “quote” operator, followed by whitespace, causes the FIFTH interpreter to

collect tokens from the input stream until it finds the next double quote followed by whitespace.

All characters up to that point are put into a string container on the top of the stack. For example,

the following line puts a file name on the stack that is consumed by the function READWAV. " filename.wav" READWAV

4.3.2 Core vocabulary

Table 4.4 provides a list of the words currently implemented in FIFTH. The FIFTH core

vocabulary includes stack, arithmetic and logical operations as well as domain-specific functions

and control. When interpreting a token, FIFTH first searches its dictionary. If it finds a match,

57

the interpreter executes the word. Otherwise, the interpreter attempts to convert the token to a

number, place it in a container and push it onto the stack. If the token contains illegal characters

for numerical interpretation, the interpreter aborts with an error.

Table 4.4: Intrinsic FIFTH words

Category Word Set

Stack manipulation DEPTH DROP DUP OVER ROT SWAP

Container manipulation SHAPE HORZCAT VERTCAT

Basic math + - * / % ^

Basic logic GT GE LT LE NE 0EQ 0GT 0LT EQ AND OR NOT TRUE FALSE

Extended math SIN COS TAN ASIN ACOS ATAN EXP LN POW10 LOG10 SQRT CEILING FLOOR ROUND MAGNITUDE ANGLE DIFF REAL IMAGINARY PI

Statistics MEAN STDDEV VARIANCE

Vector manipulation ONES ZEROS RAMP VRAMP FLIP TRANSPOSE REVERSE LENGTH SETLENGTH NUMEL MIN MININDEX MAX MAXINDEX SUM PRODUCT UNFILLED { | } "

Branching IF ELSE THEN

Flow control BEGIN UNTIL

Memory CONSTANT VARIABLE @ ! +!

Compilation WORD ENDWORD EMPTY FORGET

Debug and Validation DATETIME EMPTY . .S .SS VER

File I/O LOAD READMAT READWAV WRITESTACK

Domain specific UNWRAP FFT FFTSHIFT IFFT FREQCOMP HILBERT MAGSQRD dBMAG WND_BMHARRIS WND_GAUSSIAN WND_HAMMING WND_HANNING

Run time control MAXALLOC SETMAXALLOC MAXEXEC SETMAXEXEC MAXSTACK SETMAXSTACK

4.3.3 Validating program execution

Each word in the dictionary has a stack signature that defines the number of containers

that the word requires on the data stack when it executes, as well as the change in the number of

items on the stack effected by the word. For example, MEAN has a stack signature that requires one

container on the stack before it can execute, and its stack delta is zero. That is, it consumes one

stack item but leaves one stack item so that the change in the number of items on the stack is

zero. The initial container on the stack may hold a single scalar (which is interpreted as a vector

of length one), a single vector, or multiple vectors. The resulting container will have a single sca-

lar value for each input vector representing the average of the values in that vector.

58

4.3.4 Flow control and function definition

FIFTH uses a simple rule set to evaluate the contents of a container and derive a logical

true or false for branching or looping. If all values in the container are zero (this includes, for ex-

ample, an empty string) the check is considered false. If there are any non-zero or non-null

values, the container evaluates to true.

FIFTH supports a single basic loop structure (BEGIN … UNTIL) and a single branching con-

struct (IF … ELSE … THEN), both patterned after the same words in FORTH. The word UNTIL

consumes a single container on the stack. If it evaluates to false, execution continues at the word

following UNTIL. If it evaluates to true, then execution unconditionally branches back to the

matching BEGIN. Similarly, IF consumes a single container from the stack. If the container evalu-

ates to true, then execution continues with the word immediately following the IF and proceeds

until it encounters the ELSE, at which point it unconditionally branches to the THEN. If the con-

tainer evaluates to false, execution skips the words between the IF and ELSE, continuing at the

first word immediately following the ELSE.

FIFTH also supports function generation in a natural way with WORD and ENDWORD. When

the FIFTH interpreter encounters WORD, the next token in the input stream is taken as the name of

a new word. The FIFTH interpreter compiles all subsequent tokens into the dictionary until it

encounters ENDWORD, which tells the FIFTH interpreter to complete the definition and to resume

interpretation.

4.3.5 Formal typing

FIFTH is a type safe language with formal type rules [15, 28]. Although the implementa-

tion does not yet take full advantage of it, the type system can provide information to the random

program generator and genetic manipulation operations to facilitate more aggressive program

constructs while maintaining syntactically and operationally correct programs. Filtering out in-

correct programs reduces the search space and improves run-time performance [23].

A selection of important typing rules is shown in Figure 4.5 through Figure 4.8. By con-

vention, an arbitrary FIFTH word is represented W, and a sequence of words by P. The type of

an individual slot on the stack is expressed with τ, σ, or ρ (with the top of the stack to the right),

and sequences of slots are represented with φ. The types of FIFTH words are expressed as func-

59

tions from initial stack typings to new stack typings. The type rules assume an environment of

mappings from declared words to their type; these are represented with Δ.

1 2 2 3

1 3

: : :

φ φ φ φφ φ

Δ → Δ →Δ →

W PW P

COMPOSE

1 2

1 2

: :

φ φθ φ θ φ

Δ →Δ ⋅ → ⋅

PP

STACK

1 2 1 1 2

1 2 2 3 4

1 2 3 4

, : :, : :

:

φ φ φ φφ φ φ φ

φ φ

Δ → →Δ → →

Δ →WORD ENDWORD

WW

W P PCOMPILE

PP

1 2 1 2 , : : φ φ φ φΔ → →W WTAUT

Figure 4.5: FIFTH basic type rules

Figure 4.5 shows the most important rules for producing coherent typings for entire pro-

grams. These include the COMPOSE, STACK, COMPILE, and TAUT rules. The COMPOSE rule

allows sequences of FIFTH words to be executed as long as each word's ending stack typing

matches the beginning stack typing of the next word. The STACK rule says that if a sequence has

a valid typing, additional slots can be added to the bottom of the stack as long as the top of the

stack matches the signature. The COMPILE rule allows new FIFTH words to be defined and

added to the dictionary. These new words can then be used in accordance with the TAUT rule.

1 2 1 2

1 2

: : : n

φ φ φ φφ φ

Δ → Δ →Δ ⋅ →IF ELSE THEN

1

1 2

PP P

IF 2 P

: :

nφ φφ φ

Δ → ⋅Δ →BEGIN UNTIL

P P

LOOP

Figure 4.6: FIFTH control flow type rules

The rules for control flow and stack manipulation are shown in Figure 4.6. The IF and

LOOP rules ensure that the stack is always in a consistent state at the end of these control con-

structs, regardless of the path that is taken through them.

Figure 4.7 shows the stack manipulation typing rules. These rules describe commonly

used stack manipulations that reorder, duplicate, or dispose of data. All of the stack manipulation

operators can be applied in any stack state in which there are enough elements on the stack, but

60

the final stack state is defined in terms of the initial stack state. Figure 4.8 shows typing rules for

a few representative mathematical operations. These demonstrate the ability to designate opera-

tions on a specific data type (where n is numeric).

: nΔ ∅→DEPTH

: nΔ →∅DROP

: τ τ τΔ → ⋅DUP

: τ σ τ σ τΔ ⋅ → ⋅ ⋅OVER

: τ σ ρ σ ρ τΔ ⋅ ⋅ → ⋅ ⋅ROT

: τ σ σ τΔ ⋅ → ⋅SWAP

Figure 4.7: FIFTH stack manipulation type rules

: n n nΔ ⋅ →*

/ : n n nΔ ⋅ →

: n n nΔ ⋅ →GT

: n nΔ →FFT

: n n nΔ →VRAMP i

Figure 4.8: FIFTH type rules for selected operations

Figure 4.9 shows how the type rules can be used to derive the type n n n n→i i for the

FIFTH fragment “/ SWAP /” where “/” is the token for division. This means that if there are three

numeric values on the top of the stack when “/ SWAP /” begins to execute, then there will be

only one numeric value on the top of the stack when “/ SWAP /” finishes executing, and every-

thing on the stack below those values will remain unchanged. The rules for / and SWAP require no

prior assumptions and appear at the top of the derivation. The type variables σ and τ for SWAP are

instantiated as the real numeric type n. The COMPOSE rule is used twice, first to combine SWAP

with the last “/”, and again to combine the first “/” with the “SWAP /”. Before the second appli-

cation of the COMPOSE rule, the STACK rule had to be used to add an extra numeric value to the

signature of the left “/”, so that it matched the signature of “SWAP /”.

61

/ : : / : / : / :

/ / :

n n n n n n n n n nn n n n n n n n

n n n n

⋅ → ⋅ → ⋅ ⋅ →⋅ ⋅ → ⋅ ⋅ →

⋅ ⋅ →

SWAPSWAP

SWAP

STACK COMPOSE

COMPOSE

Figure 4.9: Type derivation of “/ SWAP /”

A similar derivation can be written for any correctly typed FIFTH program allowing the

GP environment to automatically determine and validate stack usage. This extends to the defini-

tion of new words since they must be constructed of words that already exist in the dictionary.

The typing of individual fragments can also be used to determine if crossover or mutation candi-

dates have compatible stack signatures. For example, since the type n n n n→i i can also be

derived for “* *", then “* *” can be substituted for “/ SWAP /” in all contexts.

4.3.6 Function validation

For any new language, it is essential that the mathematical operation of all functions be

validated as performing correctly. To facilitate this, we included two file operators, READMAT and

WRITESTACK, which read and write MATLAB revision 4 binary data files. WRITESTACK non-

destructively saves all containers on the stack into variables in the file named S00, S01, etc.,

where S00 contains the data from the top container on the stack. READMAT reads the file contents

into variables as named in the file. We selected the MATLAB format since MATLAB also han-

dles vectors natively, and it is capable of reading and writing vectors using multiple data types

including integer, real, and complex representations. These features allow us to implement all or

parts of an algorithm in both languages and exchange data for automatic comparison.

4.4 The FIFTH genetic programming environment (GPE5)

The genetic programming environment for FIFTH (GPE5) consists of three major com-

ponents, as illustrated in Figure 4.10. The GP5 component provides random program generation,

genetic manipulation, program interpretation and parsing, as well as an interactive FIFTH termi-

nal. The DEC5 (Distributed Evaluation Controller) component manages the distributed

evaluation of programs for each generation. One or more DPI5 (Distributed Program Interpreter)

components are required to run the programs against the problem data sets. Each DPI5 is man-

aged by the DEC5 through CORBA interfaces.

62

Problem ConfigurationGPE5Config.xml

ProblemData Sets

GP5Genetic Programmer in

Fifth

Programs

FIFTH ProgramFragments

DEC5Distributed Evaluation

Controller

DPI5Distributed Program

Interpreter

Individual ProgramResults

Gen####_Prog###.csv

Generation ResultsGen####_Results.xml

Figure 4.10: Block diagram for the FIFTH genetic programming environment (GPE5)

All of the configuration parameters for GPE5 are read from an XML file that conforms to

the GPE5 namespace schema defined in GPE5th.xsd. The expected name for the configuration

file is GPE5Config.xml.

4.4.1 Random program generation

The random program generator that is part of GP5 uses a two step approach to creating

syntactically correct programs with the desired output signature. After first generating a ran-

domly selected fraction of the maximum program size, it parses the new program fragment to

identify errors. The final output is the result of fixing all errors by inserting additional words and

terminals until all steps in the program have valid stack signatures, thus ensuring both syntactical

and operational correctness.

There are several important configuration parameters associated with random program

generation including the population size, minimum and maximum program sizes, and the desired

result stack signature. Also, program building blocks are specified by assigning probabilities of

selection to terminals, functions, and structures.

4.4.2 Fitness evaluation

Fitness evaluation for each generation is performed by the DPI5 components. Fitness is a

measure of how well a program has learned to predict the output(s) from the input(s). Generally,

continuous fitness functions are necessary to achieve good results with GP. While fitness can be

any continuous measure, GPE5 uses standardized forms where zero represents a perfect fit.

These include error, error squared, and relative error. Although relative error is not often encoun-

63

tered in the GP literature, it works well for problems where the acceptable deviation from the

actual output is proportional to the output value. GPE5 includes two unique fitness functions that

are applicable when the number of data sets is large. There are classes of problems for which a

calculated answer is deemed to be correct if it is within a specified tolerance (τ) of the actual

output. While a correct/incorrect assessment for a program output against a single data set is not

a continuous function, if the number of data sets is large, then the percentage of incorrect an-

swers begins to approximate a continuous fitness function.

Table 4.5 shows the equations implemented in GPE5 for calculating the fitness of a pro-

gram (fp) from the program results calculated for a data set (pi) and the known desired output

from a data set (oi). Any program that fails to execute properly is automatically assigned a fit-

ness value of infinity. Execution failure examples include infinite loops and allocating too much

memory.

Table 4.5: GPE5 fitness functions

Error Type Equation

Error 1

1 n

p i ii

f p on =

= −∑

Error Squared ( )2

1

1 n

p i ii

f p on =

= −∑

Relative Error 1

1 ni i

pi i

p of

n o=

−= ∑

Percent Incorrect ( )1

100 n

p i ii

f p on

τ=

= − >∑

Relative Percent Incorrect 1

100 ni i

pi i

p of

n oτ

=

⎛ − ⎞= >⎜ ⎟

⎝ ⎠∑

To avoid overfitting during program evolution and to avoid premature convergence to a

local minima, GPE5 uses a different subset of the available data files for evaluating each genera-

tion. The number of files included in the subset is controlled by a configuration parameter. The

configuration also allows specific data files to be designated for required inclusion in the evalua-

tion set for each generation. These two options provide some of the attractive properties of the

bagging and boosting techniques in machine learning [22].

64

4.4.3 Parent Pool Selection

After evaluating the fitness of each program in a generation, the GP5 component must se-

lect the individual programs from generation n that comprise the pool of parents for generation

n+1. This requires sorting the entire population of generation n (containing λ individuals) by fit-

ness and selecting the μ most fit individuals as the parent pool, whereμ λ≤ . For this operation,

GP5 excludes all programs in generation n whose fitness value is infinite. In early generations,

this often means that the parent pool will be less than the desired μ value. For the equations in the

next section, μ is the actual number of parents in the parent pool.

4.4.4 Probability Ranking

Before performing the genetic manipulation operations, each individual in the parent pool

is assigned a probability of selection (pi) based on either a linear or exponential ranking as speci-

fied in the configuration file. The parent pool then consists of a list of programs ordered by

fitness as follows.

[ ]1, : 1 (least fit), (most fit)i i iμ μ+∈Ζ = =

For linear ranking, the probability of selection for an individual parent is based on a lin-

ear relationship between the probability of the worst individual being selected ( 0P μ ), and the

probability of the best individual being selected ( 1P μ ). The individual probability of selection

associated with each parent is:

( )1 10 1 01i

ip P P Pμ μ⎡ ⎤−

= + −⎢ ⎥−⎣ ⎦ Constraint: 0 1 2P P+ =

For exponential ranking, the probability is computed using a bias constant (c). The intent

is to exponentially bias the ranking to favor more fit programs.

( )

*

*

1

1 for 1 1

i

i s

j s

j

cp cc

μ

−

−

=

−= ≠

−∑

1

for 1 i

j

ip cj

μ

=

= =

∑

When c < 1, the probability curve bends downward, favoring a few parents with the best

fitness. When c > 1, the probability curve bends upward, producing a more uniform bias toward

a larger group of the most fit parents. To avoid the equation anomaly at c = 1, GPE5 uses a line-

arly increasing probability. Finally, to avoid extremely sharp bends in the curve introduced by

65

large exponential values, GPE5 uses a scale value (s) so that s = 1 for μ < 50, and s = μ / 50 for μ

≥ 50. The effect of the bias constant on the curve shape is illustrated in Figure 4.11.

Figure 4.11: Effect of bias constant c on exponential ranking

4.4.5 Crossover and mutation

FIFTH supports crossover, mutation, and cloning as standard genetic manipulations. One

difficulty with mutation and crossover operations in genetic programming is the high probability

that the result will destroy the viability of the offspring because the program no longer executes

correctly after the operation. Tchernev [77] proposed that the destructive effects of crossover

could be mitigated in FORTH-like GP languages by requiring that the exchanged segments have

matching stack signatures. FIFTH follows this protocol by using a sophisticated stack signature

that includes depth tracing for stack manipulation, loops and branches. FIFTH mutation and

crossover operations allow a program fragment, which can vary in length from a single word to

66

multiple consecutive words, to be replaced by a fragment with a matching stack signature. While

this choice limits the search space, it increases the probability that the result of evolution will not

be destructive.

Since GPE5 uses a linear representation instead of a tree representation, crossover is

similar in function but different in form than the crossover described in section 4.1.3.6. The

process begins with the selection of two parents. In the first parent, instead of selecting a single

tree node, GP5 selects a random starting location in the program and a random length for the

fragment to be removed from the first parent. GP5 then characterizes the input-output effect for

the selected fragment to determine its signature. In the second parent, GP5 first locates multiple

fragments that are compatible with that signature, and then randomly selects one as the exchange

candidate. Swapping the two compatible fragments completes the operation.

To demonstrate the important difference between linear crossover and tree crossover,

consider the following example taken from the genealogy of a successfully evolved symbol rate

algorithm presented in the next chapter. The linear program sequence is shown below with boxes

around the fragments that were selected in each parent. Both fragments have the same signature

requiring at least two items on the stack on entry and leaving the stack size unchanged on exit.

Parent: Gen0019_Prog2412.5th x SIN FFT DIFF -2.0 * CEILING DIFF FFT x / FLOOR x REAL -2.0 MAGSQRD ROUND MAGSQRD + WND_HAMMING + x x REAL * 10.0 DIFF WND_HAMMING SIN IMAGINARY + DIFF -1.0 ANGLE * IMAGINARY ANGLE IMAGINARY WND_HAMMING CEILING MAGSQRD SQRT ANGLE - ROUND CEILING -2.0 /

Parent: Gen0019_Prog3907.5th x SIN FFT DIFF ROUND DIFF FFT x / FLOOR x - 2.0 MAGSQRD MAGSQRD REAL VRAMP FLOOR REVERSE RAMP DIFF * FLOOR REAL /

Child: Gen0020_Prog2845.5th x SIN FFT DIFF ROUND DIFF FFT x / FLOOR x - 2.0 MAGSQRD MAGSQRD REAL VRAMP FLOOR WND_HAMMING + x * FLOOR REAL /

Figure 4.12 illustrates this crossover using a tree-like representation. For clarity, only the

principal parent and child are shown. While the directed acyclic graphs are similar to the parse

trees shown in Figure 4.3, they contain an unusual branch caused by the operator VRAMP, which

peeks at the top item on the stack and produces a compatible ramp vector as a new item on the

stack (signature n n n→ i ). Note that after the crossover the resulting graph has changed struc-

ture. There is no equivalent binary tree crossover that would yield this result.

67

x

SIN

FFT

DIFF

ROUND

DIFF

-

FFT x

/

FLOOR x

2

MAGSQRD

MAGSQRD

REAL

VRAMP

FLOOR

WND_HAMMING

*

FLOOR

REAL

/

Child Gen0020_Prog2845

x

SIN

FFT

DIFF

ROUND

DIFF

-

FFT x

/

FLOOR x

2

MAGSQRD

MAGSQRD

REAL

VRAMP

FLOOR

REVERSE

RAMP

DIFF

*

FLOOR

REAL

/

Parent Gen0019_Prog3907

+ x

Figure 4.12: FIFTH crossover example showing legal structural change

Mutation requires only one parent. GP5 performs the same initial fragment selection and

signature characterization that it does for crossover. The selected fragment is then replaced by a

randomly generated fragment that matches the signature, thus introducing new genetic material.

Using the signature ensures that the resulting modified program remains both syntactically and

operationally correct. This implies that the same structural changes that are possible during

crossover are also possible during mutation.

The GPE5 configuration contains several parameters that control the genetic operations.

Probabilities may be set separately for clone, mutation, and crossover. There are four different

length selection algorithms that may be specified independently for mutation and crossover.

68

4.4.6 Implementation

Our implementation of the FIFTH™ language and its associated genetic programming

environment is in C++ and is designed to be platform independent. Where operating system spe-

cific interaction is required, the ACE libraries provide an intermediate layer. CORBA

communication between the distributed components is based on the TAO libraries [65]. FIFTH

has been successfully compiled and run under Microsoft Windows XP, Linux, and Sun Solaris.

The design is object oriented and allows new base dictionary words to be implemented either in

FIFTH or as C++ functions. This feature is important for scientific applications where libraries

of optimized C++ code are available for matrix manipulation and domain specific algorithms that

are hard to debug and optimize. For example, we used the GNU Scientific Library [9] functions

for fast Fourier transform (FFT) and several other functions.

4.5 Using GPE5 to solve a problem

As with any GP environment, solving a problem using GPE5 requires a few preparatory

steps, most of which involve editing the XML configuration file from which all of the compo-

nents read their control information. The following descriptions are analogous to the preparatory

steps outlined by Koza [42].

4.5.1 Identify the terminal set

In GPE5, there are two categories of terminals: problem data inputs and ephemeral num-

bers. Problem data inputs may be scalars, vectors, or arrays. For each data set, the values are

stored as variables in MATLAB V4 .mat file format. The variable name(s) must be in the con-

figuration file for selection during random program generation. An inherent part of this

preparatory step is the identification and organization of the data set files. Each data set must be

in a separate .mat file. When read into the FIFTH program, each variable in the .mat file be-

comes a FIFTH CONSTANT.

The second category consists of numbers that may be selected during random program

generation. In the configuration file, groups of terminal tokens are assigned separate probabilities

of selection.

69

4.5.2 Identify the function set

The function set for GPE5 is divided into three categories controlled by the configuration

file. Groups of tokens within each category can be assigned an independent probability of selec-

tion. The first category consists of the subset of FIFTH words that are appropriate for the

problem under consideration. These may be math operations, logical comparisons, and stack ma-

nipulations, as well as domain specific functions such as filters and transforms.

The second category of functions, unique to GPE5, consists of user-defined FIFTH pro-

gram fragments. Each fragment is a valid FIFTH program stored in a file. If a fragment is

selected when randomly generating a program, the fragment code is inserted into the program so

that it is indistinguishable from the purely random material. In subsequent generations, the frag-

ments are subject to genetic manipulation just like the random parts of the program.

The third category describes the architectural and structural components of a program.

These include branch and loop constructs (IF, ELSE, THEN; BEGIN, UNTIL), automatically de-

fined words (bracketed by WORD and ENDWORD), and intermediate storage (VARIABLE, CONSTANT).

4.5.3 Select the control parameters

In addition to the ranking, fitness function, and genetic manipulation parameters already

discussed, the GPE5Config.xml file provides a mechanism for experimenting with several other

control parameters. Some of these affect the termination of the GP algorithm, including mini-

mum and maximum number of generations and the fitness target. Others control the operation of

the DPI5 components to work within available memory as they execute evolved programs. These

parameters include maximum execution stack depth, maximum number of words to execute (to

avoid endless loops in a program), and the maximum allowed memory allocation for a single

container.

4.6 Polynomial regression example problem

We selected polynomial regression for initial testing of the GPE5 to demonstrate its abil-

ity to solve standard GP test problems. For this example run, we used the following quadratic

equation. y = 0.5x2 + x/3 – 2.2

70

Pertinent configuration parameters were: population size = 1000, minimum program size = 15,

maximum program size = 150, parent pool size = 900, and exponential fitness ranking bias = 0.9.

The genetic operations included cloning, mutation, and crossover with respective probabilities of

0.05, 0.50 and 0.45. Both mutation and crossover used a uniform length selection of 1 to 4 pro-

gram tokens. Table 4.6 shows the terminal values and functions configured for use by the

random program generator.

Table 4.6: Word selection probabilities for the polynomial regression example

P(selection) Terminals and Functions in the set 0.20 x (data input)

0.30 1 2 3 4 5 -1 -2 -3 -4 -5

0.40 * + - /

0.10 DUP SWAP OVER ROT

The following FIFTH solution is a reasonable human-generated program using 17 tokens: 1.0 2.0 / x * x * x 3.0 / + 1.0 5.0 / 2.0 + -

Figure 4.13 shows the evolutionary progress of the example FIFTH run. The solid line

shows the best fitness at each generation. The dotted line shows the fitness of the best ancestor in

each generation for the solution program in generation 33 that met the fitness termination crite-

ria. This run was performed on three dual-core Pentium class machines with at least 1GB of

RAM running Windows XP. The machines were connected via a 100 Mbps LAN. Each machine

ran two instances of the DPI5, completing the run in just under four minutes.

The evolved solution is longer (43 tokens), but it is a correct general solution. x x * 1.0 + 1.0 x -2.0 1.0 / - 5.0 - -3.0 -3.0 / -3.0 + -4.0 - -5.0 / -5.0 + -3.0 / x + -3.0 / OVER - -4.0 / - - - 2.0 / -4.0 - -5.0 +

Multiple trial runs with a variety of polynomials confirmed that GPE5 worked as ex-

pected for this typical GP problem class. While an interesting problem, polynomial regression

does not take advantage of a principal new feature of FIFTH: vector handling. The next chapter

describes the application of FIFTH and GPE5 to the symbol rate problem, which clearly benefits

from FIFTH’s native vector handling.

71

0 5 10 15 20 25 30 3510-15

10-10

10-5

100

105

Generation

Fitn

ess

BestAncestry of best

Figure 4.13: Best fitness progression for a polynomial regression GP run

72

Chapter 5 Automatic Generation of Symbol Rate Algorithms

5.1 Preparing the GP run

5.1.1 Problem formulation

For this experiment, we evolved only the first stage (feature extraction) of the symbol

rate algorithm. As described in section 3.2, we needed a second stage to convert the feature vec-

tor into a symbol rate. Since the configuration file allowed us to define post processing code, we

used a FIFTH second stage as follows, where x is the input signal vector and Fs is the sample

rate in Hertz. FFT MAGNITUDE LENGTH 2.0 / SETLENGTH VRAMP Fs 2 / * 10.0 GT * MAXINDEX x LENGTH SWAP DROP Fs SWAP / *

The first line expects the feature vector on the data stack. It executes a Fourier transform

to develop the periodicity of the feature vector, preserves only the magnitude of the result, and

then deletes the redundant second half of the result. The second line finds the index of the high-

est spectral energy above 10 Hz, and the third line uses the sampling frequency to convert the

index into a symbol rate leaving a single number on the stack. For comparison, the following is

an equivalent MATLAB function where xf is the feature vector. The FIFTH lines are interleaved

as comments. function baud = pickpeak( xf, Fs ) % FFT MAGNITUDE LENGTH 2.0 / SETLENGTH XF = abs(fft(xf)); n = length(XF) / 2; XF = XF(1:n); % VRAMP Fs 2 / * 10.0 GT * MAXINDEX mask = [0:n-1]/(n-1) * (Fs/2); mask = mask > 10.0; XF = XF(:) .* mask(:); [c,idx] = max( XF ); % x LENGTH SWAP DROP Fs SWAP / * baud = idx * (Fs/(n*2));

73

5.1.2 Terminal set and function set

With the signal property ranges outlined in Table 5.1, we used ALEF to generate a train-

ing set consisting of 3680 signal files including five random symbol variations for each

combination of signal properties, a fixed 8000 Hz sample rate, and a fixed signal length of 16384

samples. All FSK signals used continuous phase transitions between symbols.

Table 5.1: Properties of the training set signals used for the symbol rate experiments

Property Typical values in HF Values used Modulation FSK, MSK, PSK, DPSK, OQPSK, ASK FSK, PSK

Pulse shape None, raised cosine (RC), root RC (RRC), Gaussian

None, RC

Excess bandwidth (rolloff) Limit: 0.00 to <1.00 Typical: 0.10 to 0.35

0, 0.1, 0.2, 0.35

Mod Index 0.5 to 3 0.1

Symbol rate Typical: 10 to 2400 symbols per second 13 values in [50, 2400]

Symbol states 2, 4, 8 2, 4, 8

Signal to noise ratio (SNR) Practical range: 0 to 60 dB 9, 12, 16, Inf

As shown in Table 5.2, the terminal values included a few numerical constants in addition to the

signal vector (x). The table also shows the FIFTH functions that were available to the random

program generator. They include standard math, stack manipulation, and several vector based

operations. The probabilities of selection shown in the table resulted in the successful algorithm

presented in section 5.2.

Table 5.2: Word selection probabilities for the symbol rate example

P(selection) Terminals and functions

0.10 x (signal vector)

0.05 1.0 2.0 3.0 4.0 5.0 10.0 -1.0 -2.0

0.10 DUP SWAP ROT OVER

0.10 * + - /

0.65 REAL IMAGINARY ANGLE UNWRAP FFT MAGNITUDE DIFF SQRT RAMP VRAMP MAGSQRD COS SIN FLOOR ROUND WND_HAMMING REVERSE CEILING

74

5.1.3 Fitness evaluation strategy

We set the evaluation configuration to use a subset strategy and to require several signal

files. Each generation was evaluated against 50 signals from the training set. Nine of these were

required files, and 41 were selected at random. The required files included four with PSK modu-

lation at symbol rates and rolloff values known to cause problems with the DPDT algorithm. The

other three were FSK modulated signals with a low, middle and high symbol rate. For this run,

we selected the relative error fitness function with a target of 0.5% error.

5.1.4 Control parameters

The control configuration parameters for this experiment were set to include 4000 pro-

grams per generation, minimum program size of 5 tokens, maximum program size of 75 tokens,

parent pool size of 3500 programs, and exponential fitness ranking with a 0.9 bias. The genetic

operations included probabilities 0.8 for mutation and 0.2 for crossover. Both genetic operations

used a linear probability length selection between 1 and 10 program tokens.

5.2 A successfully evolved algorithm

5.2.1 Baseline performance of DPDT

As a performance comparison baseline, we used a FIFTH implementation of the DPDT

first stage, shown below, and ran it against the training set. x ANGLE UNWRAP DIFF MAGNITUDE

Defining a correct answer as one with less than 1% error, this implementation of DPDT achieved

only 66.8% correct estimations. When the set was restricted to only PSK modulation (the domain

in which DPDT should work well) it achieved only 78% correct estimations.

5.2.2 Evolution results

This example was run on a Sun Fire E2900 with 96 GB RAM and 12 dual core 1.5 GHz

processors running Solaris 10. With twenty DPI5 processes, the run completed in just over 19

hours. Figure 5.1 shows the progression of the run by plotting the fitness of the best program in

each generation. Note that the fitness scale is logarithmic. In generation 24, program 1333

achieved a sufficiently low fitness, and because no better program appeared in the next 10 gen-

75

erations, the run terminated. The dotted line in Figure 5.1 shows the ancestry of the best program

by plotting the best fitness of its direct ancestors.

0 5 10 15 20 25 30 3510-4

10-3

10-2

10-1

100

Generation

Fitn

ess

BestAncestry of best

Figure 5.1: Best fitness progression for a symbol rate GP run

The resulting best program (hereafter referred to as P1333) is very different from any

published algorithm encountered during the course of this research. x SIN FFT DIFF ROUND DIFF FFT x / FLOOR x - 2.0 MAGSQRD MAGSQRD REAL VRAMP WND_HAMMING + x * FLOOR REAL /

The second line produces the constant value seventeen, so the algorithm can be written as: x SIN FFT DIFF ROUND DIFF FFT x / FLOOR x - 17 x * FLOOR REAL /

When run against the entire training set, P1333 achieves an impressive 97.7% correct es-

timation. Table 5.3 compares the distribution of the correct answers by symbol rate (baud) and

modulation type for both DPDT and P1333.

To validate the performance of P1333, we ran it against a subset of the signal files used

with the wavelet symbol rate algorithm (see Table 3.2). Since these signals were not used during

the evolutionary process, they constituted a reasonable test set. The test set included two new

property values that were not present in the training set: root raised cosine (RRC) pulse shape

and phase discontinuous FSK symbol transitions. Using all five symbol rates available in that set,

and including all ten random symbol variations for each combination of property values, the se-

lected test set consisted of 6250 signal files.

76

Table 5.3: Percent correct performance by symbol rate value against the entire training set

DPDT P1333

Baud FSK PSK FSK PSK

50 0.0 45.4 78.3 87.9

75 0.0 59.6 83.3 96.3

96 0.0 65.8 85.0 99.6

100 0.0 72.1 98.3 99.6

125 1.7 72.9 96.7 100.0

200 1.7 75.0 91.7 100.0

300 0.0 75.0 93.3 100.0

600 12.5 100.0 100.0 100.0

800 0.0 75.0 100.0 100.0

1280 0.0 75.0 100.0 100.0

1500 75.0 100.0 100.0 100.0

1800 5.0 100.0 100.0 100.0

2400 NA 100.0 NA 100.0

As seen in Table 5.4, the performance of both algorithms was similar for both the training

and test sets. When we expanded the FSK signal properties to include other modulation indexes,

neither DPDT nor P1333 performed well against the new FSK signals. We are planning addi-

tional experiments with an augmented training set to evolve an even more general algorithm.

Table 5.4: Percent correct performance by symbol rate value against the test set

DPDT P1333

Baud FSK PSK FSK PSK

50 0.3 68.0 74.7 92.8

100 0.0 82.2 99.0 100.0

300 0.0 85.7 89.7 100.0

1280 2.0 85.7 100.0 100.0

2400 NA 100.0 NA 100.0

5.2.3 P1333 algorithm structure

One of the known difficulties with GP is its tendency to produce obscure code that may

contain sequences that have little effect on the code’s overall performance. For example, the se-

quence in the original P1333 (2.0 MAGSQRD MAGSQRD REAL VRAMP WND_HAMMING +) is certainly an

unusual way to create the constant value 17. To determine how the performance of P1333 might

be affected by simple changes, we ran a series of six algorithm variations against the training set.

77

A slight improvement occurred with the constant value set to 25. Otherwise, all other changes

were destructive.

Table 5.5: Performance of variations on algorithm P1333

Variant Stage One Code % Correct Original x SIN FFT DIFF ROUND DIFF FFT x / FLOOR x -

17 x * FLOOR REAL / 97.7

A x SIN FFT DIFF ROUND DIFF FFT x / FLOOR x - 10 x * FLOOR REAL / 95.0

B x SIN FFT DIFF ROUND DIFF FFT x / FLOOR x - 25 x * FLOOR REAL / 97.8

C x SIN FFT DIFF ROUND DIFF FFT x / FLOOR x - 100 x * FLOOR REAL / 94.9

D x FFT DIFF DIFF FFT x / FLOOR x - 100 x * FLOOR REAL / 62.9

E x SIN FFT DIFF DIFF FFT x / FLOOR x - 100 x * FLOOR REAL / 69.9

F x FFT DIFF ROUND DIFF FFT x / FLOOR x - 100 x * FLOOR REAL / 94.8

5.2.4 P1333 algorithm analysis

Insight into the operation of P1333 can be gained by comparing its intermediate results

with those of DPDT. The two intermediate points of interest are the stage one feature vector and

the stage two vector just after the Fourier transform.

Figure 5.2 shows the feature vectors and stage two vectors for a signal that both algo-

rithms estimated correctly. The signal is two level PSK modulated at 50 baud with raised cosine

pulse shape, rolloff of 0.1, and SNR of 16. In the feature plots on the left, the symbol transitions

are clearly seen as large spikes, but the spikes in P1333 are several orders of magnitude greater

than for DPDT. The plots to the right show the stage two vectors with the x-axis scaled to repre-

sent the symbol rate. While the largest peak for both algorithms is located at the correct 50

symbols per second, the P1333 peak is several orders of magnitude larger (4x107 versus 1x103).

Figure 5.3 shows the equivalent analysis for a signal with the same properties (but differ-

ent symbol sequence) in which DPDT was incorrect and P1333 was correct. DPDT again shows

strong peaks at each symbol transition, but the peaks do not line up as well with the P1333 peaks.

DPDT exhibits a harmonic peak at 100 baud that is larger than the 50 baud peak. Even though

the harmonic is present in P1333, it is lower in magnitude than the 50 baud dominant peak.

78

0 0.5 10

1

2

3

4

Magnitude

DPDT Features

0 0.5 10

1

2

3

Magnitude x 10^5 P1333 Features

Seconds

0 50 100 150 2000

1

Magnitude x 10^3 DPDT

0 50 100 150 2000

1

2

3

4

5

Magnitude x 10^7 P1333

Baud

Figure 5.2: Feature vector and FFT vector for P1333 and DPDT, both correct

0 0.5 10

1

2

3

4

Magnitude

DPDT Features

0 0.5 10

1


Seconds

0 50 100 150 2000

1


0 50 100 150 2000

1

2

3


Baud

Figure 5.3: Feature vector and FFT vector for P1333 correct, DPDT incorrect

79

There were a few signals that both P1333 and DPDT missed. Figure 5.4 shows another

two level PSK modulated signal at 50 baud, but with no pulse shaping and a SNR of 12. The

DPDT maximum peak occurs at the harmonic 150 baud value, while the P1333 maximum peak

occurs at a non-harmonic value of 14 baud. This error could have been easily corrected by in-

creasing the minimum acceptable symbol rate to 15 baud, in which case the maximum peak for

P1333 would have occurred at the correct 50 baud value. After examining multiple signals in this

way, we conclude that P1333 produces fewer and smaller harmonic peaks than does DPDT.

0 0.5 10

1

2

3

4

Magnitude

DPDT Features

0 0.5 10

1

2


Seconds

0 50 100 150 2000

1


0 50 100 150 2000

1

2

3

4


Baud

Figure 5.4: Feature vector and FFT vector for P1333 and DPDT, both incorrect

5.3 The efficacy of genetic programming

The results of this experiment clearly demonstrate that the genetically evolved P1333 al-

gorithm performs significantly better than the commonly used DPDT algorithm over the selected

range of signal properties. Based on this and other experiments, we conclude that our FIFTH ge-

netic programming environment is a viable technique for addressing real-world vector space

problems. The production of P1333 as a viable algorithm was not a fluke. Many of our explora-

tory runs produced potentially useful algorithms for the symbol rate problem.

80

Chapter 6 Discussion and Future Work

During the course of this dissertation we have explored the creative process required to

develop signal processing algorithms. The material that we selected for inclusion thus far repre-

sents a coherent thread through a body of work that stretches back over five years. As with any

journey of this duration, there were a number of significant milestones along the way, and the

effort is far from over. In this chapter, we present additional discussion, lessons learned, and

real-world impact associated with three of the most significant milestones: the signal exploitation

markup language schema (SIGEXML), the algorithm evaluation framework (ALEF), and the

genetic programming environment for FIFTH (GPE5). While the specific context continues to be

communication signal recognition in the HF band, the principles apply to any discipline that

processes time series data, making these conclusions pertinent to a much wider audience.

6.1 SIGEXML: the Signal Exploitation Markup Language schema

Practitioners in the field of signal intelligence understand that fully characterizing an al-

gorithm for use in a real-world environment is a difficult task, but one that must be undertaken if

the deployed system is to be successful. The intent is not to require that an algorithm work under

all conditions, but rather that the specific conditions under which the algorithm does work are

well defined, detectable, and controllable. This is sadly ignored in many published algorithms,

making them look attractive on paper but resulting in little practical application.

A necessary supporting element in the signal analysis process is an expressive method for

organizing and conveying the answers to key questions: What are we looking for? What did we

find? Where was the emitter located? How was the data collected? What conditions were present

that might affect the data? The existing description techniques introduced in section 2.2 lacked

sufficient clarity in one or more of these areas, and their organization precluded reasonable ex-

pansion. We also reviewed an unpublished signal description schema under development by a

knowledgeable group. However, their work produces the schema as an automatically derived ar-

tifact from a Unified Modeling Language (UML) based model of the signal domain. In our

opinion, the resulting schema was overly complicated, insufficiently detailed, and not appropri-

81

ately organized to accomplish our goal of a language that was tractable for both human and ma-

chine consumption. This was the primary impetus for developing SIGEXML.

Stating that we underestimated the complexity of this task does not begin to convey the

amount of work expended in this effort. After several hundred man-hours of meetings, discus-

sions, and reviews, the first revision of SIGEXML took form in a hierarchy of 10 files containing

over 4000 lines of XML schema code. The resulting work was subsequently adopted for use in a

number of large operational systems and has proven very effective. As with all first revisions, we

discovered a number of areas needing improvement, and SIGEXML recently underwent a sec-

ond round of design changes to address some of these issues.

While there are many factors that contributed to the success of SIGEXML, two deserve

particular attention. First, it represents a cooperative melding of the knowledge of several experts

in communication signal processing with the computer science related architectural and organ-

izational skills of the author. This combination of information content with functional

organization sets SIGEXML apart from similar descriptive systems. Second, although we elected

to develop directly in XML schema language instead of using a higher level form of expression

such as UML, this approach succeeded only because we presented the schema to the domain ex-

perts using the visualization format rendered by the product XMLSpy (e.g., see Figure C.2). This

allowed the domain experts to contribute effectively without having to learn all of the intricacies

of XML schema while the underlying language easily and fully supported the computer science

goals such as ensuring data integrity and providing for automated data validation.

6.2 ALEF: the automated Algorithm Evaluation Framework

As we progressed from characterizing the problem domain to working with algorithms

within the domain, it became obvious that tuning an algorithm to provide optimal performance

requires an understanding of all of the algorithm’s parameters, many of which may be hidden or

overlooked in typical publications. These hidden but important factors range from simple items

such as a sample rate to entire related algorithms such as peak search.

It is well understood that an automated testing environment can significantly contribute to

characterizing an algorithm, comparing competing algorithms, or developing a new algorithm.

However, selecting or constructing automation tools requires difficult compromises between the

competing goals of specificity and generality. A good framework provides a balanced approach.

82

A framework is a set of libraries, software modules, or other tools that form a structure within

which a software project can be organized and developed. By selecting MATLAB to construct

our framework, we inherently gained the general capabilities of a powerful programming and

scripting language. By implementing a series of small functions that build from general opera-

tions to specific applications, we created complete tools that can be easily used for any signal

processing domain as well as functional components that can be easily modified for closely re-

lated domains. The resulting ALEF framework consists of over 8000 lines of MATLAB code in

over 150 files.

In this dissertation we demonstrated just a few of the capabilities of ALEF for character-

izing and developing algorithms. However, the automation only partly mitigates the complexity.

There are virtually unlimited ways to construct algorithms using a two stage approach of feature

extraction followed by classification. Realizing that manual exploration could never cover more

than a small fraction of this limitless landscape, we tackled the problem of automating the gen-

eration of the algorithms themselves.

6.3 FIFTH: a new look at genetic programming

To accomplish the automation, we designed a new genetic programming language,

FIFTH, with intrinsic vector handling capability, and we implemented a completely distributed

GP environment to support it. The source is comprised of over 100 C++ files containing almost

25,000 lines of code. The simple, object oriented design of the both the language implementation

and the GP environment have proven to be easily extensible and maintainable.

There are aspects of the language and environment that seemed advantageous during de-

sign but which we truly appreciate after working with FIFTH to solve real problems. FIFTH

programs appear in text format and require no parsing beyond string tokenization. Although the

language is interpreted, the execution speed is reasonable due to the efficient implementation of

the dictionary and interpreter. The FIFTH runtime system has a small footprint, as do FIFTH

programs themselves, resulting in a system that runs well even on networks of older hardware

with limited memory. Another unique feature among GP environments is the interactive console

interpreter, which provides a convenient exploration environment for FIFTH programs whether

written manually by a human or automatically by the GPE.

83

The design philosophy underpinning FIFTH does not mirror the choices that have been

made for most genetic programming languages and environments. Banzhaf [12] and others ad-

vocate using function sets and data representation that closely resemble the structure of the

underlying CPU hardware. They reason that this provides a simple structure with significant exe-

cution speed advantages. We perceive as a disadvantage to this approach that the resulting

programs are almost incomprehensible to humans without significant additional effort.

In addition, for many applications in the feature recognition domain, the performance

bottleneck occurs in the execution of matrix manipulation operations such as finding eigenval-

ues, performing convolutions and filtering, or calculating FFTs. Optimized libraries are available

for these types of operations. It is highly unlikely that a genetic programming system would ever

evolve either optimized or unoptimized versions of such complicated operations. By starting

with primitives that reflect the best practices of the domain and using the genetic programming

environment as the glue, it seems reasonable that we are more likely to end up with a working

program. In other words instead of trying to evolve a horse from an amoeba, we try to breed a

better horse from a good bloodline. The former process is an example of general evolution, while

the latter is an example of special evolution [34].

The details of data handling and support are also critical. Recent publications using stack-

based representations have attempted to expand the basic data types available in a GPL by add-

ing separate stacks for each data type. This appears to be an extension of the direction taken by

early stack based languages such as FORTH. However, a separate stack design requires adding

not only a full set of data manipulation functions for each stack but also functions for data trans-

fer between stacks. As previously indicated, this choice adversely expands the dimensionality of

the program search space.

The key feature that clearly sets FIFTH apart from other GP languages is its ability to

natively handle a vector as an intrinsic unit of data. Not only does this facilitate the inclusion of

domain specific primitives (in this case, functions such as FFT and windowing) as indivisible

operators, it makes possible the expression of complicated serial transforms in a simple form that

is void of control operations based on vector lengths.

While our experiments have vindicated our approach and have demonstrated that the

automated development of human-competitive vector-based algorithms to solve real-world prob-

lems is possible, they have also highlighted the complexity associated with accomplishing this

84

task. Consistent with other published GP environments, only a fraction of our runs produced vi-

able results. The effectiveness of the search depended on the inclusion and the selection

probabilities of the operators needed to solve the problem. We also found that the closure proper-

ties of all operators had to be carefully considered. For example, when two vectors of unequal

size (or a vector and a matrix) are added, what should the footprint of the result be? Decisions

about these details made a strong impact on the results that were generated.

Even with these difficulties, this work is highly significant. It compellingly demonstrates

the utility of the GP approach to produce vector algorithms for feature extraction, which is the

first stage in many problem domains where large volumes of data are naturally expressed as vec-

tors. Without native vectors, it is difficult to formulate these problems in a way that GP analysis

can directly process the raw data. FIFTH solves this problem with a simple and easily manipu-

lated syntax, making it an attractive vehicle for GP research.

6.4 The path forward

Although our experiments have produced some excellent results, we have barely touched

the potential of this new language and its genetic programming environment. Additional research

and development are likely to yield important advances. For example, we have already identified

several areas in both the FIFTH language and the GPE5 infrastructure that need work.

• The language support for automatically defined functions, variables, branching, and

flow control has been implemented in FIFTH. However, since we were able to dem-

onstrate the viability of automatic algorithm discovery without using these constructs,

the genetic manipulation of programs containing these elements has not been com-

pletely tested.

• There are a number of standard transforms and vector operations that could prove

useful if added to FIFTH, including inner and outer vector products, matrix determi-

nant and inversion, wavelets, and so forth.

• We ran a small number of trials to determine which of the four available length selec-

tion algorithms produced the best results for our symbolic regression problem.

Qualitatively, we believe that more viable solutions emerged when using the linear

probability technique than with the uniform, Gaussian, or tree emulation selection

techniques. While this appeared to apply to the symbol rate problem as well, we have

85

yet to run the number of trials necessary to quantify the observation. Length selection

is just one of many GPE5 configuration parameters that affect the evolutionary proc-

ess but that have not been fully characterized.

• Qualitative observation of the elapsed time of an experiment as a function of the

number of DPI5 processes involved indicated an almost linear relationship. However,

due to available resource limitations, we have only tested GPE5 with up to 50 proces-

sors. We shortly expect to expand this to several hundred processors.

Furthermore, it is important that FIFTH be applied to a broader variety of vector-based

problems. In the domain of communication signals this would include signal modulation classifi-

cation, specific emitter recognition, signal detection and location in a wideband data stream, and

other classical problems. Beyond communication, there are countless disciplines that use time

series data where FIFTH might be able to provide creative insight into difficult problems.

To facilitate this expansion goal, we want to make ALEF and FIFTH available to other

researchers. We are preparing two packages that will be available from Southwest Research In-

stitute at no charge under an academic use license.

The ALEF package will include:

• The ALEF MATLAB files shown in Appendix D,

• Automatically generated documentation in HTML format,

• The pre-built library of signal files used for symbol rate algorithm comparison, and

• Example MATLAB scripts using the ALEF functions.

The GPE5 package will include:

• The complete C++ source for FIFTH and GPE5,

• MATLAB files for viewing runs and analyzing results,

• Automatically generated documentation in HTML form,

• The pre-built library of files for the example symbolic regression function,

• The pre-built library of signal files used for the symbol rate algorithm development,

• A pre-built set of binary files to run GPE5 on Windows XP.

Please email the author ([email protected]) for availability and licensing details.

86

Appendix A

Glossary and Acronyms

ALE Automatic Link Establishment (Fed Std 1045) AM Amplitude Modulation - analog signal where the amplitude of the carrier wave is varied Baud Rate A measure of the modulation symbol rate, proportional to the message data rate BPSK Bi-Phase Shift Keying – a digital modulation technique BW Bandwidth - width of the frequency spectrum occupied by a signal measured in Hz CBNRC Communications Branch National Research Council CDF Canadian Defense Forces COMINT Communications Intelligence CW Carrier Wave or Continuous Wave refers to a continuous RF radio wave dB decibels – a logarithmic unit to measure the difference in power DF Direction Finding or Direction Finder DFSK Dual Frequency Shift Keying - digital modulation technique that uses four specific fre-

quencies to encode a signal DND Department of National Defense (Canada) DoS Department of State DSP Digital Signal Processing ELINT Electronic Intelligence (Radars) FDM Frequency Division Multiplex FDMA Frequency Division Multiple Access FFT Fast Fourier Transformer FM Frequency Modulation – analog signal that varies the frequency of the carrier wave FSK Frequency Shift Key(ing) – digital modulation method in which a signal is shifted be-

tween two or more discrete frequency levels at a defined rate. GPS Global Positioning System - a constellation of 24 satellites used to derive worldwide

position, time, and frequency information GSM Global Service Mobile – digital cellular communications (900 MHz ). HF High Frequency – radio spectrum ranging from 3 to 30 MHz Hz Hertz – Number of cycles in one second. IF Intermediate Frequency In-Phase Every sinusoid can be expressed as the sum of a sine function (phase zero) and a cosine

function (phase 2π ). The sine part is called the “In-Phase” component; the cosine part is called the “Phase-Quadrature” component. In general, “phase-quadrature” means 90 degrees out of phase.

ISB Independent Sideband kHz Frequency measurement in kilo Hertz - Cycles in thousands

87

KL Karhunen-Loève decomposition - a type of Principal Component Analysis LQA Link Quality Assessment – a probe signal that provides communication network man-

agement functions. LSB Lower Sideband MF Medium Frequency range, pertaining to the 300 kHz to 3 MHz range. MFSK Multi-Frequency Shift Keyed - digital modulation that uses more than four separate

frequency levels to encode a signal MHz Megahertz – Frequency in millions of cycles per second MSK Minimal Shift Keying NB Narrowband - a small bandwidth usually occupied by a single signal. In HF, NB is usu-

ally 4 kHz or smaller, but may be as large as 20 kHz N-levels An arbitrary designator for any number of frequency levels (MFSK) or phase states

(QAM) referenced for descriptive purposes only. NSA National Security Agency OOK On-Off Key modulation used for Morse messages PM Phase Modulation - analog modulation where the phase of the carrier wave is varied in

accordance with the modulating signal PSK Phase Shift Key modulation - digital modulation where the phase of the carrier is

modulated in discrete steps to encode a signal QAM Quadrature Amplitude Modulation - both the amplitude and phases of a carrier wave

are varied in discrete steps to encode a signal QPSK Quadrature Phase Shift Keying – PSK that uses four phases Quadrature See In-Phase RF Radio Frequency SIGINT Signals Intelligence SNR Signal to Noise Ratio SOI Signal of interest SS Spread Spectrum SSB Single Sideband TACS Telephone Automated Communication System – a Cellular system. TDMA Time Division Multiple Access UHF Ultra-High Frequency – radio spectrum ranging from 300 to 3000 MHz USA US Army USAF US Air Force USB Upper Sideband USMC US Marine Corps USN US Navy VHF Very High Frequency – radio spectrum ranging from 300 to 3000 MHz WB Wideband – spectrum bandwidths that may contain more than one discrete signal.

88

Appendix B

Typical Wideband Signal Surveillance System

In the recent past, Signal Intelligence work was characterized by rooms full of operators at radio

receivers. These skilled individuals tuned through various bands listening to their headphones and

watching spectrogram displays for telltale signs that indicated a signal of interest. Today, the volume

and complexity of signal traffic easily overwhelms this type of operation. Instead, large automated sys-

tems are being developed that can monitor large portions of a radio spectrum simultaneously and alert an

operator only when they find something interesting.

Figure B.1 shows a simplified block diagram for a wideband HF surveillance system. The dotted

lines represent high volume signal data flow, while the solid lines represent lower volume command and

control data flow. The system may consist of anywhere from a few to hundreds of interconnected com-

puters.

The general blind signal recognition problem can be divided into five distinct sub-problems.

• Detection (finding a signal of interest in the received signal data).

• Classification (determining the signal modulation type and associated parameters).

• Demodulation (extracting a binary or audio data stream from the signal).

• Identification (using the demodulated data and possibly direction finding data to identify the signal source).

• Analysis (using the recognition data to extract information about the signal content).

The entry point for data is one or more receivers that digitize wide chunks (2 to 5 MHz) of the

HF spectrum. This digital data is time tagged and stored in short-term volatile memory that provides

several seconds of delay. Since this wideband data usually contains multiple narrowband signals, one of

the first algorithmic tasks is to comb through the data looking for new energy emissions. When the New

Energy Detection block locates a potential signal, it creates a mission request that contains its best esti-

mate of the signals starting time, frequency, bandwidth, and duration. It then passes the mission request

to a Mission Queue. The Mission Queue’s responsibility is to coordinate the job of passing these mis-

sion requests to the hundreds of Mission Prosecutors (MP) in the system.

89

DigitalReceiver(s)3-30 MHz

Delay Memoryand

Channelization

WidebandData

New EnergyDetection

ExternalMissionGateway

MissionQueue

MissionRequests

MissionProsecutor

NarrowbandData

SOIDefinitionsReports

Figure B.1: Simplified block diagram of a wideband signal surveillance system

The Mission Prosecutor is where the system must determine if a particular new energy detection

mission is actually a signal of interest (SOI). It begins by contacting the wideband memory component

and requesting a narrowband (usually 5 to 20 KHz) stream at the indicated frequency and time. It is the

mission prosecutor that executes the signal recognition algorithms against this narrowband data stream.

The symbol rate estimation and modulation classification algorithms discussed in this proposal are just a

few of the many algorithms that comprise the "recognition" process. If the results match any of the crite-

ria in the SOI database, the MP may record the data, notify operators, or post alerts to other surveillance

systems.

90

Appendix C

Signal Exploitation Markup Language Schemas

Figure C.1 shows the include dependency hierarchy for the Signal Exploitation schema.

Files at the bottom of the diagram are the ones used to describe high-level entities such as librar-

ies and reports. The other files contain data types and common elements.

se1ActionPieces.xsd se1BaseTypes.xsd

se2GeolocationPieces.xsd se2SignalDescriptionPieces.xsd

se3Alerts.xsd se3CheckTarget.xsd se3Demod.xsdse3Mission.xsd se3SignalLibrary.xsdse3SignalProsecutionPlan.xsd

Figure C.1: Dependency hierarchy for signal exploitation schema files

Table C.1: Description of signal exploitation schema files

File Name Contents Description

se1ActionPieces.xsd Defines actions (e.g. recording data, alerting operators, or generating reports) a system might perform in response to successful recognition of a signal.

se1BaseTypes.xsd Defines fundamental data types used in signal processing. Examples include frequency ranges, time ranges, and unique identifiers.

se2GeolocationPieces.xsd Defines structures for reporting geographic locations (Latitude, Longitude, Altitude) and directions (Line of Bearing)

se2SignalDescriptionPieces.xsd Definitions for templates that describe desired signal characteristics and re-ports that describe recorded or calculated signal characteristics.

se3Alerts.xsd Defines a general format for communicating alerts. Specific content is driven by an XML dictionary to all for internationalization.

se3CheckTarget.xsd Defines both a list of signal targets to periodically check for direction finding accuracy, as well as a report structure for the results of the check.

se3Demod.xsd Defines interaction with an external signal demodulator.

se3Mission.xsd Defines a report for all information about the prosecution of a specific signal.

se3SignalLibrary.xsd Defines a storage mechanism for organizing libraries of signal files.

se3SignalProsecutionPlan.xsd Defines the work process for a signal interception system by describing signals of interest and where to expect those signals in frequency, time, and geo-graphic location.

91

Figure C.2: Pictorial representation of XML schema type externalReportFSK

92

Figure C.3: Pictorial representation of XML schema type SegmentReportType

93

Appendix D

Index of ALEF Functions

D.1 Index for directory SigGen (test signal generator functions)

File Name Description addAWGN Impose Added White Gaussian Noise (AWGN) on a signal. blanksigdesc Creates a blank signal description structure.

genAM Generate an Amplitude Modulation (AM) signal with added white Gaussian noise (AWGN).

genAMfiles Generate multiple AM signal files, saved in normalized .wav format.

genCPFSK Generate a Continuous Phase Frequency Shift Key (CPFSK) signal with added white Gaussian noise (AWGN).

genCPFSKfiles Generate multiple CPFSK signal files, saved in normalized .wav format.

genCPM Generate a Continuous Phase Modulation (CPM) signal with added white Gaussian noise (AWGN).

genCPMfiles Generate multiple CPM signal files, saved in normalized .wav format.

genDPSK Generate a Differential Phase Shift Key (DPSK) signal with added white Gaussian noise (AWGN).

genDPSKfiles Generate multiple DPSK signal files, saved in normalized .wav format. genFM Generate a Frequency Modulated signal with added white Gaussian noise (AWGN). genFMfiles Generate multiple FM signal files, saved in normalized .wav format.

genFSK Generate a Frequency Shift Key (FSK) signal with added white Gaussian noise (AWGN).

genFSKfiles Generate multiple FSK signal files, saved in normalized .wav format.

genGMSK Generate a Gaussian Minimum Shift Key (GMSK) signal with added white Gaussian noise (AWGN).

genMPAM Generate a M-ary Pulse Amplitude Modulated (MPAM) signal with added white Gaussian noise (AWGN).

genMSK Generate a Minimum Shift Key (MSK) signal with added white Gaussian noise (AWGN).

genMSKfiles Generate multiple MSK signal files, saved in normalized .wav format.

genOOK Generate an On Off Key (OOK) signal (Morse code) with added white Gaussian noise (AWGN).

genOOKfiles Generate multiple OOK signal files, saved in normalized .wav format.

genOQPSK Generate an Offset Quadrature Phase Shift Key (OQPSK) signal with added white Gaussian noise (AWGN).

genOQPSKfiles Generate multiple OQPSK signal files, saved in normalized .wav format. genPSK Generate a Phase Shift Key (PSK) signal with added white Gaussian noise (AWGN). genPSKfiles Generate multiple PSK signal files, saved in normalized .wav format.

94

File Name Description

genQAM Generate a Quadrature Amplitude Modulation (QAM) signal with added white Gaus-sian noise (AWGN).

genQAMfiles Generate multiple QAM signal files, saved in normalized .wav format. genSSB Generate a Single Sideband (SSB) signal with added white Gaussian noise (AWGN). genSSBfiles Generate multiple SSB signal files, saved in normalized .wav format. heterodyne Mixes an analytic signal up or down in frequency. hfchan Passes a signal through an HF channel simulator built in Simulink. hfchanwav Passes wav files through an HF channel simulator built in Simulink. impulsive Returns a vector that models impulsive noise that can be added to simulated signals. killwaitbox Removes orphan progress boxes. specgramgray Produces a spectrogram in inverse grayscale suitable for B/W publishing. testgenfiles Script to test all of the gen*files. testmdl Script to test the sim*.mdl files and the gen*.m files wav2signal Read a wav file and extract signal data wavname2sigdesc Parses a wave file name into a signal description structure wavname2title Creates a description string from a wav file naming convention wavwritescaled Write a normalized .wav file.

D.2 Index for directory SigGen\models (SigGen support functions)

File Name Description bldpstable Build a structure array with valid combinations of pulse shape parameters. chkchanstruct Check/Validate the structure of channel parameters and restructure it. mkCPFSKcarriernode Create an XML carrier node for FSK as defined in digicom.xsd mkCPMcarriernode Create an XML carrier node for CPM as defined in digicom.xsd mkDPSKcarriernode Create an XML carrier node for PSK or DPSK as defined in digicom.xsd mkFSKcarriernode Create an XML carrier node for FSK as defined in digicom.xsd mkMSKcarriernode Create an XML carrier node for PSK as defined in digicom.xsd mkOOKcarriernode Create an XML carrier node for OOK as defined in digicom.xsd mkOQPSKcarriernode Create an XML carrier node for OQPSK as defined in digicom.xsd mkQAMcarriernode Create an XML carrier node for PSK or QAM as defined in digicom.xsd mkfiledescnode Create am XML file description node mkfilenode Create the basic shell for a siglib.xsd signalFile node mksegnode Create an XML signal segment as defined in digicom.xsd mksigdescnode Create a signal description node opensiglib Open a signal library xml file for read/write qamconst Creates a QAM constellation of points in complex format samplefactors Calculate rational fraction numbers for integer baud rate and sampling. testsim This is a test script used to work with the simTest.mdl file.

95

Other MATLAB-specific files in this directory (Simulink models):

hfchansim.mdl

simAM.mdl

simAWGN.mdl

simCPFSK.mdl

simCPM.mdl

simDPSK.mdl

simDPSKps.mdl

simFM.mdl

simFSK_cont.mdl

simFSK_discont.mdl

simGMSK.mdl

simLSB.mdl

simMPAM.mdl

simMSK.mdl

simOQPSK.mdl

simOQPSKps.mdl

simQAM.mdl

simQAMps.mdl

simTestNoPS.mdl

simTestPS.mdl

simUSB.mdl

D.3 Index for directory SAFramework (signal analysis framework)

File Name Description Readme Help summary for the Signal Analysis Framework functions. fmtcsv Format a vector or matrix into a comma delimited string. formatArgs Converts structure field names into variables in the callers workspace. formatOutput Converts variable names in the callers workspace into fields in a structure. fw_anova Performs analysis of variance of a metric against factor values fw_combinetests Combine the results from two separate tests fw_errsummary Produces a summary report for the test data fw_factor Extracts the multiple valued factors from a kTest structure

fw_factorconst Reformats a factor matrix so that it represents values for a single constant factor value

fw_factordel Removes a factor column fw_factorperf Produce a vector of % correct values for each value of a single factor fw_factorstats Produce table of %correct for every multiple valued factor fw_listbuild Build a list of file names that match the given criterion fw_listcommon Find common file names in two file lists fw_listdir Finds all .wav files in the given path fw_listfailed Produces a list of all files that consistently failed. fw_listprune Removes all signal files in a list except those matching the given criteria fw_listread Reads a saved list of file names into a cell array of strings fw_listwrite Writes a file list (cell array of strings) into a text file fw_metric Calculates a metric from one or more fields in kTest.kResult fw_plotcluster Plot pass/fail data and cluster to find centroids fw_plotdensity Plots a probability density function for a pass/fail discriminator fw_plotgroup Produce a group plot of multiple valued factors fw_plotperf Creates 2D plot of performance. White = 0%fail, Black = 100%fail. fw_plotroc Create a Receiver Operating Characteristic (ROC) curve fw_rocpoint Select a threshold from ROC data created by fw_plotroc

96

File Name Description fw_runtest Run tests on an algorithm. Results are stored in a kTest var in a saved .mat file. fw_setfigdefaults Sets default font sizes for figures to enhance suitability for printing. htmClose Close an HTML report file htmImage Save a figure and insert a reference into an HTML report file htmOpen Open an HTML report file. htmPara Insert a preformatted string into an HTML report file htmSection Insert a section title into an HTML report file htmTable Insert a formatted table into an HTML report file

D.4 Index for directory SymbolRate (symbol rate algorithm functions)

File Name Description sr_acor Symbol rate using autocorrelation. sr_avse Symbol rate using absolute value of the signal envelope. sr_dpdt Symbol rate using phase derivative. sr_magsqrd Symbol rate using magnitude squared of the signal envelope. sr_wavelet Symbol rate using multi-scale wavelet transform. sra_visual Symbol Rate Visualization GUI. srh_featuredata Common symbol rate algorithm function to collect data on extracted features srh_peakdata Common symbol rate algorithm function to collect data on baud rate peaks srh_pickpeak Common symbol rate algorithm function to select symbol rate peak srh_spectral Common symbol rate algorithm function to develop the spectral lines srh_visualize Common symbol rate algorithm function to plot symbol rate information

D.5 Index of directory KLDecomp (KL decomposition functions)

File Name Description kl_basis2avi Create avi files of the KL basis vectors kl_calcbins Calculate bin values for a given range. kl_centerpeaks Find the amount to shift a vector to center its peaks. kl_checktraining Test a KL database to see if the reconstruction matches the training set kl_classify1d Modulation identification using vector match against KL derived database. kl_decomp KL decomposition of a matrix of vectors. kl_findpeakscirc Find peaks in a vector where the ends effectively wrap. kl_fixsignal Try to fix signal by removing carrier offset and centering phase peaks. kl_plotbyfactor 3D Scatter plot of data by constant factor value. kl_ploteigval Produces a summary report for the test data kl_plotrecon Test the decomposition and reconstruction of a signal kl_rptanalyzedb Produces a summary report of the eigen values associated with the KL xform

97

File Name Description kl_rptareabyfactor Area plot and locate best separation vectors for histogram of a specific attribute

kl_rptfailedpairs Massage output from kl_checktraining to see if failed specific id at least matches factor.

kl_rptshow3 Produce a set of reports showing classifications using 3 KL vectors

kl_separation Calculate the minimum separation distance between tests with common factor values

kl_sig2hist1d Create a histogram vector that represents the given input signal. kl_sig2hist2d Create a matrix (2d) histogram that represents the given input signal. kl_train2avi Create avi files of the training set and resulting KL transform kl_vdtheta Calculate the distance and angle between a vector and one or more other vectors. kl_viewhist M-file for kl_viewhist.fig View 1d histograms of signal components. kl_viewhist2d M-file for kl_viewhist2d.fig. View 2d histograms of signal components. kl_wav2hist Create an ensemble (matrix) of histogram vectors from a list of signal files.

98

Appendix E

Analysis Report from DPDT Experiment

Algorithm: sr_dpdt

Basic Test Information and Performance Metrics

-------------------------------------------------- Test Results Identification -------------------------------------------------- Algorithm: sr_dpdt Description: Derivative of the Phase Angle Test date: 30-Sep-2005 12:10:17 Analysis date: 30-Sep-2005 13:33:37 Error type: Pct Error Result field: symbolsPerSec Pass threshold: 2 -------------------------------------------------- Summary Statistics -------------------------------------------------- # Tests: 15750 # Right: 15099 # Wrong: 651 % Right: 95.8667 -------------------------------------------------- File: sr_dpdt_PSK_20050930T121017_01.fig

99

Performance by factor

-------------------------------------------------- Single value factors -------------------------------------------------- sampleRateHz: [ 8000 ] durationSec: [ 2.048 ] startTimeSec: [ 0.6 ] modulation: { PSK } modIndex: [ 0 ] phaseContinuity: { Unknown } rayleighCondition: { None } ricianCondition: { None } impnseCondition: { None } rMinSymbolsPerSec: [ 20 ] sPeakOpt: { orh } bDropInitial: [ 0 ] -------------------------------------------------- Multiple value factors -------------------------------------------------- offsetFreqHz: [0 50 100] symbolsPerSec: [50 100 300 1280 2400] numStates: [2 4 8] pulseShape: { None Normal sqrt } rolloff: [0 0.1 0.2 0.35] SNRdB: [9 12 16 20 40]

Statistics for each factor value

Factor Value #Tests #Right %Right --------------- ---------- ---------- ---------- ---------- offsetFreqHz 0 5250 5087 96.90 offsetFreqHz 50 5250 5040 96.00 offsetFreqHz 100 5250 4972 94.70 symbolsPerSec 50 3150 2556 81.14 symbolsPerSec 100 3150 3093 98.19 symbolsPerSec 300 3150 3150 100.00 symbolsPerSec 1280 3150 3150 100.00 symbolsPerSec 2400 3150 3150 100.00 numStates 2 5250 5184 98.74 numStates 4 5250 4966 94.59 numStates 8 5250 4949 94.27 pulseShape None 2250 2158 95.91 pulseShape Normal 6750 6414 95.02 pulseShape sqrt 6750 6527 96.70 rolloff 0 2250 2158 95.91 rolloff 0.1 4500 4195 93.22 rolloff 0.2 4500 4322 96.04 rolloff 0.35 4500 4424 98.31 SNRdB 9 3150 2996 95.11 SNRdB 12 3150 3062 97.21 SNRdB 16 3150 3080 97.78 SNRdB 20 3150 3079 97.75 SNRdB 40 3150 2882 91.49

100

Summary of Performance: Symbol Rate Vs. Offset Frequency

0 50 100

50 84.76 80.38 78.29

100 99.71 99.62 95.24

300 100.00 100.00 100.00

1280 100.00 100.00 100.00

2400 100.00 100.00 100.00

Summary of Performance: Symbol Rate Vs. Pulse Shape

None Normal sqrt

50 79.56 77.78 85.04

100 100.00 97.33 98.44

300 100.00 100.00 100.00

1280 100.00 100.00 100.00

2400 100.00 100.00 100.00

101

Probability density and ROC plots

File: sr_dpdt_PSK_20050930T121017_06.fig


Example Operating point: FP= 0.01 TP= 0.55 Thresh=0.00330433

102


Cluster plot File: sr_dpdt_PSK_20050930T121017_18.fig

103

Appendix F

Signal Histogram Viewer Samples

The following figures are example screen captures of the one dimensional Signal Histo-

gram Viewer application (kl_viewhist.m). The five figures show examples for AM, OOK, PSK

without pulse shaping, PSK with pulse shaping, and FSK modulation types.

Figure F.1: Histograms for AM signal

104

Figure F.2: Histograms for OOK, 35 WPM signal

105

Figure F.3: Histograms for PSK, 4 level, 50 baud, no pulse shape signal

106

Figure F.4: Histograms for PSK, 4 level, 50 baud, RRC pulse shape signal

107

Figure F.5: Histograms for FSK, 2 level, 50 baud, mod index 1 signal

108

Bibliography

[1] "Appendix 1-Classification of emissions and necessary bandwidths," in Radio Regula-tions: International Telecommunication Union, 1990.

[2] "Radio Communications in the Digital Age Volume 1: HF Technology," Harris Corpora-tion, http://www.rfcomm.harris.com/support/PDF/hfradio.pdf, 1996.

[3] "Manual of Regulations & Procedures for Federal Radio Frequency Management," Na-tional Telecommunications and Information Administration, http://www.ntia.doc.gov/osmhome/redbook/redbook.html, 2000.

[4] "Using The ITU Emission Classifications," International Amateur Radio Union, http://www.echelon.ca/iarumsr2/emisscode.html, 2002.

[5] University of Pennsylvania, "Linguistic Data Consortium," http://www.ldc.upenn.edu/.

[6] UC Berkeley, The Ptolemy Project, http://ptolemy.berkeley.edu/. 2004.

[7] Register Machine Learning Technologies, Discipulus, http://www.aimlearning.com/products.htm. 2005.

[8] Statistics Toolbox User's Guide. Natick, MA: The Mathworks, Inc., 2005.

[9] Free Software Foundation, "GNU Scientific Library," http://www.gnu.org/software/gsl/.

[10] E. Azzouz and A. K. Nandi, Automatic Modulation Recognition of Communication Sig-nals. Norwell, MA: Kluwer Academic Publishers, 1996.

[11] W. Banzhaf, F. D. Francone, and P. Nordin, "The Effect of Extensive Use of the Muta-tion Operator on Generalization in Genetic Programming Using Sparse Data Sets," presented at Parallel Problem Solving from Nature IV, Proceedings of the International Conference on Evolutionary Computation, Berlin, Germany, 1996.

[12] W. Banzhaf, P. Nordin, R. E. Keller, et al., Genetic Programming -- An Introduction; On the Automatic Evolution of Computer Programs and its Applications. San Francisco: Morgan Kaufmann, dpunkt.verlag, 1998.

[13] L. Brodie, Starting Forth, 2 ed. Englewood Cliffs: Prentice-Hall, 1987.

[14] W. S. Bruce, "The Lawnmower Problem Revisited: Stack-Based Genetic Programming and Automatically Defined Functions," presented at Genetic Programming 1997: Pro-ceedings of the Second Annual Conference, Stanford University, CA, USA, 1997.

[15] L. Cardelli, "Type Systems," in CRC Handbook of Computer Science and Engineering, A. B. Tucker, Ed. Boca Raton, FL: CRC Press, 2004.

[16] Y. T. Chan, J. W. Plews, and K. C. Ho, "Symbol rate estimation by the wavelet trans-form," presented at IEEE International Symposium on Circuits and Systems (ISCAS 97), 1997.

[17] DARPA, "Model Based Integration of Embedded Software (MoBIES)," http://dtsn.darpa.mil/ixo/programdetail.asp?progid=38, accessed September 2003.

109

[18] K. L. Davidson, J. R. Goldschneider, L. Cazzanti, et al., "Feature-based modulation clas-sification using circular statistics," presented at Military Communications Conference, Monterey, CA, 2004.

[19] L. Dehio, "Monitoring Utility Stations," http://rover.vistecprivat.de/~signals/, accessed September 2002.

[20] R. M. Friedberg, "A Learning Machine: Part 1," IBM Journal of Research and Develop-ment, vol. 2, pp. 2-13, January 1958.

[21] M. Gudaitis, "A Waveform Description Language For Software Defined Radio," pre-sented at Software Defined Radio Technical Conference, San Diego, California, 2002.

[22] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York, NY: Springer-Verlag, 2001.

[23] T. D. Haynes, D. A. Schoenefeld, and R. L. Wainwright, "Type Inheritance in Strongly Typed Genetic Programming," in Advances in Genetic Programming 2, P. J. Angeline and nd K. E. Kinnear, Jr., Eds. Cambridge, MA, USA: MIT Press, 1996, pp. 359--376.

[24] K. C. Ho, W. Prokopiw, and Y. T. Chan, "Modulation identification of digital signals by the wavelet transform," presented at IEEE Radar, Sonar and Navigation, 2000.

[25] K. C. Ho, A. E. Scheidl, Y. T. Chan, et al., "Signal identification by orthogonal trans-forms," IEEE Proceedings on Radar, Sonar, and Navigation, vol. 145, pp. 145-152, June 1998.

[26] K. Holladay, A. Burmeister, R. Dollarhide, et al., "A configurable signal analyzer for embedded systems," presented at Military Communications (MILCOM), Boston, MA, 2003.

[27] K. Holladay and K. Robbins, "A framework for automatic large-scale testing and charac-terization of signal processing algorithms," presented at Military Communications (MILCOM), Monterey, CA, 2004.

[28] K. Holladay, K. Robbins, and J. v. Ronne, "FIFTH™: A stack based GP language for vector processing," presented at EuroGP 2007, Valencia, Spain, 2007.

[29] K. Holladay, K. Robbins, and D. Varner, "An XML schema for digital communication signal research," presented at XML 2002, Baltimore, MD, 2002.

[30] D. Jackson, "Dormant program nodes and the efficiency of genetic programming," pre-sented at GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation, Washington DC, USA, 2005.

[31] R. Jain, "Determining Sample Size," in The Art of Computer Systems Performance Analysis. New York, NY: John Wiley & Sons, 1990, pp. 217.

[32] R. Kakarala and P. Ogunbona, "Signal Analysis Using a Multiresolution Form of the Singular Value Decomposition," IEEE Transactions on Image Processing, vol. 10, pp. 724-735, May 2001.

[33] J. Keller, "The Coming HF Radio Renaissance," Military & Aerospace Electronics, Sep-tember 2002.

[34] G. A. Kerkut, Implications of Evolution. New York: Pergamon Press Inc., 1960.

110

[35] J. Kilgallen, Monteria LLC, "Global Communication Intelligence," http://www.monteriallc.com, accessed September 2002.

[36] M. Kirby, Geometric Data Analysis. New York, NY: John Wiley & Sons, 2001.

[37] M. Kirby and L. Sirovich, "Application of the Karhunen-Loeve procedure for the charac-terization of human faces," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, pp. 103-108, January 1990.

[38] Klingenfuss, Klingenfuss Publications, "Klingenfuss Radio Monitoring," http://www.klingenfuss.org, accessed September 2002.

[39] M. A. Koets, J. Porter, M. Moore, et al., "An analytical framework and implementation architecture for a reconfigurable signal classification system," presented at Military Communications Conference, 2004.

[40] B. S. Koh and H. S. Lee, "Detection of symbol rate of unknown digital communication signals," Electronics Letters, vol. 29, pp. 278-279, February 1993.

[41] J. Koza, F. H. Bennett III, D. Andre, et al., "The Design of Analog Circuits by Means of Genetic Programming," in Evolutionary Design by Computers, P. Bentley, Ed. San Fran-cisco, USA: Morgan Kaufmann, 1999, pp. 365--385.

[42] J. R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge: MIT Press, 1992.

[43] J. R. Koza, Genetic Programming II: Automatic Discovery of Reusable Programs. Cam-bridge: MIT Press, 1994.

[44] J. R. Koza, D. Andre, F. H. Bennett III, et al., Genetic Programming III: Darwinian In-vention and Problem Solving: Morgan Kaufman, 1999.

[45] J. R. Koza, M. A. Keane, M. J. Streeter, et al., Genetic Programming IV: Routine Hu-man-Competitive Machine Intelligence: Kluwer Academic Publishers, 2003.

[46] S. Kremer and J. Shiels, "A Testbed for Automatic Modulation Recognition Using Artifi-cial Neural Networks," presented at Canadian Conference on Electrical and Computer Engineering (CCECE), St. John's, Newfoundland, 1997.

[47] M. Kueckenwaitz, F. Quint, and J. Reichert, "A robust baud rate estimator for noncoop-erative demodulation," presented at Military Communications Conference (MILCOM 2000), 2000.

[48] R. Lacroix, "MilSpec Communication Canada," http://www.milspec.ca, accessed Sep-tember 2002.

[49] P. Laguna, G. Moody, J. Garcia, et al., "Analysis of the ST-T complex of the electrocar-diogram using the Karhunen-Loeve transform: adaptive monitoring and alternans detection," Medical & Biological Engineering & Computing, vol. 37, pp. 175–189, 1999.

[50] W. B. Langdon, Genetic Programming and Data Structures: Genetic Programming + Data Structures = Automatic Programming!, vol. 1. Boston: Kluwer, 1998.

[51] F.-S. Lu, C.-X. Yang, and P.-L. Lin, "An improved Wigner distribution based algorithm for signal identification," presented at International Symposium on Underwater Technol-ogy, Taipei, TW, 2004.

111

[52] B. Moghaddam and A. Pentland, "An Automatic System for Model-Based Coding of Faces," presented at IEEE Data Compression Conference, Snowbird, Utah, 1995.

[53] B. M. Moret, "Towards a Discipline of Experimental Algorithmics," in Monographs in Discrete Mathematics and Theoretical Computer Science: American Mathematical Soci-ety, 2000.

[54] A. K. Nandi and E. E. Azzouz, "Algorithms for automatic modulation recognition of communication signals," IEEE Transactions on Communications, vol. 46, pp. 431 - 436 1998

[55] OMG, "Unified Modeling Language Specification," http://www.omg.org/, accessed 2005 September.

[56] T. Perkis, "Stack-Based Genetic Programming," presented at Proceedings of the 1994 IEEE World Congress on Computational Intelligence, Orlando, Florida, USA, 1994.

[57] R. Poli, "Parallel Distributed Genetic Programming," University of Birmingham School of Computer Science, Birmingham, UK September 1996.

[58] S. Prabhakar and A. Jain, "Fingerprint Identification," http://biometrics.cse.msu.edu/fingerprint.html, accessed September 2005.

[59] J. G. Proakis, Digital Communications. New York, NY: McGraw-Hill, 2000.

[60] D. E. Reed and M. A. Wickert, "Performance of a fourth-power envelope detector for symbol-rate tone generation," presented at International Conference on Communications (ICC), 1988.

[61] M. M. Rizki and L. A. Tamburino, "Evolutionary Computing Applied To Pattern Recog-nition," in Genetic Programming 1998: Proceedings of the Third Annual Conference, J. R. Koza, n. W. Banzhaf, et al., Eds. University of Wisconsin, Madison, Wisconsin, USA: Morgan Kaufmann San Francisco, CA, USA, 1998, pp. 777--785.

[62] K. Robbins and D. Senseman, "Visualizing Differences in Movies of Cortical Activity," presented at Visualization '98, 1998.

[63] R. Sawai, H. Harada, H. Shirai, et al., "General-purpose symbol rate and symbol timing estimation method by using multi-resolution analysis based on wavelet transform for multimode software radio," presented at IEEE 50th Vehicular Technology, 1999.

[64] S. Scalsky, "Digital Signals FAQ Version 5.0," Worldwide Utility News, http://www.wunclub.com/digfaq/signals.html, 1997.

[65] D. C. Schmidt, "Real-time CORBA with TAO," http://www.cs.wustl.edu/~schmidt/TAO.html.

[66] K. C. Sharman, A. I. Esparcia-Alcazar, and Y. Li, "Evolving Digital Signal Processing Algorithms by Genetic Programming," Glasgow G12 8QQ, Scotland, 1995.

[67] K. C. Sharman, A. I. Esparcia-Alcazar, and Y. Li, "Evolving Signal Processing Algo-rithms by Genetic Programming," presented at First International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, GALESIA, Sheffield, UK, 1995.

112

[68] J. A. Sills and J. F. Wood, "Application of the Euclidean algorithm to optimal baud-rate estimation," presented at Military Communications Conference (MILCOM), McLean, VA, 1996.

[69] B. Sklar, Digital Communications Fundamentals and Applications. Upper Saddle River, NJ: Prentice Hall, 2001.

[70] L. Spector and A. Robinson, "Genetic Programming and Autoconstructive Evolution with the Push Programming Language," presented at Genetic Programming and Evolvable Machines, 2002.

[71] J. Stevens, R. B. Heckendorn, and T. Soule, "Exploiting disruption aversion to control code bloat," presented at GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation, Washington DC, USA, 2005.

[72] K. Stoffel and L. Spector, "High-Performance, Parallel, Stack-Based Genetic Program-ming," presented at Genetic Programming 1996: Proceedings of the First Annual Conference, Stanford University, CA, USA, 1996.

[73] W. Su and J. Kosinski, "A survey of digital modulation recognition methods," presented at International Signal Processing Conference, 2003.

[74] B. Tarver, E. Christensen, and A. Miller, "Software defined radios (SDR) platform and application programming interfaces (API)," presented at Military Communications Con-ference (MILCOM), 2001.

[75] E. Tchernev, "Forth Crossover Is Not a Macromutation?," presented at Genetic Pro-gramming 1998: Proceedings of the Third Annual Conference, University of Wisconsin, Madison, Wisconsin, USA, 1998.

[76] E. Tchernev, "Stack-Correct Crossover Methods in Genetic Programming," presented at Late Breaking Papers at the Genetic and Evolutionary Computation Conference (GECCO-2002), New York, NY, 2002.

[77] E. B. Tchernev and D. S. Phatak, "Control structures in linear and stack-based Genetic Programming," presented at Late Breaking Papers at the 2004 Genetic and Evolutionary Computation Conference, Seattle, Washington, USA, 2004.

[78] A. Teller and M. Veloso, "Program Evolution for Data Mining," The International Jour-nal of Expert Systems, vol. 8, pp. 216--236, 1995.

[79] T.-M. Tu, C.-H. Chen, and C.-I. Chang, "A Posteriori Least Squares Orghogonal Sub-space Projection Approach to Desired Signature Extraction and Detection," IEEE Transactions on Geoscience and Remote Sensing, vol. 35, pp. 127-139, January 1997.

[80] V. University, "The Generic Modeling Environment," http://www.isis.vanderbilt.edu/Projects/gme/, accessed September 2005.

[81] USArmy, "Joint Tactical Radio System (JTRS) Program," http://jtrs.army.mil/, accessed September 2005.

[82] H. L. Van Trees, Detection, Estimation, and Modulation Theory, Part I. New York, NY: John Wiley & Sons, 2001.

[83] P. Walmsley, Definitive XML Schema. Upper Saddle River, NJ: Prentice Hall, 2002.

113

[84] A. W. Wegener, "Practical techniques for baud rate estimation," presented at Interna-tional Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1992.

[85] M. Willis, H. Hiden, P. Marenbach, et al., "Genetic Programming: An Introduction and Survey of Applications," presented at Second International Conference on Genetic Algo-rithms in Engineering Systems: Innovations and Applications, GALESIA, University of Strathclyde, Glasgow, UK, 1997.

[86] Z. Yaqin, R. Guanghui, W. Xuexia, et al., "Automatic digital modulation recognition us-ing artificial neural networks," presented at International Conference on Neural Networks and Signal Processing Nanjing, China, 2003.

[87] G. Youngblood, "A Software-Defined Radio for the Masses, Part 1," QEX, pp. 13-21, July/August 2002.

[88] E. Zorlu, N. Y. Guler, and I. Olcer, "Statistical methods based modulation recognition algorithm," presented at Signal Processing and Communications Applications Confer-ence, 2004.

Vita

Background

Kenneth Holladay was born in Jacksonville, Florida on February 1, 1956. He is the elder

child of Billy and Dorcie Holladay. He graduated as valedictorian from Colonial High School,

Orlando, Florida in 1973. In 1978 he graduated with high honors from the University of Florida

with a Bachelor of Science in Chemical Engineering. In 1990 Mr. Holladay became a registered

professional engineer in the state of Florida. He received his Master of Science in Computer Sci-

ence from the University of Texas at San Antonio in 2005 and will complete his Ph.D. degree in

Computer Science in May 2007. Kenneth married his wife Deborah in 1975. Together they have

three children, Aaron, Miriam, and Benjamin.

Work History

In 1996 Mr. Holladay joined the staff of Southwest Research Institute in San Antonio,

Texas where he is currently a Staff Analyst in the Signal Exploitation and Geolocation division.

His responsibilities encompass software engineering, software design and development, and dis-

tributed computing, primarily for signal processing systems.

Mr. Holladay began his professional career in 1978 at Milliken Research in Spartanburg,

South Carolina where he worked on rheological characterization of dye thickeners, flow control

of non-Newtonian dye mixtures, and dye filtration. In 1980 he went to work for CENTEC Cor-

poration in Fort Lauderdale, Florida where he spent the next five years working on energy and

environmental related projects in the food, agricultural chemical, and metal coating industries.

His work encompassed analysis and design for chemical processes and industrial control sys-

tems. During this period he began to work extensively with microcomputer hardware and

software.

In 1986 Mr. Holladay helped co-found Applied System Technologies in Fort Lauderdale,

Florida. He was the principal designer of the Cornerstone family of software products, which

manage communication and calibration of smart industrial instruments. He also served for ten

years as a member of the international HART Communication Foundation, a group responsible

for maintaining the standards for the Highway Addressable Remote Transducer (HART) indus-

trial communication protocol.

Patents

US 6,330,517 B1 Improved Interface for Managing Process Calibrators

Publications

Kenneth Holladay, Kay A. Robbins, and Jeffery von Ronne, “FIFTH: A stack based GP lan-guage for vector processing,” presented at EuroGP, Valencia, Spain, April 2007.

Kenneth Holladay, Kay A. Robbins, “A framework for automatic large-scale testing and charac-terization of signal processing algorithms,” presented at Military Communications (MILCOM), Monterey, CA, October, 2004.

Kenneth Holladay, Kay A. Robbins, “Experimental analysis of wavelet transforms for estimating PSK symbol rate,” presented at IASTED International Conference on Communication Systems and Applications, Banff, Canada, July 2004.

Kenneth Holladay, Michael Koets, Denise Varner, et al., “A configurable signal analyzer for embedded systems,” presented at Military Communications (MILCOM), Boston, MA, 2003.

Kenneth Holladay, Kay A. Robbins, Denise Varner, “An XML schema for digital communica-tion signal research,” presented at XML 2002, Baltimore, MD, December 2002.

Kenneth Holladay, Danny Lents, "Field calibration chaos ceases," InTech, November 1999.

Kenneth Holladay, "Measurement: wrestling the mother of all nonstandards," InTech, August 1998.

Kenneth Holladay, "Calibrating HART transmitters," InTech, May 1996.

Kenneth Holladay, "Using the HART protocol to manage for quality," ISA Proceedings, Paper #94-617, October 1994.

Kenneth Holladay, "Are you calibrating your smart transmitters properly?" Instrumentation and Control Systems, September 1993.

Kenneth Holladay "Results of energy audits conducted in citrus plants during 1979-1980," Transaction of the 1981 Citrus Engineering Conference, March, 1981.

The Automatic Generation and Testing of Signal … · the automatic generation and testing of ......

Documents

Transcript of The Automatic Generation and Testing of Signal … · the automatic generation and testing of ......