Acceleration of the 3D FDTD Algorithm in Fixed-point ...Acceleration of the 3D FDTD Algorithm in...

Acceleration of the 3D FDTD Algorithm in Fixed-point

Arithmetic using Reconfigurable Hardware

A Thesis Presented

by

Wang Chen

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in

Electrical Engineering

in the field of

Electrical Engineering

Northeastern University

Boston, Massachusetts

August 2007

NORTHEASTERN UNIVERSITY

Graduate School of Engineering

Thesis Title: Acceleration of the 3D FDTD Algorithm in Fixed-point Arithmetic

using Reconfigurable Hardware.

Author: Wang Chen.

Department: Electrical and Computer Engineering.

Approved for Thesis Requirements of the Doctor of Philosophy Degree

Thesis Advisor: Prof. Miriam Leeser Date

Thesis Committee Member: Prof. Carey Rappaport Date

Thesis Committee Member: Prof. Charles DiMarzio Date

Department Chair: Prof. Ali Abur Date

Graduate School Notified of Acceptance:

Director of the Graduate School: Prof. Yaman Yener Date

NORTHEASTERN UNIVERSITY

Graduate School of Engineering

Thesis Title: Acceleration of the 3D FDTD Algorithm in Fixed-point Arithmetic

using Reconfigurable Hardware.

Author: Wang Chen.

Department: Electrical and Computer Engineering.

Approved for Thesis Requirements of the Doctor of Philosophy Degree

Thesis Advisor: Prof. Miriam Leeser Date

Thesis Committee Member: Prof. Carey Rappaport Date

Thesis Committee Member: Prof. Charles DiMarzio Date

Department Chair: Prof. Ali Abur Date

Graduate School Notified of Acceptance:

Director of the Graduate School: Prof. Yaman Yener Date

Copy Deposited in Library:

Reference Librarian Date

Abstract

Understanding and predicting electromagnetic behavior is needed more and more in

modern technology. The Finite-Difference Time-Domain (FDTD) method is a pow-

erful computational electromagnetic technique for modelling electromagnetic space.

However, the computation of this method is complex and time consuming. Imple-

menting this algorithm in hardware will greatly increase its computational speed and

widen its usage.

This dissertation presents a fixed-point 3D UPML (Uniaxial Perfectly Matched

Layer) FDTD FPGA accelerator, which supports a wide range of materials including

dispersive media. By analyzing the performance of fixed-point arithmetic in the 3D

FDTD algorithm, the correct fixed-point representation is chosen to minimize the rel-

ative error between the fixed-point and floating point results. The FPGA accelerator

supports the UPML absorbing boundary conditions which have better performance

in dispersive soil and human tissue media than PML boundary conditions.

The 3D UPML FDTD hardware accelerator has been designed and implemented

on a WildStar-II Pro FPGA computing board. The computational speed of the hard-

ware implementation represents a 25 times speedup compare to the software imple-

mentation running on a 3.0GHz PC. This result indicates that the custom hardware

implementation can achieve significant speed up for the 3D UPML FDTD algorithm.

ii

The speedup of the FDTD hardware implementation is due to three major fac-

tors: fixed-point representation, custom memory interface design, and pipelining and

parallelism. The FDTD method is a data-intense algorithm. The bottleneck of the

hardware design is its memory interface. With the limited bandwidth between the

FPGA and on-board memories, a carefully designed custom memory interface al-

lows fully utilization of the memory bandwidth and greatly improves performance.

The FDTD algorithm is also a computationally intense algorithm. By considering

the trade-offs between resources and performance, pipelining and parallelism are im-

plemented to achieve optimal design performance based on the available hardware

resources.

iii

Acknowledgements

First of all, I would like to thank my advisor Dr. Miriam Leeser, not only for her

invaluable advice and encouragement leading me in the process of fulfilling my grad-

uate research, but also for her great guidance and help during my six-year’s graduate

study.

I also greatly appreciate the friendship and support of Dr. Carey Rappaport

and Panos Kosmas in CenSSIS, Center for Subsurface Sensing and Imaging Systems.

This dissertation would not have been possible without their helps, support and their

FDTD model.

I would like to thank Dr. Charles DiMarzio, who served on both my masters and

doctoral committees, for his constant help and support on my research.

It has been a pleasure to cooperate with all of my colleagues and friends in the

Reconfigurable Computing Laboratory of the Northeastern University. I would like

to thank all of them for their suggestions, help and support.

Finally, I left my special acknowledgement to my wife Lan, and my parents, Xiuzhi

Chen and Yuanrui Zhang for their constant love and great support.

This research is supported in part by CenSSIS, the Center for Subsurface Sens-

ing and Imaging Systems, under the Engineering Research Centers Program of the

National Science Foundation.

iv

Contents

1 INTRODUCTION 1

1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 BACKGROUND 7

2.1 The Finite-Difference Time-Domain (FDTD) Method . . . . . . . . . 7

2.1.1 FDTD Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.2 FDTD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.1.3 Absorbing Boundary Conditions . . . . . . . . . . . . . . . . . 15

2.1.3.1 The Mur-type absorbing boundary conditions . . . . 15

2.1.3.2 The UPML absorbing boundary conditions . . . . . 16

2.2 FPGA and Reconfigurable Computing . . . . . . . . . . . . . . . . . 19

2.2.1 FPGA Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2.2 FPGA Computing Boards . . . . . . . . . . . . . . . . . . . . 23

2.2.3 Reconfigurable Computing Basics . . . . . . . . . . . . . . . . 25

2.2.3.1 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . 26

2.2.3.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . 28

2.2.3.3 Memory Hierarchy and Interface . . . . . . . . . . . 29

2.2.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2.4 Reconfigurable Computing Design Flow . . . . . . . . . . . . . 30

2.3 Finite Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.3.1 Floating-Point Arithmetic . . . . . . . . . . . . . . . . . . . . 32

v

2.3.2 Fixed-Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . 33

2.3.3 Fixed-Point Quantization . . . . . . . . . . . . . . . . . . . . 36

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3 RELATED WORK 38

3.1 Applications of the FDTD Method . . . . . . . . . . . . . . . . . . . 38

3.2 Parallel-computing Systems for FDTD Acceleration . . . . . . . . . . 41

3.3 VLSI Implementations of the FDTD Method . . . . . . . . . . . . . . 43

3.4 FPGA Implementations of the FDTD Method . . . . . . . . . . . . . 45

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4 FDTD ON FPGA 50

4.1 The Advantages of Putting FDTD on an FPGA . . . . . . . . . . . . 50

4.1.1 Speed, Size, Power and Flexibility . . . . . . . . . . . . . . . . 51

4.1.2 Good for Parallelism and Deep Pipelining . . . . . . . . . . . 51

4.1.3 Suitable for Fixed-Point Arithmetic . . . . . . . . . . . . . . . 52

4.2 Key Problems to Solve - FDTD on an FPGA . . . . . . . . . . . . . 58

4.2.1 Determine the right precision for fixed-point representation . . 59

4.2.2 Determine the memory organization and the memory interface 60

4.2.3 Determine the proper use of pipelining and parallelism . . . . 61

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5 ALGORITHM ANALYSIS 63

5.1 Target FDTD Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.1.1 The GPR Buried Object Detection Forward Model . . . . . . 64

5.1.2 The Breast Cancer Detection Forward Model . . . . . . . . . . 66

5.1.3 The Spiral Antenna Model . . . . . . . . . . . . . . . . . . . . 67

5.1.4 Simplification and 2D Structure . . . . . . . . . . . . . . . . . 68

5.1.5 Six Target Models . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.2.1 Fixed-point Analysis . . . . . . . . . . . . . . . . . . . . . . . 71

5.2.2 Fixed-point Quantization . . . . . . . . . . . . . . . . . . . . . 78

5.3 Hardware Design Analysis . . . . . . . . . . . . . . . . . . . . . . . . 81

vi

5.3.1 Memory hierarchy and memory interface . . . . . . . . . . . . 81

5.3.2 Managed cache module . . . . . . . . . . . . . . . . . . . . . . 86

5.3.2.1 The memory transfer bottleneck . . . . . . . . . . . 86

5.3.2.2 Optimizing data flow and processing core . . . . . . 87

5.3.2.3 Expand to Three Dimensions . . . . . . . . . . . . . 90

5.3.3 Pipelining and Parallelism . . . . . . . . . . . . . . . . . . . . 92

5.3.3.1 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . 93

5.3.3.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . 93

5.3.3.3 Two Hardware Implementations . . . . . . . . . . . . 95

5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6 FDTD DESIGN, IMPLEMENTATION and RESULTS 98

6.1 Overview of the FDTD hardware implementation . . . . . . . . . . . 98

6.2 I/O Interface and DMA Transfer . . . . . . . . . . . . . . . . . . . . 100

6.2.1 WildStar-II Pro’s I/O Interface . . . . . . . . . . . . . . . . . 100

6.2.2 DMA Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2.3 DMA Control Logic and FIFO Components . . . . . . . . . . 104

6.2.4 DMA Interface Extension . . . . . . . . . . . . . . . . . . . . 106

6.3 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

6.3.1 Combined Datapath . . . . . . . . . . . . . . . . . . . . . . . 107

6.3.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.4 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

6.4.1 Data Arrangement . . . . . . . . . . . . . . . . . . . . . . . . 111

6.4.2 On-board Memories . . . . . . . . . . . . . . . . . . . . . . . . 113

6.4.3 BlockRAM Cache Module . . . . . . . . . . . . . . . . . . . . 113

6.5 Control State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 116

6.6 FDTD Implementations and Results . . . . . . . . . . . . . . . . . . 117

6.6.1 2D UPML FDTD Implementation . . . . . . . . . . . . . . . . 118

6.6.2 3D UPML FDTD Implementation — Single-Cell Full-UPML . 120

6.6.3 3D UPML FDTD Implementation — Double-Cell Full-UPML 125

6.6.4 3D UPML FDTD Implementation — UPML/Center Combined 127

6.6.5 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . 129

vii

6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

7 CONCLUSIONS and FUTURE WORK 132

7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Bibliography 136

viii

List of Tables

2.1 List of the main features of the WildStart-II Pro FPGA boards . . . 24

4.1 Relative error between fixed-point arithmetic and floating-point arith-

metic over 200000 Timesteps . . . . . . . . . . . . . . . . . . . . . . . 57

5.1 Detailed Specifications of the Target FDTD Models . . . . . . . . . . 70

5.2 Average absolute error between fixed-point arithmetic and floating-

point arithmetic for different bit-widths for 2D GPR Landmine Detec-

tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.3 Average relative errors between fixed-point arithmetic and floating-


tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73


point arithmetic for different bit-widths for 2D Breast Cancer Detec-

tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.5 Average relative error between fixed-point arithmetic and floating-


tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74


point arithmetic for different bit-widths for 2D Spiral Antenna Model 75



ix



tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76


point arithmetic for different bit-widths for 3D GPR Landmine De-

tection Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76



tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77



tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77





6.1 Read/Write ratio in different FDTD implementations . . . . . . . . . 129

6.2 3D FDTD hardware implementation performance results . . . . . . . 130

x

List of Figures

2.1 The geometrical representation of the 3D Yee cell. . . . . . . . . . . . 10

2.2 A Complete Flow Diagram of the FDTD Method . . . . . . . . . . . 13

2.3 Speed vs. flexibility of software, FPGAs and ASICs . . . . . . . . . . 20

2.4 Xilinx FPGA structure and its CLB . . . . . . . . . . . . . . . . . . . 22

2.5 Block diagram of the WildStarTM -II Pro board . . . . . . . . . . . . 25

2.6 Simple Structure Diagram of the Normal COTS Reconfigurable Com-

puting Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.7 Structure of a pipeline in the free space model of the 2D FDTD algorithm 28

2.8 The FPGA Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . 31

2.9 Single precision floating-point representation . . . . . . . . . . . . . . 32

2.10 Double precision floating-point representation . . . . . . . . . . . . . 33

2.11 Uninterpreted fix-point representation . . . . . . . . . . . . . . . . . . 33

2.12 Interpreted fix-point representation . . . . . . . . . . . . . . . . . . . 34

2.13 Multiplication of fix-point representation . . . . . . . . . . . . . . . . 35

5.1 3D FDTD application of land mine detection using ground penetrating

radar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.2 3D FDTD application of microwave breast cancer detection . . . . . . 66

5.3 The floor plan of the spiral antenna model . . . . . . . . . . . . . . . 67

5.4 Structural diagram of the simplification and the pseudo-2D FDTD

algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.5 Data structure of the fixed-point representation . . . . . . . . . . . . 71

5.6 Relative error between fixed-point arithmetic and floating-point arith-

metic for different bit-widths . . . . . . . . . . . . . . . . . . . . . . . 80

xi

5.7 Structural Diagram of the Memory Interface . . . . . . . . . . . . . . 85

5.8 Structural Diagram of the Simple 2-Row Cache Module . . . . . . . . 88

5.9 Structural Diagram of the 2D Managed Cache Module . . . . . . . . 90

5.10 Structural Diagram of the 4-Slice Caching Design . . . . . . . . . . . 91

5.11 Structure Diagram of the 4×3 Rows Caching Module . . . . . . . . . 925.12 Running Two Cells in Parallel . . . . . . . . . . . . . . . . . . . . . . 94

5.13 The UPML boundary conditions cells and non-UPML center cells in

the model space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.1 Overview Diagram of the FDTD hardware Design . . . . . . . . . . . 99

6.2 DMA Control Logic and FIFO Components . . . . . . . . . . . . . . 104

6.3 The block diagram of the 3 level memory interface . . . . . . . . . . . 106

6.4 Diagram of the magnetic updating pipeline . . . . . . . . . . . . . . . 107

6.5 Diagram of electric updating algorithm on different media . . . . . . 108

6.6 Diagram of the electric updating pipeline . . . . . . . . . . . . . . . . 110

6.7 Diagram of the storage of EM field data on on-board memories . . . . 112

6.8 Diagram of detailed data flow on 3D electric updating pipeline module 115

6.9 Diagram of hierarchy level of control state machines. . . . . . . . . . 116

6.10 Data flow of the 2D UPML FDTD hardware Design . . . . . . . . . . 118

6.11 Performance of the 2D UPML FDTD hardware Design . . . . . . . . 120

6.12 Data flow of the 3D FDTD field updating algorithms . . . . . . . . . 122

6.13 Data flow of the 3D UPML FDTD hardware design . . . . . . . . . . 124

6.14 Difference between the 3D Single-Cell and Double-Cell Full-UPML

FDTD hardware design . . . . . . . . . . . . . . . . . . . . . . . . . . 126

6.15 Data flow of the 3D Double-Cell Full-UPML FDTD hardware design 127

6.16 Performance of the 3D Double-Cell UPML FDTD hardware Design . 128

6.17 Data flow of the 3D UPML/Center combined FDTD hardware design 129

xii

Chapter 1

INTRODUCTION

1.1 Overview

Understanding and predicting electromagnetic behavior is more and more needed

in key electrical engineering technologies such as cellular phones, mobile comput-

ing, lasers and photonic circuits. Since the Finite-Difference Time-Domain (FDTD)

method provides a direct, time domain solution to Maxwell’s Equations in differential

form with relatively good accuracy and flexibility, it has become a powerful method

for solving a wide variety of different electromagnetics problems. The FDTD method

was not widely used until the past decade when computing resources improved. Even

today, the computational cost is still too high for real-time application of the FDTD

method.

Much effort has been spent on software acceleration research and people have

used supercomputers or parallel computer arrays to calculate the FDTD algorithm

in software. However, real-time application of the FDTD algorithm needs faster

speed as well as smaller package. Although Application Specific Integrated Circuits

1

CHAPTER 1. INTRODUCTION 2

(ASICs) provide much faster speed, people hesitate to apply the FDTD algorithm to

ASICs due to their high development cost.

Recently, as high capacity Field-Programmable Gate Arrays (FPGAs) have emerged,

people have become interested in reconfigurable hardware implementations of the

FDTD algorithm for faster calculation and real-time applications.

FPGAs offer the benefits of both ASICs and software. Like an ASIC, an FPGA

can implement millions of gates of logic which are designed optimally for a specific

task in a single integrated circuit. Although FPGA-based design may not challenge

the performance of ASICs, the fact is that few applications will ever merit the expense

of high cost ASICs. Also, like software, FPGAs are reprogrammable by designers any-

time, and are therefore flexible and convenient. Although FPGA-based development

is still costly compared to software, FPGA-based design can offer extremely high

performance surpassing any other programmable solutions.

There also have been a number of effort lately to implemented the FDTD algo-

rithm on graphics processor unit (GPU). Computer graphics card is widely available

and relatively cheaper than FPGA. Instead of working on a hardware design, the de-

signers use C++ and OpenGL to program the FDTD algorithm on GPU. However,

because GPU is designed for graphics calculation, its hardware structure is fixed.

There is little flexibility to upgrade the FDTD design or change the data type. In-

stead, the FPGA-based design offers much better flexibility so it can achieve higher

precision and faster speed.

We present the first fixed-point 3D UPML FDTD FPGA accelerator, which sup-

ports a wide range of materials including dispersive media. By analyzing the perfor-

mance of fixed-point arithmetic in the 3D FDTD algorithm, the correct fixed-point

representation is chosen to minimize the relative error between fixed-point and float-

ing point results. The FPGA accelerator supports the UPML absorbing boundary


conditions which have better performance in dispersive soil and human tissue media

than PML boundary conditions.

The 3D UPML FDTD hardware accelerator has been designed and implemented

on a WildStar-II Pro FPGA computing board. The computational speed of the hard-

ware implementation represents a 25 times speedup compare to the software imple-

mentation running on a 3.0GHz PC. This result indicates that the custom hardware

implementation can achieve significant speed up for the 3D UPML FDTD algorithm.

The speedup of the FDTD hardware implementation is due to three major fac-

tors: fixed-point representation, custom memory interface design, and pipelining and

parallelism. The FDTD method is a data-intense algorithm. The bottleneck of the

hardware design is its memory interface. With the limited bandwidth between the

FPGA and on-board memories, a carefully designed custom memory interface al-

lows full utilization of the memory bandwidth and greatly improves performance.

The FDTD algorithm is also a computationally intense algorithm. By considering

the trade-offs between resources and performance, pipelining and parallelism are im-

plemented in order to achieve optimal design performance based on the available

hardware resources.

1.2 Contributions

The contributions of this research are:

• This research presents a fixed-point 3D UPML FDTD FPGA accelerator whichsupports a wide range of materials including dispersive media. The 3D UPML

FDTD hardware accelerator has been designed and implemented on a WildStar-

II Pro FPGA computing board. The computational speed of the hardware


implementation represents a 25 times speedup compare to the software im-

plementation running on a 3.0GHz PC. This result indicates that the custom

hardware implementation can achieve significant speed up for the 3D UPML

FDTD algorithm.

• The 3D UPML FDTD algorithm has been carefully analyzed to find an ef-ficient hardware architecture. A carefully designed custom memory interface

and managed cache module allows fully utilization of the memory bandwidth

and greatly improves hardware design performance. By considering the trade-

offs between resources and performance, pipelining and parallelism are imple-

mented to achieve optimal design performance based on the available hardware

resources.

• Our 3D FDTD FPGA accelerator is the first fixed-point 3D FDTD FPGA ac-celerator. Recent approaches to hardware acceleration of the FDTD algorithm

focus on floating-point arithmetic implementations [1] [2]. Although floating-

point representation provides relatively high data precision, because of favorable

data properties and algorithm structure of the FDTD method, the relative error

between the fixed-point results and double-precision floating-point results can

be controlled to be very low as long as we carefully choose the fixed-point rep-

resentation. Since fixed-point arithmetic components provide faster speed and

consume less hardware resources than floating-point, the fixed-point arithmetic

FDTD accelerator has better performance.

• Our 3D FDTD FPGA accelerator is the first hardware implementation to sup-port the UPML absorbing boundary condition (ABC) which has better perfor-

mance in dispersive soil and human tissue media than PML ABC. The UPML


ABC provides better performance on the reflection ratio of all incidence an-

gles in dispersive media compared to other approaches. Although the UPML

FDTD algorithm is more computationally expensive and consumes more mem-

ory space, its regular structure makes it relatively easy to implement in hard-

ware compared to PML and Mur-type ABC.

• Our 3D FDTD FPGA accelerator implements and supports GPR mine de-tection, breast cancer detection, and spiral antenna research. Our research

provides a good example of FPGA acceleration of the FDTD applications in

fixed-point arithmetic. With the experience of our research, other FDTD ap-

plications can be applied to FPGAs more easily.

1.3 Dissertation Structure

The outline of the reminder of this dissertation is as follows.

Chapter 2 starts with an introduction of the Finite-Difference Time-Domain method,

the structure of the FDTD algorithm, and the UPML absorbing boundary conditions.

It then discusses the FPGA and WildStar-II Pro reconfigurable boards where the de-

sign will be implemented. Finally, finite precision theories are introduced for better

understanding of the algorithm data analysis in Chapter 5.

Chapter 3 presents a survey of the related work of our research, including re-

search on the FDTD method, parallel computing, VLSI implementation, and FPGA

implementation of the FDTD method.

Chapter 4 first analyzes and presents the advantages of implementing the FDTD

algorithm on an FPGA. Then it lists the key problems which need to be solved in

order to have a successful FDTD hardware implementation. These problems will be


solved in Chapter 5 and Chapter 6.

Chapter 5 starts on the introduction to the six target FDTD models we used for

data analysis and design verification. It then gives the details of our research on the

fixed-point data analysis of the 3D FDTD algorithm. It compares the relative error

of the electromagnetic field data for different data bit-widths to choose a suitable

bit-width for fixed-point representation. Finally, this chapter carefully analyzes the

3D UPML FDTD algorithm to find an efficient hardware architecture. Carefully

designed custom memory interface and managed cache module are proposed and

analyzed. The implementation of pipelining and parallelism are discussed based on

the trade-offs between hardware resources and design performance.

Chapter 6 discusses several technical details of the 3D UPML FDTD implementa-

tion, including I/O interface, DMA transfer, datapath, data flow, memory interface,

managed cache module, and control state machine. After that, several FDTD im-

plementations are introduced and their performance results are presented. These

FDTD implementations are introduced in the sequence of our design work. Their

data flow and control logic become more and more complex as we further optimize

the design to achieve better performance. This chapter also contains a summary of

the performance of the FDTD implementations.

Chapter 7 draws conclusions and gives suggestions for future work.

Chapter 2

BACKGROUND

In this chapter we will introduce the backgrounds of the FDTD method, FPGAs and

finite precision, which are the basis of this dissertation.

2.1 The Finite-Difference Time-Domain (FDTD)

Method

The discovery of Maxwell’s equations was one of the outstanding achievements of

19th century science. The equations give a unified and complete theory for people to

understand electromagnetic wave phenomena. Solving Maxwell’s equations becomes

an important method for investigating the propagation, radiation and scattering of

electromagnetic waves.

The FDTD method provides a direct time-domain solution of Maxwell’s Equa-

tions. After Yee first introduced it in 1966 [3], people began to realize its accuracy and

flexibility for solving electromagnetic problems. Because of its clear structure, easy to

understand algorithm, and ability to solve Maxwell’s equations on most scales with

7

CHAPTER 2. BACKGROUND 8

many different environments, the FDTD method has become a powerful technique for

solving a wide variety of different electromagnetic problems. Such problems includ

discrete scattering analysis, antenna design, radar research, digital circuit packag-

ing on multi-layer circuits boards, tumor detection and various medical studies on

electromagnetic waves’ effect on the human body [4, 5, 6, 7, 8].

In this section, we will first introduce the basics of the FDTD method by solving

Maxwell’s equations step by step. Then we will introduce the structure and properties

of the FDTD software algorithm. Finally we will talk about the absorbing boundary

conditions for the FDTD method.

2.1.1 FDTD Basics

The differential form of Maxwell’s equations and constitutive relations can be written

with the following equations:

∇× ~E = −∂~B

∂t− σm ~H − ~M (2.1)

∇× ~H = ∂~D

∂t+ σe ~E + ~J (2.2)

∇ · ~D = ρe; ∇ · ~B = ρm (2.3)

~D = � ~E; ~B = µ ~H (2.4)

In Equations (2.1)-(2.4), the following symbols are used:

~E: electric field ~D: electric flux density~H: magnetic field ~B: magnetic flux density~J : electric current density ~M : equivalent magnetic current density�: electrical permittivity µ: magnetic permeabilityσe: electric conductivity σm: equivalent magnetic conductivityρe: electric charge density ρm: equivalent magnetic charge density


First, the FDTD method replaces ~D and ~B in Equation (2.1)-(2.2) with ~E and ~H

according to the constitutive relations in Equation (2.4), which yields Maxwell’s curl

equations:

µ∂ ~H

∂t= −∇× ~E − σm ~H − ~M (2.5)

�∂ ~E

∂t= ∇× ~H − σe ~E − ~J (2.6)

All the curl operators are then written in differential form and replaced by partial

derivative operators, as shown in Equation (2.7), with the ~E and ~H vector separated

into three vectors in three dimensions(e.g. ~E is separated into Ex, Ey, Ez):

curl ~F = ∇× ~F = x̂(∂Fz∂y

− ∂Fy∂z

) + ŷ(∂Fx∂z

− ∂Fz∂x

) + ẑ(∂Fy∂x

− ∂Fx∂y

) (2.7)

Then we can rewrite Maxwell’s curl equations into six equations in differential

form in rectangular coordinates:

µ∂Hx∂t

=∂Ey∂z

− ∂Ez∂y

− σmHx − Mx (2.8)

µ∂Hy∂t

=∂Ez∂x

− ∂Ex∂z

− σmHy − My (2.9)

µ∂Hz∂t

=∂Ex∂y

− ∂Ey∂x

− σmHz − Mz (2.10)

�∂Ex∂t

=∂Hz∂y

− ∂Hy∂z

− σeEx − Jx (2.11)

�∂Ey∂t

=∂Hx∂z

− ∂Hz∂x

− σeEy − Jy (2.12)

�∂Ez∂t

=∂Hy∂x

− ∂Hx∂y

− σeEz − Jz (2.13)

Second, the FDTD method establishes a model space, which is the physical region

where Maxwell’s equations are solved or the simulation is performed. The model


space is then discretized to a number of cells and the time duration “t” is discretized

to a number of time steps. The size of the unit cell is often set to less than one

tenth of the electromagnetic wavelength. The unit time step is calculated as the time

duration the electromagnetic wave spends travelling through a unit cell. Every cell in

the model space has its associated electric and magnetic fields E and H. The material

type of each cell is also specified by its permittivity, permeability and conductivity.

We use model size to represent the number of cells in the model space. The model

size depends on the size of unit cell. The smaller the unit cell, the larger the model

size.

The 3D grid shown in Figure 2.1, named the “Yee-Cell” [3], is very helpful for

understanding the discretized electromagnetic model space.

Figure 2.1: The geometrical representation of the 3D Yee cell.

As illustrated in Figure 2.1, the “Yee-Cell” is a small cube, which can be treated

as one single cell picked from the discretized model space. ∆x, ∆y, ∆z are the three

dimensions of this cube. We use (i, j, k) to denote the point whose real coordinate

is (i∆x, j∆y, k∆z) in the model space. Instead of placing E and H components in

the center of each cell, the E and H field components are interlaced so that every E


component is surrounded by four circulating H components, and every H component

is surrounded by four circulating E components. The operation ∇× or “curl” inEquation (2.5), (2.6) and (2.7) is easily interpreted using this figure. For example,

the Hx component located at point (i, j +12, k + 1

2) is surrounded by four circulating

E components, two Ey components and two Ez components, matching Equation

(2.8), which states that the Hx component increases directly in response to a curl

of E components in the x direction with a constant µ. The constant µ specifies the

magnetic permeability of the material at the location of this unit cell. Similarly, the

E components increase directly in response to the curl of the H components, with a

constant proportional to the electrical permittivity � of the material at the current

location. Maxwell’s equations in rectangular coordinates, Equation (2.8)-(2.13), can

be clearly illustrated by Yee’s cell.

In this section, we represent an electric component Ez at the discretized 3D co-

ordinate (i∆x, j∆y, (k + 12)∆z) as Ez|i,j,k+ 1

2

, and when the current time is in the

discretized N-th time step, we denote the same component as Ez|Ni,j,k+ 12

.

Third, all the partial derivative operators in Equations (2.8)-(2.13) are replaced by

their central difference approximations as illustrated in Equation (2.14). The second

order part of the Taylor series expansion is discarded to keep the algorithm simple

and reduce the computational cost. Also, the variable without partial derivative can

be approximated by time averaging as shown in Equation (2.15), which has similar

structure to the central difference approximation.

∂f(uo)

∂u=

f(uo + ∆u) − f(uo − ∆u)2∆u

+ O[(∆u)2] (2.14)

f(uo) =f(uo + ∆u) + f(uo − ∆u)

2(2.15)


For example, Equation (2.8) is changed to:

µHx(t0 +

∆t2

) − Hx(t0 − ∆t2 )∆t

=Ey(z0 +

∆z2

) − Ey(z0 − ∆z2 )∆z

(2.16)

−Ez(y0 +∆y2

) − Ez(y0 − ∆y2 )∆y

− σmHx(t0 +

∆t2

) + Hx(t0 − ∆t2 )2

− Mx

After these modifications, the FDTD method turns Maxwell’s equations into a

set of linear equations from which we can calculate the electric and magnetic fields

in every cell in the model space. We call these equations the electric and magnetic

field updating algorithms. Six field updating algorithms form the basis of the FDTD

method. For example, the field updating algorithm for the Hx component, derived

from Equation (2.16) or Equation (2.8), is given by:

(µ

∆t+

σm2

)Hx|N+ 1

2

i,j+ 12,k+ 1

2

= (µ

∆t− σm

2)Hx|

N− 12

i,j+ 12,k+ 1

2

+1

∆z[Ey|Ni,j+ 1

2,k+1 − Ey|Ni,j+ 1

2,k] −

1

∆y[Ez|Ni,j+1,k+ 1

2

− Ez|Ni,j,k+ 12

] − Mx|Ni,j+ 12,k+ 1

2

(2.17)

2.1.2 FDTD Algorithm

The FDTD algorithm is based on these field updating equations. The flow diagram of

the FDTD algorithm is shown in Figure 2.2. The FDTD algorithm first establishes the

computational model space, specifies the storage memory for all the electromagnetic

field data, specifies the material properties for every cell in the model space and

specifies the excitation source. The excitation source can be a point source, a plane

wave, an electric field or an other option depending on the application. The FDTD

algorithm then runs through the field updating algorithms on every cell in the model

space. The electromagnetic field data are updated and stored back to the original

memory. Once it finishes, the algorithm will go to the next time step and run through


all the cells again. The algorithm iterates the updating computations for each time

step, until the desired time period is completed. The boundary condition part consists

of special algorithms dealing with the unit cells located on the boundary of the model

space. For these cells, some of the adjacent cells may not exist, so special methods

are needed to update them appropriately. The output of the FDTD algorithm can

be any electric or magnetic field data from any cell in any time step.

Initialization� Establishes the model space� Specifies the material properties� Specifies the excitation source

UpdateSource Excitation

End

Time over?

Yes

No, Go to NextTime Step

Time Step ++Boundary Conditions

Update Magnetic FieldData on all Cells

Update Electric FieldData on all Cells

Figure 2.2: A Complete Flow Diagram of the FDTD Method

The electric and magnetic fields depend on each other. As we can see from

Equation (2.5), in Maxwell’s curl equations, the curl of the electric field E in space

equals the partial derivative of the magnetic field H with respect to “t”. After the

modifications of the FDTD method, the expansion of the curl operation depends on

the E field in the cells surrounding the current cell. The expansion of the partial

derivative of H with respect to “t” depends on the current magnetic field minus the

stored previous time step’s magnetic field. In other words, at any cell in the model


space, the current time step’s magnetic field depends on the stored previous time

step’s magnetic field and the electric fields in the surrounding cells. Similarly, at any

cell in the model space, the current time step’s electric field depends on the stored

previous time step’s electric field and the magnetic fields in the surrounding cells.

Because of the dependence between the electric and magnetic fields, we cannot

update them in parallel. So the FDTD algorithm updates the electric and mag-

netic fields in an interlaced manner as shown in Figure 2.2. Notice that electric and

magnetic field updating algorithms are interlaced in the time domain so magnetic

components will be updated first based on the previous stored electric data and then

electric components will be updated based on the newly updated magnetic data. This

interlacing mechanism will be iterated time step by time step until the job finishes.

The FDTD algorithm is an accurate and successful modelling technique for com-

putational electromagnetics. It is flexible, allowing the user to model a wide variety

of electromagnetic materials and environments on most scales. It is also easy to un-

derstand, with its clear structure and direct time domain calculation. However, the

FDTD algorithm is a data intensive and computationally intensive algorithm. The

amount of data in the FDTD model space can be very large for large model sizes,

creating a heavy burden on both memory storage and access. The computation is

also intensive for each cell in the FDTD model space, including updating six electric

and magnetic fields and also updating the boundary conditions. The heavy burden

on data access and complex computation makes the FDTD algorithm run slowly on a

single processor. Modelling an electromagnetic problem using the FDTD method can

easily require several hours. Without powerful computational resources, the FDTD

models are too time-consuming to be implemented on a single computing node. Ac-

celerating the FDTD method with inexpensive and compact hardware will greatly

expand its application and popularity.


2.1.3 Absorbing Boundary Conditions

The above electric and magnetic updating algorithms work accurately inside the

model space. However, because the cells on the boundary of the model space do

not have the adjacent cells needed in the updating algorithms, the algorithm does

not work properly on the boundary, so there are algorithm-introduced reflections

on the boundary of the model space. Special techniques are necessary to deal with

the cells on the boundary, prevent nonphysical reflections from outgoing waves and

simulate the extension of the model space to infinity. These special techniques are

called the absorbing boundary conditions (ABC). The development of efficient and

accurate ABCs is very important for the FDTD method and several types of ABC

have been proposed, which are grouped into two major approaches, analytical and

perfect matched layer.

2.1.3.1 The Mur-type absorbing boundary conditions

The first group of ABCs are called the analytical ABCs, which are mainly deduced

from the electromagnetic wave equation. The Mur-type ABC [9] is one of the popu-

lar ABCs in this group. Mur-type ABC is deduced from the one-way wave equation,

which is a part of the wave equation that allows wave propagation only in one di-

rection [5]. The wave equation in three dimensions shown in Equation (2.18) can be

written as Equation (2.19), or more compactly, as Equation (2.20). A part of the

wave equation, L−U = 0, as in Equation (2.21), is the one-way wave equation which

absorbs most of the impinging waves on the boundary of the model space.

(∂2

∂x2+

∂2

∂y2+

∂2

∂z2− 1

c2∂2

∂t2)U = 0 (2.18)


[∂2

∂x2− 1

c2∂2

∂t2(1 −

∂2

∂y2c2

∂2

∂t2

−∂2

∂z2c2

∂2

∂t2

)]U = 0 (2.19)

[∂2

∂x2− 1

c2∂2

∂t2(1 − S2)]U = LU = L+L−U = 0 (2.20)

(∂

∂x− 1

c

∂

∂t

√1 − S2)U = L−U = 0 (2.21)

Depending on the approximation of√

1 − S2, there are first-order and second-order Mur-type ABCs. The first order Mur-type ABC approximates

√1 − S2 = 1

and the second order uses√

1 − S2 = 1 − 12S2. The second-order Mur-type is more

accurate but more complex than the first-order one. For example, the discretized

version of the first-order Mur-type ABC for the Z component of the electric field is

given by:

Ez|N+10,j,k+ 12

= Ez|N1,j,k+ 12

+c∆t − ∆xc∆t + ∆x

[Ez|N+11,j,k+ 12

− Ez|N0,j,k+ 12

] (2.22)

The Mur-type ABC is easy to understand. It updates the electric field data on

the boundary cells every time step after the main updating algorithms are finished,

as shown in Figure 2.2. The updates of the field data are processed in-place with no

extra memory consumption, which makes the Mur-type ABC a good memory saver.

However, the biggest problem with the Mur-type ABC is that its quality is only

good for incident angles up to 25 degrees [6]. Its quality deteriorates significantly for

greater incidence angles. This property limits the usage of the Mur-type ABC.

2.1.3.2 The UPML absorbing boundary conditions

The other approach is called the Perfectly Matched Layer (PML) ABC [10]. PML

ABC is a highly effective absorbing boundary condition. By setting the outer bound-

ary of the model space to an absorbing material layer, PML can absorb most of

the impinging waves and have low reflection. The most favorable property of the


PML ABC is that its quality does not depend on incidence angle, polarization or

frequency [6].

The UPML (Uniaxial PML) ABC [7] is modified from the PML ABC by replacing

the mathematical PML medium model with a physical uniaxial anisotropic PML

medium, which keeps all the favorable properties of the PML ABC. Instead of setting

only the outer boundary of the model space to the PML material, UPML uses a

generalized formulation of the entire FDTD model space, providing both lossless,

isotropic medium in the primary computation zone and individual UPML absorber

in the outer boundary cells [6]. Based on the original UPML, a novel modification of

UPML, the Single Pole Conductivity Model [11, 12], better supports lossy, dispersive

media. The overall quality of the UPML ABC is more than 30dB better than the

original PML on dispersive media. Compared to the Mur-type ABC, the quality of

the UPML ABC is about 10dB better for incident angles less than 30 degrees and

much better for incident angles larger than 30 degrees [12], because the Mur-type

ABC deteriorates for greater incident angles [6].

The advantage of this approach is its good quality. Besides the high quality of

results in normal media, the quality of the UPML ABC is especially good for disper-

sive media, which is useful in solving many realistic problems. Also, because UPML

integrates boundary conditions with the electric field updating algorithms, the FDTD

algorithm is more uniform in structure, making it a good model for hardware data-

path design. The disadvantage of the UPML ABC is its high memory consumption.

The UPML ABC introduced nine extra field data for each cell in the model space,

consuming 1.5 times more memory than the normal FDTD algorithm. Because ac-

curacy of the ABC is very important for the FDTD algorithm, the PML and UPML

ABCs are very popular in FDTD research and applications due to their high quality.

The time-harmonic Maxwell’s curl equation in frequency domain in the UPML


regions can be written as:

∇× E = −jωµs̄H∇× H = (jω� + σ)s̄E

(2.23)

Where s̄ is a tensor that can be used throughout the entire FDTD model space:

s̄ = diag[syszsx

,sxszsy

,sxsysz

] (2.24)

sx = κx +σxjω�

sy = κy +σyjω�

sz = κz +σzjω�

(2.25)

In the PML region which is on the boundary of the model space, σx is PML’s con-

ductivity while κx is a real valued parameter [6]. In the lossless primary computation

zone, σ is always zero and κ is always one, so the media is just the normal media and

the data-path is the normal FDTD computation. P and P 2 are introduced to reduce

the complexity of this convolution calculation:

Px =syszsx

Ex, Py =sxszsy

Ey, Pz =sxsysz

Ez (2.26)

P 2x =1

syPx, P

2y =

1

szPy, P

2z =

1

sxPz (2.27)

After substituting E components from Equation (2.26) into (2.23), and substi-

tuting ∂∂t

for jω in Equation (2.23), an example of the UPML algorithm in lossy,

dielectric media is given by the following equations. The details of the UPML algo-

rithm are well documented [6] and the novel modification of the UPML algorithm for

dispersive media is introduced in several papers [11, 12, 13, 14, 15, 16, 17, 18]. For

example, the discretized version of the UPML algorithm for Ex and Hx components


are:

Px|N+1i+ 12,j,k

=2� − σ∆t2� + σ∆t

Px|Ni+ 12,j,k +

2∆t

2� + σ∆t(2.28)

(Hz|

N+ 12

i+ 12,j+ 1

2,k− Hz|

N+ 12

i+ 12,j− 1

2,k

∆y−

Hy|N+ 1

2

i+ 12,j,k+ 1

2

− Hy|N+ 1

2

i+ 12,j,k− 1

2

∆z)

P 2x |N+1i+ 12,j,k

=2�κy − σy∆t2�κy + σy∆t

P 2x |Ni+ 12,j,k +

2�

2�κy + σy∆t[Px|N+1i+ 1

2,j,k

− Px|Ni+ 12,j,k] (2.29)

Ex|N+1i+ 12,j,k

=2�κz − σz∆t2�κz + σz∆t

Ex|Ni+ 12,j,k +

1

2�κz + σz∆t(2.30)

[(2�κx + σx∆t)P2x |N+1i+ 1

2,j,k

− (2�κx − σx∆t)P 2x |Ni+ 12,j,k]

Bx|N+ 3

2

i,j+ 12,k+ 1

2

=2�κy − σy∆t2�κy + σy∆t

Bx|N+ 1

2

i,j+ 12,k+ 1

2

− 2�∆t2�κy + σy∆t

(2.31)

(Ez|N+1i,j+1,k+ 1

2

− Ez|N+1i,j,k+ 12

∆y−

Ey|N+1i,j+ 12,k+1

− Ey|N+1i,j+ 12,k

∆z)

Hx|N+ 3

2

i,j+ 12,k+ 1

2

=2�κz − σz∆t2�κz + σz∆t

Hx|N+ 1

2

i,j+ 12,k+ 1

2

+1

(2�κz + σz∆t)µ(2.32)

[(2�κx + σx∆t)Bx|N+ 3

2

i,j+ 12,k+ 1

2

− (2�κx − σx∆t)Bx|N+ 1

2

i,j+ 12,k+ 1

2

]

2.2 FPGA and Reconfigurable Computing

There are several options for implementing digital logic designs. The first option

is to program the algorithms into computer software and run them on general pur-

pose microprocessors. Designers can change the software instructions anytime so the

digital logic can be changed easily and flexibly. Although software is very flexible,

software running on general purpose microprocessors is slow compared to hardware


implementations for many applications. The microprocessor must read and decode

each software instruction and then execute it, creating a high execution overhead

for each individual operation [19]. Also, the microprocessor can only handle limited

parallelism and pipelining, while a custom hardware design can fully exploit the ben-

efits of both. For many scientific computations and real-time applications, software

implementation is too slow to be practical.

The second option is to use an Application Specific Integrated Circuit (ASIC),

which is designed for a specific computational task and achieves very fast speed and

high efficiency when executing the exact computation for which it was designed [19].

ASIC chips are small and consume less power than the other options, so they can be

found in many products from mobile phones to aeroplanes. However, the development

cost of ASICs is very high. What’s more, once the chip is fabricated, it cannot be

modified at all. The chip needs to be redesigned and refabricated if any single part

of the digital logic needs to be altered.

Figure 2.3: Speed vs. flexibility of software, FPGAs and ASICs

Reconfigurable computing devices, mostly known as Field Programmable Gate

Arrays (FPGAs), offer the benefits of both ASICs and software. Like an ASIC, an

FPGA can implement millions of gates of logic, which are designed optimally for


a specific task, in a single integrated circuit, thus achieving high performance with

low power and small size. Although the FPGA-based design may not challenge the

performance of ASICs, the fact is that few applications will ever merit the expense of

high cost ASICs. Also like software, FPGAs are reprogrammable by designers at any

time, and are therefore flexible and convenient. Although FPGA-based development

is still costly compared to software, FPGA-based design can offer extremely high per-

formance, surpassing any other programmable solutions for many applications [20].

Figure 2.3 compares speed and flexibility among software, FPGAs and ASICs. Rela-

tive high speed, relative high flexibility and advantages in cost, size and power make

FPGAs suitable for a wide variety of applications.

In this section, we will first introduce the basics of reconfigurable devices and com-

mercial FPGA computing boards. Then we will talk about reconfigurable computing

based design.

2.2.1 FPGA Basics

An FPGA is a programmable hardware device where the hardware circuits are pro-

grammable by the designer to create custom digital logic designs [21]. The structure

of the FPGA is composed of configurable logic blocks (CLBs), programmable routing

and Input/Output blocks (IOBs).

Figure 2.4 shows the structure of an FPGA from Xilinx, Inc [22]. The CLBs are

arranged in a matrix in the FPGA, and connected by programmable routing. IOBs

are connected to CLBs via programmable routing also. Each CLB is a basic functional

element that implements part of the algorithm. Digital logic is implemented in several

CLBs. The structure of CLBs vary in different FPGA chips; however, CLBs are

normally made up of lookup tables (LUTs), carry logic and storage elements. The


ConfigurableLogic Block

(CLB)

I/O Block

ProgrammableRouting

Route map of a real FPGA chip

Figure 2.4: Xilinx FPGA structure and its CLB

bottom right corner of Figure 2.4 shows the structural diagram of a slice in the Xilinx

Virtex-II Pro FPGA chip; each CLB in the Virtex-II Pro chip has two slices. The

lookup tables (LUTs) are the most important part of an FPGA. The 4-input LUT

shown in Figure 2.4 is composed of 24 memory units. The input signal is the address

data for the LUT to pick the corresponding value from the memory. The memory

units can be programmed to store any truth table, or any function, so most logic

can be efficiently implemented by one or more lookup tables. The storage elements

are generally used as registers and the carry logic is usually added in order to avoid

wasting LUTs and for faster speed.

Most current FPGAs are SRAM-based, which means all the programmable LUTs


and routing are based on SRAM bits [19]. Reconfiguring the FPGA is actually done

by loading the bitstream into SRAM, which takes only milliseconds to complete.

FPGAs can be reprogrammed in the target system without detaching the FPGA

chip, a perfect match for hardware systems which need frequent upgrading or have

several optional designs. FPGAs can be reprogrammed at run-time too, which means

part of an FPGA can be reprogrammed when the system is running on the other

part of the chip. These advantages make FPGAs very flexible and suitable for many

applications.

2.2.2 FPGA Computing Boards

The FPGA computing resources we use in this research are typical commercial off

the shelf (COTS) FPGA computing boards, which are widely available and easy to

setup. These boards normally contain one or two FPGA chips. Each FPGA chip may

be connected to several on-board memories, including SRAM or DRAM. One FPGA

and its associated on-board memory form the basic combination used for custom

hardware computation. These computing boards are often PCI boards for a desktop

computer or PCMCIA cards for a laptop computer. Data and control signals can

be transferred between the FPGA computing board and the host PC via the PCI

interface, either via standard PCI transfer or fast DMA transfer. There are also some

boards use other interfaces like fiber-optic or RocketIO for faster speed and flexible

connection.

The board we use is a WildStarTM -II Pro/PCI reconfigurable FPGA computing

board from Annapolis Micro Systems [23]. Table 2.1 lists the main features of the

the WildStarTM -II Pro/PCI board, and Figure 2.5 shows its block diagram.

There are two Xilinx Virtex-II Pro FPGAs on the WildStar-II Pro board. Each


WILDSTAR-II PRO/PCI boards

FPGA chips Two Xilinx Virtex-II Pro FPGAs XC2V70 [22] (33088 slices,328 embedded Multiplier and 5904Kb BlockRAM)

Memory 12 ports of DDR-II SRAM Ports totally 54MBytesPorts (6 × 4.5MBytes for each FPGA chip)

Memory 11 GBytes/sec memory bandwidthBandwidth (6 × 72-bit for each FPGA chip)

PCI Interface 133MHz/64bit PCI-X up to 1.03GBytes/sec

Table 2.1: List of the main features of the WildStart-II Pro FPGA boards

FPGA has 328 embedded 18×18 signed multipliers and 382×18Kb BlockRAMs. Theembedded multiplier built inside the FPGA is much faster than a multiplier com-

ponent implemented by reconfigurable logic. So it is better to use the embedded

multipliers if possible. The BlockRAMs are the fastest memory the designer can use

in an FPGA design. Critical data interchange and interfacing can be programmed

using the BlockRAMs. A pair consisting of an embedded multiplier and a Block-

RAM shares the same data and address buses in the Xilinx Virtex-II architecture, so

once we use the embedded multiplier, we cannot use its corresponding BlockRAM,

and vice versa. Therefore the sum of the total number of embedded multipliers and

BlockRAMs used must be less than 328.

Figure 2.5 shows the block diagram of the WildStar-II Pro FPGA board. Each

FPGA is connected to six independent on-board memories. The on-board memories

are 1M×36bit DDRII SRAM which have 72-bit data bandwidth and speed up to200MHz. The size of each SRAM is 36Mbits, or 4.5MBytes, so the total SRAM

attached to each FPGA is 27MBytes. The WildStar-II Pro board is connected to

the desktop computer via a PCI-X interface, with a DMA data transfer rate up to

1GB/s between the host PC and FPGA. Convenient interfaces between the FPGA

chip, on board memories and host PC makes the WildStar-II Pro FPGA computing

board a good platform for hardware implementation.


Figure 2.5: Block diagram of the WildStarTM -II Pro board

2.2.3 Reconfigurable Computing Basics

As introduced in the beginning of this section, the key feature of reconfigurable

computing is its ability to greatly speedup algorithm computation while maintaining

flexibility [19]. How does reconfigurable hardware design achieve high performance?

Based on the structure of the COTS FPGA computing board, we will introduce the

basic elements of reconfigurable computing which play important roles in computation

speedup.

As shown in Figure 2.6, a FPGA computing board normally contains an FPGA

chip, on-board memories and a PCI controller; these form the basic components of

custom hardware design. A typical hardware design reads the input data from the

on-board memory or PCI interface, performs the computation inside the FPGA, and

writes the results back to memory or PCI interface. The hardware design inside an


Commercial FPGA Computing Board

PCI Interface

FPGAOn-Board

MEMORIES

Pipeline Module

Pipeline Module

Pipeline Module

Pipeline ModuleIn P

aral

lel

Figure 2.6: Simple Structure Diagram of the Normal COTS Reconfigurable Comput-ing Board

FPGA normally consists of deep pipelines with as much parallelism as much as possi-

ble. The size and speed of the FPGA chip, the size and speed of the on-board memory

and the speed of the PCI controller all determine the performance of the hardware

design on this board. However, given a specific FPGA computing board, there are

three design approaches which contribute most to the performance of the FPGA

based design. They are pipelining, parallelism and memory hierarchy/interface.

2.2.3.1 Pipelining

Pipelining is an implementation technique whereby multiple instructions are over-

lapped in execution [24]. A pipeline is like an assembly line.

1. There are many instructions waiting to be processed just like there are many

cars waiting to be produced in the assembly line.

2. All the instructions should have similar structure. Instructions should use the

same hardware processing units, just like all the cars on the assembly line should have

identical production procedures. We call the part of an instruction that corresponds

to each hardware processing unit a “task”, so each instruction is composed of several


tasks. If there is only one instruction to be executed, each of the tasks will be executed

one by one in N clock cycles. Of course, the hardware units here are not efficiently

used because every hardware unit is used only once in the N clock cycles.

3. The benefit of pipelining is that it can run several instructions at the same time

and let them share hardware resources. The instructions are overlapped in execution,

in the shape of a cascade as shown in Figure 2.7. Figure 2.7 is an example structure

of a pipeline in the FDTD design. The rows represent instructions and the columns

represent clock cycles. We can see from this figure that every hardware unit is fully

utilized, since every hardware unit works on a different instruction in every clock

cycle, just like a working assembly line. The computation results now pop out of

the pipeline every clock cycle. Compared to the non-pipelined data-path, the design

throughput is increased by N times.

A pipeline is a perfect hardware structure when an algorithm needs to execute

many similar instructions. To build the pipeline, registers need to be inserted between

hardware units in the data-path to synchronize them according to a clock signal.

Because all the hardware units need to be finished in one clock cycle, the speed of

the pipeline is determined by the slowest hardware unit or pipeline step in the data-

path. The smaller the pipeline step, the higher the throughput. So the data-path and

registers need to be balanced, also called re-timing [25], to make the length of pipeline

steps similar. In case of large and slow arithmetic components like multipliers, we can

use specially designed “pipelined components” which separates their tasks internally

and are capable to be pipelined to several steps, like the pipelined multiplier in

Figure 2.7.

Pipelining has some limitations so that not all similar instructions can be pipelined.

Data hazards may arise when an instruction depends on the results of a previous in-

struction so it cannot be executed until the previous instruction is completed. Also,


Figure 2.7: Structure of a pipeline in the free space model of the 2D FDTD algorithm

pipelining adds extra registers to the data-path so it increases the latency and con-

sumes more resources. In most cases, the latency and registers are small enough to

be ignored, and the high throughput and high overall design performance introduced

by pipelining is important for the reconfigurable computing speedup.

2.2.3.2 Parallelism

One of the most important techniques in reconfigurable computing design is to par-

allelize as many tasks as possible. As long as two tasks do not affect each other and

can be processed at the same time, we can execute them in parallel in order to speed

up the design. However, the hardware design is always limited by the availability of

hardware resources. The hardware resources include slice area, on-chip memory area


and input/output interface. More hardware resources mean higher cost and more

power consumption. Parallelism will consume more resources, so there is a trade-off

between hardware resources and speedup. Carefully analysis is needed to decided

how much parallelism is suitable for the current design to achieve the fastest speed

with minimum cost.

2.2.3.3 Memory Hierarchy and Interface

Memory is an important part of a reconfigurable computing design. As shown in

Figure 2.6, the FPGA can access the fast memories on board or the memories on the

host computer via the slower PCI interface. Also, as we introduced in Section 2.2.2,

the fastest memory is the on-chip BlockRAMs available inside the FPGA chip. Since

the read/write speed of memory is important to the whole design’s speed, to avoid

a bottleneck at the memory interface, fast memory is needed. But fast memory is

expensive and its size is limited, so it is not practical to use fast memory for the

whole design. A memory hierarchy can be organized according to the sizes of the

different types of memories as well as their speed and cost. The goal of the memory

hierarchy is to provide a memory system with the cost almost as low as a design that

uses only cheap memory and with the speed almost as fast as a design that uses only

fast memory.

2.2.3.4 Summary

General purpose processors (GPP) can support only limited parallelism and pipelin-

ing. A GPP has a fixed number of Arithmetic Logic Units (ALUs) for parallelism,

and it usually pipelines arithmetic operations in a limited way within each ALU. In


contrast, a custom hardware design is organized for a specific algorithm. The data-

path can be fully parallelized and pipelined. The memory hierarchy and interface

can be optimized to support pipelining and parallelism. Compared to a software al-

gorithm running on a GPP, the reconfigurable computing design can be much faster.

The more pipelining and parallelism, the faster the design.

Not all algorithms are suitable for reconfigurable computing. Considering the

relatively higher cost of hardware implementation, a significant speedup should be

achieved to make the implementation practical. From the discussion above, we can

conclude that deep pipelining and parallelism and suitable memory hierarchy are the

key points of reconfigurable computing speedup. To fit these conditions, an algo-

rithm should have a high volume of loops or repeated, similar instructions, intensive

computations and limited memory access. Also, when most data in an algorithm

have similar range, they can be represented by fix-point arithmetic which is easier

and faster to process in a custom hardware design. We will continue talk about these

topics in the Chapter 6.

2.2.4 Reconfigurable Computing Design Flow

The FPGA design flow includes coding the algorithm in VHDL, simulation and veri-

fication, synthesizing, placing and routing, and finally on-chip verification. Figure 2.8

shows a diagram of the FPGA design flow. Given a target algorithm, the first step

is algorithm analysis and hardware architectural design. Then we use VHDL code to

describe the hardware design, which can be debugged and verified through simula-

tion. The completed VHDL design is then synthesized to create the gate level netlist.

The place and route tool then fits the synthesized design into the FPGA, creating

a physical layout and a bit-stream file which can be loaded directly into the FPGA


chip. We use a C program running on a host PC to communicate with the loaded

FPGA chip for board level debugging and verification. If there is any problem in

board level testing, we need to go back and modify the VHDL code. The final design

which passes board level verification will be tested for performance.

Figure 2.8: The FPGA Design Flow

2.3 Finite Precision

The original 3D FDTD model was written in Fortran, using 64-bit double precision

floating-point data. Floating-point representation provides high resolution and large

dynamic range, but it can be costly. In hardware design, floating-point representation

has slower arithmetic components and consumes more hardware resources. It is

significantly more expensive than fixed-point representation in speed and cost when

implemented in hardware. On the other hand, fixed-point components in hardware

design have much faster speed and occupy less space. In applications where the

data resolution and dynamic range can be constrained, like the FDTD algorithm,

the fixed-point arithmetic can provide similar precision and much faster speed than


floating-point arithmetic.

In this section, the definition and properties of floating-point representation and

fix-point representation will be introduced. Then we will discuss more about the

fixed-point quantization and its related algorithm analysis.

2.3.1 Floating-Point Arithmetic

Programmers use floating-point precision to represent real numbers in most computer

programs. The IEEE single precision floating-point standard representation requires

a 32 bit word. As shown in Figure 2.9, the most significant bit is the sign bit, ‘S’,

the next eight bits are the exponent bits ‘E’, from E30 to E23, and the final 23 bits

are the fraction bits ‘F’, from F22 to F0 [26] [27]:

Figure 2.9: Single precision floating-point representation

The value V represented by the word may be determined as follows:

Vfloat = (−1)S ∗ 1.F ∗ 2E−BIAS, where BIAS = 127.Single precision floating-point representation can represent data in the range −2128

to 2128, the highest resolution is up to 2−127. This is a very large data range, as it can

represent very small and very large numbers, which fix-point representation lacks.

The IEEE double precision floating point standard representation requires a 64

bit word. Similar to the single precision standard, as shown in Figure 2.10, the most

significant bit is the sign bit, ‘S’, the next eleven bits are the exponent bits ‘E’, from

E62 to E52, and the final 52 bits are the fraction bits ‘F’, from F51 to F0:

Similar to single precision, the value V may be determined as:


Figure 2.10: Double precision floating-point representation

Vfloat = (−1)S ∗ 1.F ∗ 2E−BIAS, where BIAS = 1023.

2.3.2 Fixed-Point Arithmetic

The fixed-point representation we use in this dissertation is two’s complement fixed-

point representation. We can think of fixed-point data as a collection of N binary

digits, with 2N possible states, as shown in Figure 2.11. The N-bit data and 2N states

here do not mean anything until their interpretation is defined. For example, if we

define the difference between two states as integer 1, and we define all the data to be

located between 0 and 2N , then N-bit data can be thought of as positive integers less

than 2N . However, if we define the most significant bit to be the sign bit, changing

the data range to between −(2N−1 − 1) and 2N−1, the N-bit data can be thoughtof as integers with absolute value less than 2N−1. And if we define the difference

between two states as 0.1, the meaning of the N-bit data will change too; they are

now rational numbers precise to 0.1. So the meaning of fixed-point N-bit binary data

depends on its interpretation.

Figure 2.11: Uninterpreted fix-point representation

In most cases, fixed-point representation is used to represent rational number

sets. Instead of defining the value of states, we can use a more direct way to define

the interpretation: the position of the binary point. For example, the 35-bit signed


fixed-point representation which will be used in this research has the form as shown

in Figure 2.12.

Figure 2.12: Interpreted fix-point representation

The digits before the binary point are one sign bit and one integer bit; the digits

after it are called the fraction part. The resolution, which is defined as the smallest

non-zero magnitude representable [28], is determined by the format. In this example,

the resolution is 2−33. Since the position of the binary point is fixed, this represen-

tation is called “fixed-point”. We can write this representation as A(1,33) [28]. The

N-bit signed fixed-point data representation A(a, b) has the value:

Vfix = −S · 2a +a−1∑

n=0

2nAn +1

2b

b−1∑

n=0

2nBn

Although fixed-point representation has different interpretations, they share the

same rules as most arithmetic operations. We can consider fixed-point numbers

as integers and choose the necessary bits and reposition the binary point after the

operation. For example, if we multiply two fixed-point numbers with the form shown

in Figure 2.13, we can ignore the binary point and multiply the two integers first.

After the multiplication, we pick the digits we need according to the position of the

binary point. In this example, it is 3 digits before the binary point and 26 digits

after. The result is still a 30-bit fixed-point datum.

The fixed-point representation has a fixed data structure that only works well

on a limited data range. For example, the fixed-point number with the format in

Figure 2.13 only has a range from -8.0 to 7.9999... [29]. If an integer result is bigger


Figure 2.13: Multiplication of fix-point representation

than 23 − 1 = 7, then it cannot be represented by three digits before the binarypoint. The situation where a number cannot fit into the representable range is called

overflow. The minimal absolute value which fixed-point data can represent is also

limited. The situation where a number is less than the resolution of the representation

is called underflow. Although the range and resolution in fixed-point representation

are limited compare to the same-length floating-point data, it is suitable for many

applications which do not need both high precision and wide dynamic range.

In the example in Figure 2.13, we ignored the 27th to 52th bits after the binary

point in order to fit the result into the current representation. If a value with higher

resolution cannot be fit in the current representation, some information must be

discarded in order to decrease its resolution. The most common methods are “trun-

cation” and “rounding toward nearest”. Truncation is computationally simpler, just

dropping all the digits beyond those required, as we did in the above example. In-

stead, rounding introduces a smaller error. When all numbers are rounded to the

nearest representable number, it is more accurate than truncation half of the time,

and the same the other half of the time. Of course it requires more effort to process;

rounding can be accomplished by adding one to the next significant bit and then


truncate it.

Depending on the number of states required, N in the N-bit fixed-point repre-

sentation can be any length we want. So there does not exist a length standard

in fixed-point representation. Because of this flexible characteristic, a designer can

optimize the system by choosing the minimal-length fixed-point data with accept-

able error. The shorter the data length, the lower the cost, the more area available

for parallelism and finally the faster the design. This point is particularly useful in

hardware design since speed and area are the ultimate goal of the hardware designer.

Because of its simplicity and fast processing, fixed-point representation is widely used

in custom hardware designs on applications where the data resolution and dynamic

range can be constrained.

2.3.3 Fixed-Point Quantization

Fixed-point quantization refers to the process of approximating values in floating-

point representation with a fixed-point representation. An N-bit signed fixed-point

representation A(a, b), as shown in Figure 2.12, has the value:

Vfix = −S · 2a +a−1∑

n=0

2nAn +1

2b

b−1∑

n=0

2nBn

The rounding method we use in this research is rounding toward nearest. So if the

lengths of a and b are decided, for any floating-point value Vfloat, we can approximate

Vfloat with fixed-point by representing the floating-point number in two’s complement

arithmetic, rounding it and then truncating the digits that do not fit in the fixed-point

representation.

The main problem remaining is how to choose the length of a and b. Fixed-

point quantization creates error which is the difference between the floating-point


and the fixed-point representation. Usually the longer the fixed-point value, the less

the quantization error. However, longer values require more hardware resources, so

the fixed-point representation we choose should balance minimizing the error with

minimizing the bit-width to save hardware resources. This is a trade-off between

quantization error and hardware area. This decision is made according to the al-

gorithm analysis of relative error and visual error. The detailed analysis will be

discussed in Chapter 5.

2.4 Summary

This chapter presented the theories of the FDTD method, which is the target al-

gorithm of this dissertation. Background on FPGAs and reconfigurable computing,

which is the basis of this research, is then introduced. Finally we talk about the

concepts of finite precision and fixed-point quantization, which will be used in the

algorithm analysis. In the next chapter, we will present related work.

Chapter 3

RELATED WORK

In this chapter we will introduce the related works of this research. Related work

falls into following categories: applications of the FDTD method, parallel-computing

systems for FDTD acceleration, VLSI implementations for FDTD acceleration and

FPGA implementations for FDTD acceleration.

3.1 Applications of the FDTD Method

The FDTD method is an important tool for investigating the propagation, radi-

ation and scattering of electromagnetic waves. Before the 1990s, the cost of solving

Maxwell’s equations directly was large and most of the related research was for mil-

itary defense purposes. For example, engineers used huge parallel supercomputing

arrays to model the radar wave reflection of airplanes by solving Maxwell’s equations,

trying to develop an airplane with low radar cross section [30]. The difficult task of

solving Maxwell’s equations has had more economical solutions since 1990, with the

development of fast computing resources applied to the FDTD method. Now the

38

CHAPTER 3. RELATED WORK 39

application of the FDTD method has spread to many areas including:

• Discrete scattering, by modelling electromagnetic wave scattering from discreteobjects [7],

• Antenna design and radar design, by modelling various kinds of antennas,

• Digital circuit packaging on multi-layer circuits boards, by helping designersanalyze electromagnetic wave phenomena on complex circuit boards [30],

• Subsurface detection and ground penetrating radar (GPR), by modelling theelectromagnetic phenomena of GPR on the subsurface model [31] [32], and

• Medical studies, by modelling electromagnetic wave phenomena in the humanbody, including:

– The study of the effect of electromagnetic waves from cell phone on the

human brain,

– The study of the computation of light scattering from cells [33], and

– The study of breast cancer detection using electromagnetic antenna [34] [35].

Among these various applications, several works are closely related to our research,

both in the subsurface detection areas and in medical studies.

The use of the FDTD method to simulate Ground Penetrating Radar (GPR)

applications for anti-personnel mine detection has been introduced [31, 32]. With

carefully modeled transmitting and receiving antenna locations, a 2D FDTD model

can simulate the wave propagation and scatter response of 3D GPR geometries with

realistic ground material [31]. An advanced 3D FDTD model was introduced [32]

which models more complex geometries, provides more accurate results and allows


complete freedom for the location of the transmitting and receiving antennas. The

3D FDTD method implements realistic dispersive soil along with air, metal and di-

electric media. The Mur-type absorbing boundary conditions were implemented and

produced good results. Both the 2D and 3D models have been validated by experi-

ments performed with a commercially available ground penetrating radar system and

realistic soil. The results produced by the FDTD models are very accurate compared

with experimental data.

Research on FDTD modelling for microwave breast cancer detection has been pre-

sented [34, 35]. Because of the large difference in electromagnetic properties between

malignant tumor tissue and normal breast tissue, microwave breast cancer detection

has attracted much interest as it may overcome some of the shortcomings of X-ray

based detection. Accurate computational modeling of microwaves in human tissue

with the FDTD method is very helpful for breast cancer detection research. In these

papers, the authors build a 3D simulation of the human breast, which includes a

semi-ellipsoid geometric representation of the breast and a planar chest wall. A sin-

gle pole conductivity model [11] based on the Z-transform is presented as a simple and

efficient way of modeling various dispersive media, in the range of 30MHz to 20GHz,

including human tissue. The UPML Absorbing Boundary Condition is implemented

and performs better than the Mur-type ABC.

Clearly, the FDTD method is a powerful tool that can be used in many different

applications. Our hardware implementations are based on both the mine detection

and the breast cancer detection research mentioned above [31, 32, 34]. Our research

leads to a hardware FDTD core supporting both GPR and medical applications, and

other areas with minimum modification.

The FDTD method is computationally intensive. Without powerful computa-

tional resources, FDTD models are too time consuming to be implemented on single


computer node. The backward FDTD algorithm, or the time reversal algorithm [35]

based on the FDTD method, especially needs speed-up because realistic subsurface

detection instruments, which are based on the backward algorithm, place the most de-

mands on computational resources in order to achieve real-time working speed. Much

effort has been spent on accelerating FDTD implementations for better performance.

3.2 Parallel-computing Systems for FDTD Accel-

eration

One of the early approaches to accelerating the FDTD method was to use super

computer or parallel computer arrays to calculate the FDTD algorithm in software.

A two node parallel algorithm exhibiting perfect speedup should run twice as fast

as a single node computer. However, because of communication bottlenecks, parallel

systems cannot achieve perfect speed up. The goal is to achieve near-perfect speedup.

A parallel implementation of the FDTD algorithm is presented [36]. In this pa-

per, the authors compared parallel FDTD code running on three different parallel

computing systems: a typical 16 node Beowulf system and two traditional parallel

supercomputers. The traditional supercomputers have better inter-node communi-

cation bandwidth and latency compared to the typical Beowulf system, but the cost

of traditional supercomputers is much higher. Well written parallel FDTD code per-

forms well on all three systems. The authors compare the performance results of the

Beowulf system to the supercomputers only, not to a single node computer. We can

assume the speed-up is a maximum of 16 fold and should be a little slower due to

communications overhead. In this paper, the authors also analyze the decomposition

and communication problems of the parallel FDTD method, which is helpful to our


research on the two FPGA implementation of the FDTD method. According to the

analysis, the FDTD code should scale to a large number of nodes easily, because most

communication is between neighboring nodes.

More detailed research on Beowulf system performance for the FDTD method

is introduced [37]. The project investigated the performance of a parallel FDTD

implementation on a Beowulf workstation cluster where the computation grid was

divided among nodes. The authors points out that for a fixed size problem, as the

number of nodes increases, the spee

Acceleration of the 3D FDTD Algorithm in Fixed-point ...Acceleration of the 3D FDTD Algorithm in...

Documents

Transcript of Acceleration of the 3D FDTD Algorithm in Fixed-point ...Acceleration of the 3D FDTD Algorithm in...