Acceleration of the 3D FDTD Algorithm in Fixed-point ...Acceleration of the 3D FDTD Algorithm in...
Transcript of Acceleration of the 3D FDTD Algorithm in Fixed-point ...Acceleration of the 3D FDTD Algorithm in...
-
Acceleration of the 3D FDTD Algorithm in Fixed-point
Arithmetic using Reconfigurable Hardware
A Thesis Presented
by
Wang Chen
to
The Department of Electrical and Computer Engineering
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in
Electrical Engineering
in the field of
Electrical Engineering
Northeastern University
Boston, Massachusetts
August 2007
-
c© Copyright 2007 by Wang Chen
All Rights Reserved
-
NORTHEASTERN UNIVERSITY
Graduate School of Engineering
Thesis Title: Acceleration of the 3D FDTD Algorithm in Fixed-point Arithmetic
using Reconfigurable Hardware.
Author: Wang Chen.
Department: Electrical and Computer Engineering.
Approved for Thesis Requirements of the Doctor of Philosophy Degree
Thesis Advisor: Prof. Miriam Leeser Date
Thesis Committee Member: Prof. Carey Rappaport Date
Thesis Committee Member: Prof. Charles DiMarzio Date
Department Chair: Prof. Ali Abur Date
Graduate School Notified of Acceptance:
Director of the Graduate School: Prof. Yaman Yener Date
-
NORTHEASTERN UNIVERSITY
Graduate School of Engineering
Thesis Title: Acceleration of the 3D FDTD Algorithm in Fixed-point Arithmetic
using Reconfigurable Hardware.
Author: Wang Chen.
Department: Electrical and Computer Engineering.
Approved for Thesis Requirements of the Doctor of Philosophy Degree
Thesis Advisor: Prof. Miriam Leeser Date
Thesis Committee Member: Prof. Carey Rappaport Date
Thesis Committee Member: Prof. Charles DiMarzio Date
Department Chair: Prof. Ali Abur Date
Graduate School Notified of Acceptance:
Director of the Graduate School: Prof. Yaman Yener Date
Copy Deposited in Library:
Reference Librarian Date
-
Abstract
Understanding and predicting electromagnetic behavior is needed more and more in
modern technology. The Finite-Difference Time-Domain (FDTD) method is a pow-
erful computational electromagnetic technique for modelling electromagnetic space.
However, the computation of this method is complex and time consuming. Imple-
menting this algorithm in hardware will greatly increase its computational speed and
widen its usage.
This dissertation presents a fixed-point 3D UPML (Uniaxial Perfectly Matched
Layer) FDTD FPGA accelerator, which supports a wide range of materials including
dispersive media. By analyzing the performance of fixed-point arithmetic in the 3D
FDTD algorithm, the correct fixed-point representation is chosen to minimize the rel-
ative error between the fixed-point and floating point results. The FPGA accelerator
supports the UPML absorbing boundary conditions which have better performance
in dispersive soil and human tissue media than PML boundary conditions.
The 3D UPML FDTD hardware accelerator has been designed and implemented
on a WildStar-II Pro FPGA computing board. The computational speed of the hard-
ware implementation represents a 25 times speedup compare to the software imple-
mentation running on a 3.0GHz PC. This result indicates that the custom hardware
implementation can achieve significant speed up for the 3D UPML FDTD algorithm.
ii
-
The speedup of the FDTD hardware implementation is due to three major fac-
tors: fixed-point representation, custom memory interface design, and pipelining and
parallelism. The FDTD method is a data-intense algorithm. The bottleneck of the
hardware design is its memory interface. With the limited bandwidth between the
FPGA and on-board memories, a carefully designed custom memory interface al-
lows fully utilization of the memory bandwidth and greatly improves performance.
The FDTD algorithm is also a computationally intense algorithm. By considering
the trade-offs between resources and performance, pipelining and parallelism are im-
plemented to achieve optimal design performance based on the available hardware
resources.
iii
-
Acknowledgements
First of all, I would like to thank my advisor Dr. Miriam Leeser, not only for her
invaluable advice and encouragement leading me in the process of fulfilling my grad-
uate research, but also for her great guidance and help during my six-year’s graduate
study.
I also greatly appreciate the friendship and support of Dr. Carey Rappaport
and Panos Kosmas in CenSSIS, Center for Subsurface Sensing and Imaging Systems.
This dissertation would not have been possible without their helps, support and their
FDTD model.
I would like to thank Dr. Charles DiMarzio, who served on both my masters and
doctoral committees, for his constant help and support on my research.
It has been a pleasure to cooperate with all of my colleagues and friends in the
Reconfigurable Computing Laboratory of the Northeastern University. I would like
to thank all of them for their suggestions, help and support.
Finally, I left my special acknowledgement to my wife Lan, and my parents, Xiuzhi
Chen and Yuanrui Zhang for their constant love and great support.
This research is supported in part by CenSSIS, the Center for Subsurface Sens-
ing and Imaging Systems, under the Engineering Research Centers Program of the
National Science Foundation.
iv
-
Contents
1 INTRODUCTION 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 BACKGROUND 7
2.1 The Finite-Difference Time-Domain (FDTD) Method . . . . . . . . . 7
2.1.1 FDTD Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 FDTD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Absorbing Boundary Conditions . . . . . . . . . . . . . . . . . 15
2.1.3.1 The Mur-type absorbing boundary conditions . . . . 15
2.1.3.2 The UPML absorbing boundary conditions . . . . . 16
2.2 FPGA and Reconfigurable Computing . . . . . . . . . . . . . . . . . 19
2.2.1 FPGA Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 FPGA Computing Boards . . . . . . . . . . . . . . . . . . . . 23
2.2.3 Reconfigurable Computing Basics . . . . . . . . . . . . . . . . 25
2.2.3.1 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . 28
2.2.3.3 Memory Hierarchy and Interface . . . . . . . . . . . 29
2.2.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2.4 Reconfigurable Computing Design Flow . . . . . . . . . . . . . 30
2.3 Finite Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.3.1 Floating-Point Arithmetic . . . . . . . . . . . . . . . . . . . . 32
v
-
2.3.2 Fixed-Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . 33
2.3.3 Fixed-Point Quantization . . . . . . . . . . . . . . . . . . . . 36
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3 RELATED WORK 38
3.1 Applications of the FDTD Method . . . . . . . . . . . . . . . . . . . 38
3.2 Parallel-computing Systems for FDTD Acceleration . . . . . . . . . . 41
3.3 VLSI Implementations of the FDTD Method . . . . . . . . . . . . . . 43
3.4 FPGA Implementations of the FDTD Method . . . . . . . . . . . . . 45
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4 FDTD ON FPGA 50
4.1 The Advantages of Putting FDTD on an FPGA . . . . . . . . . . . . 50
4.1.1 Speed, Size, Power and Flexibility . . . . . . . . . . . . . . . . 51
4.1.2 Good for Parallelism and Deep Pipelining . . . . . . . . . . . 51
4.1.3 Suitable for Fixed-Point Arithmetic . . . . . . . . . . . . . . . 52
4.2 Key Problems to Solve - FDTD on an FPGA . . . . . . . . . . . . . 58
4.2.1 Determine the right precision for fixed-point representation . . 59
4.2.2 Determine the memory organization and the memory interface 60
4.2.3 Determine the proper use of pipelining and parallelism . . . . 61
4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5 ALGORITHM ANALYSIS 63
5.1 Target FDTD Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.1.1 The GPR Buried Object Detection Forward Model . . . . . . 64
5.1.2 The Breast Cancer Detection Forward Model . . . . . . . . . . 66
5.1.3 The Spiral Antenna Model . . . . . . . . . . . . . . . . . . . . 67
5.1.4 Simplification and 2D Structure . . . . . . . . . . . . . . . . . 68
5.1.5 Six Target Models . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.1 Fixed-point Analysis . . . . . . . . . . . . . . . . . . . . . . . 71
5.2.2 Fixed-point Quantization . . . . . . . . . . . . . . . . . . . . . 78
5.3 Hardware Design Analysis . . . . . . . . . . . . . . . . . . . . . . . . 81
vi
-
5.3.1 Memory hierarchy and memory interface . . . . . . . . . . . . 81
5.3.2 Managed cache module . . . . . . . . . . . . . . . . . . . . . . 86
5.3.2.1 The memory transfer bottleneck . . . . . . . . . . . 86
5.3.2.2 Optimizing data flow and processing core . . . . . . 87
5.3.2.3 Expand to Three Dimensions . . . . . . . . . . . . . 90
5.3.3 Pipelining and Parallelism . . . . . . . . . . . . . . . . . . . . 92
5.3.3.1 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.3.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . 93
5.3.3.3 Two Hardware Implementations . . . . . . . . . . . . 95
5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6 FDTD DESIGN, IMPLEMENTATION and RESULTS 98
6.1 Overview of the FDTD hardware implementation . . . . . . . . . . . 98
6.2 I/O Interface and DMA Transfer . . . . . . . . . . . . . . . . . . . . 100
6.2.1 WildStar-II Pro’s I/O Interface . . . . . . . . . . . . . . . . . 100
6.2.2 DMA Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2.3 DMA Control Logic and FIFO Components . . . . . . . . . . 104
6.2.4 DMA Interface Extension . . . . . . . . . . . . . . . . . . . . 106
6.3 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.1 Combined Datapath . . . . . . . . . . . . . . . . . . . . . . . 107
6.3.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.4.1 Data Arrangement . . . . . . . . . . . . . . . . . . . . . . . . 111
6.4.2 On-board Memories . . . . . . . . . . . . . . . . . . . . . . . . 113
6.4.3 BlockRAM Cache Module . . . . . . . . . . . . . . . . . . . . 113
6.5 Control State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.6 FDTD Implementations and Results . . . . . . . . . . . . . . . . . . 117
6.6.1 2D UPML FDTD Implementation . . . . . . . . . . . . . . . . 118
6.6.2 3D UPML FDTD Implementation — Single-Cell Full-UPML . 120
6.6.3 3D UPML FDTD Implementation — Double-Cell Full-UPML 125
6.6.4 3D UPML FDTD Implementation — UPML/Center Combined 127
6.6.5 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . 129
vii
-
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
7 CONCLUSIONS and FUTURE WORK 132
7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Bibliography 136
viii
-
List of Tables
2.1 List of the main features of the WildStart-II Pro FPGA boards . . . 24
4.1 Relative error between fixed-point arithmetic and floating-point arith-
metic over 200000 Timesteps . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Detailed Specifications of the Target FDTD Models . . . . . . . . . . 70
5.2 Average absolute error between fixed-point arithmetic and floating-
point arithmetic for different bit-widths for 2D GPR Landmine Detec-
tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.3 Average relative errors between fixed-point arithmetic and floating-
point arithmetic for different bit-widths for 2D GPR Landmine Detec-
tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4 Average absolute error between fixed-point arithmetic and floating-
point arithmetic for different bit-widths for 2D Breast Cancer Detec-
tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.5 Average relative error between fixed-point arithmetic and floating-
point arithmetic for different bit-widths for 2D Breast Cancer Detec-
tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.6 Average absolute error between fixed-point arithmetic and floating-
point arithmetic for different bit-widths for 2D Spiral Antenna Model 75
5.7 Average relative error between fixed-point arithmetic and floating-
point arithmetic for different bit-widths for 2D Spiral Antenna Model 75
ix
-
5.8 Average absolute error between fixed-point arithmetic and floating-
point arithmetic for different bit-widths for 3D GPR Landmine Detec-
tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.9 Average relative error between fixed-point arithmetic and floating-
point arithmetic for different bit-widths for 3D GPR Landmine De-
tection Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.10 Average absolute error between fixed-point arithmetic and floating-
point arithmetic for different bit-widths for 3D Breast Cancer Detec-
tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.11 Average relative error between fixed-point arithmetic and floating-
point arithmetic for different bit-widths for 3D Breast Cancer Detec-
tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.12 Average absolute error between fixed-point arithmetic and floating-
point arithmetic for different bit-widths for 3D Spiral Antenna Model 78
5.13 Average relative error between fixed-point arithmetic and floating-
point arithmetic for different bit-widths for 3D Spiral Antenna Model 78
6.1 Read/Write ratio in different FDTD implementations . . . . . . . . . 129
6.2 3D FDTD hardware implementation performance results . . . . . . . 130
x
-
List of Figures
2.1 The geometrical representation of the 3D Yee cell. . . . . . . . . . . . 10
2.2 A Complete Flow Diagram of the FDTD Method . . . . . . . . . . . 13
2.3 Speed vs. flexibility of software, FPGAs and ASICs . . . . . . . . . . 20
2.4 Xilinx FPGA structure and its CLB . . . . . . . . . . . . . . . . . . . 22
2.5 Block diagram of the WildStarTM -II Pro board . . . . . . . . . . . . 25
2.6 Simple Structure Diagram of the Normal COTS Reconfigurable Com-
puting Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7 Structure of a pipeline in the free space model of the 2D FDTD algorithm 28
2.8 The FPGA Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.9 Single precision floating-point representation . . . . . . . . . . . . . . 32
2.10 Double precision floating-point representation . . . . . . . . . . . . . 33
2.11 Uninterpreted fix-point representation . . . . . . . . . . . . . . . . . . 33
2.12 Interpreted fix-point representation . . . . . . . . . . . . . . . . . . . 34
2.13 Multiplication of fix-point representation . . . . . . . . . . . . . . . . 35
5.1 3D FDTD application of land mine detection using ground penetrating
radar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.2 3D FDTD application of microwave breast cancer detection . . . . . . 66
5.3 The floor plan of the spiral antenna model . . . . . . . . . . . . . . . 67
5.4 Structural diagram of the simplification and the pseudo-2D FDTD
algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.5 Data structure of the fixed-point representation . . . . . . . . . . . . 71
5.6 Relative error between fixed-point arithmetic and floating-point arith-
metic for different bit-widths . . . . . . . . . . . . . . . . . . . . . . . 80
xi
-
5.7 Structural Diagram of the Memory Interface . . . . . . . . . . . . . . 85
5.8 Structural Diagram of the Simple 2-Row Cache Module . . . . . . . . 88
5.9 Structural Diagram of the 2D Managed Cache Module . . . . . . . . 90
5.10 Structural Diagram of the 4-Slice Caching Design . . . . . . . . . . . 91
5.11 Structure Diagram of the 4×3 Rows Caching Module . . . . . . . . . 925.12 Running Two Cells in Parallel . . . . . . . . . . . . . . . . . . . . . . 94
5.13 The UPML boundary conditions cells and non-UPML center cells in
the model space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.1 Overview Diagram of the FDTD hardware Design . . . . . . . . . . . 99
6.2 DMA Control Logic and FIFO Components . . . . . . . . . . . . . . 104
6.3 The block diagram of the 3 level memory interface . . . . . . . . . . . 106
6.4 Diagram of the magnetic updating pipeline . . . . . . . . . . . . . . . 107
6.5 Diagram of electric updating algorithm on different media . . . . . . 108
6.6 Diagram of the electric updating pipeline . . . . . . . . . . . . . . . . 110
6.7 Diagram of the storage of EM field data on on-board memories . . . . 112
6.8 Diagram of detailed data flow on 3D electric updating pipeline module 115
6.9 Diagram of hierarchy level of control state machines. . . . . . . . . . 116
6.10 Data flow of the 2D UPML FDTD hardware Design . . . . . . . . . . 118
6.11 Performance of the 2D UPML FDTD hardware Design . . . . . . . . 120
6.12 Data flow of the 3D FDTD field updating algorithms . . . . . . . . . 122
6.13 Data flow of the 3D UPML FDTD hardware design . . . . . . . . . . 124
6.14 Difference between the 3D Single-Cell and Double-Cell Full-UPML
FDTD hardware design . . . . . . . . . . . . . . . . . . . . . . . . . . 126
6.15 Data flow of the 3D Double-Cell Full-UPML FDTD hardware design 127
6.16 Performance of the 3D Double-Cell UPML FDTD hardware Design . 128
6.17 Data flow of the 3D UPML/Center combined FDTD hardware design 129
xii
-
Chapter 1
INTRODUCTION
1.1 Overview
Understanding and predicting electromagnetic behavior is more and more needed
in key electrical engineering technologies such as cellular phones, mobile comput-
ing, lasers and photonic circuits. Since the Finite-Difference Time-Domain (FDTD)
method provides a direct, time domain solution to Maxwell’s Equations in differential
form with relatively good accuracy and flexibility, it has become a powerful method
for solving a wide variety of different electromagnetics problems. The FDTD method
was not widely used until the past decade when computing resources improved. Even
today, the computational cost is still too high for real-time application of the FDTD
method.
Much effort has been spent on software acceleration research and people have
used supercomputers or parallel computer arrays to calculate the FDTD algorithm
in software. However, real-time application of the FDTD algorithm needs faster
speed as well as smaller package. Although Application Specific Integrated Circuits
1
-
CHAPTER 1. INTRODUCTION 2
(ASICs) provide much faster speed, people hesitate to apply the FDTD algorithm to
ASICs due to their high development cost.
Recently, as high capacity Field-Programmable Gate Arrays (FPGAs) have emerged,
people have become interested in reconfigurable hardware implementations of the
FDTD algorithm for faster calculation and real-time applications.
FPGAs offer the benefits of both ASICs and software. Like an ASIC, an FPGA
can implement millions of gates of logic which are designed optimally for a specific
task in a single integrated circuit. Although FPGA-based design may not challenge
the performance of ASICs, the fact is that few applications will ever merit the expense
of high cost ASICs. Also, like software, FPGAs are reprogrammable by designers any-
time, and are therefore flexible and convenient. Although FPGA-based development
is still costly compared to software, FPGA-based design can offer extremely high
performance surpassing any other programmable solutions.
There also have been a number of effort lately to implemented the FDTD algo-
rithm on graphics processor unit (GPU). Computer graphics card is widely available
and relatively cheaper than FPGA. Instead of working on a hardware design, the de-
signers use C++ and OpenGL to program the FDTD algorithm on GPU. However,
because GPU is designed for graphics calculation, its hardware structure is fixed.
There is little flexibility to upgrade the FDTD design or change the data type. In-
stead, the FPGA-based design offers much better flexibility so it can achieve higher
precision and faster speed.
We present the first fixed-point 3D UPML FDTD FPGA accelerator, which sup-
ports a wide range of materials including dispersive media. By analyzing the perfor-
mance of fixed-point arithmetic in the 3D FDTD algorithm, the correct fixed-point
representation is chosen to minimize the relative error between fixed-point and float-
ing point results. The FPGA accelerator supports the UPML absorbing boundary
-
CHAPTER 1. INTRODUCTION 3
conditions which have better performance in dispersive soil and human tissue media
than PML boundary conditions.
The 3D UPML FDTD hardware accelerator has been designed and implemented
on a WildStar-II Pro FPGA computing board. The computational speed of the hard-
ware implementation represents a 25 times speedup compare to the software imple-
mentation running on a 3.0GHz PC. This result indicates that the custom hardware
implementation can achieve significant speed up for the 3D UPML FDTD algorithm.
The speedup of the FDTD hardware implementation is due to three major fac-
tors: fixed-point representation, custom memory interface design, and pipelining and
parallelism. The FDTD method is a data-intense algorithm. The bottleneck of the
hardware design is its memory interface. With the limited bandwidth between the
FPGA and on-board memories, a carefully designed custom memory interface al-
lows full utilization of the memory bandwidth and greatly improves performance.
The FDTD algorithm is also a computationally intense algorithm. By considering
the trade-offs between resources and performance, pipelining and parallelism are im-
plemented in order to achieve optimal design performance based on the available
hardware resources.
1.2 Contributions
The contributions of this research are:
• This research presents a fixed-point 3D UPML FDTD FPGA accelerator whichsupports a wide range of materials including dispersive media. The 3D UPML
FDTD hardware accelerator has been designed and implemented on a WildStar-
II Pro FPGA computing board. The computational speed of the hardware
-
CHAPTER 1. INTRODUCTION 4
implementation represents a 25 times speedup compare to the software im-
plementation running on a 3.0GHz PC. This result indicates that the custom
hardware implementation can achieve significant speed up for the 3D UPML
FDTD algorithm.
• The 3D UPML FDTD algorithm has been carefully analyzed to find an ef-ficient hardware architecture. A carefully designed custom memory interface
and managed cache module allows fully utilization of the memory bandwidth
and greatly improves hardware design performance. By considering the trade-
offs between resources and performance, pipelining and parallelism are imple-
mented to achieve optimal design performance based on the available hardware
resources.
• Our 3D FDTD FPGA accelerator is the first fixed-point 3D FDTD FPGA ac-celerator. Recent approaches to hardware acceleration of the FDTD algorithm
focus on floating-point arithmetic implementations [1] [2]. Although floating-
point representation provides relatively high data precision, because of favorable
data properties and algorithm structure of the FDTD method, the relative error
between the fixed-point results and double-precision floating-point results can
be controlled to be very low as long as we carefully choose the fixed-point rep-
resentation. Since fixed-point arithmetic components provide faster speed and
consume less hardware resources than floating-point, the fixed-point arithmetic
FDTD accelerator has better performance.
• Our 3D FDTD FPGA accelerator is the first hardware implementation to sup-port the UPML absorbing boundary condition (ABC) which has better perfor-
mance in dispersive soil and human tissue media than PML ABC. The UPML
-
CHAPTER 1. INTRODUCTION 5
ABC provides better performance on the reflection ratio of all incidence an-
gles in dispersive media compared to other approaches. Although the UPML
FDTD algorithm is more computationally expensive and consumes more mem-
ory space, its regular structure makes it relatively easy to implement in hard-
ware compared to PML and Mur-type ABC.
• Our 3D FDTD FPGA accelerator implements and supports GPR mine de-tection, breast cancer detection, and spiral antenna research. Our research
provides a good example of FPGA acceleration of the FDTD applications in
fixed-point arithmetic. With the experience of our research, other FDTD ap-
plications can be applied to FPGAs more easily.
1.3 Dissertation Structure
The outline of the reminder of this dissertation is as follows.
Chapter 2 starts with an introduction of the Finite-Difference Time-Domain method,
the structure of the FDTD algorithm, and the UPML absorbing boundary conditions.
It then discusses the FPGA and WildStar-II Pro reconfigurable boards where the de-
sign will be implemented. Finally, finite precision theories are introduced for better
understanding of the algorithm data analysis in Chapter 5.
Chapter 3 presents a survey of the related work of our research, including re-
search on the FDTD method, parallel computing, VLSI implementation, and FPGA
implementation of the FDTD method.
Chapter 4 first analyzes and presents the advantages of implementing the FDTD
algorithm on an FPGA. Then it lists the key problems which need to be solved in
order to have a successful FDTD hardware implementation. These problems will be
-
CHAPTER 1. INTRODUCTION 6
solved in Chapter 5 and Chapter 6.
Chapter 5 starts on the introduction to the six target FDTD models we used for
data analysis and design verification. It then gives the details of our research on the
fixed-point data analysis of the 3D FDTD algorithm. It compares the relative error
of the electromagnetic field data for different data bit-widths to choose a suitable
bit-width for fixed-point representation. Finally, this chapter carefully analyzes the
3D UPML FDTD algorithm to find an efficient hardware architecture. Carefully
designed custom memory interface and managed cache module are proposed and
analyzed. The implementation of pipelining and parallelism are discussed based on
the trade-offs between hardware resources and design performance.
Chapter 6 discusses several technical details of the 3D UPML FDTD implementa-
tion, including I/O interface, DMA transfer, datapath, data flow, memory interface,
managed cache module, and control state machine. After that, several FDTD im-
plementations are introduced and their performance results are presented. These
FDTD implementations are introduced in the sequence of our design work. Their
data flow and control logic become more and more complex as we further optimize
the design to achieve better performance. This chapter also contains a summary of
the performance of the FDTD implementations.
Chapter 7 draws conclusions and gives suggestions for future work.
-
Chapter 2
BACKGROUND
In this chapter we will introduce the backgrounds of the FDTD method, FPGAs and
finite precision, which are the basis of this dissertation.
2.1 The Finite-Difference Time-Domain (FDTD)
Method
The discovery of Maxwell’s equations was one of the outstanding achievements of
19th century science. The equations give a unified and complete theory for people to
understand electromagnetic wave phenomena. Solving Maxwell’s equations becomes
an important method for investigating the propagation, radiation and scattering of
electromagnetic waves.
The FDTD method provides a direct time-domain solution of Maxwell’s Equa-
tions. After Yee first introduced it in 1966 [3], people began to realize its accuracy and
flexibility for solving electromagnetic problems. Because of its clear structure, easy to
understand algorithm, and ability to solve Maxwell’s equations on most scales with
7
-
CHAPTER 2. BACKGROUND 8
many different environments, the FDTD method has become a powerful technique for
solving a wide variety of different electromagnetic problems. Such problems includ
discrete scattering analysis, antenna design, radar research, digital circuit packag-
ing on multi-layer circuits boards, tumor detection and various medical studies on
electromagnetic waves’ effect on the human body [4, 5, 6, 7, 8].
In this section, we will first introduce the basics of the FDTD method by solving
Maxwell’s equations step by step. Then we will introduce the structure and properties
of the FDTD software algorithm. Finally we will talk about the absorbing boundary
conditions for the FDTD method.
2.1.1 FDTD Basics
The differential form of Maxwell’s equations and constitutive relations can be written
with the following equations:
∇× ~E = −∂~B
∂t− σm ~H − ~M (2.1)
∇× ~H = ∂~D
∂t+ σe ~E + ~J (2.2)
∇ · ~D = ρe; ∇ · ~B = ρm (2.3)
~D = � ~E; ~B = µ ~H (2.4)
In Equations (2.1)-(2.4), the following symbols are used:
~E: electric field ~D: electric flux density~H: magnetic field ~B: magnetic flux density~J : electric current density ~M : equivalent magnetic current density�: electrical permittivity µ: magnetic permeabilityσe: electric conductivity σm: equivalent magnetic conductivityρe: electric charge density ρm: equivalent magnetic charge density
-
CHAPTER 2. BACKGROUND 9
First, the FDTD method replaces ~D and ~B in Equation (2.1)-(2.2) with ~E and ~H
according to the constitutive relations in Equation (2.4), which yields Maxwell’s curl
equations:
µ∂ ~H
∂t= −∇× ~E − σm ~H − ~M (2.5)
�∂ ~E
∂t= ∇× ~H − σe ~E − ~J (2.6)
All the curl operators are then written in differential form and replaced by partial
derivative operators, as shown in Equation (2.7), with the ~E and ~H vector separated
into three vectors in three dimensions(e.g. ~E is separated into Ex, Ey, Ez):
curl ~F = ∇× ~F = x̂(∂Fz∂y
− ∂Fy∂z
) + ŷ(∂Fx∂z
− ∂Fz∂x
) + ẑ(∂Fy∂x
− ∂Fx∂y
) (2.7)
Then we can rewrite Maxwell’s curl equations into six equations in differential
form in rectangular coordinates:
µ∂Hx∂t
=∂Ey∂z
− ∂Ez∂y
− σmHx − Mx (2.8)
µ∂Hy∂t
=∂Ez∂x
− ∂Ex∂z
− σmHy − My (2.9)
µ∂Hz∂t
=∂Ex∂y
− ∂Ey∂x
− σmHz − Mz (2.10)
�∂Ex∂t
=∂Hz∂y
− ∂Hy∂z
− σeEx − Jx (2.11)
�∂Ey∂t
=∂Hx∂z
− ∂Hz∂x
− σeEy − Jy (2.12)
�∂Ez∂t
=∂Hy∂x
− ∂Hx∂y
− σeEz − Jz (2.13)
Second, the FDTD method establishes a model space, which is the physical region
where Maxwell’s equations are solved or the simulation is performed. The model
-
CHAPTER 2. BACKGROUND 10
space is then discretized to a number of cells and the time duration “t” is discretized
to a number of time steps. The size of the unit cell is often set to less than one
tenth of the electromagnetic wavelength. The unit time step is calculated as the time
duration the electromagnetic wave spends travelling through a unit cell. Every cell in
the model space has its associated electric and magnetic fields E and H. The material
type of each cell is also specified by its permittivity, permeability and conductivity.
We use model size to represent the number of cells in the model space. The model
size depends on the size of unit cell. The smaller the unit cell, the larger the model
size.
The 3D grid shown in Figure 2.1, named the “Yee-Cell” [3], is very helpful for
understanding the discretized electromagnetic model space.
Figure 2.1: The geometrical representation of the 3D Yee cell.
As illustrated in Figure 2.1, the “Yee-Cell” is a small cube, which can be treated
as one single cell picked from the discretized model space. ∆x, ∆y, ∆z are the three
dimensions of this cube. We use (i, j, k) to denote the point whose real coordinate
is (i∆x, j∆y, k∆z) in the model space. Instead of placing E and H components in
the center of each cell, the E and H field components are interlaced so that every E
-
CHAPTER 2. BACKGROUND 11
component is surrounded by four circulating H components, and every H component
is surrounded by four circulating E components. The operation ∇× or “curl” inEquation (2.5), (2.6) and (2.7) is easily interpreted using this figure. For example,
the Hx component located at point (i, j +12, k + 1
2) is surrounded by four circulating
E components, two Ey components and two Ez components, matching Equation
(2.8), which states that the Hx component increases directly in response to a curl
of E components in the x direction with a constant µ. The constant µ specifies the
magnetic permeability of the material at the location of this unit cell. Similarly, the
E components increase directly in response to the curl of the H components, with a
constant proportional to the electrical permittivity � of the material at the current
location. Maxwell’s equations in rectangular coordinates, Equation (2.8)-(2.13), can
be clearly illustrated by Yee’s cell.
In this section, we represent an electric component Ez at the discretized 3D co-
ordinate (i∆x, j∆y, (k + 12)∆z) as Ez|i,j,k+ 1
2
, and when the current time is in the
discretized N-th time step, we denote the same component as Ez|Ni,j,k+ 12
.
Third, all the partial derivative operators in Equations (2.8)-(2.13) are replaced by
their central difference approximations as illustrated in Equation (2.14). The second
order part of the Taylor series expansion is discarded to keep the algorithm simple
and reduce the computational cost. Also, the variable without partial derivative can
be approximated by time averaging as shown in Equation (2.15), which has similar
structure to the central difference approximation.
∂f(uo)
∂u=
f(uo + ∆u) − f(uo − ∆u)2∆u
+ O[(∆u)2] (2.14)
f(uo) =f(uo + ∆u) + f(uo − ∆u)
2(2.15)
-
CHAPTER 2. BACKGROUND 12
For example, Equation (2.8) is changed to:
µHx(t0 +
∆t2
) − Hx(t0 − ∆t2 )∆t
=Ey(z0 +
∆z2
) − Ey(z0 − ∆z2 )∆z
(2.16)
−Ez(y0 +∆y2
) − Ez(y0 − ∆y2 )∆y
− σmHx(t0 +
∆t2
) + Hx(t0 − ∆t2 )2
− Mx
After these modifications, the FDTD method turns Maxwell’s equations into a
set of linear equations from which we can calculate the electric and magnetic fields
in every cell in the model space. We call these equations the electric and magnetic
field updating algorithms. Six field updating algorithms form the basis of the FDTD
method. For example, the field updating algorithm for the Hx component, derived
from Equation (2.16) or Equation (2.8), is given by:
(µ
∆t+
σm2
)Hx|N+ 1
2
i,j+ 12,k+ 1
2
= (µ
∆t− σm
2)Hx|
N− 12
i,j+ 12,k+ 1
2
+1
∆z[Ey|Ni,j+ 1
2,k+1 − Ey|Ni,j+ 1
2,k] −
1
∆y[Ez|Ni,j+1,k+ 1
2
− Ez|Ni,j,k+ 12
] − Mx|Ni,j+ 12,k+ 1
2
(2.17)
2.1.2 FDTD Algorithm
The FDTD algorithm is based on these field updating equations. The flow diagram of
the FDTD algorithm is shown in Figure 2.2. The FDTD algorithm first establishes the
computational model space, specifies the storage memory for all the electromagnetic
field data, specifies the material properties for every cell in the model space and
specifies the excitation source. The excitation source can be a point source, a plane
wave, an electric field or an other option depending on the application. The FDTD
algorithm then runs through the field updating algorithms on every cell in the model
space. The electromagnetic field data are updated and stored back to the original
memory. Once it finishes, the algorithm will go to the next time step and run through
-
CHAPTER 2. BACKGROUND 13
all the cells again. The algorithm iterates the updating computations for each time
step, until the desired time period is completed. The boundary condition part consists
of special algorithms dealing with the unit cells located on the boundary of the model
space. For these cells, some of the adjacent cells may not exist, so special methods
are needed to update them appropriately. The output of the FDTD algorithm can
be any electric or magnetic field data from any cell in any time step.
Initialization� Establishes the model space� Specifies the material properties� Specifies the excitation source
UpdateSource Excitation
End
Time over?
Yes
No, Go to NextTime Step
Time Step ++Boundary Conditions
Update Magnetic FieldData on all Cells
Update Electric FieldData on all Cells
Figure 2.2: A Complete Flow Diagram of the FDTD Method
The electric and magnetic fields depend on each other. As we can see from
Equation (2.5), in Maxwell’s curl equations, the curl of the electric field E in space
equals the partial derivative of the magnetic field H with respect to “t”. After the
modifications of the FDTD method, the expansion of the curl operation depends on
the E field in the cells surrounding the current cell. The expansion of the partial
derivative of H with respect to “t” depends on the current magnetic field minus the
stored previous time step’s magnetic field. In other words, at any cell in the model
-
CHAPTER 2. BACKGROUND 14
space, the current time step’s magnetic field depends on the stored previous time
step’s magnetic field and the electric fields in the surrounding cells. Similarly, at any
cell in the model space, the current time step’s electric field depends on the stored
previous time step’s electric field and the magnetic fields in the surrounding cells.
Because of the dependence between the electric and magnetic fields, we cannot
update them in parallel. So the FDTD algorithm updates the electric and mag-
netic fields in an interlaced manner as shown in Figure 2.2. Notice that electric and
magnetic field updating algorithms are interlaced in the time domain so magnetic
components will be updated first based on the previous stored electric data and then
electric components will be updated based on the newly updated magnetic data. This
interlacing mechanism will be iterated time step by time step until the job finishes.
The FDTD algorithm is an accurate and successful modelling technique for com-
putational electromagnetics. It is flexible, allowing the user to model a wide variety
of electromagnetic materials and environments on most scales. It is also easy to un-
derstand, with its clear structure and direct time domain calculation. However, the
FDTD algorithm is a data intensive and computationally intensive algorithm. The
amount of data in the FDTD model space can be very large for large model sizes,
creating a heavy burden on both memory storage and access. The computation is
also intensive for each cell in the FDTD model space, including updating six electric
and magnetic fields and also updating the boundary conditions. The heavy burden
on data access and complex computation makes the FDTD algorithm run slowly on a
single processor. Modelling an electromagnetic problem using the FDTD method can
easily require several hours. Without powerful computational resources, the FDTD
models are too time-consuming to be implemented on a single computing node. Ac-
celerating the FDTD method with inexpensive and compact hardware will greatly
expand its application and popularity.
-
CHAPTER 2. BACKGROUND 15
2.1.3 Absorbing Boundary Conditions
The above electric and magnetic updating algorithms work accurately inside the
model space. However, because the cells on the boundary of the model space do
not have the adjacent cells needed in the updating algorithms, the algorithm does
not work properly on the boundary, so there are algorithm-introduced reflections
on the boundary of the model space. Special techniques are necessary to deal with
the cells on the boundary, prevent nonphysical reflections from outgoing waves and
simulate the extension of the model space to infinity. These special techniques are
called the absorbing boundary conditions (ABC). The development of efficient and
accurate ABCs is very important for the FDTD method and several types of ABC
have been proposed, which are grouped into two major approaches, analytical and
perfect matched layer.
2.1.3.1 The Mur-type absorbing boundary conditions
The first group of ABCs are called the analytical ABCs, which are mainly deduced
from the electromagnetic wave equation. The Mur-type ABC [9] is one of the popu-
lar ABCs in this group. Mur-type ABC is deduced from the one-way wave equation,
which is a part of the wave equation that allows wave propagation only in one di-
rection [5]. The wave equation in three dimensions shown in Equation (2.18) can be
written as Equation (2.19), or more compactly, as Equation (2.20). A part of the
wave equation, L−U = 0, as in Equation (2.21), is the one-way wave equation which
absorbs most of the impinging waves on the boundary of the model space.
(∂2
∂x2+
∂2
∂y2+
∂2
∂z2− 1
c2∂2
∂t2)U = 0 (2.18)
-
CHAPTER 2. BACKGROUND 16
[∂2
∂x2− 1
c2∂2
∂t2(1 −
∂2
∂y2c2
∂2
∂t2
−∂2
∂z2c2
∂2
∂t2
)]U = 0 (2.19)
[∂2
∂x2− 1
c2∂2
∂t2(1 − S2)]U = LU = L+L−U = 0 (2.20)
(∂
∂x− 1
c
∂
∂t
√1 − S2)U = L−U = 0 (2.21)
Depending on the approximation of√
1 − S2, there are first-order and second-order Mur-type ABCs. The first order Mur-type ABC approximates
√1 − S2 = 1
and the second order uses√
1 − S2 = 1 − 12S2. The second-order Mur-type is more
accurate but more complex than the first-order one. For example, the discretized
version of the first-order Mur-type ABC for the Z component of the electric field is
given by:
Ez|N+10,j,k+ 12
= Ez|N1,j,k+ 12
+c∆t − ∆xc∆t + ∆x
[Ez|N+11,j,k+ 12
− Ez|N0,j,k+ 12
] (2.22)
The Mur-type ABC is easy to understand. It updates the electric field data on
the boundary cells every time step after the main updating algorithms are finished,
as shown in Figure 2.2. The updates of the field data are processed in-place with no
extra memory consumption, which makes the Mur-type ABC a good memory saver.
However, the biggest problem with the Mur-type ABC is that its quality is only
good for incident angles up to 25 degrees [6]. Its quality deteriorates significantly for
greater incidence angles. This property limits the usage of the Mur-type ABC.
2.1.3.2 The UPML absorbing boundary conditions
The other approach is called the Perfectly Matched Layer (PML) ABC [10]. PML
ABC is a highly effective absorbing boundary condition. By setting the outer bound-
ary of the model space to an absorbing material layer, PML can absorb most of
the impinging waves and have low reflection. The most favorable property of the
-
CHAPTER 2. BACKGROUND 17
PML ABC is that its quality does not depend on incidence angle, polarization or
frequency [6].
The UPML (Uniaxial PML) ABC [7] is modified from the PML ABC by replacing
the mathematical PML medium model with a physical uniaxial anisotropic PML
medium, which keeps all the favorable properties of the PML ABC. Instead of setting
only the outer boundary of the model space to the PML material, UPML uses a
generalized formulation of the entire FDTD model space, providing both lossless,
isotropic medium in the primary computation zone and individual UPML absorber
in the outer boundary cells [6]. Based on the original UPML, a novel modification of
UPML, the Single Pole Conductivity Model [11, 12], better supports lossy, dispersive
media. The overall quality of the UPML ABC is more than 30dB better than the
original PML on dispersive media. Compared to the Mur-type ABC, the quality of
the UPML ABC is about 10dB better for incident angles less than 30 degrees and
much better for incident angles larger than 30 degrees [12], because the Mur-type
ABC deteriorates for greater incident angles [6].
The advantage of this approach is its good quality. Besides the high quality of
results in normal media, the quality of the UPML ABC is especially good for disper-
sive media, which is useful in solving many realistic problems. Also, because UPML
integrates boundary conditions with the electric field updating algorithms, the FDTD
algorithm is more uniform in structure, making it a good model for hardware data-
path design. The disadvantage of the UPML ABC is its high memory consumption.
The UPML ABC introduced nine extra field data for each cell in the model space,
consuming 1.5 times more memory than the normal FDTD algorithm. Because ac-
curacy of the ABC is very important for the FDTD algorithm, the PML and UPML
ABCs are very popular in FDTD research and applications due to their high quality.
The time-harmonic Maxwell’s curl equation in frequency domain in the UPML
-
CHAPTER 2. BACKGROUND 18
regions can be written as:
∇× E = −jωµs̄H∇× H = (jω� + σ)s̄E
(2.23)
Where s̄ is a tensor that can be used throughout the entire FDTD model space:
s̄ = diag[syszsx
,sxszsy
,sxsysz
] (2.24)
sx = κx +σxjω�
sy = κy +σyjω�
sz = κz +σzjω�
(2.25)
In the PML region which is on the boundary of the model space, σx is PML’s con-
ductivity while κx is a real valued parameter [6]. In the lossless primary computation
zone, σ is always zero and κ is always one, so the media is just the normal media and
the data-path is the normal FDTD computation. P and P 2 are introduced to reduce
the complexity of this convolution calculation:
Px =syszsx
Ex, Py =sxszsy
Ey, Pz =sxsysz
Ez (2.26)
P 2x =1
syPx, P
2y =
1
szPy, P
2z =
1
sxPz (2.27)
After substituting E components from Equation (2.26) into (2.23), and substi-
tuting ∂∂t
for jω in Equation (2.23), an example of the UPML algorithm in lossy,
dielectric media is given by the following equations. The details of the UPML algo-
rithm are well documented [6] and the novel modification of the UPML algorithm for
dispersive media is introduced in several papers [11, 12, 13, 14, 15, 16, 17, 18]. For
example, the discretized version of the UPML algorithm for Ex and Hx components
-
CHAPTER 2. BACKGROUND 19
are:
Px|N+1i+ 12,j,k
=2� − σ∆t2� + σ∆t
Px|Ni+ 12,j,k +
2∆t
2� + σ∆t(2.28)
(Hz|
N+ 12
i+ 12,j+ 1
2,k− Hz|
N+ 12
i+ 12,j− 1
2,k
∆y−
Hy|N+ 1
2
i+ 12,j,k+ 1
2
− Hy|N+ 1
2
i+ 12,j,k− 1
2
∆z)
P 2x |N+1i+ 12,j,k
=2�κy − σy∆t2�κy + σy∆t
P 2x |Ni+ 12,j,k +
2�
2�κy + σy∆t[Px|N+1i+ 1
2,j,k
− Px|Ni+ 12,j,k] (2.29)
Ex|N+1i+ 12,j,k
=2�κz − σz∆t2�κz + σz∆t
Ex|Ni+ 12,j,k +
1
2�κz + σz∆t(2.30)
[(2�κx + σx∆t)P2x |N+1i+ 1
2,j,k
− (2�κx − σx∆t)P 2x |Ni+ 12,j,k]
Bx|N+ 3
2
i,j+ 12,k+ 1
2
=2�κy − σy∆t2�κy + σy∆t
Bx|N+ 1
2
i,j+ 12,k+ 1
2
− 2�∆t2�κy + σy∆t
(2.31)
(Ez|N+1i,j+1,k+ 1
2
− Ez|N+1i,j,k+ 12
∆y−
Ey|N+1i,j+ 12,k+1
− Ey|N+1i,j+ 12,k
∆z)
Hx|N+ 3
2
i,j+ 12,k+ 1
2
=2�κz − σz∆t2�κz + σz∆t
Hx|N+ 1
2
i,j+ 12,k+ 1
2
+1
(2�κz + σz∆t)µ(2.32)
[(2�κx + σx∆t)Bx|N+ 3
2
i,j+ 12,k+ 1
2
− (2�κx − σx∆t)Bx|N+ 1
2
i,j+ 12,k+ 1
2
]
2.2 FPGA and Reconfigurable Computing
There are several options for implementing digital logic designs. The first option
is to program the algorithms into computer software and run them on general pur-
pose microprocessors. Designers can change the software instructions anytime so the
digital logic can be changed easily and flexibly. Although software is very flexible,
software running on general purpose microprocessors is slow compared to hardware
-
CHAPTER 2. BACKGROUND 20
implementations for many applications. The microprocessor must read and decode
each software instruction and then execute it, creating a high execution overhead
for each individual operation [19]. Also, the microprocessor can only handle limited
parallelism and pipelining, while a custom hardware design can fully exploit the ben-
efits of both. For many scientific computations and real-time applications, software
implementation is too slow to be practical.
The second option is to use an Application Specific Integrated Circuit (ASIC),
which is designed for a specific computational task and achieves very fast speed and
high efficiency when executing the exact computation for which it was designed [19].
ASIC chips are small and consume less power than the other options, so they can be
found in many products from mobile phones to aeroplanes. However, the development
cost of ASICs is very high. What’s more, once the chip is fabricated, it cannot be
modified at all. The chip needs to be redesigned and refabricated if any single part
of the digital logic needs to be altered.
Figure 2.3: Speed vs. flexibility of software, FPGAs and ASICs
Reconfigurable computing devices, mostly known as Field Programmable Gate
Arrays (FPGAs), offer the benefits of both ASICs and software. Like an ASIC, an
FPGA can implement millions of gates of logic, which are designed optimally for
-
CHAPTER 2. BACKGROUND 21
a specific task, in a single integrated circuit, thus achieving high performance with
low power and small size. Although the FPGA-based design may not challenge the
performance of ASICs, the fact is that few applications will ever merit the expense of
high cost ASICs. Also like software, FPGAs are reprogrammable by designers at any
time, and are therefore flexible and convenient. Although FPGA-based development
is still costly compared to software, FPGA-based design can offer extremely high per-
formance, surpassing any other programmable solutions for many applications [20].
Figure 2.3 compares speed and flexibility among software, FPGAs and ASICs. Rela-
tive high speed, relative high flexibility and advantages in cost, size and power make
FPGAs suitable for a wide variety of applications.
In this section, we will first introduce the basics of reconfigurable devices and com-
mercial FPGA computing boards. Then we will talk about reconfigurable computing
based design.
2.2.1 FPGA Basics
An FPGA is a programmable hardware device where the hardware circuits are pro-
grammable by the designer to create custom digital logic designs [21]. The structure
of the FPGA is composed of configurable logic blocks (CLBs), programmable routing
and Input/Output blocks (IOBs).
Figure 2.4 shows the structure of an FPGA from Xilinx, Inc [22]. The CLBs are
arranged in a matrix in the FPGA, and connected by programmable routing. IOBs
are connected to CLBs via programmable routing also. Each CLB is a basic functional
element that implements part of the algorithm. Digital logic is implemented in several
CLBs. The structure of CLBs vary in different FPGA chips; however, CLBs are
normally made up of lookup tables (LUTs), carry logic and storage elements. The
-
CHAPTER 2. BACKGROUND 22
ConfigurableLogic Block
(CLB)
I/O Block
ProgrammableRouting
Route map of a real FPGA chip
Figure 2.4: Xilinx FPGA structure and its CLB
bottom right corner of Figure 2.4 shows the structural diagram of a slice in the Xilinx
Virtex-II Pro FPGA chip; each CLB in the Virtex-II Pro chip has two slices. The
lookup tables (LUTs) are the most important part of an FPGA. The 4-input LUT
shown in Figure 2.4 is composed of 24 memory units. The input signal is the address
data for the LUT to pick the corresponding value from the memory. The memory
units can be programmed to store any truth table, or any function, so most logic
can be efficiently implemented by one or more lookup tables. The storage elements
are generally used as registers and the carry logic is usually added in order to avoid
wasting LUTs and for faster speed.
Most current FPGAs are SRAM-based, which means all the programmable LUTs
-
CHAPTER 2. BACKGROUND 23
and routing are based on SRAM bits [19]. Reconfiguring the FPGA is actually done
by loading the bitstream into SRAM, which takes only milliseconds to complete.
FPGAs can be reprogrammed in the target system without detaching the FPGA
chip, a perfect match for hardware systems which need frequent upgrading or have
several optional designs. FPGAs can be reprogrammed at run-time too, which means
part of an FPGA can be reprogrammed when the system is running on the other
part of the chip. These advantages make FPGAs very flexible and suitable for many
applications.
2.2.2 FPGA Computing Boards
The FPGA computing resources we use in this research are typical commercial off
the shelf (COTS) FPGA computing boards, which are widely available and easy to
setup. These boards normally contain one or two FPGA chips. Each FPGA chip may
be connected to several on-board memories, including SRAM or DRAM. One FPGA
and its associated on-board memory form the basic combination used for custom
hardware computation. These computing boards are often PCI boards for a desktop
computer or PCMCIA cards for a laptop computer. Data and control signals can
be transferred between the FPGA computing board and the host PC via the PCI
interface, either via standard PCI transfer or fast DMA transfer. There are also some
boards use other interfaces like fiber-optic or RocketIO for faster speed and flexible
connection.
The board we use is a WildStarTM -II Pro/PCI reconfigurable FPGA computing
board from Annapolis Micro Systems [23]. Table 2.1 lists the main features of the
the WildStarTM -II Pro/PCI board, and Figure 2.5 shows its block diagram.
There are two Xilinx Virtex-II Pro FPGAs on the WildStar-II Pro board. Each
-
CHAPTER 2. BACKGROUND 24
WILDSTAR-II PRO/PCI boards
FPGA chips Two Xilinx Virtex-II Pro FPGAs XC2V70 [22] (33088 slices,328 embedded Multiplier and 5904Kb BlockRAM)
Memory 12 ports of DDR-II SRAM Ports totally 54MBytesPorts (6 × 4.5MBytes for each FPGA chip)
Memory 11 GBytes/sec memory bandwidthBandwidth (6 × 72-bit for each FPGA chip)
PCI Interface 133MHz/64bit PCI-X up to 1.03GBytes/sec
Table 2.1: List of the main features of the WildStart-II Pro FPGA boards
FPGA has 328 embedded 18×18 signed multipliers and 382×18Kb BlockRAMs. Theembedded multiplier built inside the FPGA is much faster than a multiplier com-
ponent implemented by reconfigurable logic. So it is better to use the embedded
multipliers if possible. The BlockRAMs are the fastest memory the designer can use
in an FPGA design. Critical data interchange and interfacing can be programmed
using the BlockRAMs. A pair consisting of an embedded multiplier and a Block-
RAM shares the same data and address buses in the Xilinx Virtex-II architecture, so
once we use the embedded multiplier, we cannot use its corresponding BlockRAM,
and vice versa. Therefore the sum of the total number of embedded multipliers and
BlockRAMs used must be less than 328.
Figure 2.5 shows the block diagram of the WildStar-II Pro FPGA board. Each
FPGA is connected to six independent on-board memories. The on-board memories
are 1M×36bit DDRII SRAM which have 72-bit data bandwidth and speed up to200MHz. The size of each SRAM is 36Mbits, or 4.5MBytes, so the total SRAM
attached to each FPGA is 27MBytes. The WildStar-II Pro board is connected to
the desktop computer via a PCI-X interface, with a DMA data transfer rate up to
1GB/s between the host PC and FPGA. Convenient interfaces between the FPGA
chip, on board memories and host PC makes the WildStar-II Pro FPGA computing
board a good platform for hardware implementation.
-
CHAPTER 2. BACKGROUND 25
Figure 2.5: Block diagram of the WildStarTM -II Pro board
2.2.3 Reconfigurable Computing Basics
As introduced in the beginning of this section, the key feature of reconfigurable
computing is its ability to greatly speedup algorithm computation while maintaining
flexibility [19]. How does reconfigurable hardware design achieve high performance?
Based on the structure of the COTS FPGA computing board, we will introduce the
basic elements of reconfigurable computing which play important roles in computation
speedup.
As shown in Figure 2.6, a FPGA computing board normally contains an FPGA
chip, on-board memories and a PCI controller; these form the basic components of
custom hardware design. A typical hardware design reads the input data from the
on-board memory or PCI interface, performs the computation inside the FPGA, and
writes the results back to memory or PCI interface. The hardware design inside an
-
CHAPTER 2. BACKGROUND 26
Commercial FPGA Computing Board
PCI Interface
FPGAOn-Board
MEMORIES
Pipeline Module
Pipeline Module
Pipeline Module
Pipeline ModuleIn P
aral
lel
Figure 2.6: Simple Structure Diagram of the Normal COTS Reconfigurable Comput-ing Board
FPGA normally consists of deep pipelines with as much parallelism as much as possi-
ble. The size and speed of the FPGA chip, the size and speed of the on-board memory
and the speed of the PCI controller all determine the performance of the hardware
design on this board. However, given a specific FPGA computing board, there are
three design approaches which contribute most to the performance of the FPGA
based design. They are pipelining, parallelism and memory hierarchy/interface.
2.2.3.1 Pipelining
Pipelining is an implementation technique whereby multiple instructions are over-
lapped in execution [24]. A pipeline is like an assembly line.
1. There are many instructions waiting to be processed just like there are many
cars waiting to be produced in the assembly line.
2. All the instructions should have similar structure. Instructions should use the
same hardware processing units, just like all the cars on the assembly line should have
identical production procedures. We call the part of an instruction that corresponds
to each hardware processing unit a “task”, so each instruction is composed of several
-
CHAPTER 2. BACKGROUND 27
tasks. If there is only one instruction to be executed, each of the tasks will be executed
one by one in N clock cycles. Of course, the hardware units here are not efficiently
used because every hardware unit is used only once in the N clock cycles.
3. The benefit of pipelining is that it can run several instructions at the same time
and let them share hardware resources. The instructions are overlapped in execution,
in the shape of a cascade as shown in Figure 2.7. Figure 2.7 is an example structure
of a pipeline in the FDTD design. The rows represent instructions and the columns
represent clock cycles. We can see from this figure that every hardware unit is fully
utilized, since every hardware unit works on a different instruction in every clock
cycle, just like a working assembly line. The computation results now pop out of
the pipeline every clock cycle. Compared to the non-pipelined data-path, the design
throughput is increased by N times.
A pipeline is a perfect hardware structure when an algorithm needs to execute
many similar instructions. To build the pipeline, registers need to be inserted between
hardware units in the data-path to synchronize them according to a clock signal.
Because all the hardware units need to be finished in one clock cycle, the speed of
the pipeline is determined by the slowest hardware unit or pipeline step in the data-
path. The smaller the pipeline step, the higher the throughput. So the data-path and
registers need to be balanced, also called re-timing [25], to make the length of pipeline
steps similar. In case of large and slow arithmetic components like multipliers, we can
use specially designed “pipelined components” which separates their tasks internally
and are capable to be pipelined to several steps, like the pipelined multiplier in
Figure 2.7.
Pipelining has some limitations so that not all similar instructions can be pipelined.
Data hazards may arise when an instruction depends on the results of a previous in-
struction so it cannot be executed until the previous instruction is completed. Also,
-
CHAPTER 2. BACKGROUND 28
Figure 2.7: Structure of a pipeline in the free space model of the 2D FDTD algorithm
pipelining adds extra registers to the data-path so it increases the latency and con-
sumes more resources. In most cases, the latency and registers are small enough to
be ignored, and the high throughput and high overall design performance introduced
by pipelining is important for the reconfigurable computing speedup.
2.2.3.2 Parallelism
One of the most important techniques in reconfigurable computing design is to par-
allelize as many tasks as possible. As long as two tasks do not affect each other and
can be processed at the same time, we can execute them in parallel in order to speed
up the design. However, the hardware design is always limited by the availability of
hardware resources. The hardware resources include slice area, on-chip memory area
-
CHAPTER 2. BACKGROUND 29
and input/output interface. More hardware resources mean higher cost and more
power consumption. Parallelism will consume more resources, so there is a trade-off
between hardware resources and speedup. Carefully analysis is needed to decided
how much parallelism is suitable for the current design to achieve the fastest speed
with minimum cost.
2.2.3.3 Memory Hierarchy and Interface
Memory is an important part of a reconfigurable computing design. As shown in
Figure 2.6, the FPGA can access the fast memories on board or the memories on the
host computer via the slower PCI interface. Also, as we introduced in Section 2.2.2,
the fastest memory is the on-chip BlockRAMs available inside the FPGA chip. Since
the read/write speed of memory is important to the whole design’s speed, to avoid
a bottleneck at the memory interface, fast memory is needed. But fast memory is
expensive and its size is limited, so it is not practical to use fast memory for the
whole design. A memory hierarchy can be organized according to the sizes of the
different types of memories as well as their speed and cost. The goal of the memory
hierarchy is to provide a memory system with the cost almost as low as a design that
uses only cheap memory and with the speed almost as fast as a design that uses only
fast memory.
2.2.3.4 Summary
General purpose processors (GPP) can support only limited parallelism and pipelin-
ing. A GPP has a fixed number of Arithmetic Logic Units (ALUs) for parallelism,
and it usually pipelines arithmetic operations in a limited way within each ALU. In
-
CHAPTER 2. BACKGROUND 30
contrast, a custom hardware design is organized for a specific algorithm. The data-
path can be fully parallelized and pipelined. The memory hierarchy and interface
can be optimized to support pipelining and parallelism. Compared to a software al-
gorithm running on a GPP, the reconfigurable computing design can be much faster.
The more pipelining and parallelism, the faster the design.
Not all algorithms are suitable for reconfigurable computing. Considering the
relatively higher cost of hardware implementation, a significant speedup should be
achieved to make the implementation practical. From the discussion above, we can
conclude that deep pipelining and parallelism and suitable memory hierarchy are the
key points of reconfigurable computing speedup. To fit these conditions, an algo-
rithm should have a high volume of loops or repeated, similar instructions, intensive
computations and limited memory access. Also, when most data in an algorithm
have similar range, they can be represented by fix-point arithmetic which is easier
and faster to process in a custom hardware design. We will continue talk about these
topics in the Chapter 6.
2.2.4 Reconfigurable Computing Design Flow
The FPGA design flow includes coding the algorithm in VHDL, simulation and veri-
fication, synthesizing, placing and routing, and finally on-chip verification. Figure 2.8
shows a diagram of the FPGA design flow. Given a target algorithm, the first step
is algorithm analysis and hardware architectural design. Then we use VHDL code to
describe the hardware design, which can be debugged and verified through simula-
tion. The completed VHDL design is then synthesized to create the gate level netlist.
The place and route tool then fits the synthesized design into the FPGA, creating
a physical layout and a bit-stream file which can be loaded directly into the FPGA
-
CHAPTER 2. BACKGROUND 31
chip. We use a C program running on a host PC to communicate with the loaded
FPGA chip for board level debugging and verification. If there is any problem in
board level testing, we need to go back and modify the VHDL code. The final design
which passes board level verification will be tested for performance.
Figure 2.8: The FPGA Design Flow
2.3 Finite Precision
The original 3D FDTD model was written in Fortran, using 64-bit double precision
floating-point data. Floating-point representation provides high resolution and large
dynamic range, but it can be costly. In hardware design, floating-point representation
has slower arithmetic components and consumes more hardware resources. It is
significantly more expensive than fixed-point representation in speed and cost when
implemented in hardware. On the other hand, fixed-point components in hardware
design have much faster speed and occupy less space. In applications where the
data resolution and dynamic range can be constrained, like the FDTD algorithm,
the fixed-point arithmetic can provide similar precision and much faster speed than
-
CHAPTER 2. BACKGROUND 32
floating-point arithmetic.
In this section, the definition and properties of floating-point representation and
fix-point representation will be introduced. Then we will discuss more about the
fixed-point quantization and its related algorithm analysis.
2.3.1 Floating-Point Arithmetic
Programmers use floating-point precision to represent real numbers in most computer
programs. The IEEE single precision floating-point standard representation requires
a 32 bit word. As shown in Figure 2.9, the most significant bit is the sign bit, ‘S’,
the next eight bits are the exponent bits ‘E’, from E30 to E23, and the final 23 bits
are the fraction bits ‘F’, from F22 to F0 [26] [27]:
Figure 2.9: Single precision floating-point representation
The value V represented by the word may be determined as follows:
Vfloat = (−1)S ∗ 1.F ∗ 2E−BIAS, where BIAS = 127.Single precision floating-point representation can represent data in the range −2128
to 2128, the highest resolution is up to 2−127. This is a very large data range, as it can
represent very small and very large numbers, which fix-point representation lacks.
The IEEE double precision floating point standard representation requires a 64
bit word. Similar to the single precision standard, as shown in Figure 2.10, the most
significant bit is the sign bit, ‘S’, the next eleven bits are the exponent bits ‘E’, from
E62 to E52, and the final 52 bits are the fraction bits ‘F’, from F51 to F0:
Similar to single precision, the value V may be determined as:
-
CHAPTER 2. BACKGROUND 33
Figure 2.10: Double precision floating-point representation
Vfloat = (−1)S ∗ 1.F ∗ 2E−BIAS, where BIAS = 1023.
2.3.2 Fixed-Point Arithmetic
The fixed-point representation we use in this dissertation is two’s complement fixed-
point representation. We can think of fixed-point data as a collection of N binary
digits, with 2N possible states, as shown in Figure 2.11. The N-bit data and 2N states
here do not mean anything until their interpretation is defined. For example, if we
define the difference between two states as integer 1, and we define all the data to be
located between 0 and 2N , then N-bit data can be thought of as positive integers less
than 2N . However, if we define the most significant bit to be the sign bit, changing
the data range to between −(2N−1 − 1) and 2N−1, the N-bit data can be thoughtof as integers with absolute value less than 2N−1. And if we define the difference
between two states as 0.1, the meaning of the N-bit data will change too; they are
now rational numbers precise to 0.1. So the meaning of fixed-point N-bit binary data
depends on its interpretation.
Figure 2.11: Uninterpreted fix-point representation
In most cases, fixed-point representation is used to represent rational number
sets. Instead of defining the value of states, we can use a more direct way to define
the interpretation: the position of the binary point. For example, the 35-bit signed
-
CHAPTER 2. BACKGROUND 34
fixed-point representation which will be used in this research has the form as shown
in Figure 2.12.
Figure 2.12: Interpreted fix-point representation
The digits before the binary point are one sign bit and one integer bit; the digits
after it are called the fraction part. The resolution, which is defined as the smallest
non-zero magnitude representable [28], is determined by the format. In this example,
the resolution is 2−33. Since the position of the binary point is fixed, this represen-
tation is called “fixed-point”. We can write this representation as A(1,33) [28]. The
N-bit signed fixed-point data representation A(a, b) has the value:
Vfix = −S · 2a +a−1∑
n=0
2nAn +1
2b
b−1∑
n=0
2nBn
Although fixed-point representation has different interpretations, they share the
same rules as most arithmetic operations. We can consider fixed-point numbers
as integers and choose the necessary bits and reposition the binary point after the
operation. For example, if we multiply two fixed-point numbers with the form shown
in Figure 2.13, we can ignore the binary point and multiply the two integers first.
After the multiplication, we pick the digits we need according to the position of the
binary point. In this example, it is 3 digits before the binary point and 26 digits
after. The result is still a 30-bit fixed-point datum.
The fixed-point representation has a fixed data structure that only works well
on a limited data range. For example, the fixed-point number with the format in
Figure 2.13 only has a range from -8.0 to 7.9999... [29]. If an integer result is bigger
-
CHAPTER 2. BACKGROUND 35
Figure 2.13: Multiplication of fix-point representation
than 23 − 1 = 7, then it cannot be represented by three digits before the binarypoint. The situation where a number cannot fit into the representable range is called
overflow. The minimal absolute value which fixed-point data can represent is also
limited. The situation where a number is less than the resolution of the representation
is called underflow. Although the range and resolution in fixed-point representation
are limited compare to the same-length floating-point data, it is suitable for many
applications which do not need both high precision and wide dynamic range.
In the example in Figure 2.13, we ignored the 27th to 52th bits after the binary
point in order to fit the result into the current representation. If a value with higher
resolution cannot be fit in the current representation, some information must be
discarded in order to decrease its resolution. The most common methods are “trun-
cation” and “rounding toward nearest”. Truncation is computationally simpler, just
dropping all the digits beyond those required, as we did in the above example. In-
stead, rounding introduces a smaller error. When all numbers are rounded to the
nearest representable number, it is more accurate than truncation half of the time,
and the same the other half of the time. Of course it requires more effort to process;
rounding can be accomplished by adding one to the next significant bit and then
-
CHAPTER 2. BACKGROUND 36
truncate it.
Depending on the number of states required, N in the N-bit fixed-point repre-
sentation can be any length we want. So there does not exist a length standard
in fixed-point representation. Because of this flexible characteristic, a designer can
optimize the system by choosing the minimal-length fixed-point data with accept-
able error. The shorter the data length, the lower the cost, the more area available
for parallelism and finally the faster the design. This point is particularly useful in
hardware design since speed and area are the ultimate goal of the hardware designer.
Because of its simplicity and fast processing, fixed-point representation is widely used
in custom hardware designs on applications where the data resolution and dynamic
range can be constrained.
2.3.3 Fixed-Point Quantization
Fixed-point quantization refers to the process of approximating values in floating-
point representation with a fixed-point representation. An N-bit signed fixed-point
representation A(a, b), as shown in Figure 2.12, has the value:
Vfix = −S · 2a +a−1∑
n=0
2nAn +1
2b
b−1∑
n=0
2nBn
The rounding method we use in this research is rounding toward nearest. So if the
lengths of a and b are decided, for any floating-point value Vfloat, we can approximate
Vfloat with fixed-point by representing the floating-point number in two’s complement
arithmetic, rounding it and then truncating the digits that do not fit in the fixed-point
representation.
The main problem remaining is how to choose the length of a and b. Fixed-
point quantization creates error which is the difference between the floating-point
-
CHAPTER 2. BACKGROUND 37
and the fixed-point representation. Usually the longer the fixed-point value, the less
the quantization error. However, longer values require more hardware resources, so
the fixed-point representation we choose should balance minimizing the error with
minimizing the bit-width to save hardware resources. This is a trade-off between
quantization error and hardware area. This decision is made according to the al-
gorithm analysis of relative error and visual error. The detailed analysis will be
discussed in Chapter 5.
2.4 Summary
This chapter presented the theories of the FDTD method, which is the target al-
gorithm of this dissertation. Background on FPGAs and reconfigurable computing,
which is the basis of this research, is then introduced. Finally we talk about the
concepts of finite precision and fixed-point quantization, which will be used in the
algorithm analysis. In the next chapter, we will present related work.
-
Chapter 3
RELATED WORK
In this chapter we will introduce the related works of this research. Related work
falls into following categories: applications of the FDTD method, parallel-computing
systems for FDTD acceleration, VLSI implementations for FDTD acceleration and
FPGA implementations for FDTD acceleration.
3.1 Applications of the FDTD Method
The FDTD method is an important tool for investigating the propagation, radi-
ation and scattering of electromagnetic waves. Before the 1990s, the cost of solving
Maxwell’s equations directly was large and most of the related research was for mil-
itary defense purposes. For example, engineers used huge parallel supercomputing
arrays to model the radar wave reflection of airplanes by solving Maxwell’s equations,
trying to develop an airplane with low radar cross section [30]. The difficult task of
solving Maxwell’s equations has had more economical solutions since 1990, with the
development of fast computing resources applied to the FDTD method. Now the
38
-
CHAPTER 3. RELATED WORK 39
application of the FDTD method has spread to many areas including:
• Discrete scattering, by modelling electromagnetic wave scattering from discreteobjects [7],
• Antenna design and radar design, by modelling various kinds of antennas,
• Digital circuit packaging on multi-layer circuits boards, by helping designersanalyze electromagnetic wave phenomena on complex circuit boards [30],
• Subsurface detection and ground penetrating radar (GPR), by modelling theelectromagnetic phenomena of GPR on the subsurface model [31] [32], and
• Medical studies, by modelling electromagnetic wave phenomena in the humanbody, including:
– The study of the effect of electromagnetic waves from cell phone on the
human brain,
– The study of the computation of light scattering from cells [33], and
– The study of breast cancer detection using electromagnetic antenna [34] [35].
Among these various applications, several works are closely related to our research,
both in the subsurface detection areas and in medical studies.
The use of the FDTD method to simulate Ground Penetrating Radar (GPR)
applications for anti-personnel mine detection has been introduced [31, 32]. With
carefully modeled transmitting and receiving antenna locations, a 2D FDTD model
can simulate the wave propagation and scatter response of 3D GPR geometries with
realistic ground material [31]. An advanced 3D FDTD model was introduced [32]
which models more complex geometries, provides more accurate results and allows
-
CHAPTER 3. RELATED WORK 40
complete freedom for the location of the transmitting and receiving antennas. The
3D FDTD method implements realistic dispersive soil along with air, metal and di-
electric media. The Mur-type absorbing boundary conditions were implemented and
produced good results. Both the 2D and 3D models have been validated by experi-
ments performed with a commercially available ground penetrating radar system and
realistic soil. The results produced by the FDTD models are very accurate compared
with experimental data.
Research on FDTD modelling for microwave breast cancer detection has been pre-
sented [34, 35]. Because of the large difference in electromagnetic properties between
malignant tumor tissue and normal breast tissue, microwave breast cancer detection
has attracted much interest as it may overcome some of the shortcomings of X-ray
based detection. Accurate computational modeling of microwaves in human tissue
with the FDTD method is very helpful for breast cancer detection research. In these
papers, the authors build a 3D simulation of the human breast, which includes a
semi-ellipsoid geometric representation of the breast and a planar chest wall. A sin-
gle pole conductivity model [11] based on the Z-transform is presented as a simple and
efficient way of modeling various dispersive media, in the range of 30MHz to 20GHz,
including human tissue. The UPML Absorbing Boundary Condition is implemented
and performs better than the Mur-type ABC.
Clearly, the FDTD method is a powerful tool that can be used in many different
applications. Our hardware implementations are based on both the mine detection
and the breast cancer detection research mentioned above [31, 32, 34]. Our research
leads to a hardware FDTD core supporting both GPR and medical applications, and
other areas with minimum modification.
The FDTD method is computationally intensive. Without powerful computa-
tional resources, FDTD models are too time consuming to be implemented on single
-
CHAPTER 3. RELATED WORK 41
computer node. The backward FDTD algorithm, or the time reversal algorithm [35]
based on the FDTD method, especially needs speed-up because realistic subsurface
detection instruments, which are based on the backward algorithm, place the most de-
mands on computational resources in order to achieve real-time working speed. Much
effort has been spent on accelerating FDTD implementations for better performance.
3.2 Parallel-computing Systems for FDTD Accel-
eration
One of the early approaches to accelerating the FDTD method was to use super
computer or parallel computer arrays to calculate the FDTD algorithm in software.
A two node parallel algorithm exhibiting perfect speedup should run twice as fast
as a single node computer. However, because of communication bottlenecks, parallel
systems cannot achieve perfect speed up. The goal is to achieve near-perfect speedup.
A parallel implementation of the FDTD algorithm is presented [36]. In this pa-
per, the authors compared parallel FDTD code running on three different parallel
computing systems: a typical 16 node Beowulf system and two traditional parallel
supercomputers. The traditional supercomputers have better inter-node communi-
cation bandwidth and latency compared to the typical Beowulf system, but the cost
of traditional supercomputers is much higher. Well written parallel FDTD code per-
forms well on all three systems. The authors compare the performance results of the
Beowulf system to the supercomputers only, not to a single node computer. We can
assume the speed-up is a maximum of 16 fold and should be a little slower due to
communications overhead. In this paper, the authors also analyze the decomposition
and communication problems of the parallel FDTD method, which is helpful to our
-
CHAPTER 3. RELATED WORK 42
research on the two FPGA implementation of the FDTD method. According to the
analysis, the FDTD code should scale to a large number of nodes easily, because most
communication is between neighboring nodes.
More detailed research on Beowulf system performance for the FDTD method
is introduced [37]. The project investigated the performance of a parallel FDTD
implementation on a Beowulf workstation cluster where the computation grid was
divided among nodes. The authors points out that for a fixed size problem, as the
number of nodes increases, the spee