Acceleration of the 3D FDTD Algorithm in Fixed-point ...Acceleration of the 3D FDTD Algorithm in...

156
Acceleration of the 3D FDTD Algorithm in Fixed-point Arithmetic using Reconfigurable Hardware A Thesis Presented by Wang Chen to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Electrical Engineering in the field of Electrical Engineering Northeastern University Boston, Massachusetts August 2007

Transcript of Acceleration of the 3D FDTD Algorithm in Fixed-point ...Acceleration of the 3D FDTD Algorithm in...

  • Acceleration of the 3D FDTD Algorithm in Fixed-point

    Arithmetic using Reconfigurable Hardware

    A Thesis Presented

    by

    Wang Chen

    to

    The Department of Electrical and Computer Engineering

    in partial fulfillment of the requirements

    for the degree of

    Doctor of Philosophy

    in

    Electrical Engineering

    in the field of

    Electrical Engineering

    Northeastern University

    Boston, Massachusetts

    August 2007

  • c© Copyright 2007 by Wang Chen

    All Rights Reserved

  • NORTHEASTERN UNIVERSITY

    Graduate School of Engineering

    Thesis Title: Acceleration of the 3D FDTD Algorithm in Fixed-point Arithmetic

    using Reconfigurable Hardware.

    Author: Wang Chen.

    Department: Electrical and Computer Engineering.

    Approved for Thesis Requirements of the Doctor of Philosophy Degree

    Thesis Advisor: Prof. Miriam Leeser Date

    Thesis Committee Member: Prof. Carey Rappaport Date

    Thesis Committee Member: Prof. Charles DiMarzio Date

    Department Chair: Prof. Ali Abur Date

    Graduate School Notified of Acceptance:

    Director of the Graduate School: Prof. Yaman Yener Date

  • NORTHEASTERN UNIVERSITY

    Graduate School of Engineering

    Thesis Title: Acceleration of the 3D FDTD Algorithm in Fixed-point Arithmetic

    using Reconfigurable Hardware.

    Author: Wang Chen.

    Department: Electrical and Computer Engineering.

    Approved for Thesis Requirements of the Doctor of Philosophy Degree

    Thesis Advisor: Prof. Miriam Leeser Date

    Thesis Committee Member: Prof. Carey Rappaport Date

    Thesis Committee Member: Prof. Charles DiMarzio Date

    Department Chair: Prof. Ali Abur Date

    Graduate School Notified of Acceptance:

    Director of the Graduate School: Prof. Yaman Yener Date

    Copy Deposited in Library:

    Reference Librarian Date

  • Abstract

    Understanding and predicting electromagnetic behavior is needed more and more in

    modern technology. The Finite-Difference Time-Domain (FDTD) method is a pow-

    erful computational electromagnetic technique for modelling electromagnetic space.

    However, the computation of this method is complex and time consuming. Imple-

    menting this algorithm in hardware will greatly increase its computational speed and

    widen its usage.

    This dissertation presents a fixed-point 3D UPML (Uniaxial Perfectly Matched

    Layer) FDTD FPGA accelerator, which supports a wide range of materials including

    dispersive media. By analyzing the performance of fixed-point arithmetic in the 3D

    FDTD algorithm, the correct fixed-point representation is chosen to minimize the rel-

    ative error between the fixed-point and floating point results. The FPGA accelerator

    supports the UPML absorbing boundary conditions which have better performance

    in dispersive soil and human tissue media than PML boundary conditions.

    The 3D UPML FDTD hardware accelerator has been designed and implemented

    on a WildStar-II Pro FPGA computing board. The computational speed of the hard-

    ware implementation represents a 25 times speedup compare to the software imple-

    mentation running on a 3.0GHz PC. This result indicates that the custom hardware

    implementation can achieve significant speed up for the 3D UPML FDTD algorithm.

    ii

  • The speedup of the FDTD hardware implementation is due to three major fac-

    tors: fixed-point representation, custom memory interface design, and pipelining and

    parallelism. The FDTD method is a data-intense algorithm. The bottleneck of the

    hardware design is its memory interface. With the limited bandwidth between the

    FPGA and on-board memories, a carefully designed custom memory interface al-

    lows fully utilization of the memory bandwidth and greatly improves performance.

    The FDTD algorithm is also a computationally intense algorithm. By considering

    the trade-offs between resources and performance, pipelining and parallelism are im-

    plemented to achieve optimal design performance based on the available hardware

    resources.

    iii

  • Acknowledgements

    First of all, I would like to thank my advisor Dr. Miriam Leeser, not only for her

    invaluable advice and encouragement leading me in the process of fulfilling my grad-

    uate research, but also for her great guidance and help during my six-year’s graduate

    study.

    I also greatly appreciate the friendship and support of Dr. Carey Rappaport

    and Panos Kosmas in CenSSIS, Center for Subsurface Sensing and Imaging Systems.

    This dissertation would not have been possible without their helps, support and their

    FDTD model.

    I would like to thank Dr. Charles DiMarzio, who served on both my masters and

    doctoral committees, for his constant help and support on my research.

    It has been a pleasure to cooperate with all of my colleagues and friends in the

    Reconfigurable Computing Laboratory of the Northeastern University. I would like

    to thank all of them for their suggestions, help and support.

    Finally, I left my special acknowledgement to my wife Lan, and my parents, Xiuzhi

    Chen and Yuanrui Zhang for their constant love and great support.

    This research is supported in part by CenSSIS, the Center for Subsurface Sens-

    ing and Imaging Systems, under the Engineering Research Centers Program of the

    National Science Foundation.

    iv

  • Contents

    1 INTRODUCTION 1

    1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2 BACKGROUND 7

    2.1 The Finite-Difference Time-Domain (FDTD) Method . . . . . . . . . 7

    2.1.1 FDTD Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.1.2 FDTD Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 12

    2.1.3 Absorbing Boundary Conditions . . . . . . . . . . . . . . . . . 15

    2.1.3.1 The Mur-type absorbing boundary conditions . . . . 15

    2.1.3.2 The UPML absorbing boundary conditions . . . . . 16

    2.2 FPGA and Reconfigurable Computing . . . . . . . . . . . . . . . . . 19

    2.2.1 FPGA Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    2.2.2 FPGA Computing Boards . . . . . . . . . . . . . . . . . . . . 23

    2.2.3 Reconfigurable Computing Basics . . . . . . . . . . . . . . . . 25

    2.2.3.1 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.2.3.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . 28

    2.2.3.3 Memory Hierarchy and Interface . . . . . . . . . . . 29

    2.2.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . 29

    2.2.4 Reconfigurable Computing Design Flow . . . . . . . . . . . . . 30

    2.3 Finite Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

    2.3.1 Floating-Point Arithmetic . . . . . . . . . . . . . . . . . . . . 32

    v

  • 2.3.2 Fixed-Point Arithmetic . . . . . . . . . . . . . . . . . . . . . . 33

    2.3.3 Fixed-Point Quantization . . . . . . . . . . . . . . . . . . . . 36

    2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3 RELATED WORK 38

    3.1 Applications of the FDTD Method . . . . . . . . . . . . . . . . . . . 38

    3.2 Parallel-computing Systems for FDTD Acceleration . . . . . . . . . . 41

    3.3 VLSI Implementations of the FDTD Method . . . . . . . . . . . . . . 43

    3.4 FPGA Implementations of the FDTD Method . . . . . . . . . . . . . 45

    3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    4 FDTD ON FPGA 50

    4.1 The Advantages of Putting FDTD on an FPGA . . . . . . . . . . . . 50

    4.1.1 Speed, Size, Power and Flexibility . . . . . . . . . . . . . . . . 51

    4.1.2 Good for Parallelism and Deep Pipelining . . . . . . . . . . . 51

    4.1.3 Suitable for Fixed-Point Arithmetic . . . . . . . . . . . . . . . 52

    4.2 Key Problems to Solve - FDTD on an FPGA . . . . . . . . . . . . . 58

    4.2.1 Determine the right precision for fixed-point representation . . 59

    4.2.2 Determine the memory organization and the memory interface 60

    4.2.3 Determine the proper use of pipelining and parallelism . . . . 61

    4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    5 ALGORITHM ANALYSIS 63

    5.1 Target FDTD Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

    5.1.1 The GPR Buried Object Detection Forward Model . . . . . . 64

    5.1.2 The Breast Cancer Detection Forward Model . . . . . . . . . . 66

    5.1.3 The Spiral Antenna Model . . . . . . . . . . . . . . . . . . . . 67

    5.1.4 Simplification and 2D Structure . . . . . . . . . . . . . . . . . 68

    5.1.5 Six Target Models . . . . . . . . . . . . . . . . . . . . . . . . 70

    5.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    5.2.1 Fixed-point Analysis . . . . . . . . . . . . . . . . . . . . . . . 71

    5.2.2 Fixed-point Quantization . . . . . . . . . . . . . . . . . . . . . 78

    5.3 Hardware Design Analysis . . . . . . . . . . . . . . . . . . . . . . . . 81

    vi

  • 5.3.1 Memory hierarchy and memory interface . . . . . . . . . . . . 81

    5.3.2 Managed cache module . . . . . . . . . . . . . . . . . . . . . . 86

    5.3.2.1 The memory transfer bottleneck . . . . . . . . . . . 86

    5.3.2.2 Optimizing data flow and processing core . . . . . . 87

    5.3.2.3 Expand to Three Dimensions . . . . . . . . . . . . . 90

    5.3.3 Pipelining and Parallelism . . . . . . . . . . . . . . . . . . . . 92

    5.3.3.1 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . 93

    5.3.3.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . 93

    5.3.3.3 Two Hardware Implementations . . . . . . . . . . . . 95

    5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

    6 FDTD DESIGN, IMPLEMENTATION and RESULTS 98

    6.1 Overview of the FDTD hardware implementation . . . . . . . . . . . 98

    6.2 I/O Interface and DMA Transfer . . . . . . . . . . . . . . . . . . . . 100

    6.2.1 WildStar-II Pro’s I/O Interface . . . . . . . . . . . . . . . . . 100

    6.2.2 DMA Transfer . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

    6.2.3 DMA Control Logic and FIFO Components . . . . . . . . . . 104

    6.2.4 DMA Interface Extension . . . . . . . . . . . . . . . . . . . . 106

    6.3 Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

    6.3.1 Combined Datapath . . . . . . . . . . . . . . . . . . . . . . . 107

    6.3.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    6.4 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    6.4.1 Data Arrangement . . . . . . . . . . . . . . . . . . . . . . . . 111

    6.4.2 On-board Memories . . . . . . . . . . . . . . . . . . . . . . . . 113

    6.4.3 BlockRAM Cache Module . . . . . . . . . . . . . . . . . . . . 113

    6.5 Control State Machines . . . . . . . . . . . . . . . . . . . . . . . . . . 116

    6.6 FDTD Implementations and Results . . . . . . . . . . . . . . . . . . 117

    6.6.1 2D UPML FDTD Implementation . . . . . . . . . . . . . . . . 118

    6.6.2 3D UPML FDTD Implementation — Single-Cell Full-UPML . 120

    6.6.3 3D UPML FDTD Implementation — Double-Cell Full-UPML 125

    6.6.4 3D UPML FDTD Implementation — UPML/Center Combined 127

    6.6.5 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . 129

    vii

  • 6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

    7 CONCLUSIONS and FUTURE WORK 132

    7.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

    7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

    Bibliography 136

    viii

  • List of Tables

    2.1 List of the main features of the WildStart-II Pro FPGA boards . . . 24

    4.1 Relative error between fixed-point arithmetic and floating-point arith-

    metic over 200000 Timesteps . . . . . . . . . . . . . . . . . . . . . . . 57

    5.1 Detailed Specifications of the Target FDTD Models . . . . . . . . . . 70

    5.2 Average absolute error between fixed-point arithmetic and floating-

    point arithmetic for different bit-widths for 2D GPR Landmine Detec-

    tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    5.3 Average relative errors between fixed-point arithmetic and floating-

    point arithmetic for different bit-widths for 2D GPR Landmine Detec-

    tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    5.4 Average absolute error between fixed-point arithmetic and floating-

    point arithmetic for different bit-widths for 2D Breast Cancer Detec-

    tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    5.5 Average relative error between fixed-point arithmetic and floating-

    point arithmetic for different bit-widths for 2D Breast Cancer Detec-

    tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

    5.6 Average absolute error between fixed-point arithmetic and floating-

    point arithmetic for different bit-widths for 2D Spiral Antenna Model 75

    5.7 Average relative error between fixed-point arithmetic and floating-

    point arithmetic for different bit-widths for 2D Spiral Antenna Model 75

    ix

  • 5.8 Average absolute error between fixed-point arithmetic and floating-

    point arithmetic for different bit-widths for 3D GPR Landmine Detec-

    tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    5.9 Average relative error between fixed-point arithmetic and floating-

    point arithmetic for different bit-widths for 3D GPR Landmine De-

    tection Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

    5.10 Average absolute error between fixed-point arithmetic and floating-

    point arithmetic for different bit-widths for 3D Breast Cancer Detec-

    tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    5.11 Average relative error between fixed-point arithmetic and floating-

    point arithmetic for different bit-widths for 3D Breast Cancer Detec-

    tion Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    5.12 Average absolute error between fixed-point arithmetic and floating-

    point arithmetic for different bit-widths for 3D Spiral Antenna Model 78

    5.13 Average relative error between fixed-point arithmetic and floating-

    point arithmetic for different bit-widths for 3D Spiral Antenna Model 78

    6.1 Read/Write ratio in different FDTD implementations . . . . . . . . . 129

    6.2 3D FDTD hardware implementation performance results . . . . . . . 130

    x

  • List of Figures

    2.1 The geometrical representation of the 3D Yee cell. . . . . . . . . . . . 10

    2.2 A Complete Flow Diagram of the FDTD Method . . . . . . . . . . . 13

    2.3 Speed vs. flexibility of software, FPGAs and ASICs . . . . . . . . . . 20

    2.4 Xilinx FPGA structure and its CLB . . . . . . . . . . . . . . . . . . . 22

    2.5 Block diagram of the WildStarTM -II Pro board . . . . . . . . . . . . 25

    2.6 Simple Structure Diagram of the Normal COTS Reconfigurable Com-

    puting Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    2.7 Structure of a pipeline in the free space model of the 2D FDTD algorithm 28

    2.8 The FPGA Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . 31

    2.9 Single precision floating-point representation . . . . . . . . . . . . . . 32

    2.10 Double precision floating-point representation . . . . . . . . . . . . . 33

    2.11 Uninterpreted fix-point representation . . . . . . . . . . . . . . . . . . 33

    2.12 Interpreted fix-point representation . . . . . . . . . . . . . . . . . . . 34

    2.13 Multiplication of fix-point representation . . . . . . . . . . . . . . . . 35

    5.1 3D FDTD application of land mine detection using ground penetrating

    radar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    5.2 3D FDTD application of microwave breast cancer detection . . . . . . 66

    5.3 The floor plan of the spiral antenna model . . . . . . . . . . . . . . . 67

    5.4 Structural diagram of the simplification and the pseudo-2D FDTD

    algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

    5.5 Data structure of the fixed-point representation . . . . . . . . . . . . 71

    5.6 Relative error between fixed-point arithmetic and floating-point arith-

    metic for different bit-widths . . . . . . . . . . . . . . . . . . . . . . . 80

    xi

  • 5.7 Structural Diagram of the Memory Interface . . . . . . . . . . . . . . 85

    5.8 Structural Diagram of the Simple 2-Row Cache Module . . . . . . . . 88

    5.9 Structural Diagram of the 2D Managed Cache Module . . . . . . . . 90

    5.10 Structural Diagram of the 4-Slice Caching Design . . . . . . . . . . . 91

    5.11 Structure Diagram of the 4×3 Rows Caching Module . . . . . . . . . 925.12 Running Two Cells in Parallel . . . . . . . . . . . . . . . . . . . . . . 94

    5.13 The UPML boundary conditions cells and non-UPML center cells in

    the model space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

    6.1 Overview Diagram of the FDTD hardware Design . . . . . . . . . . . 99

    6.2 DMA Control Logic and FIFO Components . . . . . . . . . . . . . . 104

    6.3 The block diagram of the 3 level memory interface . . . . . . . . . . . 106

    6.4 Diagram of the magnetic updating pipeline . . . . . . . . . . . . . . . 107

    6.5 Diagram of electric updating algorithm on different media . . . . . . 108

    6.6 Diagram of the electric updating pipeline . . . . . . . . . . . . . . . . 110

    6.7 Diagram of the storage of EM field data on on-board memories . . . . 112

    6.8 Diagram of detailed data flow on 3D electric updating pipeline module 115

    6.9 Diagram of hierarchy level of control state machines. . . . . . . . . . 116

    6.10 Data flow of the 2D UPML FDTD hardware Design . . . . . . . . . . 118

    6.11 Performance of the 2D UPML FDTD hardware Design . . . . . . . . 120

    6.12 Data flow of the 3D FDTD field updating algorithms . . . . . . . . . 122

    6.13 Data flow of the 3D UPML FDTD hardware design . . . . . . . . . . 124

    6.14 Difference between the 3D Single-Cell and Double-Cell Full-UPML

    FDTD hardware design . . . . . . . . . . . . . . . . . . . . . . . . . . 126

    6.15 Data flow of the 3D Double-Cell Full-UPML FDTD hardware design 127

    6.16 Performance of the 3D Double-Cell UPML FDTD hardware Design . 128

    6.17 Data flow of the 3D UPML/Center combined FDTD hardware design 129

    xii

  • Chapter 1

    INTRODUCTION

    1.1 Overview

    Understanding and predicting electromagnetic behavior is more and more needed

    in key electrical engineering technologies such as cellular phones, mobile comput-

    ing, lasers and photonic circuits. Since the Finite-Difference Time-Domain (FDTD)

    method provides a direct, time domain solution to Maxwell’s Equations in differential

    form with relatively good accuracy and flexibility, it has become a powerful method

    for solving a wide variety of different electromagnetics problems. The FDTD method

    was not widely used until the past decade when computing resources improved. Even

    today, the computational cost is still too high for real-time application of the FDTD

    method.

    Much effort has been spent on software acceleration research and people have

    used supercomputers or parallel computer arrays to calculate the FDTD algorithm

    in software. However, real-time application of the FDTD algorithm needs faster

    speed as well as smaller package. Although Application Specific Integrated Circuits

    1

  • CHAPTER 1. INTRODUCTION 2

    (ASICs) provide much faster speed, people hesitate to apply the FDTD algorithm to

    ASICs due to their high development cost.

    Recently, as high capacity Field-Programmable Gate Arrays (FPGAs) have emerged,

    people have become interested in reconfigurable hardware implementations of the

    FDTD algorithm for faster calculation and real-time applications.

    FPGAs offer the benefits of both ASICs and software. Like an ASIC, an FPGA

    can implement millions of gates of logic which are designed optimally for a specific

    task in a single integrated circuit. Although FPGA-based design may not challenge

    the performance of ASICs, the fact is that few applications will ever merit the expense

    of high cost ASICs. Also, like software, FPGAs are reprogrammable by designers any-

    time, and are therefore flexible and convenient. Although FPGA-based development

    is still costly compared to software, FPGA-based design can offer extremely high

    performance surpassing any other programmable solutions.

    There also have been a number of effort lately to implemented the FDTD algo-

    rithm on graphics processor unit (GPU). Computer graphics card is widely available

    and relatively cheaper than FPGA. Instead of working on a hardware design, the de-

    signers use C++ and OpenGL to program the FDTD algorithm on GPU. However,

    because GPU is designed for graphics calculation, its hardware structure is fixed.

    There is little flexibility to upgrade the FDTD design or change the data type. In-

    stead, the FPGA-based design offers much better flexibility so it can achieve higher

    precision and faster speed.

    We present the first fixed-point 3D UPML FDTD FPGA accelerator, which sup-

    ports a wide range of materials including dispersive media. By analyzing the perfor-

    mance of fixed-point arithmetic in the 3D FDTD algorithm, the correct fixed-point

    representation is chosen to minimize the relative error between fixed-point and float-

    ing point results. The FPGA accelerator supports the UPML absorbing boundary

  • CHAPTER 1. INTRODUCTION 3

    conditions which have better performance in dispersive soil and human tissue media

    than PML boundary conditions.

    The 3D UPML FDTD hardware accelerator has been designed and implemented

    on a WildStar-II Pro FPGA computing board. The computational speed of the hard-

    ware implementation represents a 25 times speedup compare to the software imple-

    mentation running on a 3.0GHz PC. This result indicates that the custom hardware

    implementation can achieve significant speed up for the 3D UPML FDTD algorithm.

    The speedup of the FDTD hardware implementation is due to three major fac-

    tors: fixed-point representation, custom memory interface design, and pipelining and

    parallelism. The FDTD method is a data-intense algorithm. The bottleneck of the

    hardware design is its memory interface. With the limited bandwidth between the

    FPGA and on-board memories, a carefully designed custom memory interface al-

    lows full utilization of the memory bandwidth and greatly improves performance.

    The FDTD algorithm is also a computationally intense algorithm. By considering

    the trade-offs between resources and performance, pipelining and parallelism are im-

    plemented in order to achieve optimal design performance based on the available

    hardware resources.

    1.2 Contributions

    The contributions of this research are:

    • This research presents a fixed-point 3D UPML FDTD FPGA accelerator whichsupports a wide range of materials including dispersive media. The 3D UPML

    FDTD hardware accelerator has been designed and implemented on a WildStar-

    II Pro FPGA computing board. The computational speed of the hardware

  • CHAPTER 1. INTRODUCTION 4

    implementation represents a 25 times speedup compare to the software im-

    plementation running on a 3.0GHz PC. This result indicates that the custom

    hardware implementation can achieve significant speed up for the 3D UPML

    FDTD algorithm.

    • The 3D UPML FDTD algorithm has been carefully analyzed to find an ef-ficient hardware architecture. A carefully designed custom memory interface

    and managed cache module allows fully utilization of the memory bandwidth

    and greatly improves hardware design performance. By considering the trade-

    offs between resources and performance, pipelining and parallelism are imple-

    mented to achieve optimal design performance based on the available hardware

    resources.

    • Our 3D FDTD FPGA accelerator is the first fixed-point 3D FDTD FPGA ac-celerator. Recent approaches to hardware acceleration of the FDTD algorithm

    focus on floating-point arithmetic implementations [1] [2]. Although floating-

    point representation provides relatively high data precision, because of favorable

    data properties and algorithm structure of the FDTD method, the relative error

    between the fixed-point results and double-precision floating-point results can

    be controlled to be very low as long as we carefully choose the fixed-point rep-

    resentation. Since fixed-point arithmetic components provide faster speed and

    consume less hardware resources than floating-point, the fixed-point arithmetic

    FDTD accelerator has better performance.

    • Our 3D FDTD FPGA accelerator is the first hardware implementation to sup-port the UPML absorbing boundary condition (ABC) which has better perfor-

    mance in dispersive soil and human tissue media than PML ABC. The UPML

  • CHAPTER 1. INTRODUCTION 5

    ABC provides better performance on the reflection ratio of all incidence an-

    gles in dispersive media compared to other approaches. Although the UPML

    FDTD algorithm is more computationally expensive and consumes more mem-

    ory space, its regular structure makes it relatively easy to implement in hard-

    ware compared to PML and Mur-type ABC.

    • Our 3D FDTD FPGA accelerator implements and supports GPR mine de-tection, breast cancer detection, and spiral antenna research. Our research

    provides a good example of FPGA acceleration of the FDTD applications in

    fixed-point arithmetic. With the experience of our research, other FDTD ap-

    plications can be applied to FPGAs more easily.

    1.3 Dissertation Structure

    The outline of the reminder of this dissertation is as follows.

    Chapter 2 starts with an introduction of the Finite-Difference Time-Domain method,

    the structure of the FDTD algorithm, and the UPML absorbing boundary conditions.

    It then discusses the FPGA and WildStar-II Pro reconfigurable boards where the de-

    sign will be implemented. Finally, finite precision theories are introduced for better

    understanding of the algorithm data analysis in Chapter 5.

    Chapter 3 presents a survey of the related work of our research, including re-

    search on the FDTD method, parallel computing, VLSI implementation, and FPGA

    implementation of the FDTD method.

    Chapter 4 first analyzes and presents the advantages of implementing the FDTD

    algorithm on an FPGA. Then it lists the key problems which need to be solved in

    order to have a successful FDTD hardware implementation. These problems will be

  • CHAPTER 1. INTRODUCTION 6

    solved in Chapter 5 and Chapter 6.

    Chapter 5 starts on the introduction to the six target FDTD models we used for

    data analysis and design verification. It then gives the details of our research on the

    fixed-point data analysis of the 3D FDTD algorithm. It compares the relative error

    of the electromagnetic field data for different data bit-widths to choose a suitable

    bit-width for fixed-point representation. Finally, this chapter carefully analyzes the

    3D UPML FDTD algorithm to find an efficient hardware architecture. Carefully

    designed custom memory interface and managed cache module are proposed and

    analyzed. The implementation of pipelining and parallelism are discussed based on

    the trade-offs between hardware resources and design performance.

    Chapter 6 discusses several technical details of the 3D UPML FDTD implementa-

    tion, including I/O interface, DMA transfer, datapath, data flow, memory interface,

    managed cache module, and control state machine. After that, several FDTD im-

    plementations are introduced and their performance results are presented. These

    FDTD implementations are introduced in the sequence of our design work. Their

    data flow and control logic become more and more complex as we further optimize

    the design to achieve better performance. This chapter also contains a summary of

    the performance of the FDTD implementations.

    Chapter 7 draws conclusions and gives suggestions for future work.

  • Chapter 2

    BACKGROUND

    In this chapter we will introduce the backgrounds of the FDTD method, FPGAs and

    finite precision, which are the basis of this dissertation.

    2.1 The Finite-Difference Time-Domain (FDTD)

    Method

    The discovery of Maxwell’s equations was one of the outstanding achievements of

    19th century science. The equations give a unified and complete theory for people to

    understand electromagnetic wave phenomena. Solving Maxwell’s equations becomes

    an important method for investigating the propagation, radiation and scattering of

    electromagnetic waves.

    The FDTD method provides a direct time-domain solution of Maxwell’s Equa-

    tions. After Yee first introduced it in 1966 [3], people began to realize its accuracy and

    flexibility for solving electromagnetic problems. Because of its clear structure, easy to

    understand algorithm, and ability to solve Maxwell’s equations on most scales with

    7

  • CHAPTER 2. BACKGROUND 8

    many different environments, the FDTD method has become a powerful technique for

    solving a wide variety of different electromagnetic problems. Such problems includ

    discrete scattering analysis, antenna design, radar research, digital circuit packag-

    ing on multi-layer circuits boards, tumor detection and various medical studies on

    electromagnetic waves’ effect on the human body [4, 5, 6, 7, 8].

    In this section, we will first introduce the basics of the FDTD method by solving

    Maxwell’s equations step by step. Then we will introduce the structure and properties

    of the FDTD software algorithm. Finally we will talk about the absorbing boundary

    conditions for the FDTD method.

    2.1.1 FDTD Basics

    The differential form of Maxwell’s equations and constitutive relations can be written

    with the following equations:

    ∇× ~E = −∂~B

    ∂t− σm ~H − ~M (2.1)

    ∇× ~H = ∂~D

    ∂t+ σe ~E + ~J (2.2)

    ∇ · ~D = ρe; ∇ · ~B = ρm (2.3)

    ~D = � ~E; ~B = µ ~H (2.4)

    In Equations (2.1)-(2.4), the following symbols are used:

    ~E: electric field ~D: electric flux density~H: magnetic field ~B: magnetic flux density~J : electric current density ~M : equivalent magnetic current density�: electrical permittivity µ: magnetic permeabilityσe: electric conductivity σm: equivalent magnetic conductivityρe: electric charge density ρm: equivalent magnetic charge density

  • CHAPTER 2. BACKGROUND 9

    First, the FDTD method replaces ~D and ~B in Equation (2.1)-(2.2) with ~E and ~H

    according to the constitutive relations in Equation (2.4), which yields Maxwell’s curl

    equations:

    µ∂ ~H

    ∂t= −∇× ~E − σm ~H − ~M (2.5)

    �∂ ~E

    ∂t= ∇× ~H − σe ~E − ~J (2.6)

    All the curl operators are then written in differential form and replaced by partial

    derivative operators, as shown in Equation (2.7), with the ~E and ~H vector separated

    into three vectors in three dimensions(e.g. ~E is separated into Ex, Ey, Ez):

    curl ~F = ∇× ~F = x̂(∂Fz∂y

    − ∂Fy∂z

    ) + ŷ(∂Fx∂z

    − ∂Fz∂x

    ) + ẑ(∂Fy∂x

    − ∂Fx∂y

    ) (2.7)

    Then we can rewrite Maxwell’s curl equations into six equations in differential

    form in rectangular coordinates:

    µ∂Hx∂t

    =∂Ey∂z

    − ∂Ez∂y

    − σmHx − Mx (2.8)

    µ∂Hy∂t

    =∂Ez∂x

    − ∂Ex∂z

    − σmHy − My (2.9)

    µ∂Hz∂t

    =∂Ex∂y

    − ∂Ey∂x

    − σmHz − Mz (2.10)

    �∂Ex∂t

    =∂Hz∂y

    − ∂Hy∂z

    − σeEx − Jx (2.11)

    �∂Ey∂t

    =∂Hx∂z

    − ∂Hz∂x

    − σeEy − Jy (2.12)

    �∂Ez∂t

    =∂Hy∂x

    − ∂Hx∂y

    − σeEz − Jz (2.13)

    Second, the FDTD method establishes a model space, which is the physical region

    where Maxwell’s equations are solved or the simulation is performed. The model

  • CHAPTER 2. BACKGROUND 10

    space is then discretized to a number of cells and the time duration “t” is discretized

    to a number of time steps. The size of the unit cell is often set to less than one

    tenth of the electromagnetic wavelength. The unit time step is calculated as the time

    duration the electromagnetic wave spends travelling through a unit cell. Every cell in

    the model space has its associated electric and magnetic fields E and H. The material

    type of each cell is also specified by its permittivity, permeability and conductivity.

    We use model size to represent the number of cells in the model space. The model

    size depends on the size of unit cell. The smaller the unit cell, the larger the model

    size.

    The 3D grid shown in Figure 2.1, named the “Yee-Cell” [3], is very helpful for

    understanding the discretized electromagnetic model space.

    Figure 2.1: The geometrical representation of the 3D Yee cell.

    As illustrated in Figure 2.1, the “Yee-Cell” is a small cube, which can be treated

    as one single cell picked from the discretized model space. ∆x, ∆y, ∆z are the three

    dimensions of this cube. We use (i, j, k) to denote the point whose real coordinate

    is (i∆x, j∆y, k∆z) in the model space. Instead of placing E and H components in

    the center of each cell, the E and H field components are interlaced so that every E

  • CHAPTER 2. BACKGROUND 11

    component is surrounded by four circulating H components, and every H component

    is surrounded by four circulating E components. The operation ∇× or “curl” inEquation (2.5), (2.6) and (2.7) is easily interpreted using this figure. For example,

    the Hx component located at point (i, j +12, k + 1

    2) is surrounded by four circulating

    E components, two Ey components and two Ez components, matching Equation

    (2.8), which states that the Hx component increases directly in response to a curl

    of E components in the x direction with a constant µ. The constant µ specifies the

    magnetic permeability of the material at the location of this unit cell. Similarly, the

    E components increase directly in response to the curl of the H components, with a

    constant proportional to the electrical permittivity � of the material at the current

    location. Maxwell’s equations in rectangular coordinates, Equation (2.8)-(2.13), can

    be clearly illustrated by Yee’s cell.

    In this section, we represent an electric component Ez at the discretized 3D co-

    ordinate (i∆x, j∆y, (k + 12)∆z) as Ez|i,j,k+ 1

    2

    , and when the current time is in the

    discretized N-th time step, we denote the same component as Ez|Ni,j,k+ 12

    .

    Third, all the partial derivative operators in Equations (2.8)-(2.13) are replaced by

    their central difference approximations as illustrated in Equation (2.14). The second

    order part of the Taylor series expansion is discarded to keep the algorithm simple

    and reduce the computational cost. Also, the variable without partial derivative can

    be approximated by time averaging as shown in Equation (2.15), which has similar

    structure to the central difference approximation.

    ∂f(uo)

    ∂u=

    f(uo + ∆u) − f(uo − ∆u)2∆u

    + O[(∆u)2] (2.14)

    f(uo) =f(uo + ∆u) + f(uo − ∆u)

    2(2.15)

  • CHAPTER 2. BACKGROUND 12

    For example, Equation (2.8) is changed to:

    µHx(t0 +

    ∆t2

    ) − Hx(t0 − ∆t2 )∆t

    =Ey(z0 +

    ∆z2

    ) − Ey(z0 − ∆z2 )∆z

    (2.16)

    −Ez(y0 +∆y2

    ) − Ez(y0 − ∆y2 )∆y

    − σmHx(t0 +

    ∆t2

    ) + Hx(t0 − ∆t2 )2

    − Mx

    After these modifications, the FDTD method turns Maxwell’s equations into a

    set of linear equations from which we can calculate the electric and magnetic fields

    in every cell in the model space. We call these equations the electric and magnetic

    field updating algorithms. Six field updating algorithms form the basis of the FDTD

    method. For example, the field updating algorithm for the Hx component, derived

    from Equation (2.16) or Equation (2.8), is given by:

    ∆t+

    σm2

    )Hx|N+ 1

    2

    i,j+ 12,k+ 1

    2

    = (µ

    ∆t− σm

    2)Hx|

    N− 12

    i,j+ 12,k+ 1

    2

    +1

    ∆z[Ey|Ni,j+ 1

    2,k+1 − Ey|Ni,j+ 1

    2,k] −

    1

    ∆y[Ez|Ni,j+1,k+ 1

    2

    − Ez|Ni,j,k+ 12

    ] − Mx|Ni,j+ 12,k+ 1

    2

    (2.17)

    2.1.2 FDTD Algorithm

    The FDTD algorithm is based on these field updating equations. The flow diagram of

    the FDTD algorithm is shown in Figure 2.2. The FDTD algorithm first establishes the

    computational model space, specifies the storage memory for all the electromagnetic

    field data, specifies the material properties for every cell in the model space and

    specifies the excitation source. The excitation source can be a point source, a plane

    wave, an electric field or an other option depending on the application. The FDTD

    algorithm then runs through the field updating algorithms on every cell in the model

    space. The electromagnetic field data are updated and stored back to the original

    memory. Once it finishes, the algorithm will go to the next time step and run through

  • CHAPTER 2. BACKGROUND 13

    all the cells again. The algorithm iterates the updating computations for each time

    step, until the desired time period is completed. The boundary condition part consists

    of special algorithms dealing with the unit cells located on the boundary of the model

    space. For these cells, some of the adjacent cells may not exist, so special methods

    are needed to update them appropriately. The output of the FDTD algorithm can

    be any electric or magnetic field data from any cell in any time step.

    Initialization� Establishes the model space� Specifies the material properties� Specifies the excitation source

    UpdateSource Excitation

    End

    Time over?

    Yes

    No, Go to NextTime Step

    Time Step ++Boundary Conditions

    Update Magnetic FieldData on all Cells

    Update Electric FieldData on all Cells

    Figure 2.2: A Complete Flow Diagram of the FDTD Method

    The electric and magnetic fields depend on each other. As we can see from

    Equation (2.5), in Maxwell’s curl equations, the curl of the electric field E in space

    equals the partial derivative of the magnetic field H with respect to “t”. After the

    modifications of the FDTD method, the expansion of the curl operation depends on

    the E field in the cells surrounding the current cell. The expansion of the partial

    derivative of H with respect to “t” depends on the current magnetic field minus the

    stored previous time step’s magnetic field. In other words, at any cell in the model

  • CHAPTER 2. BACKGROUND 14

    space, the current time step’s magnetic field depends on the stored previous time

    step’s magnetic field and the electric fields in the surrounding cells. Similarly, at any

    cell in the model space, the current time step’s electric field depends on the stored

    previous time step’s electric field and the magnetic fields in the surrounding cells.

    Because of the dependence between the electric and magnetic fields, we cannot

    update them in parallel. So the FDTD algorithm updates the electric and mag-

    netic fields in an interlaced manner as shown in Figure 2.2. Notice that electric and

    magnetic field updating algorithms are interlaced in the time domain so magnetic

    components will be updated first based on the previous stored electric data and then

    electric components will be updated based on the newly updated magnetic data. This

    interlacing mechanism will be iterated time step by time step until the job finishes.

    The FDTD algorithm is an accurate and successful modelling technique for com-

    putational electromagnetics. It is flexible, allowing the user to model a wide variety

    of electromagnetic materials and environments on most scales. It is also easy to un-

    derstand, with its clear structure and direct time domain calculation. However, the

    FDTD algorithm is a data intensive and computationally intensive algorithm. The

    amount of data in the FDTD model space can be very large for large model sizes,

    creating a heavy burden on both memory storage and access. The computation is

    also intensive for each cell in the FDTD model space, including updating six electric

    and magnetic fields and also updating the boundary conditions. The heavy burden

    on data access and complex computation makes the FDTD algorithm run slowly on a

    single processor. Modelling an electromagnetic problem using the FDTD method can

    easily require several hours. Without powerful computational resources, the FDTD

    models are too time-consuming to be implemented on a single computing node. Ac-

    celerating the FDTD method with inexpensive and compact hardware will greatly

    expand its application and popularity.

  • CHAPTER 2. BACKGROUND 15

    2.1.3 Absorbing Boundary Conditions

    The above electric and magnetic updating algorithms work accurately inside the

    model space. However, because the cells on the boundary of the model space do

    not have the adjacent cells needed in the updating algorithms, the algorithm does

    not work properly on the boundary, so there are algorithm-introduced reflections

    on the boundary of the model space. Special techniques are necessary to deal with

    the cells on the boundary, prevent nonphysical reflections from outgoing waves and

    simulate the extension of the model space to infinity. These special techniques are

    called the absorbing boundary conditions (ABC). The development of efficient and

    accurate ABCs is very important for the FDTD method and several types of ABC

    have been proposed, which are grouped into two major approaches, analytical and

    perfect matched layer.

    2.1.3.1 The Mur-type absorbing boundary conditions

    The first group of ABCs are called the analytical ABCs, which are mainly deduced

    from the electromagnetic wave equation. The Mur-type ABC [9] is one of the popu-

    lar ABCs in this group. Mur-type ABC is deduced from the one-way wave equation,

    which is a part of the wave equation that allows wave propagation only in one di-

    rection [5]. The wave equation in three dimensions shown in Equation (2.18) can be

    written as Equation (2.19), or more compactly, as Equation (2.20). A part of the

    wave equation, L−U = 0, as in Equation (2.21), is the one-way wave equation which

    absorbs most of the impinging waves on the boundary of the model space.

    (∂2

    ∂x2+

    ∂2

    ∂y2+

    ∂2

    ∂z2− 1

    c2∂2

    ∂t2)U = 0 (2.18)

  • CHAPTER 2. BACKGROUND 16

    [∂2

    ∂x2− 1

    c2∂2

    ∂t2(1 −

    ∂2

    ∂y2c2

    ∂2

    ∂t2

    −∂2

    ∂z2c2

    ∂2

    ∂t2

    )]U = 0 (2.19)

    [∂2

    ∂x2− 1

    c2∂2

    ∂t2(1 − S2)]U = LU = L+L−U = 0 (2.20)

    (∂

    ∂x− 1

    c

    ∂t

    √1 − S2)U = L−U = 0 (2.21)

    Depending on the approximation of√

    1 − S2, there are first-order and second-order Mur-type ABCs. The first order Mur-type ABC approximates

    √1 − S2 = 1

    and the second order uses√

    1 − S2 = 1 − 12S2. The second-order Mur-type is more

    accurate but more complex than the first-order one. For example, the discretized

    version of the first-order Mur-type ABC for the Z component of the electric field is

    given by:

    Ez|N+10,j,k+ 12

    = Ez|N1,j,k+ 12

    +c∆t − ∆xc∆t + ∆x

    [Ez|N+11,j,k+ 12

    − Ez|N0,j,k+ 12

    ] (2.22)

    The Mur-type ABC is easy to understand. It updates the electric field data on

    the boundary cells every time step after the main updating algorithms are finished,

    as shown in Figure 2.2. The updates of the field data are processed in-place with no

    extra memory consumption, which makes the Mur-type ABC a good memory saver.

    However, the biggest problem with the Mur-type ABC is that its quality is only

    good for incident angles up to 25 degrees [6]. Its quality deteriorates significantly for

    greater incidence angles. This property limits the usage of the Mur-type ABC.

    2.1.3.2 The UPML absorbing boundary conditions

    The other approach is called the Perfectly Matched Layer (PML) ABC [10]. PML

    ABC is a highly effective absorbing boundary condition. By setting the outer bound-

    ary of the model space to an absorbing material layer, PML can absorb most of

    the impinging waves and have low reflection. The most favorable property of the

  • CHAPTER 2. BACKGROUND 17

    PML ABC is that its quality does not depend on incidence angle, polarization or

    frequency [6].

    The UPML (Uniaxial PML) ABC [7] is modified from the PML ABC by replacing

    the mathematical PML medium model with a physical uniaxial anisotropic PML

    medium, which keeps all the favorable properties of the PML ABC. Instead of setting

    only the outer boundary of the model space to the PML material, UPML uses a

    generalized formulation of the entire FDTD model space, providing both lossless,

    isotropic medium in the primary computation zone and individual UPML absorber

    in the outer boundary cells [6]. Based on the original UPML, a novel modification of

    UPML, the Single Pole Conductivity Model [11, 12], better supports lossy, dispersive

    media. The overall quality of the UPML ABC is more than 30dB better than the

    original PML on dispersive media. Compared to the Mur-type ABC, the quality of

    the UPML ABC is about 10dB better for incident angles less than 30 degrees and

    much better for incident angles larger than 30 degrees [12], because the Mur-type

    ABC deteriorates for greater incident angles [6].

    The advantage of this approach is its good quality. Besides the high quality of

    results in normal media, the quality of the UPML ABC is especially good for disper-

    sive media, which is useful in solving many realistic problems. Also, because UPML

    integrates boundary conditions with the electric field updating algorithms, the FDTD

    algorithm is more uniform in structure, making it a good model for hardware data-

    path design. The disadvantage of the UPML ABC is its high memory consumption.

    The UPML ABC introduced nine extra field data for each cell in the model space,

    consuming 1.5 times more memory than the normal FDTD algorithm. Because ac-

    curacy of the ABC is very important for the FDTD algorithm, the PML and UPML

    ABCs are very popular in FDTD research and applications due to their high quality.

    The time-harmonic Maxwell’s curl equation in frequency domain in the UPML

  • CHAPTER 2. BACKGROUND 18

    regions can be written as:

    ∇× E = −jωµs̄H∇× H = (jω� + σ)s̄E

    (2.23)

    Where s̄ is a tensor that can be used throughout the entire FDTD model space:

    s̄ = diag[syszsx

    ,sxszsy

    ,sxsysz

    ] (2.24)

    sx = κx +σxjω�

    sy = κy +σyjω�

    sz = κz +σzjω�

    (2.25)

    In the PML region which is on the boundary of the model space, σx is PML’s con-

    ductivity while κx is a real valued parameter [6]. In the lossless primary computation

    zone, σ is always zero and κ is always one, so the media is just the normal media and

    the data-path is the normal FDTD computation. P and P 2 are introduced to reduce

    the complexity of this convolution calculation:

    Px =syszsx

    Ex, Py =sxszsy

    Ey, Pz =sxsysz

    Ez (2.26)

    P 2x =1

    syPx, P

    2y =

    1

    szPy, P

    2z =

    1

    sxPz (2.27)

    After substituting E components from Equation (2.26) into (2.23), and substi-

    tuting ∂∂t

    for jω in Equation (2.23), an example of the UPML algorithm in lossy,

    dielectric media is given by the following equations. The details of the UPML algo-

    rithm are well documented [6] and the novel modification of the UPML algorithm for

    dispersive media is introduced in several papers [11, 12, 13, 14, 15, 16, 17, 18]. For

    example, the discretized version of the UPML algorithm for Ex and Hx components

  • CHAPTER 2. BACKGROUND 19

    are:

    Px|N+1i+ 12,j,k

    =2� − σ∆t2� + σ∆t

    Px|Ni+ 12,j,k +

    2∆t

    2� + σ∆t(2.28)

    (Hz|

    N+ 12

    i+ 12,j+ 1

    2,k− Hz|

    N+ 12

    i+ 12,j− 1

    2,k

    ∆y−

    Hy|N+ 1

    2

    i+ 12,j,k+ 1

    2

    − Hy|N+ 1

    2

    i+ 12,j,k− 1

    2

    ∆z)

    P 2x |N+1i+ 12,j,k

    =2�κy − σy∆t2�κy + σy∆t

    P 2x |Ni+ 12,j,k +

    2�

    2�κy + σy∆t[Px|N+1i+ 1

    2,j,k

    − Px|Ni+ 12,j,k] (2.29)

    Ex|N+1i+ 12,j,k

    =2�κz − σz∆t2�κz + σz∆t

    Ex|Ni+ 12,j,k +

    1

    2�κz + σz∆t(2.30)

    [(2�κx + σx∆t)P2x |N+1i+ 1

    2,j,k

    − (2�κx − σx∆t)P 2x |Ni+ 12,j,k]

    Bx|N+ 3

    2

    i,j+ 12,k+ 1

    2

    =2�κy − σy∆t2�κy + σy∆t

    Bx|N+ 1

    2

    i,j+ 12,k+ 1

    2

    − 2�∆t2�κy + σy∆t

    (2.31)

    (Ez|N+1i,j+1,k+ 1

    2

    − Ez|N+1i,j,k+ 12

    ∆y−

    Ey|N+1i,j+ 12,k+1

    − Ey|N+1i,j+ 12,k

    ∆z)

    Hx|N+ 3

    2

    i,j+ 12,k+ 1

    2

    =2�κz − σz∆t2�κz + σz∆t

    Hx|N+ 1

    2

    i,j+ 12,k+ 1

    2

    +1

    (2�κz + σz∆t)µ(2.32)

    [(2�κx + σx∆t)Bx|N+ 3

    2

    i,j+ 12,k+ 1

    2

    − (2�κx − σx∆t)Bx|N+ 1

    2

    i,j+ 12,k+ 1

    2

    ]

    2.2 FPGA and Reconfigurable Computing

    There are several options for implementing digital logic designs. The first option

    is to program the algorithms into computer software and run them on general pur-

    pose microprocessors. Designers can change the software instructions anytime so the

    digital logic can be changed easily and flexibly. Although software is very flexible,

    software running on general purpose microprocessors is slow compared to hardware

  • CHAPTER 2. BACKGROUND 20

    implementations for many applications. The microprocessor must read and decode

    each software instruction and then execute it, creating a high execution overhead

    for each individual operation [19]. Also, the microprocessor can only handle limited

    parallelism and pipelining, while a custom hardware design can fully exploit the ben-

    efits of both. For many scientific computations and real-time applications, software

    implementation is too slow to be practical.

    The second option is to use an Application Specific Integrated Circuit (ASIC),

    which is designed for a specific computational task and achieves very fast speed and

    high efficiency when executing the exact computation for which it was designed [19].

    ASIC chips are small and consume less power than the other options, so they can be

    found in many products from mobile phones to aeroplanes. However, the development

    cost of ASICs is very high. What’s more, once the chip is fabricated, it cannot be

    modified at all. The chip needs to be redesigned and refabricated if any single part

    of the digital logic needs to be altered.

    Figure 2.3: Speed vs. flexibility of software, FPGAs and ASICs

    Reconfigurable computing devices, mostly known as Field Programmable Gate

    Arrays (FPGAs), offer the benefits of both ASICs and software. Like an ASIC, an

    FPGA can implement millions of gates of logic, which are designed optimally for

  • CHAPTER 2. BACKGROUND 21

    a specific task, in a single integrated circuit, thus achieving high performance with

    low power and small size. Although the FPGA-based design may not challenge the

    performance of ASICs, the fact is that few applications will ever merit the expense of

    high cost ASICs. Also like software, FPGAs are reprogrammable by designers at any

    time, and are therefore flexible and convenient. Although FPGA-based development

    is still costly compared to software, FPGA-based design can offer extremely high per-

    formance, surpassing any other programmable solutions for many applications [20].

    Figure 2.3 compares speed and flexibility among software, FPGAs and ASICs. Rela-

    tive high speed, relative high flexibility and advantages in cost, size and power make

    FPGAs suitable for a wide variety of applications.

    In this section, we will first introduce the basics of reconfigurable devices and com-

    mercial FPGA computing boards. Then we will talk about reconfigurable computing

    based design.

    2.2.1 FPGA Basics

    An FPGA is a programmable hardware device where the hardware circuits are pro-

    grammable by the designer to create custom digital logic designs [21]. The structure

    of the FPGA is composed of configurable logic blocks (CLBs), programmable routing

    and Input/Output blocks (IOBs).

    Figure 2.4 shows the structure of an FPGA from Xilinx, Inc [22]. The CLBs are

    arranged in a matrix in the FPGA, and connected by programmable routing. IOBs

    are connected to CLBs via programmable routing also. Each CLB is a basic functional

    element that implements part of the algorithm. Digital logic is implemented in several

    CLBs. The structure of CLBs vary in different FPGA chips; however, CLBs are

    normally made up of lookup tables (LUTs), carry logic and storage elements. The

  • CHAPTER 2. BACKGROUND 22

    ConfigurableLogic Block

    (CLB)

    I/O Block

    ProgrammableRouting

    Route map of a real FPGA chip

    Figure 2.4: Xilinx FPGA structure and its CLB

    bottom right corner of Figure 2.4 shows the structural diagram of a slice in the Xilinx

    Virtex-II Pro FPGA chip; each CLB in the Virtex-II Pro chip has two slices. The

    lookup tables (LUTs) are the most important part of an FPGA. The 4-input LUT

    shown in Figure 2.4 is composed of 24 memory units. The input signal is the address

    data for the LUT to pick the corresponding value from the memory. The memory

    units can be programmed to store any truth table, or any function, so most logic

    can be efficiently implemented by one or more lookup tables. The storage elements

    are generally used as registers and the carry logic is usually added in order to avoid

    wasting LUTs and for faster speed.

    Most current FPGAs are SRAM-based, which means all the programmable LUTs

  • CHAPTER 2. BACKGROUND 23

    and routing are based on SRAM bits [19]. Reconfiguring the FPGA is actually done

    by loading the bitstream into SRAM, which takes only milliseconds to complete.

    FPGAs can be reprogrammed in the target system without detaching the FPGA

    chip, a perfect match for hardware systems which need frequent upgrading or have

    several optional designs. FPGAs can be reprogrammed at run-time too, which means

    part of an FPGA can be reprogrammed when the system is running on the other

    part of the chip. These advantages make FPGAs very flexible and suitable for many

    applications.

    2.2.2 FPGA Computing Boards

    The FPGA computing resources we use in this research are typical commercial off

    the shelf (COTS) FPGA computing boards, which are widely available and easy to

    setup. These boards normally contain one or two FPGA chips. Each FPGA chip may

    be connected to several on-board memories, including SRAM or DRAM. One FPGA

    and its associated on-board memory form the basic combination used for custom

    hardware computation. These computing boards are often PCI boards for a desktop

    computer or PCMCIA cards for a laptop computer. Data and control signals can

    be transferred between the FPGA computing board and the host PC via the PCI

    interface, either via standard PCI transfer or fast DMA transfer. There are also some

    boards use other interfaces like fiber-optic or RocketIO for faster speed and flexible

    connection.

    The board we use is a WildStarTM -II Pro/PCI reconfigurable FPGA computing

    board from Annapolis Micro Systems [23]. Table 2.1 lists the main features of the

    the WildStarTM -II Pro/PCI board, and Figure 2.5 shows its block diagram.

    There are two Xilinx Virtex-II Pro FPGAs on the WildStar-II Pro board. Each

  • CHAPTER 2. BACKGROUND 24

    WILDSTAR-II PRO/PCI boards

    FPGA chips Two Xilinx Virtex-II Pro FPGAs XC2V70 [22] (33088 slices,328 embedded Multiplier and 5904Kb BlockRAM)

    Memory 12 ports of DDR-II SRAM Ports totally 54MBytesPorts (6 × 4.5MBytes for each FPGA chip)

    Memory 11 GBytes/sec memory bandwidthBandwidth (6 × 72-bit for each FPGA chip)

    PCI Interface 133MHz/64bit PCI-X up to 1.03GBytes/sec

    Table 2.1: List of the main features of the WildStart-II Pro FPGA boards

    FPGA has 328 embedded 18×18 signed multipliers and 382×18Kb BlockRAMs. Theembedded multiplier built inside the FPGA is much faster than a multiplier com-

    ponent implemented by reconfigurable logic. So it is better to use the embedded

    multipliers if possible. The BlockRAMs are the fastest memory the designer can use

    in an FPGA design. Critical data interchange and interfacing can be programmed

    using the BlockRAMs. A pair consisting of an embedded multiplier and a Block-

    RAM shares the same data and address buses in the Xilinx Virtex-II architecture, so

    once we use the embedded multiplier, we cannot use its corresponding BlockRAM,

    and vice versa. Therefore the sum of the total number of embedded multipliers and

    BlockRAMs used must be less than 328.

    Figure 2.5 shows the block diagram of the WildStar-II Pro FPGA board. Each

    FPGA is connected to six independent on-board memories. The on-board memories

    are 1M×36bit DDRII SRAM which have 72-bit data bandwidth and speed up to200MHz. The size of each SRAM is 36Mbits, or 4.5MBytes, so the total SRAM

    attached to each FPGA is 27MBytes. The WildStar-II Pro board is connected to

    the desktop computer via a PCI-X interface, with a DMA data transfer rate up to

    1GB/s between the host PC and FPGA. Convenient interfaces between the FPGA

    chip, on board memories and host PC makes the WildStar-II Pro FPGA computing

    board a good platform for hardware implementation.

  • CHAPTER 2. BACKGROUND 25

    Figure 2.5: Block diagram of the WildStarTM -II Pro board

    2.2.3 Reconfigurable Computing Basics

    As introduced in the beginning of this section, the key feature of reconfigurable

    computing is its ability to greatly speedup algorithm computation while maintaining

    flexibility [19]. How does reconfigurable hardware design achieve high performance?

    Based on the structure of the COTS FPGA computing board, we will introduce the

    basic elements of reconfigurable computing which play important roles in computation

    speedup.

    As shown in Figure 2.6, a FPGA computing board normally contains an FPGA

    chip, on-board memories and a PCI controller; these form the basic components of

    custom hardware design. A typical hardware design reads the input data from the

    on-board memory or PCI interface, performs the computation inside the FPGA, and

    writes the results back to memory or PCI interface. The hardware design inside an

  • CHAPTER 2. BACKGROUND 26

    Commercial FPGA Computing Board

    PCI Interface

    FPGAOn-Board

    MEMORIES

    Pipeline Module

    Pipeline Module

    Pipeline Module

    Pipeline ModuleIn P

    aral

    lel

    Figure 2.6: Simple Structure Diagram of the Normal COTS Reconfigurable Comput-ing Board

    FPGA normally consists of deep pipelines with as much parallelism as much as possi-

    ble. The size and speed of the FPGA chip, the size and speed of the on-board memory

    and the speed of the PCI controller all determine the performance of the hardware

    design on this board. However, given a specific FPGA computing board, there are

    three design approaches which contribute most to the performance of the FPGA

    based design. They are pipelining, parallelism and memory hierarchy/interface.

    2.2.3.1 Pipelining

    Pipelining is an implementation technique whereby multiple instructions are over-

    lapped in execution [24]. A pipeline is like an assembly line.

    1. There are many instructions waiting to be processed just like there are many

    cars waiting to be produced in the assembly line.

    2. All the instructions should have similar structure. Instructions should use the

    same hardware processing units, just like all the cars on the assembly line should have

    identical production procedures. We call the part of an instruction that corresponds

    to each hardware processing unit a “task”, so each instruction is composed of several

  • CHAPTER 2. BACKGROUND 27

    tasks. If there is only one instruction to be executed, each of the tasks will be executed

    one by one in N clock cycles. Of course, the hardware units here are not efficiently

    used because every hardware unit is used only once in the N clock cycles.

    3. The benefit of pipelining is that it can run several instructions at the same time

    and let them share hardware resources. The instructions are overlapped in execution,

    in the shape of a cascade as shown in Figure 2.7. Figure 2.7 is an example structure

    of a pipeline in the FDTD design. The rows represent instructions and the columns

    represent clock cycles. We can see from this figure that every hardware unit is fully

    utilized, since every hardware unit works on a different instruction in every clock

    cycle, just like a working assembly line. The computation results now pop out of

    the pipeline every clock cycle. Compared to the non-pipelined data-path, the design

    throughput is increased by N times.

    A pipeline is a perfect hardware structure when an algorithm needs to execute

    many similar instructions. To build the pipeline, registers need to be inserted between

    hardware units in the data-path to synchronize them according to a clock signal.

    Because all the hardware units need to be finished in one clock cycle, the speed of

    the pipeline is determined by the slowest hardware unit or pipeline step in the data-

    path. The smaller the pipeline step, the higher the throughput. So the data-path and

    registers need to be balanced, also called re-timing [25], to make the length of pipeline

    steps similar. In case of large and slow arithmetic components like multipliers, we can

    use specially designed “pipelined components” which separates their tasks internally

    and are capable to be pipelined to several steps, like the pipelined multiplier in

    Figure 2.7.

    Pipelining has some limitations so that not all similar instructions can be pipelined.

    Data hazards may arise when an instruction depends on the results of a previous in-

    struction so it cannot be executed until the previous instruction is completed. Also,

  • CHAPTER 2. BACKGROUND 28

    Figure 2.7: Structure of a pipeline in the free space model of the 2D FDTD algorithm

    pipelining adds extra registers to the data-path so it increases the latency and con-

    sumes more resources. In most cases, the latency and registers are small enough to

    be ignored, and the high throughput and high overall design performance introduced

    by pipelining is important for the reconfigurable computing speedup.

    2.2.3.2 Parallelism

    One of the most important techniques in reconfigurable computing design is to par-

    allelize as many tasks as possible. As long as two tasks do not affect each other and

    can be processed at the same time, we can execute them in parallel in order to speed

    up the design. However, the hardware design is always limited by the availability of

    hardware resources. The hardware resources include slice area, on-chip memory area

  • CHAPTER 2. BACKGROUND 29

    and input/output interface. More hardware resources mean higher cost and more

    power consumption. Parallelism will consume more resources, so there is a trade-off

    between hardware resources and speedup. Carefully analysis is needed to decided

    how much parallelism is suitable for the current design to achieve the fastest speed

    with minimum cost.

    2.2.3.3 Memory Hierarchy and Interface

    Memory is an important part of a reconfigurable computing design. As shown in

    Figure 2.6, the FPGA can access the fast memories on board or the memories on the

    host computer via the slower PCI interface. Also, as we introduced in Section 2.2.2,

    the fastest memory is the on-chip BlockRAMs available inside the FPGA chip. Since

    the read/write speed of memory is important to the whole design’s speed, to avoid

    a bottleneck at the memory interface, fast memory is needed. But fast memory is

    expensive and its size is limited, so it is not practical to use fast memory for the

    whole design. A memory hierarchy can be organized according to the sizes of the

    different types of memories as well as their speed and cost. The goal of the memory

    hierarchy is to provide a memory system with the cost almost as low as a design that

    uses only cheap memory and with the speed almost as fast as a design that uses only

    fast memory.

    2.2.3.4 Summary

    General purpose processors (GPP) can support only limited parallelism and pipelin-

    ing. A GPP has a fixed number of Arithmetic Logic Units (ALUs) for parallelism,

    and it usually pipelines arithmetic operations in a limited way within each ALU. In

  • CHAPTER 2. BACKGROUND 30

    contrast, a custom hardware design is organized for a specific algorithm. The data-

    path can be fully parallelized and pipelined. The memory hierarchy and interface

    can be optimized to support pipelining and parallelism. Compared to a software al-

    gorithm running on a GPP, the reconfigurable computing design can be much faster.

    The more pipelining and parallelism, the faster the design.

    Not all algorithms are suitable for reconfigurable computing. Considering the

    relatively higher cost of hardware implementation, a significant speedup should be

    achieved to make the implementation practical. From the discussion above, we can

    conclude that deep pipelining and parallelism and suitable memory hierarchy are the

    key points of reconfigurable computing speedup. To fit these conditions, an algo-

    rithm should have a high volume of loops or repeated, similar instructions, intensive

    computations and limited memory access. Also, when most data in an algorithm

    have similar range, they can be represented by fix-point arithmetic which is easier

    and faster to process in a custom hardware design. We will continue talk about these

    topics in the Chapter 6.

    2.2.4 Reconfigurable Computing Design Flow

    The FPGA design flow includes coding the algorithm in VHDL, simulation and veri-

    fication, synthesizing, placing and routing, and finally on-chip verification. Figure 2.8

    shows a diagram of the FPGA design flow. Given a target algorithm, the first step

    is algorithm analysis and hardware architectural design. Then we use VHDL code to

    describe the hardware design, which can be debugged and verified through simula-

    tion. The completed VHDL design is then synthesized to create the gate level netlist.

    The place and route tool then fits the synthesized design into the FPGA, creating

    a physical layout and a bit-stream file which can be loaded directly into the FPGA

  • CHAPTER 2. BACKGROUND 31

    chip. We use a C program running on a host PC to communicate with the loaded

    FPGA chip for board level debugging and verification. If there is any problem in

    board level testing, we need to go back and modify the VHDL code. The final design

    which passes board level verification will be tested for performance.

    Figure 2.8: The FPGA Design Flow

    2.3 Finite Precision

    The original 3D FDTD model was written in Fortran, using 64-bit double precision

    floating-point data. Floating-point representation provides high resolution and large

    dynamic range, but it can be costly. In hardware design, floating-point representation

    has slower arithmetic components and consumes more hardware resources. It is

    significantly more expensive than fixed-point representation in speed and cost when

    implemented in hardware. On the other hand, fixed-point components in hardware

    design have much faster speed and occupy less space. In applications where the

    data resolution and dynamic range can be constrained, like the FDTD algorithm,

    the fixed-point arithmetic can provide similar precision and much faster speed than

  • CHAPTER 2. BACKGROUND 32

    floating-point arithmetic.

    In this section, the definition and properties of floating-point representation and

    fix-point representation will be introduced. Then we will discuss more about the

    fixed-point quantization and its related algorithm analysis.

    2.3.1 Floating-Point Arithmetic

    Programmers use floating-point precision to represent real numbers in most computer

    programs. The IEEE single precision floating-point standard representation requires

    a 32 bit word. As shown in Figure 2.9, the most significant bit is the sign bit, ‘S’,

    the next eight bits are the exponent bits ‘E’, from E30 to E23, and the final 23 bits

    are the fraction bits ‘F’, from F22 to F0 [26] [27]:

    Figure 2.9: Single precision floating-point representation

    The value V represented by the word may be determined as follows:

    Vfloat = (−1)S ∗ 1.F ∗ 2E−BIAS, where BIAS = 127.Single precision floating-point representation can represent data in the range −2128

    to 2128, the highest resolution is up to 2−127. This is a very large data range, as it can

    represent very small and very large numbers, which fix-point representation lacks.

    The IEEE double precision floating point standard representation requires a 64

    bit word. Similar to the single precision standard, as shown in Figure 2.10, the most

    significant bit is the sign bit, ‘S’, the next eleven bits are the exponent bits ‘E’, from

    E62 to E52, and the final 52 bits are the fraction bits ‘F’, from F51 to F0:

    Similar to single precision, the value V may be determined as:

  • CHAPTER 2. BACKGROUND 33

    Figure 2.10: Double precision floating-point representation

    Vfloat = (−1)S ∗ 1.F ∗ 2E−BIAS, where BIAS = 1023.

    2.3.2 Fixed-Point Arithmetic

    The fixed-point representation we use in this dissertation is two’s complement fixed-

    point representation. We can think of fixed-point data as a collection of N binary

    digits, with 2N possible states, as shown in Figure 2.11. The N-bit data and 2N states

    here do not mean anything until their interpretation is defined. For example, if we

    define the difference between two states as integer 1, and we define all the data to be

    located between 0 and 2N , then N-bit data can be thought of as positive integers less

    than 2N . However, if we define the most significant bit to be the sign bit, changing

    the data range to between −(2N−1 − 1) and 2N−1, the N-bit data can be thoughtof as integers with absolute value less than 2N−1. And if we define the difference

    between two states as 0.1, the meaning of the N-bit data will change too; they are

    now rational numbers precise to 0.1. So the meaning of fixed-point N-bit binary data

    depends on its interpretation.

    Figure 2.11: Uninterpreted fix-point representation

    In most cases, fixed-point representation is used to represent rational number

    sets. Instead of defining the value of states, we can use a more direct way to define

    the interpretation: the position of the binary point. For example, the 35-bit signed

  • CHAPTER 2. BACKGROUND 34

    fixed-point representation which will be used in this research has the form as shown

    in Figure 2.12.

    Figure 2.12: Interpreted fix-point representation

    The digits before the binary point are one sign bit and one integer bit; the digits

    after it are called the fraction part. The resolution, which is defined as the smallest

    non-zero magnitude representable [28], is determined by the format. In this example,

    the resolution is 2−33. Since the position of the binary point is fixed, this represen-

    tation is called “fixed-point”. We can write this representation as A(1,33) [28]. The

    N-bit signed fixed-point data representation A(a, b) has the value:

    Vfix = −S · 2a +a−1∑

    n=0

    2nAn +1

    2b

    b−1∑

    n=0

    2nBn

    Although fixed-point representation has different interpretations, they share the

    same rules as most arithmetic operations. We can consider fixed-point numbers

    as integers and choose the necessary bits and reposition the binary point after the

    operation. For example, if we multiply two fixed-point numbers with the form shown

    in Figure 2.13, we can ignore the binary point and multiply the two integers first.

    After the multiplication, we pick the digits we need according to the position of the

    binary point. In this example, it is 3 digits before the binary point and 26 digits

    after. The result is still a 30-bit fixed-point datum.

    The fixed-point representation has a fixed data structure that only works well

    on a limited data range. For example, the fixed-point number with the format in

    Figure 2.13 only has a range from -8.0 to 7.9999... [29]. If an integer result is bigger

  • CHAPTER 2. BACKGROUND 35

    Figure 2.13: Multiplication of fix-point representation

    than 23 − 1 = 7, then it cannot be represented by three digits before the binarypoint. The situation where a number cannot fit into the representable range is called

    overflow. The minimal absolute value which fixed-point data can represent is also

    limited. The situation where a number is less than the resolution of the representation

    is called underflow. Although the range and resolution in fixed-point representation

    are limited compare to the same-length floating-point data, it is suitable for many

    applications which do not need both high precision and wide dynamic range.

    In the example in Figure 2.13, we ignored the 27th to 52th bits after the binary

    point in order to fit the result into the current representation. If a value with higher

    resolution cannot be fit in the current representation, some information must be

    discarded in order to decrease its resolution. The most common methods are “trun-

    cation” and “rounding toward nearest”. Truncation is computationally simpler, just

    dropping all the digits beyond those required, as we did in the above example. In-

    stead, rounding introduces a smaller error. When all numbers are rounded to the

    nearest representable number, it is more accurate than truncation half of the time,

    and the same the other half of the time. Of course it requires more effort to process;

    rounding can be accomplished by adding one to the next significant bit and then

  • CHAPTER 2. BACKGROUND 36

    truncate it.

    Depending on the number of states required, N in the N-bit fixed-point repre-

    sentation can be any length we want. So there does not exist a length standard

    in fixed-point representation. Because of this flexible characteristic, a designer can

    optimize the system by choosing the minimal-length fixed-point data with accept-

    able error. The shorter the data length, the lower the cost, the more area available

    for parallelism and finally the faster the design. This point is particularly useful in

    hardware design since speed and area are the ultimate goal of the hardware designer.

    Because of its simplicity and fast processing, fixed-point representation is widely used

    in custom hardware designs on applications where the data resolution and dynamic

    range can be constrained.

    2.3.3 Fixed-Point Quantization

    Fixed-point quantization refers to the process of approximating values in floating-

    point representation with a fixed-point representation. An N-bit signed fixed-point

    representation A(a, b), as shown in Figure 2.12, has the value:

    Vfix = −S · 2a +a−1∑

    n=0

    2nAn +1

    2b

    b−1∑

    n=0

    2nBn

    The rounding method we use in this research is rounding toward nearest. So if the

    lengths of a and b are decided, for any floating-point value Vfloat, we can approximate

    Vfloat with fixed-point by representing the floating-point number in two’s complement

    arithmetic, rounding it and then truncating the digits that do not fit in the fixed-point

    representation.

    The main problem remaining is how to choose the length of a and b. Fixed-

    point quantization creates error which is the difference between the floating-point

  • CHAPTER 2. BACKGROUND 37

    and the fixed-point representation. Usually the longer the fixed-point value, the less

    the quantization error. However, longer values require more hardware resources, so

    the fixed-point representation we choose should balance minimizing the error with

    minimizing the bit-width to save hardware resources. This is a trade-off between

    quantization error and hardware area. This decision is made according to the al-

    gorithm analysis of relative error and visual error. The detailed analysis will be

    discussed in Chapter 5.

    2.4 Summary

    This chapter presented the theories of the FDTD method, which is the target al-

    gorithm of this dissertation. Background on FPGAs and reconfigurable computing,

    which is the basis of this research, is then introduced. Finally we talk about the

    concepts of finite precision and fixed-point quantization, which will be used in the

    algorithm analysis. In the next chapter, we will present related work.

  • Chapter 3

    RELATED WORK

    In this chapter we will introduce the related works of this research. Related work

    falls into following categories: applications of the FDTD method, parallel-computing

    systems for FDTD acceleration, VLSI implementations for FDTD acceleration and

    FPGA implementations for FDTD acceleration.

    3.1 Applications of the FDTD Method

    The FDTD method is an important tool for investigating the propagation, radi-

    ation and scattering of electromagnetic waves. Before the 1990s, the cost of solving

    Maxwell’s equations directly was large and most of the related research was for mil-

    itary defense purposes. For example, engineers used huge parallel supercomputing

    arrays to model the radar wave reflection of airplanes by solving Maxwell’s equations,

    trying to develop an airplane with low radar cross section [30]. The difficult task of

    solving Maxwell’s equations has had more economical solutions since 1990, with the

    development of fast computing resources applied to the FDTD method. Now the

    38

  • CHAPTER 3. RELATED WORK 39

    application of the FDTD method has spread to many areas including:

    • Discrete scattering, by modelling electromagnetic wave scattering from discreteobjects [7],

    • Antenna design and radar design, by modelling various kinds of antennas,

    • Digital circuit packaging on multi-layer circuits boards, by helping designersanalyze electromagnetic wave phenomena on complex circuit boards [30],

    • Subsurface detection and ground penetrating radar (GPR), by modelling theelectromagnetic phenomena of GPR on the subsurface model [31] [32], and

    • Medical studies, by modelling electromagnetic wave phenomena in the humanbody, including:

    – The study of the effect of electromagnetic waves from cell phone on the

    human brain,

    – The study of the computation of light scattering from cells [33], and

    – The study of breast cancer detection using electromagnetic antenna [34] [35].

    Among these various applications, several works are closely related to our research,

    both in the subsurface detection areas and in medical studies.

    The use of the FDTD method to simulate Ground Penetrating Radar (GPR)

    applications for anti-personnel mine detection has been introduced [31, 32]. With

    carefully modeled transmitting and receiving antenna locations, a 2D FDTD model

    can simulate the wave propagation and scatter response of 3D GPR geometries with

    realistic ground material [31]. An advanced 3D FDTD model was introduced [32]

    which models more complex geometries, provides more accurate results and allows

  • CHAPTER 3. RELATED WORK 40

    complete freedom for the location of the transmitting and receiving antennas. The

    3D FDTD method implements realistic dispersive soil along with air, metal and di-

    electric media. The Mur-type absorbing boundary conditions were implemented and

    produced good results. Both the 2D and 3D models have been validated by experi-

    ments performed with a commercially available ground penetrating radar system and

    realistic soil. The results produced by the FDTD models are very accurate compared

    with experimental data.

    Research on FDTD modelling for microwave breast cancer detection has been pre-

    sented [34, 35]. Because of the large difference in electromagnetic properties between

    malignant tumor tissue and normal breast tissue, microwave breast cancer detection

    has attracted much interest as it may overcome some of the shortcomings of X-ray

    based detection. Accurate computational modeling of microwaves in human tissue

    with the FDTD method is very helpful for breast cancer detection research. In these

    papers, the authors build a 3D simulation of the human breast, which includes a

    semi-ellipsoid geometric representation of the breast and a planar chest wall. A sin-

    gle pole conductivity model [11] based on the Z-transform is presented as a simple and

    efficient way of modeling various dispersive media, in the range of 30MHz to 20GHz,

    including human tissue. The UPML Absorbing Boundary Condition is implemented

    and performs better than the Mur-type ABC.

    Clearly, the FDTD method is a powerful tool that can be used in many different

    applications. Our hardware implementations are based on both the mine detection

    and the breast cancer detection research mentioned above [31, 32, 34]. Our research

    leads to a hardware FDTD core supporting both GPR and medical applications, and

    other areas with minimum modification.

    The FDTD method is computationally intensive. Without powerful computa-

    tional resources, FDTD models are too time consuming to be implemented on single

  • CHAPTER 3. RELATED WORK 41

    computer node. The backward FDTD algorithm, or the time reversal algorithm [35]

    based on the FDTD method, especially needs speed-up because realistic subsurface

    detection instruments, which are based on the backward algorithm, place the most de-

    mands on computational resources in order to achieve real-time working speed. Much

    effort has been spent on accelerating FDTD implementations for better performance.

    3.2 Parallel-computing Systems for FDTD Accel-

    eration

    One of the early approaches to accelerating the FDTD method was to use super

    computer or parallel computer arrays to calculate the FDTD algorithm in software.

    A two node parallel algorithm exhibiting perfect speedup should run twice as fast

    as a single node computer. However, because of communication bottlenecks, parallel

    systems cannot achieve perfect speed up. The goal is to achieve near-perfect speedup.

    A parallel implementation of the FDTD algorithm is presented [36]. In this pa-

    per, the authors compared parallel FDTD code running on three different parallel

    computing systems: a typical 16 node Beowulf system and two traditional parallel

    supercomputers. The traditional supercomputers have better inter-node communi-

    cation bandwidth and latency compared to the typical Beowulf system, but the cost

    of traditional supercomputers is much higher. Well written parallel FDTD code per-

    forms well on all three systems. The authors compare the performance results of the

    Beowulf system to the supercomputers only, not to a single node computer. We can

    assume the speed-up is a maximum of 16 fold and should be a little slower due to

    communications overhead. In this paper, the authors also analyze the decomposition

    and communication problems of the parallel FDTD method, which is helpful to our

  • CHAPTER 3. RELATED WORK 42

    research on the two FPGA implementation of the FDTD method. According to the

    analysis, the FDTD code should scale to a large number of nodes easily, because most

    communication is between neighboring nodes.

    More detailed research on Beowulf system performance for the FDTD method

    is introduced [37]. The project investigated the performance of a parallel FDTD

    implementation on a Beowulf workstation cluster where the computation grid was

    divided among nodes. The authors points out that for a fixed size problem, as the

    number of nodes increases, the spee