by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science...
Transcript of by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science...
FullMonte: Fast Biophotonic Simulations
by
Jeffrey Cassidy
A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science
Graduate Department of Electrical and Computer EngineeringUniversity of Toronto
c© Copyright 2014 by Jeffrey Cassidy
Abstract
FullMonte: Fast Biophotonic Simulations
Jeffrey Cassidy
Master of Applied Science
Graduate Department of Electrical and Computer Engineering
University of Toronto
2014
Modeling of light propagation through turbid (highly-scattering) media such as living tissue
is important for a number of medical applications including diagnostics and therapeutics. This
thesis studies methods of performing such simulations quickly and accurately. It begins with a
formal definition of the problem, a review of solution methods, and an overview of the current
state of the art in fast simulation methods encompassing both traditional software and more
specialized hardware acceleration approaches (GPU, custom logic). It introduces FullMonte,
the fastest mesh-based Monte Carlo software model available and highlights its novel optimiza-
tions. Additionally, it demonstrates the first fully three-dimensional hardware simulator using
Field-Programmable Gate Array (FPGA) custom logic, offering large (40x) power-efficiency
and performance (3x) gains. Next, a plan for significant future feature enhancements and per-
formance scale-out is sketched out. Lastly, it proposes applying the simulators developed to a
number of problems relevant to current clinical and research practice.
ii
Acknowledgements
It goes without saying that my two supervisors, Professor Vaughn Betz and Professor Lothar
Lilge, were both extremely important to the completion of this work. Were they “just” tech-
nically savvy, well-informed across a wide range of topics, and well-respected in their fields, I
would have been very fortunate. They are undoubtedly that, but my good fortune goes fur-
ther as they are also hard-working, excellent mentors, generous with their time, and energetic
supporters: truly outstanding role models. Their guidance and constant enthusiasm have made
graduate school enjoyable, indeed so much so that I look forward to many (but not too many!)
more years working for them during my PhD: an outcome that I had not originally planned
for, but one of the easiest decisions I’ve made.
I am very thankful to Professor Jonathan Rose, who provided the introduction to Lothar
without which this collaboration would not have happened.
Much gratitude is due to Emily Dobson for her love, support, and patience, particularly
during the “crunch” stage: writing this thesis while at the same time taking a full course load
towards my PhD. You have been incredible throughout - thank you so much!
My thanks also go to a number of friends and family whose support and encouragement
have been vital along the way. First and foremost, to my grandmother Geneva McNeil who is
the model of generosity, patience, and kindness. My aunt Donna McNeil has always been there
for me, and provided a peaceful place to work and/or relax when it’s needed. My aunt Susan
and uncle Paul Douglas have been supportive during both my undergraduate and graduate
education. Good friends Chris Trendall and Nancy Wolf have provided many laughs and kind
words along the way. Dianna Lanteigne was very important in my decision to leave work and
return to graduate school.
I am also thankful for funding and in-kind contributions from several organizations. Blue-
spec Inc. provided the Bluespec Compiler and related software which made designing and
simulating the FPGA implementation very much faster and easier than I could have expected.
Altera Corp. provided the software tools used for hardware synthesis of the FPGA implemen-
tation. Financial support was provided by Altera Corportation, the Ontario Cancer Institute,
and the University of Toronto.
iii
Contents
1 Introduction 1
1.1 Medical Uses of Light . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Photodynamic Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Bioluminescence Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Background 5
2.1 Applications of Diffuse Light . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Photodynamic Therapy (PDT) . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Diffuse Optical Tomography . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Bioluminescence Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 Diffuse Optical Spectroscopy . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Tissue Optics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Light Propagation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Geometry Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Material Optical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.3 Source Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.4 Output Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 Numerical Solution Implementations . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Finite Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.4.2 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5 Existing Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 MCML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.2 tMCimg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.3 CUDAMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.4 CUDAMCML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.5 GPU-MCML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.6 NIRFAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
iv
2.5.7 TIM-OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.8 MMCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.9 MCX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.10 FBM (MCML on FPGA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6 Computing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.6.1 Central Processing Units (CPU) . . . . . . . . . . . . . . . . . . . . . . . 29
2.6.2 Graphics Processor Units (GPU) . . . . . . . . . . . . . . . . . . . . . . . 29
2.6.3 Field-Programmable Gate Array . . . . . . . . . . . . . . . . . . . . . . . 29
3 Software model 31
3.1 Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.2 Geometry Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.3 Tools and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.1.4 Programming Language and Style . . . . . . . . . . . . . . . . . . . . . . 33
3.2 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Performance enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.2 Explicit parallelism through SIMD intrinsics . . . . . . . . . . . . . . . . 36
3.3.3 The wmin Russian roulette parameter . . . . . . . . . . . . . . . . . . . . 36
3.4 Output Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.5 Profiling information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.1 Geometry Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.5.2 Operation Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5.3 Coordinate precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.4 Spin Calculation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.5.5 Intersection Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4 FPGA Implementation 46
4.1 Motivation for Hardware Acceleration . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.1 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.2 Intel Xeon Phi processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.1.3 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Hardware Platform: Altera-Terasic DE-5 . . . . . . . . . . . . . . . . . . 48
4.2.2 Implementation Language: Bluespec . . . . . . . . . . . . . . . . . . . . . 48
4.2.3 Design Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.4 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.2.5 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.2.6 Packet Loop Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
v
4.3 Design Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.1 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.2 Photon launch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.3 Step length generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.4 Tetrahedron Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.3.5 Intersection test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.6 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.7 Absorption, roulette, spin, and step finish . . . . . . . . . . . . . . . . . . 64
4.3.8 Altera DSP Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3.9 Mathematical operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 Results 66
5.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1.1 Unit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1.2 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.1.3 Conservation of Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.1.4 Comparison to Reference Simulators . . . . . . . . . . . . . . . . . . . . . 68
5.2 Algorithm Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2.1 Operation Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.2 Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.3 Software Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.2 Comparison to TIM-OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3.3 Multi-Threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.4 wmin parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.4 Hardware Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.4.1 Area Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.4.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5 Architecture Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5.1 Larger Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.5.2 Parallelism for Greater Throughput . . . . . . . . . . . . . . . . . . . . . 87
5.5.3 Cost of Scale-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6 Conclusions and Future Work 91
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.1 Contribution summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.2 FullMonte Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.1.3 FullMonte Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vi
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.1 FullMonte Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.2 FullMonte Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.2.3 New Acceleration Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Bibliography 99
vii
List of Tables
2.1 Summary of relevant tissue optical properties with typical values in the optical
window from Cheong [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 Comparison of existing simulators with key features: geometry, absorption scor-
ing, anisotropy, refraction, non-scattering voids, time-resolved data, and accelera-
tion methods: FPGA (Nx)=FPGA with N instances per chip; MT=multithreading;
SIMD=Intel SSE instructions, automatic or manual optimization; Asterisk indi-
cates planned future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1 Core FPGA data structures for packet, geometry, and material representation . . 56
5.1 Test cases and variants used to evaluate operation complexity vs run time . . . . 77
5.2 Comparison of FullMonte and TIM-OS run times for Digimouse standard albedo
case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3 Run-time impact of changing wmin for three different Digimouse albedo scenarios 80
5.4 Area required for a single instance on Stratix V A7 device . . . . . . . . . . . . . 84
5.5 Performance and energy-efficiency comparison (FPGA vs CPU) at 210 MHz clock
rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.6 Resource estimates for 8-pipeline cache hierarchy (DRAM peak b/w is 348M/sec,
so needs 27% efficiency); ∗ assuming 2 instances share 1 physical RAM; based
on Digimouse profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
viii
List of Figures
2.1 Absorption spectrum of principal tissue chromophores from Vogel and Venu-
gopalan [65], showing the tissue optical window from 630-1000nm . . . . . . . . . 6
2.2 Depiction of High-Resolution Diffuse Optical Tomography (HR-DOT) setup from
Habermehl et al [30] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Side-by-side depiction (L to R) of BLI image, CT scan, PET image, and dissec-
tion photograph of nude mouse with a bioluminescent xenograft tumour repro-
duced from [49] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1 Overview of hop, drop, spin flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1 Block diagram for FPGA implementation, with stages requiring random numbers
shaded; the boxed group is actually a single block but is expanded to show packet
flow; see Fig 5.5 for event frequency details . . . . . . . . . . . . . . . . . . . . . 58
4.2 BSV example showing use of Randqueue to queue up random numbers . . . . . . 60
5.1 Results comparison (FullMonte software vs. TIM-OS) for Digimouse energy per
surface element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2 Results comparison (FullMonte software vs. TIM-OS) for Digimouse energy per
volume element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Validation of FullMonte hardware simulation vs FullMonte software . . . . . . . . 72
5.4 Photon packet event frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.5 Algorithm flow graph annotated with transition probabilities (edges) and average
per-packet operation counts (nodes) for Digimouse at standard albedo . . . . . . 74
5.6 Cacheability of four different test cases, showing relatively low hit rate for LRU
cache at top left/right (note logarithmic scale for cache size); static Zipf cache
at bottom left is better; bottom right shows L2 hit rate for two options with
Digimouse (std): Hybrid (L1 LRU, L2 LFU) requires 2377 elements for 50% hit
rate, while pure LRU (L1 LRU, L2 LRU) requires 8246 . . . . . . . . . . . . . . 76
5.7 Software run time vs. operation count: Mints and Mabs for a variety of test
cases, showing Mints as a predictor for run time . . . . . . . . . . . . . . . . . . 78
ix
5.8 Result standard deviation vs result value at varying wmin values (Digimouse
surface emission at standard albedo) with vertical line showing 16-bit dynamic
range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.9 Sandy Bridge i7-2600K die photo from Anandtech [61], showing the very large
area dedicated to caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.10 Hardware block diagram of FullMonte (top) and FBM (bottom) showing latency
with core-loop edges in black; maximum loop latency is 100 for FBM and 52 (18)
for FullMonte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.11 Proposed cache architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
x
List of Mathematical Symbols
Statistics and Random Variables
E [X] Expectation of random variable X
PrE Probability of some event E
Var [X] Variance of random variable X
cv(X) Coefficient of variation for random variable X, cv(X) =
√Var[X]
E[X]
Bp Bernoulli distribution which returns 1 with success probability p, else 0
Uij Uniform distribution with output i ≤ x < j
Eµ Exponential random distribution with CDF F (x) = 1− e−µx and mean 1µ
Fk(x) Cumulative distribution function (CDF) for a distribution with parameter k
fk(x) Probability density function (PDF) for a distribution with parameter k
F−1k (x) Inverse CDF (ICDF) for distribution with parameter k
Photon packet properties
p [cm] Position
d Direction
a, b Auxiliary unit vectors orthonormal to d used in the scattering calculation
q [cm] Intersection of ray with material boundary
s [cm] Physical distance to intersection
l Base-2 dimensionless step length [0,∞)
t [ns] Time
w Weight (energy or equivalently expected number of photons)
Geometry
Ri Discrete homogeneous region
Si Discrete surface element
V [R] [cm3] Volume of region R
A[S] [cm2] Area of surface element S
i j k Unit vectors along the x, y, z axis respectively
n Interface normal vector
Ci [cm] Tetrahedron face constant (i ∈ [1, 4])
xi
Tissue Optical Properties
g Anisotropy factor g = E [cos θ] where θ is the deflection angle
n Refractive index
α Albedo
β Persistence (number of steps from unit weight to roulette) β = −1lnα
µa [cm−1] Absorption coefficient
µs [cm−1] Scattering coefficient
µ′s [cm−1] Reduced scattering coefficient µ′s = µs(1− g)
µt [cm−1] Total attenuation coefficient µt = µs + µa (reciprocal of Mean Free Path)
ρ [mol L−1] Concentration of absorbers
σs, σa [m2] Scattering (absorption) cross-section per molecule
ε [cm−1mol−1L] Molar extinction coefficient (molar absorptivity)
Physical Constants
NA [mol−1] Avogadro’s number, 6.022× 1023
c0 cm ns−1 Speed of light in vacuum, 29.98cmns−1
h [Js] Planck’s constant, 6.626× 10−34
Simulation Parameters
N0 Total number of packets launched
m Probability of roulette survival
wmin Minimum packet weight to trigger roulette
Simulation Outputs
φ(x, t) [Js−1cm−2] Fluence rate (energy flux) at point x
Φ(x) [Jcm−2] Fluence Φ =∫φ(x, t)dt
ΦV [R] [Jcm−2] Average fluence over the volume of region R
EV [R] [J] Total energy deposited in region R
ΦA[S] [Jcm−2] Average fluence passing through a surface S
EA[S] [J] Total energy passing through surface S
Hardware-Related Symbols
ε The smallest representable value in a given number system
C Number of pipeline registers inserted into a dependence loop
fc [MHz] Core computational clock frequency
fmax [MHz] Maximum achievable system clock frequency
T Reciprocal throughput
L Latency (clock cycles)
xii
Glossary and List of Abbreviations
BLI Bioluminescence imaging
CPU Central Processing Unit
CT X-ray Computed Tomography
CUDA Compute Unified Device Architecture, a CPU programming language by NVIDIA
CUDAMC CUDA-based time-resolved MC for semi-infinite homogeneous non-absorbing media
CUDAMCML A CUDA (GPU) implementation of MCML
CW Continuous-wave
DOS Diffuse Optical Spectroscopy
DOT Diffuse Optical Tomography
fNIRS Functional Near-Infrared Spectroscopy (synonym for DOT)
FPGA Field-Programmable Gate Array
GPGPU General-Purpose computing on Graphics Processing Unit
GPU Graphics Processing Unit
GPU-MCML GPU implementation of MCML
HLS High-Level Synthesis
HNC Head and Neck Cancers
IPDT Interstitial (within the body) PDT
MCML Monte Carlo for Multi-Layered media
MC Monte Carlo
MCX Monte Carlo Extreme, a voxelized GPU-based simulator
MFP Mean Free Path (µ−1t )
MMCM Mesh-Based Monte Carlo Method
MRI Magnetic Resonance Imaging
NIRFAST Near-Infrared Fluoresence And Spectral Tomography, a Matlab-baesd diffusion solver
PDT Photodynamic therapy
PS Photosensitizer
PT Photodynamic Threshold, a dose definition
RNG Random Number Generator
RTE Radiative Transfer Equation
RTL Register-Transfer Level (detailed hardware design of a digital system)
xiii
SFDI Spatial Frequency-Domain Imaging
SFMT SIMD-Oriented Fast Mersenne Twister
SIMD Single Instruction Multiple Data
SMT Simultaneous Multi-Threading: multiple threads sharing one core (Intel “Hyperthreading”)
SPMD Single Program Multiple Data
SSE Intel SIMD Streaming Extensions (vector instructions)
TIM-OS Tetrahedral Inhomogeneous Mesh Optical Simulator
xiv
Chapter 1
Introduction
1.1 Medical Uses of Light
Many important medical applications make use of light in the “optical window” which is gener-
ally defined as wavelengths from deep red (≈ 630nm) into the near infrared (≈ 1060nm). The
region is so named because absorption of common tissue and blood constituents is at a minimum
there [65], allowing light to travel large distances into the body. Light at these wavelengths is
non-harmful, generally inexpensive to produce and detect, and easily guided by optical fibres
which can be applied at surfaces, through endoscopes, or inserted using needles.
As an imaging and detection method, light in the optical window can provide in vivo func-
tional information through non- or minimally-invasive means using highly-portable devices.
This contrasts with other imaging modalities such as magnetic resonance imaging (MRI) which
is very expensive and non-portable. Ionizing radiation such as x-rays (including CT scans) are
likewise non-portable and gradually harmful as the dose accumulates. Positron emission tomog-
raphy (PET) provides functional information based on glucose uptake but requires injection of
a radioactive tracer, which should be kept to a minimum for human patients. The absence of
all these drawbacks makes light an attractive choice for medical sensing.
As a treatment technology, light can be used in a targeted way to destroy unwanted cells
including cancer. Red and near-infrared light do not inherently have any accumulated toxic
effects. As a result, unlike ionizing radiation such as x-rays, light-based treatments do not
have any inherent limit on the number of times they can be applied. This is particularly
important for treating conditions like cancer where local recurrences may happen, requiring
re-treatment. In some cases, ionizing radiation treatment may not be usable a second time due
to the accumulated damage during the first treatment.
The utility of light in this window is limited by the fact that biological tissues are very
turbid, meaning they scatter light strongly. Hence, any light which travels through more than
a fraction of a millimetre of tissue will be scattered and become diffuse rather than focused.
In addition, any in vivo sensing or imaging will have to contend with background scatter from
surrounding tissue, which reduces the contrast. Accurate calculation of scattered light propa-
1
Chapter 1. Introduction 2
gation is therefore essential for the design and function of medical devices, as well as correct
interpretation of results from measurements made with light. Safe and effective therapeutic
use (such as destroying unwanted cells) depends on the ability to predict the light distribution
and hence the correct distribution of absorbed energy within the tissue. A few examples are
highlighted below as motivation for the research presented here. Greater detail on state of the
art applications and simulation methods is presented in Chapter 2.
Each solution technique directly or indirectly uses a (possibly approximated) form of the
Radiative Transfer Equation (RTE), which is the basic conservation law governing light prop-
agation in turbid media. It states that for each point, the photon flux in a given direction is
governed by the incident flux in that direction, minus losses due to scattering and absorption,
plus scattering from other directions into this direction. Analytic methods and numerical sim-
ulations of light transport problems make use of varying techniques to produce solutions that
conform to the RTE.
1.1.1 Photodynamic Therapy
Photodynamic Therapy (PDT) [70] is an interesting and promising emerging medical applica-
tion of light. It is a targeted, minimally-invasive treatment used to selectively kill diseased cells
including cancer or bacteria. The patient is given a non-toxic photosensitizer (PS) which is
sensitive to light. When exposed to light of a specific wavelength, the photosensitizer excites
the oxygen normally present in living tissue into a reactive form with a short lifetime. The
excited oxygen quickly reacts with proteins and lipids in cells, causing damage that leads to cell
death if the accumulated damage is sufficient. Since the oxygen radicals have a short lifetime,
the effects are confined to the immediate area where light exposure, photosensitizer, and tissue
oxygen overlap so there is little to no systemic toxicity.
The optimal treatment plan provides a dose which damages all target cells while minimizing
collateral damage to nearby organs at risk. Without an accurate model of light transport in
tissue, it is impossible to predict the light energy received, and hence the PDT dose delivered and
treatment outcome. There is not yet a sufficiently fast, accurate light propagation simulation
with related dose-evaluation and treatment-planning software for this application which has
been a barrier to use of interstitial PDT for complex anatomy. The current state of PDT
research and clinical use is summarized in Section 2.1.1. The goal of this thesis is to provide
part of the solution, namely the fast and accurate light-propagation calculation which will
enable progress in dose evaluation and treatment planning.
1.1.2 Bioluminescence Imaging
Another important application which relies on fast and accurate simulation of light propagation
through tissue is bioluminescence imaging (BLI) [54]. BLI is a popular research tool used on
small animals in which a cell line of interest presenting a disease is transfected with a gene
which causes it to produce a protein which luminesces (produces light) without application of
Chapter 1. Introduction 3
an external excitation source. This enables monitoring the spread of that cell line by observing
the luminescent emissions using a low-light camera. Most BLI work is currently qualitative,
using the images to track the progression and spread of disease. Quantitative BLI (QBLI) [46],
also known as Bioluminescence Tomography (BLT) is an emerging technique which attempts
to reconstruct an accurate geometric model of the volume of interest using knowledge of the
anatomical structure (usually obtained by MR or CT), optical properties, and further assump-
tions about the volume of interest. This information is provided as constraints to a numerical
solver which tries to find a simulation geometry which minimizes the difference between the
simulated light pattern and the observed pattern. Given sufficient time and computing power,
it is possible to obtain quantitative functional information about the volume of interest.
1.2 Inverse Problems
The applications introduced above and many others rely on solving a mathematical inverse to
the RTE, where a volume description (geometry, optical properties, sources) is sought which
gives a particular pattern of fluence. Since no closed-form analytic solution exists for complex
geometries, it is generally necessary to solve the problem using iterative techniques. The large
number of iterations required make the successful use of such techniques dependent on a fast and
accurate implementation of the forward solution to the RTE so that many candidate solutions
may be tried to find the best.
Taking for instance PDT, a desired dose is defined in terms of constraints on energy per
unit volume within the treatment volume and the necessary source configuration must be solved
for. A physician will define the target volume, organs at risk, and the desired dose parameters
for the different structures. The goal is often described as a minimum dose to be delivered to
the target tissue and a maximum dose not to be exceeded for nearby healthy tissue. Thus,
a treatment plan specifies for a specific patient and target dose profile the number of fibre-
optic sources, source positions, source shapes, total input light intensity and duration for each.
With such a large number of free parameters, the space of possible treatment plans is likewise
large. Lacking an analytic solution giving a treatment plan from a problem definition, it is
necessary to start from one or more guesses and successively refine them. Evaluation of each
candidate refinement requires calculation of a separate forward simulation, and we anticipate
that hundreds or thousands of such simulations may be necessary for optimization.
In bioluminescence imaging, the goal is similar except that the difference between observed
and simulated surface emission should be minimized. Given an observed distribution of light,
the researcher wishes to find the distribution of sources which gave rise to the observation.
The problem is generally constrained by additional anatomical information, either a reference
anatomy or by structural information from other imaging modalities (MRI, CT). Again the
minimization problem will likely take hundreds or thousands of iterations of the forward simu-
lation.
Chapter 1. Introduction 4
The common factor in these and other techniques is that a large number of forward simu-
lations must be conducted to arrive at a solution. Research and clinical relevance demand that
the overall cost and computation time be reasonable prior to widespread adoption. Hence, this
thesis focuses on making biophotonic simulations as fast as possible as an enabler of a wide
variety of important optical techniques in medical research and ultimately clinical practice.
1.3 Contributions
Given the need outlined above for fast and accurate simulations of light propagation from one
or more sources within a heterogeneous tissue volume, we first investigated and produced a fast
simulator. We started by understanding and improving on the best available software. The
amount of computational cores and energy required to achieve practically useful simulation
times were deemed excessive so we investigated faster and more efficient computational plat-
forms. Next, an implementation using custom digital logic was undertaken to provide further
integer-factor gains in performance and power efficiency. The principal contributions presented
in this thesis are as follows:
• The fastest tetrahedral-mesh-based1 software Monte Carlo light propagation model avail-
able
• A novel, faster and more hardware-friendly method for computing scattering
• The first FPGA-based implementation of a tetrahedral mesh-based Monte Carlo light
scattering simulator
• A hardware-accelerated simulator which is faster (3x) and more power-efficient (40x) than
a CPU
1.4 Organization of Thesis
The balance of the thesis is organized as follows. A more thorough review of relevant ap-
plications, the physics of diffuse light propagation theory, and the current state of the art in
simulation methods is presented as background material in Chapter 2. Next, Chapter 3 presents
the FullMonte C++ CPU-based software model. It is the fastest existing simulator in its class,
and incorporates several novel features to enhance performance and customizability. Based on
the software model, a hardware implementation using Field-Programmable Gate Arrays was
created as described in Chapter 4. Chapter 5 shows simulation results in terms of functional val-
idation, power consumption, and performance for the two models. Finally, Chapter 6 presents
a discussion of future feature enhancements, performance scale-out, and application work to be
done.1A tetrahedral mesh description is, as explained later, the most flexible and accurate geometry model for light
propagation simulations.
Chapter 2
Background
This chapter provides context for the research presented in the balance of the thesis. We start
with a discussion of therapeutic, diagnostic, and research applications of diffuse light. Next, we
present a brief summary light-tissue interactions and optical properties which are relevant to
propagation in tissue. In Section 2.3, we give a more formal definition of the forward problem
solved by the FullMonte simulators, as well as an abstract description of the simulation inputs
and outputs. Section 2.4 introduces the two principal solution algorithms: finite element with
the diffusion approximation, and Monte Carlo. It also gives a detailed description of the Monte
Carlo algorithm used in FullMonte, but without implementation details. The chapter ends with
a summary of the state of the art in diffuse propagation simulators, and an introduction to the
most common technologies for accelerating computation to place the FullMonte hardware effort
in context.
2.1 Applications of Diffuse Light
The relatively low absorption of living tissue in the tissue optical window ranging from dark red
to infrared (see Fig 2.1) presents a useful means for transporting energy into and out of tissue.
Even more importantly, a number of important tissue constituents have distinctive spectral
features within this band. Consequently, many medical applications use light in this range to
measure and control biological processes, through even several centimetres of scattering tissue.
Several such applications are reviewed below to motivate the research undertaken.
Photodynamic therapy (PDT, Sec 2.1.1) is a light-mediated treatment where chemical reac-
tions are caused by the absorbed photons, requiring careful control of the fluence rate through-
out the planning volume. Fluorescence and absorption imaging methods like Diffuse Optical
Tomography (DOT, Sec 2.1.2) rely on both an excitation input and an observed return to ex-
tract information on the distribution of fluorescent or absorbing molecules. Bioluminescence
Imaging (BLI, Sec 2.1.3) uses light emitted from within tissues that arrives at the skin surface
to gain functional information. Diffuse Optical Spectroscopy (DOS, Sec 2.1.4) makes use of the
variation of optical properties across different wavelengths to infer material composition and
5
Chapter 2. Background 6
Figure 2.1: Absorption spectrum of principal tissue chromophores from Vogel and Venu-gopalan [65], showing the tissue optical window from 630-1000nm
Chapter 2. Background 7
hence physiological parameters. All of these applications exploit the tissue optical window, and
work entirely with light that has been scattered many times. For all of them, accurate propa-
gation simulations and knowledge of optical properties are essential for correct interpretation
of measurements or achievement of intended results.
2.1.1 Photodynamic Therapy (PDT)
Introduction to PDT
Photodynamic therapy (PDT) is a minimally-invasive treatment for a number of medical con-
ditions including cancer and bacterial infections that destroys diseased cells where the level
of damage is a function of the light intensity. It uses a photosensitizer (PS) which is either
applied topically (for superficial treatment) or given intravenously to be absorbed by the pa-
tient and selectively retained by the target cells. When the photosensitizer has oxygen nearby
and is exposed to photons in its absorption band, it excites the oxygen into a short-lived re-
active state. A reaction then quickly occurs with proteins and lipids, causing cell damage in
the immediate area of photon absorption. If sufficient damage is accumulated, the cell will
die through either apoptosis or necrosis depending on the degree of damage. Therefore, PDT
offers a light-mediated method of selectively killing cells, which means that treatment safety
and effectiveness depend on having an accurate model of light propagation to evaluate the PDT
dose to be delivered.
Interstitial PDT (IPDT) is the use of photodynamic therapy within the body using light
delivered by optical fibres inserted via one or more needles. Use of PDT for non-superficial
applications complicates the treatment planning effort since it offers far more degrees of free-
dom in light configuration, and has the potential to treat lesions closer to organs at risk deep
within the body. In order to plan a safe and effective treatment, dose definitions such as the
Photodynamic Threshold model exist [45] [25] and rely critically on the distribution of three
factors: PS, light fluence, and tissue oxygen. This research aims to provide a means for fast
and accurate prediction of fluence, thereby advancing interstitial PDT for complicated anatomy
towards clinical utility.
Current Clinical Status
PDT is currently approved for a number of superficial implications and has been used with great
success for a number of applications: skin lesions including actinic keratosis [62] and some skin
cancers [27]; Barrett’s oesophagus [29], a pre-malignant lesion; and bladder cancer [42]. In
each of these applications, the target region is accessible from the surface, extends only a few
millimetres in depth, and can be illuminated from a large surface area. Conversely, interstitial
PDT requires light delivery within the body, usually by optical fibres placed via needles. One
of the critical factors in PDT, particularly in the interstitial case, is fast and accurate treatment
planning.
Chapter 2. Background 8
The state of research in PDT as of 2008 is summarized in a review by Wilson and Pat-
terson [70]. A number of locations are active in PDT research, with treatments for various
indications in clinical trials being administered to patients. Encouraging results have been
found for interstitial PDT of pancreatic [32] cancer in humans, where it was concluded that
such treatment was safe and possibly efficacious, and that tumour necrotic volume was propor-
tional to dose delivered. In 2004, D’Cruz et al [18] presented a study of 128 patients receiving
PDT for advanced head-and-neck cancer (HNC) that was accessible to superficial illumination.
Median survival was significantly improved for patients who showed complete initial response
compared to those who did not.
Biel [3] presents a summary of over 1,500 patients treated within the previous 18 years with
PDT for HNC. Notably, the study used no treatment planning, instead following a standard
bodyweight-proportional dose of PS and a constant light dose. For superficial tumours accessible
by laryngoscope, a fixed light intensity was delivered to target a fixed range of surface fluence
values (J/cm2). For larger tumours (depth exceeding 3mm), cylindrical diffusers were implanted
within the tumour bed with a fixed 1cm spacing. Patients with laryngeal carcinoma in situ and
stage I-II tumour without node involvement received PDT alone, were all discharged home on
the same day, and showed a five-year cure rate of 90% without significant side effects. Multi-
institutional phase II and III trials were completed demonstrating efficacy for early primary and
recurrent cancers. Biel also reports a small (18-patient) clinical trial in which PDT was used
as an intraoperative adjuvant to surgery, and summarizes another fourteen patients treated
intraoperatively by Dilkes where two cases were disease-free after five months, but two others
had carotid blowouts, a serious morbidity likely attributable to overexposure. This suggests
that more precise treatment planning may be beneficial to avoid overdosing structures at risk.
Other trials discussed within the summary showed promise in palliation of late-stage disease as
well as very strong cure rates for early disease.
Davidson et al [17] reported in 2009 on a Phase II clinical trial of prostate cancer using
TOOKAD for vascular-targeted interstitial PDT. This was the first trial with patient-specific
treatment planning for prostate PDT, which was conducted using the diffusion model since the
prostate is a relatively homogeneous organ. The authors note that speed of solution is critical
to clinical utility, since the treatment plan must be updated during treatment due to shifting
fibre positions, changing optical properties, and changing photosensitizer concentrations. It was
demonstrated there and elsewhere [21][52] that PDT is a viable treatment option for prostate
cancer, despite some inter-patient variability in photosensitizer concentration.
The Need for Treatment Planning
Jacques [36] highlights the importance of tissue optical properties in treatment planning to
control the fluence delivered. While superficial PDT is inherently limited to a few millimetres
of depth, interstitial PDT delivers light below the surface which places it closer to potential
organs at risk. For skin cancers, a multi-layered model is often adequate so radial or even
Chapter 2. Background 9
planar symmetry can be assumed which is not possible in the more general case. As a result,
interstitial PDT will require more complex planning and have higher consequences for error,
particularly if used in the head and neck which have a large number of sensitive structures. For
general use, the clinical target volume will have a significantly more complex anatomy than the
prostate, where the entire gland (healthy or not) can be treated and even overexposed, though
there are organs at risk which must be protected (urethra, rectum). This thesis confines itself to
modeling of the light distribution as a first step to a complete PDT treatment planning system.
2.1.2 Diffuse Optical Tomography
One of the applications that has pushed modeling of turbid media forward is Diffuse Optical
Tomography (DOT), also known as Functional Near Infrared Spectroscopy (fNIRS), a technique
which uses mathematical tomographic techniques to reconstruct three-dimensional images of
absorption contrast through scattering media from measured transmission between pairs of
sources and detectors. It often acts as a complement to Functional MRI (fMRI), providing
similar information but via different mechanisms and with different costs and benefits.
A typical DOT setup has multiple (tens) of light sources operating at at least two wave-
lengths, with detectors placed around the volume of interest as shown in Fig 2.2. The signal
propagating between each source-detector pair is measured at multiple wavelengths. One of the
difficulties of the technique is that the detected light has been scattered a very large number of
times and arrived via a multitude of paths spanning a large volume of tissue. However, localized
perturbations in optical properties can be inferred from the measurements given some a priori
knowledge about the geometry and optical properties of the target volume, a light-propagation
simulator, and an iterative algorithm. In many cases, the changes of interest are in the concen-
trations of oxy- and deoxy-hemoglobin which are prominent absorbers in the red wavelengths
and can be discriminated from one another by a suitable choice of wavelengths in the dark red
(see Fig 2.1). These signals provide a view of cerebral hemodynamics, which can be used to
diagnose disease [50] and to learn about normal brain function [40]. Functional MRI using the
BOLD (Blood Oxygen Level Dependent) signal is the established technique for making such
measurements, however its cost, relatively slow acquisition time, and lack of portability make it
less than ideal. One area where DOT and its precursor fNIRS (functional Near-Infrared Spec-
troscopy) show significant potential is for continuous monitoring of brain oxygenation, both for
premature infants in neonatal intensive care [43] and for stroke victims. MRI is not suitable for
continuous monitoring due to cost, size and comfort concerns, giving a significant advantage to
optical techniques for such applications.
Since the collected light did not travel a straight path, the core of the DOT technique is
the ability to simulate light propagation through the target volume of interest. Determining
the amount and location of the perturbations of optical properties which caused a given optical
signal is a mathematical inverse problem. No closed-form solution exists for the relevant geome-
tries so it requires many candidate solutions to be tried. Result quality is strongly linked to the
Chapter 2. Background 10
Figure 2.2: Depiction of High-Resolution Diffuse Optical Tomography (HR-DOT) setup fromHabermehl et al [30]
Figure 2.3: Side-by-side depiction (L to R) of BLI image, CT scan, PET image, and dissectionphotograph of nude mouse with a bioluminescent xenograft tumour reproduced from [49]
quality of simulation, and the method’s utility is limited by the computational requirements.
Several of the software packages described later in this chapter (tMCimg, MCX) were originally
designed to support DOT of the brain, and FullMonte could be used for this purpose as well.
2.1.3 Bioluminescence Imaging
Bioluminescence Imaging (BLI [54]) is the use of genetically-encoded fluorescent proteins to
trace cell lines of interest in vivo. For instance, by inserting the correct gene into a cancerous
tumour and implanting that tumour in a small animal model, it is possible to watch the progress
of the disease throughout the body including the formation of metastases. An example in a
nude mouse model is shown in Fig 2.3. With the success of DOT and other diffuse sensing and
imaging modalities, interest has been increasing in quantitative techniques for BLI.
Chapter 2. Background 11
2.1.4 Diffuse Optical Spectroscopy
Diffuse optical spectroscopy (DOS), also known as photon migration spectroscopy, is a non-
invasive technique for detecting disease and monitoring its response to treatment. Based on
the premise that the optical properties of tissue differ between healthy and diseased tissue,
DOS aims to extract information from measurements of the optical properties at multiple
wavelengths. For in vivo diagnostics, though, it is not possible to disentangle the effects of
scattering and absorption using only continuous-wave light sources. The absorption measured
is a function of the absorption coefficient and the path length travelled, which depends on
scattering. In the absence of scattering, transmission at a given wavelength follows Beer’s law
as it would in a cuvette of non-scattering liquid T = exp(−εLc) where the extinction coeffi-
cient ε depends on the absorber and the wavelength. Measuring at multiple wavelengths given
knowledge of ε for the chromophores present in the tissue allows inference of the compositions
c within the interrogated tissue. However, the presence of scatterers mean that the expected
path length L taken by a detected photon, and hence its probability of being absorbed, is
longer than the physical source-detector distance. In a non-homogeneous non-infinite medium,
that expected path length L is also a function of the tissue geometry and boundary conditions.
Consequently, the results of CW-DOS may be improved by a more accurate simulation of light
propagation if the tissue geometry is known. A prime example of this effect would be DOS of
the human breast [64], which requires a scattering model to produce a useful fit. Variants of
the technique such as Spatial Frequency-Domain Imaging (SFDI) [16] also rely on models of
light transport through tissue. While techniques using pulsed or temporally modulated light
sources to measure optical path length exist, the equipment required is very complicated and
expensive, making computing-based approaches with CW sources more desirable.
2.2 Tissue Optics
This section presents a brief overview of tissue optics. Jacques [37] presents an overview of
tissue optical properties, and a compilation of values for a wide variety of wavelengths and
tissue types.
The primary optical effects of interest in tissue are scattering and absorption, which occur
frequently (≈ 10 − 1000/cm). In the optical window previously discussed, scattering is one
to two orders of magnitude more prominent than absorption so any given photon will likely
scatter a large number of times before being absorbed. Additionally, when the refractive index
n differs between regions, the normal physics involving internal reflection, Fresnel reflection,
and refraction must be considered.
When a polarized light source such as a laser is used, there is the possibility of observing
polarization-dependent effects, which are generally fairly small signals caused by tissue bire-
fringence and chiral activity. The present work focuses on multiply-scattered light on length
scales that would generally make measurement of polarization-related effects difficult. All of
Chapter 2. Background 12
the applications presented above can safely neglect polarization. Some interesting specialized
biophotonic measurements using diffuse polarimetry have been proposed by Ghosh et al [28],
but have yet to become mainstream.
Coherence effects such as speckle are also generally ignored when modeling propagation
in turbid media for several reasons. Even though PDT often uses lasers which are coherent
light sources, the treatment time is long enough (minutes) that even slowly-changing speckle
patterns average out during the treatment time due to gradual shifting of tissues. In other
applications relating to fluorescence and bioluminesce, it is not relevant since the source is
incoherent. Lastly, for macroscopic applications it is not possible to produce a sufficiently fine
description of the target material that a meaningful simulation output would result.
It should also be mentioned that no non-linear effects are modeled. The Monte Carlo for-
mulation used in this work relies on the assumption that photon trajectories, scattering, and
absorption probabilities are independent of local fluence rate. As a result, harmonic genera-
tion, two-photon absorption, and Raman are fundamentally not possible to model within this
framework. For the applications summarized above, though, the power used is low enough that
nonlinear effects are insignificant.
2.3 Light Propagation Models
A forward problem description is a complete description of a situation for which the propagation
is to be modeled. It consists of:
1. A geometry description, consisting of one or more regions, each with an associated material
2. A set of materials with all relevant optical properties defined
3. A set of light sources with distribution parameters and weights
4. A definition of the output data to be collected
The light propagation models described below produce one or more sets of output data for
a given input forward problem description. Before introducing solution methods (Sec 2.4), we
first discuss in detail the problem definition below.
2.3.1 Geometry Descriptions
A number of different geometry descriptions are possible when modeling turbid media. Each
geometry consists of a set of regions Ri, each of which has a boundary with defined surface
normals n, an associated material, and a set of adjacent regions. Region descriptions range
in complexity but must support as a minimum testing whether a point p is within the region,
finding the point q where a ray intersects the boundary, calculating the volume V [R] and
specifying which region is adjacent at that point.
Chapter 2. Background 13
Infinite
The simplest problem-geometry description is an infinite homogeneous medium. Under the
diffusion approximation to transport theory, the infinite case from an isotropic source has an
analytic solution. It is also very simple to simulate via MC, since the optical properties remain
the same regardless of position and there are no material boundaries. Due to the compact
problem representation it allows very simple simulations that can achieve high computational
performance when the analytic diffusion approximation is not appropriate.
Semi-Infinite
The problem complexity is increased only slightly when moving to a semi-infinite medium, in
which there are two materials: one turbid medium of interest (generally some form of biological
tissue), surrounded by another medium (often air). Generally such a problem takes the coordi-
nate z to be depth below the surface, which spans the xy-plane at z = 0. In such a model, there
is a boundary if z changes sign during the step. If the boundary is encountered and there is a
refractive index mismatch, it is necessary to model Fresnel reflection, total internal reflection,
and refraction when computing the step result. The description requires only one additional
parameter (external medium refractive index) beyond the infinite case since the interface loca-
tion z = 0 and normal vector k are implicit. When following a ray (p, d), the physical step
length s along the ray to arrive at the boundary is also simple, just s = −pzdz
if dz < 0.
Planar
Among the first widely-used Monte Carlo methods was a model using infinite planar slabs of
material (MCML, Sec 2.5.1). If there are N slabs (usually 5-10) lying in the xy plane and the
photons arrive along the z axis, then there is cylindrical symmetry around that axis. Describing
such a geometry requires only the z coordinate of the lower edge for each of the N slabs, along
with optical properties. Assuming the source distribution also has cylindrical symmetry, the
absorption scoring can be reduced to 2D Φ(r, z) = Φ(√x2 + y2, z) which reduces the number of
packets required to achieve acceptable result variance. Like the semi-infinite case, the interface
normal is always ±k. The boundary (j = i or i+ 1) faced by the ray can be found by checking
the sign of dz, and then the distance can be found by s =z−zjdz
.
Voxelized
To represent more complex geometries, a natural extension is to break the problem into discrete
cubic voxels with each being assigned a material. However, the voxelized geometry description
does not lend itself to accurate description of curved surfaces and particularly does not provide
smooth surface normals for such surfaces. The model has been applied to Diffuse Optical
Tomography (DOT) of the brain [6], where the refractive index is generally matched. Finding
the boundary with another material in this model requires looking up the material ID of every
Chapter 2. Background 14
voxel along the path. For large homogeneous regions, this scheme is inefficient in terms of both
storage space and computational effort, and it provides only a global tradeoff between resolution
and geometry size. Binzoni et al [4] demonstrate some of the shortcomings due to artifacts in
producing surface normals when using the voxelized model to describe curved surfaces with
refractive index differences.
Mesh-based
Three-dimensional volumes with general shapes can also be modeled as the union of a set of
tetrahedra, at the cost of some additional complexity. Methods of handling (Matlab/GNU
Octave) and visualizing (Visualization Toolkit [34]) tetrahedral meshes are well-known from
other applications including Finite Element Analysis. While it lacks the regularity of cubic
voxels, the tetrahedral mesh description has two major advantages.
First, the normal for all interfaces is directly available. For a tetrahedron defined by
counterclockwise-oriented points P1,P2,P3,P4, the normal to the face opposite P4 is found
by normalizing (P2−P1)× (P3−P1). Any of the three other normals can be found by rotat-
ing the point array appropriately. Using these normals, the interior of the tetrahedron is the
intersection of four half-spaces, which is the set x : ni ·x ≥ Ci, i ∈ [1, 4]. Whether a point
is inside the tetrahedron or not can be tested by direct evaluation of the four conditions just
given. For a more thorough introduction to representations and operations on polytopes (closed
N-D objects bounded by flat sides), the reader is directed to a reference book by O’Rourke [39].
Second, the mesh can be made coarser or finer as needed for the problem; areas which do
not need large amounts of detail have a very compact representation, while curved surfaces
can be progressively refined as a set of piecewise-linear approximations. Consider the case
where a photon must take a step in a homogeneous region. In the voxelized representation, the
algorithm must advance voxel-by-voxel, constantly calculating a new grid index and fetching
the material code for it, possibly many times. By contrast, in the mesh representation a single
tetrahedron can represent an arbitrarily large region, and a step of any size within the region
requires only four intersection tests, one for each face. Only if the step crosses one of the faces
must a new, adjacent tetrahedron be loaded to continue the step.
2.3.2 Material Optical Properties
The relevant tissue optical properties and their typical values for turbid media in the opti-
cal window are summarized in Table 2.3.2. Absorption and scattering are specified by their
coefficients, respectively µa, µs, which give the expected number of interactions per unit dis-
tance traveled, typically in cm−1. Their sum µt is the total interaction coefficient, which is the
expected number of interactions (scattering or absorption, which are independent) a photon
has per unit length. Its reciprocal µ−1t [cm] is the Transport Mean Free path, which is the
expected distance traveled by a photon between interactions. The albedo α is derived from
the absorption and scattering coefficients 0 ≤ α = µsµa+µs
≤ 1 which is the probability that a
Chapter 2. Background 15
Value Unit Range Typical Description
µs [cm−1] ≥ 0 . 3000 Scattering coefficientµa [cm−1] ≥ 0 . 300 Absorption coefficientg (−1, 1) & 0.8 Anisotropy coeffientn ≥ 1 . 1.5 Refractive index
Table 2.1: Summary of relevant tissue optical properties with typical values in the opticalwindow from Cheong [11]
given interaction is a scattering event. When the photon scatters, the anisotropy parameter
g = E [cos θ] = E[d′ · d
]describes the expected value of the cosine of the deflection angle (the
angle between the direction vector before and after). A value of −1 is perfect backwards reflec-
tion (mirror-like), 0 is biased neither forwards nor backwards (outgoing energy in the forward
and backward half-spheres are equal), positive values scatter dominantly forwards, and 1 indi-
cates no scattering interaction at all1. In some situations g is not used directly, but it modifies
the scattering coefficient to yield a reduced scattering coefficient µ′s = (1 − g)µs which gives
similar behavior assuming the absorption coefficient is small compared to scattering and that
material properties are locally homogeneous.
2.3.3 Source Descriptions
A number of different source descriptions are possible. An non-exhaustive list is presented
below:
• Normally-incident pencil beam (directed beam, delta-function profile)
• Isotropic point
• Isotropic volume
• Directed surface (finite-width beam)
Some of the solution methods may not support all source types due to inherent restrictions
on symmetry, or due to a design choice not to include them. It should be noted that the diffusion
approximation supports only isotropic sources since the diffusion approximation is incompatible
with the notion of a directed beam. Virtual sources [26] can be used to approximate other source
profiles as sums of point sources. It is also possible [67] to model finite-diameter beams through
convolution of infinitely-thin beams, if the geometry is symmetric around the beam.
2.3.4 Output Data
Most often, biophotonic simulations are done in terms of the fluence Φ(x), which is the amount
of light energy passing through an infinitesimal area dA at a point x over some time period.
1Scattering is elastic so if direction/momentum does not change, there was no energy or momentum transferhence no interaction.
Chapter 2. Background 16
Typically, units of J/cm are used. If a single absorber (eg. molecule) with absorption cross-
section σ cm2 is exposed to such fluence, it will be expected to absorb σΦ(x) joules of energy.
Given a density ρ mol L−1 of absorbers, the total energy they absorb in a volume dV is
E(x) = NAρσΦ(x) dV = µaΦ(x) dV (2.1)
Since the energy of a single photon at wavelength λ is E = hc0λ , absorbed energy is directly
convertible into a number of photons absorbed. For PDT, the number of photons absorbed by
the PS is proportional to the number of radicals created and hence damage caused. In other
applications, the signal detected is generally proportional to the number of photons arriving at
a camera or detector so fluence is often the most relevant quantity.
As defined, fluence is a continuous scalar field which has an analytical solution only for simple
geometries. To produce an approximate solution to a non-trivial problem, one must resort to
numerical simulation by discretizing the problem and finding piecewise solutions which obey the
RTE to within some tolerance. In those methods, continuous scalar fields such as fluence are
represented by average values over a finite number of regions. As a convention, the discussion
below uses parentheses for continuous fields such as fluence Φ(x), while using square brackets
to denote discrete arrays like ΦV [R] for the average fluence over a discrete volume region R.
Volume Fluence
When discretizing volume, the problem geometry is split into a number of regions Ri with
homogeneous optical properties, which could be described by voxels, tetrahedral elements,
cylindrical sections, or otherwise. he average fluence in a region R with finite volume V [R] can
be found as
ΦV [R] =1
V [R]
∫R
Φ(x) dV [Jcm−2] (2.2)
When using Monte Carlo methods to simulate light propagation, the simulator scores the
photon absorption (proportional to energy) within the volume, which is∫RE(x) dV . Using
Eq 2.1 and assuming a homogeneous µa > 0, the average energy per volume can be converted
to fluence:
ΦV [R] =1
V [R]µa[R]
∫RE(x) dV =
EV [R]
V [R]µa[R][Jcm−2] (2.3)
Surface Emittance
For surface imaging problems such as BLI and DOT, the quantity of interest is actually the
fluence escaping the surface (emittance), which is detected. For discrete surface element S with
area A[S], an average surface fluence can be calculated similarly as
ΦA[Si] =1
A[S]
∫S
Φ(x) dS =EA[S]
A[S][Jcm−2] (2.4)
Chapter 2. Background 17
Detectors
Some applications such as DOT model use of specialized detectors, typically fibre-optic probes.
In the case of small isotropic diffusers, the result should be not differ significantly from the
fluence in the surrounding tissue. In Monte Carlo simulations it is possible to specify customized
probes and evaluate whether a photon is captured or not using a wide range of criteria.
Time Resolution
For non-continuous-wave applications including DOT or DOS, the fluence within a time window
matters. In these cases, the input light can be considered to be an infinitely short delta-function
in time δ(t), yielding a flux φ(x, t) as a response to that impulse. For modulated systems using
phase-sensitive detection, the amplitude and phase response H(ω) at each detector can be found
from the Fourier transform of the impulse response function. In the case of pulsed systems, the
time histogram h(t) is generally produced directly by time-gating the detector.
A Monte Carlo simulator can produce this by keeping track of the simulation time t since
a packet was launched and splitting recorded fluence into N discrete time bins [ti, ti+1) i ∈[0, N − 1]. When moving a distance ∆s, the time counter advances according to the speed of
light and distance traveled so ∆t = nc0
∆s. When the packet is absorbed, it is assigned to the
appropriate time bin i and region R so that the recorded energy is
EV [R, i] =
∫ ti+1
ti
∫Rµaφ(x, t) dV dt (2.5)
from which fluence can be derived using Eq 2.3. A similar treatment can be done for surface
elements or custom detectors.
Compared to non-time-resolved simulation, the cost is N times more additional storage
space per element, though this can be offset by selecting only a subset of elements to record.
TIM-OS and other simulators already provide such a capability. FullMonte does not yet, though
the software version could easily be upgraded to do so limited only by the size of memory. If
a time histogram is desired for a small number of detectors or surface elements, the hardware
version could also accommodate time resolution limited only by memory capacity.
2.4 Numerical Solution Implementations
The Radiative Transfer Equation [55] (RTE, Eq 2.6) for a single wavelength is the conservation
relation which must be obeyed for light transport in turbid media. It describes the conditions
for a function L(x, Ω) to be a valid description of radiance at a point x, in direction Ω.
1
v
∂
∂tL(x, Ω, t)+Ω · ∇L(x, Ω, t)+µt(x)L(x, Ω, t) = s(x, Ω, t)+
∫ΩL(x, Ω′, t)dµs(x, Ω′ → Ω) dΩ′
(2.6)
Chapter 2. Background 18
In this equation, µt is the interaction coefficient previously introduced and dΩ is an element
of solid angle surrounding the point x with surface normal Ω. Scattering is characterized by
µs(x, Ω′ → Ω), which is the proportion of intensity that is scattered from incident direction
Ω′ into direction Ω. The left side gives three terms for radiance decreases: non-steady-state
pulse propagation; steady-state energy flow; and, energy absorbed or scattered away from the
direction Ω. At right, there are two terms for radiance increases: a source term; and, an
integral over all other directions of the energy scattered into the direction Ω. For steady-state
(non-time-resolved) solutions, the first term is assumed to be zero. We note also that the
bulk scattering coefficient µs must be equal to∫
4π dµs(x, Ω′ → Ω), and that by definition, the
variable we want (fluence) is the integral of radiance at point x:
Φ(x) =
∫ ∫ΩL(x, Ω) dΩ dt (2.7)
Being a complicated partial differential equation (PDE), the RTE has known analytic solu-
tions only for very simple and/or approximated cases. Solution for more general cases requires
numerical methods, of which two are commonly used: the Finite Element Method (FEM) or
Monte Carlo (MC). Either discretization must obey the RTE, though they do so in different
ways.
2.4.1 Finite Element
Under the diffusion approximation to the RTE, the fluence distribution can be modeled as a
quantity diffusing down a concentration gradient, similar to heat. Qualitatively, diffuse light is
light that has been sufficiently scattered to have lost all directionality. It assumes that L in the
RTE above is isotropic, meaning uniform over Ω so L(x, Ω) = L(x) and dµs(x, Ω′ → Ω) = µs.
The FEM involves discretizing the volume of interest into tetrahedral elements and reducing
the RTE to a system of linear equations. A thorough treatment of diffuse light propagation is
given by Jacques and Pogue in [38], so only a cursory review is given here. More formally, the
diffusion approximation requires:
1. The materials involved have high albedo µs µa
2. There are no non-scattering voids in the material (µs > 0, all materials scatter)
3. All sources are isotropic s(x, Ω, t) = s(x)
4. Scattering anisotropy can be neglected, ie. dµs(x, Ω, Ω′) = µ′s
5. Results are not expected to be valid within a few mean free paths of a source
6. All materials have a uniform refractive index
Making these assumptions has a number of attractive features. First, it reduces the problem
to that of solving a sparse matrix for which many fast and accurate programs exist. Second, it
Chapter 2. Background 19
offers analytic solutions for simple cases with certain symmetry. Perturbation techniques can
give quick approximations for small changes in the problem geometry (eg small material inho-
mogeneity). A high-quality freely-available implementation, NIRFAST (described in Sec 2.5.6),
is also available.
Offsetting these, though, is the cost of the approximations made. The results are acknowl-
edged to be valid only if the distance from a source or a material boundary exceeds a few mean
free paths. This assumption could be problematic for applications like PDT, particularly if us-
ing extended sources such that a large tissue volume is located near a source. Likewise in PDT
for complex anatomy there will be a large number of material boundaries, possibly including
air cavities which have a strong refractive index change, which are not modeled properly in the
diffusion regime. Considering the relative merits, we chose to pursue a Monte Carlo method
since it is inherently parallel and offers the best possible accuracy by capturing all relevant
physics without restrictive approximations.
2.4.2 Monte Carlo
Computer-based Monte Carlo (MC) models of light transport in turbid media take a different
approach. Instead of modeling conservation laws on a large scale, MC models track individual
photons using appropriately-distributed random numbers so that their expected behavior is
physically correct. Millions or more of such photons are traced and after a sufficient number of
simulations, the result will converge arbitrarily close to the expected answer.
Implementations of this method for biophotonics generally use a common core algorithm
which operates assuming that ballistic photons travel in straight lines through regions of piecewise-
constant optical properties until scattered, absorbed, reflected, or refracted. In this model, a
propagating photon is described by a position p, and a direction d. The scattering and absorp-
tion process, called “hop, drop, spin”, was originally proposed (but not so named) by Wilson
and Adam in 1983 [69]. Prahl et al in 1989 [57] refined the algorithm with the addition of
roulette and anisotropic scattering, and an open source implementation (MCML) was given by
Wang et al [44]. An overview is given below; for greater detail, the reader is directed to the
original MCML paper which gives a thorough treatment.
Launch
A photon packet is first randomly launched (assigned a position and direction) into the tissue
from a source distribution. For isotropic sources, the direction unit vector is randomly chosen
from the unit sphere. If the source is directed then the direction is simply some constant d0.
Likewise, the position may be a constant point or start randomly distributed over a line, area,
or volume.
Chapter 2. Background 20
Hop
Interactions with the material, whether scattering or absorption, are modeled assuming the
Beer-Lambert law. Consider a rectangular prism of area A and thickness ds which contains
particles of cross-section σ with number density ρ moles per unit volume. Now look at a path
through the prism normally incident on that face. If the path is chosen using a uniform random
distribution over the face, it has a probability σA of hitting any one particle in the box. Since
there are n = ρVNA
= ρAdsNA
independent randomly-distributed particles in the volume, it has a
probability(1− σ
A
)nof hitting exactly none of them. However it is generally a valid assumption
that the particles are spaced sparsely enough that their probability of overlapping within the
prism slice of thickness ds is zero. In that case, the probability of interaction is just 1 − n σAwhich could also be derived by a binomial expansion for small σ so
Pr (Interaction in ds) = 1− Pr (No interaction) = nσ
A= σN ds = µds (2.8)
The quantity µ = σρNA
has units of reciprocal length (here cm−1), and is called the coefficient
of scattering (µs) or absorption (µa). Eq 2.8 defines a differential equation for the CDF of the
step length before interaction S, the solution of which is exponential with parameter µ (denoted
here S ∼ Eµ):
Pr (s < S) =F (s) = 1− e−µs (2.9)
Pr (s ≤ S < s+ ds) = F ′(s) =f(s) = µe−µs = µ(1− F (s)) (2.10)
That distribution has mean µ−1. A photon will therefore travel on average 1µs
before being
scattered or 1µa
before being absorbed, in a medium containing only scatterers or absorbers
respectively. To combine them, we note that by definition scattering and absorption are in-
dependent so their probabilities within a given infinitesimal length ds are additive. By the
properties of the exponential distribution, the parameter becomes µt = µs + µa where µt is
known as the transport Mean Free Path (MFP) which is the average distance traveled before
scattering or absorption. Once the photon has an interaction, the probability it was scattered
is:
Pr (Scatter in [s, s+ ds))
Pr (Interaction in [s, s+ ds))=
µs(1− F (s))
(µs + µa)(1− F (s))=
µsµs + µa
= α (2.11)
which gives a mathematical definition for albedo α which was introduced in Sec 2.3.2 as a
material property.
When modeling photon propagation using MC, we need to draw a step length from an
appropriate distribution. To generate an exponential random step length s ∼ Eµ we can use the
standard technique of drawing a uniform random variable u and transforming it by the inverse
exponential CDF:
Chapter 2. Background 21
s = F−1(u) = − ln(1− u) , u ∼ U01 (2.12)
It is important to recall that the material properties µs and µa apply only within the current
region, so before completing the step we must ensure that the photon has stayed within the
region. To do so, we check if the ray p, d intersects a region boundary in a distance less than
s. If not, then the packet position is updated to p′ = p + sd, the “hop” phase is complete, and
the process moves on to “drop”.
When there is an intersection, consider the distribution of s for a ray passing through one
layer of thickness T (µ1) into another with different interaction coefficients µ2. The CDF is just
F1(s) until it exits the first layer, which it does with probability 1 − F1(T ). From there on, it
travels an additional s′ = s− T according to the distribution for the second material.
F (s) =
F1(s) s ≤ T
F1(T ) + (1− F1(T ))F2(s− T ) s > T(2.13)
Substituting the CDF into the second case above, we get
F (s) =(1− exp(−µ1s1)) + exp(−µ1s1)(1− exp(−µ2s′))) (2.14)
=1− exp(−µ1s1 − µ2s′) (2.15)
But the original step length probability was 1− exp(−µ1u), so we must set s′ = u−µ1s1µ2
to
preserve the step probability. That expression has a special case when µ2 = 0 in transparent
media (air, glass), so it is convenient here to introduce the dimensionless step length l, which
is scaled so that it has a unit-exponential distribution regardless of the material.
l =sµt (2.16)
F (l) =1− e−l =⇒ l ∼ E1 (2.17)
It is more convenient to draw l ∼ E1 and track l′ = l − sµ as the photon moves through
materials. When needed the physical step length can be calculated from its definition in Eq 2.16
or taken as infinite if µt = 0.
Interface
If the photon encounters a boundary and that boundary is an interface (a change in refractive
index from ni to nt) then it may either reflect or have its angle to the normal refract from
incidence angle θi to transmitted angle θt. Snell’s Law states that
Chapter 2. Background 22
sin θt =nint
sin θi if sin θi ≤ntni
(2.18)
for refraction, or total internal reflection (TIR) occurs otherwise. Even when TIR does not
occur, Fresnel reflection may still apply. Since the simulation does not track polarization, it
assumes that the two polarizations (s,p) relative to the surface are equally probable, and hence
that the reflection coefficient is the average of the two reflection coefficients (Rs,Rp) given by
Fresnel.
R =Rs +Rp
2=
1
2
[∣∣∣∣ni cos θi − nt cos θtni cos θi + nt cos θt
∣∣∣∣2 +
∣∣∣∣ni cos θt − nt cos θini cos θt + nt cos θi
∣∣∣∣2]
(2.19)
Given a Fresnel reflection probability R, the event of photon reflection can be modeled as
a Bernoulli random variable BR. If the ray reflects at the interface due to Fresnel or internal
reflection, then the “hop” step must advance the ray to the intersection point and reflect its
direction d:
p′ =q (2.20)
d′ =d + 2nd · n (2.21)
l′ =l − |q− p|µt (2.22)
If only the other hand it transmits and the transport mean free path µ−1t differs in the
material being entered, then the physical step size must be updated using Eq 2.16.
Drop
At the conclusion of the “hop”, it is time for the photon to have an interaction with the material.
Recalling Eq 2.11, the photon has probability α ∈ [0, 1] to be scattered, and 1−α to terminate
through absorption. In the simplest formulation, this process can be simulated using a Bernoulli
random variable b ∼ Bα that returns 1 with probability α and 0 with probability 1 − α. The
packet would then drop its energy at the interaction site p and terminate if b = 0. If the energy
absorbed per unit volume is of interest, the amount dropped is accumulated in an array to form
part of the result. Otherwise if b = 1, the photon continues onwards.
A very common optimization originally proposed by Wilson and Adam [69] changes this
description somewhat by combining multiple photons into a packet that travels together, but
behaves identically in the expected sense to the simple case just described. While individual
photons must either be absorbed or terminated, the packet does both proportionally in such
a way as to keep the correct expected value. Each packet has a continuous weight which can
be thought of as an expected proportion of photons which would remain after following the
same path as the packet. Suppose N0 photons are traveling together in a packet and have an
interaction leaving N ′.
Chapter 2. Background 23
w′ =1
N0E[N ′]
=1
N0(αN + (1− α)0) = αw (2.23)
∆w =w′ − w = (1− α)w (2.24)
Instead of having either a scattering or an absorption event, the packet deposits weight
(1−α)w and has its weight decreased to αw. By allowing a packet to survive multiple absorption
events, the probability that the path traverses regions remote from the source is increased,
providing greater resolution in such regions. It also allows for some economy of computation
since multiple photons can share the calculation of a single hop length, intersection test, and
scattering event.
In an absorbing medium, the packet will continue to weaken, thus adding less and less to
the results with each absorption event but never actually reaching zero. Only when the packet
exits the medium does it cease to need further computing. For some geometries, this could
take a very long time, requiring extensive computation to add infinitesimal accuracy to the
model results. To avoid this problem, MCML introduced the random termination of weak
packets, called “Russian roulette”. When a packet’s weight becomes less than a minimum
value wmin, it is given a 1-in-m chance of surviving with weight mw. This process ensures that
weak packets, which do not contribute significantly to the output sum, are terminated without
violating conservation of energy in the expectation as shown below:
E[w′]
= 0 Pr(die) +mwPr(live) = 0 +mw1
m= w (2.25)
Termination of weak packets provides a balance between higher simulation accuracy in
areas that receive very low fluence, versus the computational cost of obtaining that additional
accuracy. The parameter wmin sets an energy threshold below which m weak (w < wmin)
packets are bundled into a stronger packet requiring 1m times as much computation to trace.
The side effect of this change is that instead of w being deposited randomly over m fluence
bins at each step, mw is deposited into one thus causing quantization noise in the lower-fluence
bins. Further investigation and discussion of this trade-off are presented in Chapter 3.
Spin
Surviving photons then undergo a spin process to simulate the effect of scattering on their
direction. Generally, the scattering interaction can be characterized as a uniform azimuthal
angle φ around the incoming direction, and a deflection θ. The Henyey-Greenstein (HG) phase
function is often used for the deflection component [57], since it has a convenient parameter
g = E [cos θ] to express the anisotropy. Note that g = 1 always implies no scattering since
E [cosX] = 1 if and only if X ≡ 0 mod 2π). When g = 0 is used as the parameter for the
HG function, the cosine of the deflection angle is uniformly distributed on [−1, 1], sending
Chapter 2. Background 24
equal amounts of energy in all directions (equivalently the outgoing direction is statistically
independent of the incoming). Generally biological tissues fall in the range of 0.8 . g < 1 [11].
The inverse CDF for the Henyey-Greenstein function is shown below, facilitating generation of
appropriately-distributed scattering angles given a uniform random number.
cos θ =1
2g
[1 + g2 −
(1− g2
1− gq
)2]
q ∼ U−1,1 (2.26)
In the original formulation, Prahl et al [57] proposed calculating the new direction of travel
d′ given d, θ, φ as:
d′x =sin θ√1− d2
z
(dxdz cosφ− dy sinφ) + dx cos θ (2.27)
d′y =sin θ√1− d2
z
(dydz cosφ− dx sinφ) + dy cos θ (2.28)
d′z =− sin θ cosφ√
1− d2z + dz cos θ (2.29)
which can be rewritten as
d′ = d cos θ + sin θ(b cosφ− a sinφ
)(2.30)
Further deconstructing, it can be shown that a, b are two unit auxiliary vectors orthogonal
to the direction of travel. Geometrically, these form an orthonormal basis for the azimuthal
plane (normal to the direction of travel) which facilitates selection of a random vector in that
plane using angle φ. The first a is formed by taking the cross-product with the z-axis and
normalizing. The second auxiliary vector is formed by crossing the direction with the first
auxiliary as follows:
a =d× k
|d× k|(2.31)
b = d× a (2.32)
It can be verified by substitution that Eq 2.31-2.32 and Eq 2.30 result in the original
formulation. Once the azimuthal vector is found, the post-scatter direction is found by rotating
the incoming direction by θ towards it. FullMonte uses an alternative way of arriving at the
same formulation, as discussed in greater detail in Sec 3.5.4.
Chapter 2. Background 25
Implementation Method-Geometry Abs Aniso. Refr Voids TR Acceleration
MCML MC Planar Y Y Y Y YtMCimg MC Voxel Y Y YCUDAMC MC Semi-inf Y Y Y GPUCUDAMCML MC Planar Y Y Y Y GPUGPU-MCML MC Planar Y Y Y Y GPUNIRFAST FEM Tet Y ApproximationTIM-OS MC Tet Y Y Y Y Y SIMD (auto), MTMMCM MC Tet Y Y Y Y Y SIMD (auto), MTMCX MC Voxel Y Y Y GPUFBM MC Planar Y Y Y Y FPGA (1x)
FullMonte (SW) MC Tet Y Y Y Y * SIMD (man), MTFullMonte (HW) MC Tet Y Y Y FPGA (1x)FullMonte (HW*) MC Tet Y Y * Y * FPGA (4x)
Table 2.2: Comparison of existing simulators with key features: geometry, absorption scor-ing, anisotropy, refraction, non-scattering voids, time-resolved data, and acceleration methods:FPGA (Nx)=FPGA with N instances per chip; MT=multithreading; SIMD=Intel SSE instruc-tions, automatic or manual optimization; Asterisk indicates planned future work
2.5 Existing Implementations
There are a number of existing implementations, summarized by key features in Table 2.2 and
discussed in greater depth below. The FullMonte software version is the most customizable,
fastest, and (except for time-resolved output) most full-featured of all implementations. The
FPGA implementation described in this thesis is still faster, with a 3x performance advantage
over software and an architecture designed to increase that further to 12x while adding feature
support.
2.5.1 MCML
MCML, introduced by Wang et al [44], was one of the first widely-used Monte Carlo simulators
for turbid media. It accepts a planar slab geometry with a normally-incident pencil beam.
Since it is a Monte Carlo simulator, it is able to model scattering, absorption, anisotropy (using
the Henyey-Greenstein phase function), reflection, and refraction at boundaries. Extended
sources may be modeled as a convolution of simulation results, but the fundamental limitation
to normally-incident light remains so variations have been developed by researchers as needed.
2.5.2 tMCimg
One of the first open-source voxelized MC solvers is tMCimg [6], which was developed to model
the scalp, skull, and brain for DOT purposes. Since the application uses probes in contact with
the scalp and does not have large refractive index mismatches, the boundary roughness imposed
by a voxelized approach is not significant. Only in the event of refractive index mismatch is
Chapter 2. Background 26
the surface normal required for purposes of computing reflection or refraction. Binzoni et al [4]
describe some of the drawback to representing curved interfaces using voxels. It is also worth
noting when making performance comparisons that the implementation of tMCimg uses a single
thread of execution, owing to its development at a time when multi-core computers were rare.
It also does not use vector instructions which can provide significant performance increases
over non-vectorized code. Modifying the software to use the multiple cores available on modern
processors should not be difficult and would yield approximately N times better performance on
N cores (or even > N in the case of simultaneous multithreading (SMT)) based on experience
with FullMonte.
2.5.3 CUDAMC
Alerstam et al [2] present CUDAMC, which is a GPU-based specialization of MCML which
records time-resolved diffuse reflectance. It uses a homogeneous, semi-infinite, non-absorbing
model and produces time-resolved output. Reflection and refraction from the interface are
modeled. In comparison against a single-threaded CPU implementation of the same code, they
report a performance increase exceeding 1000x.
GPU computing provides a very high number of floating-point operations. Since there is but
a single homogeneous slab, all optical properties and geometry are global constants. Further,
since the material is non-absorbing there is no absorption to score or roulette calculation to
perform. As such, this result should be regarded as an approximate upper bound on the
acceleration available: the calculation never has to stall to fetch geometry information from
memory, and never has to access memory to record absorption so it is entirely compute-bound.
While it does access memory for the output histogram, that operation is quite rare (at most
once per packet) compared to scattering, step length generation, and intersection testing which
may happen hundreds of times.
2.5.4 CUDAMCML
With CUDAMC as a special-case subset, Alerstam et al [2] also present CUDAMCML, which
is a complete implementation of MCML for the GPU. The authors claim speedup on the order
of 100x, against the original relatively unoptimized single-core CPU implementation of MCML.
The performance reduction from CUDAMC (1000x) is notable, since the problem is nearly
identical in terms of calculation. There are a small number of planar slabs to be stored instead
of just one material set, though the memory size and bandwidth requirements thereby imposed
are not significant. Drawing step lengths, random number generation, and scattering remain
identical. Intersection checking also remains nearly identical, though instead of z > 0, the
condition becomes zi−1 ≤ z ≤ zi, i ∈ (0, n). What changes (and significantly so) is the need
to read, accumulate, and write one fluence value each time an absorption event happens. The
resulting memory bandwidth demand is the primary culprit for the order-of-magnitude decrease
in speedup.
Chapter 2. Background 27
2.5.5 GPU-MCML
A recent (2009) work by Lo [48], and later Alerstam and Lo [1] called GPU-MCML uses a
modern NVIDIA “Fermi” GPU to achieve up to 600x speedup relative to single-core CPU-based
MCML. The performance improvements over CUDAMCML are incremental, primarily due to
caching of the area immediately around the source, and all of the inherent model limitations of
MCML remain.
2.5.6 NIRFAST
Dehghani et al [19] use the diffusion approximation to formulate the problem on a tetrahedral
mesh using the Finite Element Method. The resulting system of sparse linear matrix equations
is solved using Matlab, and is freely available in a package called NIRFAST (Near Infrared
Fluorescence and Spectral Tomography). Tetrahedral meshes are used in a wide variety of
applications so they benefit from broad support in Matlab and other libraries for generation,
manipulation, and visualization. Likewise sparse matrices occur in many fields and thus benefit
from the wide availability of quality software code for their solution as well as many hardware
acceleration efforts. However, the model has significant limitations which prevent its use in
certain conditions. Most notably, the diffusion approximation breaks down in the presence of
weak scattering, strong absorption, and changes in refractive index.
2.5.7 TIM-OS
Prior to creation of FullMonte, the fastest tetrahedral mesh-based Monte Carlo simulator was
TIM-OS by Shen and Wang [60]. It uses the “hop, drop, spin” technique and related variance-
reduction techniques found in MCML but adapts them to a tetrahedral mesh.
In their paper, the authors of TIM-OS note that it is slightly faster than MCML on identical
problems (where a mesh is generated to represent infinite planar slabs). The performance
increase is likely due to superior performance tuning and the aggressive optimizations of the
Intel C Compiler, since the tetrahedral method inherently requires more arithmetic operations.
2.5.8 MMCM
Fang [22] presents an alternative to TIM-OS with substantially similar features. One difference
is that MMCM permits shapes other than tetrahedrons to be used in the mesh, but no benefit is
conclusively demonstrated, though there is an additional cost in complexity and performance. In
general, a polytope can be represented as a union of tetrahedra [39] so the additional complexity
adds no new capability. The performance of the code is slower than TIM-OS so it is not a
primary focus for comparison.
Chapter 2. Background 28
2.5.9 MCX
Fang and Boas [24] created MCX, which is a GPU implementation of the tMCimg algorithm
and therefore subject to the same assumptions and limitations. Compared to a single-core
CPU running tMCimg, MCX was shown to be 75-300x faster depending on options and the
specific problem. The option which most impacted run time was whether or not to require
atomic memory accesses. When disabled, some of the photon weight is lost due to memory
race conditions in which two separate GPU threads read the fluence accumulator value, each
separately adds a value, and then both write back. The second write overwrites the first, and
the value it added is lost. The authors demonstrate that the proportion is generally small, and
argue that it can be safely neglected for their test cases.
2.5.10 FBM (MCML on FPGA)
The first use of Field-Programmable Gate Array (FPGA) custom digital logic for acceleration
of biophotonic simulations was done by William Lo [47]. FBM implements MCML subject to
limitations on the number of layers (5) and the size of the absorption grid (200x200). Significant
gains in performance and energy-efficiency were demonstrated, with a 65x gain reported in
performance-per-power ratio, and a 45x gain in performance (single-core CPU vs single-FPGA).
Enhancements of the present work over Lo’s work include use of a more general geometry
model, and improvements in performance. Implementation of a tetrahedral model requires more
storage, more memory bandwidth, and more calculation. However, the hardware presented can
be taken as a proof-of-concept and an indication of the possible performance and power gains.
The performance gains should be treated carefully, though, as they compared against a non-
optimized single-threaded CPU implementation. Most importantly, SIMD vector instructions
were not used in the reference case so the processor could be capable of better performance.
2.6 Computing Platforms
With the end of clock frequency scaling, computer engineers can no longer rely on applications
automatically running faster year-over-year. The power and cooling cost of large-scale comput-
ing has also become an issue of concern recently. As a result, interest has increased in alterna-
tive computing platforms to achieve high performance in compact form factors and reasonable
power budgets. Different platforms present vastly different abstractions to the programmer,
along with different implementation tools, and a correspondingly wide range of architectural
tradeoffs. In seeking to accelerate Monte Carlo simulations for turbid media, three candidate
implementation platforms were identified: traditional CPU software, Graphics Processor Units,
and custom logic.
Chapter 2. Background 29
2.6.1 Central Processing Units (CPU)
Traditional Central Processing Units (CPUs) which form the core of computers are laid out by
the manufacturer and arrive fully fixed in their function. The CPU provides an instruction set to
the programmer, which can be used to implement the desired functions. The flexibility available
to the programmer is simply the sequence of instructions and data fed to the processor. This
von Neumann model [31] of computing has proven successful over the years due to its generality,
flexibility, and relative simplicity to program. Fundamentally, the paradigm for CPUs is for
a central data-processing unit to move data in from storage, execute a series of operations,
and move it back into storage. Significant amounts of energy and silicon area are expended on
moving data rather than actual calculation.
With the end of clock frequency scaling but continued scaling of transistor size, CPUs
now boast an increasing number of available cores and an ever-increasing set of specialized
instructions. Since even basic computers now come with two or four cores, it is no longer
reasonable to ignore multi-threaded programming when looking for performance. Likewise, use
of vector instructions is an important consideration for extracting peak performance [58].
The FullMonte software model presented here therefore uses both techniques to achieve its
performance advantage over other simulators.
2.6.2 Graphics Processor Units (GPU)
Graphics Processor Units (GPUs) have been used recently to accelerate computation. Origi-
nally designed to meet the needs of drawing graphics, they are optimized for highly-repetitive
operations and to provide extreme memory bandwidth. In contrast to CPUs which have a small
number of very fast, flexible, and highly-tuned compute engines that can each operate inde-
pendently, GPUs rely on massive parallelism with hundreds or thousands of simpler computing
elements that work in lock-step. The cost of simplicity is that each core operates far slower,
and a number of cores share scheduling logic meaning they must execute the same program in
lock-step. For applications which are floating-point intensive and have significant data paral-
lelism, ie. perform the same operations on many different contiguous pieces of data, GPUs can
offer significant performance increases.
2.6.3 Field-Programmable Gate Array
What CPU and GPU computing share is the paradigm of thinking in a sequence of steps, which
is a natural process for a human programmer to solve a problem. Field Programmable Gate
Arrays (FPGAs) are a form of programmable digital logic which implement spatial computing
through a configurable layout rather than a sequence of instructions. Fundamentally, an FPGA
is an array of fine-grained processing elements including memory blocks, arithmetic blocks
(usually offering variations of multiplication and/or addition), state elements (registers), and
programmable logic, connected by programmable connections. The name “field-programmable”
Chapter 2. Background 30
derives from the ability of FPGAs to be reprogrammed (“re-wired”) a nearly unbounded number
of times, simply by reloading the bitstream which takes under one second. The program or
bitstream specifies what functions the elements are to perform, and how they are to be wired
together. This reprogrammability allows state elements and compute elements to be intermixed,
permitting data to be stored closer to the location where it is processed. As a result, less energy
may be expended on moving data. Some commercially successful results showing performance
and power-efficiency increases for financial Monte Carlo applications are presented in a white
paper by Altera Corp [13].
On the extreme other end of the programmability spectrum are Application-Specific Inte-
grated Circuit (ASIC) and fully custom silicon devices. Such devices typically cost in the tens
or hundreds of millions of dollars to design and test, with the advantage of extremely low unit
cost and very high performance and power efficiency once running [33]. Development times and
risk are also correspondingly much higher. Clearly a very large production run is necessary to
justify the investment. FPGAs offer a middle ground between ASIC/full-custom and more tra-
ditional instruction-set (CPU/GPU) processing. Despite significant programmability overhead
compared to ASICs [41], significant power savings are still possible over CPU/GPU systems
without incurring the extreme engineering cost and risk.
Problems with a significant degree of pipeline parallelism, involving large chains of dependent
computations, tend to benefit from FPGA acceleration. Because the device program is a spatial
layout rather than a temporal sequence of instructions, it is possible for such computations to be
laid out such that outputs feed directly to dependent inputs and are located nearby. Keeping
connection lengths short saves power since shorter connections are easier to drive, and also
permits high performance since shorter links are faster. This minimizes the device area and
energy necessary to move data to where it is needed. More general instruction-based compute
models like CPU and GPU expend a very large amount of energy getting the data from memory,
cache, and registers to the compute units. Those compute units are also fixed in number and
position, which involves a degree of overhead if the application’s needs do not match the device
provided. When designing an FPGA bitstream, the available fixed-position state and compute
components may be connected in such a way as to provide just the right amount of each
computational resource and to locate just enough state elements nearby.
Chapter 3
Software model
This chapter introduces the FullMonte software simulator and highlights its important features.
3.1 Design choices
The preceding chapter presented an overview of existing solution techniques and software im-
plementations for the simulation of light propagation in turbid media. Given the large diversity
of options, the following goals were decided on to guide the present design:
1. Give correct results across many material properties (anisotropy, refractive index, albedo,
scattering)
2. Accommodate complex geometry
3. Be highly optimized for speed, running faster than any other simulator of equivalent
generality
4. Use only free and open-source tools and libraries
5. Be sufficiently flexible to incorporate new light source types easily
6. Make full use of parallel hardware and specialized functions available to the CPU
7. Offer the user and programmer a wide range of options for gathering output data
8. Offer the programmer a wide range of code instrumentation and profiling options
9. Incur no performance overhead for data or profiling features that are DE-selected
Based on the goals, a number of important high-level choices were made regarding what
type of simulator to implement and how. They include the nature of the simulator (Monte
Carlo), the geometry model (tetrahedral mesh), the programming language used (C++), as
well as related choices of programming style, tools, and libraries.
31
Chapter 3. Software model 32
3.1.1 Monte Carlo simulation
Monte Carlo was a clear choice based on its ability to model complex geometry and the widest
variety of materials. Analytic solutions to the RTE are not known for non-trivial structures, and
the Finite Element Method is fast and simple but requires too many restrictive approximations
to be of use in the cases of interest, particularly IPDT.
As an additional benefit, MC methods are inherently very parallel because M computing
elements can be used with M different random seeds (to ensure statistical independence) with-
out any need to communicate during the simulation. At the end, the results can be summed
to produce an output with√M times less standard deviation. Assuming the time required to
merge results after completion is much smaller than that required to generate them, this offers
a speedup very close to M times versus a single unit. Other solution techniques are not as
inherently parallel.
3.1.2 Geometry Representation
We chose a tetrahedral mesh for the geometry representation because of its ability to approx-
imate curved surfaces. Boundary-element and voxelized representations were also considered.
As previously noted, a voxelized representation is not adequate due to artifacts at curved edges
with refractive index changes.
A boundary-element representation, in which the surfaces of homogeneous regions are stored
as a mesh of triangles, is inappropriate because of the turbidity of the medium. Though it is a
common approach and yields a compact representation in computer graphics raytracing within
non-scattering volumes, the number of intersection tests required becomes excessive when used
for turbid media. Each time a packet is scattered, it changes direction and hence needs to have
a new set of intersection tests calculated. In the boundary element method each intersection
test requires fetching and checking for intersection of that ray with all surfaces that bound the
current region, which can be a large number for a complex surface. In contrast, a ray can exit
a tetrahedron only through one of the four faces, thus limiting the number that need to be
fetched and tested. When implementing the algorithm, there is a benefit in the simplicity of
having a fixed number of faces to fetch and test. The tetrahedral representation has no loss of
generality since any shape that can be represented by triangular surface mesh can be converted
into a tetrahedral volume mesh. The resulting mesh is larger, and element boundaries are
crossed more frequently, but that is acceptable in exchange for reduced memory accesses and
computation per scattering event.
3.1.3 Tools and Libraries
It was decided that the software should use entirely free and open-source libraries and tools.
FullMonte uses the Boost open-source libraries and was compiled with the Gnu Compiler Col-
lection. TIM-OS, the other leading tetrahedral MC simulator, requires the Intel C Compiler
Chapter 3. Software model 33
(ICC) and Math Kernel Libraries (MKL) which are not free. It also relies on the automatic
vectorization built into the ICC to achieve performance. The ICC’s auto-vectorization capabil-
ities are significantly better than GCC, as reported by Fang [23] in a comparison of MMC with
TIM-OS using different compilers. That experiment showed a 1.6x speed increase from switch-
ing compilers alone. FullMonte provides superior performance without requiring proprietary
tools.
3.1.4 Programming Language and Style
C++ is a widely-used language for designing high-quality libraries and high-performance soft-
ware. It allows a number of high-level abstractions including object orientation, while still
allowing the programmer to optimize low-level features of the program. Since the Monte Carlo
simulator proposed here executes certain core functions very many times, performance of these
inner loops is critical and can be optimized only if low-level calls to specific machine instruc-
tions are possible. The availability of high-quality numerical libraries (eg. for random-number
generation) is also important. Languages such as C and C++ meet these criteria.
On the other hand, significant flexibility is desirable so that the program’s functionality can
be changed easily and in a modular fashion at compile time. The C language falls significantly
short in its flexibility so C++ was chosen. FullMonte uses inlined C++ templates to allow
the programmer to alter or disable output-data gathering functions at compile time so that a
large variety of data can be collected, while paying the performance cost of only those features
selected. This design choice allows an easy upgrade path for future features, for instance time-
resolved calculation, without major alterations to the core simulator or branching the core
code.
The requirement for best-in-class performance implies that the implementation should be
designed in a hardware-aware way, involve detailed optimization where appropriate, and use
advanced processor features where possible. With the end of automatic processor performance
increases over time due to clock frequency increases, processor manufacturers are now placing
more and more computational cores on each die. To extract the full potential performance
from a modern processor, it is necessary to create a multi-threaded program which maximizes
utilization of all cores. Hence, FullMonte was designed from the beginning for multi-threaded
performance.
3.2 Design Overview
The basic simulation loop is shown in Fig 3.2, implementing the classic “hop, drop, spin”
algorithm. Multiple threads run concurrently, each launching a new packet when its current
packet retires. A thread will propagate the packet throughout the flow until it dies in roulette,
at which point the thread launches another. All threads have their own separately-seeded
random number generators to maintain independence.
Chapter 3. Software model 34
To launch the packet, the launcher draws a random direction and position from the set
of sources and their parameters. Weight is initialized to one. At the moment of launch, the
enclosing tetrahedron ID is found and stored within the packet before it propagates to the hop
stage.
At the hop stage, a random step length is drawn per Eq 2.4.2 and the intersection test is
performed. If the hop terminates within the same element, the packet is passed onwards to
the “drop” stage. If instead it encounters a boundary with a material of the same refractive
index then it advances to the intersection point and tries again to complete the hop. Lastly,
and least frequently, if the boundary is with a material having a different refractive index then
the packet is passed to the interface code for testing of internal reflection, Fresnel reflection,
and refraction.
When a packet arrives at a refractive index interface, it is evaluated for total internal
reflection. If the condition proves true, then the direction is reflected through the normal,
otherwise the refracted ray is calculated since it provides information necessary to calculate
the Fresnel coefficients. Based on the incident and refracted components, the Fresnel reflection
probability R is calculated and a Bernoulli random variable BR is drawn to determine whether
the packet reflects or not. Internal reflection, refraction, and Fresnel reflection are all distinct
events in the logger, which is notified appropriately.
In the drop stage, the packet drops part of its energy. The surrounding environment is
stored as a special material ID zero. If the packet propagates into this region, the logger is
called to report an exit event. Otherwise, an absorption event is reported. Generally this will
mean that the element ID and weight dropped are placed in a queue for later merging, however
in some cases (eg. imaging applications) the internal fluence is not of interest and hence is not
recorded. If the weight following the drop is less than a threshold (wmin to be discussed below),
then it is sent to roulette for possible termination. Otherwise, the packet moves directly to
scattering.
If applicable, roulette is calculated very simply using a Bernoulli random variable Bm, where
a nonzero return means the packet continues. The appropriate logger event is called to notify
of a roulette loss or win as appropriate.
When it arrives at the scatter function, random numbers are drawn and the Henyey-
Greenstein phase function is evaluated to give the scattering angles. The angles are applied
to the current direction of travel and the packet passes back to the hop stage for another
intersection test. Scattering events are also passed to the logger for possible action.
3.3 Performance enhancements
3.3.1 Multithreading
FullMonte, like some other simulators (TIM-OS, MMCM) uses a programmable number of
threads to do the computation. Each thread has its own random number generator (RNG)
Chapter 3. Software model 35
Launch
Draw step
Hop
TIR
Fresnel
No
Reflect
Yes
Refract
NoYes
Interface
Non-interface boundary
Drop
No boundary
Spin
w>=wminRoulette
w<wmin
Pr 1/m
Dead
Pr (m-1)/m
Figure 3.1: Overview of hop, drop, spin flow
Chapter 3. Software model 36
initialized with a different seed. Each independently launches photons, propagates them, and
sends event notifications to a Logger object for collection (details later).
In the default logging regime, the weight and mesh element ID for each absorption event
is placed in a thread-specific queue similar to TIM-OS. When the absorption queue is full, the
thread locks a mutex (mutual exclusion lock) such that it has sole access to the absorption
array, and accumulates the information from the queue into the array. The locking process is
essential because if two threads happen to access the array at the same time they can write
conflicting data which will violate the conservation of energy.
3.3.2 Explicit parallelism through SIMD intrinsics
Critical sections of code were identified from profiling information and then carefully hand-
optimized using Intel SIMD Streaming Extensions (SSE) instructions. Compiler intrinsics are
function calls that are translated directly into specific assembly instructions by the compiler.
They are embedded in source code like normal function calls and allow access to the most basic
level of machine instructions, while preserving some amount of code readability and convenience
for the programmer. SSE Instructions are Intel-specific instructions which basic arithmetic
operations to be done on groups of up to four numbers at a time for increased throughput.
FullMonte makes heavy use of such calls to achieve high performance for its most frequently
called operations: intersection testing and scattering.
The program uses an open-source (zlib license) library by Julien Pommier [56] that provides
fast vector math functions including sin, cos, and logarithm. FullMonte also relies on an imple-
mentation of the Mersenne Twister random-number generator by Saito and Matsumoto [59],
which generates uniform random bit sequences using high-performance Intel SIMD instructions.
3.3.3 The wmin Russian roulette parameter
The wmin parameter introduced in the algorithmic description in Sec 2.4.2 also has a significant
impact on performance, which until now has not received much attention. It permits a trade-off
between faster simulation and higher output quality. MCML uses a value of 10−4 while TIM-OS
uses 10−5 and MMCM uses 10−6. All of these values can be shown to expend computing time
unnecessarily for some applications. Below, the impact in terms of both output quality (result
variance) and run time are discussed from a theoretical standpoint; detailed simulation results
are presented in Sec 5.3.4.
Performance Impact
Assume a photon packet of initial weight w traveling through a homogeneous medium with
albedo α, and let us define a new property called the material’s persistence β, the number of
steps required for the packet to be attenuated by 1e , as − 1
lnα . By definition (Sec 2.4.2), the
weight remaining after i steps is wαi. Roulette occurs when the remaining weight wαi < wmin,
Chapter 3. Software model 37
which happens after i > β ln wwmin
. To get the least integer for which this is true, we take the
ceiling⌈β ln w
wmin
⌉. After that number of steps, roulette is done in which there is a 1-in-m
chance of the packet continuing with strength mw. Let T (w) be the expected number of steps
that a packet of weight w > wmin takes within a material of albedo α before losing at roulette.
i =β lnw
wmin(3.1)
T (w) = die+1
mT (αdie−imwmin) (3.2)
Assuming β 1 and α ≈ 1, the ceiling functions can be dropped permitting a direct
solution at the cost of mild error.
T (w) ≈ β lnw
wmin+
1
mT (mwmin) (3.3)
Substituting mwmin into Eq 3.2 and collecting terms, a solution can be found for T (mwmin)
which can be substituted back into Eq 3.3 again to find the value for any w, including a newly-
launched packet of weight 1:
T (1) = β ln1
wmin+
m
m− 1β lnm (3.4)
Which shows that the in the absence of exit events, the number of packet scattering events
is governed by the choice of roulette parameters m and wmin which can be changed without
changing the expectation of the result, unlike β which is a material property derived from albedo.
Given the form of the equation, changes to wmin are far more significant to the outcome than
changes to m, so wmin is the primary quality-time control.
If some fraction e ∈ [0, 1] of packets do exit the medium before being terminated at roulette
at wmin, only those packets remaining in the medium are subject to increased calculation if
wmin is decreased by a factor of k, ie. ∆T ≤ (1 − e)β ln k. The increase may be less than
predicted by Eq 3.4 because some of the packets may exit before terminating. Conversely, the
reduction in operation count from an increase in the parameter may decrease e since packets
will tend to terminate earlier.
Output Quality (Variance) Impact
Having shown that the performance difference can be significant, we turn our attention to the
output quality difference. In Sec 2.4.2 it was shown that changes to the roulette constants do
not alter the expectation of energy and hence the accuracy of results. However, the variance of
the output is also important since it determines the uncertainty remaining after a given number
of packets is run.
Let P be a path consisting of a series of points p[i], i ∈ [0, N ] from the launch point p0 to
Chapter 3. Software model 38
the arrival point pN . There can be infinitely many such paths for a given p0, pN . Along this
path, consider two different notions of weight: a physical weight w which is the probability
that a physical photon launched from p0 arrives at pN given that it follows path P , regardless
of whether it is absorbed there, and a simulated weight W , which is a function W (w) that
may have a random component. The physical weight arriving at the end of the path must
be w =∏N−1i=1 α(pi) as presented in Sec 2.4.2. By physics, the energy absorbed at point x
must be equal to the product of fluence, infinitesimal volume, and absorption coefficient so
(1− α)w = Φ(x)µa dV . For the simulation output to be unbiased (correct in expectation), the
expected simulated weight must equal the physical weight.
Let the probability of the various quantities conditioned on arriving via a given path P
be called path-conditional on P . Let PrP be the probability of a path P being traversed,
regardless of the termination criteria, roulette, etc. The unconditional arrival probability can
be calculated as an expectation over all possible paths arriving at p.
E [W ] =∑P∈P
E[W∣∣P ]PrP = Φµa dV (3.5)
But E[W∣∣P ] depends only on w so we can define a probability density function f(w) which
gives the probability of arriving at p via any path that has physical weight w. By definition,
E[W∣∣w] = w for the simulation to produce correct results. What is of interest is the variance
of the resulting simulation weight W collected. From probability we know that
Var [W ] = E[Var
[W∣∣w]]+ Var
[E[W∣∣w]] (3.6)
Var [W ] =
∫ 1
0f(w)Var [W |w] dw + Var [w] (3.7)
In this formulation, the first term is the additional error injected by a termination scheme.
The second is the inherent variability in the process of randomly selecting a path to traverse.
Non-packetized propagation
In the non-packetized formulation, the photon is either alive with weight 1 or dead with weight
0. The path-conditional survival probability is therefore a Bernoulli random variable:
w =S : S ∼ Bw (3.8)
E[w∣∣P ] =w (3.9)
Var[w∣∣P ] =w(1− w) (3.10)
cv(w) =
√1− ww
(3.11)
Chapter 3. Software model 39
The coefficient of variation above gives an intuition that the packet becomes increasingly
“noisy” or “quantized” as it becomes less probable to arrive at a given destination.
Packetized propagation without roulette
In the case where roulette is not performed, the packets will continue indefinitely unless termi-
nated by exiting the geometry or by other criteria (eg. a time gate, or a maximum number of
steps).
w =w (3.12)
E[w∣∣w] =w (3.13)
Var[w∣∣w] =0 (3.14)
No additional variance is introduced by the (absence of) termination criteria. However, the
computational cost is very large since all packets must be traced until they exit or are retired
due to other criteria (eg. a time gate).
Roulette
In the roulette formulation, the photon packet weight always has a lower bound of wmin since
if the packet has weight w < wmin at the end of the step it either terminates or returns with
weight mw. To arrive in the roulette formulation, the packet would have to survive r times
where
r = max
0,
⌊ln wmin
w
lnm
⌋(3.15)
w =mrwS , S ∼ Bm−r (3.16)
As shown below, the path-conditional expected value remains the same so there is no bias
introduced, but the path-conditional variance changes:
E[w∣∣P ] =
1
mrmrw = w (3.17)
Var[w∣∣P ] =
1
mrm2rw2 − w2 = w2(mr − 1) (3.18)
cv =√mr − 1 ≈
√wminw
(3.19)
Since the packet weight always has a lower bound, the amount of energy deposited per step
(ie. per unit computational cost) also has a lower bound which is advantageous. The price is
that the output variance per packet traced is increased relative to the case where roulette is
Chapter 3. Software model 40
not performed, and the variance increase becomes greater as the path becomes less probable.
However, it should be noted that each packet traced takes more computing resources.
Merging this result into Eq 3.7, we find
Var [w] =
∫ 1
0f(w)w2(mr − 1) dw + Var [w] (3.20)
with the definition of r as in Eq 3.15. The distribution f(w) is not directly observable or
calculable, though it could theoretically be simulated by taking a histogram of the weight of
all packets arriving within a mesh element. Even without a value available, it does give some
intuition. The more probability that a position has of receiving a high-weight packet, the less
the variance increase due to roulette. If on the other hand the bulk of the probability f(w) is in
areas where w wmin then large values of r will apply and the variance will be correspondingly
increased. Such behavior is observed in the simulation results and discussed in Chapter 5. By
tracking the expectation of mrw, it should be possible to estimate the variance of each surface
and volume element in addition to estimating its mean, which is a novel capability.
3.4 Output Data
To address the design goal of flexibility, the main simulation loop was designed to accept a
template parameter which models the Logger concept. A logger is a class which has a method
corresponding to each of the following events, which the simulator calls when the event occurs:
• Launch
• Scattering
• Absorption
• Intersection with a material boundary
• Arrival at a refractive index interface
• Internal reflection
• Refraction
• Fresnel reflection
• Termination through roulette
• Roulette survival
By providing packet information as part of the method call, the programmer can change
the type and format of data captured by the logger without ever changing the core loop. Since
the changes are made at compile-time, any features not included do not impose any run-time
performance overhead due to efficient inlining and dead-code elimination by the compiler.
Chapter 3. Software model 41
3.5 Profiling information
In general, a computer algorithm consists of data movement and computation, both of which
take time and device resources. Understanding the performance of an algorithm implementation
requires understanding both aspects and their interaction. Due to the flexible design of the main
loop and the logger concept, many useful pieces of profiling data can easily be acquired through
already-existing functionality.
3.5.1 Geometry Description
One of the key differences between MCML with its infinite planar slab geometry and a com-
plex tetrahedral mesh is the size of the geometry description. In the planar slab regime, the
entire geometry description for n layers (usually . 10) can be encapsulated in just 5n numbers
representing µa, µs, g, n, z. This is not the case for more complex tetrahedral representations
which can use ≈ 103 − 106 mesh elements, each requiring at least 4 face descriptions, each
having a 3D vector, a constant, and a pointer to the next element. Unlike MCML, the entire
description does not necessarily fit into any of a typical computer’s caches. An efficient and
compact geometry description is therefore essential to the problem, and the ability to access it
quickly will be one of the limiting factors in performance.
To that end, profiling was undertaken using the Logger framework to understand the char-
acteristics of relevant problems. A memory profiler was created which receives notification each
time a packet moves to a new material through either a boundary event or refraction event. The
profiler stores the current tetra ID and a count of scattering events. When the packet arrives
in a new material, the logger writes the previous tetra ID and event count out to a file, then
resets the event count and stores the new tetra ID. The resulting trace is a run-length-encoded
history of memory addresses fetched for intersection testing.
Temporal Locality
Temporal locality refers to the correlation of memory addresses accessed by an algorithm over
time. Informally, it answers the question “what percentage of memory accesses refer to data
which have been accessed in the last n accesses?”. Most modern computing devices make use
of a memory hierarchy of storage devices, ranging from small fast caches closest to the com-
puting elements, to larger slower storage further away. When data is sought, the computer
first looks in its nearest caches, then searches progressively further afield only if the data is
not present. Modern CPUs [31] tend to have three levels (L1-L3) ranging from smallest/fastest
to largest/slowest before accessing main memory. Typical computer caches use a replacement
policy of storing the most recently accessed data and (if necessary) making space for it by eject-
ing the least-recently-used (LRU) data [31]. Algorithms which have temporal locality benefit
from such a cache, since it exploits the correlation in memory accesses over time. Based on the
memory traces described above, the simulator’s memory access patterns into the tetrahedron
Chapter 3. Software model 42
memory were assessed for temporal locality, with results discussed in Chapter 5.
Spatial Locality
In addition to temporal locality, accesses can show an address-dependent frequency distribution.
The use of least-frequently used (LFU) replacement in caches is well-known [7] for applications
such as web traffic and multimedia which follow a Zipf-like (power-law) distribution. The LFU
paradigm differs from LRU in that pages are evicted based on being less frequently accessed
over a long term, rather than a short-term measurement of how recently it has been accessed.
Analysis to be presented later shows that a hybrid LRU/LFU cache scheme would perform best
for the simulator based on these observations.
Software was written to simulate cache accesses using the stored memory traces mentioned
above. Using a family of templated C++ classes that permit simulation of a memory hierarchy
(different sizes and types), simulations were conducted to determine the effectiveness of different
caching schemes. Because all packets are mutually independent, the statistics of the access
request stream are expected to be stationary over the long term, with short-term correlation
due to the limitation that a packet can move only to an adjacent mesh element (and has some
probability to step back after a short time).
Software does not permit explicit cache management, so this insight is not directly ex-
ploitable when writing a software simulator. However, these data are useful in designing other
implementations including GPU and FPGA designs, the memory access patterns and cachabil-
ity are important to achieving high performance, particularly on FPGA where it is possible to
implement custom cache logic. This point receives further discussion in the hardware chapter.
3.5.2 Operation Frequency
In addition to the need to move data to the compute units, the ability to carry out the calcu-
lations themselves is important for performance. To avoid premature optimization, the Logger
framework was used again to count the relative frequency of the different operations and identify
which are most critical to performance.
The frequency with which the various pipeline steps occur will dictate which has the biggest
influence on overall algorithm run time, and hence which are the best candidates for manual op-
timization. Based on operation counts, the following conclusions were drawn for the Digimouse
BLI setup:
• Intersection testing is the most frequent
• Scattering
• Interface-related events are very rare
It is intuitive that intersection testing should be the most frequent operation since it must
happen at least once per scattering event. It can happen more than once if the hop hits a
Chapter 3. Software model 43
region boundary before completing, in which case the new element must be loaded and tested.
Scattering (“spin”), absorption (“drop”), and roulette should be equally frequent since packets
progress from one to the next with divergence only when a packet dies which is rare.
Refractive interfaces are considerably rarer in the test cases studied, by 2-3 orders of magni-
tude. To describe a general shape using a tetrahedral mesh requires a number of tetrahedrons,
so it stands to reason that each individual material region should comprise a large number of
elements. Of these, only the boundary elements have faces which are interfaces so interface-
related operations should be much rarer. Biological tissues are also relatively homogeneous
in their refractive index, except for air cavities so typical problems will have relatively few
interfaces. Both data from a small number of test cases and intuition agree that the critical
path is composed of intersection testing and scattering, both of which were carefully optimized.
Future work should certainly look at a broader range of problem definitions to assess the range
of parameters, however the conclusions are expected to remain qualitatively valid.
3.5.3 Coordinate precision
Compared to other implementations, FullMonte uses a lower-precision floating-point represen-
tation (IEEE Single instead of Double). During development, assertions were added to the code
to check for effects of numerical round-off error, such as validating that the norm of unit vectors
(eg. direction) were within a reasonable tolerance of unity. No violations were found, suggest-
ing that the double-precision values used in TIM-OS were unnecessary. Simulation results also
converged to the same value regardless of precision, suggesting that the additional precision
is not necessary. Switching to single-precision enabled many calculations to be done using
a single four-element floating-point Intel SSE vector instruction, instead of two two-element
double-precision instructions. This in turn had a significant impact on the instruction count
required in the inner loop. Newer processors (Intel Sandy Bridge and up) now have 256-bit
registers that hold four double-precision elements so the gap will decrease. However it remains
useful as a way to decrease memory bandwidth requirements so that more elements may stay
resident in the cache.
While the performance benefit of using single precision vs double is not as large for newer
processors, it remains a useful finding for non-CPU implementations. FPGAs perform far
faster on fixed-point computation than floating, and so a software validation of lower numerical
precision is very useful. This reinforces results reported by Alerstam et al [2] which showed
that CUDAMC’s performance was not sensitive to precision between float and double. GPUs
also have far more single-precision floating point units than double-precision so this finding can
be applied on a GPU implementation as well.
3.5.4 Spin Calculation Methods
Since the scatter event is called once per step and involves a number of mathematical operators,
it is an important target for optimization. A number of different algorithms and variants
Chapter 3. Software model 44
were tested for speed. To assess speed, micro-benchmarks were run where a single packet was
repeatedly spun by pre-calculated angles θ, φ whose sines and cosines were stored in an array.
Pre-storing rather than calculating isolates the timing of just the inner loop which is of interest.
By repeatedly spinning the same packet, the number of memory accesses required to complete
the benchmark is minimized. It should also reflect the typical case when the software is running
where frequently-used values would be expected to be register-resident.
Cross Spin
To start, the original MCML spin calculation was implemented exactly as described in Sec 2.4.2
and the original MCML paper [44]. It makes no use of SIMD instructions or other hardware
optimizations.
Matrix Spin
In considering the spin formulation as originally proposed, we noted that calculating and dis-
carding a, b requires calculation of a reciprocal and a square root. If these auxiliary vectors
were maintained, it would save some computation at the expense of additional state informa-
tion. Further, the original formulation requires a special case because it is singular if dz = ±1.
Avoiding the need to check and handle the special case would be desirable.
In the original formulation used by MCML and subsequent derivatives, the new post-rotation
vectors a′, b′ were never calculated; the original a, b were calculated implicitly as part of Eq 2.27-
2.29 and then discarded. We developed a new formulation for FullMonte [9] that maintains the
auxiliary vectors a, b for use in Eq 2.30 directly, instead of discarding them and re-calculating.
The geometric interpretation is the same as in the original case above, except that the vectors
a, b are rotated along with d so that they remain orthonormal to d and may be used again.
The additional calculations required are:
a′ = a cosφ− b sinφ (3.21)
b′ = −a sinφ+ b cosφ (3.22)
The new formulation avoids the special case where d = ±k as well as one square-root
and one reciprocal, the costs and benefits of which are discussed later in the implementation
descriptions.
The matrix spin described in Sec 2.4.2 was implemented using SIMD instructions, and out-
performed the original formulation. This implementation is attractive for hardware which has
a high density of multipliers but less other units (divide, square-root). FPGAs are exactly
such a platform since they have fast and power-efficient hard multiplier blocks. Similarly in
modern GPUs [12], there are a large number of simple cores with adders and multipliers but a
smaller number of shared, slower special function units for division and square-root. By trading
Chapter 3. Software model 45
away division and square-root in favour of more multiplication, it may be possible to get faster
performance if the algorithm were to be implemented on a GPU.
SIMD Cross Spin
Subsequent to implementation of the matrix-spin algorithm above, the original “cross spin”
algorithm was further enhanced by use of SSE intrinsics leading to the fastest CPU-based
implementation of all the variants tried. In particular, substitution of square-root and division
by an explicit reciprocal-square-root instruction made a large difference.
The azimuthal vectors are formed by normalizing the cross product between the packet
direction and the k vector. This method has the advantage of simplicity since the two zero
components in the k vector reduce the number of nonzero terms in the output. It has a singular
case where d ‖ k so d × k = 0 which is handled separately. Performance enhancement over
the previous version was achieved by using a hardware approximate-reciprocal instruction in
place of a math library call. Additionally, the number of instructions was decreased by using
SIMD instructions which operate on more than one data item at once. The matrix formulation
remains in use in the hardware version, though, to shorten latency and make use of plentiful
hardware multipliers.
3.5.5 Intersection Testing
One small but significant optimization from previous implementations was a change to the
storage of normals within a tetrahedron. Instead of storing one normal vector per four-element
SIMD register, the coordinates were gathered by type. One vector each is dedicated to holding
all of the x, y, z, and constant offset components for the four vectors. Doing so avoids some
manipulation of the vectors necessary to compute the required dot products. Since intersection
testing is actually the most frequently-occurring operation of the entire pipeline, the impact
is not trivial. The normal vector itself is not needed except in the case of arrival at a refrac-
tive index boundary which is significantly rarer. When needed, the vector can be found by
transposing the vectors in the tetrahedron definition.
Chapter 4
FPGA Implementation
This chapter contains detailed technical descriptions of the FPGA-accelerated implementation,
and as such necessarily contains some jargon specific to computer engineering and digital logic
design. It may safely be skipped by readers who are not computer engineering experts; the key
results, validation, and performance comparisons are all summarized in Chapter 5 and discussed
in Chapter 6.
4.1 Motivation for Hardware Acceleration
Monte Carlo simulations are inherently parallel. All iterations are independent, and the statis-
tical uncertainty (standard deviation) of the answer declines with 1√N
where N is the number
of paths simulated through each element or detector. The simplest approach to reduce runtime
is to run M simulations of NM paths on M parallel machines with independent random number
sequences and sum the results. Assuming that the time to merge the results is much less than
the time to generate them, this will take ≈ 1M as much time.
However, this naive approach runs into practical limits quickly: to compute a result of equal
quality M times faster, it requires M times as much power, cost, and space yielding a constant
per-packet ratio regardless of M . Conversely, to compute a result with M times smaller result
standard deviation in the results given the same time takes M2 power, cost, and space. For
MC simulations of complex geometries to really “break through” into everyday use in the clinic
and research lab, they need to provide better numbers of packets simulated per unit cost, space,
and power in addition to time. Given that CPU processor speeds are not increasing at their
former rate [31], processor architecture and manufacturing alone will not provide significant
improvement in single-core CPU performance over time so alternative approaches are needed.
The alternatives involve either other algorithms to solve the same problem or other com-
puting architectures. Leaving aside the question of other algorithms, which has not yet yielded
an option for the most general materials and geometries, other compute architectures are a
compelling way forward. The three options considered are listed below.
46
Chapter 4. FPGA Implementation 47
4.1.1 GPU
Currently the most popular compute accelerator in the market is GPGPU, or General Purpose
computing on Graphics Processor Units [31], but they were considered and discarded for this
application. Their programming model divides the program into many threads, but requires
that groups of threads called warps access contiguous memory in a process known as coalescing.
If a group of photons packets is launched from a point source, they start within the same tetra-
hedral element but as they travel they rapidly diverge due to scattering and so begin to access
non-contiguous memory which would be expected to lead to very sub-optimal performance.
For applications such as PDT in which the volumetric fluence distribution is of interest, it also
requires accumulating values over a large array shared between threads which needs expensive
atomic memory access to ensure correct results. CUDAMC (Sec 2.5.3) uses a GPU to achieve
approximately 1000x run time decrease compared to a non-optimized single-thread implemen-
tation. That algorithm however requires virtually no memory access and so could be regarded
as a hard upper limit on performance when the problem is fully compute-bound, and indeed it
is also less compute-intensive than working with a full 3D tetrahedral mesh. The authors note
that acceleration decreases by an order of magnitude when implementing the full planar-slab
MCML. Tetrahedral mesh computation requires both more memory access and more compu-
tation so it would be reasonable to expect further significant performance decreases. Since the
acceleration results are reported against a single-core CPU implementation, a GPU implemen-
tation could be expected to achieve less than an order of magnitude better performance in
run time compared to a multi-threaded CPU implementation, and less advantage on an energy
basis.
4.1.2 Intel Xeon Phi processor
Intel’s new Xeon Phi coprocessor systems [15] offer a highly-parallel compute accelerator aimed
at competing with GPGPU while using the Intel x86 instruction set. It is an instance of
Intel’s new Many Integrated Cores (MIC) architecture, using relatively lightweight in-order
cores coupled to a mid-sized cache (256 kb/core: larger than a GPU, smaller than an x86 CPU)
and fast memory. By increasing the number of cores and available memory bandwidth, it might
offer a performance increase for this application compared to a normal x86 processor without
imposing the overhead of a GPU, specifically the requirement for memory-access coalescing and
the heavy penalty for branch divergence. Cache coherency is also a very strong advantage when
considering the need to accumulate many absorption events across cores. Counterbalancing
that, the smaller cache relative to a full x86 processor may impose a penalty due to a higher
miss rate. The total power budget is also correspondingly larger, so it may or may not be a net
improvement in power-performance terms.
Implementation of the FullMonte software simulator on such a system would be relatively
low-effort due to instruction-set compatibility (including all of the hand-optimized vector parts),
a plausible candidate for accelerating the calculation, and an interesting evaluation of the new
Chapter 4. FPGA Implementation 48
technology. Due to the recent announcement of the device family (Nov 2012) and its novelty it
has not yet been targeted for a FullMonte implementation.
4.1.3 FPGA
FPGAs, as introduced in Sec 2.6.3 are programmable logic devices which offer far greater pro-
grammability, power efficiency, and in some cases compute ability at the cost of greater difficulty
in programming. Unlike GPUs, FPGAs offer fine-grained parallelism and the opportunity to
customize the memory hierarchy for the target application. Energy efficiency is also vastly
superior on the FPGA platform, which is a desirable attribute for scaling up the computation
to handle large volumes of simulations, particularly in the context of large-scale computing or
portable systems. They are also a mature device with a proven track-record of energy-efficient
and high-throughput computing. Two major vendors, Xilinx and Altera, offer large-scale FPGA
devices using modern manufacturing processes (28nm) with mature CAD tools, large IP port-
folios, and reasonably similar device architectures. Of the two, an Altera Stratix V FPGA was
chosen as the implementation medium.
4.2 Design Overview
4.2.1 Hardware Platform: Altera-Terasic DE-5
The Terasic DE-5Net [63] is a development platform for the Altera Stratix V FPGA [14], a high-
end modern 28nm FPGA. The board includes a Stratix V A7 device, which is a mid-size variant
of the Stratix family designed to provide a balance of logic, memory, and DSP functions. It also
supports two large DDR3 SO-DIMM memory modules and four QDR-II+ SRAM modules for
fast random-access memory. Listing for $8,000 USD, it is a common platform for prototyping
FPGA projects. As will be discussed later, it provides a good mix of FPGA and memory
technology for scaling up FullMonte to higher performance. The proposed scale-up architecture
would use all of the memory features just listed as well as nearly all of the available DSP
resources on the FPGA.
4.2.2 Implementation Language: Bluespec
FPGA designs are typically implemented either by writing Register-Transfer-Level (RTL) hard-
ware descriptions (VHDL or Verilog being the most-used languages), or by using High-Level
Synthesis tools. RTL design tends to be very laborious, verbose, error-prone, and to result
in code which is difficult to adapt to new contexts (new FPGA devices or new applications).
HLS tools often greatly restrict the method of expressing the problem and/or lead to inefficient
device resource usage due to excessive abstraction of important device details. A number of
commercial [35] and academic [8] tools start from recognizable sequential languages such as C
and Matlab or explicitly-parallel instruction-based languages like OpenCL. While some, partic-
Chapter 4. FPGA Implementation 49
ularly the Altera OpenCL [13] compiler have shown success in a few applications, we judged the
efficiency and flexibility given up using HLS tools based on software programming languages to
be excessive.
Choosing between the two traditional options poses a difficult dilemma between convenient
design but low performance on the one hand, and a difficult, tedious, error-prone process on
the other. The FPGA implementation of FullMonte used a third option: a new commercial
HLS tool called Bluespec and its related language Bluespec System Verilog (BSV), which take
a radically different approach from both RTL and other HLS systems. Derived from functional
programming languages which have a primarily academic heritage, the language makes a strong
distinction between (pure) functions whose return value is a function only of its explicit inputs
(ie. for the same input, it always gives the same output), and actions which may read and write
state elements. A quick introduction to the language and its core concepts are provided in the
book BSV By Example [53], while the BSV Reference Guide [5] provides a detailed language
reference. The novel features of the language most relevant to this project are discussed below,
and also referenced where appropriate in the detailed design description that follows.
Choosing a relatively new and unfamiliar language over “traditional” design methods was
a risk, but the results have justified the risk many times over: simulations ran an order of
magnitude faster, many errors were caught in the compilation stage, code volume was greatly
reduced, and code readability was enhanced. Overall, Bluespec provided a large productivity
increase throughout the design process and resulted in code that is far more maintainable
and reusable. Some highlights of the language and compiler are discussed below, with specific
references where appropriate in the detailed design description as well.
Guarded Atomic Actions
Bluespec programs consist of two fundamental elements: rules, which consists of a set of con-
ditions (guards) and set of actions that modify module state if and when the rule fires; and
state elements (eg. registers, memories), which are modified by actions.
Conditions can be specified explicitly (do X if Y) by the programmer, or can be derived
implicitly from other conditions within the rule (do X, where X is only permitted to happen if
Z). Based on the program source, the compiler evaluates conflicts between the effects of rules
and generates a scheduler which decides what rules should fire when. By analyzing the conflicts,
the scheduler ensures that no two rules whose side effects are incompatible (eg. both writing
the same register) fire together. At each clock cycle, the scheduler evaluates the conditions
(implicit and explicit) for every rule, and determines which are permitted to fire. Based on
the assigned priorities and conflicts, it then selects which rules to fire within the cycle. While
this sounds like additional overhead, it must also be done by a programmer to write correct
RTL code. The compiler lifts this burden, and also provides errors if the program specification
appears to be ambiguous or infeasible.
Each rule is atomic, which means that its actions execute entirely or not at all: if any part
Chapter 4. FPGA Implementation 50
of it is not able to execute due to conflict, the scheduler will not permit the rule to fire. Instead
of having to derive the scheduling logic for each state element manually, the programmer can
think in terms of what actions have to occur in what situations. The compiler then takes care
of making sure that the actions are attempted only when they are permitted, and that no two
rules fire which conflict. An oft-used example is the FIFO block provided in the Bluespec IP
libraries. If a rule involves enqueuing a value into the FIFO, that rule automatically carries
the condition that the FIFO is not full. Even better, if two rules must enqueue values into
the FIFO, it will warn the programmer to make a priority decision if they can conflict. Best
of all, though, suppose it is necessary to modify a working program so that under yet another
condition a value is enqueued into the same FIFO. That would be as simple as writing the new
rule and specifying its priority relative to the other two; no modification of the other rules (in
fact of any existing code at all) or manual rewriting of scheduling logic is necessary because the
compiler does it all.
Strong Typing
In Bluespec as in Haskell, the language is strongly typed and uses a type class system. All
expressions must have a type, and any type conversion must be explicitly requested by the
programmer unlike in C, Matlab, or Verilog. While that may sound less convenient, several
convenient consequences follow. First, since each expression’s type is statically and unambigu-
ously known at compile time, variables can be defined from other variables without explicitly
specifying their type (eg. “let x = ...” where the type of the RHS need not be stated by the
programmer). Second, there exist signed and unsigned versions for each length of bit vector
so common Verilog errors due to implicit extension, truncation, and sign conversion do not
happen; the programmer must ask for all of those conversions. Third, types may belong to
type classes for which groups of functions are defined. General functions can be defined which
take arguments of any type that belongs to a given type class. For instance, one could define
twice sum(x, y) = 2 ∗ (x + y) which would then work for arguments x, y of any type that is
a member of the Arith# class defining the basic arithmetic operations. Type classes provide
polymorphism similar but distinct from C++ since no direct inheritance of data members is
necessary. This convenience does not stop with functions, but hardware modules too can ac-
tually be parameterized by type. Such parameterization drastically cuts down on “boilerplate”
code for commonly-used design patterns including testbenches and module wrappers.
Higher-Order Functions and Modules
Due to its heritage from functional languages and particularly Haskell, Bluespec allows functions
and hardware modules to be passed as arguments to other functions and modules. Three good
examples are given later, one for queueing of random numbers in Sec 4.3.1, one for simulating
imported Verilog modules in Sec 4.3.8 and another for test-bench creation in Sec 5.1.
Chapter 4. FPGA Implementation 51
Compiled Simulation
When running Monte Carlo simulations that may involve thousands of paths, each requiring
thousands of arithmetic operations, simulation speed is a significant factor in debugging pro-
ductivity. The Bluespec compiler can compile BSV code into a cycle- and bit-accurate C++
version which runs very quickly using the provided Bluesim simulator. A very rough estimate
would place the speedup at an order of magnitude or better. The Bluespec code can also
integrate with user-provided C++ code as well, which is useful for testing and for exploring
architecture options where some functions have not been fully implemented in Bluespec.
One limitation is that existing Verilog RTL code (eg. FPGA vendor IP, including mathe-
matical functions) cannot be incorporated into the C++-based simulation. Bluespec can also
emit Verilog for simulation using a normal RTL simulator (eg. Modelsim) but that gives up
the speed advantage inherent in the C++ compilation-based approach. However, if an accurate
C++- or Bluespec-based model for the IP can be created then Bluesim can still be used. That
approach was taken when incorporating Altera IP to instantiate DSP cores.
Libraries
Bluespec also ships with a large library of intellectual property including useful primitives
like First-In First-Out (FIFO) buffers, as well as Block RAM instances. These libraries are
quite useful because they are broadly parameterizable, eg. the FIFOs are parameterizable in
terms of both type and depth. Any type which is a member of class Bits#(), ie. any class
(including user-defined) which can be represented using a fixed number of bits, can be stored in
a FIFO. As mentioned earlier, the implicit conditions on all library modules are factored into
the scheduler so no explicit checking of FIFO full/empty conditions is required. There is also
a convenient library called StmtFSM which is useful for creating finite-state machines using an
easy sub-language. It works within the guarded atomic action framework such that the FSM
state advances only when all actions within that step are able to fire.
In contrast to the flexibility described above, vendor-specific IP libraries in RTL languages
will often require regeneration using a separate tool when changing width, depth, or other
parameters so the Bluespec IP model represents a significant convenience in terms of source
code flexibility. Vendor IP libraries also put the burden on the user to ensuring that the input-
port conditions are correct for using the IP. Using a Bluespec library, on the other hand, the
constraints will propagate upstream and be incorporated into rules for using the IP.
4.2.3 Design Limitations
A few limitations and assumptions were made to make the problem scope tractable while still
enabling useful conclusions about the feasibility and performance of a full system:
1. At most 16 distinct materials may be simulated
Chapter 4. FPGA Implementation 52
2. Maximum mesh size is 64k elements
3. Internal reflection, refraction, and Fresnel reflection are currently omitted
4. Only isotropic point sources are supported
The number of distinct materials is representative of typical problem sizes. Since a user
must contour the different material regions and define optical properties, 16 was seen as a
reasonable number which few simulations are likely to exceed. Maximum mesh size was limited
to 64k elements due to limited on-chip memory availability. This was sufficient to run the
“cube 5med” test set, and can also accommodate a set which covers the majority of memory
accesses in real applications (> 95% for Digimouse BLI test set).
The current system also supports only isotropic point sources. A pencil beam is trivial
to support but not currently done, and the extensions to line sources and volume sources are
simple and unlikely to limit overall system performance since launch is hundreds of times less
frequent than intersection testing and scattering.
To reduce the algorithm complexity for a first prototype, calculations relating to index
of refraction (internal reflection, refraction, and Fresnel reflection) were excluded. As will be
demonstrated later, interface calculations are two orders of magnitude rarer than the most com-
mon operations (intersection testing and scattering) and therefore are not a major performance
bottleneck. Inclusion of these effects will be important for application of the system to the most
general class of problems, but would not be expected to limit the performance of the overall
system.
4.2.4 Design Goals
Given the selection of FPGA as the computational platform for implementing an accelerated
MC light propagation engine, we derived a set of goals for the design to take best advantage
of the relative strengths and weaknesses of FPGAs. Based on a high-level analysis of the
algorithm, the following high-level objectives were set:
1. Insert pipeline registers as needed to maximize clock frequency (target 250MHz)
2. Exploit pipeline parallelism by keeping multiple packets in flight simultaneously
3. Minimize latency of the inner packet loop (hop-drop-spin)
4. Achieve maximal throughput by loop unrolling in critical operations
5. Avoidance of floating-point operations in favour of fixed-point
6. Maximize utilization (minimize idle time) of the most resource-intensive blocks
7. Share operators for less-frequently used functions
Chapter 4. FPGA Implementation 53
Pipelining for Maximum Frequency
As spatial computing devices, FPGA designs are best conceptualized in terms of an intercon-
nected spatial layout of logic, computing, and storage elements. In contrast to a CPU or GPU
whose core layout is fixed at manufacturing time, specific areas of the FPGA can be dedicated
to specific operations, such as random-number generation, intersection testing, mesh storage,
etc. Instead of bringing data to the computational core, processing it, and returning it to
memory, the calculations flow through the FPGA from input through intermediate stages and
to output.
Each storage and logic element within the FPGA has a delay associated with it, as does
each link carrying data between elements. Synchronous design, in which an input may be
accepted and an output may be provided at each tick of the clock, is by far the dominant
design style for FPGAs. To ensure correctness, the clock period must be no faster than the
slowest path within a block so that all elements are finished computing before their result is
stored. If that condition fails to hold, then an incomplete or garbled result will be stored
and passed onwards. For long chains of operations (called a pipeline since computation “flows”
through it), the maximum speed may become intolerably slow despite the large silicon area used
for calculation. Generally FPGA designs make use of pipeline registers to store intermediate
results instead of having all the computation happen in a single cycle. By partitioning the
total path delay into segments between storage elements, the maximum segment delay can be
reduced and hence the maximum clock frequency increased. Inputs can therefore be accepted
more frequently, giving better total throughput [68].
The clock-frequency increase from pipelining does not come for free, however. If a function
expresses a recurrence ai+1 = f(ti, . . . ), the subsequent stage ai+1 cannot be calculated until
ai is available, which takes C clock cycles if C registers have been inserted in the path. The
present Monte Carlo simulation is just such a case since a packet’s position after step i + 1
depends on where it was at step i.
Pipeline Parallelism
If a fixed sequence of steps needs to be applied to an input, then those steps can be laid out
in order with each feeding its successor. On the further condition that each element flowing
through the pipeline is independent, i.e. that the path of packet i has no dependence on packet
j (∀i 6= j), they may be computed in arbitrary order or in parallel. When many independent
items run through a similar set of steps in parallel there exists pipeline parallelism. In this case,
the sequence is almost fixed with the exception of some branches as depicted in Fig 4.2.6.
After a complete hop-drop-spin cycle, the packet repeats the process starting with drawing a
step length. The length of time (latency) for a single packet to complete a loop is not inherently
important; the throughput to calculate a large N (millions) of packets is what matters. In
this sense, Monte Carlo simulation is ideal for FPGAs because it involves simulation of many
independent sample paths. There is no dependency of the state or path between packets so
Chapter 4. FPGA Implementation 54
abundant pipeline parallelism exists.
To exploit pipeline parallelism, then, it is necessary to keep at least C packets in the pipeline
if the loop latency is C. Since packets are independent, a new packet may be launched at any
time, which provides a simple but effective way to guarantee there is always a packet being
provided to the draw step-hop blocks: any time there will be a “bubble” (idle time) in the
pipeline, a new packet is launched to fill it.
Latency
In general, each of the C packets being processed will be located in a different tetrahedron
whose definition must be readily available to complete the step computation. When scaling up
to larger problem geometries where it is not possible to keep all of the geometry in a single
storage location, the packet-loop latency will determine the number of tetrahedrons which must
be kept readily available in a local cache. Since caching is relatively expensive in terms of area,
energy, and complexity, minimizing the cache size required to serve the elements in progress by
minimizing latency is an important factor for ultimate performance and scalability. Introducing
pipeline latency in computation also requires inserting delay elements to align the delay of the
all elements in the packet.
Optimization of the design requires a delicate balance between adding pipeline stages where
appropriate to increase clock frequency, while reducing latency where possible to reduce cache
and state-storage requirements. Latency can be reduced by running independent computations
in parallel (eg. the weight update due to absorption and the direction update due to scat-
tering). This design also implements some strategies for “hoisting” latency out of the main
loop, either by operator strength reduction or by pre-generating random numbers which are
data-independent.
Loop Unrolling for Throughput
The design is targeting maximum achievable throughput for packet computation for a given
area. Unrolling a loop by factor R increases throughput by R, reduces latency by R, decreases
control complexity, and increases area by R. For instance, computation of (ab, cd, ef) could be
calculated by calculating ab, cd, and ef in three steps using a single multiplier, which would
have a latency of 3 and a throughput of one third (a new output tuple is produced every third
cycle) for area cost 1. It could also be unrolled so that three multipliers calculate in parallel for
latency 1 and throughput 1 but area cost 3. Throughput per area is kept roughly constant, but
latency and complexity are both reduced. A latency decrease is desirable as argued above, and
a decrease in control complexity makes it easier to achieve high clock frequency. All statically-
indexed loops on the critical loop should therefore be unrolled as far as possible.
Chapter 4. FPGA Implementation 55
Fixed-Point Computation
FPGAs natively support fixed-point multiplication and addition with “hard” optimized fixed-
function DSP blocks that are fast, plentiful, and power-efficient. The incremental cost of
supporting floating-point operations is quite high due to the need for additional logic to shift
the operands compared to fixed-point. Full support of the IEEE single or double standards also
requires handling of special cases such as infinity and not-a-number, which add logic complexity.
Special cases are avoided by careful construction of logic to ensure that divide-by-zero and
other pathologies never occur. The additional complexity of floating point is not necessary in
this application because the ranges of all variables are bounded. All spatial coordinates lie
within a bounded mesh-description range; directions are unit vectors which bounds the size of
their components; sines and cosines of angles are similarly bounded; and packet weight remains
always in the range of [wmin, 1]. Given bounded ranges, the only question that remains is how
many bits to allocate to each such that the increment ε between steps is sufficiently small.
For a Monte Carlo simulator, the expected result is correct so long as the expectation of
each step is the correct value, which means that no step introduces bias. A properly-chosen
fixed-point representation should not apply any bias, preserving correctness although additional
variance may be added due to quantization. Up to a point, the quantization noise should be
dominated by other sources of randomness in the system, and even exceeding that threshold
the variance may be overcome by running additional iterations so there exists a natural tradeoff
between area and required number of iterations to achieve a target variance level. By reducing
precision, the silicon area required is decreased (which also correlates to a clock frequency
increase) while the number of packets required to achieve identical variance increases.
Maximal Utilization of Expensive Blocks
When a large amount of device area is allocated to a specific function, we wish to maximize
the proportion of time that it is active. Given a system clock frequency fs, and reciprocal
throughput T the block can do at most fsT computations per second, if it is provided an input
every time it’s able to accept one.
In scaling the design up to multiple instances, it will be important to consider matching
the density of each functional block type to its relative frequency in the computation. If for
instance an interface happens once in ten hops, then it would be sensible to instantiate ten hop
cores sharing one interface core if the control and queueing costs are not excessive.
4.2.5 Data Representation
The bit widths for the most important data structures are given in Table 4.1. All quantities are
fixed-point, and strive to use element widths which are 9, 18, 27, or 36 bits which fit naturally
into Altera hard DSP blocks and block RAM units. Fixed-point was chosen due to the presence
of definite bounds on all variable ranges. The use of 36 bits for packet weight and 64 bits to
Chapter 4. FPGA Implementation 56
Data item Bits per Range Precision Comment
2D Unit Vector 2x18 (36) ±1 8× 10−6
3D Unit Vector 3x18 (54)3D Position Vector 3x18 (54) ±8cm 0.6µmDimensionless step 18 0− 63 1.2× 10−4
Physical step 18 0− 63cm 1.2µmPacket weight 36 0− 1 1.5× 10−11
Absorbed weight 64 0− 2× 108 1.5× 10−11 200M absorptionsper element beforeoverflow
Tetrahedron ID 20 0− 106 3x more than Digi-mouse
Material ID 4 0− 15Interface ID 8 0− 255 Number of distinct
material combinationsat an interface
Packet 294 3x 3D unit vector, weight, 3D position, mate-rial ID, tetra ID, dimensionless step remaining
Tetra definition 404 4x Adjacent tetra ID, 4x4x18 face normals &constants, material ID, 4x interface IDs
Table 4.1: Core FPGA data structures for packet, geometry, and material representation
accumulate absorbed weight are both conservative, given that the weight is always at least
wmin so the smallest increment which can be deposited is (1 − α)wmin or approximately 10−9
for an albedo of 99.9% and wmin = 10−5. Keeping the full precision in the weight accumulator
ensures there will be no roundoff error, and its width ensures that the accumulator can handle
at least 264−36
1−α & 109 absorption events per element in the worst-case where the packet arrives
with unit weight at α = 0.8.
For step lengths, the worst-case interaction coefficient from Cheong [11] is ≈ 3000cm−1
which would yield an average step length of 3µm. By setting the resolution of both position
and step length several times lower than this value, the probability of any given step getting
“stuck” at a given position due to truncation is acceptably low.
The problem description size that can be handled by the data structure (although not
accommodated in on-chip memory) is 1M tetrahedra, 16 materials, and 256 distinct interfaces
(material pairs which are adjacent in the mesh).
4.2.6 Packet Loop Description
At a high level, the packet flow is implemented as presented in Fig 4.2.6. By inspection, the
intersection-test stage must be the most frequently occurring since it is the only step which
is involved in all cycles of the data flow graph. It is also among the most computationally
intensive (eight 4-element dot products), hence keeping it near 100% utilization is critical to
maximizing performance per unit area.
Chapter 4. FPGA Implementation 57
The boxed region in the figure depicts the drop, roulette, and spin stages which are actually
implemented in parallel to reduce latency. As shown in Sec 2.4.2, the packet will on average
pass the drop stage a large number of times before expiring at roulette. Since roulette does not
alter the position or direction of the packet, the finish-drop-roulette and finish-drop-spin edges
may be merged, so the spin is always executed speculating that the packet continues. If it does,
the result is available with lower latency. When the packet eventually loses, it terminates and
the effort to calculate the speculative spin result is wasted. However since the probability of
termination is on the order of 1% or lower, speculation is generally productive and the possible
savings available from avoiding the cost of mis-speculation are not worthwhile.
4.3 Design Details
4.3.1 Random Number Generation
To produce a set of U01 random numbers, a fully-parallel implementation of the TT800 “Tiny
Twister” (a variant of Mersenne Twister) of Matsumoto and Saito [59] was created. The
Mersenne Twister RNG was chosen because it is a high-quality random number generator with
very long period that uses only bitwise operations which are easily and cheaply implemented
on an FPGA. The original software version which was used as a template and for validation
produces a sequence of 32-bit integers from an 800-bit state vector.
The implementation used here updates all 800 state bits in parallel at a rate that can exceed
500MHz, providing a pseudo-random bit stream at up to 400 Gbit/s with negligible resource
cost. A smaller implementation would suffice, but is not worth the effort for the trivial cost
savings. To produce numbers with a particular statistical distribution, the uniform random
numbers feed a distribution function which manipulates them into the appropriate form.
Randqueue block
MC algorithms by nature require several streams of independent random variables with various
distributions. They are calculated by transforming one or more U01 random variables, which
requires some latency L to compute. Since these must be random and independent of the data
being processed, there is no input data dependency when creating the random variables. A
natural conclusion of this is that the distributed random variables can be computed in advance
and queued so they are ready immediately when needed, supporting the latency-minimization
objective by hoisting the latency out of the inner loop.
A random-number queue was devised which wraps the distribution function and a FIFO
queue of length L + 1. To initialize, L + 1 random numbers are drawn, fed to the calculation
engine, and the results are placed in the queue. When the last is complete, the queue signals
that it is ready to provide random numbers. When a value is subsequently drawn from the
distribution output queue, a new uniform random number is drawn and fed to the calculation
Chapter 4. FPGA Implementation 58
Launch
Draw step Tetra lookup Hop Interface
Finish step
Drop
Spin
Roulette
Dead
Figure 4.1: Block diagram for FPGA implementation, with stages requiring random numbersshaded; the boxed group is actually a single block but is expanded to show packet flow; seeFig 5.5 for event frequency details
Chapter 4. FPGA Implementation 59
engine. After L cycles, the result is enqueued thus ensuring that there is always a result
available.
Implementing this design pattern in Bluespec was very simple. A random distribution func-
tion is expressed as a module that has a port with type signature ServerFL#(in t,out t,lat):
it takes an input type in t, and outputs a result of type out t after lat cycles. Other ports are
permitted for use in configuring the distribution, gathering usage statistics, or for other pur-
poses. The input type must be convertible to bits (expressed by membership in the Bits#()
typeclass) so that it may be fed from a U01 RNG. Sample BSV code showing how to draw an
exponential random variable using Randqueue is given in Fig 4.2.
In some cases, multiple different parameters are used with a particular distribution (eg.
differing g values for the Henyey-Greenstein function) but the number of parameters are small
(n ≤ 16 materials). For those cases, a RandqueueMulti block allows a distribution with n
different parameter values to share a single calculation engine feeding n different queues. When
a number is drawn from queue i, a request is issued to the calculation engine with a random
number and the i-th parameter value. On completion, the new distributed random number is
placed back in the queue. The RandqueueMulti module as written can be used for distributions
with any parameter type param t (including tuples, structures, etc), with any random number
generator, any latency, etc without altering a single line of its definition. This is one example
of code composability and reuse in Bluespec.
Bernoulli Distribution
The Bernoulli distribution Bp returns 1 with probability p and 0 with probability 1− p, corre-
sponding to the “success” or “failure” of an event. Where p = 2−i, the variable can be created
by the bitwise AND of i random bits. For convenience, the roulette parameter m was chosen
to be 16 so i = 4. In the case where p is not known in advance (eg. Fresnel reflection) or
p 6= 2−i ∀i ∈ I+ a U01 random number r is drawn and 1 is returned if r < p.
Uniform 2D Unit Vector
Several techniques exist for creating random numbers uniformly distributed around the unit
circle in R2. Such a vector can be characterized solely by the angle ψ measured clockwise from
the x axis, so a direct method involves drawing a random angle ψ ∼ U0,2π. Since latency is not
a concern and the direction vectors use a fairly low-resolution (18b) fixed-point representation,
the CORDIC [66] algorithm was chosen. It uses only comparisons, bit shifts, and additions to
compute trigonometric functions digit-by-digit which is ideal for use in an FPGA. A special
implementation of the CORDIC algorithm computes v(u) = (cos 2πu, sin 2πu) so that a uniform
random number u ∼ U01 can be used directly, saving a multiplication by 2π at the input. The
algorithm also exploits symmetry between the quadrants of sine and cosine.
Chapter 4. FPGA Implementation 60
1 // Tiny Twister 800b parallel RNG
2 Bit#(800) rng <- mkTT800;
3
4 // Wire (just like Verilog wire) to transmit random number
5 Wire#(Bit#(19)) rnd_step <- mkWire;
6 Wire#(UInt#(18)) rnd_angle <- mkWire;
7
8 // on every clock, draw a random 800-bit number and send the lower 19 bits on wire rnd_step
9 rule drawStepRandom;
10 let rnd800 <- rng.get;
11 rnd_step <= rnd[18:0];
12 rnd_angle <= unpack(rnd[37:19]);
13 endrule
14
15 // instantiate a module to compute the log of a 19-bit number
16 let logfcn <- mkLog;
17
18 // pass the wire and the distribution function module to the RNG queue
19 // NOTES: 1) latency is implicit in the type of logfcn which is not shown here
20 // (programmer doesn’t even need to know to instantiate)
21 // 2) rnd_step doesn’t have to be a wire; could be any module (incl user defined)
22 // in the ToGet#() typeclass
23
24 Randqueue_ifc#(UInt#(19)) rq_steplen <- mkRandqueue(toGet(rnd_step),logfcn);
25
26 // Now create a random-number queue for a 2D unit vector using a random input to
27 // sincos
28
29 let cordicCalc <- mkSinCos;
30 Randqueue_ifc#(UVect2D_18) rq_unitvector2d <- mkRandqueue(toGet(rnd_angle),sincos);
31
32 // Draw and display numbers when available
33 rule showIt;
34 let s <- rq_steplen.get; // implicit condition here
35 // rule can only fire if number available
36 $display("At time ",$time," drew a set of length ",s);
37 endrule
Figure 4.2: BSV example showing use of Randqueue to queue up random numbers
Chapter 4. FPGA Implementation 61
3D Unit Vector
Creating an appropriate uniform distribution over the unit sphere in R3 is slightly more com-
plicated. A naive algorithm using spherical coordinates v = (1, θ, ψ) : θ, ψ ∼ U0,2π for instance
does not give a correct distribution. If however cos θ ∼ U01 and ψ ∼ U0,2π then a correct distri-
bution can be formed as shown below. In that formulation, cosψ, sinψ can be calculated as a
2D unit vector above and sin θ =√
1− cos2 θ. The fact that sin θ ≥ 0 always is not a problem
since all terms containing sin θ also contain either (but not both of) cosψ or sinψ which are
symmetric around 0. In the FullMonte formulation, the auxiliary vectors a, b are needed, and
can be calculated directly from the sines and cosines above as follows:
d = (cos θ,− sin θ cosψ, sin θ sinψ) (4.1)
a = (sin θ, cos θ cosψ,− cos θ sinψ) (4.2)
b = (0, sinψ, cosψ) (4.3)
Other techniques exist using rejection sampling of points x ∼ U[01]3 in the unit cube to
find ‖x‖ ≤ 1 followed by normalization to get a unit vector. The current implementation
was chosen for its simplicity, predictable throughput, use of hard multipliers, and avoidance
of special functions (division, square-root). This module is a candidate for resource reduction
since new packets are launched fairly rarely so the current fully-unrolled implementation offers
for more throughput than necessary.
Exponential Distribution
The CDF and ICDF of the exponential distribution Eµ with parameter µ (mean µ−1) are
Fµ(x) =1− e−µx = F (µx) (4.4)
F−1µ (x) =
−1
µln(1− y) =
1
µF−1
1 (x) (4.5)
The entire family of exponential distributions with different parameters µ can be generated
by appropriate scaling of the unit exponential F1(x). To economize, the distribution used in
hardware actually computes F−1ln 2(x) = 1
ln 2 ln(1− x) = log2(1− x) since it is easier to compute
for binary numbers. The constants kt,i = µt,i ln 2 are stored so that the correct step lengths
can be derived from the base-2 dimensionless step length.
The base-2 logarithm is calculated using
log2(2i(1 + x)) = i+1
ln 2ln(1 + x) (4.6)
First the number of leading zeros is counted, then the Taylor series for log2(1+x), 0 ≤ x < 1
is used for the remaining digits.
Chapter 4. FPGA Implementation 62
Henyey-Greenstein Phase Function
To calculate the Henyey-Greenstein function for the scattering deflection angle, the ICDF is
calculated using the formula of Eq 2.26. At its output, the HG function provides cos θ, sin θ
so that scattering can be accomplished by just multiplication and addition once the direction
vector and auxiliary vectors are provided. For more efficient hardware calculation, the Henyey-
Greenstein ICDF for cos θ can be partitioned into material-dependent constants k0, k1 and
functions of the random variable:
k0 =1 + g2
m
2gm(4.7)
k1 =1− g2
m√2gm
(4.8)
cos θ =k0 −(
k1
1 + giu
)2
u ∼ U−1,1 (4.9)
which gives the desired equation (Eq 2.26). The sine is calculated from√
1− cos2 θ, and its
sign does not matter since it is the component in the azimuthal plane which is controlled by a
uniform random vector.
4.3.2 Photon launch
Only isotropic point sources are supported in the current implementation, though a diversity of
sources could easily be added. To launch a new photon, the weight is set to unity, the position
to the position of the point source, and the direction to a randomly-drawn 3D unit vector as
described above. Since all quantities are either constant (weight, position) or drawn from the
random-number queue, this step has no latency.
4.3.3 Step length generation
Step lengths are generated in base-2 dimensionless terms (Sec 2.4.2) so only a single exponential
RV l ∼ Eln 2 is required. The conversion to physical dimensions is done within the intersection-
test block using the scaled parameter µtln 2 . Function latency is hidden using a Randqueue block
so that step lengths are always available latency-free.
4.3.4 Tetrahedron Lookup
In the current implementation, all tetrahedrons are stored in large array of Block RAM. Each
element is 404 bits and up to 64k tetrahedra can be stored, requiring an 11x128 array of block
RAM (1408/2560 blocks, 28160/51200 kbit of Stratix V A7 total capacity). That array size
covers the entire “cube 5med” mesh used for testing, or 20% of the Digimouse mesh. If the
Chapter 4. FPGA Implementation 63
most-frequently-used Digimouse elements were stored, the on-chip set would cover 95% of all
memory accesses.
4.3.5 Intersection test
All necessary quantities for the intersection tests are computed directly using multiply-add
hardware blocks. For a given ray and tetrahedron, we need to know if the ray intersects the
tetrahedron within the current physical step length. If it does, then we need the cosine of the
angle, the intersection point, and the (physical) distance. The calculation starts by finding
which face is the closest to the ray, by first finding the angle between the ray and each face,
and the height over that face:
cos θi =d · ni (4.10)
hi =p · ni − Ci (4.11)
Of the four faces in a tetrahedron, a given ray can point towards at most three of them so one
can be eliminated from the comparison. For rays that point towards a given face (cos θi < 0),
the distance di to the face is given by hi = di cos θi. Since division is a long-latency operation, we
wish to avoid it where possible to minimize the number of in-flight packets at a given moment.
To find the closer of faces i and j
di < dj ⇐⇒hi
cos θi<
hjcos θj
(4.12)
can be checked more quickly and without division by computing
hi cos θj − hj cos θi < 0 (4.13)
which can be done entirely within a Stratix V DSP block. The first two faces of the faces
the ray is pointing at are so compared, and then compared to the third to find the closest to
the ray.
Lastly, the physical step length to the nearest face di = hicos θi
is checked against the dimen-
sionless length of the current step l. A similar trick is done to check if the step terminates
inside the current tetrahedron without dividing.
4.3.6 Interface
Handling of refractive index boundaries is not currently supported, though it could be imple-
mented using largely existing blocks. The interface block is currently used only for calculating
the point where a ray meets the tetrahedron face. It contains a divider to evaluate s = hcos θ
and from that calculate the intersection point q = p + sd.
Chapter 4. FPGA Implementation 64
4.3.7 Absorption, roulette, spin, and step finish
As noted in the data flow diagram of Fig 4.2.6, the absorption, roulette, and step-finish stages
are merged because they operate on independent data.
Absorption
The albedos αm for materials m ∈ [0, 15] are stored in a lookup table. When the packet is
partially absorbed in material m, its weight is multiplied by the albedo so that w′ = wαm.
The difference w−w′ is computed and written to an output port of the module along with the
tetrahedron ID that the packet currently inhabits, for purposes of accumulating volume fluence.
Roulette
If the packet weight is below the threshold wmin at the conclusion of the absorption step,
then the packet is subjected to roulette (Sec 2.4.2). A B 116
random variable is formed by the
bitwise AND of four bits from the random number generator. If the result is 1, then the packet
continues with increased weight 16w. The value m = 16 was chosen because it is easy to work
with using bit manipulation: multiplication by 16 is the same as a bitwise left-shift by 4 places,
and the Bernoulli random variable is each to generate by bitwise AND. In parallel with the
roulette step, the packet is speculatively continuing through the spin step since the probability
of termination is on the order of 1%.
Spin
Scattering occurs based on the Henyey-Greenstein distribution. Since there are a small number
(≤ 16) of materials, the constants k0, k1 (Eq 4.9) are stored for each material and connected to
a RandqueueMulti to hide the latency. Numerical precision was optimized as well, noting that
g & 0.8 for typical biological materials in the optical window, which limits the range of many
of the components of the equation.
The Scatter function block itself is a deterministic application of the input deflection and
azimuthal angles represented by cos θ, sin θ, cosψ, sinψ to the input direction vectors d, a, b.
The matrix scattering formulation described in Sec 2.4.2 is applied directly using hard multiply
blocks to compute the scattering matrix inputs in two cycles, then multiply-add blocks to apply
the matrix to the input data in another two clock cycles.
By decoupling the calculation of the scattering angles from their application to a vector,
different phase functions can be used. Though the Henyey-Greenstein function is very common
in biophotonics, enforcing the distinction between generating the random angles and applying
them allows flexibility, improves clarity in the source code, and simplifies testing.
Chapter 4. FPGA Implementation 65
Step Finish
Since absorption modifies only weight, and spin modifies only direction, the packet may also
complete its step in parallel by traveling the originally-planned s units to update its position
to p′ = p + sd. This is a speculative step assuming the packet survives.
4.3.8 Altera DSP Primitives
In order to extract maximum performance from the Stratix V FPGA, we opted for explicit
instantiation of Verilog IP cores. Neither Verilog nor BSV-generated Verilog resulted in correct
inference of the hard-block multiply-add operation, so the synthesized hardware used several
times more DSP units than necessary. It appears that the issue is with Altera Quartus synthesis
rather than the Bluespec Compiler. However, there is an issue within the Bluespec compiler
wherein signed multiplications require more DSP units than expected. It can be worked around
by explicit instantiation of a DSP core or by calling out to a suitable Verilog module. Some such
issues remain at the time of writing as noted in the results discussion, but they can be fixed
given some time. Since Bluesim cannot handle BSV-Verilog mixed-language simulation, we
also created functionally-identical blocks in BSV for the DSP cores (signed multiplication and
dot-product). Large random test vectors were used in Modelsim to ensure exact correspondence
between the behavioral and RTL models.
4.3.9 Mathematical operators
The bulk of the operations in FullMonte are multiplication and multiply-add. All operations
which are not were implemented in Bluespec using standard algorithms, avoiding the use of
floating-point IP which tends to require more area and DSP units. A custom base-2 logarithm
module was written to exploit the low precision requirement (18b) and to avoid use of floating-
point IP. As previously discussed, a CORDIC-based sine-cosine module was written in BSV
both as an exploration of Bluespec for numerical algorithms, and to use an input range of [0, 1)
instead of [0, 2π), saving a multiplication. A digit-by-digit (CORDIC-like) square-root module
was also implemented for generation of 3D unit vectors, since latency is not a concern due
to the Randqueue structure and it had a fairly small logic footprint. There is also a module
calculating√
1− x2 using Taylor series for calculating sine given cosine or vice-versa. While
not tightly-optimized or thoroughly examined, we believe the use of custom modules instead of
floating-point vendor IP resulted in net DSP-unit savings and a good opportunity to evaluate
Bluespec for simple numerical algorithms. There remains room to tweak the speed and area of
the numerical cores, however they are not performance-critical at the moment.
Chapter 5
Results
We present two implementations of the FullMonte algorithm here. The first is a highly-
optimized C++ implementation using multi-threading and Intel SSE intrinsics to achieve high
performance on a standard CPU. Run-time requirements for a variety of scenarios are presented
for an Intel Sandy Bridge quad-core CPU with SMT1 providing eight logical cores. The other
is a custom digital-logic implementation written in Bluespec SystemVerilog for an FPGA. We
did not create a physical realization; however we fully validated and synthesized the design and
are confident that the results presented here can be realized in functional, accurate hardware.
Non-trivial additional effort would be required to support data transfers from the host to the
device, without gaining any additional insight into the core of the problem so that has been de-
ferred to future work. We have also skipped implementing refractive index boundaries (Fresnel
and total internal reflection, refraction), though later discussion will demonstrate that has little
impact on the conclusions drawn. To demonstrate correctness, we did bit- and cycle-accurate
(identical-to-hardware) simulations using the Bluesim hardware simulation environment. Al-
tera’s Quartus II program was used to synthesize the design, producing area speed and power
results for a current high-end 28nm FPGA, the Altera Stratix V A7 device (speed grade C1,
fastest available).
This balance of this chapter is divided into five sections. First, we demonstrate the cor-
rectness of both the software and hardware implementations by internal consistency checks and
external comparison with another simulator. Next, we show results from profiling tools built
into the FullMonte software simulator that identify the operations and memory accesses that
are most critical to performance. The software and hardware performance each receive one sec-
tion of detailed discussion, followed by the presentation of an innovative hardware architecture
to scale up to larger meshes and higher performance.
1Simultaneous Multi-Threading, the sharing of one physical core by multiple execution threads. Intel brandsthis “HyperThreading”.
66
Chapter 5. Results 67
5.1 Validation
Our validation strategy uses three parts: unit testing; internal online consistency checks in-
cluding assertions and conservation of energy; and external checks against a reference simulator
(TIM-OS). The validation of software and hardware are presented below in parallel since they
use the same techniques and concepts. We validated the FullMonte software simulator first
against the existing TIM-OS software simulator (Sec 2.5.7) using its provided test suite. Then,
confident of its accuracy, we used it to evaluate the output of the hardware design.
5.1.1 Unit Tests
Both the hardware and software were extensively unit-tested to ensure correct function of
individual blocks.
To create the software model, a number of libraries were used including Julien Pommier’s
fast SSE math routines [56] for sin/cos/log, as well as Saito and Matsumoto’s SFMT Mersenne
Twister RNG implementation [59]. Use of existing libraries provided highly-optimized, easy-
to-use routines that required minimal validation. We used Octave (a Matlab-like numerical
environment) to generate, manipulate, and visualize tetrahedral meshes, and to validate the
program blocks dealing with the mesh (eg. intersection, finding the tetrahedron enclosing a
point, etc). It was also used to test the statistical distribution of RVs including the Henyey-
Greenstein function.
For the hardware implementation, all major blocks were validated individually. Deter-
ministic blocks such as logarithm (used for the Eln 2 RV generator), sin/cos calculation, divi-
sion, square-root, intersection testing, step finishing, and Henyey-Greenstein evaluation were all
tested with large numbers of random inputs and cross-validated between software, hardware,
and separate implementations in Octave. The Tiny Twister RNG was compared against the
authors’ original software implementation for ten million cycles.
5.1.2 Assertions
Assertion checks are used in both the hardware and software implementations to verify that
certain invariants hold. For instance, the packet direction vectors must be orthonormal ie.
d · d = a · a = b · b = 1, and d · a = a · b = d · b = 0. We use assertions to check that this and
other properties (such as non-overflow of queues). In this case, assertion failure would indicate
that excessive roundoff error had accumulated or that the spin calculation was incorrectly
applied. Since they carry a heavy performance penalty, assertions are disabled when compiling
the software to run performance tests. In the hardware implementation, assertions are used in
simulation only and automatically removed by the compiler before synthesis.
Chapter 5. Results 68
5.1.3 Conservation of Energy
The simulator should by design follow conservation of energy: the total packet weight launched
should equal the total that was absorbed plus the total that exited the geometry. Some zero-
mean noise is introduced through the roulette process, so the amount of energy added and
removed during roulette are both accumulated as well. These statistics are gathered during
the simulation in both the software and hardware versions to verify correct operation. Both
implementations obey conservation of energy to within very tight tolerances ( 10−8 of weight
launched).
5.1.4 Comparison to Reference Simulators
Two other public-domain simulators (MMCM and TIM-OS, presented in Ch 2) are able to
address the same problems as FullMonte. Of the two, TIM-OS is the more full-featured and
widely-used. It comes packaged with a suite of several test cases, covering a variety of source
types, optical properties, and geometries. We used the entire test suite to validate the software
simulator, and show one detailed example below.
The test cases shown below simulates BLI using Digimouse [20], which is a widely-used
freely-available digital model of a mouse often used for bioluminescence imaging experiments.
The dataset contains co-registered PET, MRI, CT, and cryosection optical images along with
an anatomical atlas created by an expert to delineate organs. The TIM-OS test set includes
one Digimouse case where an extended light source is included to model a BLI-tagged tumour.
We ran simulations using one billion packets through both FullMonte software and TIM-OS,
collecting the emittance from each surface triangle and the fluence within each tetrahedral
element. The MC technique scores energy absorbed or emitted, which is then converted to
fluence using Eq 2.3 or Eq 2.4. Since the coefficient of variation for each measurement is
inversely proportional to the number of packets recorded (which is proportional to the element’s
area or volume, and roughly correlated to fluence), the comparison was done in terms of energy
per element instead of fluence. To use fluence would unduly amplify variation in small elements
which would make the results harder to compare.
Figure 5.1 shows the comparison for the energy exiting the geometry for each triangular
surface patch. Each figure shows four graphs, to be read in order left-to-right and then top-to-
bottom. The first at top left shows a log-log plot of output from FullMonte (B) versus TIM-OS
(A) on an element-by-element basis ie (logA, logB). Since MC models a random process, the
outputs may differ for two reasons: either there is a bias, or due to random fluctuation in the
output which should reduce as packet count (recorded fluence) increases. Convergence to tight
tolerances with increasing fluence indicates that bias is not present. The second shows a measure
of percentage difference B−AA , and as expected, the elements which recorded more energy showed
a lower coefficient of variation. The bottom-left panel shows a more detailed comparison for the
top 5000 elements (either surface patches or tetrahedral volume elements), which collectively
account for over 99.9% of surface energy emitted. Figure 5.2 shows a comparison for volume
Chapter 5. Results 69
elements the same way, with the top 5000 elements covering 91.6% of absorbed energy. Both
show that the simulator results agree.
Some features of the validation graphs require explanation. Generally, one assumes Gaussian
noise when examining the variance of a process which is a combination of many random factors.
In the top-left panel of Fig 5.2, there is noticeable asymmetry volume elements with counts 100
and lower. The actual distribution cannot be Gaussian since it is constrained to be positive,
which enforces asymmetry. If the arrivals were actually IID binomial, some upwards skew would
be expected for small samples. The skew is increased because the absorption events are not
independent: given that photon is in a tetrahedron, it is likely to deposit energy there multiple
times before expiring. There is also a curve at the bottom-left corner of the top-right plot.
Packets do not propagate with weight less than wmin, so the minimum quantum of absorption
is (1 − α)wmin ≈ 0.1 · 10−5. Around 10−6 one can see that error is correspondingly quantized
between ≈ 0% and −100%.
The skewness is more pronounced in the lower-left panel where error is presented on a
linear scale as a percentage of the reference (TIM-OS) value. It is worth noting here that skew
is due to the result presentation, where a factor of two could result in error of -50% or +100%
depending on which way the ratio goes. There is also a much greater density of points at the
lower values, making the variance appear relatively greater as well since the large density of
points near zero error are not distinguishable. We note that the values appear to have a zero
median and clear convergence towards zero error. As the bottom-right panel shows, the error
follows a 1√x
curve to zero to within tight bounds for the highest-fluence elements.
Hardware
We validated the hardware implementation using a bit- and cycle-accurate model compiled from
the original Bluespec SystemVerilog code into C++ using Bluesim. The comparison technique
was identical to the Digimouse case above, with two exceptions. First, we used a smaller test
case called “cube 5med” because Digimouse would not fit within the on-chip memory. Second,
we ran only 1.6 million packets due to the requirement to finish in reasonable time. The
geometry is a cube made of 48,000 tetrahedra and 4,800 surface elements, with five internal
layers of differing properties (µa, µs, g, n). We altered the case to make the index of refraction
homogeneous at n = 1.0, since reflection and refraction calculations are not implemented yet.
The outputs are compared against the FullMonte software in Fig 5.3, showing convergence
towards the correct value.
Contrary to the Digimouse case, the coefficient of variation is higher on the exiting energy
than the absorbed energy. The absorption map was built using on average over 700 absorption
events per packet, whereas the surface fluence is the result of just under 1 event per packet
(a very small number were terminated in roulette). Since only 1.6 million packets were run,
the results have not had sufficient packets to converge as tightly as the software comparison.
Despite the larger surface element variance at the chosen packet count, we are confident of the
Chapter 5. Results 70
Figure 5.1: Results comparison (FullMonte software vs. TIM-OS) for Digimouse energy persurface element
Chapter 5. Results 71
Figure 5.2: Results comparison (FullMonte software vs. TIM-OS) for Digimouse energy pervolume element
Chapter 5. Results 72
Figure 5.3: Validation of FullMonte hardware simulation vs FullMonte software
simulation quality because the results agree tightly when volume fluence is considered and the
output obeys conservation of energy. Since the surface triangles are faces of the tetrahedral
volumes and the volumes show correct fluence, the surface fluence should be correct as well.
To simulate hardware running 1.6 million packets took 18 hours of PC time, approximately
2600x slower than the optimized software model on the same computer. While that may appear
slow, it is actually a noteworthy result for simulation of digital hardware due to the substan-
tial complexity involved in modeling exact device behavior including all necessary queues. The
simulator is single-core only, reducing the equivalent gap between the detailed hardware simula-
tion and the highly-optimized C++ implementation to 470x. Subjectively based on experience,
it compares very favourably with (integer factors better than) more traditional methods of
RTL-level simulation.
5.2 Algorithm Profiling
In order to optimize both FullMonte implementations, we gathered detailed information on
what operations are most frequent, and the distribution of memory accesses in time and space.
Chapter 5. Results 73
Figure 5.4: Photon packet event frequency
5.2.1 Operation Frequency
There exists a huge disparity in the frequency of various operations on a packet. As shown in
Fig 5.4, intersection testing is by far the most common operation, followed by scattering. Data
are presented for three variations of Digimouse (high-albedo with 2µs, standard, and low-albedo
with 12µs), and cube 5med. Interface-related calculations place far behind in the test cases used,
as would be expected since the finest unit of geometry is a single tetrahedron and it is likely
to take many tetrahedrons to describe any more complex shapes which would have refractive
index differences. Since tetrahedrons without interfaces far outnumber those with interfaces,
interface-related calculations should be rare.
The same data is presented in a different way in Fig 5.5 which shows an annotated flow
diagram derived from the Digimouse test case run with profiling enabled. Each node is labelled
with the average number of times the operation occurs in a simulation, while the edges are
tagged with the probability of a packet following that edge from the preceding node.
Chapter 5. Results 74
Launch1
Draw step376.4
100.0%
Tetra lookup542.3
100.0% Hop542.8
100.0% Interface166.5
30.7%
Step finish375.8
69.2%
99.7%
0.3%
99.9%
Figure 5.5: Algorithm flow graph annotated with transition probabilities (edges) and averageper-packet operation counts (nodes) for Digimouse at standard albedo
5.2.2 Memory Access
While the CPU has a fixed memory architecture, the programmer may still alter program
sections to make optimal use of the provided hardware. When designing an FPGA implemen-
tation, there is considerably more flexibility in the types of memory used and caching schemes
employed. Compared to simpler (MCML-like) geometries, the tetrahedral model requires or-
ders of magnitude more elements, each several times larger than a layer definition in MCML.
Fast access to memory is therefore critical to performance of this algorithm on any computing
device. Using the existing logging framework, a module was created which tracks all accesses
to the mesh storage and to the absorption array. A trace analyzer was created to do statistical
analysis of the data generated from actual simulation runs.
One of the distinctive features of modern CPUs compared to other computing platforms
(GPU, FPGA) is their very large LRU (least-recently-used) cache which serves to hide the very
long latency required to access main memory. Each time a memory address is requested for
read or write, the processor checks the address to see if the memory contents are held in the
cache. If so (a cache hit), it is able to complete the request using the cache copy rather than
waiting to access main memory. Otherwise (a cache miss), it fetches the result from memory
and puts it in the cache, ejecting the least recently used item in the cache to make space.
Considerable design effort, silicon area, and power are expended to provide a high-performance
cache, in particular the logic to determine which addresses are resident and to implement the
replacement policy. Under the assumption of temporal locality, ie. that items recently used will
likely be used again soon, such a cache is highly effective.
Profiling of the FullMonte algorithm on the other hand found that the algorithm exhibits
Chapter 5. Results 75
limited temporal locality. The graph at top left of Fig 5.6 shows the statistical distribution
of the number of distinct accesses before a given address is accessed again, produced by the
trace analyzer. The graph can be interpreted as plotting hit rate against cache size n for a
cache implementing fully-associative perfect LRU. If the presently-requested address is one of
the n most recently accessed, it will be resident in the cache. A cache of only the eight most
recently used (MRU) elements can serve 60% of tetrahedron requests for most cases while the
next thousand elements increase that count only marginally as illustrated by the conditional
hit rate in the top right panel. Given that the element requested is not in the first 8 elements,
its probability of being in the next thousand is quite small (5-20%) Fortunately for the CPU,
its cache is large enough to hold the entire working set (n 105, the far right of the graph)
so the large penalty involved in accessing main memory is avoided. However, the allocation of
power and silicon area is not optimal and so other devices which allocate area differently may
be expected to outperform.
What is evident, though, is a non-uniform distribution by address. The lower-left panel
shows the hit rate when the n most-frequently-used elements are stored in the cache, instead of
most-recently. Such a system is known as a Least-Frequently Used (LFU) or “Zipf” replacement
policy [7] which should provide better results at lower cost. Given the high hit rate for a small
LRU cache, it would be attractive to use a hybrid system with a small LRU cache whose
missed requests are served by a larger LFU cache. Further simplicity could be gained from the
observation that access probabilities are stationary within a given simulation which may last
minutes, so the cache set could be chosen statically. The theoretical conditional hit rate for
exactly such a system is shown in the bottom-right panel for the Digimouse (standard-albedo)
case. Similar results are seen for all four test cases, with hybrid outperforming significantly.
The FPGA design proposed below in Sec 5.5 exploits exactly these characteristics to propose
a highly-efficient customized memory system.
5.3 Software Performance
All experiments were performed on an Intel i7-2600K 3.5 GHz quad-core CPU with SMT
allowing eight simultaneously active threads.
Since intersection testing is the most common operation, the most bandwidth-intensive,
and the most compute-intensive, it is a reasonable first approximation of overall performance.
Within a given test case, the ratio of intersection tests to other operations must also be fixed
so it gives a proxy for overall computing effort. For the balance of this discussion, the term
Mints (Millions of INtersection Tests per Second) will be used to measure an implementation’s
performance in performing such tests. Likewise, Mabs (Millions of Absorptions per Second)
stands for the number of absorptions recorded per second. Absorption events have very low
compute intensity but result in two memory accesses per absorption (read-accumulate-write)
which need to be atomic: a non-trivial requirement for highly parallel systems. FullMonte,
Chapter 5. Results 76
Figure 5.6: Cacheability of four different test cases, showing relatively low hit rate for LRUcache at top left/right (note logarithmic scale for cache size); static Zipf cache at bottom left isbetter; bottom right shows L2 hit rate for two options with Digimouse (std): Hybrid (L1 LRU,L2 LFU) requires 2377 elements for 50% hit rate, while pure LRU (L1 LRU, L2 LRU) requires8246
Chapter 5. Results 77
Digimouse Complex mesh representative of BLI applicationsRan with high-albedo (2µs) and low-albedo variations (1
2µs)Cube 5med A regular cube with five layers of differing optical properties
(modified from original case by setting n = 1.0 for all layersAlso ran a variant with 2µs
Fourlayer Thin tissue section consisting of four layersHalf-sphere air Non-absorbing, non-scattering half-sphereHalf-sphere tissue A scattering version of the above caseOnelayer Single thin layer of tissue with four different combinations of
optical properties spanning a range of 4x in scattering and2x in absorption
Table 5.1: Test cases and variants used to evaluate operation complexity vs run time
like TIM-OS, uses a per-thread queue of absorption events and then locks the main absorption
array. To scale up to a very large number of cores, the serialization imposed by such locking
may become a heavy penalty.
Figure 5.3 below shows that the number of intersection tests required predicts run time very
well (R2 > 99%) across a wide variety of problem descriptions derived from the TIM-OS test
suite as summarized in Table 5.3. The “half-sphere air” test case which is non-scattering and
non-absorbing gives an upper bound on the performance achievable by the CPU for intersection
testing at 95 Mints. By removing virtually all of the other operations (packets must still be
launched), performance improves by only 35%, suggesting that intersection testing is responsible
for nearly 3x as much run time as the other operations in the average case. Clearly, it is the
essential factor for CPU performance.
5.3.1 Caching
Cache-hit profiling using Cachegrind revealed that the miss rate of the last-level cache was
below 0.01% when running Digimouse, indicating that main memory latency has essentially
no impact on the algorithm’s performance. We saw considerable speedup from Simultaneous
Multi-Threading, which further suggests that the design is not bound by memory throughput
since all cores share the L3 cache and main memory. If the design were memory-throughput-
bound then adding additional computing cores would not increase speed. On the CPU at least,
the silicon area dedicated to caching exceeds what is necessary and a hypothetical device of
the same area with more compute capability and less caching would likely outperform. Other
architectures such as Intel MIC (Many Integrated Core) or GPU platforms may achieve better
results since they allocate their silicon area differently between caching and computing.
5.3.2 Comparison to TIM-OS
FullMonte’s software implementation provides slightly (10%) better performance than TIM-
OS when used at the same wmin value. Since TIM-OS is automatically vectorized by the
Chapter 5. Results 78
Figure 5.7: Software run time vs. operation count: Mints and Mabs for a variety of test cases,showing Mints as a predictor for run time
Chapter 5. Results 79
Time (s)Threads TIM-OS FullMonte
1 447 4432 227 2284 119 1238 83 7616 83 7632 83 76
Table 5.2: Comparison of FullMonte and TIM-OS run times for Digimouse standard albedocase
Intel C Compiler (ICC) while FullMonte has been hand-optimized using intrinsics, this is an
impressive result for the ICC. Limited further avenues exist to boost performance as discussed
in the chapter on future work. Details are provided in Table 5.2.
5.3.3 Multi-Threading
The FullMonte algorithm is very scalable across threads, showing linear increase for 1-4 cores,
and an additional speed boost of 55% when using the logical cores provided by SMT. Its
scalability slightly exceeds TIM-OS, possibly due to lower bandwidth requirements associated
with single-precision float instead of double. When using double-precision floats, the two logical
cores sharing a physical core may contend more for L1/L2 cache access, or it may increase the
contention rate when cores read from L3.
5.3.4 wmin parameter
As discussed in Sec 2.4.2, the wmin parameter provides an important quality-runtime tradeoff
that is independent of the other optimizations discussed here. Figure 5.8 shows the variance
impact of altering the parameter from its typical value of 10−5 up to 0.1. Generally, the higher
the proportion of packets terminating by roulette the larger the impact. If most or all packets
exit the geometry, then decreasing wmin has no impact since simulation terminates for reasons
other than roulette. This effect should be most pronounced when modeling BLI or IPDT-like
cases because they generally have few packets exiting. As shown in Table 5.3, the run-time
impact is significant while Fig 5.8 indicates that the quality loss (high variance) occurs in
elements with undetectably low fluence levels up until wmin = 10−3. The bold vertical line
shows the ideal dynamic-range limit of a 16-bit sensor as is typically used for BLI, assuming no
pixels are allowed to saturate. The variance lying to the left of the line would not be observable,
but the simulation would run about 40% faster.
Chapter 5. Results 80
Figure 5.8: Result standard deviation vs result value at varying wmin values (Digimouse surfaceemission at standard albedo) with vertical line showing 16-bit dynamic range
Low Standard Highwmin Time (s) Speedup Time (s) Speedup Time (s) Speedup10−5 474 1.0 845 1.0 1683 1.010−4 414 1.15 736 1.15 1447 1.1610−3 352 1.35 605 1.40 1181 1.4310−2 266 1.78 455 1.85 871 1.9310−1 160 2.96 272 3.10 506 3.32
Table 5.3: Run-time impact of changing wmin for three different Digimouse albedo scenarios
Chapter 5. Results 81
5.3.5 Summary
Based on the analysis presented above, we highlight a number of conclusions regarding software
implementation of photon migration.
First, the FullMonte algorithm is compute-bound when running on a CPU. Since perfor-
mance scales up linearly with the number of threads, contention for the shared L3 cache and
main memory are evidently not limiting factors. The addition of more processing cores should
provide additional performance.
Second, the memory architecture of a CPU is over-designed for the problem at hand. As
the die photo in Fig 5.9 shows, great amounts of silicon area (also energy and design effort)
are expended to provide a large and fast LRU cache, at the expense of space for processing
cores. Note that only the largest, last-level (L3) cache is explicitly marked; there is more area
within each core dedicated to L2 and L1 cache. Current CPU L3 caches are both excessively
large and use an unduly complex replacement algorithm for the task at hand. The Intel Many
Integrated Core architecture may provide an interesting avenue for future work since it makes
different trade-offs regarding caching, core complexity, and core count.
Third, the ability to perform intersection testing limits performance across a range of sce-
narios. Both scattering events and traversing into an adjacent tetrahedron prompt the need
for an intersection test. If a given geometry is highly-scattering relative to the mesh element
size, then Mint count is dominated by scattering events and the mesh can be refined with little
performance penalty. At some point, though, one can expect an excessively fine mesh to im-
pose a penalty for two reasons: first, it expands the working set beyond the cache size causing
cache misses; and second, the ratio of intersection tests to steps becomes larger requiring more
computing.
Fourth, further study is required to determine the appropriate value of wmin for a given
application. It can provide a significant speed increase if the increased variance of low-fluence
elements is tolerable, which seems likely at least for BLI. It may also be the case for PDT,
which exhibits threshold behavior and hence does not need accurate results for fluence that is
well below the threshold.
Finally, the algorithm shows excellent scalability through parallelism due to the indepen-
dence and the relatively small size of the working set (geometry description and absorption
array). The software implementation has been highly tuned and competes well with several
other packages, which suggests that there exists little more room to improve CPU-based per-
formance. A few incremental proposals are discussed in future work, but since only a 30% gain
results when completely eliminating scattering, reflection, and refraction, the remaining (and
essential) item is intersection testing which we believe to be very tightly optimized. Since the
time to combine result sets is minimal compared to the time to compute them, the CPU imple-
mentation can be scaled up at will using more cores, sockets, and nodes, albeit at the expense
of money, heat load, and power requirements. Significant per-core and per-watt performance
improvements through software changes are unlikely.
Chapter 5. Results 82
Figure 5.9: Sandy Bridge i7-2600K die photo from Anandtech [61], showing the very large areadedicated to caching
5.4 Hardware Performance
In comparison to the previous implementation by Lo [47] which simulated only infinite planar
layers, FullMonte uses a much richer geometry model. Despite the additional complexity, the
latency of the inner loop is actually considerably smaller due to careful choice of mathematical
precision (mostly 18-27 vs 32 bits in FBM), and the latency-hoisting transformations discussed
earlier. Technological progress in FPGA devices between Stratix III (FBM) and V (FullMonte)
have also helped since more processing can be done within a given clock period and hard
multipliers have increased functionality. Figure 5.10 depicts the flow difference, with FullMonte
on top and Lo’s FBM on bottom. Edges coloured green are infrequent, giving a loop of 52 cycles,
while the core (simplest possible step) path is shaded in black and lasts 18 cycles. Operations
whose latency have been hoisted out of the inner loop through queueing are shaded gray.
Latency is a critical factor determining the size of cache required to scale up the design, so
its minimization is an important goal. FullMonte can run at a maximum clock frequency of 215
MHz, compared to 80 MHz for FBM: a significant gain which cannot be attributed solely to
process advancement, particularly in light of the decreased latency2. Lo’s work uses a 100-stage
pipeline, meaning a packet exits the roulette core 100 stages after it enters, so 100 packets must
be in flight at a given moment to keep the pipeline fully utilized.
By introducing forks into the data flow, FullMonte is marginally more complex than FBM
and requires queues to balance the stages. Correct function is ensured by assertions that
check that there is always space in the queues when needed so no packets are dropped. The
benefit of this additional complexity is that it permits operations with high latency but low
probability (interface-related code) to be removed from the core loop. Building up from the
2Generally one can increase a circuit’s minimum clock speed by increasing its latency as measured by numberof clock cycles, whereas decreasing latency tends to lower the clock rate unless done very carefully.
Chapter 5. Results 83
Launch1
Step1 Boundary
60
Move1
Reflect37
SharedArith
37
Rotate37
Fluence37
Roulette1
RNG1
IsotropicPt src
6
Log1
HenyeyGreenstein
16
SpinAbsorbRoulette
4
Launch1
StepLength
1Fetch
2Boundary
11 Interface39
Figure 5.10: Hardware block diagram of FullMonte (top) and FBM (bottom) showing latencywith core-loop edges in black; maximum loop latency is 100 for FBM and 52 (18) for FullMonte
current foundation, the designer will have a chance to trade cache size versus utilization which
was not available in the FBM architecture. It is now possible to ensure that the utilization is
high (100% in the absence of interfaces) with a smaller cache size, since latency is extended
only when necessary. If interfaces are less than 1% of events, then the pipeline can be kept 99%
full with only 18 packets in flight, and will stall the remaining 1% of the time. Alternatively,
one could keep 52 packets in flight so that utilization is always perfectly full - a new tradeoff
available from this new architecture.
5.4.1 Area Requirements
The area requirements to synthesize a single instance of the design are shown below in Table 5.4.
As expected, intersection testing is among the most resource-intensive blocks. Almost all of
the block RAM is accounted for by the tetrahedron storage. Fortunately it uses only half of
the available read ports. An additional intersection tester could use the other port of the same
storage at no cost.
Some of the resource counts are also slightly over-reported. There are situations where
Bluespec incorrectly instantiates much larger DSP units than necessary, which for instance
accounts for an additional four units in the scattering block that could be saved. Bluespec
FIFOs are also used extensively for pipeline delays to align data, which in its current form results
in extensive Block RAM. Using different primitives will result in different resource decisions,
Chapter 5. Results 84
Block Fmax ALM FF DSP BRAM
Henyey-Greenstein 364 1740 2857 4 0Exponential Dist 479 112 128 1 0Isotropic point source 401 371 723 11 2Intersection test 329 510 799 20 0Interface 340 1707 2713 5 2Scatter 366 279 534 23 0Step finish (est) 350 200 6 0Storage 1222
Queueing, control, RNG 3786 4665 1 325
Total 215 8705 12619 71 1551Fraction of device 4% 3% 28% 61 %
Adjusted total (est) 18705 22619 67 1251Fraction 8% 10% 26% 49%
Table 5.4: Area required for a single instance on Stratix V A7 device
reducing block RAM count by 200-300 at the cost of additional logic blocks. An adjusted total
area figure is included in the table reflecting a best estimate of the requirements after all issues
are fixed. Block RAM is a constraint on the achievable parallelism since it requires nearly half
the chip to serve two intersection testers. Alternative caching arrangements discussed later
may reduce the required resources, which will be a benefit since designs become difficult to
synthesize when they use too high a fraction of the chip’s capacity.
Utilization of the launcher is also quite low because packets take on average hundreds of
steps before expiry. It could easily be shared among many pipelines, reducing the per-instance
DSP count to 56 so that four instances can fit on the chip. The requirement could be further
reduced 2-3x by rolling the launcher implementation (Sec 4.2.4).
In summary, instantiation of multiple design instances is limited by both DSP units and
block RAMs on the S5 A7 device, accommodating up to four instances. Switching to another
family member such as D5 which has a higher density of DSP units could be beneficial, except
that it would reduce available on-chip memory, a trade-off which remains to be evaluated
rigorously with new caching schemes. In either case, it should be clear that four instances of the
pipeline can be accommodated within the device. A discussion of factors limiting performance
scale-out are discussed below in Sec 5.5.
5.4.2 Power Consumption
Since the computation occurs entirely on the FPGA chip (no external memories or other ele-
ments, and no host I/O during simulation), accelerator power consumption is due only to the
FPGA core power itself. We used Altera Quartus II to produce a quick vectorless estimate of
the total power a physical realization would consume assuming a standard 12.5% toggle rate at
215 MHz for internal digital signals. Ambient temperature was assumed to be 25, with junction
temperature automatically calculated assuming a 23mm heatsink and 200LPM airflow, with no
Chapter 5. Results 85
Power (W)Core Normalized
Static Dynamic IO Total Speed Energy/pkt
CPU (low range) 47.5 1.0 36.5CPU (high range) 76 1.0 58.5Single-instance Stratix V 1.2 2.1 0.6 3.9 3.0 1.0Estimated 4 instances 2.4 8.4 0.6 11.4 12.0 0.75
Table 5.5: Performance and energy-efficiency comparison (FPGA vs CPU) at 210 MHz clockrate
board thermal model (conservative). As the design is scaled up to more instances on the chip,
we expect that dynamic power would increase proportionally, but that static power would in-
crease at a slower rate and I/O power should remain the same. The I/O power estimate is rough
because synthesis results were run by instantiating the core with its top-level ports connected
to general-purpose I/O connections. In a real implementation, a PCI-Express serial connection
would be used which would probably reduce the power consumption. Adding off-chip DRAM
access to accommodate larger geometries would increase I/O power as well.
To compute energy-efficiency per amount of computing, we must also account for the differ-
ence in simulation speeds. Based on run-time results, the CPU implementation is limited to 70
Mints across a variety of test cases. The hardware implementation is also Mints-limited, and is
able to achieve 100% utilization of the intersection-test block at 210 MHz, thus producing 210
Mints or 3x faster than the CPU using a single instance.
Measuring CPU power consumption fairly is a difficult matter. A typical computer system
may have a power supply rated for 200-300W, which sets a definite upper bound that includes
many elements that are not critical to actually carrying out simulations (graphics card, hard
disk, cooling, etc.) and hence should be excluded from the comparison. The processor used
has a thermal design power (TDP) rating of 95W, although again this is a maximum not
necessarily achieved. Since all cores are fully active, we can assume that the processor is fairly
heavily loaded, though as previously discussed it will not need to access main memory. There
are also no I/O or graphics operations required so portions of the chip will be idle. As a
reasonable estimate, we take a pair of values, 50% and 80% of TDP, as a proxy for CPU power
consumption.
A summary of the results is presented in Table 5.5, indicating that the FPGA system as
it stands has an energy-efficiency advantage on the order of 40x. Under conservative assump-
tions for scaling up the implementation, that gap could increase another 30%. That result is
much lower than previous work by Lo [47], who achieved nearly 700x using an FPGA and a
processor that are both two generations older. Processor power efficiency has increased greatly
since that comparison was made (from the 65nm node to 32nm), and the FullMonte software
implementation is also inherently far more efficient due to use of SIMD instructions. We also
use more conservative values for processor power consumption (Lo uses 50% TDP to estimate
Chapter 5. Results 86
using one of two cores, we use 50-80% for using 4/4). Conveniently for comparison, both chips
use are on similar manufacturing process nodes, 32nm for the CPU and 28nm for the FPGA.
5.5 Architecture Scalability
In addition to the prototype single pipeline described above, we also propose an architecture
below which would permit FullMonte to tackle larger problems and attain higher performance.
The architectural discussion includes a careful analysis of the factors which may limit perfor-
mance, based on the profiling results discussed above and the specifications of both the Stratix
V FPGA family and the DE-5 evaluation board.
5.5.1 Larger Meshes
The present implementation has a limited mesh size due to its use of on-chip memory exclusively.
As noted in the profiling results of Sec 5.2, the majority (& 90%− & 98%) of tetrahedron
accesses on the large Digimouse mesh occur within the 64k most frequent addresses, which are
already stored on-chip. To store the entire Digimouse mesh, the remaining elements (≈ 250k)
could be stored in off-chip memory. Since those accesses are only one-tenth as frequent as the
ones stored on chip, the memory needs to be only one-tenth as fast to avoid being a performance
limitation.
The Terasic DE-5 board selected has a Stratix V FPGA with two DDR3 SO-DIMM memory
modules (up to 8GB size), whose theoretical bandwidth is 136 Gbit/sec. At peak performance3
it could serve 348 million tetrahedron requests per second (Mtets). If the tetrahedra that
do not fit on-chip were saved in off-chip DDR memory running at 25% efficiency (87 Mtets)
and accounted for 10% of memory accesses (the complement of the 64k which cover 90% or
more), the system could fetch 870 Mtets before the off-chip bandwidth would limit performance.
Assuming there is also a cache holding the eight most-recently-used elements and a hit rate of
50% as shown in profiling, that would yield a total system performance limit of 1740 Mints,
which is nearly 8 pipelines running at 215 MHz or approximately 24x faster than the CPU
implementation.
It would also be necessary in many applications (particularly PDT) to record the absorption
events. QDRII+ uses separate read and write buses and a burst length of four, so a 72b read
and a 72b write can both be completed every other clock cycle. This scheme is ideal for
fluence accumulation since accumulation requires a sequence of read-add-write. Each memory
is capable of addressing 8MB of data, or 128k 72-bit words.
For 72-bit fluence accumulation, up to 512k elements can be stored on the four chips. Since
QDRII+ (unlike DDR3) has no bus efficiency overhead, the total off-chip access rate for the
four chips would be 900 MHz for 72-bit read-write pairs. Since each absorption event results
3Peak values are guaranteed not to be exceeded. In a real implementation there would be some overheadwhich will detract from this value.
Chapter 5. Results 87
in one read-write pair, limiting performance to 900 Mabs of off-chip access in the absence of
any caching, sufficient to serve four pipelines at 225 MHz. That limit could be raised by an
appropriate caching scheme such that tetrahedron fetching is the performance-limiting factor.
Bearing in mind that each absorption event requires at least one intersection test, it means
that the DDR3 bandwidth limit for tetrahedron storage and the QDRII+ limit for fluence
accumulation would be compatible with each other and with very high performance.
5.5.2 Parallelism for Greater Throughput
Since MC simulations are inherently parallel, running M instances of the pipeline with inde-
pendent RNG seeds would yield an M times speedup if the time to merge results is negligible.
In future work, the pipelines could share memory so they are merged on-the-fly but would need
to share access bandwidth which could limit performance.
Scaling up through parallelism would be trivial for a number of functional blocks. The Tiny
Twister 800 RNG produces 800 bits in parallel, of which less than 100 are used for a single
pipeline instance. Up to eight loop instances could receive independent bits from the single
RNG. Likewise, the packet launcher is sharable since the average packet takes anywhere from
50-500+ steps after being launched. That suggests that a single launcher could be shared by
50+ instances, or that it could be shared by four while having its implementation rolled up to
10x to economize on device resources.
One of the most significant current bounds on parallelism is the numer of DSP units on the
device. The implementation uses 71 out of 256 available on a Stratix V A7 chip. Of those,
eleven can be shared among all implementations for the launcher, and can further be reduced to
by loop rolling since the launcher is needed only infrequently. Previously-discussed arithmetic
optimizations and bug fixes are expected to save five DSP blocks, after which the design will
require 3+55M DSP blocks to accommodate M independent parallel pipeline instances. When
a refractive-index interface block is added, it will add to these requirements but can be shared
across all pipelines due to its very sparse usage pattern.
Tetrahedron memory access would also become an important factor when scaling up. Fig-
ure 5.11 shows an efficient architecture based on the profiling results of Sec 5.2. Edges between
blocks indicate the access rate in millions per second. Based on profiling, more than half of
memory accesses can be served by an 8-element L1 LRU cache which would require eight stor-
age elements per in-flight packet for a total of 416 elements. Since the elements are 404b wide,
they can be stored in eleven parallel block RAMs. Misses from that cache could be directed
to an L2 Zipf cache with a static cache set that could be pre-determined by running a small
simulation (≈ 105 packets) on the host (or perhaps in future work using self-profiling FPGA
hardware). Since the L1 hit rate is better than 50%, two pipelines should be able to share a
single L2 cache port with some queueing. If the static L2 cache were implemented as a Block-
RAM-based ROM, then two read ports would be available per array, so there could be two
pipelines per port and two ports per Block RAM array. A quick estimate from the profiling
Chapter 5. Results 88
Block RAMLevel Policy MAccess/s Hit rate Elements Inst Ports Per inst Total
L1 LRU 300 60% 8x52 8 1R 1W 11 88L2 LFU 240 50% 4k 4 1R 44∗ 176L3 LFU 240 80% 32k 2 1R 352∗ 704DRAM - 96 100% Millions 1 1R 0 0
Total Hybrid 2400 Millions 968
Table 5.6: Resource estimates for 8-pipeline cache hierarchy (DRAM peak b/w is 348M/sec,so needs 27% efficiency); ∗ assuming 2 instances share 1 physical RAM; based on Digimouseprofiling
illustrated in Fig 5.6 indicates that a few thousand elements should suffice for an L2 cache. One
possibility would be to make L2 as large as possible and serve its misses from DRAM, allowing
four pipelines per chip which is the limit based on available DSP units.
For chips with a larger number of DSP units, eight pipelines might be feasible. It would
require instantiation of one of the previously-described 8-LRU L1 caches per pipeline, and an
L2 cache shared among 4 pipelines of sufficient size (4k) to serve at least 50% of requests. The
misses from the two L2 instances could be served by a shared L3 cache before going to main
memory. Total throughput would be governed by the ability of the L3/DRAM solution to serve
tetrahedron requests if the cache assumptions are correct. Fluence accumulation would also
need to keep pace, which should be achievable with a simple 8-LRU L1 scheme coupled to the
QDRII system described above. Given a 50% L1 miss rate, the QDRII RAM could serve 1800
Mabs which is the peak absorption output of eight pipelines at 225 MHz.
5.5.3 Cost of Scale-Up
The reference PC workstation cost approximately $2200, of which several hundred each went
towards a GPU and solid-state storage (SSD) system which are not relevant to the problem at
hand. Allowing that the actual cost of processor, memory, and relevant components was on the
order of $1200, the CPU could be fairly compared to a Terasic DE-5-Net board [63] which hosts
the FPGA used for simulation on a PCI Express card for a list price of $8000. If the modest
scaling projections are achieved then the FPGA-based system will compare favourably with a
CPU-based system in terms of purchase cost per throughput (1.8x), energy efficiency (30-50x),
and throughput (12x).
5.6 Summary
The FullMonte software implementation is a highly-optimized software model, and the fastest
tetrahedral-mesh Monte Carlo model for light propagation in turbid materials. Aside from
time-resolved output (which is a planned new feature), it supports the most general set of
geometries, materials, and output data in an efficient and customizable way.
Chapter 5. Results 89
PipelineL1
8 LRU60% hit
300
PipelineL1
8 LRU60% hit
300
PipelineL1
8 LRU60% hit
300
PipelineL1
8 LRU60% hit
300
PipelineL1
8 LRU60% hit
300
PipelineL1
8 LRU60% hit
300
PipelineL1
8 LRU60% hit
300
PipelineL1
8 LRU60% hit
300
L24k static LFU
50% hit
120
120
L24k static LFU
50% hit
120
120
L24k static LFU
50% hit
120
120
L24k static LFU
50% hit
120
120
L332k static LFU
80% hit
120
120
L332k static LFU
80% hit
120
120
Off-chipStorage
348M/s at peak48
48
Figure 5.11: Proposed cache architecture
Chapter 5. Results 90
The FullMonte hardware architecture demonstrates significant novelty and improvement
over previous work by Lo. Clock speed has been increased, partly due to technological advances
between device generations (65nm to 28nm), and partly due to careful optimization of bit
widths and algorithmic enhancements. Latency of the core loop has been cut in half (or by 5x,
if the interface path is ignored) even while increasing clock speed, which will prove important
for future scaling. The comparison against CPU is also more reliable since the FullMonte
software is highly-optimized, multi-threaded, and uses modern processor features in contrast to
Lo’s reference point, MCML, which is unoptimized single-threaded C code. We synthesized a
prototype which shows correct function and provides insight into area and power requirements,
with energy per simulation being reduced 30-50x from a highly-tuned CPU implementation and
a 3x performance increase while using less than a quarter of the FPGA device.
We gained insight into the factors that limit performance through extensive profiling, and
have identified novel techniques to increase performance and efficiency of hardware algorithms.
In addition, we proposed and analyzed a memory architecture which would enable scaling-up
of the prototype to handle larger meshes and higher performance. The use of a static Zipf-style
cache is new for this application, and would provide significant benefits in performance, area,
and complexity over the more-typical LRU policy. We presented analysis which shows that a
scaled-up system could support at least four parallel instances, with sufficient off-chip memory
to store the Digimouse mesh and record volume fluence for all elements. Such a system would be
attractive compared to a CPU-based system on measures of throughput, cost-per-throughput,
and power-per-throughput, as well as physical space and cooling required. Since it outperforms
CPUs on all those metrics, it would be the optimal choice for scaling such calculations up for
iterative solution of biophotonic inverse problems.
Chapter 6
Conclusions and Future Work
6.1 Conclusions
This chapter summarizes the principal contributions and findings of this thesis, and suggests
future research avenues. Future work can be divided into several sections: further improvements
to the software model; optimizing and adding features to the single-pipeline hardware; scaling
the prototype hardware up to larger problems and higher performance; applying the new insight
to other computational platforms; and putting the simulators to work on applications.
6.1.1 Contribution summary
The principal contributions demonstrated in this thesis include the following advances in tetra-
hedral mesh-based Monte Carlo simulations of light propagation through turbid media:
• Fastest available software simulator
• Most flexible available software simulator (zero customization overhead configurable out-
put data gathering)
• New method for scattering calculation
• New variance estimator
• Demonstrated feasibility of FPGA hardware with 3x speed and 40x power efficiency in-
crease over computer
• Proposed and analyzed an FPGA architecture to achieve >12x speed increase over CPU
6.1.2 FullMonte Software
The FullMonte software model described in this thesis is now the fastest available open-source
tetrahedral MC simulator. It achieves this by making extensive use of manual optimizations
where appropriate, and by exploiting modern CPU capabilities. In the process, we generated
91
Chapter 6. Conclusions and Future Work 92
new profiling tools and data to analyze the basic algorithm, identify the factors limiting its
performance, and optimize both the hardware and software designs based on that profiling.
In conclusion, FullMonte is the current state-of-the-art for software biophotonic simulations,
using a highly optimized C++-based using Intel SSE instructions. In view of its considerable
optimization, we believe further efforts to accelerate CPU-based simulations offer room for only
incremental improvement.
6.1.3 FullMonte Hardware
In response to the diminishing returns from software optimization discussed above, a hardware
implementation was designed which shows the feasibility of fast Monte Carlo biophotonic sim-
ulations using FPGAs. The current FullMonte hardware achieves a 3x speedup while providing
a 40x benefit in power efficiency within a compact package. This is the first such hardware
design for complex geometry, and it has conclusively demonstrated that FPGAs can achieve
achieve superior speed and power performance compared to a CPU for this application. We
have also presented an architecture to scale up to higher performance and bigger problems,
based on thorough application profiling and careful design analysis.
The current hardware design simulates a single instance of the core packet loop running at
215 MHz on a commercially-available FPGA device with resources to spare while consuming
far less power than a CPU. The design requires less than one quarter of the resources of an
Altera 5SGXMA7N1F45C1 FPGA, leaving room for future work to expand to multiple parallel
pipelines for greater performance. Development work to attach it to the PCI-Express bus and
write support drivers to interface it to a host computer requires effort but carries little to no
technical risk.
6.2 Future Work
6.2.1 FullMonte Software
Several important contributions were made to improve on the state of the art software model.
Further work could be undertaken both to improve the performance of the existing model and
to incorporate new features or capabilities to make it even more broadly applicable. A few such
avenues are sketched out below.
Variance Estimation
The new variance calculation proposed in Sec 3.3.3 should be implemented and validated. Since
MC simulations are inherently random in nature, the run time is directly proportional to the
number of paths simulated. There is therefore a natural tradeoff between run time and result
quality, which can now be rigorously quantified. To have a reliable estimate of the output
variance may permit the user to terminate simulations more quickly once a target level of
Chapter 6. Conclusions and Future Work 93
variance is reached, or to have an estimate of output variance for a given fixed number of
simulated packets. This could find use, for instance, in planning PDT treatments such that
confidence bounds can be placed around the simulated light dose to enhance patient safety.
Validation would be required, in which the realized sample variance per mesh element would
be computed over N independent runs and compared with the variance estimator as proposed.
New Source Types
The current software model is limited to point sources (isotropic or directed), isotropic volume
sources, and directed face sources. The source code is designed so that new capabilities can
be added using the C++ inheritance mechanism without altering any core code. Conveniently,
the inheritance mechanism also allows reuse of aspects of sources when designing other sources.
Line sources with both uniform and customized longitudinal emission profiles (similar to work
by Rendon [10]) would be one area of particular interest to PDT applications.
Time Resolution
The primary target of interest for the present work is PDT which operates in the continuous-
wave mode, meaning time resolved calculations are not necessary. Some applications (DOT,
time-resolved DOS), however, require time-resolved output data. TIM-OS currently provides
time-resolved functionality, as does CUDAMC (for a limited geometry), providing ample bases
for comparison if and when that feature is introduced. If time resolution is desired, the optimal
technique for an arbitrary pulsed or modulated input is to calculate a temporal impulse response
function for a given source configuration, then compute the actual output by a convolution of the
impulse response and the input waveform (either modulation or finite-duration pulse). While
the calculations themselves are trivial, requiring only that the packet time from launch t be
stored and t′ = t + nsc0
calculated at each scattering step, reporting output as a function of
discretized time adds another dimension to all output data arrays.
Intel AVX/AVX2 Instructions
The current FullMonte software uses Intel SSE instructions up to version 4.1 (2011), with four-
element single-precision registers. Recent Intel processors have a new instruction set called AVX
(Advanced Vector eXtensions) which offers eight-element single-precision registers. Newer chips
will implement its successor AVX2, which expands that to sixteen-element single-precision reg-
isters. While the simulation as written makes natural use of four-element registers (either
with three spatial dimensions, or using the four faces of a tetrahedron), some sections of code
may benefit from the new instructions. Certainly some of the vector-math functions such
as logarithm, sine/cosine calculation, and random number generation will see an increase in
throughput increased if the new instructions are used since more elements can be computed
in parallel, resulting in fewer calls to such functions. By nature, Monte Carlo simulations
Chapter 6. Conclusions and Future Work 94
make intensive use of random numbers so an increase in performance of the RNG and related
functions (logarithm, sine/cosine) used to generated distributed random numbers may be sig-
nificant. Through rewriting of the main loop, it may be possible (though probably difficult)
to extract additional performance by batching multiple intersection tests or other operations
within the larger sixteen-element registers though that will require significant restructuring.
To increase software simulator performance and maintain a fair CPU-FPGA comparison, the
software should be updated to make full use of all new processor features as they become
available.
One area which may pay significant dividends would be enabling vector calculation of the
Henyey-Greenstein function. With sixteen-element-wide registers, the effective cost of phase
function calculation could be cut to 1/16 per unit. The results would need to be stored in
a small queue, which would incur some slight overhead. A similar concept was used in the
hardware design to hoist latency out of the main packet loop (Sec 4.3.1).
6.2.2 FullMonte Hardware
The FullMonte hardware implementation presented here is presented as a proof-of-concept
which demonstrates that significant run time and power-efficiency gains can be made for com-
plex biophotonic problems through implementing the simulations on FPGAs.
Refractive-Index Interfaces
To make the hardware simulator fully applicable for PDT of complex volumes including HNC,
it will be necessary to support calculations at refractive index boundaries. The presence of air
voids in the sinuses and oral/esophageal cavity will make a significant difference in the fluence
distribution due to the sharp refractive-index change. As previously argued, the computational
cost should be modest and not an overall limit to system performance.
A module to handle refractive interfaces could be written using mostly existing building
blocks, based on the algorithm described in Sec 2.4.2. The intersection point and cosine of the
incidence angle are already provided as input. The code to calculate sin θ =√
1− cos2 θ also
already exists. The condition for total internal reflection (TIR) can be checked by comparing
cos θ against a constant stored for each material interface (at most N2m = 256). If TIR does
not occur, then the sine and cosine of transmitted angle can be calculated through Snell’s law,
and those quantities used to evaluate the Fresnel reflection probability R. Using a BR random
distribution (variable R = f(n1, n2, θ)), the packet will either entirely reflect or refract such
that the expected energy transmitted is physically correct.
When evaluating the new direction, the vectors d, a, b must all be adjusted for reflection or
refraction as appropriate. One option would involve calculating a′ = d×n|d×n|
and b′ = d′× a′. Of
these, only the normalization of a′ is costly or high-latency, since it requires division by sin θ
which must already have been calculated for Snell’s Law. The additional computational cost
Chapter 6. Conclusions and Future Work 95
should be minor given that interface events are so much rarer than scattering. A hardware
implementation could conceivably use a single divider and a single DSP unit with loop rolling
to produce a low-throughput result.
Time Resolution
The general technique and limitations of time-resolved simulation were discussed within the
context of the software model. Conceptually the same idea could be implemented in hardware,
with the caveat that it could easily become a limiting factor in simulation performance due
to the different memory architecture. Since an FPGA’s fast on-chip memories are of limited
size, to store results with fine temporal resolution or more than a few time-resolved data points
would require an excessive proportion of off-chip memory access. That access in turn would
be much slower and have a lower throughput which would limit the maximum calculation rate.
On the other hand if the results are only needed at a small number of probe locations known at
simulation time (eg. as done in CUDAMC), then simulation speed would not be compromised.
Pipelining
The gap between the individual block maximum speeds (330MHz) and system maximum
(215MHz) shown in Table 5.4 indicates that there is room to improve performance by pipelining
interconnect between blocks. While it is not reasonable to expect to hit exactly the maximum,
since that is achieved by a single block in isolation with no interfering logic or routing, it should
be possible to increase speed by an additional 20-30% through careful pipelining under the
proviso that each stage should be justified since the overall design is sensitive to excess latency.
Arithmetic Optimization
The most significant issue is that the Bluespec compiler generates explicit sign-extension when
performing signed multiplication, which appears to lead to incorrect inference from the Altera
FPGA synthesis tools. Unfortunately fixing that requires explicit instantiation of a hard block
which makes the code less readable. There remain several instances of this issue at the time of
writing, which results in a cumulative impact of 7 additional DSP units consumed.
There are also some minor optimizations of mathematical functions that could reduce la-
tency and resource usage, particularly the square-root (or√
1− x2) and division operations.
They are not performance-limiting at the moment so such optimizations can wait. The current
implementations were chosen to get a working system quickly rather than carefully optimized.
6.2.3 New Acceleration Platforms
With constant demand for improved performance, the landscape for processors and compute ac-
celerators is ever changing. As device capabilities evolve, software must periodically be updated
to take advantage of the new capabilities. In addition, even without adding new capabilities,
Chapter 6. Conclusions and Future Work 96
architectures change in ways that require software tuning, for instance: core size, number of
cores, cache size, memory coherence models, multi-threading. Some of the recent and upcoming
changes which may be relevant to future development of the FullMonte software simulator are
summarized below. With the new algorithmic understanding developed in this thesis, it will
be easier to provide a rigorous analysis of the costs and benefits of each candidate architecture
and platform.
GPU
While there exist a number of previously-identified challenges to a GPU implementation of
the algorithm (recall Sec 4.1.1 discussing FullMonte platform choice and Sec 2.5.3, 2.5.4, 2.5.5
regarding previous attempts with GPU), it could be an interesting problem to attempt. Solving
problems efficiently using GPUs often requires clever transformations of the problem to tailor
an implementation to the particularities of the compute medium, notably memory coalescing,
divergence avoidance, and locality exploitation. As GPU caching and coalescing capabilities
continue to evolve, they may become more suitable as a compute medium for this problem. At
present, we believe that the problem would be difficult to accelerate on a GPU but that does
not constitute proof, and the effort would surely be illuminating whether successful or not.
Intel Xeon Phi
As previously discussed in Sec 4.1.1, Intel’s Xeon Phi is a very recent addition to the compute-
acceleration landscape. It is worth noting that several Top500 [51] supercomputers including the
top-ranked Tianhe-2 and seventh-ranked TACC Stampede systems use Xeon Phi coprocessors.
Portability of a functionally-correct algorithm would be trivial due to the shared instruction
set including all intrinsics used in FullMonte software. The increased core count would suggest
that Xeon Phi may run faster since FullMonte is compute-bound on current CPUs, however the
smaller cache size may mean more stalling while waiting on main memory. With some tuning,
it would be possible to give a fair evaluation of performance on the new platform, and this
should certainly be pursued.
6.2.4 Applications
Having created a high-quality, flexible, high-performance general software simulator, significant
work exists in the application domain to further illustrate the value of the new simulator.
Several such options which are immediately practical are introduced below.
Comparison vs Finite Element Method
Given the existence of the diffusion approximation which can provide a much faster solution
via FEM, it is natural to ask why use MC and when. Other authors have noted the qualitative
limitations of the diffusion approximation [38]. Rigorous testing of the differences between FEM
Chapter 6. Conclusions and Future Work 97
and a Monte Carlo solution in representative problem geometries would be valuable. Work is
in progress to allow FullMonte to read NIRFAST input files so that a direct comparison can
be made without any effort to translate the input files. Comparison of the output of validated
Monte Carlo solutions with FEM-derived solutions will indicate how large a difference exists
and in what cases. We are not aware of published work for non-trivial problem geometries.
Application to PDT
The motivating application for the present work is treatment planning for PDT. When treating
complex geometries such as HNC, large portions of the planning treatment volume are within
a few mean free paths of strong optical-property boundaries. When using extended sources the
volume of tissue within a few mean free paths of a source becomes significant. Further, the
delicacy of nearby organs at risk, particularly the carotid arteries in the HNC case, require high
simulation accuracy. The demands of PDT, including recording absorbed energy throughout
the volume, the lack of need for time-resolved features, and its representative material properties
and mesh sizes drove the present architecture so it is a natural first application for this hardware.
Consequently, we aim to use real anonymized patient data to perform simulations of PDT for
HNC and develop the necessary infrastructure to do a complete PDT fluence-evaluation system
based on the demonstrated hardware architecture.
Other Applications
When a completed hardware simulator is fully implemented, applications which are currently
infeasible due to high computing demands (HNC PDT, quantitative BLI with complex geome-
tries) will become more feasible thanks to an estimated 12-20x runtime decrease without the
high space and power requirements of a compute cluster. Several cards could fit within a work-
station, enabling desk-top biophotonic simulation reaching towards two orders of magnitude
speedup versus a CPU-based solution of the same size and power requirements. The portability
and modest power consumption will mean high-performance PDT dose evaluation can travel,
for instance into operating rooms where bringing a compute cluster would not be possible due
to space and power restrictions.
The use of this hardware and software for applications in the continuous-wave imaging/detection
regime (CW DOS, SFDI, BLI) would be relatively straightforward. In contrast to the PDT
application which is the primary focus of this hardware, for imaging and detection only the
photons exiting the material are of interest. Though packet exit events are not currently cap-
tured, their relative sparsity (≈ 20 : 1− ≈ 1000 : 1 less common) compared to absorption
means that the output event rate is quite low relative to the hardware already implemented.
Due to the nature of the tetrahedral mesh description, the number of surface faces will also be
much less than the number of tetrahedra in the entire mesh. Consequently, both the bandwidth
needs and working-set size of a hardware emittance logger would be trivial compared to those of
the absorption logger already presented. A specialized implementation could exploit that fact
Chapter 6. Conclusions and Future Work 98
to dedicate more logic and memory resources to geometry fetching and less to event logging,
achieving still better Mints than the present.
6.3 Summary
This thesis has presented several functional and performance enhancements to the state of the
art in software simulation of light propagation through turbid tissues. The FullMonte open-
source software implementation presented ranks as the best in its class for both flexibility (with-
out performance overhead for features not used) and performance, besting all other instances
in run time. We have also demonstrated the feasibility and performance of a power-efficient,
compact, cost-effective hardware solution using FPGA technology. The prototype FPGA imple-
mentation was simulated to achieve more than 3x performance increase over highly-optimized
best-in-class multi-threaded software, while using 40x less power. A detailed analysis is pre-
sented showing that a further 4x performance increase (to 12x vs CPU) is achievable, with up
to 20x estimated as possible with further optimization.
Bibliography
[1] Erik Alerstam, William Chun Yip Lo, Tianyi David Han, Jonathan Rose, Stefan
Andersson-Engels, and Lothar Lilge. Next-generation acceleration and code optimization
for light transport in turbid media using GPUs Abstract :. Biomedical Optics Express,
1(2):658–675, 2010.
[2] Erik Alerstam, Tomas Svensson, and Stefan Andersson-Engels. Parallel computing with
graphics processing units for high-speed Monte Carlo simulation of photon migration. Jour-
nal of biomedical optics, 13(6):060504, 2012.
[3] Merrill A Biel. Photodynamic Therapy. Methods in Molecular Biology, 635:281–293, 2010.
[4] T Binzoni, T S Leung, R Giust, D Rufenacht, and a H Gandjbakhche. Light transport
in tissue by 3D Monte Carlo: influence of boundary voxelization. Computer methods and
programs in biomedicine, 89(1):14–23, January 2008.
[5] Bluespec Inc. Bluespec TM SystemVerilog Reference Guide. Number January. 2012.
[6] David Boas, J Culver, J Stott, and A Dunn. Three dimensional Monte Carlo code for
photon migration through complex heterogeneous media including the adult human head.
Optics express, 10(3):159–70, February 2002.
[7] Lee Breslau, Pei Cao, and Li Fan. Web caching and Zipf-like distributions: Evidence and
implications. In IEEE Infocom, volume XX, pages 126–134, 1999.
[8] Andrew Canis, Jongsok Choi, Mark Aldham, and Victor Zhang. LegUp : An Open-Source
High-Level Synthesis Tool for FPGA-Based Processor / Accelerator Systems. 13(2), 2013.
[9] Jeffrey Cassidy, Lothar Lilge, and Vaughn Betz. FullMonte: a framework for high-
performance Monte Carlo simulation of light through turbid media with complex geometry.
In Proc SPIE BiOS, volume 8592, pages 85920H–14, San Francisco, CA, February 2013.
SPIE.
[10] Cesar Augusto Rendon Restrepo. Biological and Physical Strategies to Improve the Ther-
apeutic Index of Photodynamic Therapy. Phd, University of Toronto, 2008.
99
Bibliography 100
[11] Wai Fung Cheong. Optical-Thermal Response of Laser-Irradiated Tissue. In A J Welch
and M J C Van Gemert, editors, Optical-Thermal Response of Laser-Irradiated Tissue,
chapter 8, pages 275–301. Plenum Press, New York, 1st ed edition, 1995.
[12] NVIDIA Corp. NVIDIA’s Next Generation UDA Compute Architecture: Kepler GK110.
Technical report, 2012.
[13] Altera Corporation. Implementing FPGA Design with the OpenCL Compiler, 2012.
[14] Altera Corporation. Stratix V Device Handbook. Technical report, San Jose, CA, 2013.
[15] Intel Corporation. Intel Xeon Phi Processor Family, 2013.
[16] David J Cuccia, Frederic Bevilacqua, Anthony J Durkin, Frederick R Ayers, and Bruce J
Tromberg. Quantitation and mapping of tissue optical properties using modulated imaging.
Journal of biomedical optics, 14(2):024012, 2009.
[17] Sean R H Davidson, Robert a Weersink, Masoom a Haider, Mark R Gertner, Arjen Bo-
gaards, David Giewercer, Avigdor Scherz, Michael D Sherar, Mostafa Elhilali, Joseph L
Chin, John Trachtenberg, and Brian C Wilson. Treatment planning and dose analysis
for interstitial photodynamic therapy of prostate cancer. Physics in medicine and biology,
54(8):2293–313, April 2009.
[18] Anil K D’Cruz, Martin H Robinson, and Merrill a Biel. mTHPC-mediated photodynamic
therapy in patients with advanced, incurable head and neck cancer: a multicenter study
of 128 patients. Head & neck, 26(3):232–40, March 2004.
[19] Hamid Dehghani, Matthew E Eames, Phaneendra K Yalavarthy, Scott C Davis, Subhadra
Srinivasan, Colin M Carpenter, Brian W Pogue, and Keith D Paulsen. Near infrared optical
tomography using NIRFAST : Algorithm for numerical model and image reconstruction.
Communication in Numerical Methods in Engineering, 25(August 2008):711–732, 2008.
[20] Belma Dogdas, David Stout, Arion F Chatziioannou, and Richard M Leahy. Digimouse:
a 3D whole body mouse atlas from CT and cryosection data. Physics in medicine and
biology, 52(3):577–87, February 2007.
[21] K L Du, R Mick, T M Busch, T C Zhu, J C Finlay, G Yu, a G Yodh, S B Malkowicz,
D Smith, R Whittington, D Stripp, and S M Hahn. Preliminary results of interstitial
motexafin lutetium-mediated PDT for prostate cancer. Lasers in surgery and medicine,
38(5):427–34, June 2006.
[22] Qianqian Fang. Mesh-based Monte Carlo method using fast ray-tracing in Plucker coordi-
nates. Biomedical optics express, 1(1):165–75, August 2010.
[23] Qianqian Fang. Comment on ”A study on tetrahedron-based inhomogeneous Monte-Carlo
optical simulation”. Biomedical optics express, 2(5):1258–64, January 2011.
Bibliography 101
[24] Qianqian Fang and David a Boas. Monte Carlo simulation of photon migration in 3D
turbid media accelerated by graphics processing units. Optics express, 17(22):20178–90,
October 2009.
[25] T J Farrell, B C Wilson, M S Patterson, and M C Olivo. Comparison of the in vivo
photodynamic threshold dose for photofrin, mono- and tetrasulfonated aluminum phthalo-
cyanine using a rat liver model. Photochemistry and photobiology, 68(3):394–9, September
1998.
[26] Thomas J. Farrell. A diffusion theory model of spatially resolved, steady-state diffuse
reflectance for the noninvasive determination of tissue optical properties in vivo. Medical
Physics, 19(4):879, 1992.
[27] Sari M Fien and Allan R Oseroff. Photodynamic therapy for non-melanoma skin cancer.
Journal of the National Comprehensive Cancer Network, 5(5):531–540, 2007.
[28] Nirmalya Ghosh, Michael F G Wood, Shu-hong Li, Richard D Weisel, Brian C Wilson,
Ren-Ke Li, and I Alex Vitkin. Mueller matrix decomposition for polarized light assessment
of biological tissues. Journal of biophotonics, 2(3):145–56, March 2009.
[29] J. Gray and G.M. Fullarton. Long term efficacy of Photodynamic Therapy (PDT) as
an ablative therapy of high grade dysplasia in Barrett’s oesophagus. Photodiagnosis and
Photodynamic Therapy, September 2013.
[30] Christina Habermehl, Christoph H Schmitz, and Jens Steinbrink. Contrast enhanced high-
resolution diffuse optical tomography of the human brain using ICG. Optics express,
19(19):18636–44, September 2011.
[31] John L Hennessy and David A Patterson. Computer Architecture: A Quantitative Ap-
proach, volume 177. Morgan Kaufman, Waltham, 5 edition, 2012.
[32] Matthew T Huggett, Michael Jermyn, Alice Gillams, Sandy Mosse, E Kent, Stephen G
Bown, Tayyaba Hasan, Brian W Pogue, and Stephen P Pereira. Photodynamic therapy of
locally advanced pancreatic cancer (VERTPAC study): final clinical results. In David H.
Kessel and Tayyaba Hasan, editors, Proc SPIE BiOS, volume 8568, pages 85680J–85680J–
6, March 2013.
[33] Brad L Hutchings and Brent E Nelson. Implementing Applications with FPGAs. In
Scott Hauck and Andre DeHon, editors, Reconfigurable Computing, chapter 21. Elsevier,
Burlington, MA, 2008.
[34] Kitware Inc. The Visualization Toolkit.
[35] Xilinx Inc. Xilinx Vivado Design Suite, 2013.
Bibliography 102
[36] Steven L Jacques. How tissue optics affect dosimetry of photodynamic therapy. Journal
of biomedical optics, 15(5):051608, 2010.
[37] Steven L Jacques. Optical properties of biological tissues: a review. Physics in Medicine
and Biology, 58(14):5007–5008, July 2013.
[38] Steven L Jacques and Brian W Pogue. Tutorial on diffuse light transport. Journal of
biomedical optics, 13(4):041302, 2008.
[39] Joseph O’Rourke. Computational Geometry in C. Cambridge University Press, 1998.
[40] Stefan P. Koch, Christina Habermehl, Jan Mehnert, Christoph H Schmitz, Susanne Holtze,
Arno Villringer, Jens Steinbrink, and Hellmuth Obrig. High-resolution optical functional
mapping of the human somatosensory cortex. Frontiers in Neuroenergetics, 2(June):1–8,
2010.
[41] Ian Kuon and Jonathan Rose. Measuring the gap between FPGAs and ASICs. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 26(2):203–
215, 2007.
[42] Joo Yong Lee, Richilda Red Diaz, Kang Su Cho, Meng Shi Lim, Jae Seung Chung, Won Tae
Kim, Won Sik Ham, and Young Deuk Choi. Efficacy and safety of photodynamic therapy
for recurrent, high grade nonmuscle invasive bladder cancer refractory or intolerant to
bacille calmette-guerin immunotherapy. The Journal of urology, 190(4):1192–9, October
2013.
[43] Steve M Liao, Nick M Gregg, Brian R White, Benjamin W Zeff, Katelin a Bjerkaas, Terrie E
Inder, and Joseph P Culver. Neonatal hemodynamic response to visual cortex activity:
high-density near-infrared spectroscopy study. Journal of biomedical optics, 15(2):026010,
2011.
[44] Liqiong Zheng Lihong Wang, Steven L Jacques. MCML - Monte Carlo modeling of light
transport in multi-layered tissues. Computer Methods and Programs in Biomedicine, 1995.
[45] L Lilge, M C Olivo, S W Schatz, J a MaGuire, M S Patterson, and B C Wilson. The
sensitivity of normal brain and intracranially implanted VX2 tumour to interstitial photo-
dynamic therapy. British journal of cancer, 73(3):332–43, February 1996.
[46] Junting Liu, Yabin Wang, Xiaochao Qu, Xiangsi Li, Xiaopeng Ma, Runqiang Han, Zhenhua
Hu, Xueli Chen, Dongdong Sun, Rongqing Zhang, Duofang Chen, Xiaoyuan Chen, Jimin
Liang, Feng Cao, and Jie Tian. In vivo quantitative bioluminescence tomography using
heterogeneous and homogeneous mouse models. Biomedical Optics Express, 18(12):13102–
13113, 2010.
Bibliography 103
[47] William Chun Yip Lo. Hardware Acceleration of a Monte Carlo simulation for Photody-
namic Therapy Treatment Planning. Master’s thesis, University of Toronto, 2009.
[48] William Chun Yip Lo, Keith Redmond, Jason Luu, Paul Chow, Jonathan Rose, and Lothar
Lilge. Hardware acceleration of a Monte Carlo simulation for photodynamic therapy treat-
ment planning. Journal of Biomedical Optics, 14(1):014019, 2009.
[49] Yujie Lu, Hidevaldo B Machado, Qinan Bao, David Stout, Harvey Herschman, and Ar-
ion F Chatziioannou. In vivo mouse bioluminescence tomography with radionuclide-based
imaging validation. Molecular imaging and biology : MIB : the official publication of the
Academy of Molecular Imaging, 13(1):53–8, February 2011.
[50] Rickson C Mesquita, Maria a Franceschini, and David a Boas. Resting state functional
connectivity of the whole head with near-infrared spectroscopy. Biomedical optics express,
1(1):324–336, January 2010.
[51] Hans Meuer, Erich Strohmaier, Jack Dongarra, and Simon Horst. Top500 Supercomputer
Sites, 2013.
[52] Caroline M Moore, Mark Emberton, and Stephen G Bown. Photodynamic therapy for
prostate cancer–an emerging approach for organ-confined disease. Lasers in surgery and
medicine, 43(7):768–75, September 2011.
[53] Rishiyur S Nikhil and Kathy Czeck. BSV by Example The next-generation language for
electronic system design. Bluespec Inc., 2010.
[54] Vasilis Ntziachristos, Jorge Ripoll, Lihong V Wang, and Ralph Weissleder. Looking and
listening to light: the evolution of whole-body photonic imaging. Nature biotechnology,
23(3):313–20, March 2005.
[55] Michael S. Patterson, Brian C. Wilson, and Douglas R. Wyman. The propagation of
optical radiation in tissue I. Models of radiation transport and their application. Lasers in
Medical Science, 6(2):155–168, June 1991.
[56] Julien Pommier. No Title, 2007.
[57] Scott A Prahl, M Keijzer, Steven L Jacques, and A J Welch. A Monte Carlo Model of
Light Propagation in Tissue. SPIE Institute Series, 5(1989):102–111, 1989.
[58] Ravi Rao. Believe It or Not! Multi-core CPUs can Match GPU Performance for a FLOP-
Intensive Application! In PACT’102, pages 537–538, Vienna, Austria, 2010. ACM.
[59] Mutsuo Saito and Makoto Matsumoto. SIMD-Oriented Fast Mersenne Twister: a 128-bit
Pseudorandom Number Generator, pages 1–15. Number 18654021. Springer, 2008.
Bibliography 104
[60] Haiou Shen and Ge Wang. A study on tetrahedron-based inhomogeneous Monte Carlo
optical simulation. Biomedical optics express, 2(1):44–57, January 2010.
[61] Anand Lal Shimpi. The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and Core
i3-2100 Tested, 2011.
[62] Mikael Tarstedt, Inger Rosdahl, Berit Berne, Katarina Svanberg, and Ann-Marie
Wennberg. A randomized multicenter study to compare two treatment regimens of topi-
cal methyl aminolevulinate (Metvix)-PDT in actinic keratosis of the face and scalp. Acta
dermato-venereologica, 85(5):424–8, January 2005.
[63] Terasic. DE5-Net FPGA Development Kit User Manual, 2012.
[64] B J Tromberg, N Shah, R Lanning, A Cerussi, J Espinoza, T Pham, L Svaasand, and
J Butler. Non-invasive in vivo characterization of breast tumors using photon migration
spectroscopy. Neoplasia (New York, N.Y.), 2(1-2):26–40, 2000.
[65] Alfred Vogel and Vasan Venugopalan. Pulsed Laser Ablation of Soft Biological Tissues.
In Ashley J. Welch and Martin J.C. Gemert, editors, Optical-Thermal Response of Laser-
Irradiated Tissue, pages 551–615. Springer Netherlands, Dordrecht, 2011.
[66] Jack E Volder. The CORDIC Trigonometric Computing Technique. IRE Transactions on
Electronic Computers, EC8(3):330–334, 1959.
[67] Lihong Wang, Steven L Jacques, and Liqiong Zheng. CONV - convolution for responses to
a finite diameter photon beam incident on multi-layered tissues. Computer methods and
programs in biomedicine, 54:141–150, 1997.
[68] Nicholas Weaver. Retiming, Repipelining, and C-Slow Retiming. In Scott Hauck and
Andre DeHon, editors, Reconfigurable Computing, chapter 18. Elsevier, Burlington, MA,
2008.
[69] BC Wilson and G Adam. A Monte Carlo model for the absorption and flux distributions
of light in tissue. Medical Physics, 10(6):824–830, 1983.
[70] Brian C Wilson and Michael S Patterson. The physics, biophysics and technology of
photodynamic therapy. Physics in medicine and biology, 53(9):R61–109, May 2008.