by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science...

118
FullMonte: Fast Biophotonic Simulations by Jeffrey Cassidy A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto c Copyright 2014 by Jeffrey Cassidy

Transcript of by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science...

Page 1: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

FullMonte: Fast Biophotonic Simulations

by

Jeffrey Cassidy

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Electrical and Computer EngineeringUniversity of Toronto

c© Copyright 2014 by Jeffrey Cassidy

Page 2: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Abstract

FullMonte: Fast Biophotonic Simulations

Jeffrey Cassidy

Master of Applied Science

Graduate Department of Electrical and Computer Engineering

University of Toronto

2014

Modeling of light propagation through turbid (highly-scattering) media such as living tissue

is important for a number of medical applications including diagnostics and therapeutics. This

thesis studies methods of performing such simulations quickly and accurately. It begins with a

formal definition of the problem, a review of solution methods, and an overview of the current

state of the art in fast simulation methods encompassing both traditional software and more

specialized hardware acceleration approaches (GPU, custom logic). It introduces FullMonte,

the fastest mesh-based Monte Carlo software model available and highlights its novel optimiza-

tions. Additionally, it demonstrates the first fully three-dimensional hardware simulator using

Field-Programmable Gate Array (FPGA) custom logic, offering large (40x) power-efficiency

and performance (3x) gains. Next, a plan for significant future feature enhancements and per-

formance scale-out is sketched out. Lastly, it proposes applying the simulators developed to a

number of problems relevant to current clinical and research practice.

ii

Page 3: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Acknowledgements

It goes without saying that my two supervisors, Professor Vaughn Betz and Professor Lothar

Lilge, were both extremely important to the completion of this work. Were they “just” tech-

nically savvy, well-informed across a wide range of topics, and well-respected in their fields, I

would have been very fortunate. They are undoubtedly that, but my good fortune goes fur-

ther as they are also hard-working, excellent mentors, generous with their time, and energetic

supporters: truly outstanding role models. Their guidance and constant enthusiasm have made

graduate school enjoyable, indeed so much so that I look forward to many (but not too many!)

more years working for them during my PhD: an outcome that I had not originally planned

for, but one of the easiest decisions I’ve made.

I am very thankful to Professor Jonathan Rose, who provided the introduction to Lothar

without which this collaboration would not have happened.

Much gratitude is due to Emily Dobson for her love, support, and patience, particularly

during the “crunch” stage: writing this thesis while at the same time taking a full course load

towards my PhD. You have been incredible throughout - thank you so much!

My thanks also go to a number of friends and family whose support and encouragement

have been vital along the way. First and foremost, to my grandmother Geneva McNeil who is

the model of generosity, patience, and kindness. My aunt Donna McNeil has always been there

for me, and provided a peaceful place to work and/or relax when it’s needed. My aunt Susan

and uncle Paul Douglas have been supportive during both my undergraduate and graduate

education. Good friends Chris Trendall and Nancy Wolf have provided many laughs and kind

words along the way. Dianna Lanteigne was very important in my decision to leave work and

return to graduate school.

I am also thankful for funding and in-kind contributions from several organizations. Blue-

spec Inc. provided the Bluespec Compiler and related software which made designing and

simulating the FPGA implementation very much faster and easier than I could have expected.

Altera Corp. provided the software tools used for hardware synthesis of the FPGA implemen-

tation. Financial support was provided by Altera Corportation, the Ontario Cancer Institute,

and the University of Toronto.

iii

Page 4: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Contents

1 Introduction 1

1.1 Medical Uses of Light . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Photodynamic Therapy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.2 Bioluminescence Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Inverse Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 Organization of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5

2.1 Applications of Diffuse Light . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Photodynamic Therapy (PDT) . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Diffuse Optical Tomography . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.3 Bioluminescence Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.1.4 Diffuse Optical Spectroscopy . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Tissue Optics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Light Propagation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.1 Geometry Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.2 Material Optical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.3 Source Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3.4 Output Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Numerical Solution Implementations . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.1 Finite Element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4.2 Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.5 Existing Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5.1 MCML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5.2 tMCimg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.5.3 CUDAMC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5.4 CUDAMCML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

2.5.5 GPU-MCML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5.6 NIRFAST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

iv

Page 5: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

2.5.7 TIM-OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5.8 MMCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.5.9 MCX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.5.10 FBM (MCML on FPGA) . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6 Computing Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

2.6.1 Central Processing Units (CPU) . . . . . . . . . . . . . . . . . . . . . . . 29

2.6.2 Graphics Processor Units (GPU) . . . . . . . . . . . . . . . . . . . . . . . 29

2.6.3 Field-Programmable Gate Array . . . . . . . . . . . . . . . . . . . . . . . 29

3 Software model 31

3.1 Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.1 Monte Carlo simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.2 Geometry Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.3 Tools and Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1.4 Programming Language and Style . . . . . . . . . . . . . . . . . . . . . . 33

3.2 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Performance enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.1 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3.2 Explicit parallelism through SIMD intrinsics . . . . . . . . . . . . . . . . 36

3.3.3 The wmin Russian roulette parameter . . . . . . . . . . . . . . . . . . . . 36

3.4 Output Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5 Profiling information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.5.1 Geometry Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.5.2 Operation Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.5.3 Coordinate precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.5.4 Spin Calculation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.5.5 Intersection Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 FPGA Implementation 46

4.1 Motivation for Hardware Acceleration . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1.1 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.2 Intel Xeon Phi processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.1.3 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Design Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2.1 Hardware Platform: Altera-Terasic DE-5 . . . . . . . . . . . . . . . . . . 48

4.2.2 Implementation Language: Bluespec . . . . . . . . . . . . . . . . . . . . . 48

4.2.3 Design Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.4 Design Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.2.5 Data Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.6 Packet Loop Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

v

Page 6: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

4.3 Design Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3.1 Random Number Generation . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3.2 Photon launch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.3 Step length generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.4 Tetrahedron Lookup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.3.5 Intersection test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.6 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.3.7 Absorption, roulette, spin, and step finish . . . . . . . . . . . . . . . . . . 64

4.3.8 Altera DSP Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.3.9 Mathematical operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 Results 66

5.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1.1 Unit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1.2 Assertions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5.1.3 Conservation of Energy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.1.4 Comparison to Reference Simulators . . . . . . . . . . . . . . . . . . . . . 68

5.2 Algorithm Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.2.1 Operation Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.2 Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3 Software Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3.1 Caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.2 Comparison to TIM-OS . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.3.3 Multi-Threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3.4 wmin parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5.4 Hardware Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.4.1 Area Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.5 Architecture Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.5.1 Larger Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.5.2 Parallelism for Greater Throughput . . . . . . . . . . . . . . . . . . . . . 87

5.5.3 Cost of Scale-Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6 Conclusions and Future Work 91

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.1.1 Contribution summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.1.2 FullMonte Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.1.3 FullMonte Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

vi

Page 7: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.2.1 FullMonte Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.2.2 FullMonte Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

6.2.3 New Acceleration Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.2.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

Bibliography 99

vii

Page 8: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

List of Tables

2.1 Summary of relevant tissue optical properties with typical values in the optical

window from Cheong [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 Comparison of existing simulators with key features: geometry, absorption scor-

ing, anisotropy, refraction, non-scattering voids, time-resolved data, and accelera-

tion methods: FPGA (Nx)=FPGA with N instances per chip; MT=multithreading;

SIMD=Intel SSE instructions, automatic or manual optimization; Asterisk indi-

cates planned future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Core FPGA data structures for packet, geometry, and material representation . . 56

5.1 Test cases and variants used to evaluate operation complexity vs run time . . . . 77

5.2 Comparison of FullMonte and TIM-OS run times for Digimouse standard albedo

case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3 Run-time impact of changing wmin for three different Digimouse albedo scenarios 80

5.4 Area required for a single instance on Stratix V A7 device . . . . . . . . . . . . . 84

5.5 Performance and energy-efficiency comparison (FPGA vs CPU) at 210 MHz clock

rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

5.6 Resource estimates for 8-pipeline cache hierarchy (DRAM peak b/w is 348M/sec,

so needs 27% efficiency); ∗ assuming 2 instances share 1 physical RAM; based

on Digimouse profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

viii

Page 9: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

List of Figures

2.1 Absorption spectrum of principal tissue chromophores from Vogel and Venu-

gopalan [65], showing the tissue optical window from 630-1000nm . . . . . . . . . 6

2.2 Depiction of High-Resolution Diffuse Optical Tomography (HR-DOT) setup from

Habermehl et al [30] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Side-by-side depiction (L to R) of BLI image, CT scan, PET image, and dissec-

tion photograph of nude mouse with a bioluminescent xenograft tumour repro-

duced from [49] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Overview of hop, drop, spin flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1 Block diagram for FPGA implementation, with stages requiring random numbers

shaded; the boxed group is actually a single block but is expanded to show packet

flow; see Fig 5.5 for event frequency details . . . . . . . . . . . . . . . . . . . . . 58

4.2 BSV example showing use of Randqueue to queue up random numbers . . . . . . 60

5.1 Results comparison (FullMonte software vs. TIM-OS) for Digimouse energy per

surface element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.2 Results comparison (FullMonte software vs. TIM-OS) for Digimouse energy per

volume element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.3 Validation of FullMonte hardware simulation vs FullMonte software . . . . . . . . 72

5.4 Photon packet event frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.5 Algorithm flow graph annotated with transition probabilities (edges) and average

per-packet operation counts (nodes) for Digimouse at standard albedo . . . . . . 74

5.6 Cacheability of four different test cases, showing relatively low hit rate for LRU

cache at top left/right (note logarithmic scale for cache size); static Zipf cache

at bottom left is better; bottom right shows L2 hit rate for two options with

Digimouse (std): Hybrid (L1 LRU, L2 LFU) requires 2377 elements for 50% hit

rate, while pure LRU (L1 LRU, L2 LRU) requires 8246 . . . . . . . . . . . . . . 76

5.7 Software run time vs. operation count: Mints and Mabs for a variety of test

cases, showing Mints as a predictor for run time . . . . . . . . . . . . . . . . . . 78

ix

Page 10: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

5.8 Result standard deviation vs result value at varying wmin values (Digimouse

surface emission at standard albedo) with vertical line showing 16-bit dynamic

range . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.9 Sandy Bridge i7-2600K die photo from Anandtech [61], showing the very large

area dedicated to caching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

5.10 Hardware block diagram of FullMonte (top) and FBM (bottom) showing latency

with core-loop edges in black; maximum loop latency is 100 for FBM and 52 (18)

for FullMonte . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.11 Proposed cache architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

x

Page 11: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

List of Mathematical Symbols

Statistics and Random Variables

E [X] Expectation of random variable X

PrE Probability of some event E

Var [X] Variance of random variable X

cv(X) Coefficient of variation for random variable X, cv(X) =

√Var[X]

E[X]

Bp Bernoulli distribution which returns 1 with success probability p, else 0

Uij Uniform distribution with output i ≤ x < j

Eµ Exponential random distribution with CDF F (x) = 1− e−µx and mean 1µ

Fk(x) Cumulative distribution function (CDF) for a distribution with parameter k

fk(x) Probability density function (PDF) for a distribution with parameter k

F−1k (x) Inverse CDF (ICDF) for distribution with parameter k

Photon packet properties

p [cm] Position

d Direction

a, b Auxiliary unit vectors orthonormal to d used in the scattering calculation

q [cm] Intersection of ray with material boundary

s [cm] Physical distance to intersection

l Base-2 dimensionless step length [0,∞)

t [ns] Time

w Weight (energy or equivalently expected number of photons)

Geometry

Ri Discrete homogeneous region

Si Discrete surface element

V [R] [cm3] Volume of region R

A[S] [cm2] Area of surface element S

i j k Unit vectors along the x, y, z axis respectively

n Interface normal vector

Ci [cm] Tetrahedron face constant (i ∈ [1, 4])

xi

Page 12: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Tissue Optical Properties

g Anisotropy factor g = E [cos θ] where θ is the deflection angle

n Refractive index

α Albedo

β Persistence (number of steps from unit weight to roulette) β = −1lnα

µa [cm−1] Absorption coefficient

µs [cm−1] Scattering coefficient

µ′s [cm−1] Reduced scattering coefficient µ′s = µs(1− g)

µt [cm−1] Total attenuation coefficient µt = µs + µa (reciprocal of Mean Free Path)

ρ [mol L−1] Concentration of absorbers

σs, σa [m2] Scattering (absorption) cross-section per molecule

ε [cm−1mol−1L] Molar extinction coefficient (molar absorptivity)

Physical Constants

NA [mol−1] Avogadro’s number, 6.022× 1023

c0 cm ns−1 Speed of light in vacuum, 29.98cmns−1

h [Js] Planck’s constant, 6.626× 10−34

Simulation Parameters

N0 Total number of packets launched

m Probability of roulette survival

wmin Minimum packet weight to trigger roulette

Simulation Outputs

φ(x, t) [Js−1cm−2] Fluence rate (energy flux) at point x

Φ(x) [Jcm−2] Fluence Φ =∫φ(x, t)dt

ΦV [R] [Jcm−2] Average fluence over the volume of region R

EV [R] [J] Total energy deposited in region R

ΦA[S] [Jcm−2] Average fluence passing through a surface S

EA[S] [J] Total energy passing through surface S

Hardware-Related Symbols

ε The smallest representable value in a given number system

C Number of pipeline registers inserted into a dependence loop

fc [MHz] Core computational clock frequency

fmax [MHz] Maximum achievable system clock frequency

T Reciprocal throughput

L Latency (clock cycles)

xii

Page 13: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Glossary and List of Abbreviations

BLI Bioluminescence imaging

CPU Central Processing Unit

CT X-ray Computed Tomography

CUDA Compute Unified Device Architecture, a CPU programming language by NVIDIA

CUDAMC CUDA-based time-resolved MC for semi-infinite homogeneous non-absorbing media

CUDAMCML A CUDA (GPU) implementation of MCML

CW Continuous-wave

DOS Diffuse Optical Spectroscopy

DOT Diffuse Optical Tomography

fNIRS Functional Near-Infrared Spectroscopy (synonym for DOT)

FPGA Field-Programmable Gate Array

GPGPU General-Purpose computing on Graphics Processing Unit

GPU Graphics Processing Unit

GPU-MCML GPU implementation of MCML

HLS High-Level Synthesis

HNC Head and Neck Cancers

IPDT Interstitial (within the body) PDT

MCML Monte Carlo for Multi-Layered media

MC Monte Carlo

MCX Monte Carlo Extreme, a voxelized GPU-based simulator

MFP Mean Free Path (µ−1t )

MMCM Mesh-Based Monte Carlo Method

MRI Magnetic Resonance Imaging

NIRFAST Near-Infrared Fluoresence And Spectral Tomography, a Matlab-baesd diffusion solver

PDT Photodynamic therapy

PS Photosensitizer

PT Photodynamic Threshold, a dose definition

RNG Random Number Generator

RTE Radiative Transfer Equation

RTL Register-Transfer Level (detailed hardware design of a digital system)

xiii

Page 14: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

SFDI Spatial Frequency-Domain Imaging

SFMT SIMD-Oriented Fast Mersenne Twister

SIMD Single Instruction Multiple Data

SMT Simultaneous Multi-Threading: multiple threads sharing one core (Intel “Hyperthreading”)

SPMD Single Program Multiple Data

SSE Intel SIMD Streaming Extensions (vector instructions)

TIM-OS Tetrahedral Inhomogeneous Mesh Optical Simulator

xiv

Page 15: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 1

Introduction

1.1 Medical Uses of Light

Many important medical applications make use of light in the “optical window” which is gener-

ally defined as wavelengths from deep red (≈ 630nm) into the near infrared (≈ 1060nm). The

region is so named because absorption of common tissue and blood constituents is at a minimum

there [65], allowing light to travel large distances into the body. Light at these wavelengths is

non-harmful, generally inexpensive to produce and detect, and easily guided by optical fibres

which can be applied at surfaces, through endoscopes, or inserted using needles.

As an imaging and detection method, light in the optical window can provide in vivo func-

tional information through non- or minimally-invasive means using highly-portable devices.

This contrasts with other imaging modalities such as magnetic resonance imaging (MRI) which

is very expensive and non-portable. Ionizing radiation such as x-rays (including CT scans) are

likewise non-portable and gradually harmful as the dose accumulates. Positron emission tomog-

raphy (PET) provides functional information based on glucose uptake but requires injection of

a radioactive tracer, which should be kept to a minimum for human patients. The absence of

all these drawbacks makes light an attractive choice for medical sensing.

As a treatment technology, light can be used in a targeted way to destroy unwanted cells

including cancer. Red and near-infrared light do not inherently have any accumulated toxic

effects. As a result, unlike ionizing radiation such as x-rays, light-based treatments do not

have any inherent limit on the number of times they can be applied. This is particularly

important for treating conditions like cancer where local recurrences may happen, requiring

re-treatment. In some cases, ionizing radiation treatment may not be usable a second time due

to the accumulated damage during the first treatment.

The utility of light in this window is limited by the fact that biological tissues are very

turbid, meaning they scatter light strongly. Hence, any light which travels through more than

a fraction of a millimetre of tissue will be scattered and become diffuse rather than focused.

In addition, any in vivo sensing or imaging will have to contend with background scatter from

surrounding tissue, which reduces the contrast. Accurate calculation of scattered light propa-

1

Page 16: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 1. Introduction 2

gation is therefore essential for the design and function of medical devices, as well as correct

interpretation of results from measurements made with light. Safe and effective therapeutic

use (such as destroying unwanted cells) depends on the ability to predict the light distribution

and hence the correct distribution of absorbed energy within the tissue. A few examples are

highlighted below as motivation for the research presented here. Greater detail on state of the

art applications and simulation methods is presented in Chapter 2.

Each solution technique directly or indirectly uses a (possibly approximated) form of the

Radiative Transfer Equation (RTE), which is the basic conservation law governing light prop-

agation in turbid media. It states that for each point, the photon flux in a given direction is

governed by the incident flux in that direction, minus losses due to scattering and absorption,

plus scattering from other directions into this direction. Analytic methods and numerical sim-

ulations of light transport problems make use of varying techniques to produce solutions that

conform to the RTE.

1.1.1 Photodynamic Therapy

Photodynamic Therapy (PDT) [70] is an interesting and promising emerging medical applica-

tion of light. It is a targeted, minimally-invasive treatment used to selectively kill diseased cells

including cancer or bacteria. The patient is given a non-toxic photosensitizer (PS) which is

sensitive to light. When exposed to light of a specific wavelength, the photosensitizer excites

the oxygen normally present in living tissue into a reactive form with a short lifetime. The

excited oxygen quickly reacts with proteins and lipids in cells, causing damage that leads to cell

death if the accumulated damage is sufficient. Since the oxygen radicals have a short lifetime,

the effects are confined to the immediate area where light exposure, photosensitizer, and tissue

oxygen overlap so there is little to no systemic toxicity.

The optimal treatment plan provides a dose which damages all target cells while minimizing

collateral damage to nearby organs at risk. Without an accurate model of light transport in

tissue, it is impossible to predict the light energy received, and hence the PDT dose delivered and

treatment outcome. There is not yet a sufficiently fast, accurate light propagation simulation

with related dose-evaluation and treatment-planning software for this application which has

been a barrier to use of interstitial PDT for complex anatomy. The current state of PDT

research and clinical use is summarized in Section 2.1.1. The goal of this thesis is to provide

part of the solution, namely the fast and accurate light-propagation calculation which will

enable progress in dose evaluation and treatment planning.

1.1.2 Bioluminescence Imaging

Another important application which relies on fast and accurate simulation of light propagation

through tissue is bioluminescence imaging (BLI) [54]. BLI is a popular research tool used on

small animals in which a cell line of interest presenting a disease is transfected with a gene

which causes it to produce a protein which luminesces (produces light) without application of

Page 17: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 1. Introduction 3

an external excitation source. This enables monitoring the spread of that cell line by observing

the luminescent emissions using a low-light camera. Most BLI work is currently qualitative,

using the images to track the progression and spread of disease. Quantitative BLI (QBLI) [46],

also known as Bioluminescence Tomography (BLT) is an emerging technique which attempts

to reconstruct an accurate geometric model of the volume of interest using knowledge of the

anatomical structure (usually obtained by MR or CT), optical properties, and further assump-

tions about the volume of interest. This information is provided as constraints to a numerical

solver which tries to find a simulation geometry which minimizes the difference between the

simulated light pattern and the observed pattern. Given sufficient time and computing power,

it is possible to obtain quantitative functional information about the volume of interest.

1.2 Inverse Problems

The applications introduced above and many others rely on solving a mathematical inverse to

the RTE, where a volume description (geometry, optical properties, sources) is sought which

gives a particular pattern of fluence. Since no closed-form analytic solution exists for complex

geometries, it is generally necessary to solve the problem using iterative techniques. The large

number of iterations required make the successful use of such techniques dependent on a fast and

accurate implementation of the forward solution to the RTE so that many candidate solutions

may be tried to find the best.

Taking for instance PDT, a desired dose is defined in terms of constraints on energy per

unit volume within the treatment volume and the necessary source configuration must be solved

for. A physician will define the target volume, organs at risk, and the desired dose parameters

for the different structures. The goal is often described as a minimum dose to be delivered to

the target tissue and a maximum dose not to be exceeded for nearby healthy tissue. Thus,

a treatment plan specifies for a specific patient and target dose profile the number of fibre-

optic sources, source positions, source shapes, total input light intensity and duration for each.

With such a large number of free parameters, the space of possible treatment plans is likewise

large. Lacking an analytic solution giving a treatment plan from a problem definition, it is

necessary to start from one or more guesses and successively refine them. Evaluation of each

candidate refinement requires calculation of a separate forward simulation, and we anticipate

that hundreds or thousands of such simulations may be necessary for optimization.

In bioluminescence imaging, the goal is similar except that the difference between observed

and simulated surface emission should be minimized. Given an observed distribution of light,

the researcher wishes to find the distribution of sources which gave rise to the observation.

The problem is generally constrained by additional anatomical information, either a reference

anatomy or by structural information from other imaging modalities (MRI, CT). Again the

minimization problem will likely take hundreds or thousands of iterations of the forward simu-

lation.

Page 18: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 1. Introduction 4

The common factor in these and other techniques is that a large number of forward simu-

lations must be conducted to arrive at a solution. Research and clinical relevance demand that

the overall cost and computation time be reasonable prior to widespread adoption. Hence, this

thesis focuses on making biophotonic simulations as fast as possible as an enabler of a wide

variety of important optical techniques in medical research and ultimately clinical practice.

1.3 Contributions

Given the need outlined above for fast and accurate simulations of light propagation from one

or more sources within a heterogeneous tissue volume, we first investigated and produced a fast

simulator. We started by understanding and improving on the best available software. The

amount of computational cores and energy required to achieve practically useful simulation

times were deemed excessive so we investigated faster and more efficient computational plat-

forms. Next, an implementation using custom digital logic was undertaken to provide further

integer-factor gains in performance and power efficiency. The principal contributions presented

in this thesis are as follows:

• The fastest tetrahedral-mesh-based1 software Monte Carlo light propagation model avail-

able

• A novel, faster and more hardware-friendly method for computing scattering

• The first FPGA-based implementation of a tetrahedral mesh-based Monte Carlo light

scattering simulator

• A hardware-accelerated simulator which is faster (3x) and more power-efficient (40x) than

a CPU

1.4 Organization of Thesis

The balance of the thesis is organized as follows. A more thorough review of relevant ap-

plications, the physics of diffuse light propagation theory, and the current state of the art in

simulation methods is presented as background material in Chapter 2. Next, Chapter 3 presents

the FullMonte C++ CPU-based software model. It is the fastest existing simulator in its class,

and incorporates several novel features to enhance performance and customizability. Based on

the software model, a hardware implementation using Field-Programmable Gate Arrays was

created as described in Chapter 4. Chapter 5 shows simulation results in terms of functional val-

idation, power consumption, and performance for the two models. Finally, Chapter 6 presents

a discussion of future feature enhancements, performance scale-out, and application work to be

done.1A tetrahedral mesh description is, as explained later, the most flexible and accurate geometry model for light

propagation simulations.

Page 19: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2

Background

This chapter provides context for the research presented in the balance of the thesis. We start

with a discussion of therapeutic, diagnostic, and research applications of diffuse light. Next, we

present a brief summary light-tissue interactions and optical properties which are relevant to

propagation in tissue. In Section 2.3, we give a more formal definition of the forward problem

solved by the FullMonte simulators, as well as an abstract description of the simulation inputs

and outputs. Section 2.4 introduces the two principal solution algorithms: finite element with

the diffusion approximation, and Monte Carlo. It also gives a detailed description of the Monte

Carlo algorithm used in FullMonte, but without implementation details. The chapter ends with

a summary of the state of the art in diffuse propagation simulators, and an introduction to the

most common technologies for accelerating computation to place the FullMonte hardware effort

in context.

2.1 Applications of Diffuse Light

The relatively low absorption of living tissue in the tissue optical window ranging from dark red

to infrared (see Fig 2.1) presents a useful means for transporting energy into and out of tissue.

Even more importantly, a number of important tissue constituents have distinctive spectral

features within this band. Consequently, many medical applications use light in this range to

measure and control biological processes, through even several centimetres of scattering tissue.

Several such applications are reviewed below to motivate the research undertaken.

Photodynamic therapy (PDT, Sec 2.1.1) is a light-mediated treatment where chemical reac-

tions are caused by the absorbed photons, requiring careful control of the fluence rate through-

out the planning volume. Fluorescence and absorption imaging methods like Diffuse Optical

Tomography (DOT, Sec 2.1.2) rely on both an excitation input and an observed return to ex-

tract information on the distribution of fluorescent or absorbing molecules. Bioluminescence

Imaging (BLI, Sec 2.1.3) uses light emitted from within tissues that arrives at the skin surface

to gain functional information. Diffuse Optical Spectroscopy (DOS, Sec 2.1.4) makes use of the

variation of optical properties across different wavelengths to infer material composition and

5

Page 20: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 6

Figure 2.1: Absorption spectrum of principal tissue chromophores from Vogel and Venu-gopalan [65], showing the tissue optical window from 630-1000nm

Page 21: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 7

hence physiological parameters. All of these applications exploit the tissue optical window, and

work entirely with light that has been scattered many times. For all of them, accurate propa-

gation simulations and knowledge of optical properties are essential for correct interpretation

of measurements or achievement of intended results.

2.1.1 Photodynamic Therapy (PDT)

Introduction to PDT

Photodynamic therapy (PDT) is a minimally-invasive treatment for a number of medical con-

ditions including cancer and bacterial infections that destroys diseased cells where the level

of damage is a function of the light intensity. It uses a photosensitizer (PS) which is either

applied topically (for superficial treatment) or given intravenously to be absorbed by the pa-

tient and selectively retained by the target cells. When the photosensitizer has oxygen nearby

and is exposed to photons in its absorption band, it excites the oxygen into a short-lived re-

active state. A reaction then quickly occurs with proteins and lipids, causing cell damage in

the immediate area of photon absorption. If sufficient damage is accumulated, the cell will

die through either apoptosis or necrosis depending on the degree of damage. Therefore, PDT

offers a light-mediated method of selectively killing cells, which means that treatment safety

and effectiveness depend on having an accurate model of light propagation to evaluate the PDT

dose to be delivered.

Interstitial PDT (IPDT) is the use of photodynamic therapy within the body using light

delivered by optical fibres inserted via one or more needles. Use of PDT for non-superficial

applications complicates the treatment planning effort since it offers far more degrees of free-

dom in light configuration, and has the potential to treat lesions closer to organs at risk deep

within the body. In order to plan a safe and effective treatment, dose definitions such as the

Photodynamic Threshold model exist [45] [25] and rely critically on the distribution of three

factors: PS, light fluence, and tissue oxygen. This research aims to provide a means for fast

and accurate prediction of fluence, thereby advancing interstitial PDT for complicated anatomy

towards clinical utility.

Current Clinical Status

PDT is currently approved for a number of superficial implications and has been used with great

success for a number of applications: skin lesions including actinic keratosis [62] and some skin

cancers [27]; Barrett’s oesophagus [29], a pre-malignant lesion; and bladder cancer [42]. In

each of these applications, the target region is accessible from the surface, extends only a few

millimetres in depth, and can be illuminated from a large surface area. Conversely, interstitial

PDT requires light delivery within the body, usually by optical fibres placed via needles. One

of the critical factors in PDT, particularly in the interstitial case, is fast and accurate treatment

planning.

Page 22: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 8

The state of research in PDT as of 2008 is summarized in a review by Wilson and Pat-

terson [70]. A number of locations are active in PDT research, with treatments for various

indications in clinical trials being administered to patients. Encouraging results have been

found for interstitial PDT of pancreatic [32] cancer in humans, where it was concluded that

such treatment was safe and possibly efficacious, and that tumour necrotic volume was propor-

tional to dose delivered. In 2004, D’Cruz et al [18] presented a study of 128 patients receiving

PDT for advanced head-and-neck cancer (HNC) that was accessible to superficial illumination.

Median survival was significantly improved for patients who showed complete initial response

compared to those who did not.

Biel [3] presents a summary of over 1,500 patients treated within the previous 18 years with

PDT for HNC. Notably, the study used no treatment planning, instead following a standard

bodyweight-proportional dose of PS and a constant light dose. For superficial tumours accessible

by laryngoscope, a fixed light intensity was delivered to target a fixed range of surface fluence

values (J/cm2). For larger tumours (depth exceeding 3mm), cylindrical diffusers were implanted

within the tumour bed with a fixed 1cm spacing. Patients with laryngeal carcinoma in situ and

stage I-II tumour without node involvement received PDT alone, were all discharged home on

the same day, and showed a five-year cure rate of 90% without significant side effects. Multi-

institutional phase II and III trials were completed demonstrating efficacy for early primary and

recurrent cancers. Biel also reports a small (18-patient) clinical trial in which PDT was used

as an intraoperative adjuvant to surgery, and summarizes another fourteen patients treated

intraoperatively by Dilkes where two cases were disease-free after five months, but two others

had carotid blowouts, a serious morbidity likely attributable to overexposure. This suggests

that more precise treatment planning may be beneficial to avoid overdosing structures at risk.

Other trials discussed within the summary showed promise in palliation of late-stage disease as

well as very strong cure rates for early disease.

Davidson et al [17] reported in 2009 on a Phase II clinical trial of prostate cancer using

TOOKAD for vascular-targeted interstitial PDT. This was the first trial with patient-specific

treatment planning for prostate PDT, which was conducted using the diffusion model since the

prostate is a relatively homogeneous organ. The authors note that speed of solution is critical

to clinical utility, since the treatment plan must be updated during treatment due to shifting

fibre positions, changing optical properties, and changing photosensitizer concentrations. It was

demonstrated there and elsewhere [21][52] that PDT is a viable treatment option for prostate

cancer, despite some inter-patient variability in photosensitizer concentration.

The Need for Treatment Planning

Jacques [36] highlights the importance of tissue optical properties in treatment planning to

control the fluence delivered. While superficial PDT is inherently limited to a few millimetres

of depth, interstitial PDT delivers light below the surface which places it closer to potential

organs at risk. For skin cancers, a multi-layered model is often adequate so radial or even

Page 23: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 9

planar symmetry can be assumed which is not possible in the more general case. As a result,

interstitial PDT will require more complex planning and have higher consequences for error,

particularly if used in the head and neck which have a large number of sensitive structures. For

general use, the clinical target volume will have a significantly more complex anatomy than the

prostate, where the entire gland (healthy or not) can be treated and even overexposed, though

there are organs at risk which must be protected (urethra, rectum). This thesis confines itself to

modeling of the light distribution as a first step to a complete PDT treatment planning system.

2.1.2 Diffuse Optical Tomography

One of the applications that has pushed modeling of turbid media forward is Diffuse Optical

Tomography (DOT), also known as Functional Near Infrared Spectroscopy (fNIRS), a technique

which uses mathematical tomographic techniques to reconstruct three-dimensional images of

absorption contrast through scattering media from measured transmission between pairs of

sources and detectors. It often acts as a complement to Functional MRI (fMRI), providing

similar information but via different mechanisms and with different costs and benefits.

A typical DOT setup has multiple (tens) of light sources operating at at least two wave-

lengths, with detectors placed around the volume of interest as shown in Fig 2.2. The signal

propagating between each source-detector pair is measured at multiple wavelengths. One of the

difficulties of the technique is that the detected light has been scattered a very large number of

times and arrived via a multitude of paths spanning a large volume of tissue. However, localized

perturbations in optical properties can be inferred from the measurements given some a priori

knowledge about the geometry and optical properties of the target volume, a light-propagation

simulator, and an iterative algorithm. In many cases, the changes of interest are in the concen-

trations of oxy- and deoxy-hemoglobin which are prominent absorbers in the red wavelengths

and can be discriminated from one another by a suitable choice of wavelengths in the dark red

(see Fig 2.1). These signals provide a view of cerebral hemodynamics, which can be used to

diagnose disease [50] and to learn about normal brain function [40]. Functional MRI using the

BOLD (Blood Oxygen Level Dependent) signal is the established technique for making such

measurements, however its cost, relatively slow acquisition time, and lack of portability make it

less than ideal. One area where DOT and its precursor fNIRS (functional Near-Infrared Spec-

troscopy) show significant potential is for continuous monitoring of brain oxygenation, both for

premature infants in neonatal intensive care [43] and for stroke victims. MRI is not suitable for

continuous monitoring due to cost, size and comfort concerns, giving a significant advantage to

optical techniques for such applications.

Since the collected light did not travel a straight path, the core of the DOT technique is

the ability to simulate light propagation through the target volume of interest. Determining

the amount and location of the perturbations of optical properties which caused a given optical

signal is a mathematical inverse problem. No closed-form solution exists for the relevant geome-

tries so it requires many candidate solutions to be tried. Result quality is strongly linked to the

Page 24: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 10

Figure 2.2: Depiction of High-Resolution Diffuse Optical Tomography (HR-DOT) setup fromHabermehl et al [30]

Figure 2.3: Side-by-side depiction (L to R) of BLI image, CT scan, PET image, and dissectionphotograph of nude mouse with a bioluminescent xenograft tumour reproduced from [49]

quality of simulation, and the method’s utility is limited by the computational requirements.

Several of the software packages described later in this chapter (tMCimg, MCX) were originally

designed to support DOT of the brain, and FullMonte could be used for this purpose as well.

2.1.3 Bioluminescence Imaging

Bioluminescence Imaging (BLI [54]) is the use of genetically-encoded fluorescent proteins to

trace cell lines of interest in vivo. For instance, by inserting the correct gene into a cancerous

tumour and implanting that tumour in a small animal model, it is possible to watch the progress

of the disease throughout the body including the formation of metastases. An example in a

nude mouse model is shown in Fig 2.3. With the success of DOT and other diffuse sensing and

imaging modalities, interest has been increasing in quantitative techniques for BLI.

Page 25: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 11

2.1.4 Diffuse Optical Spectroscopy

Diffuse optical spectroscopy (DOS), also known as photon migration spectroscopy, is a non-

invasive technique for detecting disease and monitoring its response to treatment. Based on

the premise that the optical properties of tissue differ between healthy and diseased tissue,

DOS aims to extract information from measurements of the optical properties at multiple

wavelengths. For in vivo diagnostics, though, it is not possible to disentangle the effects of

scattering and absorption using only continuous-wave light sources. The absorption measured

is a function of the absorption coefficient and the path length travelled, which depends on

scattering. In the absence of scattering, transmission at a given wavelength follows Beer’s law

as it would in a cuvette of non-scattering liquid T = exp(−εLc) where the extinction coeffi-

cient ε depends on the absorber and the wavelength. Measuring at multiple wavelengths given

knowledge of ε for the chromophores present in the tissue allows inference of the compositions

c within the interrogated tissue. However, the presence of scatterers mean that the expected

path length L taken by a detected photon, and hence its probability of being absorbed, is

longer than the physical source-detector distance. In a non-homogeneous non-infinite medium,

that expected path length L is also a function of the tissue geometry and boundary conditions.

Consequently, the results of CW-DOS may be improved by a more accurate simulation of light

propagation if the tissue geometry is known. A prime example of this effect would be DOS of

the human breast [64], which requires a scattering model to produce a useful fit. Variants of

the technique such as Spatial Frequency-Domain Imaging (SFDI) [16] also rely on models of

light transport through tissue. While techniques using pulsed or temporally modulated light

sources to measure optical path length exist, the equipment required is very complicated and

expensive, making computing-based approaches with CW sources more desirable.

2.2 Tissue Optics

This section presents a brief overview of tissue optics. Jacques [37] presents an overview of

tissue optical properties, and a compilation of values for a wide variety of wavelengths and

tissue types.

The primary optical effects of interest in tissue are scattering and absorption, which occur

frequently (≈ 10 − 1000/cm). In the optical window previously discussed, scattering is one

to two orders of magnitude more prominent than absorption so any given photon will likely

scatter a large number of times before being absorbed. Additionally, when the refractive index

n differs between regions, the normal physics involving internal reflection, Fresnel reflection,

and refraction must be considered.

When a polarized light source such as a laser is used, there is the possibility of observing

polarization-dependent effects, which are generally fairly small signals caused by tissue bire-

fringence and chiral activity. The present work focuses on multiply-scattered light on length

scales that would generally make measurement of polarization-related effects difficult. All of

Page 26: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 12

the applications presented above can safely neglect polarization. Some interesting specialized

biophotonic measurements using diffuse polarimetry have been proposed by Ghosh et al [28],

but have yet to become mainstream.

Coherence effects such as speckle are also generally ignored when modeling propagation

in turbid media for several reasons. Even though PDT often uses lasers which are coherent

light sources, the treatment time is long enough (minutes) that even slowly-changing speckle

patterns average out during the treatment time due to gradual shifting of tissues. In other

applications relating to fluorescence and bioluminesce, it is not relevant since the source is

incoherent. Lastly, for macroscopic applications it is not possible to produce a sufficiently fine

description of the target material that a meaningful simulation output would result.

It should also be mentioned that no non-linear effects are modeled. The Monte Carlo for-

mulation used in this work relies on the assumption that photon trajectories, scattering, and

absorption probabilities are independent of local fluence rate. As a result, harmonic genera-

tion, two-photon absorption, and Raman are fundamentally not possible to model within this

framework. For the applications summarized above, though, the power used is low enough that

nonlinear effects are insignificant.

2.3 Light Propagation Models

A forward problem description is a complete description of a situation for which the propagation

is to be modeled. It consists of:

1. A geometry description, consisting of one or more regions, each with an associated material

2. A set of materials with all relevant optical properties defined

3. A set of light sources with distribution parameters and weights

4. A definition of the output data to be collected

The light propagation models described below produce one or more sets of output data for

a given input forward problem description. Before introducing solution methods (Sec 2.4), we

first discuss in detail the problem definition below.

2.3.1 Geometry Descriptions

A number of different geometry descriptions are possible when modeling turbid media. Each

geometry consists of a set of regions Ri, each of which has a boundary with defined surface

normals n, an associated material, and a set of adjacent regions. Region descriptions range

in complexity but must support as a minimum testing whether a point p is within the region,

finding the point q where a ray intersects the boundary, calculating the volume V [R] and

specifying which region is adjacent at that point.

Page 27: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 13

Infinite

The simplest problem-geometry description is an infinite homogeneous medium. Under the

diffusion approximation to transport theory, the infinite case from an isotropic source has an

analytic solution. It is also very simple to simulate via MC, since the optical properties remain

the same regardless of position and there are no material boundaries. Due to the compact

problem representation it allows very simple simulations that can achieve high computational

performance when the analytic diffusion approximation is not appropriate.

Semi-Infinite

The problem complexity is increased only slightly when moving to a semi-infinite medium, in

which there are two materials: one turbid medium of interest (generally some form of biological

tissue), surrounded by another medium (often air). Generally such a problem takes the coordi-

nate z to be depth below the surface, which spans the xy-plane at z = 0. In such a model, there

is a boundary if z changes sign during the step. If the boundary is encountered and there is a

refractive index mismatch, it is necessary to model Fresnel reflection, total internal reflection,

and refraction when computing the step result. The description requires only one additional

parameter (external medium refractive index) beyond the infinite case since the interface loca-

tion z = 0 and normal vector k are implicit. When following a ray (p, d), the physical step

length s along the ray to arrive at the boundary is also simple, just s = −pzdz

if dz < 0.

Planar

Among the first widely-used Monte Carlo methods was a model using infinite planar slabs of

material (MCML, Sec 2.5.1). If there are N slabs (usually 5-10) lying in the xy plane and the

photons arrive along the z axis, then there is cylindrical symmetry around that axis. Describing

such a geometry requires only the z coordinate of the lower edge for each of the N slabs, along

with optical properties. Assuming the source distribution also has cylindrical symmetry, the

absorption scoring can be reduced to 2D Φ(r, z) = Φ(√x2 + y2, z) which reduces the number of

packets required to achieve acceptable result variance. Like the semi-infinite case, the interface

normal is always ±k. The boundary (j = i or i+ 1) faced by the ray can be found by checking

the sign of dz, and then the distance can be found by s =z−zjdz

.

Voxelized

To represent more complex geometries, a natural extension is to break the problem into discrete

cubic voxels with each being assigned a material. However, the voxelized geometry description

does not lend itself to accurate description of curved surfaces and particularly does not provide

smooth surface normals for such surfaces. The model has been applied to Diffuse Optical

Tomography (DOT) of the brain [6], where the refractive index is generally matched. Finding

the boundary with another material in this model requires looking up the material ID of every

Page 28: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 14

voxel along the path. For large homogeneous regions, this scheme is inefficient in terms of both

storage space and computational effort, and it provides only a global tradeoff between resolution

and geometry size. Binzoni et al [4] demonstrate some of the shortcomings due to artifacts in

producing surface normals when using the voxelized model to describe curved surfaces with

refractive index differences.

Mesh-based

Three-dimensional volumes with general shapes can also be modeled as the union of a set of

tetrahedra, at the cost of some additional complexity. Methods of handling (Matlab/GNU

Octave) and visualizing (Visualization Toolkit [34]) tetrahedral meshes are well-known from

other applications including Finite Element Analysis. While it lacks the regularity of cubic

voxels, the tetrahedral mesh description has two major advantages.

First, the normal for all interfaces is directly available. For a tetrahedron defined by

counterclockwise-oriented points P1,P2,P3,P4, the normal to the face opposite P4 is found

by normalizing (P2−P1)× (P3−P1). Any of the three other normals can be found by rotat-

ing the point array appropriately. Using these normals, the interior of the tetrahedron is the

intersection of four half-spaces, which is the set x : ni ·x ≥ Ci, i ∈ [1, 4]. Whether a point

is inside the tetrahedron or not can be tested by direct evaluation of the four conditions just

given. For a more thorough introduction to representations and operations on polytopes (closed

N-D objects bounded by flat sides), the reader is directed to a reference book by O’Rourke [39].

Second, the mesh can be made coarser or finer as needed for the problem; areas which do

not need large amounts of detail have a very compact representation, while curved surfaces

can be progressively refined as a set of piecewise-linear approximations. Consider the case

where a photon must take a step in a homogeneous region. In the voxelized representation, the

algorithm must advance voxel-by-voxel, constantly calculating a new grid index and fetching

the material code for it, possibly many times. By contrast, in the mesh representation a single

tetrahedron can represent an arbitrarily large region, and a step of any size within the region

requires only four intersection tests, one for each face. Only if the step crosses one of the faces

must a new, adjacent tetrahedron be loaded to continue the step.

2.3.2 Material Optical Properties

The relevant tissue optical properties and their typical values for turbid media in the opti-

cal window are summarized in Table 2.3.2. Absorption and scattering are specified by their

coefficients, respectively µa, µs, which give the expected number of interactions per unit dis-

tance traveled, typically in cm−1. Their sum µt is the total interaction coefficient, which is the

expected number of interactions (scattering or absorption, which are independent) a photon

has per unit length. Its reciprocal µ−1t [cm] is the Transport Mean Free path, which is the

expected distance traveled by a photon between interactions. The albedo α is derived from

the absorption and scattering coefficients 0 ≤ α = µsµa+µs

≤ 1 which is the probability that a

Page 29: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 15

Value Unit Range Typical Description

µs [cm−1] ≥ 0 . 3000 Scattering coefficientµa [cm−1] ≥ 0 . 300 Absorption coefficientg (−1, 1) & 0.8 Anisotropy coeffientn ≥ 1 . 1.5 Refractive index

Table 2.1: Summary of relevant tissue optical properties with typical values in the opticalwindow from Cheong [11]

given interaction is a scattering event. When the photon scatters, the anisotropy parameter

g = E [cos θ] = E[d′ · d

]describes the expected value of the cosine of the deflection angle (the

angle between the direction vector before and after). A value of −1 is perfect backwards reflec-

tion (mirror-like), 0 is biased neither forwards nor backwards (outgoing energy in the forward

and backward half-spheres are equal), positive values scatter dominantly forwards, and 1 indi-

cates no scattering interaction at all1. In some situations g is not used directly, but it modifies

the scattering coefficient to yield a reduced scattering coefficient µ′s = (1 − g)µs which gives

similar behavior assuming the absorption coefficient is small compared to scattering and that

material properties are locally homogeneous.

2.3.3 Source Descriptions

A number of different source descriptions are possible. An non-exhaustive list is presented

below:

• Normally-incident pencil beam (directed beam, delta-function profile)

• Isotropic point

• Isotropic volume

• Directed surface (finite-width beam)

Some of the solution methods may not support all source types due to inherent restrictions

on symmetry, or due to a design choice not to include them. It should be noted that the diffusion

approximation supports only isotropic sources since the diffusion approximation is incompatible

with the notion of a directed beam. Virtual sources [26] can be used to approximate other source

profiles as sums of point sources. It is also possible [67] to model finite-diameter beams through

convolution of infinitely-thin beams, if the geometry is symmetric around the beam.

2.3.4 Output Data

Most often, biophotonic simulations are done in terms of the fluence Φ(x), which is the amount

of light energy passing through an infinitesimal area dA at a point x over some time period.

1Scattering is elastic so if direction/momentum does not change, there was no energy or momentum transferhence no interaction.

Page 30: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 16

Typically, units of J/cm are used. If a single absorber (eg. molecule) with absorption cross-

section σ cm2 is exposed to such fluence, it will be expected to absorb σΦ(x) joules of energy.

Given a density ρ mol L−1 of absorbers, the total energy they absorb in a volume dV is

E(x) = NAρσΦ(x) dV = µaΦ(x) dV (2.1)

Since the energy of a single photon at wavelength λ is E = hc0λ , absorbed energy is directly

convertible into a number of photons absorbed. For PDT, the number of photons absorbed by

the PS is proportional to the number of radicals created and hence damage caused. In other

applications, the signal detected is generally proportional to the number of photons arriving at

a camera or detector so fluence is often the most relevant quantity.

As defined, fluence is a continuous scalar field which has an analytical solution only for simple

geometries. To produce an approximate solution to a non-trivial problem, one must resort to

numerical simulation by discretizing the problem and finding piecewise solutions which obey the

RTE to within some tolerance. In those methods, continuous scalar fields such as fluence are

represented by average values over a finite number of regions. As a convention, the discussion

below uses parentheses for continuous fields such as fluence Φ(x), while using square brackets

to denote discrete arrays like ΦV [R] for the average fluence over a discrete volume region R.

Volume Fluence

When discretizing volume, the problem geometry is split into a number of regions Ri with

homogeneous optical properties, which could be described by voxels, tetrahedral elements,

cylindrical sections, or otherwise. he average fluence in a region R with finite volume V [R] can

be found as

ΦV [R] =1

V [R]

∫R

Φ(x) dV [Jcm−2] (2.2)

When using Monte Carlo methods to simulate light propagation, the simulator scores the

photon absorption (proportional to energy) within the volume, which is∫RE(x) dV . Using

Eq 2.1 and assuming a homogeneous µa > 0, the average energy per volume can be converted

to fluence:

ΦV [R] =1

V [R]µa[R]

∫RE(x) dV =

EV [R]

V [R]µa[R][Jcm−2] (2.3)

Surface Emittance

For surface imaging problems such as BLI and DOT, the quantity of interest is actually the

fluence escaping the surface (emittance), which is detected. For discrete surface element S with

area A[S], an average surface fluence can be calculated similarly as

ΦA[Si] =1

A[S]

∫S

Φ(x) dS =EA[S]

A[S][Jcm−2] (2.4)

Page 31: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 17

Detectors

Some applications such as DOT model use of specialized detectors, typically fibre-optic probes.

In the case of small isotropic diffusers, the result should be not differ significantly from the

fluence in the surrounding tissue. In Monte Carlo simulations it is possible to specify customized

probes and evaluate whether a photon is captured or not using a wide range of criteria.

Time Resolution

For non-continuous-wave applications including DOT or DOS, the fluence within a time window

matters. In these cases, the input light can be considered to be an infinitely short delta-function

in time δ(t), yielding a flux φ(x, t) as a response to that impulse. For modulated systems using

phase-sensitive detection, the amplitude and phase response H(ω) at each detector can be found

from the Fourier transform of the impulse response function. In the case of pulsed systems, the

time histogram h(t) is generally produced directly by time-gating the detector.

A Monte Carlo simulator can produce this by keeping track of the simulation time t since

a packet was launched and splitting recorded fluence into N discrete time bins [ti, ti+1) i ∈[0, N − 1]. When moving a distance ∆s, the time counter advances according to the speed of

light and distance traveled so ∆t = nc0

∆s. When the packet is absorbed, it is assigned to the

appropriate time bin i and region R so that the recorded energy is

EV [R, i] =

∫ ti+1

ti

∫Rµaφ(x, t) dV dt (2.5)

from which fluence can be derived using Eq 2.3. A similar treatment can be done for surface

elements or custom detectors.

Compared to non-time-resolved simulation, the cost is N times more additional storage

space per element, though this can be offset by selecting only a subset of elements to record.

TIM-OS and other simulators already provide such a capability. FullMonte does not yet, though

the software version could easily be upgraded to do so limited only by the size of memory. If

a time histogram is desired for a small number of detectors or surface elements, the hardware

version could also accommodate time resolution limited only by memory capacity.

2.4 Numerical Solution Implementations

The Radiative Transfer Equation [55] (RTE, Eq 2.6) for a single wavelength is the conservation

relation which must be obeyed for light transport in turbid media. It describes the conditions

for a function L(x, Ω) to be a valid description of radiance at a point x, in direction Ω.

1

v

∂tL(x, Ω, t)+Ω · ∇L(x, Ω, t)+µt(x)L(x, Ω, t) = s(x, Ω, t)+

∫ΩL(x, Ω′, t)dµs(x, Ω′ → Ω) dΩ′

(2.6)

Page 32: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 18

In this equation, µt is the interaction coefficient previously introduced and dΩ is an element

of solid angle surrounding the point x with surface normal Ω. Scattering is characterized by

µs(x, Ω′ → Ω), which is the proportion of intensity that is scattered from incident direction

Ω′ into direction Ω. The left side gives three terms for radiance decreases: non-steady-state

pulse propagation; steady-state energy flow; and, energy absorbed or scattered away from the

direction Ω. At right, there are two terms for radiance increases: a source term; and, an

integral over all other directions of the energy scattered into the direction Ω. For steady-state

(non-time-resolved) solutions, the first term is assumed to be zero. We note also that the

bulk scattering coefficient µs must be equal to∫

4π dµs(x, Ω′ → Ω), and that by definition, the

variable we want (fluence) is the integral of radiance at point x:

Φ(x) =

∫ ∫ΩL(x, Ω) dΩ dt (2.7)

Being a complicated partial differential equation (PDE), the RTE has known analytic solu-

tions only for very simple and/or approximated cases. Solution for more general cases requires

numerical methods, of which two are commonly used: the Finite Element Method (FEM) or

Monte Carlo (MC). Either discretization must obey the RTE, though they do so in different

ways.

2.4.1 Finite Element

Under the diffusion approximation to the RTE, the fluence distribution can be modeled as a

quantity diffusing down a concentration gradient, similar to heat. Qualitatively, diffuse light is

light that has been sufficiently scattered to have lost all directionality. It assumes that L in the

RTE above is isotropic, meaning uniform over Ω so L(x, Ω) = L(x) and dµs(x, Ω′ → Ω) = µs.

The FEM involves discretizing the volume of interest into tetrahedral elements and reducing

the RTE to a system of linear equations. A thorough treatment of diffuse light propagation is

given by Jacques and Pogue in [38], so only a cursory review is given here. More formally, the

diffusion approximation requires:

1. The materials involved have high albedo µs µa

2. There are no non-scattering voids in the material (µs > 0, all materials scatter)

3. All sources are isotropic s(x, Ω, t) = s(x)

4. Scattering anisotropy can be neglected, ie. dµs(x, Ω, Ω′) = µ′s

5. Results are not expected to be valid within a few mean free paths of a source

6. All materials have a uniform refractive index

Making these assumptions has a number of attractive features. First, it reduces the problem

to that of solving a sparse matrix for which many fast and accurate programs exist. Second, it

Page 33: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 19

offers analytic solutions for simple cases with certain symmetry. Perturbation techniques can

give quick approximations for small changes in the problem geometry (eg small material inho-

mogeneity). A high-quality freely-available implementation, NIRFAST (described in Sec 2.5.6),

is also available.

Offsetting these, though, is the cost of the approximations made. The results are acknowl-

edged to be valid only if the distance from a source or a material boundary exceeds a few mean

free paths. This assumption could be problematic for applications like PDT, particularly if us-

ing extended sources such that a large tissue volume is located near a source. Likewise in PDT

for complex anatomy there will be a large number of material boundaries, possibly including

air cavities which have a strong refractive index change, which are not modeled properly in the

diffusion regime. Considering the relative merits, we chose to pursue a Monte Carlo method

since it is inherently parallel and offers the best possible accuracy by capturing all relevant

physics without restrictive approximations.

2.4.2 Monte Carlo

Computer-based Monte Carlo (MC) models of light transport in turbid media take a different

approach. Instead of modeling conservation laws on a large scale, MC models track individual

photons using appropriately-distributed random numbers so that their expected behavior is

physically correct. Millions or more of such photons are traced and after a sufficient number of

simulations, the result will converge arbitrarily close to the expected answer.

Implementations of this method for biophotonics generally use a common core algorithm

which operates assuming that ballistic photons travel in straight lines through regions of piecewise-

constant optical properties until scattered, absorbed, reflected, or refracted. In this model, a

propagating photon is described by a position p, and a direction d. The scattering and absorp-

tion process, called “hop, drop, spin”, was originally proposed (but not so named) by Wilson

and Adam in 1983 [69]. Prahl et al in 1989 [57] refined the algorithm with the addition of

roulette and anisotropic scattering, and an open source implementation (MCML) was given by

Wang et al [44]. An overview is given below; for greater detail, the reader is directed to the

original MCML paper which gives a thorough treatment.

Launch

A photon packet is first randomly launched (assigned a position and direction) into the tissue

from a source distribution. For isotropic sources, the direction unit vector is randomly chosen

from the unit sphere. If the source is directed then the direction is simply some constant d0.

Likewise, the position may be a constant point or start randomly distributed over a line, area,

or volume.

Page 34: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 20

Hop

Interactions with the material, whether scattering or absorption, are modeled assuming the

Beer-Lambert law. Consider a rectangular prism of area A and thickness ds which contains

particles of cross-section σ with number density ρ moles per unit volume. Now look at a path

through the prism normally incident on that face. If the path is chosen using a uniform random

distribution over the face, it has a probability σA of hitting any one particle in the box. Since

there are n = ρVNA

= ρAdsNA

independent randomly-distributed particles in the volume, it has a

probability(1− σ

A

)nof hitting exactly none of them. However it is generally a valid assumption

that the particles are spaced sparsely enough that their probability of overlapping within the

prism slice of thickness ds is zero. In that case, the probability of interaction is just 1 − n σAwhich could also be derived by a binomial expansion for small σ so

Pr (Interaction in ds) = 1− Pr (No interaction) = nσ

A= σN ds = µds (2.8)

The quantity µ = σρNA

has units of reciprocal length (here cm−1), and is called the coefficient

of scattering (µs) or absorption (µa). Eq 2.8 defines a differential equation for the CDF of the

step length before interaction S, the solution of which is exponential with parameter µ (denoted

here S ∼ Eµ):

Pr (s < S) =F (s) = 1− e−µs (2.9)

Pr (s ≤ S < s+ ds) = F ′(s) =f(s) = µe−µs = µ(1− F (s)) (2.10)

That distribution has mean µ−1. A photon will therefore travel on average 1µs

before being

scattered or 1µa

before being absorbed, in a medium containing only scatterers or absorbers

respectively. To combine them, we note that by definition scattering and absorption are in-

dependent so their probabilities within a given infinitesimal length ds are additive. By the

properties of the exponential distribution, the parameter becomes µt = µs + µa where µt is

known as the transport Mean Free Path (MFP) which is the average distance traveled before

scattering or absorption. Once the photon has an interaction, the probability it was scattered

is:

Pr (Scatter in [s, s+ ds))

Pr (Interaction in [s, s+ ds))=

µs(1− F (s))

(µs + µa)(1− F (s))=

µsµs + µa

= α (2.11)

which gives a mathematical definition for albedo α which was introduced in Sec 2.3.2 as a

material property.

When modeling photon propagation using MC, we need to draw a step length from an

appropriate distribution. To generate an exponential random step length s ∼ Eµ we can use the

standard technique of drawing a uniform random variable u and transforming it by the inverse

exponential CDF:

Page 35: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 21

s = F−1(u) = − ln(1− u) , u ∼ U01 (2.12)

It is important to recall that the material properties µs and µa apply only within the current

region, so before completing the step we must ensure that the photon has stayed within the

region. To do so, we check if the ray p, d intersects a region boundary in a distance less than

s. If not, then the packet position is updated to p′ = p + sd, the “hop” phase is complete, and

the process moves on to “drop”.

When there is an intersection, consider the distribution of s for a ray passing through one

layer of thickness T (µ1) into another with different interaction coefficients µ2. The CDF is just

F1(s) until it exits the first layer, which it does with probability 1 − F1(T ). From there on, it

travels an additional s′ = s− T according to the distribution for the second material.

F (s) =

F1(s) s ≤ T

F1(T ) + (1− F1(T ))F2(s− T ) s > T(2.13)

Substituting the CDF into the second case above, we get

F (s) =(1− exp(−µ1s1)) + exp(−µ1s1)(1− exp(−µ2s′))) (2.14)

=1− exp(−µ1s1 − µ2s′) (2.15)

But the original step length probability was 1− exp(−µ1u), so we must set s′ = u−µ1s1µ2

to

preserve the step probability. That expression has a special case when µ2 = 0 in transparent

media (air, glass), so it is convenient here to introduce the dimensionless step length l, which

is scaled so that it has a unit-exponential distribution regardless of the material.

l =sµt (2.16)

F (l) =1− e−l =⇒ l ∼ E1 (2.17)

It is more convenient to draw l ∼ E1 and track l′ = l − sµ as the photon moves through

materials. When needed the physical step length can be calculated from its definition in Eq 2.16

or taken as infinite if µt = 0.

Interface

If the photon encounters a boundary and that boundary is an interface (a change in refractive

index from ni to nt) then it may either reflect or have its angle to the normal refract from

incidence angle θi to transmitted angle θt. Snell’s Law states that

Page 36: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 22

sin θt =nint

sin θi if sin θi ≤ntni

(2.18)

for refraction, or total internal reflection (TIR) occurs otherwise. Even when TIR does not

occur, Fresnel reflection may still apply. Since the simulation does not track polarization, it

assumes that the two polarizations (s,p) relative to the surface are equally probable, and hence

that the reflection coefficient is the average of the two reflection coefficients (Rs,Rp) given by

Fresnel.

R =Rs +Rp

2=

1

2

[∣∣∣∣ni cos θi − nt cos θtni cos θi + nt cos θt

∣∣∣∣2 +

∣∣∣∣ni cos θt − nt cos θini cos θt + nt cos θi

∣∣∣∣2]

(2.19)

Given a Fresnel reflection probability R, the event of photon reflection can be modeled as

a Bernoulli random variable BR. If the ray reflects at the interface due to Fresnel or internal

reflection, then the “hop” step must advance the ray to the intersection point and reflect its

direction d:

p′ =q (2.20)

d′ =d + 2nd · n (2.21)

l′ =l − |q− p|µt (2.22)

If only the other hand it transmits and the transport mean free path µ−1t differs in the

material being entered, then the physical step size must be updated using Eq 2.16.

Drop

At the conclusion of the “hop”, it is time for the photon to have an interaction with the material.

Recalling Eq 2.11, the photon has probability α ∈ [0, 1] to be scattered, and 1−α to terminate

through absorption. In the simplest formulation, this process can be simulated using a Bernoulli

random variable b ∼ Bα that returns 1 with probability α and 0 with probability 1 − α. The

packet would then drop its energy at the interaction site p and terminate if b = 0. If the energy

absorbed per unit volume is of interest, the amount dropped is accumulated in an array to form

part of the result. Otherwise if b = 1, the photon continues onwards.

A very common optimization originally proposed by Wilson and Adam [69] changes this

description somewhat by combining multiple photons into a packet that travels together, but

behaves identically in the expected sense to the simple case just described. While individual

photons must either be absorbed or terminated, the packet does both proportionally in such

a way as to keep the correct expected value. Each packet has a continuous weight which can

be thought of as an expected proportion of photons which would remain after following the

same path as the packet. Suppose N0 photons are traveling together in a packet and have an

interaction leaving N ′.

Page 37: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 23

w′ =1

N0E[N ′]

=1

N0(αN + (1− α)0) = αw (2.23)

∆w =w′ − w = (1− α)w (2.24)

Instead of having either a scattering or an absorption event, the packet deposits weight

(1−α)w and has its weight decreased to αw. By allowing a packet to survive multiple absorption

events, the probability that the path traverses regions remote from the source is increased,

providing greater resolution in such regions. It also allows for some economy of computation

since multiple photons can share the calculation of a single hop length, intersection test, and

scattering event.

In an absorbing medium, the packet will continue to weaken, thus adding less and less to

the results with each absorption event but never actually reaching zero. Only when the packet

exits the medium does it cease to need further computing. For some geometries, this could

take a very long time, requiring extensive computation to add infinitesimal accuracy to the

model results. To avoid this problem, MCML introduced the random termination of weak

packets, called “Russian roulette”. When a packet’s weight becomes less than a minimum

value wmin, it is given a 1-in-m chance of surviving with weight mw. This process ensures that

weak packets, which do not contribute significantly to the output sum, are terminated without

violating conservation of energy in the expectation as shown below:

E[w′]

= 0 Pr(die) +mwPr(live) = 0 +mw1

m= w (2.25)

Termination of weak packets provides a balance between higher simulation accuracy in

areas that receive very low fluence, versus the computational cost of obtaining that additional

accuracy. The parameter wmin sets an energy threshold below which m weak (w < wmin)

packets are bundled into a stronger packet requiring 1m times as much computation to trace.

The side effect of this change is that instead of w being deposited randomly over m fluence

bins at each step, mw is deposited into one thus causing quantization noise in the lower-fluence

bins. Further investigation and discussion of this trade-off are presented in Chapter 3.

Spin

Surviving photons then undergo a spin process to simulate the effect of scattering on their

direction. Generally, the scattering interaction can be characterized as a uniform azimuthal

angle φ around the incoming direction, and a deflection θ. The Henyey-Greenstein (HG) phase

function is often used for the deflection component [57], since it has a convenient parameter

g = E [cos θ] to express the anisotropy. Note that g = 1 always implies no scattering since

E [cosX] = 1 if and only if X ≡ 0 mod 2π). When g = 0 is used as the parameter for the

HG function, the cosine of the deflection angle is uniformly distributed on [−1, 1], sending

Page 38: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 24

equal amounts of energy in all directions (equivalently the outgoing direction is statistically

independent of the incoming). Generally biological tissues fall in the range of 0.8 . g < 1 [11].

The inverse CDF for the Henyey-Greenstein function is shown below, facilitating generation of

appropriately-distributed scattering angles given a uniform random number.

cos θ =1

2g

[1 + g2 −

(1− g2

1− gq

)2]

q ∼ U−1,1 (2.26)

In the original formulation, Prahl et al [57] proposed calculating the new direction of travel

d′ given d, θ, φ as:

d′x =sin θ√1− d2

z

(dxdz cosφ− dy sinφ) + dx cos θ (2.27)

d′y =sin θ√1− d2

z

(dydz cosφ− dx sinφ) + dy cos θ (2.28)

d′z =− sin θ cosφ√

1− d2z + dz cos θ (2.29)

which can be rewritten as

d′ = d cos θ + sin θ(b cosφ− a sinφ

)(2.30)

Further deconstructing, it can be shown that a, b are two unit auxiliary vectors orthogonal

to the direction of travel. Geometrically, these form an orthonormal basis for the azimuthal

plane (normal to the direction of travel) which facilitates selection of a random vector in that

plane using angle φ. The first a is formed by taking the cross-product with the z-axis and

normalizing. The second auxiliary vector is formed by crossing the direction with the first

auxiliary as follows:

a =d× k

|d× k|(2.31)

b = d× a (2.32)

It can be verified by substitution that Eq 2.31-2.32 and Eq 2.30 result in the original

formulation. Once the azimuthal vector is found, the post-scatter direction is found by rotating

the incoming direction by θ towards it. FullMonte uses an alternative way of arriving at the

same formulation, as discussed in greater detail in Sec 3.5.4.

Page 39: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 25

Implementation Method-Geometry Abs Aniso. Refr Voids TR Acceleration

MCML MC Planar Y Y Y Y YtMCimg MC Voxel Y Y YCUDAMC MC Semi-inf Y Y Y GPUCUDAMCML MC Planar Y Y Y Y GPUGPU-MCML MC Planar Y Y Y Y GPUNIRFAST FEM Tet Y ApproximationTIM-OS MC Tet Y Y Y Y Y SIMD (auto), MTMMCM MC Tet Y Y Y Y Y SIMD (auto), MTMCX MC Voxel Y Y Y GPUFBM MC Planar Y Y Y Y FPGA (1x)

FullMonte (SW) MC Tet Y Y Y Y * SIMD (man), MTFullMonte (HW) MC Tet Y Y Y FPGA (1x)FullMonte (HW*) MC Tet Y Y * Y * FPGA (4x)

Table 2.2: Comparison of existing simulators with key features: geometry, absorption scor-ing, anisotropy, refraction, non-scattering voids, time-resolved data, and acceleration methods:FPGA (Nx)=FPGA with N instances per chip; MT=multithreading; SIMD=Intel SSE instruc-tions, automatic or manual optimization; Asterisk indicates planned future work

2.5 Existing Implementations

There are a number of existing implementations, summarized by key features in Table 2.2 and

discussed in greater depth below. The FullMonte software version is the most customizable,

fastest, and (except for time-resolved output) most full-featured of all implementations. The

FPGA implementation described in this thesis is still faster, with a 3x performance advantage

over software and an architecture designed to increase that further to 12x while adding feature

support.

2.5.1 MCML

MCML, introduced by Wang et al [44], was one of the first widely-used Monte Carlo simulators

for turbid media. It accepts a planar slab geometry with a normally-incident pencil beam.

Since it is a Monte Carlo simulator, it is able to model scattering, absorption, anisotropy (using

the Henyey-Greenstein phase function), reflection, and refraction at boundaries. Extended

sources may be modeled as a convolution of simulation results, but the fundamental limitation

to normally-incident light remains so variations have been developed by researchers as needed.

2.5.2 tMCimg

One of the first open-source voxelized MC solvers is tMCimg [6], which was developed to model

the scalp, skull, and brain for DOT purposes. Since the application uses probes in contact with

the scalp and does not have large refractive index mismatches, the boundary roughness imposed

by a voxelized approach is not significant. Only in the event of refractive index mismatch is

Page 40: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 26

the surface normal required for purposes of computing reflection or refraction. Binzoni et al [4]

describe some of the drawback to representing curved interfaces using voxels. It is also worth

noting when making performance comparisons that the implementation of tMCimg uses a single

thread of execution, owing to its development at a time when multi-core computers were rare.

It also does not use vector instructions which can provide significant performance increases

over non-vectorized code. Modifying the software to use the multiple cores available on modern

processors should not be difficult and would yield approximately N times better performance on

N cores (or even > N in the case of simultaneous multithreading (SMT)) based on experience

with FullMonte.

2.5.3 CUDAMC

Alerstam et al [2] present CUDAMC, which is a GPU-based specialization of MCML which

records time-resolved diffuse reflectance. It uses a homogeneous, semi-infinite, non-absorbing

model and produces time-resolved output. Reflection and refraction from the interface are

modeled. In comparison against a single-threaded CPU implementation of the same code, they

report a performance increase exceeding 1000x.

GPU computing provides a very high number of floating-point operations. Since there is but

a single homogeneous slab, all optical properties and geometry are global constants. Further,

since the material is non-absorbing there is no absorption to score or roulette calculation to

perform. As such, this result should be regarded as an approximate upper bound on the

acceleration available: the calculation never has to stall to fetch geometry information from

memory, and never has to access memory to record absorption so it is entirely compute-bound.

While it does access memory for the output histogram, that operation is quite rare (at most

once per packet) compared to scattering, step length generation, and intersection testing which

may happen hundreds of times.

2.5.4 CUDAMCML

With CUDAMC as a special-case subset, Alerstam et al [2] also present CUDAMCML, which

is a complete implementation of MCML for the GPU. The authors claim speedup on the order

of 100x, against the original relatively unoptimized single-core CPU implementation of MCML.

The performance reduction from CUDAMC (1000x) is notable, since the problem is nearly

identical in terms of calculation. There are a small number of planar slabs to be stored instead

of just one material set, though the memory size and bandwidth requirements thereby imposed

are not significant. Drawing step lengths, random number generation, and scattering remain

identical. Intersection checking also remains nearly identical, though instead of z > 0, the

condition becomes zi−1 ≤ z ≤ zi, i ∈ (0, n). What changes (and significantly so) is the need

to read, accumulate, and write one fluence value each time an absorption event happens. The

resulting memory bandwidth demand is the primary culprit for the order-of-magnitude decrease

in speedup.

Page 41: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 27

2.5.5 GPU-MCML

A recent (2009) work by Lo [48], and later Alerstam and Lo [1] called GPU-MCML uses a

modern NVIDIA “Fermi” GPU to achieve up to 600x speedup relative to single-core CPU-based

MCML. The performance improvements over CUDAMCML are incremental, primarily due to

caching of the area immediately around the source, and all of the inherent model limitations of

MCML remain.

2.5.6 NIRFAST

Dehghani et al [19] use the diffusion approximation to formulate the problem on a tetrahedral

mesh using the Finite Element Method. The resulting system of sparse linear matrix equations

is solved using Matlab, and is freely available in a package called NIRFAST (Near Infrared

Fluorescence and Spectral Tomography). Tetrahedral meshes are used in a wide variety of

applications so they benefit from broad support in Matlab and other libraries for generation,

manipulation, and visualization. Likewise sparse matrices occur in many fields and thus benefit

from the wide availability of quality software code for their solution as well as many hardware

acceleration efforts. However, the model has significant limitations which prevent its use in

certain conditions. Most notably, the diffusion approximation breaks down in the presence of

weak scattering, strong absorption, and changes in refractive index.

2.5.7 TIM-OS

Prior to creation of FullMonte, the fastest tetrahedral mesh-based Monte Carlo simulator was

TIM-OS by Shen and Wang [60]. It uses the “hop, drop, spin” technique and related variance-

reduction techniques found in MCML but adapts them to a tetrahedral mesh.

In their paper, the authors of TIM-OS note that it is slightly faster than MCML on identical

problems (where a mesh is generated to represent infinite planar slabs). The performance

increase is likely due to superior performance tuning and the aggressive optimizations of the

Intel C Compiler, since the tetrahedral method inherently requires more arithmetic operations.

2.5.8 MMCM

Fang [22] presents an alternative to TIM-OS with substantially similar features. One difference

is that MMCM permits shapes other than tetrahedrons to be used in the mesh, but no benefit is

conclusively demonstrated, though there is an additional cost in complexity and performance. In

general, a polytope can be represented as a union of tetrahedra [39] so the additional complexity

adds no new capability. The performance of the code is slower than TIM-OS so it is not a

primary focus for comparison.

Page 42: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 28

2.5.9 MCX

Fang and Boas [24] created MCX, which is a GPU implementation of the tMCimg algorithm

and therefore subject to the same assumptions and limitations. Compared to a single-core

CPU running tMCimg, MCX was shown to be 75-300x faster depending on options and the

specific problem. The option which most impacted run time was whether or not to require

atomic memory accesses. When disabled, some of the photon weight is lost due to memory

race conditions in which two separate GPU threads read the fluence accumulator value, each

separately adds a value, and then both write back. The second write overwrites the first, and

the value it added is lost. The authors demonstrate that the proportion is generally small, and

argue that it can be safely neglected for their test cases.

2.5.10 FBM (MCML on FPGA)

The first use of Field-Programmable Gate Array (FPGA) custom digital logic for acceleration

of biophotonic simulations was done by William Lo [47]. FBM implements MCML subject to

limitations on the number of layers (5) and the size of the absorption grid (200x200). Significant

gains in performance and energy-efficiency were demonstrated, with a 65x gain reported in

performance-per-power ratio, and a 45x gain in performance (single-core CPU vs single-FPGA).

Enhancements of the present work over Lo’s work include use of a more general geometry

model, and improvements in performance. Implementation of a tetrahedral model requires more

storage, more memory bandwidth, and more calculation. However, the hardware presented can

be taken as a proof-of-concept and an indication of the possible performance and power gains.

The performance gains should be treated carefully, though, as they compared against a non-

optimized single-threaded CPU implementation. Most importantly, SIMD vector instructions

were not used in the reference case so the processor could be capable of better performance.

2.6 Computing Platforms

With the end of clock frequency scaling, computer engineers can no longer rely on applications

automatically running faster year-over-year. The power and cooling cost of large-scale comput-

ing has also become an issue of concern recently. As a result, interest has increased in alterna-

tive computing platforms to achieve high performance in compact form factors and reasonable

power budgets. Different platforms present vastly different abstractions to the programmer,

along with different implementation tools, and a correspondingly wide range of architectural

tradeoffs. In seeking to accelerate Monte Carlo simulations for turbid media, three candidate

implementation platforms were identified: traditional CPU software, Graphics Processor Units,

and custom logic.

Page 43: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 29

2.6.1 Central Processing Units (CPU)

Traditional Central Processing Units (CPUs) which form the core of computers are laid out by

the manufacturer and arrive fully fixed in their function. The CPU provides an instruction set to

the programmer, which can be used to implement the desired functions. The flexibility available

to the programmer is simply the sequence of instructions and data fed to the processor. This

von Neumann model [31] of computing has proven successful over the years due to its generality,

flexibility, and relative simplicity to program. Fundamentally, the paradigm for CPUs is for

a central data-processing unit to move data in from storage, execute a series of operations,

and move it back into storage. Significant amounts of energy and silicon area are expended on

moving data rather than actual calculation.

With the end of clock frequency scaling but continued scaling of transistor size, CPUs

now boast an increasing number of available cores and an ever-increasing set of specialized

instructions. Since even basic computers now come with two or four cores, it is no longer

reasonable to ignore multi-threaded programming when looking for performance. Likewise, use

of vector instructions is an important consideration for extracting peak performance [58].

The FullMonte software model presented here therefore uses both techniques to achieve its

performance advantage over other simulators.

2.6.2 Graphics Processor Units (GPU)

Graphics Processor Units (GPUs) have been used recently to accelerate computation. Origi-

nally designed to meet the needs of drawing graphics, they are optimized for highly-repetitive

operations and to provide extreme memory bandwidth. In contrast to CPUs which have a small

number of very fast, flexible, and highly-tuned compute engines that can each operate inde-

pendently, GPUs rely on massive parallelism with hundreds or thousands of simpler computing

elements that work in lock-step. The cost of simplicity is that each core operates far slower,

and a number of cores share scheduling logic meaning they must execute the same program in

lock-step. For applications which are floating-point intensive and have significant data paral-

lelism, ie. perform the same operations on many different contiguous pieces of data, GPUs can

offer significant performance increases.

2.6.3 Field-Programmable Gate Array

What CPU and GPU computing share is the paradigm of thinking in a sequence of steps, which

is a natural process for a human programmer to solve a problem. Field Programmable Gate

Arrays (FPGAs) are a form of programmable digital logic which implement spatial computing

through a configurable layout rather than a sequence of instructions. Fundamentally, an FPGA

is an array of fine-grained processing elements including memory blocks, arithmetic blocks

(usually offering variations of multiplication and/or addition), state elements (registers), and

programmable logic, connected by programmable connections. The name “field-programmable”

Page 44: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 2. Background 30

derives from the ability of FPGAs to be reprogrammed (“re-wired”) a nearly unbounded number

of times, simply by reloading the bitstream which takes under one second. The program or

bitstream specifies what functions the elements are to perform, and how they are to be wired

together. This reprogrammability allows state elements and compute elements to be intermixed,

permitting data to be stored closer to the location where it is processed. As a result, less energy

may be expended on moving data. Some commercially successful results showing performance

and power-efficiency increases for financial Monte Carlo applications are presented in a white

paper by Altera Corp [13].

On the extreme other end of the programmability spectrum are Application-Specific Inte-

grated Circuit (ASIC) and fully custom silicon devices. Such devices typically cost in the tens

or hundreds of millions of dollars to design and test, with the advantage of extremely low unit

cost and very high performance and power efficiency once running [33]. Development times and

risk are also correspondingly much higher. Clearly a very large production run is necessary to

justify the investment. FPGAs offer a middle ground between ASIC/full-custom and more tra-

ditional instruction-set (CPU/GPU) processing. Despite significant programmability overhead

compared to ASICs [41], significant power savings are still possible over CPU/GPU systems

without incurring the extreme engineering cost and risk.

Problems with a significant degree of pipeline parallelism, involving large chains of dependent

computations, tend to benefit from FPGA acceleration. Because the device program is a spatial

layout rather than a temporal sequence of instructions, it is possible for such computations to be

laid out such that outputs feed directly to dependent inputs and are located nearby. Keeping

connection lengths short saves power since shorter connections are easier to drive, and also

permits high performance since shorter links are faster. This minimizes the device area and

energy necessary to move data to where it is needed. More general instruction-based compute

models like CPU and GPU expend a very large amount of energy getting the data from memory,

cache, and registers to the compute units. Those compute units are also fixed in number and

position, which involves a degree of overhead if the application’s needs do not match the device

provided. When designing an FPGA bitstream, the available fixed-position state and compute

components may be connected in such a way as to provide just the right amount of each

computational resource and to locate just enough state elements nearby.

Page 45: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3

Software model

This chapter introduces the FullMonte software simulator and highlights its important features.

3.1 Design choices

The preceding chapter presented an overview of existing solution techniques and software im-

plementations for the simulation of light propagation in turbid media. Given the large diversity

of options, the following goals were decided on to guide the present design:

1. Give correct results across many material properties (anisotropy, refractive index, albedo,

scattering)

2. Accommodate complex geometry

3. Be highly optimized for speed, running faster than any other simulator of equivalent

generality

4. Use only free and open-source tools and libraries

5. Be sufficiently flexible to incorporate new light source types easily

6. Make full use of parallel hardware and specialized functions available to the CPU

7. Offer the user and programmer a wide range of options for gathering output data

8. Offer the programmer a wide range of code instrumentation and profiling options

9. Incur no performance overhead for data or profiling features that are DE-selected

Based on the goals, a number of important high-level choices were made regarding what

type of simulator to implement and how. They include the nature of the simulator (Monte

Carlo), the geometry model (tetrahedral mesh), the programming language used (C++), as

well as related choices of programming style, tools, and libraries.

31

Page 46: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 32

3.1.1 Monte Carlo simulation

Monte Carlo was a clear choice based on its ability to model complex geometry and the widest

variety of materials. Analytic solutions to the RTE are not known for non-trivial structures, and

the Finite Element Method is fast and simple but requires too many restrictive approximations

to be of use in the cases of interest, particularly IPDT.

As an additional benefit, MC methods are inherently very parallel because M computing

elements can be used with M different random seeds (to ensure statistical independence) with-

out any need to communicate during the simulation. At the end, the results can be summed

to produce an output with√M times less standard deviation. Assuming the time required to

merge results after completion is much smaller than that required to generate them, this offers

a speedup very close to M times versus a single unit. Other solution techniques are not as

inherently parallel.

3.1.2 Geometry Representation

We chose a tetrahedral mesh for the geometry representation because of its ability to approx-

imate curved surfaces. Boundary-element and voxelized representations were also considered.

As previously noted, a voxelized representation is not adequate due to artifacts at curved edges

with refractive index changes.

A boundary-element representation, in which the surfaces of homogeneous regions are stored

as a mesh of triangles, is inappropriate because of the turbidity of the medium. Though it is a

common approach and yields a compact representation in computer graphics raytracing within

non-scattering volumes, the number of intersection tests required becomes excessive when used

for turbid media. Each time a packet is scattered, it changes direction and hence needs to have

a new set of intersection tests calculated. In the boundary element method each intersection

test requires fetching and checking for intersection of that ray with all surfaces that bound the

current region, which can be a large number for a complex surface. In contrast, a ray can exit

a tetrahedron only through one of the four faces, thus limiting the number that need to be

fetched and tested. When implementing the algorithm, there is a benefit in the simplicity of

having a fixed number of faces to fetch and test. The tetrahedral representation has no loss of

generality since any shape that can be represented by triangular surface mesh can be converted

into a tetrahedral volume mesh. The resulting mesh is larger, and element boundaries are

crossed more frequently, but that is acceptable in exchange for reduced memory accesses and

computation per scattering event.

3.1.3 Tools and Libraries

It was decided that the software should use entirely free and open-source libraries and tools.

FullMonte uses the Boost open-source libraries and was compiled with the Gnu Compiler Col-

lection. TIM-OS, the other leading tetrahedral MC simulator, requires the Intel C Compiler

Page 47: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 33

(ICC) and Math Kernel Libraries (MKL) which are not free. It also relies on the automatic

vectorization built into the ICC to achieve performance. The ICC’s auto-vectorization capabil-

ities are significantly better than GCC, as reported by Fang [23] in a comparison of MMC with

TIM-OS using different compilers. That experiment showed a 1.6x speed increase from switch-

ing compilers alone. FullMonte provides superior performance without requiring proprietary

tools.

3.1.4 Programming Language and Style

C++ is a widely-used language for designing high-quality libraries and high-performance soft-

ware. It allows a number of high-level abstractions including object orientation, while still

allowing the programmer to optimize low-level features of the program. Since the Monte Carlo

simulator proposed here executes certain core functions very many times, performance of these

inner loops is critical and can be optimized only if low-level calls to specific machine instruc-

tions are possible. The availability of high-quality numerical libraries (eg. for random-number

generation) is also important. Languages such as C and C++ meet these criteria.

On the other hand, significant flexibility is desirable so that the program’s functionality can

be changed easily and in a modular fashion at compile time. The C language falls significantly

short in its flexibility so C++ was chosen. FullMonte uses inlined C++ templates to allow

the programmer to alter or disable output-data gathering functions at compile time so that a

large variety of data can be collected, while paying the performance cost of only those features

selected. This design choice allows an easy upgrade path for future features, for instance time-

resolved calculation, without major alterations to the core simulator or branching the core

code.

The requirement for best-in-class performance implies that the implementation should be

designed in a hardware-aware way, involve detailed optimization where appropriate, and use

advanced processor features where possible. With the end of automatic processor performance

increases over time due to clock frequency increases, processor manufacturers are now placing

more and more computational cores on each die. To extract the full potential performance

from a modern processor, it is necessary to create a multi-threaded program which maximizes

utilization of all cores. Hence, FullMonte was designed from the beginning for multi-threaded

performance.

3.2 Design Overview

The basic simulation loop is shown in Fig 3.2, implementing the classic “hop, drop, spin”

algorithm. Multiple threads run concurrently, each launching a new packet when its current

packet retires. A thread will propagate the packet throughout the flow until it dies in roulette,

at which point the thread launches another. All threads have their own separately-seeded

random number generators to maintain independence.

Page 48: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 34

To launch the packet, the launcher draws a random direction and position from the set

of sources and their parameters. Weight is initialized to one. At the moment of launch, the

enclosing tetrahedron ID is found and stored within the packet before it propagates to the hop

stage.

At the hop stage, a random step length is drawn per Eq 2.4.2 and the intersection test is

performed. If the hop terminates within the same element, the packet is passed onwards to

the “drop” stage. If instead it encounters a boundary with a material of the same refractive

index then it advances to the intersection point and tries again to complete the hop. Lastly,

and least frequently, if the boundary is with a material having a different refractive index then

the packet is passed to the interface code for testing of internal reflection, Fresnel reflection,

and refraction.

When a packet arrives at a refractive index interface, it is evaluated for total internal

reflection. If the condition proves true, then the direction is reflected through the normal,

otherwise the refracted ray is calculated since it provides information necessary to calculate

the Fresnel coefficients. Based on the incident and refracted components, the Fresnel reflection

probability R is calculated and a Bernoulli random variable BR is drawn to determine whether

the packet reflects or not. Internal reflection, refraction, and Fresnel reflection are all distinct

events in the logger, which is notified appropriately.

In the drop stage, the packet drops part of its energy. The surrounding environment is

stored as a special material ID zero. If the packet propagates into this region, the logger is

called to report an exit event. Otherwise, an absorption event is reported. Generally this will

mean that the element ID and weight dropped are placed in a queue for later merging, however

in some cases (eg. imaging applications) the internal fluence is not of interest and hence is not

recorded. If the weight following the drop is less than a threshold (wmin to be discussed below),

then it is sent to roulette for possible termination. Otherwise, the packet moves directly to

scattering.

If applicable, roulette is calculated very simply using a Bernoulli random variable Bm, where

a nonzero return means the packet continues. The appropriate logger event is called to notify

of a roulette loss or win as appropriate.

When it arrives at the scatter function, random numbers are drawn and the Henyey-

Greenstein phase function is evaluated to give the scattering angles. The angles are applied

to the current direction of travel and the packet passes back to the hop stage for another

intersection test. Scattering events are also passed to the logger for possible action.

3.3 Performance enhancements

3.3.1 Multithreading

FullMonte, like some other simulators (TIM-OS, MMCM) uses a programmable number of

threads to do the computation. Each thread has its own random number generator (RNG)

Page 49: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 35

Launch

Draw step

Hop

TIR

Fresnel

No

Reflect

Yes

Refract

NoYes

Interface

Non-interface boundary

Drop

No boundary

Spin

w>=wminRoulette

w<wmin

Pr 1/m

Dead

Pr (m-1)/m

Figure 3.1: Overview of hop, drop, spin flow

Page 50: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 36

initialized with a different seed. Each independently launches photons, propagates them, and

sends event notifications to a Logger object for collection (details later).

In the default logging regime, the weight and mesh element ID for each absorption event

is placed in a thread-specific queue similar to TIM-OS. When the absorption queue is full, the

thread locks a mutex (mutual exclusion lock) such that it has sole access to the absorption

array, and accumulates the information from the queue into the array. The locking process is

essential because if two threads happen to access the array at the same time they can write

conflicting data which will violate the conservation of energy.

3.3.2 Explicit parallelism through SIMD intrinsics

Critical sections of code were identified from profiling information and then carefully hand-

optimized using Intel SIMD Streaming Extensions (SSE) instructions. Compiler intrinsics are

function calls that are translated directly into specific assembly instructions by the compiler.

They are embedded in source code like normal function calls and allow access to the most basic

level of machine instructions, while preserving some amount of code readability and convenience

for the programmer. SSE Instructions are Intel-specific instructions which basic arithmetic

operations to be done on groups of up to four numbers at a time for increased throughput.

FullMonte makes heavy use of such calls to achieve high performance for its most frequently

called operations: intersection testing and scattering.

The program uses an open-source (zlib license) library by Julien Pommier [56] that provides

fast vector math functions including sin, cos, and logarithm. FullMonte also relies on an imple-

mentation of the Mersenne Twister random-number generator by Saito and Matsumoto [59],

which generates uniform random bit sequences using high-performance Intel SIMD instructions.

3.3.3 The wmin Russian roulette parameter

The wmin parameter introduced in the algorithmic description in Sec 2.4.2 also has a significant

impact on performance, which until now has not received much attention. It permits a trade-off

between faster simulation and higher output quality. MCML uses a value of 10−4 while TIM-OS

uses 10−5 and MMCM uses 10−6. All of these values can be shown to expend computing time

unnecessarily for some applications. Below, the impact in terms of both output quality (result

variance) and run time are discussed from a theoretical standpoint; detailed simulation results

are presented in Sec 5.3.4.

Performance Impact

Assume a photon packet of initial weight w traveling through a homogeneous medium with

albedo α, and let us define a new property called the material’s persistence β, the number of

steps required for the packet to be attenuated by 1e , as − 1

lnα . By definition (Sec 2.4.2), the

weight remaining after i steps is wαi. Roulette occurs when the remaining weight wαi < wmin,

Page 51: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 37

which happens after i > β ln wwmin

. To get the least integer for which this is true, we take the

ceiling⌈β ln w

wmin

⌉. After that number of steps, roulette is done in which there is a 1-in-m

chance of the packet continuing with strength mw. Let T (w) be the expected number of steps

that a packet of weight w > wmin takes within a material of albedo α before losing at roulette.

i =β lnw

wmin(3.1)

T (w) = die+1

mT (αdie−imwmin) (3.2)

Assuming β 1 and α ≈ 1, the ceiling functions can be dropped permitting a direct

solution at the cost of mild error.

T (w) ≈ β lnw

wmin+

1

mT (mwmin) (3.3)

Substituting mwmin into Eq 3.2 and collecting terms, a solution can be found for T (mwmin)

which can be substituted back into Eq 3.3 again to find the value for any w, including a newly-

launched packet of weight 1:

T (1) = β ln1

wmin+

m

m− 1β lnm (3.4)

Which shows that the in the absence of exit events, the number of packet scattering events

is governed by the choice of roulette parameters m and wmin which can be changed without

changing the expectation of the result, unlike β which is a material property derived from albedo.

Given the form of the equation, changes to wmin are far more significant to the outcome than

changes to m, so wmin is the primary quality-time control.

If some fraction e ∈ [0, 1] of packets do exit the medium before being terminated at roulette

at wmin, only those packets remaining in the medium are subject to increased calculation if

wmin is decreased by a factor of k, ie. ∆T ≤ (1 − e)β ln k. The increase may be less than

predicted by Eq 3.4 because some of the packets may exit before terminating. Conversely, the

reduction in operation count from an increase in the parameter may decrease e since packets

will tend to terminate earlier.

Output Quality (Variance) Impact

Having shown that the performance difference can be significant, we turn our attention to the

output quality difference. In Sec 2.4.2 it was shown that changes to the roulette constants do

not alter the expectation of energy and hence the accuracy of results. However, the variance of

the output is also important since it determines the uncertainty remaining after a given number

of packets is run.

Let P be a path consisting of a series of points p[i], i ∈ [0, N ] from the launch point p0 to

Page 52: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 38

the arrival point pN . There can be infinitely many such paths for a given p0, pN . Along this

path, consider two different notions of weight: a physical weight w which is the probability

that a physical photon launched from p0 arrives at pN given that it follows path P , regardless

of whether it is absorbed there, and a simulated weight W , which is a function W (w) that

may have a random component. The physical weight arriving at the end of the path must

be w =∏N−1i=1 α(pi) as presented in Sec 2.4.2. By physics, the energy absorbed at point x

must be equal to the product of fluence, infinitesimal volume, and absorption coefficient so

(1− α)w = Φ(x)µa dV . For the simulation output to be unbiased (correct in expectation), the

expected simulated weight must equal the physical weight.

Let the probability of the various quantities conditioned on arriving via a given path P

be called path-conditional on P . Let PrP be the probability of a path P being traversed,

regardless of the termination criteria, roulette, etc. The unconditional arrival probability can

be calculated as an expectation over all possible paths arriving at p.

E [W ] =∑P∈P

E[W∣∣P ]PrP = Φµa dV (3.5)

But E[W∣∣P ] depends only on w so we can define a probability density function f(w) which

gives the probability of arriving at p via any path that has physical weight w. By definition,

E[W∣∣w] = w for the simulation to produce correct results. What is of interest is the variance

of the resulting simulation weight W collected. From probability we know that

Var [W ] = E[Var

[W∣∣w]]+ Var

[E[W∣∣w]] (3.6)

Var [W ] =

∫ 1

0f(w)Var [W |w] dw + Var [w] (3.7)

In this formulation, the first term is the additional error injected by a termination scheme.

The second is the inherent variability in the process of randomly selecting a path to traverse.

Non-packetized propagation

In the non-packetized formulation, the photon is either alive with weight 1 or dead with weight

0. The path-conditional survival probability is therefore a Bernoulli random variable:

w =S : S ∼ Bw (3.8)

E[w∣∣P ] =w (3.9)

Var[w∣∣P ] =w(1− w) (3.10)

cv(w) =

√1− ww

(3.11)

Page 53: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 39

The coefficient of variation above gives an intuition that the packet becomes increasingly

“noisy” or “quantized” as it becomes less probable to arrive at a given destination.

Packetized propagation without roulette

In the case where roulette is not performed, the packets will continue indefinitely unless termi-

nated by exiting the geometry or by other criteria (eg. a time gate, or a maximum number of

steps).

w =w (3.12)

E[w∣∣w] =w (3.13)

Var[w∣∣w] =0 (3.14)

No additional variance is introduced by the (absence of) termination criteria. However, the

computational cost is very large since all packets must be traced until they exit or are retired

due to other criteria (eg. a time gate).

Roulette

In the roulette formulation, the photon packet weight always has a lower bound of wmin since

if the packet has weight w < wmin at the end of the step it either terminates or returns with

weight mw. To arrive in the roulette formulation, the packet would have to survive r times

where

r = max

0,

⌊ln wmin

w

lnm

⌋(3.15)

w =mrwS , S ∼ Bm−r (3.16)

As shown below, the path-conditional expected value remains the same so there is no bias

introduced, but the path-conditional variance changes:

E[w∣∣P ] =

1

mrmrw = w (3.17)

Var[w∣∣P ] =

1

mrm2rw2 − w2 = w2(mr − 1) (3.18)

cv =√mr − 1 ≈

√wminw

(3.19)

Since the packet weight always has a lower bound, the amount of energy deposited per step

(ie. per unit computational cost) also has a lower bound which is advantageous. The price is

that the output variance per packet traced is increased relative to the case where roulette is

Page 54: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 40

not performed, and the variance increase becomes greater as the path becomes less probable.

However, it should be noted that each packet traced takes more computing resources.

Merging this result into Eq 3.7, we find

Var [w] =

∫ 1

0f(w)w2(mr − 1) dw + Var [w] (3.20)

with the definition of r as in Eq 3.15. The distribution f(w) is not directly observable or

calculable, though it could theoretically be simulated by taking a histogram of the weight of

all packets arriving within a mesh element. Even without a value available, it does give some

intuition. The more probability that a position has of receiving a high-weight packet, the less

the variance increase due to roulette. If on the other hand the bulk of the probability f(w) is in

areas where w wmin then large values of r will apply and the variance will be correspondingly

increased. Such behavior is observed in the simulation results and discussed in Chapter 5. By

tracking the expectation of mrw, it should be possible to estimate the variance of each surface

and volume element in addition to estimating its mean, which is a novel capability.

3.4 Output Data

To address the design goal of flexibility, the main simulation loop was designed to accept a

template parameter which models the Logger concept. A logger is a class which has a method

corresponding to each of the following events, which the simulator calls when the event occurs:

• Launch

• Scattering

• Absorption

• Intersection with a material boundary

• Arrival at a refractive index interface

• Internal reflection

• Refraction

• Fresnel reflection

• Termination through roulette

• Roulette survival

By providing packet information as part of the method call, the programmer can change

the type and format of data captured by the logger without ever changing the core loop. Since

the changes are made at compile-time, any features not included do not impose any run-time

performance overhead due to efficient inlining and dead-code elimination by the compiler.

Page 55: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 41

3.5 Profiling information

In general, a computer algorithm consists of data movement and computation, both of which

take time and device resources. Understanding the performance of an algorithm implementation

requires understanding both aspects and their interaction. Due to the flexible design of the main

loop and the logger concept, many useful pieces of profiling data can easily be acquired through

already-existing functionality.

3.5.1 Geometry Description

One of the key differences between MCML with its infinite planar slab geometry and a com-

plex tetrahedral mesh is the size of the geometry description. In the planar slab regime, the

entire geometry description for n layers (usually . 10) can be encapsulated in just 5n numbers

representing µa, µs, g, n, z. This is not the case for more complex tetrahedral representations

which can use ≈ 103 − 106 mesh elements, each requiring at least 4 face descriptions, each

having a 3D vector, a constant, and a pointer to the next element. Unlike MCML, the entire

description does not necessarily fit into any of a typical computer’s caches. An efficient and

compact geometry description is therefore essential to the problem, and the ability to access it

quickly will be one of the limiting factors in performance.

To that end, profiling was undertaken using the Logger framework to understand the char-

acteristics of relevant problems. A memory profiler was created which receives notification each

time a packet moves to a new material through either a boundary event or refraction event. The

profiler stores the current tetra ID and a count of scattering events. When the packet arrives

in a new material, the logger writes the previous tetra ID and event count out to a file, then

resets the event count and stores the new tetra ID. The resulting trace is a run-length-encoded

history of memory addresses fetched for intersection testing.

Temporal Locality

Temporal locality refers to the correlation of memory addresses accessed by an algorithm over

time. Informally, it answers the question “what percentage of memory accesses refer to data

which have been accessed in the last n accesses?”. Most modern computing devices make use

of a memory hierarchy of storage devices, ranging from small fast caches closest to the com-

puting elements, to larger slower storage further away. When data is sought, the computer

first looks in its nearest caches, then searches progressively further afield only if the data is

not present. Modern CPUs [31] tend to have three levels (L1-L3) ranging from smallest/fastest

to largest/slowest before accessing main memory. Typical computer caches use a replacement

policy of storing the most recently accessed data and (if necessary) making space for it by eject-

ing the least-recently-used (LRU) data [31]. Algorithms which have temporal locality benefit

from such a cache, since it exploits the correlation in memory accesses over time. Based on the

memory traces described above, the simulator’s memory access patterns into the tetrahedron

Page 56: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 42

memory were assessed for temporal locality, with results discussed in Chapter 5.

Spatial Locality

In addition to temporal locality, accesses can show an address-dependent frequency distribution.

The use of least-frequently used (LFU) replacement in caches is well-known [7] for applications

such as web traffic and multimedia which follow a Zipf-like (power-law) distribution. The LFU

paradigm differs from LRU in that pages are evicted based on being less frequently accessed

over a long term, rather than a short-term measurement of how recently it has been accessed.

Analysis to be presented later shows that a hybrid LRU/LFU cache scheme would perform best

for the simulator based on these observations.

Software was written to simulate cache accesses using the stored memory traces mentioned

above. Using a family of templated C++ classes that permit simulation of a memory hierarchy

(different sizes and types), simulations were conducted to determine the effectiveness of different

caching schemes. Because all packets are mutually independent, the statistics of the access

request stream are expected to be stationary over the long term, with short-term correlation

due to the limitation that a packet can move only to an adjacent mesh element (and has some

probability to step back after a short time).

Software does not permit explicit cache management, so this insight is not directly ex-

ploitable when writing a software simulator. However, these data are useful in designing other

implementations including GPU and FPGA designs, the memory access patterns and cachabil-

ity are important to achieving high performance, particularly on FPGA where it is possible to

implement custom cache logic. This point receives further discussion in the hardware chapter.

3.5.2 Operation Frequency

In addition to the need to move data to the compute units, the ability to carry out the calcu-

lations themselves is important for performance. To avoid premature optimization, the Logger

framework was used again to count the relative frequency of the different operations and identify

which are most critical to performance.

The frequency with which the various pipeline steps occur will dictate which has the biggest

influence on overall algorithm run time, and hence which are the best candidates for manual op-

timization. Based on operation counts, the following conclusions were drawn for the Digimouse

BLI setup:

• Intersection testing is the most frequent

• Scattering

• Interface-related events are very rare

It is intuitive that intersection testing should be the most frequent operation since it must

happen at least once per scattering event. It can happen more than once if the hop hits a

Page 57: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 43

region boundary before completing, in which case the new element must be loaded and tested.

Scattering (“spin”), absorption (“drop”), and roulette should be equally frequent since packets

progress from one to the next with divergence only when a packet dies which is rare.

Refractive interfaces are considerably rarer in the test cases studied, by 2-3 orders of magni-

tude. To describe a general shape using a tetrahedral mesh requires a number of tetrahedrons,

so it stands to reason that each individual material region should comprise a large number of

elements. Of these, only the boundary elements have faces which are interfaces so interface-

related operations should be much rarer. Biological tissues are also relatively homogeneous

in their refractive index, except for air cavities so typical problems will have relatively few

interfaces. Both data from a small number of test cases and intuition agree that the critical

path is composed of intersection testing and scattering, both of which were carefully optimized.

Future work should certainly look at a broader range of problem definitions to assess the range

of parameters, however the conclusions are expected to remain qualitatively valid.

3.5.3 Coordinate precision

Compared to other implementations, FullMonte uses a lower-precision floating-point represen-

tation (IEEE Single instead of Double). During development, assertions were added to the code

to check for effects of numerical round-off error, such as validating that the norm of unit vectors

(eg. direction) were within a reasonable tolerance of unity. No violations were found, suggest-

ing that the double-precision values used in TIM-OS were unnecessary. Simulation results also

converged to the same value regardless of precision, suggesting that the additional precision

is not necessary. Switching to single-precision enabled many calculations to be done using

a single four-element floating-point Intel SSE vector instruction, instead of two two-element

double-precision instructions. This in turn had a significant impact on the instruction count

required in the inner loop. Newer processors (Intel Sandy Bridge and up) now have 256-bit

registers that hold four double-precision elements so the gap will decrease. However it remains

useful as a way to decrease memory bandwidth requirements so that more elements may stay

resident in the cache.

While the performance benefit of using single precision vs double is not as large for newer

processors, it remains a useful finding for non-CPU implementations. FPGAs perform far

faster on fixed-point computation than floating, and so a software validation of lower numerical

precision is very useful. This reinforces results reported by Alerstam et al [2] which showed

that CUDAMC’s performance was not sensitive to precision between float and double. GPUs

also have far more single-precision floating point units than double-precision so this finding can

be applied on a GPU implementation as well.

3.5.4 Spin Calculation Methods

Since the scatter event is called once per step and involves a number of mathematical operators,

it is an important target for optimization. A number of different algorithms and variants

Page 58: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 44

were tested for speed. To assess speed, micro-benchmarks were run where a single packet was

repeatedly spun by pre-calculated angles θ, φ whose sines and cosines were stored in an array.

Pre-storing rather than calculating isolates the timing of just the inner loop which is of interest.

By repeatedly spinning the same packet, the number of memory accesses required to complete

the benchmark is minimized. It should also reflect the typical case when the software is running

where frequently-used values would be expected to be register-resident.

Cross Spin

To start, the original MCML spin calculation was implemented exactly as described in Sec 2.4.2

and the original MCML paper [44]. It makes no use of SIMD instructions or other hardware

optimizations.

Matrix Spin

In considering the spin formulation as originally proposed, we noted that calculating and dis-

carding a, b requires calculation of a reciprocal and a square root. If these auxiliary vectors

were maintained, it would save some computation at the expense of additional state informa-

tion. Further, the original formulation requires a special case because it is singular if dz = ±1.

Avoiding the need to check and handle the special case would be desirable.

In the original formulation used by MCML and subsequent derivatives, the new post-rotation

vectors a′, b′ were never calculated; the original a, b were calculated implicitly as part of Eq 2.27-

2.29 and then discarded. We developed a new formulation for FullMonte [9] that maintains the

auxiliary vectors a, b for use in Eq 2.30 directly, instead of discarding them and re-calculating.

The geometric interpretation is the same as in the original case above, except that the vectors

a, b are rotated along with d so that they remain orthonormal to d and may be used again.

The additional calculations required are:

a′ = a cosφ− b sinφ (3.21)

b′ = −a sinφ+ b cosφ (3.22)

The new formulation avoids the special case where d = ±k as well as one square-root

and one reciprocal, the costs and benefits of which are discussed later in the implementation

descriptions.

The matrix spin described in Sec 2.4.2 was implemented using SIMD instructions, and out-

performed the original formulation. This implementation is attractive for hardware which has

a high density of multipliers but less other units (divide, square-root). FPGAs are exactly

such a platform since they have fast and power-efficient hard multiplier blocks. Similarly in

modern GPUs [12], there are a large number of simple cores with adders and multipliers but a

smaller number of shared, slower special function units for division and square-root. By trading

Page 59: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 3. Software model 45

away division and square-root in favour of more multiplication, it may be possible to get faster

performance if the algorithm were to be implemented on a GPU.

SIMD Cross Spin

Subsequent to implementation of the matrix-spin algorithm above, the original “cross spin”

algorithm was further enhanced by use of SSE intrinsics leading to the fastest CPU-based

implementation of all the variants tried. In particular, substitution of square-root and division

by an explicit reciprocal-square-root instruction made a large difference.

The azimuthal vectors are formed by normalizing the cross product between the packet

direction and the k vector. This method has the advantage of simplicity since the two zero

components in the k vector reduce the number of nonzero terms in the output. It has a singular

case where d ‖ k so d × k = 0 which is handled separately. Performance enhancement over

the previous version was achieved by using a hardware approximate-reciprocal instruction in

place of a math library call. Additionally, the number of instructions was decreased by using

SIMD instructions which operate on more than one data item at once. The matrix formulation

remains in use in the hardware version, though, to shorten latency and make use of plentiful

hardware multipliers.

3.5.5 Intersection Testing

One small but significant optimization from previous implementations was a change to the

storage of normals within a tetrahedron. Instead of storing one normal vector per four-element

SIMD register, the coordinates were gathered by type. One vector each is dedicated to holding

all of the x, y, z, and constant offset components for the four vectors. Doing so avoids some

manipulation of the vectors necessary to compute the required dot products. Since intersection

testing is actually the most frequently-occurring operation of the entire pipeline, the impact

is not trivial. The normal vector itself is not needed except in the case of arrival at a refrac-

tive index boundary which is significantly rarer. When needed, the vector can be found by

transposing the vectors in the tetrahedron definition.

Page 60: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4

FPGA Implementation

This chapter contains detailed technical descriptions of the FPGA-accelerated implementation,

and as such necessarily contains some jargon specific to computer engineering and digital logic

design. It may safely be skipped by readers who are not computer engineering experts; the key

results, validation, and performance comparisons are all summarized in Chapter 5 and discussed

in Chapter 6.

4.1 Motivation for Hardware Acceleration

Monte Carlo simulations are inherently parallel. All iterations are independent, and the statis-

tical uncertainty (standard deviation) of the answer declines with 1√N

where N is the number

of paths simulated through each element or detector. The simplest approach to reduce runtime

is to run M simulations of NM paths on M parallel machines with independent random number

sequences and sum the results. Assuming that the time to merge the results is much less than

the time to generate them, this will take ≈ 1M as much time.

However, this naive approach runs into practical limits quickly: to compute a result of equal

quality M times faster, it requires M times as much power, cost, and space yielding a constant

per-packet ratio regardless of M . Conversely, to compute a result with M times smaller result

standard deviation in the results given the same time takes M2 power, cost, and space. For

MC simulations of complex geometries to really “break through” into everyday use in the clinic

and research lab, they need to provide better numbers of packets simulated per unit cost, space,

and power in addition to time. Given that CPU processor speeds are not increasing at their

former rate [31], processor architecture and manufacturing alone will not provide significant

improvement in single-core CPU performance over time so alternative approaches are needed.

The alternatives involve either other algorithms to solve the same problem or other com-

puting architectures. Leaving aside the question of other algorithms, which has not yet yielded

an option for the most general materials and geometries, other compute architectures are a

compelling way forward. The three options considered are listed below.

46

Page 61: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 47

4.1.1 GPU

Currently the most popular compute accelerator in the market is GPGPU, or General Purpose

computing on Graphics Processor Units [31], but they were considered and discarded for this

application. Their programming model divides the program into many threads, but requires

that groups of threads called warps access contiguous memory in a process known as coalescing.

If a group of photons packets is launched from a point source, they start within the same tetra-

hedral element but as they travel they rapidly diverge due to scattering and so begin to access

non-contiguous memory which would be expected to lead to very sub-optimal performance.

For applications such as PDT in which the volumetric fluence distribution is of interest, it also

requires accumulating values over a large array shared between threads which needs expensive

atomic memory access to ensure correct results. CUDAMC (Sec 2.5.3) uses a GPU to achieve

approximately 1000x run time decrease compared to a non-optimized single-thread implemen-

tation. That algorithm however requires virtually no memory access and so could be regarded

as a hard upper limit on performance when the problem is fully compute-bound, and indeed it

is also less compute-intensive than working with a full 3D tetrahedral mesh. The authors note

that acceleration decreases by an order of magnitude when implementing the full planar-slab

MCML. Tetrahedral mesh computation requires both more memory access and more compu-

tation so it would be reasonable to expect further significant performance decreases. Since the

acceleration results are reported against a single-core CPU implementation, a GPU implemen-

tation could be expected to achieve less than an order of magnitude better performance in

run time compared to a multi-threaded CPU implementation, and less advantage on an energy

basis.

4.1.2 Intel Xeon Phi processor

Intel’s new Xeon Phi coprocessor systems [15] offer a highly-parallel compute accelerator aimed

at competing with GPGPU while using the Intel x86 instruction set. It is an instance of

Intel’s new Many Integrated Cores (MIC) architecture, using relatively lightweight in-order

cores coupled to a mid-sized cache (256 kb/core: larger than a GPU, smaller than an x86 CPU)

and fast memory. By increasing the number of cores and available memory bandwidth, it might

offer a performance increase for this application compared to a normal x86 processor without

imposing the overhead of a GPU, specifically the requirement for memory-access coalescing and

the heavy penalty for branch divergence. Cache coherency is also a very strong advantage when

considering the need to accumulate many absorption events across cores. Counterbalancing

that, the smaller cache relative to a full x86 processor may impose a penalty due to a higher

miss rate. The total power budget is also correspondingly larger, so it may or may not be a net

improvement in power-performance terms.

Implementation of the FullMonte software simulator on such a system would be relatively

low-effort due to instruction-set compatibility (including all of the hand-optimized vector parts),

a plausible candidate for accelerating the calculation, and an interesting evaluation of the new

Page 62: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 48

technology. Due to the recent announcement of the device family (Nov 2012) and its novelty it

has not yet been targeted for a FullMonte implementation.

4.1.3 FPGA

FPGAs, as introduced in Sec 2.6.3 are programmable logic devices which offer far greater pro-

grammability, power efficiency, and in some cases compute ability at the cost of greater difficulty

in programming. Unlike GPUs, FPGAs offer fine-grained parallelism and the opportunity to

customize the memory hierarchy for the target application. Energy efficiency is also vastly

superior on the FPGA platform, which is a desirable attribute for scaling up the computation

to handle large volumes of simulations, particularly in the context of large-scale computing or

portable systems. They are also a mature device with a proven track-record of energy-efficient

and high-throughput computing. Two major vendors, Xilinx and Altera, offer large-scale FPGA

devices using modern manufacturing processes (28nm) with mature CAD tools, large IP port-

folios, and reasonably similar device architectures. Of the two, an Altera Stratix V FPGA was

chosen as the implementation medium.

4.2 Design Overview

4.2.1 Hardware Platform: Altera-Terasic DE-5

The Terasic DE-5Net [63] is a development platform for the Altera Stratix V FPGA [14], a high-

end modern 28nm FPGA. The board includes a Stratix V A7 device, which is a mid-size variant

of the Stratix family designed to provide a balance of logic, memory, and DSP functions. It also

supports two large DDR3 SO-DIMM memory modules and four QDR-II+ SRAM modules for

fast random-access memory. Listing for $8,000 USD, it is a common platform for prototyping

FPGA projects. As will be discussed later, it provides a good mix of FPGA and memory

technology for scaling up FullMonte to higher performance. The proposed scale-up architecture

would use all of the memory features just listed as well as nearly all of the available DSP

resources on the FPGA.

4.2.2 Implementation Language: Bluespec

FPGA designs are typically implemented either by writing Register-Transfer-Level (RTL) hard-

ware descriptions (VHDL or Verilog being the most-used languages), or by using High-Level

Synthesis tools. RTL design tends to be very laborious, verbose, error-prone, and to result

in code which is difficult to adapt to new contexts (new FPGA devices or new applications).

HLS tools often greatly restrict the method of expressing the problem and/or lead to inefficient

device resource usage due to excessive abstraction of important device details. A number of

commercial [35] and academic [8] tools start from recognizable sequential languages such as C

and Matlab or explicitly-parallel instruction-based languages like OpenCL. While some, partic-

Page 63: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 49

ularly the Altera OpenCL [13] compiler have shown success in a few applications, we judged the

efficiency and flexibility given up using HLS tools based on software programming languages to

be excessive.

Choosing between the two traditional options poses a difficult dilemma between convenient

design but low performance on the one hand, and a difficult, tedious, error-prone process on

the other. The FPGA implementation of FullMonte used a third option: a new commercial

HLS tool called Bluespec and its related language Bluespec System Verilog (BSV), which take

a radically different approach from both RTL and other HLS systems. Derived from functional

programming languages which have a primarily academic heritage, the language makes a strong

distinction between (pure) functions whose return value is a function only of its explicit inputs

(ie. for the same input, it always gives the same output), and actions which may read and write

state elements. A quick introduction to the language and its core concepts are provided in the

book BSV By Example [53], while the BSV Reference Guide [5] provides a detailed language

reference. The novel features of the language most relevant to this project are discussed below,

and also referenced where appropriate in the detailed design description that follows.

Choosing a relatively new and unfamiliar language over “traditional” design methods was

a risk, but the results have justified the risk many times over: simulations ran an order of

magnitude faster, many errors were caught in the compilation stage, code volume was greatly

reduced, and code readability was enhanced. Overall, Bluespec provided a large productivity

increase throughout the design process and resulted in code that is far more maintainable

and reusable. Some highlights of the language and compiler are discussed below, with specific

references where appropriate in the detailed design description as well.

Guarded Atomic Actions

Bluespec programs consist of two fundamental elements: rules, which consists of a set of con-

ditions (guards) and set of actions that modify module state if and when the rule fires; and

state elements (eg. registers, memories), which are modified by actions.

Conditions can be specified explicitly (do X if Y) by the programmer, or can be derived

implicitly from other conditions within the rule (do X, where X is only permitted to happen if

Z). Based on the program source, the compiler evaluates conflicts between the effects of rules

and generates a scheduler which decides what rules should fire when. By analyzing the conflicts,

the scheduler ensures that no two rules whose side effects are incompatible (eg. both writing

the same register) fire together. At each clock cycle, the scheduler evaluates the conditions

(implicit and explicit) for every rule, and determines which are permitted to fire. Based on

the assigned priorities and conflicts, it then selects which rules to fire within the cycle. While

this sounds like additional overhead, it must also be done by a programmer to write correct

RTL code. The compiler lifts this burden, and also provides errors if the program specification

appears to be ambiguous or infeasible.

Each rule is atomic, which means that its actions execute entirely or not at all: if any part

Page 64: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 50

of it is not able to execute due to conflict, the scheduler will not permit the rule to fire. Instead

of having to derive the scheduling logic for each state element manually, the programmer can

think in terms of what actions have to occur in what situations. The compiler then takes care

of making sure that the actions are attempted only when they are permitted, and that no two

rules fire which conflict. An oft-used example is the FIFO block provided in the Bluespec IP

libraries. If a rule involves enqueuing a value into the FIFO, that rule automatically carries

the condition that the FIFO is not full. Even better, if two rules must enqueue values into

the FIFO, it will warn the programmer to make a priority decision if they can conflict. Best

of all, though, suppose it is necessary to modify a working program so that under yet another

condition a value is enqueued into the same FIFO. That would be as simple as writing the new

rule and specifying its priority relative to the other two; no modification of the other rules (in

fact of any existing code at all) or manual rewriting of scheduling logic is necessary because the

compiler does it all.

Strong Typing

In Bluespec as in Haskell, the language is strongly typed and uses a type class system. All

expressions must have a type, and any type conversion must be explicitly requested by the

programmer unlike in C, Matlab, or Verilog. While that may sound less convenient, several

convenient consequences follow. First, since each expression’s type is statically and unambigu-

ously known at compile time, variables can be defined from other variables without explicitly

specifying their type (eg. “let x = ...” where the type of the RHS need not be stated by the

programmer). Second, there exist signed and unsigned versions for each length of bit vector

so common Verilog errors due to implicit extension, truncation, and sign conversion do not

happen; the programmer must ask for all of those conversions. Third, types may belong to

type classes for which groups of functions are defined. General functions can be defined which

take arguments of any type that belongs to a given type class. For instance, one could define

twice sum(x, y) = 2 ∗ (x + y) which would then work for arguments x, y of any type that is

a member of the Arith# class defining the basic arithmetic operations. Type classes provide

polymorphism similar but distinct from C++ since no direct inheritance of data members is

necessary. This convenience does not stop with functions, but hardware modules too can ac-

tually be parameterized by type. Such parameterization drastically cuts down on “boilerplate”

code for commonly-used design patterns including testbenches and module wrappers.

Higher-Order Functions and Modules

Due to its heritage from functional languages and particularly Haskell, Bluespec allows functions

and hardware modules to be passed as arguments to other functions and modules. Three good

examples are given later, one for queueing of random numbers in Sec 4.3.1, one for simulating

imported Verilog modules in Sec 4.3.8 and another for test-bench creation in Sec 5.1.

Page 65: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 51

Compiled Simulation

When running Monte Carlo simulations that may involve thousands of paths, each requiring

thousands of arithmetic operations, simulation speed is a significant factor in debugging pro-

ductivity. The Bluespec compiler can compile BSV code into a cycle- and bit-accurate C++

version which runs very quickly using the provided Bluesim simulator. A very rough estimate

would place the speedup at an order of magnitude or better. The Bluespec code can also

integrate with user-provided C++ code as well, which is useful for testing and for exploring

architecture options where some functions have not been fully implemented in Bluespec.

One limitation is that existing Verilog RTL code (eg. FPGA vendor IP, including mathe-

matical functions) cannot be incorporated into the C++-based simulation. Bluespec can also

emit Verilog for simulation using a normal RTL simulator (eg. Modelsim) but that gives up

the speed advantage inherent in the C++ compilation-based approach. However, if an accurate

C++- or Bluespec-based model for the IP can be created then Bluesim can still be used. That

approach was taken when incorporating Altera IP to instantiate DSP cores.

Libraries

Bluespec also ships with a large library of intellectual property including useful primitives

like First-In First-Out (FIFO) buffers, as well as Block RAM instances. These libraries are

quite useful because they are broadly parameterizable, eg. the FIFOs are parameterizable in

terms of both type and depth. Any type which is a member of class Bits#(), ie. any class

(including user-defined) which can be represented using a fixed number of bits, can be stored in

a FIFO. As mentioned earlier, the implicit conditions on all library modules are factored into

the scheduler so no explicit checking of FIFO full/empty conditions is required. There is also

a convenient library called StmtFSM which is useful for creating finite-state machines using an

easy sub-language. It works within the guarded atomic action framework such that the FSM

state advances only when all actions within that step are able to fire.

In contrast to the flexibility described above, vendor-specific IP libraries in RTL languages

will often require regeneration using a separate tool when changing width, depth, or other

parameters so the Bluespec IP model represents a significant convenience in terms of source

code flexibility. Vendor IP libraries also put the burden on the user to ensuring that the input-

port conditions are correct for using the IP. Using a Bluespec library, on the other hand, the

constraints will propagate upstream and be incorporated into rules for using the IP.

4.2.3 Design Limitations

A few limitations and assumptions were made to make the problem scope tractable while still

enabling useful conclusions about the feasibility and performance of a full system:

1. At most 16 distinct materials may be simulated

Page 66: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 52

2. Maximum mesh size is 64k elements

3. Internal reflection, refraction, and Fresnel reflection are currently omitted

4. Only isotropic point sources are supported

The number of distinct materials is representative of typical problem sizes. Since a user

must contour the different material regions and define optical properties, 16 was seen as a

reasonable number which few simulations are likely to exceed. Maximum mesh size was limited

to 64k elements due to limited on-chip memory availability. This was sufficient to run the

“cube 5med” test set, and can also accommodate a set which covers the majority of memory

accesses in real applications (> 95% for Digimouse BLI test set).

The current system also supports only isotropic point sources. A pencil beam is trivial

to support but not currently done, and the extensions to line sources and volume sources are

simple and unlikely to limit overall system performance since launch is hundreds of times less

frequent than intersection testing and scattering.

To reduce the algorithm complexity for a first prototype, calculations relating to index

of refraction (internal reflection, refraction, and Fresnel reflection) were excluded. As will be

demonstrated later, interface calculations are two orders of magnitude rarer than the most com-

mon operations (intersection testing and scattering) and therefore are not a major performance

bottleneck. Inclusion of these effects will be important for application of the system to the most

general class of problems, but would not be expected to limit the performance of the overall

system.

4.2.4 Design Goals

Given the selection of FPGA as the computational platform for implementing an accelerated

MC light propagation engine, we derived a set of goals for the design to take best advantage

of the relative strengths and weaknesses of FPGAs. Based on a high-level analysis of the

algorithm, the following high-level objectives were set:

1. Insert pipeline registers as needed to maximize clock frequency (target 250MHz)

2. Exploit pipeline parallelism by keeping multiple packets in flight simultaneously

3. Minimize latency of the inner packet loop (hop-drop-spin)

4. Achieve maximal throughput by loop unrolling in critical operations

5. Avoidance of floating-point operations in favour of fixed-point

6. Maximize utilization (minimize idle time) of the most resource-intensive blocks

7. Share operators for less-frequently used functions

Page 67: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 53

Pipelining for Maximum Frequency

As spatial computing devices, FPGA designs are best conceptualized in terms of an intercon-

nected spatial layout of logic, computing, and storage elements. In contrast to a CPU or GPU

whose core layout is fixed at manufacturing time, specific areas of the FPGA can be dedicated

to specific operations, such as random-number generation, intersection testing, mesh storage,

etc. Instead of bringing data to the computational core, processing it, and returning it to

memory, the calculations flow through the FPGA from input through intermediate stages and

to output.

Each storage and logic element within the FPGA has a delay associated with it, as does

each link carrying data between elements. Synchronous design, in which an input may be

accepted and an output may be provided at each tick of the clock, is by far the dominant

design style for FPGAs. To ensure correctness, the clock period must be no faster than the

slowest path within a block so that all elements are finished computing before their result is

stored. If that condition fails to hold, then an incomplete or garbled result will be stored

and passed onwards. For long chains of operations (called a pipeline since computation “flows”

through it), the maximum speed may become intolerably slow despite the large silicon area used

for calculation. Generally FPGA designs make use of pipeline registers to store intermediate

results instead of having all the computation happen in a single cycle. By partitioning the

total path delay into segments between storage elements, the maximum segment delay can be

reduced and hence the maximum clock frequency increased. Inputs can therefore be accepted

more frequently, giving better total throughput [68].

The clock-frequency increase from pipelining does not come for free, however. If a function

expresses a recurrence ai+1 = f(ti, . . . ), the subsequent stage ai+1 cannot be calculated until

ai is available, which takes C clock cycles if C registers have been inserted in the path. The

present Monte Carlo simulation is just such a case since a packet’s position after step i + 1

depends on where it was at step i.

Pipeline Parallelism

If a fixed sequence of steps needs to be applied to an input, then those steps can be laid out

in order with each feeding its successor. On the further condition that each element flowing

through the pipeline is independent, i.e. that the path of packet i has no dependence on packet

j (∀i 6= j), they may be computed in arbitrary order or in parallel. When many independent

items run through a similar set of steps in parallel there exists pipeline parallelism. In this case,

the sequence is almost fixed with the exception of some branches as depicted in Fig 4.2.6.

After a complete hop-drop-spin cycle, the packet repeats the process starting with drawing a

step length. The length of time (latency) for a single packet to complete a loop is not inherently

important; the throughput to calculate a large N (millions) of packets is what matters. In

this sense, Monte Carlo simulation is ideal for FPGAs because it involves simulation of many

independent sample paths. There is no dependency of the state or path between packets so

Page 68: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 54

abundant pipeline parallelism exists.

To exploit pipeline parallelism, then, it is necessary to keep at least C packets in the pipeline

if the loop latency is C. Since packets are independent, a new packet may be launched at any

time, which provides a simple but effective way to guarantee there is always a packet being

provided to the draw step-hop blocks: any time there will be a “bubble” (idle time) in the

pipeline, a new packet is launched to fill it.

Latency

In general, each of the C packets being processed will be located in a different tetrahedron

whose definition must be readily available to complete the step computation. When scaling up

to larger problem geometries where it is not possible to keep all of the geometry in a single

storage location, the packet-loop latency will determine the number of tetrahedrons which must

be kept readily available in a local cache. Since caching is relatively expensive in terms of area,

energy, and complexity, minimizing the cache size required to serve the elements in progress by

minimizing latency is an important factor for ultimate performance and scalability. Introducing

pipeline latency in computation also requires inserting delay elements to align the delay of the

all elements in the packet.

Optimization of the design requires a delicate balance between adding pipeline stages where

appropriate to increase clock frequency, while reducing latency where possible to reduce cache

and state-storage requirements. Latency can be reduced by running independent computations

in parallel (eg. the weight update due to absorption and the direction update due to scat-

tering). This design also implements some strategies for “hoisting” latency out of the main

loop, either by operator strength reduction or by pre-generating random numbers which are

data-independent.

Loop Unrolling for Throughput

The design is targeting maximum achievable throughput for packet computation for a given

area. Unrolling a loop by factor R increases throughput by R, reduces latency by R, decreases

control complexity, and increases area by R. For instance, computation of (ab, cd, ef) could be

calculated by calculating ab, cd, and ef in three steps using a single multiplier, which would

have a latency of 3 and a throughput of one third (a new output tuple is produced every third

cycle) for area cost 1. It could also be unrolled so that three multipliers calculate in parallel for

latency 1 and throughput 1 but area cost 3. Throughput per area is kept roughly constant, but

latency and complexity are both reduced. A latency decrease is desirable as argued above, and

a decrease in control complexity makes it easier to achieve high clock frequency. All statically-

indexed loops on the critical loop should therefore be unrolled as far as possible.

Page 69: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 55

Fixed-Point Computation

FPGAs natively support fixed-point multiplication and addition with “hard” optimized fixed-

function DSP blocks that are fast, plentiful, and power-efficient. The incremental cost of

supporting floating-point operations is quite high due to the need for additional logic to shift

the operands compared to fixed-point. Full support of the IEEE single or double standards also

requires handling of special cases such as infinity and not-a-number, which add logic complexity.

Special cases are avoided by careful construction of logic to ensure that divide-by-zero and

other pathologies never occur. The additional complexity of floating point is not necessary in

this application because the ranges of all variables are bounded. All spatial coordinates lie

within a bounded mesh-description range; directions are unit vectors which bounds the size of

their components; sines and cosines of angles are similarly bounded; and packet weight remains

always in the range of [wmin, 1]. Given bounded ranges, the only question that remains is how

many bits to allocate to each such that the increment ε between steps is sufficiently small.

For a Monte Carlo simulator, the expected result is correct so long as the expectation of

each step is the correct value, which means that no step introduces bias. A properly-chosen

fixed-point representation should not apply any bias, preserving correctness although additional

variance may be added due to quantization. Up to a point, the quantization noise should be

dominated by other sources of randomness in the system, and even exceeding that threshold

the variance may be overcome by running additional iterations so there exists a natural tradeoff

between area and required number of iterations to achieve a target variance level. By reducing

precision, the silicon area required is decreased (which also correlates to a clock frequency

increase) while the number of packets required to achieve identical variance increases.

Maximal Utilization of Expensive Blocks

When a large amount of device area is allocated to a specific function, we wish to maximize

the proportion of time that it is active. Given a system clock frequency fs, and reciprocal

throughput T the block can do at most fsT computations per second, if it is provided an input

every time it’s able to accept one.

In scaling the design up to multiple instances, it will be important to consider matching

the density of each functional block type to its relative frequency in the computation. If for

instance an interface happens once in ten hops, then it would be sensible to instantiate ten hop

cores sharing one interface core if the control and queueing costs are not excessive.

4.2.5 Data Representation

The bit widths for the most important data structures are given in Table 4.1. All quantities are

fixed-point, and strive to use element widths which are 9, 18, 27, or 36 bits which fit naturally

into Altera hard DSP blocks and block RAM units. Fixed-point was chosen due to the presence

of definite bounds on all variable ranges. The use of 36 bits for packet weight and 64 bits to

Page 70: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 56

Data item Bits per Range Precision Comment

2D Unit Vector 2x18 (36) ±1 8× 10−6

3D Unit Vector 3x18 (54)3D Position Vector 3x18 (54) ±8cm 0.6µmDimensionless step 18 0− 63 1.2× 10−4

Physical step 18 0− 63cm 1.2µmPacket weight 36 0− 1 1.5× 10−11

Absorbed weight 64 0− 2× 108 1.5× 10−11 200M absorptionsper element beforeoverflow

Tetrahedron ID 20 0− 106 3x more than Digi-mouse

Material ID 4 0− 15Interface ID 8 0− 255 Number of distinct

material combinationsat an interface

Packet 294 3x 3D unit vector, weight, 3D position, mate-rial ID, tetra ID, dimensionless step remaining

Tetra definition 404 4x Adjacent tetra ID, 4x4x18 face normals &constants, material ID, 4x interface IDs

Table 4.1: Core FPGA data structures for packet, geometry, and material representation

accumulate absorbed weight are both conservative, given that the weight is always at least

wmin so the smallest increment which can be deposited is (1 − α)wmin or approximately 10−9

for an albedo of 99.9% and wmin = 10−5. Keeping the full precision in the weight accumulator

ensures there will be no roundoff error, and its width ensures that the accumulator can handle

at least 264−36

1−α & 109 absorption events per element in the worst-case where the packet arrives

with unit weight at α = 0.8.

For step lengths, the worst-case interaction coefficient from Cheong [11] is ≈ 3000cm−1

which would yield an average step length of 3µm. By setting the resolution of both position

and step length several times lower than this value, the probability of any given step getting

“stuck” at a given position due to truncation is acceptably low.

The problem description size that can be handled by the data structure (although not

accommodated in on-chip memory) is 1M tetrahedra, 16 materials, and 256 distinct interfaces

(material pairs which are adjacent in the mesh).

4.2.6 Packet Loop Description

At a high level, the packet flow is implemented as presented in Fig 4.2.6. By inspection, the

intersection-test stage must be the most frequently occurring since it is the only step which

is involved in all cycles of the data flow graph. It is also among the most computationally

intensive (eight 4-element dot products), hence keeping it near 100% utilization is critical to

maximizing performance per unit area.

Page 71: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 57

The boxed region in the figure depicts the drop, roulette, and spin stages which are actually

implemented in parallel to reduce latency. As shown in Sec 2.4.2, the packet will on average

pass the drop stage a large number of times before expiring at roulette. Since roulette does not

alter the position or direction of the packet, the finish-drop-roulette and finish-drop-spin edges

may be merged, so the spin is always executed speculating that the packet continues. If it does,

the result is available with lower latency. When the packet eventually loses, it terminates and

the effort to calculate the speculative spin result is wasted. However since the probability of

termination is on the order of 1% or lower, speculation is generally productive and the possible

savings available from avoiding the cost of mis-speculation are not worthwhile.

4.3 Design Details

4.3.1 Random Number Generation

To produce a set of U01 random numbers, a fully-parallel implementation of the TT800 “Tiny

Twister” (a variant of Mersenne Twister) of Matsumoto and Saito [59] was created. The

Mersenne Twister RNG was chosen because it is a high-quality random number generator with

very long period that uses only bitwise operations which are easily and cheaply implemented

on an FPGA. The original software version which was used as a template and for validation

produces a sequence of 32-bit integers from an 800-bit state vector.

The implementation used here updates all 800 state bits in parallel at a rate that can exceed

500MHz, providing a pseudo-random bit stream at up to 400 Gbit/s with negligible resource

cost. A smaller implementation would suffice, but is not worth the effort for the trivial cost

savings. To produce numbers with a particular statistical distribution, the uniform random

numbers feed a distribution function which manipulates them into the appropriate form.

Randqueue block

MC algorithms by nature require several streams of independent random variables with various

distributions. They are calculated by transforming one or more U01 random variables, which

requires some latency L to compute. Since these must be random and independent of the data

being processed, there is no input data dependency when creating the random variables. A

natural conclusion of this is that the distributed random variables can be computed in advance

and queued so they are ready immediately when needed, supporting the latency-minimization

objective by hoisting the latency out of the inner loop.

A random-number queue was devised which wraps the distribution function and a FIFO

queue of length L + 1. To initialize, L + 1 random numbers are drawn, fed to the calculation

engine, and the results are placed in the queue. When the last is complete, the queue signals

that it is ready to provide random numbers. When a value is subsequently drawn from the

distribution output queue, a new uniform random number is drawn and fed to the calculation

Page 72: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 58

Launch

Draw step Tetra lookup Hop Interface

Finish step

Drop

Spin

Roulette

Dead

Figure 4.1: Block diagram for FPGA implementation, with stages requiring random numbersshaded; the boxed group is actually a single block but is expanded to show packet flow; seeFig 5.5 for event frequency details

Page 73: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 59

engine. After L cycles, the result is enqueued thus ensuring that there is always a result

available.

Implementing this design pattern in Bluespec was very simple. A random distribution func-

tion is expressed as a module that has a port with type signature ServerFL#(in t,out t,lat):

it takes an input type in t, and outputs a result of type out t after lat cycles. Other ports are

permitted for use in configuring the distribution, gathering usage statistics, or for other pur-

poses. The input type must be convertible to bits (expressed by membership in the Bits#()

typeclass) so that it may be fed from a U01 RNG. Sample BSV code showing how to draw an

exponential random variable using Randqueue is given in Fig 4.2.

In some cases, multiple different parameters are used with a particular distribution (eg.

differing g values for the Henyey-Greenstein function) but the number of parameters are small

(n ≤ 16 materials). For those cases, a RandqueueMulti block allows a distribution with n

different parameter values to share a single calculation engine feeding n different queues. When

a number is drawn from queue i, a request is issued to the calculation engine with a random

number and the i-th parameter value. On completion, the new distributed random number is

placed back in the queue. The RandqueueMulti module as written can be used for distributions

with any parameter type param t (including tuples, structures, etc), with any random number

generator, any latency, etc without altering a single line of its definition. This is one example

of code composability and reuse in Bluespec.

Bernoulli Distribution

The Bernoulli distribution Bp returns 1 with probability p and 0 with probability 1− p, corre-

sponding to the “success” or “failure” of an event. Where p = 2−i, the variable can be created

by the bitwise AND of i random bits. For convenience, the roulette parameter m was chosen

to be 16 so i = 4. In the case where p is not known in advance (eg. Fresnel reflection) or

p 6= 2−i ∀i ∈ I+ a U01 random number r is drawn and 1 is returned if r < p.

Uniform 2D Unit Vector

Several techniques exist for creating random numbers uniformly distributed around the unit

circle in R2. Such a vector can be characterized solely by the angle ψ measured clockwise from

the x axis, so a direct method involves drawing a random angle ψ ∼ U0,2π. Since latency is not

a concern and the direction vectors use a fairly low-resolution (18b) fixed-point representation,

the CORDIC [66] algorithm was chosen. It uses only comparisons, bit shifts, and additions to

compute trigonometric functions digit-by-digit which is ideal for use in an FPGA. A special

implementation of the CORDIC algorithm computes v(u) = (cos 2πu, sin 2πu) so that a uniform

random number u ∼ U01 can be used directly, saving a multiplication by 2π at the input. The

algorithm also exploits symmetry between the quadrants of sine and cosine.

Page 74: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 60

1 // Tiny Twister 800b parallel RNG

2 Bit#(800) rng <- mkTT800;

3

4 // Wire (just like Verilog wire) to transmit random number

5 Wire#(Bit#(19)) rnd_step <- mkWire;

6 Wire#(UInt#(18)) rnd_angle <- mkWire;

7

8 // on every clock, draw a random 800-bit number and send the lower 19 bits on wire rnd_step

9 rule drawStepRandom;

10 let rnd800 <- rng.get;

11 rnd_step <= rnd[18:0];

12 rnd_angle <= unpack(rnd[37:19]);

13 endrule

14

15 // instantiate a module to compute the log of a 19-bit number

16 let logfcn <- mkLog;

17

18 // pass the wire and the distribution function module to the RNG queue

19 // NOTES: 1) latency is implicit in the type of logfcn which is not shown here

20 // (programmer doesn’t even need to know to instantiate)

21 // 2) rnd_step doesn’t have to be a wire; could be any module (incl user defined)

22 // in the ToGet#() typeclass

23

24 Randqueue_ifc#(UInt#(19)) rq_steplen <- mkRandqueue(toGet(rnd_step),logfcn);

25

26 // Now create a random-number queue for a 2D unit vector using a random input to

27 // sincos

28

29 let cordicCalc <- mkSinCos;

30 Randqueue_ifc#(UVect2D_18) rq_unitvector2d <- mkRandqueue(toGet(rnd_angle),sincos);

31

32 // Draw and display numbers when available

33 rule showIt;

34 let s <- rq_steplen.get; // implicit condition here

35 // rule can only fire if number available

36 $display("At time ",$time," drew a set of length ",s);

37 endrule

Figure 4.2: BSV example showing use of Randqueue to queue up random numbers

Page 75: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 61

3D Unit Vector

Creating an appropriate uniform distribution over the unit sphere in R3 is slightly more com-

plicated. A naive algorithm using spherical coordinates v = (1, θ, ψ) : θ, ψ ∼ U0,2π for instance

does not give a correct distribution. If however cos θ ∼ U01 and ψ ∼ U0,2π then a correct distri-

bution can be formed as shown below. In that formulation, cosψ, sinψ can be calculated as a

2D unit vector above and sin θ =√

1− cos2 θ. The fact that sin θ ≥ 0 always is not a problem

since all terms containing sin θ also contain either (but not both of) cosψ or sinψ which are

symmetric around 0. In the FullMonte formulation, the auxiliary vectors a, b are needed, and

can be calculated directly from the sines and cosines above as follows:

d = (cos θ,− sin θ cosψ, sin θ sinψ) (4.1)

a = (sin θ, cos θ cosψ,− cos θ sinψ) (4.2)

b = (0, sinψ, cosψ) (4.3)

Other techniques exist using rejection sampling of points x ∼ U[01]3 in the unit cube to

find ‖x‖ ≤ 1 followed by normalization to get a unit vector. The current implementation

was chosen for its simplicity, predictable throughput, use of hard multipliers, and avoidance

of special functions (division, square-root). This module is a candidate for resource reduction

since new packets are launched fairly rarely so the current fully-unrolled implementation offers

for more throughput than necessary.

Exponential Distribution

The CDF and ICDF of the exponential distribution Eµ with parameter µ (mean µ−1) are

Fµ(x) =1− e−µx = F (µx) (4.4)

F−1µ (x) =

−1

µln(1− y) =

1

µF−1

1 (x) (4.5)

The entire family of exponential distributions with different parameters µ can be generated

by appropriate scaling of the unit exponential F1(x). To economize, the distribution used in

hardware actually computes F−1ln 2(x) = 1

ln 2 ln(1− x) = log2(1− x) since it is easier to compute

for binary numbers. The constants kt,i = µt,i ln 2 are stored so that the correct step lengths

can be derived from the base-2 dimensionless step length.

The base-2 logarithm is calculated using

log2(2i(1 + x)) = i+1

ln 2ln(1 + x) (4.6)

First the number of leading zeros is counted, then the Taylor series for log2(1+x), 0 ≤ x < 1

is used for the remaining digits.

Page 76: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 62

Henyey-Greenstein Phase Function

To calculate the Henyey-Greenstein function for the scattering deflection angle, the ICDF is

calculated using the formula of Eq 2.26. At its output, the HG function provides cos θ, sin θ

so that scattering can be accomplished by just multiplication and addition once the direction

vector and auxiliary vectors are provided. For more efficient hardware calculation, the Henyey-

Greenstein ICDF for cos θ can be partitioned into material-dependent constants k0, k1 and

functions of the random variable:

k0 =1 + g2

m

2gm(4.7)

k1 =1− g2

m√2gm

(4.8)

cos θ =k0 −(

k1

1 + giu

)2

u ∼ U−1,1 (4.9)

which gives the desired equation (Eq 2.26). The sine is calculated from√

1− cos2 θ, and its

sign does not matter since it is the component in the azimuthal plane which is controlled by a

uniform random vector.

4.3.2 Photon launch

Only isotropic point sources are supported in the current implementation, though a diversity of

sources could easily be added. To launch a new photon, the weight is set to unity, the position

to the position of the point source, and the direction to a randomly-drawn 3D unit vector as

described above. Since all quantities are either constant (weight, position) or drawn from the

random-number queue, this step has no latency.

4.3.3 Step length generation

Step lengths are generated in base-2 dimensionless terms (Sec 2.4.2) so only a single exponential

RV l ∼ Eln 2 is required. The conversion to physical dimensions is done within the intersection-

test block using the scaled parameter µtln 2 . Function latency is hidden using a Randqueue block

so that step lengths are always available latency-free.

4.3.4 Tetrahedron Lookup

In the current implementation, all tetrahedrons are stored in large array of Block RAM. Each

element is 404 bits and up to 64k tetrahedra can be stored, requiring an 11x128 array of block

RAM (1408/2560 blocks, 28160/51200 kbit of Stratix V A7 total capacity). That array size

covers the entire “cube 5med” mesh used for testing, or 20% of the Digimouse mesh. If the

Page 77: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 63

most-frequently-used Digimouse elements were stored, the on-chip set would cover 95% of all

memory accesses.

4.3.5 Intersection test

All necessary quantities for the intersection tests are computed directly using multiply-add

hardware blocks. For a given ray and tetrahedron, we need to know if the ray intersects the

tetrahedron within the current physical step length. If it does, then we need the cosine of the

angle, the intersection point, and the (physical) distance. The calculation starts by finding

which face is the closest to the ray, by first finding the angle between the ray and each face,

and the height over that face:

cos θi =d · ni (4.10)

hi =p · ni − Ci (4.11)

Of the four faces in a tetrahedron, a given ray can point towards at most three of them so one

can be eliminated from the comparison. For rays that point towards a given face (cos θi < 0),

the distance di to the face is given by hi = di cos θi. Since division is a long-latency operation, we

wish to avoid it where possible to minimize the number of in-flight packets at a given moment.

To find the closer of faces i and j

di < dj ⇐⇒hi

cos θi<

hjcos θj

(4.12)

can be checked more quickly and without division by computing

hi cos θj − hj cos θi < 0 (4.13)

which can be done entirely within a Stratix V DSP block. The first two faces of the faces

the ray is pointing at are so compared, and then compared to the third to find the closest to

the ray.

Lastly, the physical step length to the nearest face di = hicos θi

is checked against the dimen-

sionless length of the current step l. A similar trick is done to check if the step terminates

inside the current tetrahedron without dividing.

4.3.6 Interface

Handling of refractive index boundaries is not currently supported, though it could be imple-

mented using largely existing blocks. The interface block is currently used only for calculating

the point where a ray meets the tetrahedron face. It contains a divider to evaluate s = hcos θ

and from that calculate the intersection point q = p + sd.

Page 78: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 64

4.3.7 Absorption, roulette, spin, and step finish

As noted in the data flow diagram of Fig 4.2.6, the absorption, roulette, and step-finish stages

are merged because they operate on independent data.

Absorption

The albedos αm for materials m ∈ [0, 15] are stored in a lookup table. When the packet is

partially absorbed in material m, its weight is multiplied by the albedo so that w′ = wαm.

The difference w−w′ is computed and written to an output port of the module along with the

tetrahedron ID that the packet currently inhabits, for purposes of accumulating volume fluence.

Roulette

If the packet weight is below the threshold wmin at the conclusion of the absorption step,

then the packet is subjected to roulette (Sec 2.4.2). A B 116

random variable is formed by the

bitwise AND of four bits from the random number generator. If the result is 1, then the packet

continues with increased weight 16w. The value m = 16 was chosen because it is easy to work

with using bit manipulation: multiplication by 16 is the same as a bitwise left-shift by 4 places,

and the Bernoulli random variable is each to generate by bitwise AND. In parallel with the

roulette step, the packet is speculatively continuing through the spin step since the probability

of termination is on the order of 1%.

Spin

Scattering occurs based on the Henyey-Greenstein distribution. Since there are a small number

(≤ 16) of materials, the constants k0, k1 (Eq 4.9) are stored for each material and connected to

a RandqueueMulti to hide the latency. Numerical precision was optimized as well, noting that

g & 0.8 for typical biological materials in the optical window, which limits the range of many

of the components of the equation.

The Scatter function block itself is a deterministic application of the input deflection and

azimuthal angles represented by cos θ, sin θ, cosψ, sinψ to the input direction vectors d, a, b.

The matrix scattering formulation described in Sec 2.4.2 is applied directly using hard multiply

blocks to compute the scattering matrix inputs in two cycles, then multiply-add blocks to apply

the matrix to the input data in another two clock cycles.

By decoupling the calculation of the scattering angles from their application to a vector,

different phase functions can be used. Though the Henyey-Greenstein function is very common

in biophotonics, enforcing the distinction between generating the random angles and applying

them allows flexibility, improves clarity in the source code, and simplifies testing.

Page 79: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 4. FPGA Implementation 65

Step Finish

Since absorption modifies only weight, and spin modifies only direction, the packet may also

complete its step in parallel by traveling the originally-planned s units to update its position

to p′ = p + sd. This is a speculative step assuming the packet survives.

4.3.8 Altera DSP Primitives

In order to extract maximum performance from the Stratix V FPGA, we opted for explicit

instantiation of Verilog IP cores. Neither Verilog nor BSV-generated Verilog resulted in correct

inference of the hard-block multiply-add operation, so the synthesized hardware used several

times more DSP units than necessary. It appears that the issue is with Altera Quartus synthesis

rather than the Bluespec Compiler. However, there is an issue within the Bluespec compiler

wherein signed multiplications require more DSP units than expected. It can be worked around

by explicit instantiation of a DSP core or by calling out to a suitable Verilog module. Some such

issues remain at the time of writing as noted in the results discussion, but they can be fixed

given some time. Since Bluesim cannot handle BSV-Verilog mixed-language simulation, we

also created functionally-identical blocks in BSV for the DSP cores (signed multiplication and

dot-product). Large random test vectors were used in Modelsim to ensure exact correspondence

between the behavioral and RTL models.

4.3.9 Mathematical operators

The bulk of the operations in FullMonte are multiplication and multiply-add. All operations

which are not were implemented in Bluespec using standard algorithms, avoiding the use of

floating-point IP which tends to require more area and DSP units. A custom base-2 logarithm

module was written to exploit the low precision requirement (18b) and to avoid use of floating-

point IP. As previously discussed, a CORDIC-based sine-cosine module was written in BSV

both as an exploration of Bluespec for numerical algorithms, and to use an input range of [0, 1)

instead of [0, 2π), saving a multiplication. A digit-by-digit (CORDIC-like) square-root module

was also implemented for generation of 3D unit vectors, since latency is not a concern due

to the Randqueue structure and it had a fairly small logic footprint. There is also a module

calculating√

1− x2 using Taylor series for calculating sine given cosine or vice-versa. While

not tightly-optimized or thoroughly examined, we believe the use of custom modules instead of

floating-point vendor IP resulted in net DSP-unit savings and a good opportunity to evaluate

Bluespec for simple numerical algorithms. There remains room to tweak the speed and area of

the numerical cores, however they are not performance-critical at the moment.

Page 80: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5

Results

We present two implementations of the FullMonte algorithm here. The first is a highly-

optimized C++ implementation using multi-threading and Intel SSE intrinsics to achieve high

performance on a standard CPU. Run-time requirements for a variety of scenarios are presented

for an Intel Sandy Bridge quad-core CPU with SMT1 providing eight logical cores. The other

is a custom digital-logic implementation written in Bluespec SystemVerilog for an FPGA. We

did not create a physical realization; however we fully validated and synthesized the design and

are confident that the results presented here can be realized in functional, accurate hardware.

Non-trivial additional effort would be required to support data transfers from the host to the

device, without gaining any additional insight into the core of the problem so that has been de-

ferred to future work. We have also skipped implementing refractive index boundaries (Fresnel

and total internal reflection, refraction), though later discussion will demonstrate that has little

impact on the conclusions drawn. To demonstrate correctness, we did bit- and cycle-accurate

(identical-to-hardware) simulations using the Bluesim hardware simulation environment. Al-

tera’s Quartus II program was used to synthesize the design, producing area speed and power

results for a current high-end 28nm FPGA, the Altera Stratix V A7 device (speed grade C1,

fastest available).

This balance of this chapter is divided into five sections. First, we demonstrate the cor-

rectness of both the software and hardware implementations by internal consistency checks and

external comparison with another simulator. Next, we show results from profiling tools built

into the FullMonte software simulator that identify the operations and memory accesses that

are most critical to performance. The software and hardware performance each receive one sec-

tion of detailed discussion, followed by the presentation of an innovative hardware architecture

to scale up to larger meshes and higher performance.

1Simultaneous Multi-Threading, the sharing of one physical core by multiple execution threads. Intel brandsthis “HyperThreading”.

66

Page 81: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 67

5.1 Validation

Our validation strategy uses three parts: unit testing; internal online consistency checks in-

cluding assertions and conservation of energy; and external checks against a reference simulator

(TIM-OS). The validation of software and hardware are presented below in parallel since they

use the same techniques and concepts. We validated the FullMonte software simulator first

against the existing TIM-OS software simulator (Sec 2.5.7) using its provided test suite. Then,

confident of its accuracy, we used it to evaluate the output of the hardware design.

5.1.1 Unit Tests

Both the hardware and software were extensively unit-tested to ensure correct function of

individual blocks.

To create the software model, a number of libraries were used including Julien Pommier’s

fast SSE math routines [56] for sin/cos/log, as well as Saito and Matsumoto’s SFMT Mersenne

Twister RNG implementation [59]. Use of existing libraries provided highly-optimized, easy-

to-use routines that required minimal validation. We used Octave (a Matlab-like numerical

environment) to generate, manipulate, and visualize tetrahedral meshes, and to validate the

program blocks dealing with the mesh (eg. intersection, finding the tetrahedron enclosing a

point, etc). It was also used to test the statistical distribution of RVs including the Henyey-

Greenstein function.

For the hardware implementation, all major blocks were validated individually. Deter-

ministic blocks such as logarithm (used for the Eln 2 RV generator), sin/cos calculation, divi-

sion, square-root, intersection testing, step finishing, and Henyey-Greenstein evaluation were all

tested with large numbers of random inputs and cross-validated between software, hardware,

and separate implementations in Octave. The Tiny Twister RNG was compared against the

authors’ original software implementation for ten million cycles.

5.1.2 Assertions

Assertion checks are used in both the hardware and software implementations to verify that

certain invariants hold. For instance, the packet direction vectors must be orthonormal ie.

d · d = a · a = b · b = 1, and d · a = a · b = d · b = 0. We use assertions to check that this and

other properties (such as non-overflow of queues). In this case, assertion failure would indicate

that excessive roundoff error had accumulated or that the spin calculation was incorrectly

applied. Since they carry a heavy performance penalty, assertions are disabled when compiling

the software to run performance tests. In the hardware implementation, assertions are used in

simulation only and automatically removed by the compiler before synthesis.

Page 82: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 68

5.1.3 Conservation of Energy

The simulator should by design follow conservation of energy: the total packet weight launched

should equal the total that was absorbed plus the total that exited the geometry. Some zero-

mean noise is introduced through the roulette process, so the amount of energy added and

removed during roulette are both accumulated as well. These statistics are gathered during

the simulation in both the software and hardware versions to verify correct operation. Both

implementations obey conservation of energy to within very tight tolerances ( 10−8 of weight

launched).

5.1.4 Comparison to Reference Simulators

Two other public-domain simulators (MMCM and TIM-OS, presented in Ch 2) are able to

address the same problems as FullMonte. Of the two, TIM-OS is the more full-featured and

widely-used. It comes packaged with a suite of several test cases, covering a variety of source

types, optical properties, and geometries. We used the entire test suite to validate the software

simulator, and show one detailed example below.

The test cases shown below simulates BLI using Digimouse [20], which is a widely-used

freely-available digital model of a mouse often used for bioluminescence imaging experiments.

The dataset contains co-registered PET, MRI, CT, and cryosection optical images along with

an anatomical atlas created by an expert to delineate organs. The TIM-OS test set includes

one Digimouse case where an extended light source is included to model a BLI-tagged tumour.

We ran simulations using one billion packets through both FullMonte software and TIM-OS,

collecting the emittance from each surface triangle and the fluence within each tetrahedral

element. The MC technique scores energy absorbed or emitted, which is then converted to

fluence using Eq 2.3 or Eq 2.4. Since the coefficient of variation for each measurement is

inversely proportional to the number of packets recorded (which is proportional to the element’s

area or volume, and roughly correlated to fluence), the comparison was done in terms of energy

per element instead of fluence. To use fluence would unduly amplify variation in small elements

which would make the results harder to compare.

Figure 5.1 shows the comparison for the energy exiting the geometry for each triangular

surface patch. Each figure shows four graphs, to be read in order left-to-right and then top-to-

bottom. The first at top left shows a log-log plot of output from FullMonte (B) versus TIM-OS

(A) on an element-by-element basis ie (logA, logB). Since MC models a random process, the

outputs may differ for two reasons: either there is a bias, or due to random fluctuation in the

output which should reduce as packet count (recorded fluence) increases. Convergence to tight

tolerances with increasing fluence indicates that bias is not present. The second shows a measure

of percentage difference B−AA , and as expected, the elements which recorded more energy showed

a lower coefficient of variation. The bottom-left panel shows a more detailed comparison for the

top 5000 elements (either surface patches or tetrahedral volume elements), which collectively

account for over 99.9% of surface energy emitted. Figure 5.2 shows a comparison for volume

Page 83: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 69

elements the same way, with the top 5000 elements covering 91.6% of absorbed energy. Both

show that the simulator results agree.

Some features of the validation graphs require explanation. Generally, one assumes Gaussian

noise when examining the variance of a process which is a combination of many random factors.

In the top-left panel of Fig 5.2, there is noticeable asymmetry volume elements with counts 100

and lower. The actual distribution cannot be Gaussian since it is constrained to be positive,

which enforces asymmetry. If the arrivals were actually IID binomial, some upwards skew would

be expected for small samples. The skew is increased because the absorption events are not

independent: given that photon is in a tetrahedron, it is likely to deposit energy there multiple

times before expiring. There is also a curve at the bottom-left corner of the top-right plot.

Packets do not propagate with weight less than wmin, so the minimum quantum of absorption

is (1 − α)wmin ≈ 0.1 · 10−5. Around 10−6 one can see that error is correspondingly quantized

between ≈ 0% and −100%.

The skewness is more pronounced in the lower-left panel where error is presented on a

linear scale as a percentage of the reference (TIM-OS) value. It is worth noting here that skew

is due to the result presentation, where a factor of two could result in error of -50% or +100%

depending on which way the ratio goes. There is also a much greater density of points at the

lower values, making the variance appear relatively greater as well since the large density of

points near zero error are not distinguishable. We note that the values appear to have a zero

median and clear convergence towards zero error. As the bottom-right panel shows, the error

follows a 1√x

curve to zero to within tight bounds for the highest-fluence elements.

Hardware

We validated the hardware implementation using a bit- and cycle-accurate model compiled from

the original Bluespec SystemVerilog code into C++ using Bluesim. The comparison technique

was identical to the Digimouse case above, with two exceptions. First, we used a smaller test

case called “cube 5med” because Digimouse would not fit within the on-chip memory. Second,

we ran only 1.6 million packets due to the requirement to finish in reasonable time. The

geometry is a cube made of 48,000 tetrahedra and 4,800 surface elements, with five internal

layers of differing properties (µa, µs, g, n). We altered the case to make the index of refraction

homogeneous at n = 1.0, since reflection and refraction calculations are not implemented yet.

The outputs are compared against the FullMonte software in Fig 5.3, showing convergence

towards the correct value.

Contrary to the Digimouse case, the coefficient of variation is higher on the exiting energy

than the absorbed energy. The absorption map was built using on average over 700 absorption

events per packet, whereas the surface fluence is the result of just under 1 event per packet

(a very small number were terminated in roulette). Since only 1.6 million packets were run,

the results have not had sufficient packets to converge as tightly as the software comparison.

Despite the larger surface element variance at the chosen packet count, we are confident of the

Page 84: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 70

Figure 5.1: Results comparison (FullMonte software vs. TIM-OS) for Digimouse energy persurface element

Page 85: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 71

Figure 5.2: Results comparison (FullMonte software vs. TIM-OS) for Digimouse energy pervolume element

Page 86: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 72

Figure 5.3: Validation of FullMonte hardware simulation vs FullMonte software

simulation quality because the results agree tightly when volume fluence is considered and the

output obeys conservation of energy. Since the surface triangles are faces of the tetrahedral

volumes and the volumes show correct fluence, the surface fluence should be correct as well.

To simulate hardware running 1.6 million packets took 18 hours of PC time, approximately

2600x slower than the optimized software model on the same computer. While that may appear

slow, it is actually a noteworthy result for simulation of digital hardware due to the substan-

tial complexity involved in modeling exact device behavior including all necessary queues. The

simulator is single-core only, reducing the equivalent gap between the detailed hardware simula-

tion and the highly-optimized C++ implementation to 470x. Subjectively based on experience,

it compares very favourably with (integer factors better than) more traditional methods of

RTL-level simulation.

5.2 Algorithm Profiling

In order to optimize both FullMonte implementations, we gathered detailed information on

what operations are most frequent, and the distribution of memory accesses in time and space.

Page 87: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 73

Figure 5.4: Photon packet event frequency

5.2.1 Operation Frequency

There exists a huge disparity in the frequency of various operations on a packet. As shown in

Fig 5.4, intersection testing is by far the most common operation, followed by scattering. Data

are presented for three variations of Digimouse (high-albedo with 2µs, standard, and low-albedo

with 12µs), and cube 5med. Interface-related calculations place far behind in the test cases used,

as would be expected since the finest unit of geometry is a single tetrahedron and it is likely

to take many tetrahedrons to describe any more complex shapes which would have refractive

index differences. Since tetrahedrons without interfaces far outnumber those with interfaces,

interface-related calculations should be rare.

The same data is presented in a different way in Fig 5.5 which shows an annotated flow

diagram derived from the Digimouse test case run with profiling enabled. Each node is labelled

with the average number of times the operation occurs in a simulation, while the edges are

tagged with the probability of a packet following that edge from the preceding node.

Page 88: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 74

Launch1

Draw step376.4

100.0%

Tetra lookup542.3

100.0% Hop542.8

100.0% Interface166.5

30.7%

Step finish375.8

69.2%

99.7%

0.3%

99.9%

Figure 5.5: Algorithm flow graph annotated with transition probabilities (edges) and averageper-packet operation counts (nodes) for Digimouse at standard albedo

5.2.2 Memory Access

While the CPU has a fixed memory architecture, the programmer may still alter program

sections to make optimal use of the provided hardware. When designing an FPGA implemen-

tation, there is considerably more flexibility in the types of memory used and caching schemes

employed. Compared to simpler (MCML-like) geometries, the tetrahedral model requires or-

ders of magnitude more elements, each several times larger than a layer definition in MCML.

Fast access to memory is therefore critical to performance of this algorithm on any computing

device. Using the existing logging framework, a module was created which tracks all accesses

to the mesh storage and to the absorption array. A trace analyzer was created to do statistical

analysis of the data generated from actual simulation runs.

One of the distinctive features of modern CPUs compared to other computing platforms

(GPU, FPGA) is their very large LRU (least-recently-used) cache which serves to hide the very

long latency required to access main memory. Each time a memory address is requested for

read or write, the processor checks the address to see if the memory contents are held in the

cache. If so (a cache hit), it is able to complete the request using the cache copy rather than

waiting to access main memory. Otherwise (a cache miss), it fetches the result from memory

and puts it in the cache, ejecting the least recently used item in the cache to make space.

Considerable design effort, silicon area, and power are expended to provide a high-performance

cache, in particular the logic to determine which addresses are resident and to implement the

replacement policy. Under the assumption of temporal locality, ie. that items recently used will

likely be used again soon, such a cache is highly effective.

Profiling of the FullMonte algorithm on the other hand found that the algorithm exhibits

Page 89: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 75

limited temporal locality. The graph at top left of Fig 5.6 shows the statistical distribution

of the number of distinct accesses before a given address is accessed again, produced by the

trace analyzer. The graph can be interpreted as plotting hit rate against cache size n for a

cache implementing fully-associative perfect LRU. If the presently-requested address is one of

the n most recently accessed, it will be resident in the cache. A cache of only the eight most

recently used (MRU) elements can serve 60% of tetrahedron requests for most cases while the

next thousand elements increase that count only marginally as illustrated by the conditional

hit rate in the top right panel. Given that the element requested is not in the first 8 elements,

its probability of being in the next thousand is quite small (5-20%) Fortunately for the CPU,

its cache is large enough to hold the entire working set (n 105, the far right of the graph)

so the large penalty involved in accessing main memory is avoided. However, the allocation of

power and silicon area is not optimal and so other devices which allocate area differently may

be expected to outperform.

What is evident, though, is a non-uniform distribution by address. The lower-left panel

shows the hit rate when the n most-frequently-used elements are stored in the cache, instead of

most-recently. Such a system is known as a Least-Frequently Used (LFU) or “Zipf” replacement

policy [7] which should provide better results at lower cost. Given the high hit rate for a small

LRU cache, it would be attractive to use a hybrid system with a small LRU cache whose

missed requests are served by a larger LFU cache. Further simplicity could be gained from the

observation that access probabilities are stationary within a given simulation which may last

minutes, so the cache set could be chosen statically. The theoretical conditional hit rate for

exactly such a system is shown in the bottom-right panel for the Digimouse (standard-albedo)

case. Similar results are seen for all four test cases, with hybrid outperforming significantly.

The FPGA design proposed below in Sec 5.5 exploits exactly these characteristics to propose

a highly-efficient customized memory system.

5.3 Software Performance

All experiments were performed on an Intel i7-2600K 3.5 GHz quad-core CPU with SMT

allowing eight simultaneously active threads.

Since intersection testing is the most common operation, the most bandwidth-intensive,

and the most compute-intensive, it is a reasonable first approximation of overall performance.

Within a given test case, the ratio of intersection tests to other operations must also be fixed

so it gives a proxy for overall computing effort. For the balance of this discussion, the term

Mints (Millions of INtersection Tests per Second) will be used to measure an implementation’s

performance in performing such tests. Likewise, Mabs (Millions of Absorptions per Second)

stands for the number of absorptions recorded per second. Absorption events have very low

compute intensity but result in two memory accesses per absorption (read-accumulate-write)

which need to be atomic: a non-trivial requirement for highly parallel systems. FullMonte,

Page 90: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 76

Figure 5.6: Cacheability of four different test cases, showing relatively low hit rate for LRUcache at top left/right (note logarithmic scale for cache size); static Zipf cache at bottom left isbetter; bottom right shows L2 hit rate for two options with Digimouse (std): Hybrid (L1 LRU,L2 LFU) requires 2377 elements for 50% hit rate, while pure LRU (L1 LRU, L2 LRU) requires8246

Page 91: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 77

Digimouse Complex mesh representative of BLI applicationsRan with high-albedo (2µs) and low-albedo variations (1

2µs)Cube 5med A regular cube with five layers of differing optical properties

(modified from original case by setting n = 1.0 for all layersAlso ran a variant with 2µs

Fourlayer Thin tissue section consisting of four layersHalf-sphere air Non-absorbing, non-scattering half-sphereHalf-sphere tissue A scattering version of the above caseOnelayer Single thin layer of tissue with four different combinations of

optical properties spanning a range of 4x in scattering and2x in absorption

Table 5.1: Test cases and variants used to evaluate operation complexity vs run time

like TIM-OS, uses a per-thread queue of absorption events and then locks the main absorption

array. To scale up to a very large number of cores, the serialization imposed by such locking

may become a heavy penalty.

Figure 5.3 below shows that the number of intersection tests required predicts run time very

well (R2 > 99%) across a wide variety of problem descriptions derived from the TIM-OS test

suite as summarized in Table 5.3. The “half-sphere air” test case which is non-scattering and

non-absorbing gives an upper bound on the performance achievable by the CPU for intersection

testing at 95 Mints. By removing virtually all of the other operations (packets must still be

launched), performance improves by only 35%, suggesting that intersection testing is responsible

for nearly 3x as much run time as the other operations in the average case. Clearly, it is the

essential factor for CPU performance.

5.3.1 Caching

Cache-hit profiling using Cachegrind revealed that the miss rate of the last-level cache was

below 0.01% when running Digimouse, indicating that main memory latency has essentially

no impact on the algorithm’s performance. We saw considerable speedup from Simultaneous

Multi-Threading, which further suggests that the design is not bound by memory throughput

since all cores share the L3 cache and main memory. If the design were memory-throughput-

bound then adding additional computing cores would not increase speed. On the CPU at least,

the silicon area dedicated to caching exceeds what is necessary and a hypothetical device of

the same area with more compute capability and less caching would likely outperform. Other

architectures such as Intel MIC (Many Integrated Core) or GPU platforms may achieve better

results since they allocate their silicon area differently between caching and computing.

5.3.2 Comparison to TIM-OS

FullMonte’s software implementation provides slightly (10%) better performance than TIM-

OS when used at the same wmin value. Since TIM-OS is automatically vectorized by the

Page 92: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 78

Figure 5.7: Software run time vs. operation count: Mints and Mabs for a variety of test cases,showing Mints as a predictor for run time

Page 93: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 79

Time (s)Threads TIM-OS FullMonte

1 447 4432 227 2284 119 1238 83 7616 83 7632 83 76

Table 5.2: Comparison of FullMonte and TIM-OS run times for Digimouse standard albedocase

Intel C Compiler (ICC) while FullMonte has been hand-optimized using intrinsics, this is an

impressive result for the ICC. Limited further avenues exist to boost performance as discussed

in the chapter on future work. Details are provided in Table 5.2.

5.3.3 Multi-Threading

The FullMonte algorithm is very scalable across threads, showing linear increase for 1-4 cores,

and an additional speed boost of 55% when using the logical cores provided by SMT. Its

scalability slightly exceeds TIM-OS, possibly due to lower bandwidth requirements associated

with single-precision float instead of double. When using double-precision floats, the two logical

cores sharing a physical core may contend more for L1/L2 cache access, or it may increase the

contention rate when cores read from L3.

5.3.4 wmin parameter

As discussed in Sec 2.4.2, the wmin parameter provides an important quality-runtime tradeoff

that is independent of the other optimizations discussed here. Figure 5.8 shows the variance

impact of altering the parameter from its typical value of 10−5 up to 0.1. Generally, the higher

the proportion of packets terminating by roulette the larger the impact. If most or all packets

exit the geometry, then decreasing wmin has no impact since simulation terminates for reasons

other than roulette. This effect should be most pronounced when modeling BLI or IPDT-like

cases because they generally have few packets exiting. As shown in Table 5.3, the run-time

impact is significant while Fig 5.8 indicates that the quality loss (high variance) occurs in

elements with undetectably low fluence levels up until wmin = 10−3. The bold vertical line

shows the ideal dynamic-range limit of a 16-bit sensor as is typically used for BLI, assuming no

pixels are allowed to saturate. The variance lying to the left of the line would not be observable,

but the simulation would run about 40% faster.

Page 94: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 80

Figure 5.8: Result standard deviation vs result value at varying wmin values (Digimouse surfaceemission at standard albedo) with vertical line showing 16-bit dynamic range

Low Standard Highwmin Time (s) Speedup Time (s) Speedup Time (s) Speedup10−5 474 1.0 845 1.0 1683 1.010−4 414 1.15 736 1.15 1447 1.1610−3 352 1.35 605 1.40 1181 1.4310−2 266 1.78 455 1.85 871 1.9310−1 160 2.96 272 3.10 506 3.32

Table 5.3: Run-time impact of changing wmin for three different Digimouse albedo scenarios

Page 95: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 81

5.3.5 Summary

Based on the analysis presented above, we highlight a number of conclusions regarding software

implementation of photon migration.

First, the FullMonte algorithm is compute-bound when running on a CPU. Since perfor-

mance scales up linearly with the number of threads, contention for the shared L3 cache and

main memory are evidently not limiting factors. The addition of more processing cores should

provide additional performance.

Second, the memory architecture of a CPU is over-designed for the problem at hand. As

the die photo in Fig 5.9 shows, great amounts of silicon area (also energy and design effort)

are expended to provide a large and fast LRU cache, at the expense of space for processing

cores. Note that only the largest, last-level (L3) cache is explicitly marked; there is more area

within each core dedicated to L2 and L1 cache. Current CPU L3 caches are both excessively

large and use an unduly complex replacement algorithm for the task at hand. The Intel Many

Integrated Core architecture may provide an interesting avenue for future work since it makes

different trade-offs regarding caching, core complexity, and core count.

Third, the ability to perform intersection testing limits performance across a range of sce-

narios. Both scattering events and traversing into an adjacent tetrahedron prompt the need

for an intersection test. If a given geometry is highly-scattering relative to the mesh element

size, then Mint count is dominated by scattering events and the mesh can be refined with little

performance penalty. At some point, though, one can expect an excessively fine mesh to im-

pose a penalty for two reasons: first, it expands the working set beyond the cache size causing

cache misses; and second, the ratio of intersection tests to steps becomes larger requiring more

computing.

Fourth, further study is required to determine the appropriate value of wmin for a given

application. It can provide a significant speed increase if the increased variance of low-fluence

elements is tolerable, which seems likely at least for BLI. It may also be the case for PDT,

which exhibits threshold behavior and hence does not need accurate results for fluence that is

well below the threshold.

Finally, the algorithm shows excellent scalability through parallelism due to the indepen-

dence and the relatively small size of the working set (geometry description and absorption

array). The software implementation has been highly tuned and competes well with several

other packages, which suggests that there exists little more room to improve CPU-based per-

formance. A few incremental proposals are discussed in future work, but since only a 30% gain

results when completely eliminating scattering, reflection, and refraction, the remaining (and

essential) item is intersection testing which we believe to be very tightly optimized. Since the

time to combine result sets is minimal compared to the time to compute them, the CPU imple-

mentation can be scaled up at will using more cores, sockets, and nodes, albeit at the expense

of money, heat load, and power requirements. Significant per-core and per-watt performance

improvements through software changes are unlikely.

Page 96: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 82

Figure 5.9: Sandy Bridge i7-2600K die photo from Anandtech [61], showing the very large areadedicated to caching

5.4 Hardware Performance

In comparison to the previous implementation by Lo [47] which simulated only infinite planar

layers, FullMonte uses a much richer geometry model. Despite the additional complexity, the

latency of the inner loop is actually considerably smaller due to careful choice of mathematical

precision (mostly 18-27 vs 32 bits in FBM), and the latency-hoisting transformations discussed

earlier. Technological progress in FPGA devices between Stratix III (FBM) and V (FullMonte)

have also helped since more processing can be done within a given clock period and hard

multipliers have increased functionality. Figure 5.10 depicts the flow difference, with FullMonte

on top and Lo’s FBM on bottom. Edges coloured green are infrequent, giving a loop of 52 cycles,

while the core (simplest possible step) path is shaded in black and lasts 18 cycles. Operations

whose latency have been hoisted out of the inner loop through queueing are shaded gray.

Latency is a critical factor determining the size of cache required to scale up the design, so

its minimization is an important goal. FullMonte can run at a maximum clock frequency of 215

MHz, compared to 80 MHz for FBM: a significant gain which cannot be attributed solely to

process advancement, particularly in light of the decreased latency2. Lo’s work uses a 100-stage

pipeline, meaning a packet exits the roulette core 100 stages after it enters, so 100 packets must

be in flight at a given moment to keep the pipeline fully utilized.

By introducing forks into the data flow, FullMonte is marginally more complex than FBM

and requires queues to balance the stages. Correct function is ensured by assertions that

check that there is always space in the queues when needed so no packets are dropped. The

benefit of this additional complexity is that it permits operations with high latency but low

probability (interface-related code) to be removed from the core loop. Building up from the

2Generally one can increase a circuit’s minimum clock speed by increasing its latency as measured by numberof clock cycles, whereas decreasing latency tends to lower the clock rate unless done very carefully.

Page 97: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 83

Launch1

Step1 Boundary

60

Move1

Reflect37

SharedArith

37

Rotate37

Fluence37

Roulette1

RNG1

IsotropicPt src

6

Log1

HenyeyGreenstein

16

SpinAbsorbRoulette

4

Launch1

StepLength

1Fetch

2Boundary

11 Interface39

Figure 5.10: Hardware block diagram of FullMonte (top) and FBM (bottom) showing latencywith core-loop edges in black; maximum loop latency is 100 for FBM and 52 (18) for FullMonte

current foundation, the designer will have a chance to trade cache size versus utilization which

was not available in the FBM architecture. It is now possible to ensure that the utilization is

high (100% in the absence of interfaces) with a smaller cache size, since latency is extended

only when necessary. If interfaces are less than 1% of events, then the pipeline can be kept 99%

full with only 18 packets in flight, and will stall the remaining 1% of the time. Alternatively,

one could keep 52 packets in flight so that utilization is always perfectly full - a new tradeoff

available from this new architecture.

5.4.1 Area Requirements

The area requirements to synthesize a single instance of the design are shown below in Table 5.4.

As expected, intersection testing is among the most resource-intensive blocks. Almost all of

the block RAM is accounted for by the tetrahedron storage. Fortunately it uses only half of

the available read ports. An additional intersection tester could use the other port of the same

storage at no cost.

Some of the resource counts are also slightly over-reported. There are situations where

Bluespec incorrectly instantiates much larger DSP units than necessary, which for instance

accounts for an additional four units in the scattering block that could be saved. Bluespec

FIFOs are also used extensively for pipeline delays to align data, which in its current form results

in extensive Block RAM. Using different primitives will result in different resource decisions,

Page 98: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 84

Block Fmax ALM FF DSP BRAM

Henyey-Greenstein 364 1740 2857 4 0Exponential Dist 479 112 128 1 0Isotropic point source 401 371 723 11 2Intersection test 329 510 799 20 0Interface 340 1707 2713 5 2Scatter 366 279 534 23 0Step finish (est) 350 200 6 0Storage 1222

Queueing, control, RNG 3786 4665 1 325

Total 215 8705 12619 71 1551Fraction of device 4% 3% 28% 61 %

Adjusted total (est) 18705 22619 67 1251Fraction 8% 10% 26% 49%

Table 5.4: Area required for a single instance on Stratix V A7 device

reducing block RAM count by 200-300 at the cost of additional logic blocks. An adjusted total

area figure is included in the table reflecting a best estimate of the requirements after all issues

are fixed. Block RAM is a constraint on the achievable parallelism since it requires nearly half

the chip to serve two intersection testers. Alternative caching arrangements discussed later

may reduce the required resources, which will be a benefit since designs become difficult to

synthesize when they use too high a fraction of the chip’s capacity.

Utilization of the launcher is also quite low because packets take on average hundreds of

steps before expiry. It could easily be shared among many pipelines, reducing the per-instance

DSP count to 56 so that four instances can fit on the chip. The requirement could be further

reduced 2-3x by rolling the launcher implementation (Sec 4.2.4).

In summary, instantiation of multiple design instances is limited by both DSP units and

block RAMs on the S5 A7 device, accommodating up to four instances. Switching to another

family member such as D5 which has a higher density of DSP units could be beneficial, except

that it would reduce available on-chip memory, a trade-off which remains to be evaluated

rigorously with new caching schemes. In either case, it should be clear that four instances of the

pipeline can be accommodated within the device. A discussion of factors limiting performance

scale-out are discussed below in Sec 5.5.

5.4.2 Power Consumption

Since the computation occurs entirely on the FPGA chip (no external memories or other ele-

ments, and no host I/O during simulation), accelerator power consumption is due only to the

FPGA core power itself. We used Altera Quartus II to produce a quick vectorless estimate of

the total power a physical realization would consume assuming a standard 12.5% toggle rate at

215 MHz for internal digital signals. Ambient temperature was assumed to be 25, with junction

temperature automatically calculated assuming a 23mm heatsink and 200LPM airflow, with no

Page 99: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 85

Power (W)Core Normalized

Static Dynamic IO Total Speed Energy/pkt

CPU (low range) 47.5 1.0 36.5CPU (high range) 76 1.0 58.5Single-instance Stratix V 1.2 2.1 0.6 3.9 3.0 1.0Estimated 4 instances 2.4 8.4 0.6 11.4 12.0 0.75

Table 5.5: Performance and energy-efficiency comparison (FPGA vs CPU) at 210 MHz clockrate

board thermal model (conservative). As the design is scaled up to more instances on the chip,

we expect that dynamic power would increase proportionally, but that static power would in-

crease at a slower rate and I/O power should remain the same. The I/O power estimate is rough

because synthesis results were run by instantiating the core with its top-level ports connected

to general-purpose I/O connections. In a real implementation, a PCI-Express serial connection

would be used which would probably reduce the power consumption. Adding off-chip DRAM

access to accommodate larger geometries would increase I/O power as well.

To compute energy-efficiency per amount of computing, we must also account for the differ-

ence in simulation speeds. Based on run-time results, the CPU implementation is limited to 70

Mints across a variety of test cases. The hardware implementation is also Mints-limited, and is

able to achieve 100% utilization of the intersection-test block at 210 MHz, thus producing 210

Mints or 3x faster than the CPU using a single instance.

Measuring CPU power consumption fairly is a difficult matter. A typical computer system

may have a power supply rated for 200-300W, which sets a definite upper bound that includes

many elements that are not critical to actually carrying out simulations (graphics card, hard

disk, cooling, etc.) and hence should be excluded from the comparison. The processor used

has a thermal design power (TDP) rating of 95W, although again this is a maximum not

necessarily achieved. Since all cores are fully active, we can assume that the processor is fairly

heavily loaded, though as previously discussed it will not need to access main memory. There

are also no I/O or graphics operations required so portions of the chip will be idle. As a

reasonable estimate, we take a pair of values, 50% and 80% of TDP, as a proxy for CPU power

consumption.

A summary of the results is presented in Table 5.5, indicating that the FPGA system as

it stands has an energy-efficiency advantage on the order of 40x. Under conservative assump-

tions for scaling up the implementation, that gap could increase another 30%. That result is

much lower than previous work by Lo [47], who achieved nearly 700x using an FPGA and a

processor that are both two generations older. Processor power efficiency has increased greatly

since that comparison was made (from the 65nm node to 32nm), and the FullMonte software

implementation is also inherently far more efficient due to use of SIMD instructions. We also

use more conservative values for processor power consumption (Lo uses 50% TDP to estimate

Page 100: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 86

using one of two cores, we use 50-80% for using 4/4). Conveniently for comparison, both chips

use are on similar manufacturing process nodes, 32nm for the CPU and 28nm for the FPGA.

5.5 Architecture Scalability

In addition to the prototype single pipeline described above, we also propose an architecture

below which would permit FullMonte to tackle larger problems and attain higher performance.

The architectural discussion includes a careful analysis of the factors which may limit perfor-

mance, based on the profiling results discussed above and the specifications of both the Stratix

V FPGA family and the DE-5 evaluation board.

5.5.1 Larger Meshes

The present implementation has a limited mesh size due to its use of on-chip memory exclusively.

As noted in the profiling results of Sec 5.2, the majority (& 90%− & 98%) of tetrahedron

accesses on the large Digimouse mesh occur within the 64k most frequent addresses, which are

already stored on-chip. To store the entire Digimouse mesh, the remaining elements (≈ 250k)

could be stored in off-chip memory. Since those accesses are only one-tenth as frequent as the

ones stored on chip, the memory needs to be only one-tenth as fast to avoid being a performance

limitation.

The Terasic DE-5 board selected has a Stratix V FPGA with two DDR3 SO-DIMM memory

modules (up to 8GB size), whose theoretical bandwidth is 136 Gbit/sec. At peak performance3

it could serve 348 million tetrahedron requests per second (Mtets). If the tetrahedra that

do not fit on-chip were saved in off-chip DDR memory running at 25% efficiency (87 Mtets)

and accounted for 10% of memory accesses (the complement of the 64k which cover 90% or

more), the system could fetch 870 Mtets before the off-chip bandwidth would limit performance.

Assuming there is also a cache holding the eight most-recently-used elements and a hit rate of

50% as shown in profiling, that would yield a total system performance limit of 1740 Mints,

which is nearly 8 pipelines running at 215 MHz or approximately 24x faster than the CPU

implementation.

It would also be necessary in many applications (particularly PDT) to record the absorption

events. QDRII+ uses separate read and write buses and a burst length of four, so a 72b read

and a 72b write can both be completed every other clock cycle. This scheme is ideal for

fluence accumulation since accumulation requires a sequence of read-add-write. Each memory

is capable of addressing 8MB of data, or 128k 72-bit words.

For 72-bit fluence accumulation, up to 512k elements can be stored on the four chips. Since

QDRII+ (unlike DDR3) has no bus efficiency overhead, the total off-chip access rate for the

four chips would be 900 MHz for 72-bit read-write pairs. Since each absorption event results

3Peak values are guaranteed not to be exceeded. In a real implementation there would be some overheadwhich will detract from this value.

Page 101: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 87

in one read-write pair, limiting performance to 900 Mabs of off-chip access in the absence of

any caching, sufficient to serve four pipelines at 225 MHz. That limit could be raised by an

appropriate caching scheme such that tetrahedron fetching is the performance-limiting factor.

Bearing in mind that each absorption event requires at least one intersection test, it means

that the DDR3 bandwidth limit for tetrahedron storage and the QDRII+ limit for fluence

accumulation would be compatible with each other and with very high performance.

5.5.2 Parallelism for Greater Throughput

Since MC simulations are inherently parallel, running M instances of the pipeline with inde-

pendent RNG seeds would yield an M times speedup if the time to merge results is negligible.

In future work, the pipelines could share memory so they are merged on-the-fly but would need

to share access bandwidth which could limit performance.

Scaling up through parallelism would be trivial for a number of functional blocks. The Tiny

Twister 800 RNG produces 800 bits in parallel, of which less than 100 are used for a single

pipeline instance. Up to eight loop instances could receive independent bits from the single

RNG. Likewise, the packet launcher is sharable since the average packet takes anywhere from

50-500+ steps after being launched. That suggests that a single launcher could be shared by

50+ instances, or that it could be shared by four while having its implementation rolled up to

10x to economize on device resources.

One of the most significant current bounds on parallelism is the numer of DSP units on the

device. The implementation uses 71 out of 256 available on a Stratix V A7 chip. Of those,

eleven can be shared among all implementations for the launcher, and can further be reduced to

by loop rolling since the launcher is needed only infrequently. Previously-discussed arithmetic

optimizations and bug fixes are expected to save five DSP blocks, after which the design will

require 3+55M DSP blocks to accommodate M independent parallel pipeline instances. When

a refractive-index interface block is added, it will add to these requirements but can be shared

across all pipelines due to its very sparse usage pattern.

Tetrahedron memory access would also become an important factor when scaling up. Fig-

ure 5.11 shows an efficient architecture based on the profiling results of Sec 5.2. Edges between

blocks indicate the access rate in millions per second. Based on profiling, more than half of

memory accesses can be served by an 8-element L1 LRU cache which would require eight stor-

age elements per in-flight packet for a total of 416 elements. Since the elements are 404b wide,

they can be stored in eleven parallel block RAMs. Misses from that cache could be directed

to an L2 Zipf cache with a static cache set that could be pre-determined by running a small

simulation (≈ 105 packets) on the host (or perhaps in future work using self-profiling FPGA

hardware). Since the L1 hit rate is better than 50%, two pipelines should be able to share a

single L2 cache port with some queueing. If the static L2 cache were implemented as a Block-

RAM-based ROM, then two read ports would be available per array, so there could be two

pipelines per port and two ports per Block RAM array. A quick estimate from the profiling

Page 102: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 88

Block RAMLevel Policy MAccess/s Hit rate Elements Inst Ports Per inst Total

L1 LRU 300 60% 8x52 8 1R 1W 11 88L2 LFU 240 50% 4k 4 1R 44∗ 176L3 LFU 240 80% 32k 2 1R 352∗ 704DRAM - 96 100% Millions 1 1R 0 0

Total Hybrid 2400 Millions 968

Table 5.6: Resource estimates for 8-pipeline cache hierarchy (DRAM peak b/w is 348M/sec,so needs 27% efficiency); ∗ assuming 2 instances share 1 physical RAM; based on Digimouseprofiling

illustrated in Fig 5.6 indicates that a few thousand elements should suffice for an L2 cache. One

possibility would be to make L2 as large as possible and serve its misses from DRAM, allowing

four pipelines per chip which is the limit based on available DSP units.

For chips with a larger number of DSP units, eight pipelines might be feasible. It would

require instantiation of one of the previously-described 8-LRU L1 caches per pipeline, and an

L2 cache shared among 4 pipelines of sufficient size (4k) to serve at least 50% of requests. The

misses from the two L2 instances could be served by a shared L3 cache before going to main

memory. Total throughput would be governed by the ability of the L3/DRAM solution to serve

tetrahedron requests if the cache assumptions are correct. Fluence accumulation would also

need to keep pace, which should be achievable with a simple 8-LRU L1 scheme coupled to the

QDRII system described above. Given a 50% L1 miss rate, the QDRII RAM could serve 1800

Mabs which is the peak absorption output of eight pipelines at 225 MHz.

5.5.3 Cost of Scale-Up

The reference PC workstation cost approximately $2200, of which several hundred each went

towards a GPU and solid-state storage (SSD) system which are not relevant to the problem at

hand. Allowing that the actual cost of processor, memory, and relevant components was on the

order of $1200, the CPU could be fairly compared to a Terasic DE-5-Net board [63] which hosts

the FPGA used for simulation on a PCI Express card for a list price of $8000. If the modest

scaling projections are achieved then the FPGA-based system will compare favourably with a

CPU-based system in terms of purchase cost per throughput (1.8x), energy efficiency (30-50x),

and throughput (12x).

5.6 Summary

The FullMonte software implementation is a highly-optimized software model, and the fastest

tetrahedral-mesh Monte Carlo model for light propagation in turbid materials. Aside from

time-resolved output (which is a planned new feature), it supports the most general set of

geometries, materials, and output data in an efficient and customizable way.

Page 103: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 89

PipelineL1

8 LRU60% hit

300

PipelineL1

8 LRU60% hit

300

PipelineL1

8 LRU60% hit

300

PipelineL1

8 LRU60% hit

300

PipelineL1

8 LRU60% hit

300

PipelineL1

8 LRU60% hit

300

PipelineL1

8 LRU60% hit

300

PipelineL1

8 LRU60% hit

300

L24k static LFU

50% hit

120

120

L24k static LFU

50% hit

120

120

L24k static LFU

50% hit

120

120

L24k static LFU

50% hit

120

120

L332k static LFU

80% hit

120

120

L332k static LFU

80% hit

120

120

Off-chipStorage

348M/s at peak48

48

Figure 5.11: Proposed cache architecture

Page 104: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 5. Results 90

The FullMonte hardware architecture demonstrates significant novelty and improvement

over previous work by Lo. Clock speed has been increased, partly due to technological advances

between device generations (65nm to 28nm), and partly due to careful optimization of bit

widths and algorithmic enhancements. Latency of the core loop has been cut in half (or by 5x,

if the interface path is ignored) even while increasing clock speed, which will prove important

for future scaling. The comparison against CPU is also more reliable since the FullMonte

software is highly-optimized, multi-threaded, and uses modern processor features in contrast to

Lo’s reference point, MCML, which is unoptimized single-threaded C code. We synthesized a

prototype which shows correct function and provides insight into area and power requirements,

with energy per simulation being reduced 30-50x from a highly-tuned CPU implementation and

a 3x performance increase while using less than a quarter of the FPGA device.

We gained insight into the factors that limit performance through extensive profiling, and

have identified novel techniques to increase performance and efficiency of hardware algorithms.

In addition, we proposed and analyzed a memory architecture which would enable scaling-up

of the prototype to handle larger meshes and higher performance. The use of a static Zipf-style

cache is new for this application, and would provide significant benefits in performance, area,

and complexity over the more-typical LRU policy. We presented analysis which shows that a

scaled-up system could support at least four parallel instances, with sufficient off-chip memory

to store the Digimouse mesh and record volume fluence for all elements. Such a system would be

attractive compared to a CPU-based system on measures of throughput, cost-per-throughput,

and power-per-throughput, as well as physical space and cooling required. Since it outperforms

CPUs on all those metrics, it would be the optimal choice for scaling such calculations up for

iterative solution of biophotonic inverse problems.

Page 105: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 6

Conclusions and Future Work

6.1 Conclusions

This chapter summarizes the principal contributions and findings of this thesis, and suggests

future research avenues. Future work can be divided into several sections: further improvements

to the software model; optimizing and adding features to the single-pipeline hardware; scaling

the prototype hardware up to larger problems and higher performance; applying the new insight

to other computational platforms; and putting the simulators to work on applications.

6.1.1 Contribution summary

The principal contributions demonstrated in this thesis include the following advances in tetra-

hedral mesh-based Monte Carlo simulations of light propagation through turbid media:

• Fastest available software simulator

• Most flexible available software simulator (zero customization overhead configurable out-

put data gathering)

• New method for scattering calculation

• New variance estimator

• Demonstrated feasibility of FPGA hardware with 3x speed and 40x power efficiency in-

crease over computer

• Proposed and analyzed an FPGA architecture to achieve >12x speed increase over CPU

6.1.2 FullMonte Software

The FullMonte software model described in this thesis is now the fastest available open-source

tetrahedral MC simulator. It achieves this by making extensive use of manual optimizations

where appropriate, and by exploiting modern CPU capabilities. In the process, we generated

91

Page 106: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 6. Conclusions and Future Work 92

new profiling tools and data to analyze the basic algorithm, identify the factors limiting its

performance, and optimize both the hardware and software designs based on that profiling.

In conclusion, FullMonte is the current state-of-the-art for software biophotonic simulations,

using a highly optimized C++-based using Intel SSE instructions. In view of its considerable

optimization, we believe further efforts to accelerate CPU-based simulations offer room for only

incremental improvement.

6.1.3 FullMonte Hardware

In response to the diminishing returns from software optimization discussed above, a hardware

implementation was designed which shows the feasibility of fast Monte Carlo biophotonic sim-

ulations using FPGAs. The current FullMonte hardware achieves a 3x speedup while providing

a 40x benefit in power efficiency within a compact package. This is the first such hardware

design for complex geometry, and it has conclusively demonstrated that FPGAs can achieve

achieve superior speed and power performance compared to a CPU for this application. We

have also presented an architecture to scale up to higher performance and bigger problems,

based on thorough application profiling and careful design analysis.

The current hardware design simulates a single instance of the core packet loop running at

215 MHz on a commercially-available FPGA device with resources to spare while consuming

far less power than a CPU. The design requires less than one quarter of the resources of an

Altera 5SGXMA7N1F45C1 FPGA, leaving room for future work to expand to multiple parallel

pipelines for greater performance. Development work to attach it to the PCI-Express bus and

write support drivers to interface it to a host computer requires effort but carries little to no

technical risk.

6.2 Future Work

6.2.1 FullMonte Software

Several important contributions were made to improve on the state of the art software model.

Further work could be undertaken both to improve the performance of the existing model and

to incorporate new features or capabilities to make it even more broadly applicable. A few such

avenues are sketched out below.

Variance Estimation

The new variance calculation proposed in Sec 3.3.3 should be implemented and validated. Since

MC simulations are inherently random in nature, the run time is directly proportional to the

number of paths simulated. There is therefore a natural tradeoff between run time and result

quality, which can now be rigorously quantified. To have a reliable estimate of the output

variance may permit the user to terminate simulations more quickly once a target level of

Page 107: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 6. Conclusions and Future Work 93

variance is reached, or to have an estimate of output variance for a given fixed number of

simulated packets. This could find use, for instance, in planning PDT treatments such that

confidence bounds can be placed around the simulated light dose to enhance patient safety.

Validation would be required, in which the realized sample variance per mesh element would

be computed over N independent runs and compared with the variance estimator as proposed.

New Source Types

The current software model is limited to point sources (isotropic or directed), isotropic volume

sources, and directed face sources. The source code is designed so that new capabilities can

be added using the C++ inheritance mechanism without altering any core code. Conveniently,

the inheritance mechanism also allows reuse of aspects of sources when designing other sources.

Line sources with both uniform and customized longitudinal emission profiles (similar to work

by Rendon [10]) would be one area of particular interest to PDT applications.

Time Resolution

The primary target of interest for the present work is PDT which operates in the continuous-

wave mode, meaning time resolved calculations are not necessary. Some applications (DOT,

time-resolved DOS), however, require time-resolved output data. TIM-OS currently provides

time-resolved functionality, as does CUDAMC (for a limited geometry), providing ample bases

for comparison if and when that feature is introduced. If time resolution is desired, the optimal

technique for an arbitrary pulsed or modulated input is to calculate a temporal impulse response

function for a given source configuration, then compute the actual output by a convolution of the

impulse response and the input waveform (either modulation or finite-duration pulse). While

the calculations themselves are trivial, requiring only that the packet time from launch t be

stored and t′ = t + nsc0

calculated at each scattering step, reporting output as a function of

discretized time adds another dimension to all output data arrays.

Intel AVX/AVX2 Instructions

The current FullMonte software uses Intel SSE instructions up to version 4.1 (2011), with four-

element single-precision registers. Recent Intel processors have a new instruction set called AVX

(Advanced Vector eXtensions) which offers eight-element single-precision registers. Newer chips

will implement its successor AVX2, which expands that to sixteen-element single-precision reg-

isters. While the simulation as written makes natural use of four-element registers (either

with three spatial dimensions, or using the four faces of a tetrahedron), some sections of code

may benefit from the new instructions. Certainly some of the vector-math functions such

as logarithm, sine/cosine calculation, and random number generation will see an increase in

throughput increased if the new instructions are used since more elements can be computed

in parallel, resulting in fewer calls to such functions. By nature, Monte Carlo simulations

Page 108: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 6. Conclusions and Future Work 94

make intensive use of random numbers so an increase in performance of the RNG and related

functions (logarithm, sine/cosine) used to generated distributed random numbers may be sig-

nificant. Through rewriting of the main loop, it may be possible (though probably difficult)

to extract additional performance by batching multiple intersection tests or other operations

within the larger sixteen-element registers though that will require significant restructuring.

To increase software simulator performance and maintain a fair CPU-FPGA comparison, the

software should be updated to make full use of all new processor features as they become

available.

One area which may pay significant dividends would be enabling vector calculation of the

Henyey-Greenstein function. With sixteen-element-wide registers, the effective cost of phase

function calculation could be cut to 1/16 per unit. The results would need to be stored in

a small queue, which would incur some slight overhead. A similar concept was used in the

hardware design to hoist latency out of the main packet loop (Sec 4.3.1).

6.2.2 FullMonte Hardware

The FullMonte hardware implementation presented here is presented as a proof-of-concept

which demonstrates that significant run time and power-efficiency gains can be made for com-

plex biophotonic problems through implementing the simulations on FPGAs.

Refractive-Index Interfaces

To make the hardware simulator fully applicable for PDT of complex volumes including HNC,

it will be necessary to support calculations at refractive index boundaries. The presence of air

voids in the sinuses and oral/esophageal cavity will make a significant difference in the fluence

distribution due to the sharp refractive-index change. As previously argued, the computational

cost should be modest and not an overall limit to system performance.

A module to handle refractive interfaces could be written using mostly existing building

blocks, based on the algorithm described in Sec 2.4.2. The intersection point and cosine of the

incidence angle are already provided as input. The code to calculate sin θ =√

1− cos2 θ also

already exists. The condition for total internal reflection (TIR) can be checked by comparing

cos θ against a constant stored for each material interface (at most N2m = 256). If TIR does

not occur, then the sine and cosine of transmitted angle can be calculated through Snell’s law,

and those quantities used to evaluate the Fresnel reflection probability R. Using a BR random

distribution (variable R = f(n1, n2, θ)), the packet will either entirely reflect or refract such

that the expected energy transmitted is physically correct.

When evaluating the new direction, the vectors d, a, b must all be adjusted for reflection or

refraction as appropriate. One option would involve calculating a′ = d×n|d×n|

and b′ = d′× a′. Of

these, only the normalization of a′ is costly or high-latency, since it requires division by sin θ

which must already have been calculated for Snell’s Law. The additional computational cost

Page 109: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 6. Conclusions and Future Work 95

should be minor given that interface events are so much rarer than scattering. A hardware

implementation could conceivably use a single divider and a single DSP unit with loop rolling

to produce a low-throughput result.

Time Resolution

The general technique and limitations of time-resolved simulation were discussed within the

context of the software model. Conceptually the same idea could be implemented in hardware,

with the caveat that it could easily become a limiting factor in simulation performance due

to the different memory architecture. Since an FPGA’s fast on-chip memories are of limited

size, to store results with fine temporal resolution or more than a few time-resolved data points

would require an excessive proportion of off-chip memory access. That access in turn would

be much slower and have a lower throughput which would limit the maximum calculation rate.

On the other hand if the results are only needed at a small number of probe locations known at

simulation time (eg. as done in CUDAMC), then simulation speed would not be compromised.

Pipelining

The gap between the individual block maximum speeds (330MHz) and system maximum

(215MHz) shown in Table 5.4 indicates that there is room to improve performance by pipelining

interconnect between blocks. While it is not reasonable to expect to hit exactly the maximum,

since that is achieved by a single block in isolation with no interfering logic or routing, it should

be possible to increase speed by an additional 20-30% through careful pipelining under the

proviso that each stage should be justified since the overall design is sensitive to excess latency.

Arithmetic Optimization

The most significant issue is that the Bluespec compiler generates explicit sign-extension when

performing signed multiplication, which appears to lead to incorrect inference from the Altera

FPGA synthesis tools. Unfortunately fixing that requires explicit instantiation of a hard block

which makes the code less readable. There remain several instances of this issue at the time of

writing, which results in a cumulative impact of 7 additional DSP units consumed.

There are also some minor optimizations of mathematical functions that could reduce la-

tency and resource usage, particularly the square-root (or√

1− x2) and division operations.

They are not performance-limiting at the moment so such optimizations can wait. The current

implementations were chosen to get a working system quickly rather than carefully optimized.

6.2.3 New Acceleration Platforms

With constant demand for improved performance, the landscape for processors and compute ac-

celerators is ever changing. As device capabilities evolve, software must periodically be updated

to take advantage of the new capabilities. In addition, even without adding new capabilities,

Page 110: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 6. Conclusions and Future Work 96

architectures change in ways that require software tuning, for instance: core size, number of

cores, cache size, memory coherence models, multi-threading. Some of the recent and upcoming

changes which may be relevant to future development of the FullMonte software simulator are

summarized below. With the new algorithmic understanding developed in this thesis, it will

be easier to provide a rigorous analysis of the costs and benefits of each candidate architecture

and platform.

GPU

While there exist a number of previously-identified challenges to a GPU implementation of

the algorithm (recall Sec 4.1.1 discussing FullMonte platform choice and Sec 2.5.3, 2.5.4, 2.5.5

regarding previous attempts with GPU), it could be an interesting problem to attempt. Solving

problems efficiently using GPUs often requires clever transformations of the problem to tailor

an implementation to the particularities of the compute medium, notably memory coalescing,

divergence avoidance, and locality exploitation. As GPU caching and coalescing capabilities

continue to evolve, they may become more suitable as a compute medium for this problem. At

present, we believe that the problem would be difficult to accelerate on a GPU but that does

not constitute proof, and the effort would surely be illuminating whether successful or not.

Intel Xeon Phi

As previously discussed in Sec 4.1.1, Intel’s Xeon Phi is a very recent addition to the compute-

acceleration landscape. It is worth noting that several Top500 [51] supercomputers including the

top-ranked Tianhe-2 and seventh-ranked TACC Stampede systems use Xeon Phi coprocessors.

Portability of a functionally-correct algorithm would be trivial due to the shared instruction

set including all intrinsics used in FullMonte software. The increased core count would suggest

that Xeon Phi may run faster since FullMonte is compute-bound on current CPUs, however the

smaller cache size may mean more stalling while waiting on main memory. With some tuning,

it would be possible to give a fair evaluation of performance on the new platform, and this

should certainly be pursued.

6.2.4 Applications

Having created a high-quality, flexible, high-performance general software simulator, significant

work exists in the application domain to further illustrate the value of the new simulator.

Several such options which are immediately practical are introduced below.

Comparison vs Finite Element Method

Given the existence of the diffusion approximation which can provide a much faster solution

via FEM, it is natural to ask why use MC and when. Other authors have noted the qualitative

limitations of the diffusion approximation [38]. Rigorous testing of the differences between FEM

Page 111: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 6. Conclusions and Future Work 97

and a Monte Carlo solution in representative problem geometries would be valuable. Work is

in progress to allow FullMonte to read NIRFAST input files so that a direct comparison can

be made without any effort to translate the input files. Comparison of the output of validated

Monte Carlo solutions with FEM-derived solutions will indicate how large a difference exists

and in what cases. We are not aware of published work for non-trivial problem geometries.

Application to PDT

The motivating application for the present work is treatment planning for PDT. When treating

complex geometries such as HNC, large portions of the planning treatment volume are within

a few mean free paths of strong optical-property boundaries. When using extended sources the

volume of tissue within a few mean free paths of a source becomes significant. Further, the

delicacy of nearby organs at risk, particularly the carotid arteries in the HNC case, require high

simulation accuracy. The demands of PDT, including recording absorbed energy throughout

the volume, the lack of need for time-resolved features, and its representative material properties

and mesh sizes drove the present architecture so it is a natural first application for this hardware.

Consequently, we aim to use real anonymized patient data to perform simulations of PDT for

HNC and develop the necessary infrastructure to do a complete PDT fluence-evaluation system

based on the demonstrated hardware architecture.

Other Applications

When a completed hardware simulator is fully implemented, applications which are currently

infeasible due to high computing demands (HNC PDT, quantitative BLI with complex geome-

tries) will become more feasible thanks to an estimated 12-20x runtime decrease without the

high space and power requirements of a compute cluster. Several cards could fit within a work-

station, enabling desk-top biophotonic simulation reaching towards two orders of magnitude

speedup versus a CPU-based solution of the same size and power requirements. The portability

and modest power consumption will mean high-performance PDT dose evaluation can travel,

for instance into operating rooms where bringing a compute cluster would not be possible due

to space and power restrictions.

The use of this hardware and software for applications in the continuous-wave imaging/detection

regime (CW DOS, SFDI, BLI) would be relatively straightforward. In contrast to the PDT

application which is the primary focus of this hardware, for imaging and detection only the

photons exiting the material are of interest. Though packet exit events are not currently cap-

tured, their relative sparsity (≈ 20 : 1− ≈ 1000 : 1 less common) compared to absorption

means that the output event rate is quite low relative to the hardware already implemented.

Due to the nature of the tetrahedral mesh description, the number of surface faces will also be

much less than the number of tetrahedra in the entire mesh. Consequently, both the bandwidth

needs and working-set size of a hardware emittance logger would be trivial compared to those of

the absorption logger already presented. A specialized implementation could exploit that fact

Page 112: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Chapter 6. Conclusions and Future Work 98

to dedicate more logic and memory resources to geometry fetching and less to event logging,

achieving still better Mints than the present.

6.3 Summary

This thesis has presented several functional and performance enhancements to the state of the

art in software simulation of light propagation through turbid tissues. The FullMonte open-

source software implementation presented ranks as the best in its class for both flexibility (with-

out performance overhead for features not used) and performance, besting all other instances

in run time. We have also demonstrated the feasibility and performance of a power-efficient,

compact, cost-effective hardware solution using FPGA technology. The prototype FPGA imple-

mentation was simulated to achieve more than 3x performance increase over highly-optimized

best-in-class multi-threaded software, while using 40x less power. A detailed analysis is pre-

sented showing that a further 4x performance increase (to 12x vs CPU) is achievable, with up

to 20x estimated as possible with further optimization.

Page 113: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Bibliography

[1] Erik Alerstam, William Chun Yip Lo, Tianyi David Han, Jonathan Rose, Stefan

Andersson-Engels, and Lothar Lilge. Next-generation acceleration and code optimization

for light transport in turbid media using GPUs Abstract :. Biomedical Optics Express,

1(2):658–675, 2010.

[2] Erik Alerstam, Tomas Svensson, and Stefan Andersson-Engels. Parallel computing with

graphics processing units for high-speed Monte Carlo simulation of photon migration. Jour-

nal of biomedical optics, 13(6):060504, 2012.

[3] Merrill A Biel. Photodynamic Therapy. Methods in Molecular Biology, 635:281–293, 2010.

[4] T Binzoni, T S Leung, R Giust, D Rufenacht, and a H Gandjbakhche. Light transport

in tissue by 3D Monte Carlo: influence of boundary voxelization. Computer methods and

programs in biomedicine, 89(1):14–23, January 2008.

[5] Bluespec Inc. Bluespec TM SystemVerilog Reference Guide. Number January. 2012.

[6] David Boas, J Culver, J Stott, and A Dunn. Three dimensional Monte Carlo code for

photon migration through complex heterogeneous media including the adult human head.

Optics express, 10(3):159–70, February 2002.

[7] Lee Breslau, Pei Cao, and Li Fan. Web caching and Zipf-like distributions: Evidence and

implications. In IEEE Infocom, volume XX, pages 126–134, 1999.

[8] Andrew Canis, Jongsok Choi, Mark Aldham, and Victor Zhang. LegUp : An Open-Source

High-Level Synthesis Tool for FPGA-Based Processor / Accelerator Systems. 13(2), 2013.

[9] Jeffrey Cassidy, Lothar Lilge, and Vaughn Betz. FullMonte: a framework for high-

performance Monte Carlo simulation of light through turbid media with complex geometry.

In Proc SPIE BiOS, volume 8592, pages 85920H–14, San Francisco, CA, February 2013.

SPIE.

[10] Cesar Augusto Rendon Restrepo. Biological and Physical Strategies to Improve the Ther-

apeutic Index of Photodynamic Therapy. Phd, University of Toronto, 2008.

99

Page 114: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Bibliography 100

[11] Wai Fung Cheong. Optical-Thermal Response of Laser-Irradiated Tissue. In A J Welch

and M J C Van Gemert, editors, Optical-Thermal Response of Laser-Irradiated Tissue,

chapter 8, pages 275–301. Plenum Press, New York, 1st ed edition, 1995.

[12] NVIDIA Corp. NVIDIA’s Next Generation UDA Compute Architecture: Kepler GK110.

Technical report, 2012.

[13] Altera Corporation. Implementing FPGA Design with the OpenCL Compiler, 2012.

[14] Altera Corporation. Stratix V Device Handbook. Technical report, San Jose, CA, 2013.

[15] Intel Corporation. Intel Xeon Phi Processor Family, 2013.

[16] David J Cuccia, Frederic Bevilacqua, Anthony J Durkin, Frederick R Ayers, and Bruce J

Tromberg. Quantitation and mapping of tissue optical properties using modulated imaging.

Journal of biomedical optics, 14(2):024012, 2009.

[17] Sean R H Davidson, Robert a Weersink, Masoom a Haider, Mark R Gertner, Arjen Bo-

gaards, David Giewercer, Avigdor Scherz, Michael D Sherar, Mostafa Elhilali, Joseph L

Chin, John Trachtenberg, and Brian C Wilson. Treatment planning and dose analysis

for interstitial photodynamic therapy of prostate cancer. Physics in medicine and biology,

54(8):2293–313, April 2009.

[18] Anil K D’Cruz, Martin H Robinson, and Merrill a Biel. mTHPC-mediated photodynamic

therapy in patients with advanced, incurable head and neck cancer: a multicenter study

of 128 patients. Head & neck, 26(3):232–40, March 2004.

[19] Hamid Dehghani, Matthew E Eames, Phaneendra K Yalavarthy, Scott C Davis, Subhadra

Srinivasan, Colin M Carpenter, Brian W Pogue, and Keith D Paulsen. Near infrared optical

tomography using NIRFAST : Algorithm for numerical model and image reconstruction.

Communication in Numerical Methods in Engineering, 25(August 2008):711–732, 2008.

[20] Belma Dogdas, David Stout, Arion F Chatziioannou, and Richard M Leahy. Digimouse:

a 3D whole body mouse atlas from CT and cryosection data. Physics in medicine and

biology, 52(3):577–87, February 2007.

[21] K L Du, R Mick, T M Busch, T C Zhu, J C Finlay, G Yu, a G Yodh, S B Malkowicz,

D Smith, R Whittington, D Stripp, and S M Hahn. Preliminary results of interstitial

motexafin lutetium-mediated PDT for prostate cancer. Lasers in surgery and medicine,

38(5):427–34, June 2006.

[22] Qianqian Fang. Mesh-based Monte Carlo method using fast ray-tracing in Plucker coordi-

nates. Biomedical optics express, 1(1):165–75, August 2010.

[23] Qianqian Fang. Comment on ”A study on tetrahedron-based inhomogeneous Monte-Carlo

optical simulation”. Biomedical optics express, 2(5):1258–64, January 2011.

Page 115: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Bibliography 101

[24] Qianqian Fang and David a Boas. Monte Carlo simulation of photon migration in 3D

turbid media accelerated by graphics processing units. Optics express, 17(22):20178–90,

October 2009.

[25] T J Farrell, B C Wilson, M S Patterson, and M C Olivo. Comparison of the in vivo

photodynamic threshold dose for photofrin, mono- and tetrasulfonated aluminum phthalo-

cyanine using a rat liver model. Photochemistry and photobiology, 68(3):394–9, September

1998.

[26] Thomas J. Farrell. A diffusion theory model of spatially resolved, steady-state diffuse

reflectance for the noninvasive determination of tissue optical properties in vivo. Medical

Physics, 19(4):879, 1992.

[27] Sari M Fien and Allan R Oseroff. Photodynamic therapy for non-melanoma skin cancer.

Journal of the National Comprehensive Cancer Network, 5(5):531–540, 2007.

[28] Nirmalya Ghosh, Michael F G Wood, Shu-hong Li, Richard D Weisel, Brian C Wilson,

Ren-Ke Li, and I Alex Vitkin. Mueller matrix decomposition for polarized light assessment

of biological tissues. Journal of biophotonics, 2(3):145–56, March 2009.

[29] J. Gray and G.M. Fullarton. Long term efficacy of Photodynamic Therapy (PDT) as

an ablative therapy of high grade dysplasia in Barrett’s oesophagus. Photodiagnosis and

Photodynamic Therapy, September 2013.

[30] Christina Habermehl, Christoph H Schmitz, and Jens Steinbrink. Contrast enhanced high-

resolution diffuse optical tomography of the human brain using ICG. Optics express,

19(19):18636–44, September 2011.

[31] John L Hennessy and David A Patterson. Computer Architecture: A Quantitative Ap-

proach, volume 177. Morgan Kaufman, Waltham, 5 edition, 2012.

[32] Matthew T Huggett, Michael Jermyn, Alice Gillams, Sandy Mosse, E Kent, Stephen G

Bown, Tayyaba Hasan, Brian W Pogue, and Stephen P Pereira. Photodynamic therapy of

locally advanced pancreatic cancer (VERTPAC study): final clinical results. In David H.

Kessel and Tayyaba Hasan, editors, Proc SPIE BiOS, volume 8568, pages 85680J–85680J–

6, March 2013.

[33] Brad L Hutchings and Brent E Nelson. Implementing Applications with FPGAs. In

Scott Hauck and Andre DeHon, editors, Reconfigurable Computing, chapter 21. Elsevier,

Burlington, MA, 2008.

[34] Kitware Inc. The Visualization Toolkit.

[35] Xilinx Inc. Xilinx Vivado Design Suite, 2013.

Page 116: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Bibliography 102

[36] Steven L Jacques. How tissue optics affect dosimetry of photodynamic therapy. Journal

of biomedical optics, 15(5):051608, 2010.

[37] Steven L Jacques. Optical properties of biological tissues: a review. Physics in Medicine

and Biology, 58(14):5007–5008, July 2013.

[38] Steven L Jacques and Brian W Pogue. Tutorial on diffuse light transport. Journal of

biomedical optics, 13(4):041302, 2008.

[39] Joseph O’Rourke. Computational Geometry in C. Cambridge University Press, 1998.

[40] Stefan P. Koch, Christina Habermehl, Jan Mehnert, Christoph H Schmitz, Susanne Holtze,

Arno Villringer, Jens Steinbrink, and Hellmuth Obrig. High-resolution optical functional

mapping of the human somatosensory cortex. Frontiers in Neuroenergetics, 2(June):1–8,

2010.

[41] Ian Kuon and Jonathan Rose. Measuring the gap between FPGAs and ASICs. IEEE

Transactions on Computer-Aided Design of Integrated Circuits and Systems, 26(2):203–

215, 2007.

[42] Joo Yong Lee, Richilda Red Diaz, Kang Su Cho, Meng Shi Lim, Jae Seung Chung, Won Tae

Kim, Won Sik Ham, and Young Deuk Choi. Efficacy and safety of photodynamic therapy

for recurrent, high grade nonmuscle invasive bladder cancer refractory or intolerant to

bacille calmette-guerin immunotherapy. The Journal of urology, 190(4):1192–9, October

2013.

[43] Steve M Liao, Nick M Gregg, Brian R White, Benjamin W Zeff, Katelin a Bjerkaas, Terrie E

Inder, and Joseph P Culver. Neonatal hemodynamic response to visual cortex activity:

high-density near-infrared spectroscopy study. Journal of biomedical optics, 15(2):026010,

2011.

[44] Liqiong Zheng Lihong Wang, Steven L Jacques. MCML - Monte Carlo modeling of light

transport in multi-layered tissues. Computer Methods and Programs in Biomedicine, 1995.

[45] L Lilge, M C Olivo, S W Schatz, J a MaGuire, M S Patterson, and B C Wilson. The

sensitivity of normal brain and intracranially implanted VX2 tumour to interstitial photo-

dynamic therapy. British journal of cancer, 73(3):332–43, February 1996.

[46] Junting Liu, Yabin Wang, Xiaochao Qu, Xiangsi Li, Xiaopeng Ma, Runqiang Han, Zhenhua

Hu, Xueli Chen, Dongdong Sun, Rongqing Zhang, Duofang Chen, Xiaoyuan Chen, Jimin

Liang, Feng Cao, and Jie Tian. In vivo quantitative bioluminescence tomography using

heterogeneous and homogeneous mouse models. Biomedical Optics Express, 18(12):13102–

13113, 2010.

Page 117: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Bibliography 103

[47] William Chun Yip Lo. Hardware Acceleration of a Monte Carlo simulation for Photody-

namic Therapy Treatment Planning. Master’s thesis, University of Toronto, 2009.

[48] William Chun Yip Lo, Keith Redmond, Jason Luu, Paul Chow, Jonathan Rose, and Lothar

Lilge. Hardware acceleration of a Monte Carlo simulation for photodynamic therapy treat-

ment planning. Journal of Biomedical Optics, 14(1):014019, 2009.

[49] Yujie Lu, Hidevaldo B Machado, Qinan Bao, David Stout, Harvey Herschman, and Ar-

ion F Chatziioannou. In vivo mouse bioluminescence tomography with radionuclide-based

imaging validation. Molecular imaging and biology : MIB : the official publication of the

Academy of Molecular Imaging, 13(1):53–8, February 2011.

[50] Rickson C Mesquita, Maria a Franceschini, and David a Boas. Resting state functional

connectivity of the whole head with near-infrared spectroscopy. Biomedical optics express,

1(1):324–336, January 2010.

[51] Hans Meuer, Erich Strohmaier, Jack Dongarra, and Simon Horst. Top500 Supercomputer

Sites, 2013.

[52] Caroline M Moore, Mark Emberton, and Stephen G Bown. Photodynamic therapy for

prostate cancer–an emerging approach for organ-confined disease. Lasers in surgery and

medicine, 43(7):768–75, September 2011.

[53] Rishiyur S Nikhil and Kathy Czeck. BSV by Example The next-generation language for

electronic system design. Bluespec Inc., 2010.

[54] Vasilis Ntziachristos, Jorge Ripoll, Lihong V Wang, and Ralph Weissleder. Looking and

listening to light: the evolution of whole-body photonic imaging. Nature biotechnology,

23(3):313–20, March 2005.

[55] Michael S. Patterson, Brian C. Wilson, and Douglas R. Wyman. The propagation of

optical radiation in tissue I. Models of radiation transport and their application. Lasers in

Medical Science, 6(2):155–168, June 1991.

[56] Julien Pommier. No Title, 2007.

[57] Scott A Prahl, M Keijzer, Steven L Jacques, and A J Welch. A Monte Carlo Model of

Light Propagation in Tissue. SPIE Institute Series, 5(1989):102–111, 1989.

[58] Ravi Rao. Believe It or Not! Multi-core CPUs can Match GPU Performance for a FLOP-

Intensive Application! In PACT’102, pages 537–538, Vienna, Austria, 2010. ACM.

[59] Mutsuo Saito and Makoto Matsumoto. SIMD-Oriented Fast Mersenne Twister: a 128-bit

Pseudorandom Number Generator, pages 1–15. Number 18654021. Springer, 2008.

Page 118: by Je rey Cassidy - University of Toronto T-Space · Je rey Cassidy Master of Applied Science Graduate Department of Electrical and Computer Engineering University of Toronto 2014

Bibliography 104

[60] Haiou Shen and Ge Wang. A study on tetrahedron-based inhomogeneous Monte Carlo

optical simulation. Biomedical optics express, 2(1):44–57, January 2010.

[61] Anand Lal Shimpi. The Sandy Bridge Review: Intel Core i7-2600K, i5-2500K and Core

i3-2100 Tested, 2011.

[62] Mikael Tarstedt, Inger Rosdahl, Berit Berne, Katarina Svanberg, and Ann-Marie

Wennberg. A randomized multicenter study to compare two treatment regimens of topi-

cal methyl aminolevulinate (Metvix)-PDT in actinic keratosis of the face and scalp. Acta

dermato-venereologica, 85(5):424–8, January 2005.

[63] Terasic. DE5-Net FPGA Development Kit User Manual, 2012.

[64] B J Tromberg, N Shah, R Lanning, A Cerussi, J Espinoza, T Pham, L Svaasand, and

J Butler. Non-invasive in vivo characterization of breast tumors using photon migration

spectroscopy. Neoplasia (New York, N.Y.), 2(1-2):26–40, 2000.

[65] Alfred Vogel and Vasan Venugopalan. Pulsed Laser Ablation of Soft Biological Tissues.

In Ashley J. Welch and Martin J.C. Gemert, editors, Optical-Thermal Response of Laser-

Irradiated Tissue, pages 551–615. Springer Netherlands, Dordrecht, 2011.

[66] Jack E Volder. The CORDIC Trigonometric Computing Technique. IRE Transactions on

Electronic Computers, EC8(3):330–334, 1959.

[67] Lihong Wang, Steven L Jacques, and Liqiong Zheng. CONV - convolution for responses to

a finite diameter photon beam incident on multi-layered tissues. Computer methods and

programs in biomedicine, 54:141–150, 1997.

[68] Nicholas Weaver. Retiming, Repipelining, and C-Slow Retiming. In Scott Hauck and

Andre DeHon, editors, Reconfigurable Computing, chapter 18. Elsevier, Burlington, MA,

2008.

[69] BC Wilson and G Adam. A Monte Carlo model for the absorption and flux distributions

of light in tissue. Medical Physics, 10(6):824–830, 1983.

[70] Brian C Wilson and Michael S Patterson. The physics, biophysics and technology of

photodynamic therapy. Physics in medicine and biology, 53(9):R61–109, May 2008.