MONTE CARLO METHODS FOR NEUTRON TRANSPORT ON …
Transcript of MONTE CARLO METHODS FOR NEUTRON TRANSPORT ON …
The Pennsylvania State University
The Graduate School
Department of Mechanical and Nuclear Engineering
MONTE CARLO METHODS FOR NEUTRON TRANSPORT ON GRAPHICS
PROCESSING UNITS USING CUDA
A Thesis in
Nuclear Engineering
by
Adam Gregory Nelson
2009 Adam Gregory Nelson
Submitted in Partial Fulfillment
of the Requirements
for the Degree of
Master of Science
December 2009
ii
The thesis of Adam Gregory Nelson was reviewed and approved* by the following:
Kostadin N. Ivanov
Distinguished Professor of Nuclear Engineering
Thesis Advisor
Maria Avramova
Assistant Professor of Nuclear Engineering
Jack Brenizer
J. ―Lee‖ Everett Professor of Mechanical and Nuclear Engineering
Chair of Nuclear Engineering
*Signatures are on file in the Graduate School
iii
ABSTRACT
This work examined the feasibility of utilizing Graphics Processing Units (GPUs) to
accelerate Monte Carlo neutron transport problems. These GPUs use many parallel processors to
perform the complex calculations necessary to create three-dimensional images at fast enough
rates for the video game industry. In 2006 NVIDIA released a programming framework (called
CUDA) that allows developers to easily program for the many cores provided by GPUs for
general purpose programs. Initial assessments have suggested that the MC algorithm may not be
able to fully utilize the GPU constraints. These constraints include the fact that MC codes are
highly dependent on branch statements (IF, ELSE, FOR, and WHILE) which can have a large
impact on GPU performance.
In this work, a Monte Carlo neutron transport code was written from scratch to be run on
both the x86 CPU platform and the GPU CUDA platform to understand the type of performance
that can be gained by utilizing GPUs.
After optimizing the code to run on the GPU, a speedup of nearly 21x was found when
using only single-precision floating point math. This can be further increased with no additional
effort if accuracy is sacrificed for speed: using a compiler flag, the speedup was increased to
nearly 24x. Further, if double-precision floating point math is desired for neutron tracking
through the geometry, a speedup of 11x was found.
While the GPUs have proven to be useful, they are not without limitations. The
following are such limitations: the maximum memory currently available on a single GPU is
4GB; the GPU RAM does not provide error-checking and correction; and the optimization
required to decrease GPU runtime can lead to code that is not readable by those who are not the
original developers.
iv
TABLE OF CONTENTS
LIST OF FIGURES ................................................................................................................. vi
LIST OF TABLES ................................................................................................................... vii
LIST OF ABBREVIATIONS .................................................................................................. viii
ACKNOWLEDGEMENTS ..................................................................................................... ix
Chapter 1 The Case For Acceleration of Monte Carlo Transport Simulations ....................... 1
Chapter 2 General Purpose Graphics Processing Units .......................................................... 3
2.1 The CUDA Architecture ............................................................................................ 4 2.2 Additional CUDA Programming Information ........................................................... 10 2.3 Limitations of the CUDA Model ............................................................................... 11
Chapter 3 Methodology .......................................................................................................... 13
3.1 LADONc .................................................................................................................... 13 3.1.1 Geometry Representation ................................................................................ 16 3.1.2 Material Representation .................................................................................. 17 3.1.3 Nuclear Data and Cross-Section Representation ............................................. 18 3.1.4 Initial Source Neutrons .................................................................................... 19 3.1.5 Random Number Generator ............................................................................ 20 3.1.6 LADONc Algorithm Details ........................................................................... 20
3.2 CERBERUSc ............................................................................................................. 25 3.3 Porting Codes to the GPU .......................................................................................... 26 3.4 Models Analyzed ....................................................................................................... 28
3.4.1 Model ―Test‖ ................................................................................................... 29 3.4.2 Model ―Two‖................................................................................................... 29 3.4.3 Model ―Three‖................................................................................................. 29 3.4.4 Model ―Four‖ .................................................................................................. 30 3.4.5 Model ―Array2‖............................................................................................... 30
3.5 Test System Hardware Specification ......................................................................... 31 3.6 CERBERUSg Initial Performance. ............................................................................ 33 3.7 LADONg Initial Performance .................................................................................... 33
Chapter 4 Optimizations ......................................................................................................... 34
4.1 Optimization Efforts .................................................................................................. 34 4.1.1 Reducing Memory Latency ............................................................................. 34 4.1.2 Register Pressure ............................................................................................. 37 4.1.3 Cross-Section Lookups ................................................................................... 38 4.1.4 Reducing Shared Memory Usage to Increase Blocks per SM ........................ 39 4.1.5 Thread Divergence .......................................................................................... 40
v
4.1.6 Increasing the Speed of Data Transfer Between System RAM and GPU
RAM ................................................................................................................. 41 4.1.7 Additional Optimizations Performed Which Reduce Accuracy...................... 41
Chapter 5 Results .................................................................................................................... 43
5.1 Single-Precision Results ............................................................................................ 44 5.1.1 Use of the Accurate Math, Single-Precision Functions................................... 44 5.1.2 Use of the Intrinsic Math, Single-Precision Functions .................................... 45 5.1.3 Examination of the Observed Peak ................................................................. 46
5.2 Double-Precision Results ........................................................................................... 48 5.2.1 Use of the Accurate Math, Double-Precision Functions ................................. 48 5.2.2 Use of the Intrinsic Math, Double-Precision Functions .................................. 50
5.3 Accuracy Comparison ................................................................................................ 51
Chapter 6 Applicability to Production Codes ......................................................................... 53
6.1 Limitations of CUDA for Monte Carlo Neutron Transport ....................................... 53 6.1.1 Maximum Available Memory ......................................................................... 53 6.1.2 Accuracy of Computations .............................................................................. 54 6.1.3 Error-Checking Memory ................................................................................. 54 6.1.4 Maintainability ................................................................................................ 55 6.1.5 Hardware Architecture Changes ..................................................................... 55 6.1.6 Optimizations May Not Be As Successful For Larger Problems .................... 56 6.1.7 Lack of Large-Scale GPU Cluster ................................................................... 56 6.1.8 CUDA Development Tools ............................................................................. 57
6.2 Possible Applications of CUDA-Accelerated MC ..................................................... 58
Chapter 7 Summary and Conclusions ..................................................................................... 59
7.1 Conclusions ................................................................................................................ 59 7.2 Recommendations for Future Work ........................................................................... 61
Bibliography ............................................................................................................................ 62
Appendix A LADONc Source Code ....................................................................................... 64
Appendix B LADONg Source Code ....................................................................................... 81
Appendix C Sample Input Files .............................................................................................. 105
C.1 Geometry File ―array2.geo‖ ...................................................................................... 105 C.2 Material File ―array2.mat‖......................................................................................... 109 C.3 Neutron Source File ―array2.src‖ .............................................................................. 111 C.3 Nuclear Data File for
1H, ―1001‖ .............................................................................. 114
vi
LIST OF FIGURES
Figure 1: NVIDIA Peak Performance Through Time, Courtesy NVIDIA .............................. 4
Figure 2: CUDA Kernel Execution, Courtesy NVIDIA .......................................................... 8
Figure 3: CUDA Hardware Architecture, Courtesy NVIDIA ................................................. 9
Figure 4: History-Based Algorithm ......................................................................................... 15
Figure 5: Details of Geometry Tracking .................................................................................. 16
Figure 6: Sample Code Transferring Neutron Data to GPU .................................................... 27
Figure 7: BFG/NVIDIA GTX 275 OC, Courtesy BFG ........................................................... 32
Figure 8: CUDA Code Transferring Data to Constant Memory .............................................. 35
Figure 9: CUDA Code Transferring Global Memory Data to Shared Memory ...................... 37
Figure 10: Example of Variable Reuse to Reduce Registers ................................................... 37
Figure 11: Single-Precision Speedup ....................................................................................... 44
Figure 12: Single-Precision Speedup Using Fast Math ........................................................... 45
Figure 13: MC-HISTORYc Run Time in the Peak Region ..................................................... 47
Figure 14: MC-HISTORYg Run Time in the Peak Region ..................................................... 47
Figure 15: Speedup in the Peak Region ................................................................................... 48
Figure 16: Double-Precision Speedup ..................................................................................... 49
Figure 17: Double-Precision Speedup Using Fast Math .......................................................... 50
vii
LIST OF TABLES
Table 1: Assumptions .............................................................................................................. 14
Table 2: Test System Hardware Specification ......................................................................... 31
Table 3: GPU Specifications, Courtesy BFG........................................................................... 32
Table 4: Accuracy Comparison ............................................................................................... 52
Table 5: Summary of Speedups ............................................................................................... 60
viii
LIST OF ABBREVIATIONS
(Entries are listed alphabetically)
CM Center-of-Mass System
CPU Central Processing Unit
CUDA Compute Unified Device Architecture
ECC Error Checking and Correcting
ENDF Evaluated Nuclear Data File
GPU Graphics Processing Unit
HPC High Performance Computer
MC Monte Carlo Neutron Transport
nvcc CUDA C Compiler
PCIe Peripheral Component Interconnect Express Bus
RAM Random Access Memory
SIMD Single-Instruction, Multiple-Data
SM Streaming Multiprocessor
ULP Unit in Last Place
ix
ACKNOWLEDGEMENTS
The author would like to thank the following people for their invaluable assistance with
this work. First, he thanks his research advisor, Professor Kostadin N. Ivanov, for guidance
through the process. Secondly, the users of the NVIDIA CUDA Programming Forum deserve
many thanks for providing solutions to difficult programming challenges encountered during the
performance of this research. Finally, the author would like to thank the unending support given
to him by his girlfriend over the past nine months as he continually was forced to choose
computer programming over his duties as a boyfriend (it is worth noting that in the months
following the completion of this thesis, she may wish the author had another topic to pursue).
1
Chapter 1
The Case For Acceleration of Monte Carlo Transport Simulations
The Monte Carlo method of solving the Boltzmann Neutron Transport Equation (MC) is
capable of producing highly accurate solutions for nuclear engineering problems of interest. This
is enabled through the exact representation of problem geometry (as long as a mathematical
description of the geometry exists), the use of nuclear data that is a continuous function of energy
(which can be as accurate as the data produced by the experimental database), and enabling a
continuous spectrum of particle directions. MC solutions therefore provide a much higher fidelity
than diffusion theory or discrete ordinates based solutions which both require discretization of the
problem geometry, energy, and/or angles.
Despite these benefits, MC has not yet displaced diffusion theory or discrete ordinates
methodologies for design. This is mainly due to the relatively large computation time required to
produce useful results. A meaningful power distribution from a MC solution requires a
converged source distribution and fine-block tally regions in all regions of interest to produce
small uncertainties on results; both of these require many millions of neutron simulations
(depending on the problem size). Therefore, the Monte Carlo method cannot displace the
established solution methodologies as a prime tool for design until the computation time can be
decreased to manageable levels. For instance, Drs. Kord Smith (at the M&C 2003 conference)
(Smith 2003), Bill Martin (at the M&C 2007 conference) (Martin 2007), and Forrest Brown (at
the PHYSOR 2008 conference) (Brown 2008) have postulated that the computational power to
perform a full core (PWR) calculation in one hour (which required twenty billion neutron
histories to reach a 1% standard deviation) will not be available until the year 2018.
2
This thesis will evaluate one possible option of accelerating MC simulations to enable the
use of MC for design: utilizing the graphics processors installed in most desktop and laptop
computers.
3
Chapter 2
General Purpose Graphics Processing Units
A Graphics Processing Unit (GPU - commonly referred to as a 3-D accelerator, video
card, or graphics card) is a processor dedicated to performing the computationally intense
floating-point calculations required to render 3-D images in real-time. These cards are installed
in most desktop and laptop personal computers to increase gaming performance of the computer.
GPUs communicate with the central processing unit (CPU) and the system‘s random access
memory (RAM) through the Peripheral Component Interconnect Express (PCIe) bus which
currently provides up to eight gigabytes per second of bandwidth (NVIDIA Corporation 2009).
Recent GPUs utilize many processing elements (similar to ‗cores‘ in CPUs) which operate in
parallel to convert the 3-D geometry and lighting conditions into a 2-D image to display on a
computer monitor. The details of the hardware architecture are discussed in more detail below.
Until recently GPUs were utilized solely for performing computations required to
produce the output image. However, in late 2006, NVIDIA released the G80 GPU, and with it,
the Compute Unified Device Architecture (CUDA) hardware and software architectures to
provide general-purpose processing capabilities on GPUs. The G80 hardware provided up to 371
gigaflops of peak performance in a package that costs only a few hundred dollars and consumed
less than 200W of power. Because these devices were primarily mainstream graphics cards,
40,000,000 CUDA capable cards were in home computers by the end of 2007 (Luebke 2007). A
comparison of the peak performance of CPUs and NVIDIA GPUs since 2003 can be seen in
Figure 1. This figure shows the enormous increase in GPU computational power, relative to
typical CPUs, since 2003.
4
Figure 1: NVIDIA Peak Performance Through Time, Courtesy NVIDIA
The large disparity between the performance of then-top-end CPUs and GPUs is because
the CPU is designed for sequential code performance with large instruction and data caches to
reduce the data transfer overhead, while GPUs utilize many processors which execute in parallel
with small local caches but very fast data transfer rates between the GPU‘s RAM and the
processors (approximately 10x the speed of system RAM). The GPU hardware and software
architectures will be discussed in more detail in the following section.
2.1 The CUDA Architecture
NVIDIA, in developing the CUDA environment, focused on creating a programming
environment which allows for essentially unlimited thread-level parallelism from the
programmers perspective which the underlying hardware then implements at runtime solve
massively parallel problems. Since the hardware and software are so inter-related, the topics will
5
be discussed together. The following information is discussed in the CUDA Programming Guide,
and the reader is directed there for further details (NVIDIA Corporation 2009).
For CUDA programs, the GPU is a co-processor which can be utilized to offload work
from the CPU. An entire program does not execute solely on the GPU, but instead only portions
of the programs are sent to the GPU for execution, typically those portions of the code for which
the programmer has decided that the GPU is a more suitable compute environment. These
portions of code executed by the GPU are called kernels. The lowest computation unit in the
kernel is a thread, which is essentially an instruction set that operates on a particular piece of data.
Threads are grouped into blocks (with a maximum of 1024 per block). Threads inside a block are
all required to start at the same instruction of a stream, but divergence to a certain degree is
allowed after that point. These threads can share data, and therefore all threads within a block
must be executed on the same ―Streaming Multiprocessor‖ (SM), which is somewhat akin to a
core on a CPU. Blocks have the property of being able to run in any order on any of the GPUs
compute units, which is a characteristic that the programmer must take into consideration.
Finally, the highest level of execution in a kernel is called the grid, which is an array of thread
blocks.
The threads inside of a block are split into groupings called warps. The current
generation of NVIDIA cards allocate thirty-two threads per warp. This grouping is transparent to
the user, and should not be assumed. Warps are sets of threads which are all processing the same
instruction. If a portion of the threads in a warp need to traverse one path of an execution path,
but another portion needs to go another path, a ‗divergent branch‘ condition exists. This is
usually caused by the use of IF/ELSE statements or conditional loops. When this occurs all
threads in the warp execute each of the possible paths possible, but only the result from the
correct path for each thread is stored. This is not optimum for decreasing the runtime of a code as
it reduces the maximum parallelism possible. The GPU can, with minimal overhead, switch
6
which warp an SM is executing. This is useful for when a particular work has requested data
from high latency memory – while the data is being fetched, the SM can switch the warp it is
executing.
Now that the basic execution model has been presented, it is useful to discuss the
memory model. Refer to Figure 2 and Figure 3 during the following discussion. The CUDA
software and hardware architectures allow for a few different memory spaces to be utilized by the
kernel. These include per-thread registers, per-thread local memory, per-block shared memory,
per-program constant memory, per-program texture memory, and per-program global memory.
The register file is a set of 16,384 32-bit registers which are dynamically allocated to
each thread that is executing on an SM. Similar to CPU registers, these are very fast, locally
stored data which cannot be shared amongst threads. Local memory is provided as a ‗spill-over‘
region should the registers be filled. Unfortunately, the local memory is only local in name: local
memory is actually stored in the GPU RAM, and hence has much slower access than the register
file (approximately 200 cycles in latency to access the data).
Shared memory is memory stored on-chip with an SM that provides data that all threads
in a block can access. Since it is local to the SM, it provides for access times essentially as fast as
the registers. Each SM has 16kB of data which is allocated for use on all of the blocks assigned
to an SM.
Constant and texture memory are both read-only data stored in the GPU RAM, however,
it is cached. The constant memory space is approximately 64kB, while the texture size is
essentially unlimited. Both spaces have caches, which provide for rapid data access times if the
data is in the cache. Constant cache is cached through linear locality while the texture cache uses
a spatial locality for caching. The texture memory can provide advanced functionality through
the texturing unit such as interpolation, but these features were not used in this analysis.
7
Global memory is an un-cached space whose size is only limited by the GPU RAM
available. Due to the un-cached nature, a read from global memory will cost approximately 200
cycles.
The architecture results in the fact that the fastest computation times can only be achieved
by using coalesced memory reads and writes. That is, the data requested by threads in a block
should be located in a contiguous memory space so that the entire memory transfer bus can be
filled with data which can be read in the minimum amount of cycles.
Since the registers and shared memory are divided amongst blocks and threads on a SM,
there is an optimum amount of shared memory and registers used by each block and thread to
allow more blocks to be run on each SM (and therefore to allow more blocks to be run at a time
on the entire GPU). This optimum configuration can be determined with the use of the CUDA
Occupancy Calculator, a tool by NVIDIA and provided to developers in the CUDA Toolkit
which calculates how many blocks can fit on an SM.
Finally, because the global and local memory is not cached and because cache-misses can
occur for the cached memory locales, the CUDA architecture can effectively ‗swap‘ the blocks
that are being executed at a current time so that while block A is awaiting data from global
memory, block B can be executed upon. This is called latency hiding. The developer can take
advantage of this by solving the problem at hand with more blocks than SMs on the GPU.
8
Figure 2: CUDA Kernel Execution, Courtesy NVIDIA
9
Figure 3: CUDA Hardware Architecture, Courtesy NVIDIA
10
2.2 Additional CUDA Programming Information
The CUDA programming environment is essentially an extension to the C programming
language. Because it is based on the C language, there is only a small learning curve for CUDA
development
Some specifics necessary when reviewing source code in the following chapters are as
follows:
Before any kernel can be called, the data necessary for the kernel must be copied
from the system RAM to the GPU RAM. CUDA provides functions which
perform this operation (for both to and from the GPU) for constant memory,
texture memory, and global memory. Shared memory, constant memory, global
memory, and textures are all declared by the programmer. Constant memory,
global memory, and textures have the life of a program, i.e., a programmer does
not have to load data into the memory spaces between each kernel call, unless the
data has changed.
Kernels are functions, like any other C function, except the function definition
and header must include the ―__global__‖ identifier, which tells the CUDA
Compiler that this is a GPU function that is callable by the CPU. When a kernel
is called, it requires an ―execution configuration‖ as well, which tells the GPU
how many threads and blocks to be created and how. These execution
configurations, in their simplest form (and that which was used in this work),
look like: ―<<<dim3 threads, dim3 blocks>>>‖. The ―dim3‖ type is a data
structure which contains a three-dimensional integer to describe the (up to) three-
dimensional grid.
11
Functions which are can only be called by the GPU must be written with the
―__device__‖ identifier before the function declaration and header.
Functions can be written which are used by the CPU and the GPU. These
functions would have both the identifiers ―__host__‖ and ―__device__‖ before
the function declaration and header.
2.3 Limitations of the CUDA Model
The discussion thus far has focused on the highlights of the use of GPUs to accelerate
general purpose programs. However, there are some downsides to the widespread adoption of
these cards for scientific applications which are discussed in the following paragraphs.
The CUDA Programming Guide (NVIDIA Corporation 2009) discusses the level that the
NVIDIA GPU‘s conform to the Institute of Electrical and Electronics Engineers (IEEE) Standard
for Floating Point Arithmetic (IEEE-754). The guide states that most operations do match the
standard, with a few exceptions. Exceptions include: a lack of dynamically configurable
rounding modes; a lack of a mechanism for signaling that a floating-point exception has occurred;
and the fact that various functions are implemented in non-IEEE specific ways (such as the
square root and division). In the end this means that some functions are less accurate than their
CPU equivalent.
The latest generation of NVIDIA‘s GPU, the GT200, is the first to support double-
precision floating point calculations. But, each SM has only one double-precision unit (there are
eight single-precision compute units in each SM). Therefore, a first order estimate of double-
precision computation runtime suggests that double-precision calculations would require eight
times as long to run as would single-precision floating point calculations.
12
The highest-end GPUs (at the time of this thesis) provide up to 4GB of RAM. For most
problems (especially shielding applications) this is not enough space and will require domain
decomposition techniques to overcome.
At this time GPUs do not provide error detection or correction. There therefore is no
means to ensure accuracy of the data transferred to and from the GPU RAM. This is a reliability
issue for clusters implementing GPUs because of the increased chance that the entire cluster will
experience a memory error.
13
Chapter 3
Methodology
To evaluate GPUs for the acceleration of MC transport codes, a baseline code as written
from scratch for the Intel x86 CPU architecture. This code is referred to here-in as ―LADONc,‖
where the c denotes CPU. A clean-sheet design was chosen over porting an existing code
because to properly convert and optimize, one must have a full knowledge of the inner-workings
of the software – something that would have taken an immense amount of time to do for a
production code such as MCNP. LADONc utilizes the history-based algorithm.
LADONc was then modified to utilize the event-based algorithm discussed in a paper by
Forrest Brown and Bill Martin (Forrest B. Brown 1984). This too was written for the CPU. This
code will be referred to as ―CERBERUSc.‖
LADONc and CERBERUSc were the baseline codes which were ported to the GPU
architecture by utilizing the CUDA programming language. These will be referred to as
―LADONg‖ and ―CERBERUSg,‖ where the g denotes GPU.
All of the above codes are discussed in further detail in the following sections.
3.1 LADONc
This code, depicted in flowchart form in Figure 4 and Figure 5, operates on one neutron
simulation at a time. Each neutron‘s calculations begin at birth (either from a user-defined source
or from a fission event) and tracked until the neutron leaks or is absorbed. To simplify the
development of LADONc, only neutron multiplication factor (kcalc) problems were considered.
Neutron physics are not modeled in exquisite detail, with such features as Doppler broadening or
advanced thermal scattering treatment being ignored. The free-gas model is used for all
14
scattering reactions. A list of assumptions and approximations can be found in Table 1. The
source code for LADONc is contained in Appendix A. Specifics of this code are discussed
below, as it provides the background for the rest of the work performed.
Table 1: Assumptions
Category Assumptions
General
Units for cross-sections are barns, number-densities are atoms/b-cm, geometries are cm
Geometry
Only supports Spheres, Cuboids
None of the above shapes can be transformed
If a neutron is exactly on the boundary of a cell it is considered inside the cell
User must define what cells are inside which other cells
Cells must be input in the order that they are inside each other
Cells must not intersect each other
Number of cells is arbitrarily limited to 100
Nuclear Data
No special treatment of cross-sections (unresolved resonances, bound nuclei)
The number of neutrons per fission is set at 2.53
No discrete Q-values used for inelastic scattering
If energy is less than the minimum in the cross-section data, 1/v scaling is used
Cross-section and fission spectrum energy is limited to 20 MeV
Physics
Only the free-gas model is used for collisions
Material Data
Number of total nuclides is arbitrarily limited to 20
15
Figure 4: History-Based Algorithm
For All Neutron Batches
Input Data
Initialize Tallies
For All Neutrons In Batch
Calculate Total Optical Thickness
Travelled and Transport Neutron To The
Corresponding x,y,z
Determine Nuclide That Neutron Collided
With
Interrogate Macroscopic Cross-Sections
For Target Nuclide, Current Cell, and
Energy
Determine Rxn Type
Rxn
Type?
Determine Number of
Fission Neuts, Add
That Many To Source
List
Increase
Radiative
Capture Tally,
End Neutron
Change Energy
and Direction
According To
Elastic Scatter
Laws
Change Energy
and Direction
According To
Inelastic Scatter
Laws
Fission
Radiative
Capture Elastic
Scatter Inelastic
Scatter
Update Keff
End Neutron If Leaked
16
Figure 5: Details of Geometry Tracking
3.1.1 Geometry Representation
To have the capability to solve problems more complex than an infinite homogenous
medium, a Monte Carlo neutron transport code must have methods to define geometric objects, or
Calculate Position If Neutron Moved Collision Distance
In Direction Of Ω
Get Initial Cell, ΣT(E), And Collision Distance
Has Cell
Changed?
Set Neutron
Position To
Calculated
NO YES
Determine Distance To
Cell Boundary And
Subtract From
Distance To Transport
While Neutron Still Has
Distance To Travel
Leaked? NO YES
Tally Leak
And End
Neutron
Store Final Neutron
x,y,z and Cell
17
cells, and be able to track which cell a particular point is in, and how far until the boundary of a
cell. This section describes how these actions are performed in LADONc.
LADONc is capable of modeling parallelepipeds, and spheres as these are easily defined
in terms of mathematical functions. These cells can be set to any size and placed at any location
in the model. The problem boundary is also defined through the use of one of these shapes.
LADONc makes use of geometric hierarchy; i.e., the user can define one cell to be inside of
another cell, displacing the material of the larger cell with the smaller cell‘s material. In a
production code this is useful for users by reducing the number of input values which need to be
updated when one dimension changes in the model (such as is usual during design iterations or
sensitivity studies). This hierarchy also reduces the need for gridding of the model because every
cell has associated with it a list of cells that are inside of it. For the purposes of this code, the user
inputs these ‗inside_me‘ lists as opposed to having the software perform this task automatically.
Additionally, the user must input the cells in this hierarchical order. For example, if
object A is inside of object B, it must come first in the geometry input file. This is not necessary
in a production code but is used here to reduce the computational effort of determining which cell
a neutron is in because the first cell that returns a successful query is the smallest one the neutron
can be in since the cells are progressed through in the order the user inputs them.
Geometric information is input into the program by use of the file ―FILE_NAME.geo‖,
where FILE_NAME is defined by the user. A sample geometry input file can be seen in
Appendix C.
3.1.2 Material Representation
Every geometric cell has associated with it a list of materials. Each material is described
by using an integer identifier and number density for each nuclide. The integer identifier, or
18
‗ZAID‘, represents the proton number (A) and mass number (Z) according to the following
formula:
𝑍𝐴𝐼𝐷 = 1000 ∗ 𝐴 + 𝑍
Equation 1
For example, the ZAID value for 𝑂816 is 8016. This format is used because it saves
memory and can be quickly decomposed later into the appropriate parts when necessary.
Material information is input into the program by use of the file ―FILE_NAME.mat‖,
where FILE_NAME is defined by the user. A sample material input file is shown in Appendix C.
3.1.3 Nuclear Data and Cross-Section Representation
To reduce development effort for this proof-of-concept analysis, no standardized cross-
section format (such as ACE) was used. Instead, a table was manually created for nuclides which
includes microscopic radiative capture, fission, elastic, and inelastic cross-sections as a function
of energy. Total cross-sections are calculated on the fly. These values could have been any
floating point number desired, but to have some resemblance of accuracy, values from the
Evaluated Nuclear Data File (ENDF) plotting program (ENDFPLOT 2.0) were used (KAERI,
Korea Atomic Energy Research Institute 2007). The inelastic scattering cross-sections are
simplified as well because only one set of eight Legendre coefficients are provided for each
nuclide. For fissile or fisisonable nuclides, parameters describing the fission neutron energy
spectrum, 𝜒 𝐸 , are also included in this data.
The cross-sections are considered point-wise because if a neutron‘s energy is between
two energy points which are defined in the cross-section set, then a simple linear interpolation is
performed, even in the resonance regions. A binary search algorithm is used to find which array
index corresponds to the energy just larger than the neutron energy in question. To avoid
19
negative cross-sections, 1/v extrapolation is used if a neutron‘s energy falls below the range of
energies of the cross-section data. The number of points in the energy mesh is small relative to
production codes. For instance, the U-235 cross-section file contains an energy mesh with 31,839
data points, ranging from 0.0115 eV to 20.0 MeV.
Nuclear data and cross-section information is stored in the program‘s executable
directory with one file per nuclide. For example, the file named ―8016‖ contains cross-sections
for 𝑂816 . The code determines which cross-section files to load in to memory based on the
ZAID‘s defined in the material input file. A sample cross-section data input file is located in
Appendix C.
3.1.4 Initial Source Neutrons
The user can provide the initial source neutron locations and energies. Neutron directions
are randomly generated by the program on the fly however. These neutrons are added to a
neutron queue list during the problem initialization and are removed as they are used. Neutrons
produced from fission are also added to the end of this queue list using the location of the fission
that caused them as their initial location and selecting the initial energy randomly from the fission
energy spectrum. If the queue list is depleted of user-input source neutrons and fission neutrons,
then neutrons are produced at the problem origin with an energy randomly selected from the 𝑈92235
fission energy spectrum.
Source neutron information is input into the program by use of the file
―FILE_NAME.src‖, where FILE_NAME is defined by the user. A sample source neutron input
file is located in Appendix C.
20
3.1.5 Random Number Generator
A linear congruential pseudo-random number generator was used from Numerical
Recipes (William Press 2002). Of course, more complicated generators could have been used,
but this was considered appropriate given that the goal of this research is not exact answers, but
the feasibility of speedups through the use of GPUs.
3.1.6 LADONc Algorithm Details
The following sections discuss the steps required of the history-based algorithm. This
algorithm essentially performs the required neutron tracking and neutron-nuclide interaction on a
per-neutron basis. That is, each neutron is tracked from birth to death (absorption or leakage)
before the next neutron‘s tracking can begin. The following sections describe the steps
performed for each neutron. Delta tracking was considered for inclusion, but was judged to be as
likely to cause divergent branches as the above method, and so the more straight-forward
approach was used.
3.1.6.1 Neutron Tracking
As can be seen in Figure 5, the neutron distance to collision is determined as follows:
1. Determine the cell that the neutron is in. This is performed by checking to see if
the neutrons position is inside of any of the objects in the geometry. Since the
input is entered so that any object that is inside of another one is entered first, the
smallest cell containing the neutron will be identified first and this will be used.
2. A randomly generated number between 0.0 and 1.0 is created, called η.
3. The expected collision distance is calculated based on Equation 2.
21
𝑐𝑜𝑙𝑙 𝑑𝑖𝑠𝑡 =− ln 𝜂
ΣT
Equation 2
However, this collision distance may or may not be the distance to proceed with.
If the collision distance is larger than the distance to the boundary further
calculations are required.
4. The current material total macroscopic cross-section, ΣT, is obtained for the
current cell and the current energy.
5. The distance to the current cell boundary is calculated. There is a chance that
there is another object inside of the current one, so all of those objects in the
‗inside_me‘ list (discussed previously) must also be checked. The smallest of all
of those distances is kept as the actual travel distance.
6. If the collision distance is larger than the distance calculated in the previous step,
the following takes place:
a. The neutron is moved this distance calculated in step 4 plus some small
additional distance (0.001) to assure that the neutron is outside of the cell
it was just in accounting for any possible rounding error.
b. The random parameter η is then updated to reflect the fact that the
neutron has ‗used‘ some of the original η. This is performed per
Equation 3 below. Note that the 0.001 added above is accounted for
here.
𝜂 = exp 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑡𝑟𝑎𝑣𝑒𝑙𝑙𝑒𝑑 + 0.001 − 𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ΣT
Equation 3
c. The current cell is queried in the same way it is described in Step 1
above.
22
d. If the current cell is inside the problem boundary (i.e., if the neutron has
not leaked) move to Step 3 and repeat until the collision distance is less
than the distance to the cell boundary. Otherwise, flag the neutron as
being leaked, and terminate its history.
7. If the collision distance is less than the distance calculated in Step 4, the neutron
is moved that distance.
The program has now successfully tracked the neutron to its collision location and is ready to
select the reaction type that the neutron will undergo.
3.1.6.2 Determination of Target Nuclide and Reaction Type
Before a reaction type can be determined, the nuclide type that was collided with must be
determined since each cell can be composed of many different nuclides. This is performed by
first generating a random number between 0.0 and the total material macroscopic cross-section (at
the energy of interest). Then, for each isotope in the material, the total macroscopic cross-section
is determined and compared with this random number. The nuclide the neutron has interacted
with is therefore the one whose range of total cross-section contains the value of the random
number.
Next, the reaction type that the neutron underwent when it collided with the target
nuclide is determined. The types of reactions available for the nuclide are: fission absorption,
non-fission absorption, elastic scattering and inelastic scattering. The target nuclide microscopic
cross-sections for each of these reactions are determined. Again, a random number is generated
and compared with these cross-sections to determine the reaction type.
23
3.1.6.3 Fission Reactions
If the neutron caused fission, the code will then create fission neutrons and add them to
the queue. Since fission events can emit a random number of neutrons, the code must first
determine how many to add to the queue. This is done rather simply: a random number is
generated between 0.0 and 1.0. If the number is larger than 0.53 (the average value of ν for 235
U
minus two), then three neutrons are to be created, if not, then two will be created. Note that no
distinction is made between prompt and delayed neutrons. The number of fissions that occurred
and the fission neutrons produced are tallied seperately.
For each neutron produced, the code will add a neutron to the queue with the initiating
neutron‘s location, but with an energy determined by sampling from the fissioning nuclide‘s
fission neutron energy spectrum, χ(E). The simulation of the initiating neutron is then terminated.
3.1.6.4 Non-Fission Absorption Reaction
If a neutron is absorbed without fissioning the target nuclide, then the neutron simulation
is terminated, and the absorption tally is incremented by one.
3.1.6.5 Scattering Treatment
Elastic and Inelastic scattering events are treated in a similar manner. The difference lies
in the angular distribution that is sampled from and the Q-value of the reaction (for elastic
scattering, the Q-value is zero MeV). Both methods are solved through the solution of the
conservation of energy and momentum. While the code uses separate functions for each, they
will be described here in parallel.
24
1. The ZAID of the target nuclide is decomposed into just the A value, since the
kinematics equations require knowledge of the atomic mass. Note that the mass
here is simplified as just A, instead of the actual atomic mass.
2. The first step is to determine the Center-Of-Mass (CM) system direction cosine,
µCM.
a. For elastic scattering, a value of µCM is picked by selecting a random
number between -1.0 and 1.0 from a uniform distribution. This
corresponds with isotropic scattering in the CM system.
b. For inelastic scattering, the same angle, µCM, is created through rejection
sampling from a distribution defined by a Legendre series truncated after
the 8th term. This is done by first randomly selecting µCM, from -1.0 to
1.0. This is then plugged into the Legendre polynomial expansion
(P(µCM)). Then another random number is generated from 0.0 to 8.0 (to
reflect the maximum possible value of the Legendre series in the range -
1.0 to 1.0). If the new random number is less than or equal to P(µCM)
then µCM is retained. If not, this step is repeated.
3. Now that µCM has been determined, the outgoing energy is determined.
a. For elastic scattering, this is calculated as shown in Equation 4.
𝐸𝑛𝑒𝑟𝑔𝑦𝑜𝑢𝑡 = 𝐸𝑛𝑒𝑟𝑔𝑦𝑖𝑛 ∗𝐴2 + 2 ⋅ 𝜇𝑐𝑚 ⋅ 𝐴 + 1
𝐴2 + 2 ⋅ 𝐴 + 1
Equation 4
b. For inelastic scattering, the reaction Q-value (the amount of energy
bound in the target nucleus during the process) is sampled from a
continuous distribution from 0.0 to Qmax. This is clearly a simplification
as the Q-value spectrum is not continuous, but this method was chosen to
25
simplify the nuclear data requirements of the code. Qmax, which is
negative for inelastic scattering, is determined using the relation shown
in Equation 5. This relation is derived to restrict the Q-values to those
which are physically possible, i.e., if energy cannot be conserved than
the Q-value chosen is impossible. After Q is determined, the outgoing
energy is determined by the relation shown in Equation 6.
𝑄𝑚𝑎𝑥 = 𝐸𝑛𝑒𝑟𝑔𝑦_𝑖𝑛 ⋅ −𝐴
𝐴 + 1.0
Equation 5
𝐸𝑛𝑒𝑟𝑔𝑦𝑜𝑢𝑡 = 𝐸𝑛𝑒𝑟𝑔𝑦𝑖𝑛 + 𝑄 𝐴 + 1.0
𝐴
Equation 6
4. From this stage forward the elastic and inelastic scattering routines are the same.
The remaining tasks are performed to convert the ingoing angle, Ωin, to the
outgoing angle, Ωout, through the kinematics relations, µCM, and the energy
change.
After the scatter reaction changes the angle and direction, the process continues from
Step 1 of the neutron tracking algorithm and continues until the neutron is lost through leakage or
absorption.
3.2 CERBERUSc
Monte Carlo neutron transport is characteristically driven by IF/ELSE tests and WHILE
loops. However, processors utilizing the Single Instruction-Multiple Data (SIMD) architecture,
such as vector computers and GPUs in a way (due to the fact that all threads in a warp must
26
follow the same execution path), suffer large losses in performance if each compute unit (thread)
requires different instructions be executed and should be avoided if possible. To accomplish this,
the history-based algorithm of analyzing many events for a single neutron and repeating for all
neutrons is modified to have the computer analyze many neutrons for a single event at once(such
as calculating the change in energy and direction due to a scatter event). This methodology, as
discussed by Brown and Martin, helps to reduce the number of IF/ELSE and WHILE tests that
can induce divergent branches in the vector portions of the code (Forrest B. Brown 1984).
For this work, the LADONc source code was modified to utilize this algorithm and is
referred to as CERBERUSc. All physics and assumptions in CERBERUSc are the same as for
LADONc. CERBERUSc calculates four ‗events‘ at once: transporting neutron to the collision
location; determination of the reaction type; elastic scattering; and inelastic scattering. Inbetween
each step a re-shuffling of the neutron lists was updated to reflect the results of each event.
Intermediate calculations were performed, and the next event begun. This had to be repeated
until all neutrons in the batch were absorbed or leaked. This method requires much event list
shuffling.
3.3 Porting Codes to the GPU
In this section, work necessary to port the CPU codes to the GPU is discussed. Both the
event and history-based codes required similar work to be performed to port the codes.
Before any code can be run on the GPU, the required memory space must be allocated in
the GPU RAM and the data passed from system RAM to the GPU RAM. This is accomplished
through CUDA runtime API function calls similar to the C++ language functions malloc and
memcpy (in fact, they are called cudaMalloc and cudaMemcpy, respectively). For the case of the
MC codes, the cell list, materials list, and cross-section data must be transferred to the GPU
27
before any calculations can begin. The list of neutron data in the current batch is passed before
each batch begins. Data also can be transferred back from the GPU RAM to the system RAM.
This is accomplished through similar API calls as discussed above, except a flag used to indicate
the direction of data transfer is changed. See Figure 6 for example C++ code which transfers the
neutron batch list to the GPU RAM.
Figure 6: Sample Code Transferring Neutron Data to GPU
Once the data is on the GPU, the GPU can begin to perform calculations using the data.
But first the code must be ported to use CUDA. This includes creating a kernel (the main
execution pathway) which, instead of looping over all neutrons in a batch (or event set, for the
event-based algorithm), one thread is assigned to each neutron to be simulated. This is the only
major distinction necessary to keep in mind when creating the kernel. Otherwise, even the same
functions can be used by GPU code as the CPU code, as long as the functions definitions are
appended by ―__device__ __host__‖, a keyword for the CUDA compiler which tells it to compile
versions of the function to be run on the GPU (device), and the CPU (host). Note that there can
be GPU functions only (those appended with ―__device__‖ only).
For the initial code porting, all of the double-precision data types (―double‖ in
C++/CUDA) were switched to single-precision floating point numbers (―float‖ in C++/CUDA) in
the CPU and GPU versions of the codes.
For LADONg, the GPU was used to parallelize the calculation of each neutron in a batch.
Each thread worked on one neutron from birth until absorption or leakage. For CERBERUSg,
the neutrons in each event were parallelized; each thread worked on one neutron of the event.
28
Each event was a kernel, and the re-shuffling was performed on the CPU. In both codes, sets of
threads were collected into blocks to allow as many threads to be processed at once as possible
and to increase the latency hiding ability of the GPU.
For all GPU codes, all problem data (neutron information, geometry data, cross-sections,
and materials data) was stored in global memory. The compiler placed most runtime variables
that were not arrays into registers, per its default action.
Once the code is ported, it only has to be compiled with the CUDA compiler, nvcc,
which also calls the C++ compiler as well to build the CPU portions of the code. If desired, the
program will be in one executable file.
The above is all the work necessary to perform the porting. More CUDA experience
would certainly have reduced this effort to a matter of hours. That being said, a programmer can
spend endless amounts of time optimizing the CUDA code once it is implemented to a working
level. The next chapter will discuss such optimization efforts.
3.4 Models Analyzed
Models with differing numbers of objects and scattering collisions per neutron will
respond differently to being run on the GPU. As the number of objects increase, more and more
objects need to be checked when determining which cell the neutron is in and the distance to the
nearest cell boundary. Also, if there is a larger object in one area of the model and the rest are
very small, the neutrons in the large object have a higher chance of spending time in that one cell,
reducing the number of loops through the geometry tracking WHILE-loop in the code.
Therefore, a range of models will have to be utilized for a proper means of comparison between
the CPU and GPU codes.
29
For the purposes of this thesis, five different models were analyzed. Each will be
discussed in the following paragraphs.
3.4.1 Model “Test”
This model, called ―Test‖, earned its name because it was the primary model used during
development of the code. This model features one sphere, centered at the origin, with a radius of
22.36cm (the radius-squared is 500 cm2). The sphere is composed of
235U
(0.04819545 atoms/barn-cm) and H2O (0.05 molecules/barn-cm).
3.4.2 Model “Two”
This model contains two concentric spheres, both centered at the origin. The inner sphere
has a 5cm radius, while the outer has a 22.36cm radius. The inner sphere is composed of 235
U
(0.04819545 atoms/barn-cm) and H2O (0.05 molecules/barn-cm), while the outer sphere is
composed of elemental zirconium (0.0425126 atoms/barn-cm).
3.4.3 Model “Three”
This model contains three spheres, two of which are inside a larger one, but not inside of
each other. The largest sphere has a radius of 9.05cm (the radius-squared is 82cm2), and is
centered at the origin. The other two spheres both have a radius of 4cm, and are centered at an x-
value of +5cm and -5cm (both are aligned at y=0cm and z=0cm). Both inner spheres are made of
the same uranium/water mixture as in the previous two models while the outer sphere is
composed of water alone (0.033 molecules/barn-cm).
30
3.4.4 Model “Four”
The fourth model contains three spheres and one cuboid. Again there are two inner
spheres of the same geometric and material description as in the ‗Three‘ model. The cuboid is
centered at the origin and extends from -10cm to +10cm in the x,y, and z directions. Finally, all
three inner shapes are encapsulated by one large sphere, centered at the origin, with a radius of
18.71cm (the radius-squared is 350cm2). The cuboid and outer sphere are both made of water
(0.033 molecules/barn-cm).
3.4.5 Model “Array2”
This model was created to tax the system. It features lots of water (hence, lots of
scattering reactions) and fifty three objects. Fifty-two are spheres, with the outer shape being a
cuboid. Of the fifty-two spheres, forty-eight are used to create twenty-four objects composed of a
UO2 sphere inside surrounded by a shell of elemental zirconium. The inner UO2 spheres have a
radius of 40cm, while the outer elemental zircaloy shells have a radius of 50cm. The UO2 is
enriched to 3w/o
235U, and the
16O has a number density of 0.07294 atoms/barn-cm. The outer
zircaloy shell is the same as used in model ―Two.‖ These spheres are placed in a 4x3x2 matrix
(x, y, z) centered at the origin. The remaining four spheres are placed in the top four corners of
the cuboid with a radius of 25cm. The outer cuboid is centered at the origin, extending from -
375cm to +375cm in the x-direction, and -250cm to +250cm in the y- and z-directions. This
object is composed of the same water as used in the previous models. Appendix C lists the input
files for this model.
31
3.5 Test System Hardware Specification
A large speedup could have easily been achieved by pairing the fastest CUDA-capable
GPU with the slowest CPU that could possibly support it. However, the results presented in the
next chapter were obtained by using relatively high-end, recent hardware. The full specs of the
system are shown in Table 2. The Core i7 CPU from Intel represents the latest in CPU
technology and outperforms a Core 2 Duo E6600 (2.40 GHz, 4 MB L2 cache) by 25% in tests of
LADONc. The GPU, the BFG GeForce GTX 275 OC, is a middle-end GPU that uses the latest
NVIDIA GPU architecture, the GT200. The specifications for this GPU can be seen in Table 3,
and an image of the card itself is shown in Figure 7. This GPU is of CUDA capability version 1.3
which means that it supports double-precision calculations and has 16,364 registers available per
SM (among other features). Additionally, it is important to note that the GPU in use is slightly
factory overclocked. These core and memory clock speeds will not be changed when obtaining
results. The GTX 275 card can support the processing of up to 30,720 concurrent threads.
Table 2: Test System Hardware Specification
CPU Intel Core i7-920 (quad-core, 8MB L2 cache, 2.66 Ghz)
Motherboard MSI Platinum SLI X58
Chipset Intel X58
Hard Disk Seagate 1TB SATA Hard Drive (32MB Buffer)
Memory OCZ Technology 6GB (Three 2GB) DDR3-1600 RAM
Video Card BFG GeForce GTX 275 OC 896MB GDDR3 RAM
Video Drivers NVIDIA Driver Version 190.18
Operating System Ubuntu 9.04 x86_64
32
Table 3: GPU Specifications, Courtesy BFG
BFG NVIDIA GeForce GTX 275 OC 896MB PCIe 2.0
GPU NVIDIA® GeForce® GTX 275
Core Clock 648MHz (vs. 633MHz standard)
Shader Clock 1440MHz (vs. 1404MHz standard)
Streaming Multiprocessors 30
Processor Cores 240
Video Memory 896MB
Memory Type GDDR3
Memory Data Rate 2304MHz (vs. 2268MHz standard)
Memory Interface 448-bit
Memory Bandwidth 129GB/sec
CUDA Capability 1.3 (Double-Precision Support)
Peak Giga-flops per Second 1010.88 (Single-Precision)
Figure 7: BFG/NVIDIA GTX 275 OC, Courtesy BFG
33
3.6 CERBERUSg Initial Performance.
CERBERUSg was the first conversion to be completed. Initial studies showed that only
a 13% speedup when compared to CERBERUSc (which is not the optimal algorithm for a serial
processor such as a CPU). CERBERUSg actually required more time to run than LADONc. The
reason for the slow runtime despite the large number of threads available on the GPU is because
the algorithm required many portions of the code to operate in serial instead of parallel (reducing
the maximum possible benefit expected), and the data transfer from GPU RAM to system RAM
was required for every re-shuffling of the event lists (adding time). CERBERUSc and
CERBERUSg were thus abandoned as soon as the first LADONg port was complete, due to the
obvious large amount of programming and optimization necessary to generate a fast code.
3.7 LADONg Initial Performance
After conversion from the CPU to GPU, LADONg ran three times as fast as LADONc on
the small problems that could be run at that time (the ―test‖ model at 24,576 neutrons per batch).
This was clearly the most likely path for success. All subsequent efforts were directed towards
optimizing LADONg to achieve the fastest code possible. It should be noted that if any
algorithmic changes were made, these were also considered for application to LADONc where it
was expected to decrease the CPU runtime as well. The major optimizations performed are
discussed in the following chapter.
34
Chapter 4
Optimizations
Optimization of the GPU code was an iterative process. The code was examined for any
possible change that could be made, and the results from before and after the change were
compared. If the change resulted in a performance increase then it was kept. If not, a lesson was
learned, and the change was removed. During this process the experiences of others are a very
useful resource; expert resources utilized during this work included the NVIDIA CUDA C
Programming Best Practices Guide (NVIDIA Corporation 2009), and the NVIDIA CUDA GPU
Computing user forums (a community of CUDA programmers available to answer questions and
provide suggestions) (NVIDIA Corporation 2009). This chapter will discuss the major LADONg
optimization efforts.
4.1 Optimization Efforts
4.1.1 Reducing Memory Latency
As discussed previously, the global memory access time is approximately 200 cycles.
The impact of this can be mitigated by running enough blocks of threads to allow the GPU to
perform calculations on other blocks while waiting for the data to be returned from global
memory. This is clearly not optimum while memory with less latency is available; there are
16kB of shared memory (memory shared by each SP, with a latency of approximately 2 cycles);
64kB of cached constant memory (which can be used by all threads and blocks, is as fast as
reading from registers if the data requested is in the 8kB cache per MP, otherwise it can cost
approximately 200 cycles); and large amounts of cached texture memory (which can be used by
35
all threads and blocks, and is as fast as reading from registers of the data is cached, otherwise it
costs approximately 200 cycles to access). The following paragraphs discuss optimizations
performed to utilize this available faster memory. It is important to note that most of these
speedups presented were performed comparing the ―test‖ model, run with 24,576 neutrons (a
multiple of the number of threads per block available on the GPU) run on the GPU compared
with the same model being analyzed on the CPU.
The first step examined was in placing the geometry information in constant memory.
Constant memory was used because the geometric data does not change during the simulation of
neutrons and the information is needed repeatedly during the collision detection portion of the
code. To do this, a ―__constant__‖ variable for the shape data was made global in scope, and the
data was copied to this symbol in the routine that loads data to the GPU, as shown in Figure 8.
Figure 8: CUDA Code Transferring Data to Constant Memory
This change immediately resulted in an increase from a speedup of three to five. This is
because the required geometric data easily fits inside the constant cache and the number of times
that a 200 cycle latency memory transfer from global memory was required was significantly
reduced.
Next, the material information was moved from global memory to constant memory
using similar code to that used for the shape list. This array was rather large, as it includes a
number density and location for where to find the cross-section data for every nuclide in every
possible cell (for the arbitrary limits chosen, this equates to 100 cells*20 nuclides per cell = 2,000
nuclides). This could have been programmed better to not provide a location in memory for
36
unused nuclide slots in each object, but the increased complexity was deemed unnecessary, since
adequate constant memory space is available. Unfortunately, this work resulted in a speedup of
only a few percent: likely due to the fact that the portion of the code which requires material
information was still limited by the expensive cross-section lookup routine, which takes a large
portion of time in any MC transport code. Efforts to increase the efficiency of the cross-section
lookup are discussed later.
Since the Legendre polynomial coefficients for each nuclide used in the inelastic
scattering function were of small size, they too were included in constant memory as opposed to
global memory. The speedup increase from this effort was only a few percent, which should be
expected since this data is only required when a neutron undergoes an inelastic scattering event.
Finally, every thread requires a data structure to describe the neutron that it is working on
(including data such as: position, cell, random number seed, energy, etc.). This data is accessed
frequently by the kernel, and should not be placed in the high latency global memory. However,
it cannot be placed in constant memory because its values change during runtime. The more
logical location for this memory is in the shared memory, not because it must be shared by
threads, but because it must be modifiable while still being low latency. This was done by
allocating an array of neutron structures, one element of the array per thread, and copying the
starting neutron data from global memory to the newly allocated shared memory. After the
threads have completed the neutron simulation the results are copied back to the global memory
so the results can be transferred back to the system RAM. The code snippets in Figure 9 shows
the code required to do this. Unfortunately, this also only resulted in a speedup of about four
percent. This is again due to the more limiting memory transfers taking place due to the cross-
section lookup routines.
37
Figure 9: CUDA Code Transferring Global Memory Data to Shared Memory
4.1.2 Register Pressure
As the number of variables declared in the kernel increase, the number of registers
required also increase. Therefore, the usage of extraneous variables was removed either by
revising the algorithm or by reusing variables that had already been declared throughout the
optimization process. Reducing registers can allow for more blocks to fit in an SM (since the
number of registers per SM are limited), and it can avoid the spillover of data from the registers
to local memory (which has a long 200 cycle latency). An example of where this was performed
is in the elastic scattering function, of which an excerpt of is shown in Figure 10.
Figure 10: Example of Variable Reuse to Reduce Registers
38
This type of effort can be successful at times (the compiler should catch the most obvious
chances for register reuse), but it does tend to make the code less readable, and thus harder to
maintain.
Another option is the CUDA compiler command line option ―—maxrregcount=N‖,
where N is the desired number of registers to be used by the kernel. This option forces the
compiler to only use N registers and to place the rest of the required storage locations into local
memory. As noted earlier, local memory is slow. However the benefit of using maxrregcount is
that if the speedup achieved by allowing more blocks to run on each SM (because the register
count is lower) outweighs the slowdown due to the increased local memory usage, then an overall
gain will be achieved. For example, this was used in the double-precision version of LADONg.
Compilation without the maxrregcount option of this code results in 49 registers and 192 bytes of
local memory required. This results in a limitation of four blocks per SM due the amount of
registers available per SM. If maxrregcount=40 is used, however, then the local memory required
increases to 248 bytes, but the number of blocks per SM increases to 6 blocks. This particular
change resulted in a runtime decrease of up to 5%.
4.1.3 Cross-Section Lookups
The cross-section lookup routine takes approximately 33% of the computation time in the
LADONc code. This is expected to also be true for LADONg, but a per-function profiler is not
yet available for CUDA so this assumption could not be validated. Obviously, great gains for
little effort should be expected by increasing the performance of this part of the algorithm. While
the algorithm looked sound, the best chance for optimization was, again, in optimizing the
memory path. As discussed previously, cross-section data requires megabytes of data. This
39
means that the only way to get faster memory access for the entire cross-section table is by
utilizing the texture memory.
At first texture memory was utilized to supply energy values necessary for performing
the binary search to determine which cross-section data array index contains the data to use for
cross-section interpolation. This was found to be detrimental to performance due to the fact that
little to no locality existed for a majority of the energy values being searched. This was further
compounded by of the very large number of texture memory fetches required to do a binary
search.
Instead of searching on the energy data in texture memory, the energy data from global
memory was used for the binary search and all other requests for this data (those that did not
perform many requests for the data every few cycles like in the binary search) utilized the texture
memory. This provided some of the benefits of the cache provided by the texture memory. In
addition to this strategy, the cross-sections were stored in texture memory inside of a four
dimensional single precision floating point structure. The four dimensional structure was utilized
because it allowed the texture memory reads to gather all four cross-section values (fission, non-
fission capture, elastic, and inelastic) in one memory read operation instead of four. These
changes together resulted in a decrease in runtime of approximately 9%.
4.1.4 Reducing Shared Memory Usage to Increase Blocks per SM
Towards the end of the optimization process it became evident that the use of shared
memory for the neutron data structure was limiting the number of blocks that can be processed
concurrently on each SM. The shared memory was used solely by the neutron data structure.
This included data for the neutron position (three float variables), neutron direction (three float
variables), neutron energy (one float variable), material total macroscopic cross-section (one float
40
variable), the random variable for collision distance detection (―eta‖, one float variable), random
number generator seed (one unsigned long integer variable), current cell identification (one short
integer), and a collided nuclide identifier (one short integer). These were present because in the
CPU code it made for a convenient organization of data specific to each neutron (except the
random number generator seed, which is specific to the GPU because of the parallel structure).
This is a large structure stored in shared memory for each thread in each block on an SM. Since
there is a limit to the amount of shared memory per SM, having large shared memory usage by
each block limited the amount of blocks that can be executed on by each SM. For example,
running 64 threads per block limited the maximum number of blocks on each SM to five. Since
the direction, material total macroscopic cross-section, and random variable for collision distance
detection were not needed to be stored after the current neutron was transported by the kernel,
these were removed from the neutron structure and placed directly into the available registers.
This allowed eight blocks to be run concurrently on each SM, speeding up the calculations by
approximately 25%.
4.1.5 Thread Divergence
In several functions, thread divergence could be avoided through some tweaks to the
algorithms. For example, the most natural one to perform was to allow leaked neutrons to
continue as if they had not yet leaked. To do this, the reaction type flag was set to ‗l‘ (lower-case
L) to denote leakage, the neutrons current cell was set to 0 (a cell guaranteed to be defined
because of the required ordering of cells in the geometry). The calculations were allowed to
proceed just as if it was a neutron that has not leaked, except a statement was added during the
phase where the reaction type is determined that says ―If the neutron has its flag currently set to
‗leak‘, do not modify it.‖
41
4.1.6 Increasing the Speed of Data Transfer Between System RAM and GPU RAM
As it is shown in the CUDA Software Development Kit sample program,
―bandwidthTest‖, using page-locked memory for arrays in system RAM that will be transferred to
and from the GPU RAM can increase data throughput. This is mainly because: the data is page-
locked in system RAM, i.e., it will always be in memory and not stored in a disk cache if the
operating system decides it needs more RAM; and the GPU can access data through the PCIe bus
without having to communicate through the CPU, removing a potential bottleneck/extra step from
the process. This memory management strategy was implemented for the neutron batch list so
that the data transfer to the GPU and from the GPU before and after each batch would be faster.
This is achieved in CUDA by using the API function ―cudaMallocHost()‖ to allocate memory in
system RAM instead of using the standard C library‘s ―malloc()‖ routine, and ―cudaFreeHost()‖
instead of ―free()‖.
Since the system this was run on has 6GB of RAM (i.e., there is plenty of space in RAM
so that no data, for any process, needs to be written to the disk cache), and the Core i7 CPU has
very fast memory transfer speeds, not much speedup was expected for this application. This
intuition was proven correct as only a 1.5% speedup was seen when implementing this feature.
The feature was kept through the optimization process because it was expected to become more
important as the batch size increased.
4.1.7 Additional Optimizations Performed Which Reduce Accuracy
The CUDA architecture includes intrinsic floating point and integer math functions
(multiplication, division, sine, cosine, natural log, exponentials, etc.) which utilize specialized
hardware routines to perform the calculations. These intrinsic functions perform the
42
computations faster than the standard software equivalent but with a reduction in accuracy. For
example, the CUDA single-precision exponential function, ―expf(x)‖, provides solutions with
accuracies of 1 ULP (Unit in Last Place – a unit of accuracy for floating point numbers). The
intrinsic version, ―__expf(x)‖ has an maximum ULP error of 2+floor(abs(1.16x)), at least twice
as large as the software based equivalent, expf(x) (NVIDIA Corporation 2009). These intrinsic
functions can be utilized in calculations where accuracy does not matter for further runtime
improvements. A study was performed which replaced all math functions with their intrinsic
counterparts (where an equivalent existed), and these results are discussed in the next chapter.
43
Chapter 5
Results
This chapter presents a comparison of runtimes and values for kcalc to show the speedups
and level of accuracies obtained. All ―speedups‖ discussed are defined as the CPU compute time
divided by the GPU compute time (i.e. LADONc time divided by LADONg time). These times
are not the entire runtime from beginning the program to completion, but instead are times from
before the first neutron batch begins running until the very last neutron batch is finished. This is
done because the loading of the problem from files requires the same amount of time in both
versions of the code, and the copying of the problem geometry, materials, and cross-section
information from system RAM to GPU RAM is negligible when required only once (as you can
imagine with transfer speeds of up to 5GB/sec). This timing scheme does, however, capture the
transfer of the neutron batch data from the system RAM to the GPU RAM, and back, for every
batch in LADONg.
For the preceeding analyses, the CPU and GPU codes were run for the five previously
identified models at neutron batch sizes of 100, 1,000, 10,000, 24,576, 49,152, 98,304, 196,608,
and 1,000,000 histories. Fifty batches were run with the first fifteen of these discarded. The
following sections describe results for each of the analyses performed.
The single and double-precision versions of LADONc were compiled using GNU‘s g++
with the following syntax: ―g++ <filename.cpp> -o <executable name>‖. The default compiler
options were used. The LADONg codes discussed were compiled using NVIDIA‘s nvcc
compiler with the following syntax: ―nvcc <filename.cu> -o <executable name> -arch sm_13 -
<additional options for each problem>‖. The ―-arch sm_13‖ option defines which CUDA
capability version GPU used is compatible with. For the case of the GTX 275, this is CUDA
capability version 1.3. The additional options are discussed in each section where applicable.
44
5.1 Single-Precision Results
5.1.1 Use of the Accurate Math, Single-Precision Functions
Figure 11 shows the speedups obtained with the LADONg code that was compiled
without using the faster, and less accurate intrinsic math functions. The maximum speedup is
20.56x (model ―four‖ at 1,000,000 neutrons per batch). The next best performing model was the
array2 model, at 19.37x (from the 49,152 neutrons per batch run). The speedup levels off as the
neutrons per batch increase. This is because the GPU only needs so many blocks being run in
order to hide the memory latency adequately, and any further block requests have a linear impact
on runtime.
Figure 11: Single-Precision Speedup
0
5
10
15
20
25
0 200000 400000 600000 800000 1000000
Spe
ed
up
Neutrons per Batch
test
two
three
four
array2
45
5.1.2 Use of the Intrinsic Math, Single-Precision Functions
The CUDA compiler (nvcc) command line argument ―-use_fast_math‖ forces the
compiler to use the intrinsic single-precision functions which run significantly faster than the
standard functions, although in doing so the accuracy is reduced. This option affects functions
such as logf(), expf(), sinf(), cosf, et cetera (the complete list of these functions can be found in
the CUDA Programming Guide (NVIDIA Corporation 2009)). Figure 12 shows the speedup if
LADONg is compiled with this option.
Figure 12: Single-Precision Speedup Using Fast Math
The maximum speedup is now seen in array2, which is at 23.91x (49,152 neutrons per
batch). This is a 23% increase in speedup compared to the same run‘s performance without the
fast math option. Four, which was the performance leader in the previous case, saw an increase
in speed of 8%. The significant increase in runtime for array2 is expected because array2
0
5
10
15
20
25
30
0 200000 400000 600000 800000 1000000
Spe
ed
up
Neutrons per Batch
test
two
three
four
array2
46
contains significantly more geometric objects in the model than do any of the other cases (fifty-
three versus four, at most). The increased number of objects means more calculations are
required in the neutron tracking portion of the code, where most of the functions converted to
intrinsic functions are located, as seen in the source code.
5.1.3 Examination of the Observed Peak
In all of the above cases, there seems to be an ideal configuration for the array2 model.
Since the speedup is a function of LADONc and LADONg runtimes, the large speedup increase
could be due to an increase in CPU runtime (without a corresponding increase in GPU runtime),
or a decrease in GPU runtime (without a corresponding decrease in CPU runtime). To analyze
this trend, LADONc and LADONg were run for all models with an increased fidelity on the
number of neutrons per batch around this ‗peak‘ point. Figure 13 and Figure 14 show the
resultant runtimes for the CPU and GPU code, respectively. The CPU code shows variability in
the runtimes, especially for the array2 model, but the trend still seems to follow a linear
trajectory.
The GPU code displays more linear results except for prompt jumps in runtime seen at
17,000 and 33,000 neutrons per batch. Interestingly, these jumps occur when multiples of the
number of active blocks on the GPU is reached. For example, there is one neutron per thread,
sixty-four threads per block, eight blocks running per SM, and 30 SMs per GPU. This shows that
the GPU is at a maximum capacity when 15,360 neutrons per batch are being simulated. The
GPU puts the inactive blocks in a queue where they wait until a slot opens up on an SM. Any
multiple of this number will also cause a similar response, although smaller in magnitude.
Computing the speedups of these cases reveals the same sort of ‗optimal‘ behavior as shown in
47
Figure 15. This explains why the peak is obtained for the 24,576 neutrons per batch case in these
analyses.
Figure 13: LADONc Run Time in the Peak Region
Figure 14: LADONg Run Time in the Peak Region
0
20
40
60
80
100
120
140
0 10000 20000 30000 40000
CP
U R
un
Tim
e
Neutrons Per Batch
test cpu
two cpu
three cpu
four cpu
array2 cpu
0
1
2
3
4
5
6
7
8
0 10000 20000 30000 40000
GP
U R
un
Tim
e
Neutrons Per Batch
test gpu
two gpu
three gpu
four gpu
array2 gpu
48
Figure 15: Speedup in the Peak Region
5.2 Double-Precision Results
5.2.1 Use of the Accurate Math, Double-Precision Functions
To quantify the impact of performing double-precision floating point computations on the
GPU, a double-precision version of the CPU and GPU codes were created. These versions did
not force every floating point calculation to be performed in double-precision; instead, only those
related to the geometric tracking were performed with double-precision math. This was because
the cross-section data available in the Evaluated Nuclear Data File (ENDF) and other repositories
is not of double-precision accuracy, and so should not require double-precision calculations. The
double-precision version of LADONg was compiled with the option ―-maxrregcount=40‖ to
reduce the number of registers required from 49 to 40 to increase the number of blocks required.
These results are shown in Figure 16. The runtime of the CPU code was only negligibly
impacted by switching to double-precision geometry computations. However, the GPU runtime
0
2
4
6
8
10
12
14
16
18
20
0 5000 10000 15000 20000 25000 30000 35000
Spe
ed
up
Neutrons Per Batch
test
two
three
four
array2
49
was essentially increased by a factor of one-and-a-half to two. The largest speedup obtained was
11.21x, which came from the runs of the ―four‖ model.
As discussed previously, there is one double-precision math unit per SM. Each SM has
eight single-precision math units. This would suggest that 1/8th of the performance could be
obtained using double-precision math. However, if the code is limited by memory latency and
transfer time but not calculation speed, then the memory performance of double-precision math is
limiting. The double-precision data requires twice as many bytes and therefore half of the
memory transfer performance is to be expected. For LADONg this is the case and explains why
the performance approaches ½ that of the single-precision math version of LADONg.
Figure 16: Double-Precision Speedup
0
2
4
6
8
10
12
0 200000 400000 600000 800000 1000000
Spe
ed
up
Neutrons per Batch
test
two
three
four
array2
50
5.2.2 Use of the Intrinsic Math, Double-Precision Functions
Finally, the fast-math compilation option was utilized with the double-precision version
of the GPU code to see the impact on runtime. As can be seen in Figure 17, the fastest speedup
achieved is 12.66x (from the ―four‖ model at 1,000,000 neutrons per batch). This is an
approximate speedup of 17% over the same double-precision model. The same model (and
neutrons per batch) runs in 56% of the time using single-precision, fast math, which agrees with
the previous conclusions about the expected performance hit when switching from single-
precision to double-precision floating point math.
Figure 17: Double-Precision Speedup Using Fast Math
0
2
4
6
8
10
12
14
0 200000 400000 600000 800000 1000000 1200000
Spe
ed
up
Neutrons per Batch
test
two
three
four
array2
51
5.3 Accuracy Comparison
Based on the previous discussion regarding the accuracy of non-IEEE compliance for
floating-point calculations, a useful comparison would be to observe the differences in the kcalcs
determined by the versions of the codes presented above. Table 4 shows the difference in kcalc
between the CPU and GPU codes for some of the cases discussed above. A positive value
indicates that the value of kcalc for the GPU was larger than its CPU counterpart. The cases up to
and including 1,000,000 neutrons per batch were run with 50 batches (first 15 batches discarded).
The 4,000,000 case was run at 100 batches (25 discarded). No run was obtained for the array2
model at 4,000,000 neutrons per batch due to the large CPU runtime.
These results show three trends: 1) increasing the number of histories removes the
differences due to differing random number streams between the CPU and GPU codes; 2) a bias
still exists between the CPU and GPU codes even after 75 batches of 4,000,000 neutrons (likely
due to a fault in the parallel random number generator algorithm used); and 3) the bias that exists
is not a strong function of using the fast-math functions.
52
Table 4: Accuracy Comparison
Neutrons per Batch
Differences in kcalc
Accuracy Case test two three four array2
10,000
-0.0104 0.0001 -0.0011 -0.0014 0.0003 Single-Precision
-0.0104 -0.0021 -0.0011 -0.0034 0.0005 Single Fast-Math
-0.0053 -0.0002 -0.0037 -0.0015 0.0017 Double-Precision
24,576
-0.0030 -0.0060 0.0016 -0.0039 0.0034 Single-Precision
-0.0030 -0.0081 0.0033 -0.0032 0.0010 Single Fast-Math
-0.0030 -0.0002 0.0010 -0.0010 0.0046 Double-Precision
49,152
0.0033 0.0009 0.0009 -0.0014 0.0006 Single-Precision
0.0020 -0.0013 0.0009 -0.0007 0.0018 Single Fast-Math
0.0008 0.0008 -0.0005 0.0008 0.0002 Double-Precision
98,304
0.0010 0.0012 0.0001 0.0000 0.0001 Single-Precision
0.0018 -0.0008 0.0001 -0.0004 -0.0003 Single Fast-Math
-0.0019 -0.0008 0.0010 -0.0005 -0.0005 Double-Precision
196,608
-0.0017 0.0001 0.0021 0.0010 0.0006 Single-Precision
-0.0016 -0.0010 0.0005 0.0002 0.0007 Single Fast-Math
0.0001 -0.0010 0.0013 0.0002 -0.0005 Double-Precision
1,000,000
-0.0002 0.0001 0.0008 0.0008 -0.0003 Single-Precision
-0.0012 -0.0007 0.0009 0.0005 -0.0003 Single Fast-Math
-0.0013 0.0000 0.0060 0.0007 -0.0005 Double-Precision
4,000,000
-0.0002 0.0001 0.0009 0.0008 N/A Single-Precision
-0.0001 0.0002 0.0009 0.0008 N/A Single Fast-Math
0.0001 0.0001 0.0011 0.0009 N/A Double-Precision
53
Chapter 6
Applicability to Production Codes
The results obtained in this thesis only pertain directly to the speedup of the LADONc
code. The reader is more likely interested in how the GPU would accelerate their own MC
applications. The following section discusses limitations identified based on the experience
gained porting the LADONc code to LADONg. After that section is a discussion of ways that a
GPU-accelerated MC code could provide rather useful.
6.1 Limitations of CUDA for Monte Carlo Neutron Transport
6.1.1 Maximum Available Memory
The NVIDIA Tesla S1070 (the top-of-the-line professional grade 1U rack mounted GPU
made exclusively for CUDA applications) provides four GPUs, each with exclusive access to
4GB of RAM, totaling 16GB per node. For a typical HPC node, it is not uncommon for a two
CPU system to have access to as much as 48 GB of RAM, all of it accessible by both CPUs.
Monte Carlo Transport codes solving reactor-sized problems will require the 48GB of memory
(maybe more) to store the geometry information, cross-section data, and fine-block region tally
data. Therefore, effective use of GPUs for full-core design calculations would require domain
decomposition techniques to reduce the memory requirements of each of the GPUs in a system.
This is an active area of research for MC applications (Thomas Brunner 2009).
It is important to keep in mind when considering domain decomposition techniques that
since the data transferred from GPU to GPU and CPU to/from GPU is limited by the PCIe bus
54
(approximately 8GB/s), the communication is even more so of a bottleneck for GPU based
solutions.
6.1.2 Accuracy of Computations
At this time, the floating-point arithmetic accuracy is not fully IEEE-754 compliant. This
may not end up being an issue, but that will not be known until a more fully-featured code is
written for CUDA and qualified against a suite of benchmark models. Additionally, NVIDIA has
complete control over the implementation of floating-point arithmetic on the GPUs. Any
qualification study performed may not be applicable to other generations of NVIDIA GPUs.
Besides the general accuracy questions above, another issue exists regarding the double-
precision floating point number performance. For this application, double-precision calculations
run at roughly twice the runtime of the single-precision performance. Double-point precision is
desired for at least the geometric portions of the program to avoid neutrons being lost due to
insufficient numerical accuracy available for neutron tracking.
6.1.3 Error-Checking Memory
The current generation of NVIDIA GPUs do not support error-checking and correcting
(ECC) RAM. This means that a fault in the data is not determined and fixed during runtime by
the hardware (as it is in typical CPU cluster nodes), resulting in possibly incorrect and
unpredictable results. However, as discussed in the paper by Maruyama, Nukada, and Matsuoka
(Maruyama, 2009), software-based ECC is possible for GPU algorithms. Software ECC makes
use of algorithms which detect transient bit-flips that can be detected and corrected all while the
main algorithm is running. This takes up resources however; roughly 60% performance loss was
55
found with this method for bandwidth-limited applications such as LADONc. Computation-
limited applications ran with a 7% overhead due to the software ECC.
6.1.4 Maintainability
A program that has a highly accessible source code with a small learning curve clearly
provides long term knowledge management benefits to an organization. While CUDA code is
not difficult to understand on its own, any optimizations made to the code, including efforts made
to reduce the number of registers, could make it unreadable to those without an intimate
knowledge of its inner workings (i.e., the developers), reducing the code‘s maintainability (and
possibly longevity).
6.1.5 Hardware Architecture Changes
The CUDA architecture is relatively young, even in the world of computers. The first
CUDA capable GPU was released in 2006. Conversely, the x86 technology used by most CPUs
in use today was first introduced in 1978 by Intel (Intel Corporation n.d.). The x86 architecture
has proven to be at least as future safe to produce for as a computer technology can be.
The fact that NVIDIA owns the CUDA architecture and can change it at anytime if
necessary limits the usefulness of a GPU-accelerated MC code for design, simply because it may
not be possible to guarantee that the code will last the life of the design (if an organization has
such requirements).
It should be mentioned that work is progressing on an open, free standard for parallel
programming on many different platforms (including both CPUs and GPUs). This standard is
called OpenCL (short for Open Computing Language) (Khronos Group 2009). It is currently
56
supported by many large organizations including: Intel, NVIDIA, AMD, IBM, Texas
Instruments, and even Los Alamos National Laboratory. This standard should allow one to
program one code which can run on any platform supported, with the code automatically taking
advantage of the hardware it is run on with minimal (if any) programmer effort. OpenCL 1.0 was
first released on August 28, 2009 in Apple‘s Operating System, Mac OS X Snow Leopard.
6.1.6 Optimizations May Not Be As Successful For Larger Problems
As discussed in Chapter 4, the speedups achieved in this effort were obtained through
fitting geometric and material data in the GPU‘s constant memory (a maximum of 64KB). While
the methods used for representing material data in this thesis were not ideal, the number of cells
and materials had to be limited for these speedup levels to be achieved. The same type of
speedups should not be expected without limiting the number of cells or materials. Alternatively,
a strategy that could be pursued to remedy this is to incorporate a ‗pre-staging‘ algorithm in the
code where the necessary geometric and material data could be dynamically loaded into the
constant memory when needed instead of fitting it all in before the neutron simulations begin.
Examples in the CUDA programming guide (NVIDIA Corporation 2009) make use of this
method.
6.1.7 Lack of Large-Scale GPU Cluster
This work has shown that a MC can be accelerated by GPUs (with the caveats described
above). However, most organizations do not purchase a high-performance computer (HPC)
strictly for use with one specific code. Since it is certainly likely that many codes will not be able
to utilize GPUs to accelerate codes (at least without a modest programming effort), it is expected
57
there will be little support to purchase such a highly-specific HPC without it being more generally
applicable first. This downside can be mitigated as the general purpose GPU usage increases and
the science community adopts GPUs for calculations.
6.1.8 CUDA Development Tools
Currently, the only CUDA development tools are provided by NVIDIA. These include
the CUDA compiler (nvcc), a debugger (cuda-gdb), and a profiler (Visual Profiler).
The debugger is a ported version of the GNU Debugger (gdb), called cuda-gdb, which
performs similar functions to gdb (NVIDIA Corporation 2009). The product description for this
still-in-development-tool from NVIDIA states that: “[cuda-gdb] is designed to present the user
with an all-in-one debugging environment capable of debugging native host code as well as
CUDA code. Standard debugging features are inherently supported for host code, and additional
features have been provided to support debugging CUDA code.”
The Visual Profiler is useful for determining statistics about a kernel such as: amount of
time spent transferring data versus running a kernel; the number of local/global/shared memory
reads and writes (coalesced and uncoalesced); and the amount of thread divergence present in a
code. Unfortunately, the profiler does not yet provide statistics on individual functions within the
kernels – all results are presented in terms of kernels and data transfers.
These tools are very useful, however, during the development process of both LADONc
and LADONg it was evident that the state of CPU (C++) development is clearly much more
advanced than the equivalent GPU tools. Of course, CUDA has only been available for three
years at this point, while C/C++ was first introduced in 1972. Until the development tools have
reached a similar level of usability, programmers should expect to spend extra effort debugging
and profiling than acclimated to.
58
6.2 Possible Applications of CUDA-Accelerated MC
If effort is not expended to address the above issues then a nearly direct application of the
work done herein could be applicable as a design scoping tool or as a neutronics solver in a multi-
physics framework. These tools require quick runtimes and can be more flexible on the level of
accuracy required. As runtime is reduced, an increasing number of design iterations can be
performed in a given time period producing a further optimized product for potentially less
money.
The following address the previously noted deficiencies as they relate to scoping tools or
multi-physics solvers:
The size of each tally region required can be reduced. This makes it possible to
fit a problem on a single GPU with access to only 4GB of RAM without the use
of domain decomposition.
Single precision floating-point math can be used (decreasing the runtime by at
least two when compared to a double-precision code).
For a scoping tool, error checking may not be necessary; if an unexpected
deviation is encountered, then the code should be re-run.
For the use of CUDA in a scoping tool, the unknown future of CUDA is not as
limiting since the GPU-accelerated scoping tool wouldn‘t be used for the final
design; it does not have to remain maintainable for the life of the design. This
does not mitigate the risk of spending effort in a code, only to have it not be
useful in the future when the GPU architecture changes.
The code complexity/maintainability issues of highly optimized code will still remain however –
a confusing source code is a confusing source code, regardless of how it is used.
59
Chapter 7
Summary and Conclusions
This work examined the feasibility of utilizing Graphics Processing Units (GPUs) to
accelerate Monte Carlo neutron transport problems. These GPUs use many parallel processors to
perform the complex calculations necessary to create three-dimensional images at fast enough
rates for the video game industry. The GPU (BFG/NVIDIA GTX 275 OC 896MB) used for this
analysis has 230 of these parallel processors, and is capable of approximately 1,088x109
Floating
Point Operations per second. In 2006 NVIDIA released a programming framework (called
CUDA) that allows developers to easily create codes which can be executed on the many cores
provided by GPUs for general purpose programs not necessarily related to graphics. Initial
assessments have suggested that the MC algorithm may not be able to fully utilize the GPU
constraints. These constraints include the fact that MC codes are highly dependent on branch
statements (IF, ELSE, FOR, and WHILE) which can have a large impact on GPU performance.
This work went through the process of writing a Monte Carlo neutron transport code from scratch
to be run on both the x86 CPU platform and later ported to the GPU CUDA platform to
understand the type of performance that can be gained by utilizing GPUs.
7.1 Conclusions
After the CPU code was ported with little effort and no optimization, it was found that
the GPU code ran three times as fast. This result was quite promising and further work went into
optimization to determine how fast the GPU code could be run.
Since the geometry and materials in use defines the complexity of the problem, numerous
models were run in this analysis. These include simple models with one sphere up to a model
60
with 53 objects. The maximum speedups obtained can be seen in Table 5. These results show
that the more complex models had achieved the largest speedups due to the fact that more time
was spent in the geometry tracking portions of the code, where the GPU performance was
beneficial (rapid access to the important data). Since high precision methods are desired for
production codes, single- and double-precision versions of the codes were also compared on the
same models. These results show an increase in GPU runtime of approximately a factor of two
when double-precision math is utilized.
Table 5: Summary of Speedups
Case Maximum Speedup Compared to CPU
Single-Precision 20.56x
Single-Precision with Fast Math 23.91x
Double-Precision 11.21x
Double-Precision with Fast Math 12.66x
Some disadvantages for production-level codes were discussed as well. These include: a
lack of size and quality of memory on the GPUs (the maximum GPU RAM available on a card is
currently 4GB per GPU, and there is no error checking or correcting performed); non-IEEE
compliant floating point operations with slightly more inaccuracy than those that are IEEE
compliant; and a loss of performance when double-precision is desired. These limitations only
act to slow down the problem being run on the GPU, a good programmer will still be able to
come away with a sizeable speedup even when accounting for the above.
61
7.2 Recommendations for Future Work
This work was not performed by an experienced programmer. Any development team
experienced in either C or C++ and Monte Carlo neutron transport programming should be able
to achieve much better results than those discussed in this document. That being said, these
results are enticing enough that developers are encouraged to examine the porting of their specific
MC codes to the CUDA environment. A candidate that comes to mind is the PSG2/Serpent
Monte Carlo transport code, a code written from scratch in 2004 using the C language (Leppänen
2009).
62
Bibliography
Brown, Forrest. "PHYSOR 2008 Conference Monte Carlo Workshop." Invited Lecturer.
Interlaken, 2008.
Forrest B. Brown, William R. Martin. "Monte Carlo Methods for Radiation Transport
Analysis on Vector Computers." Progress in Nuclear Energy (Pergamon Press Ltd.) 14, no. 3
(July 1984): 269-299.
Intel Corporation. Corporate Archives Timeline.
http://www.intel.com/museum/archives/timeline/index.htm (accessed August 29, 2009).
KAERI, Korea Atomic Energy Research Institute. ENDFPLOT 2.0. 2007.
http://atom.kaeri.re.kr/cgi-bin/endfplot.pl (accessed March 2009).
Khronos Group. OpenCL Overview. 2009. http://www.khronos.org/opencl/ (accessed
August 29, 2009).
Leppänen, Jaako. "Burnup Calculation Capability in the PSG2 / Serpent Monte Carlo
Reactor Physics Code." M&C 2009. Saratoga Springs, 2009.
Luebke, David. "The Democratization of Parallel Computing." International Conference
for High Performance Computing, Networking, Storage and Analysis 2007 (SC07). Reno,
Nevada, November 2007.
Martin, William. "Joint International Topical Meeting on Mathematics and Computation
and Supercomputing in Nuclear." Invited Keynote Speaker. Monterrey, California, April 2007.
N. Maruyama, A. Nukada, and S. Matsuoka, ―Software-Based ECC for GPUs‖ in 2009
Symposium on Application Accelerators in High Performance Computing (SAAHPC’09),
Urbana, Illinois, July 2009.
NVIDIA Corporation. CUDA Programming and Development. 2009.
http://forums.nvidia.com/index.php?showforum=71 (accessed September 8, 2009).
63
NVIDIA Corporation. NVIDIA CUDA C Programming Best Practices Guide. CUDA
Toolkit 2.3. Portable Document Format. Santa Clara, CA, July 2009.
NVIDIA Corporation. NVIDIA CUDA Debugger - CUDA-GDB Debugger. Version 2.3
Beta. Portable Document Format. Santa Clara, California, July 2009.
NVIDIA Corporation. NVIDIA CUDA Programming Guide. CUDA Toolkit 2.3. Portable
Document Format. Santa Clara, California, July 2009.
Smith, Kord. "Nuclear Mathematical and Computational Sciences: A Century in Review,
A Century Anew." Invited Keynote Speaker. Gatlinburg, Tennessee, May 2003.
Thomas Brunner, Patrick Brantley. "An Efficient, Robust, Domain-Decomposition
Algorithm for Particle Monte Carlo." Journal of Computational Physics (Academic Press
Professional, Inc.) 228, no. 10 (June 2009): 3882-3890.
William Press, Saul Teukolsky, William Vetterling, Brian Flannery. Numerical Recipes
in C++ - The Art of Scientific Computing. 2. Cambridge: Cambridge University Press, 2002.
64
Appendix A
LADONc Source Code
The following C++ code is the single-precision code for LADONc.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
#include <iostream>
#include <float.h>
#include <fstream>
#include <string>
using namespace std;
const int OBJECT_MAX=100;
const int NOT_FISSION_FLAG=101;
const int NUCLIDE_MAX=20;
const float NU=0.53; //2+NU = nu, # neuts per fission
const float SMALLEST_float=0.001f; //used to move particle slightly
struct neutron
float x;
float y;
float z;
float energy;
float ox;
float oy;
float oz; //directions omegax, omegay, omegaz
unsigned short cell;
;
struct shapes
char type; //s for sphere p for parallelopiped, c for cylinder
float x0, y0, z0, R2; //R2 because its R^2
float x1, y1, z1; //these are required to define a cube (and z1 for cyl)
unsigned short inside_me[OBJECT_MAX];
;
struct materials
unsigned short num_mat;
int zaid_loc;
float density;
;
struct legendres
float a0, a1, a2, a3, a4, a5, a6, a7;
;
struct xs
65
//these are micros
float * rad_cap;
float * fission;
float * elastic;
float * inelastic;
float * Emesh;
legendres coeffs;
float a,b,c,d;
int mesh_length;
;
struct xsinfo
xs* cross_sec;
int *list;
int listlength;
;
void inputgeometry(string filename, unsigned short& shape_num, shapes
shape_list[]);
float rng(unsigned long int a, float bounds);
void inputXS(unsigned short shape_num, xsinfo* xs_data, materials
mat_list[][NUCLIDE_MAX]);
void inputmaterials(string filename, materials mat_list[][NUCLIDE_MAX]);
void inputsourcelist2(string filename, neutron* source);
void inputjob(char job[], string& in_filename, unsigned long long int& in_seed,
unsigned int& in_nbatch, unsigned int& in_batches, unsigned int&
in_batchskips);
void giveneutrondir(float* ox, float* oy, float* oz);
float sample_X(float a, float b, float c, float max_y);
bool inside_of(shapes shape, float x, float y, float z);
unsigned short getCell(unsigned short shape_num, shapes shape_list[], float x,
float y, float z);
float dist_to_boundary(shapes shape, float* x, float* y, float* z, float ox,
float oy, float oz);
void move_n(float x, float y, float z, float ox, float oy, float oz, float*
endx, float* endy, float* endz, float dr);
void scatter_iso(int zaid, float* u, float* v, float* w, float* energy);
void inelastic_scatter(int zaid, float* u, float* v, float* w, float* energy,
legendres coeffs);
int findzaid(xsinfo xs_data, int zaid);
unsigned int binarySearch(float sortedArray[], unsigned int first, unsigned int
last, float key);
float getmicro(float E, xs cross_sec, char type, unsigned int i);
int picknuclide(float allmatSigmaT, xsinfo xs_data, float E, materials
mat_def[]);
float getmaterialmacroT(float E, xsinfo xs_data, materials mat_def[]);
float getmacro(float E, xsinfo xs_data, char type, float density, int loc,
unsigned int i);
char get_Rxn(float E, xsinfo xs_data, materials mat_def);
int main(int argc, char* argv[])
clock_t time_1, time_2;
//get main input
string file;
unsigned long long int seed;
unsigned int nbatch; //neuts per batch
unsigned int batches;
unsigned int batchskips;
66
char jobname[]="input.job"; //!!! when using commandline, replace this with
//argv[1]
inputjob(jobname, file, seed, nbatch, batches, batchskips);
//init rng
rng(seed,1.0f);
xsinfo xs_data;
unsigned short num_shapes; //get from input
shapes shapelist[OBJECT_MAX]; //get from input
inputgeometry(file, num_shapes, shapelist);
materials mat_list[OBJECT_MAX][NUCLIDE_MAX];
inputmaterials(file, mat_list);
neutron* queuelist;
queuelist=(neutron*)malloc(2*nbatch*sizeof(neutron));
for (unsigned int i=0; i<2*nbatch; i++)
queuelist[i].energy=-1.0f;
inputsourcelist2(file,queuelist);
inputXS(num_shapes,&xs_data, mat_list);
cout <<"\nINPUT PARSING COMPLETE\nTIMER STARTED";
time_1=clock(); //start timer!
float keff=0;
float kbatch;
float ksum=0; //used for checking my kbatch algorithm
float keff_alt;//used for checking my kbatch algorithm
unsigned int queuecount=0;
unsigned int fissionqueuecount=0;
unsigned int lost_tally=0;
for (int batchcount=1; batchcount <= batches; batchcount++)
cout <<"\nBatch "<<batchcount;
unsigned int collisions=0;
unsigned int fission_tally=0;
unsigned int rad_cap_tally=0;
unsigned int leak_tally=0;
unsigned int fission_production_tally=0;
for (unsigned int count=1; count <= nbatch; count++)
//cout <<"\nbeginning neutron # "<<count;
bool neutron_alive=true;
float coll_dist;
char rxn_type;
int target_nuclide;
neutron neut;
//depends upon input source distribution, and fission neutron queue
//create neutron
if (queuelist[queuecount].energy==-1.0f)
//If there are no neutrons in queue, start at r=0 with random E
neut.x=0.0f;
neut.y=0.0f;
neut.z=0.0f;
neut.energy=sample_X(0.453f,-1.036f,2.29f,0.35820615f);
neut.cell=NOT_FISSION_FLAG;
queuecount=0;
else
neut.x=queuelist[queuecount].x;
neut.y=queuelist[queuecount].y;
neut.z=queuelist[queuecount].z;
67
neut.energy=queuelist[queuecount].energy;
neut.cell=queuelist[queuecount].cell;
if (queuecount<(2*nbatch-2))
queuecount++;
else queuecount=0;
giveneutrondir(&neut.ox,&neut.oy,&neut.oz);
if (neut.cell==OBJECT_MAX) //source in wrong xyz, leak and move on
neutron_alive=false;
while (neutron_alive)
//begin transporting neutron
neut.cell=getCell(num_shapes, shapelist, neut.x,neut.y,neut.z);
float tempx,tempy,tempz;
float eta=rng(0,1.0f);
bool finding_new_cell=true;
float SigmaT;
while (finding_new_cell)
float dist=dist_to_boundary(shapelist[neut.cell],
&neut.x,&neut.y,&neut.z,neut.ox,neut.oy,neut.oz);
if (dist<0.0f)
finding_new_cell=false; //Neut Lost!
neutron_alive=false;
lost_tally++;
//set cell temporarily to something to survive this
iteration of the loop
neut.cell=0;
SigmaT=getmaterialmacroT(neut.energy,xs_data,
mat_list[neut.cell]);
coll_dist=-log(eta)/(SigmaT);
bool find_dist=true;
int i=0;
while (find_dist)
float temp;
if (shapelist[neut.cell].inside_me[i]!=999)
temp=dist_to_boundary
(shapelist[shapelist[neut.cell].inside_me[i]],&neut.x,&neut.y,&neut.z,
neut.ox,neut.oy,neut.oz);
if ((temp>0)&&(temp<dist))
dist=temp;
i++;
else find_dist=false;
if (coll_dist>=dist)
neut.x=neut.x+neut.ox*(dist+SMALLEST_float);
neut.y=neut.y+neut.oy*(dist+SMALLEST_float);
neut.z=neut.z+neut.oz*(dist+SMALLEST_float);
neut.cell=getCell(num_shapes, shapelist,
neut.x,neut.y,neut.z);
if (neut.cell==OBJECT_MAX)
neutron_alive=false;
finding_new_cell=false;
68
leak_tally++;
else
eta=exp(((dist+SMALLEST_float)-
coll_dist)*SigmaT);
else
neut.x=neut.x+neut.ox*(coll_dist);
neut.y=neut.y+neut.oy*(coll_dist);
neut.z=neut.z+neut.oz*(coll_dist);
finding_new_cell=false;
if (neutron_alive)
target_nuclide=picknuclide(SigmaT, xs_data,
neut.energy, mat_list[neut.cell]);
rxn_type=get_Rxn(neut.energy, xs_data,
mat_list[neut.cell][target_nuclide]);
int loc=mat_list[neut.cell][target_nuclide].zaid_loc;
switch (rxn_type)
case 'f': //do fission stuff
//determine number of neutrons from fission
int num_fission_neuts;
if (rng(0,1.0f) > NU)
num_fission_neuts=3;
else num_fission_neuts=2;
for (int i=0; i<num_fission_neuts; i++)
neutron temp;
temp.x=neut.x;
temp.y=neut.y;
temp.z=neut.z;
temp.cell=neut.cell;
temp.energy=
sample_X(xs_data.cross_sec[loc].a,
xs_data.cross_sec[loc].b,
xs_data.cross_sec[loc].c,
xs_data.cross_sec[loc].d);
queuelist[fissionqueuecount]=temp;
if (fissionqueuecount<(2*nbatch-2))
fissionqueuecount++;
else
fissionqueuecount=0;
fission_production_tally++;
//store initial rx,ry,rz,E for fission n's in queue
fission_tally++;
neutron_alive=false;
break;
case 'c': //do capture stuff
neutron_alive=false;
rad_cap_tally++;
break;
case 'e':
collisions++;
scatter_iso(xs_data.list[ mat_list[neut.cell][target_nuclide].zaid_loc],
&neut.ox,&neut.oy,&neut.oz,&neut.energy);
break;
69
case 'i': //do inelastic stuff;
collisions++;
inelastic_scatter(xs_data.list[mat_list[neut.cell][target_nuclide].zaid_loc],
&neut.ox,&neut.oy,&neut.oz,&neut.energy, xs_data.cross_sec[loc].coeffs);
break;
default: cout << "\nError in get_Rxn; returned "<<
rxn_type;
cout <<"\nfission neutrons: "<<fission_production_tally
<<" absorptions: "<<rad_cap_tally+fission_tally
<<" leaked neutrons: "<< leak_tally;
cout <<"\nAverage collisions per neutron= "<<collisions/(float)nbatch;
kbatch=fission_production_tally /
((float)(leak_tally+rad_cap_tally+fission_tally));
if (batchcount>batchskips)
keff=((batchcount-batchskips-1)*(keff)+kbatch)/(batchcount-batchskips);
ksum+=kbatch;
cout <<"\nkbatch= "<<kbatch<<"\trunning keff= "<<keff;
time_2=clock();
cout <<"\nkeff = "<<keff;
keff_alt=ksum/((float)(batches-batchskips));
cout <<"\nkeff_alt = "<<keff_alt;
float deltat=(float) (time_2-time_1)/(float) CLOCKS_PER_SEC;
cout <<"\nComputation Time= " << deltat <<" seconds";
cout <<"\nAverage Neutrons per Second = "<<nbatch*batches/deltat;
cout <<"\nAverage Neutrons per Minute = "<<nbatch*batches/deltat*60;
cout <<"\nLost neutrons = " << lost_tally<<endl;
ofstream timeout;
timeout.open("time.out");
timeout<<deltat<<endl<<keff;
timeout.close();
return (EXIT_SUCCESS);
void inputjob(char job[], string& in_filename, unsigned long long int& in_seed,
unsigned int& in_nbatch, unsigned int& in_batches, unsigned int& in_batchskips)
ifstream fin;
fin.open(job);
fin>>in_filename;
if (!fin) cout <<"\ninputjob file not open...";
fin>>in_seed>>in_nbatch>>in_batches>>in_batchskips;
fin.close();
void inputsourcelist2(string filename, neutron* source)
neutron temp;
filename.append(".src");
ifstream sin(filename.c_str());
if (!sin)
cout<<"\nerror reading .src file";
return;
70
int number;
sin>>number; //number of source neuts
for (int i=0; i<number; i++)
sin>>temp.x>>temp.y>>temp.z>>temp.energy;
temp.cell=NOT_FISSION_FLAG;
source[i]=temp;
sin.close();
void inputmaterials(string filename, materials mat_list[][NUCLIDE_MAX])
for (int i=0; i<OBJECT_MAX; i++)
for (int j=0; j<NUCLIDE_MAX; j++)
mat_list[i][j].num_mat=0;
mat_list[i][j].zaid_loc=0;
mat_list[i][j].density=0.0f;
filename.append(".mat");
ifstream min(filename.c_str());
if (!min)
cout<<"\nerror reading .mat file";
return;
int number; //number of cells
min>>number;
for (int i=0; i<number; i++) //i is the same as cell #
min>> mat_list[i][0].num_mat;
for (int j=0; j<mat_list[i][0].num_mat; j++)
min>>mat_list[i][j].zaid_loc>>mat_list[i][j].density;
mat_list[i][j].num_mat=mat_list[i][0].num_mat;
for (int j=mat_list[i][0].num_mat; j<NUCLIDE_MAX; j++)
mat_list[i][j].num_mat=mat_list[i][0].num_mat;
min.close();
void inputXS(unsigned short shape_num, xsinfo* xs_data, materials
mat_list[][NUCLIDE_MAX])
ifstream xsin;
//get all of the zaids
int k=0;
//find how many diff zaids there are
int temp[OBJECT_MAX*NUCLIDE_MAX];
for (int i=0; i<OBJECT_MAX*NUCLIDE_MAX; i++)
temp[i]=0;
temp[k]=mat_list[0][0].zaid_loc;
k++;
for (int i=0; i<OBJECT_MAX; i++)
71
for (int j=0; j<NUCLIDE_MAX; j++)
bool found_val=false;
for (int z=0; z<k; z++)
if (temp[z]==mat_list[i][j].zaid_loc)
found_val=true;
if ((!found_val)&&(mat_list[i][j].zaid_loc!=0))
temp[k]=mat_list[i][j].zaid_loc;
k++;
for (int z=0; z<k; z++)
for (int i=0; i<OBJECT_MAX; i++)
for (int j=0; j<NUCLIDE_MAX; j++)
if (temp[z]==mat_list[i][j].zaid_loc)
mat_list[i][j].zaid_loc=z;
(*xs_data).listlength=k;
(*xs_data).list=(int*)malloc((*xs_data).listlength*sizeof(*(*xs_data).list));
for (int i=0; i<k; i++)
(*xs_data).list[i]=temp[i];
//This should have produced a list of all of the zaids to get data for
(*xs_data).cross_sec= (xs*)
malloc((*xs_data).listlength*sizeof(*(*xs_data).cross_sec));
for (int i=0; i<k; i++)
char xsfile[7];
sprintf(xsfile, "%d",(*xs_data).list[i]);
xsin.open(xsfile);
if (!xsin)
cout<<"\nerror reading xs file";
return;
//a,b,c are from: X(E)=a*exp(b*E)*sinh(sqrt(c*E))
//d is the max of X(E).
xsin>>(*xs_data).cross_sec[i].a>>(*xs_data).cross_sec[i].b>>(*xs_data).cross_se
c[i].c>>(*xs_data).cross_sec[i].d;
xsin >>(*xs_data).cross_sec[i].mesh_length;
(*xs_data).cross_sec[i].Emesh=(float*)malloc((*xs_data).cross_sec[i].mesh_lengt
h*sizeof(*(*xs_data).cross_sec[i].Emesh));
72
(*xs_data).cross_sec[i].fission=(float*)malloc((*xs_data).cross_sec[i].mesh_len
gth*sizeof(*(*xs_data).cross_sec[i].fission));
(*xs_data).cross_sec[i].rad_cap=(float*)malloc((*xs_data).cross_sec[i].mesh_len
gth*sizeof(*(*xs_data).cross_sec[i].rad_cap));
(*xs_data).cross_sec[i].elastic=(float*)malloc((*xs_data).cross_sec[i].mesh_len
gth*sizeof(*(*xs_data).cross_sec[i].elastic));
(*xs_data).cross_sec[i].inelastic=(float*)malloc((*xs_data).cross_sec[i].mesh_l
ength*sizeof(*(*xs_data).cross_sec[i].inelastic));
if ((*xs_data).cross_sec[i].Emesh==NULL) cout <<"\ndidnt initialize
Emesh...";
for (int j=0; j<(*xs_data).cross_sec[i].mesh_length; j++)
xsin>>(*xs_data).cross_sec[i].Emesh[j]>>(*xs_data).cross_sec[i].fission[j]>>(*x
s_data).cross_sec[i].rad_cap[j]
>>(*xs_data).cross_sec[i].elastic[j]>>(*xs_data).cross_sec[i].inelastic[j];
//get legendre coeffs
xsin>>(*xs_data).cross_sec[i].coeffs.a0>>(*xs_data).cross_sec[i].coeffs.a1
>>(*xs_data).cross_sec[i].coeffs.a2>>(*xs_data).cross_sec[i].coeffs.a3
>>(*xs_data).cross_sec[i].coeffs.a4>>(*xs_data).cross_sec[i].coeffs.a5
>>(*xs_data).cross_sec[i].coeffs.a6>>(*xs_data).cross_sec[i].coeffs.a7;
xsin.close();
float rng(unsigned long int a, float bounds)
const float mult=1.0f/4294967296.0f;
static unsigned long int seed;
if (a==0)
a=seed;
seed=(a * 1664525+1013904223) % 4294967296;
a=seed;
return seed*bounds*mult;
void inputgeometry(string filename,unsigned short& shape_num,shapes
shape_list[])
filename.append(".geo");
ifstream gin(filename.c_str());
unsigned short number;
if (!gin)
cout<<"\nerror reading .geo file";
return;
gin >> number;
shape_num=number;
for(int i=0; i<(int)number; i++)
//get one object
gin >>shape_list[i].type>>shape_list[i].x0>>shape_list[i].x1
73
>>shape_list[i].y0>>shape_list[i].y1>>shape_list[i].z0
>>shape_list[i].z1>>shape_list[i].R2;
int temp;
int j=0;
bool continue_loop=true;
while (continue_loop)
gin>>temp;
shape_list[i].inside_me[j]=(unsigned short) temp;
j++;
if (temp==999) continue_loop=false;
gin.close();
void giveneutrondir(float* ox, float* oy, float* oz)
*oz=rng(0,2.0f)-1.0f;
float temp=sqrtf(1.0f-(*oz)*(*oz));
*ox=rng(0,6.283185307f);
*oy=sinf(*ox)*temp;
*ox=cosf(*ox)*temp;
float sample_X(float a, float b, float c, float max_y)
//X(E)=a*exp(-b*E)*sinh(sqrt(c*E))
float max_x=20.0f;
//float d is the max_y
bool cont_loop=true;
float x;
while (cont_loop)
x=rng(0,max_x);
if (rng(0,max_y)<=a*exp(-b*x)*sinh(sqrt(c*x)))
return x;
int findzaid(xsinfo xs_data, int zaid)
for (int i=0; i<xs_data.listlength; i++)
if (xs_data.list[i]==zaid)
return i;
//if made it here, wasnt found
return 999;
bool inside_of(shapes shape, float x, float y, float z)
//given x,y,z, if I am in a given object ('shape').
//Sphere (x-x0)^2+(y-y0)^2+(z-z0)^2=R^2
//Cube: z=z0; z=z1, x=x0, x=x1; y=y0, y=y1
//Cylinder: (x-x0)^2+(y-y0)^2=R^2 z>=z0 z<=z1
74
if (shape.type=='s') //spheres
float temp=(x-shape.x0)*(x-shape.x0) +
(y-shape.y0)*(y-shape.y0) +
(z-shape.z0)*(z-shape.z0);
//cout <<"\nx2+y2+z2= "<<temp<<" R2= "<<shape.R2;
if (temp<=shape.R2) //then it is inside, or on the surface of the
sphere
// cout <<"\nx2+y2+z2= "<<temp<<" R2= "<<shape.R2;
return true; //!!!This works if the smallest cellIDs mean they are
inside the rest (so i dont need my 'inside_me' list...
else if (shape.type=='p') //paralellopiped
//z0 is top face, z1 is bottom face
//x0 is larger x face, x1 is smaller x face
//y0 is larger y face, y1 is smaller y face
if (((z<=shape.z0)&&(z>=shape.z1))&&
(((x<=shape.x0)&&(x>=shape.x1))&&
((y<=shape.y0)&&(y>=shape.y1))))
return true;
return false;
unsigned short getCell(unsigned short shape_num, shapes shape_list[], float x,
float y, float z)
for (unsigned short i=0; i<shape_num; i++)
if (inside_of(shape_list[i],x,y,z))
return i;
return OBJECT_MAX;
void move_n(float x, float y, float z, float ox, float oy, float oz, float*
endx, float* endy, float* endz, float dr)
*endx=x+ox*dr;
*endy=y+oy*dr;
*endz=z+oz*dr;
float dist_to_boundary(shapes shape, float* x, float* y, float* z, float ox,
float oy, float oz)
if (shape.type=='s') //spheres
float d=(ox*ox+oy*oy+oz*oz);
float b= 2.0f*(ox*(*x-shape.x0)+oy*(*y-shape.y0)+oz*(*z-shape.z0));
float c= ((*x)*(*x)-2.0f*(*x)*shape.x0+shape.x0*shape.x0) +
((*y)*(*y)-2.0f*(*y)*shape.y0+shape.y0*shape.y0) +
((*z)*(*z)-2.0f*(*z)*shape.z0+shape.z0*shape.z0)-shape.R2;
float temp=b*b-4.0f*d*c;
if (temp < 0.0f)
return -10.0f; //a flag so i know that it didnt intersect
temp=sqrtf(temp);
c = 0.5f*(temp-b)/d;
75
d= 0.5f*(-temp-b)/d;
if (d>0.0f) //pick smallest root
return d;
else
return c;
else if (shape.type=='p') //parallelopiped
//z0 is top face, z1 is bottom face
//x0 is larger x face, x1 is smaller x face
//y0 is larger y face, y1 is smaller y face
//normalize omegas
//compute distances
float temp1, temp2, temp3;
temp1=1.0E37f;
temp2=1.0E37f;
temp3=1.0E37f;
if (ox>0.0f)
temp1=(shape.x0-*x)/ox;
else if (ox<0.0f)
temp1=(shape.x1-*x)/ox;
if (oy>0.0f)
temp2=(shape.y0-*y)/oy;
else if (oy<0.0f)
temp2=(shape.y1-*y)/oy;
if (oz>0.0f)
temp3=(shape.z0-*z)/oz;
else if (oz<0.0f)
temp3=(shape.z1-*z)/oz;
//find the smallest, that is our winner
temp1=min(temp1,temp2);
temp1=min(temp1,temp3);
return temp1;
else return -10.0f;
void scatter_iso(int zaid, float* u, float* v, float* w, float* energy)
//This calcs a new u,v,w, and energy after an isotropic elastic collision
int A=zaid/1000;
A=zaid-1000*A;
float mu_cm=rng(0,2.0f)-1.0f;
float new_energy=*energy*(A*A+2.0f*A*mu_cm+1.0f) /
((float)(A*A+2.0f*A+1.0f));
float temp=sqrt(*energy/new_energy);
float cos_phi=cos(atan(sin(acos(mu_cm))/(1.0f/A+mu_cm)));
float sin_phi=sin(acos(cos_phi));
float cos_w=rng(0,2.0f)-1.0f;
float sin_w=sin(acos(cos_w));
temp=sin_phi/(sqrt(1.0f-(*w)*(*w)));//reused to save space
float new_u, new_v, new_w;
if (isinf(temp))
new_u=0.0f;
new_v=0.0f;
new_w=(*w)*cos_phi;
76
else
new_u=temp*((*v)*sin_w-(*v)*(*u)*cos_w)+(*u)*cos_phi;
new_v=temp*(-(*u)*sin_w-(*w)*(*v)*cos_w)+(*v)*cos_phi;
new_w=sin_phi*sqrt(1.0f-(*w)*(*w))*cos_w+(*w)*cos_phi;
//this is done since machine accuracy with these floats seems to be making
//me have omega vectors larger than unity. Rescaling.
temp =new_u*new_u+new_v*new_v+new_w*new_w;
if ((temp>1.00f)||(temp<0.999f))
temp=1.0f/sqrt(temp);
new_u=new_u*temp;
new_v=new_v*temp;
new_w=new_w*temp;
*u=new_u;
*v=new_v;
*w=new_w;
*energy=new_energy;
void inelastic_scatter(int zaid, float* u, float* v, float* w, float* energy,
legendres coeffs)
int A=zaid/1000;
A=zaid-1000*A;
//First step: sample Q - done with Uniform RNG from 0 to maxQ
bool cont_loop=true;
float Q=rng(0,(*energy*-A/((float)A+1.0f)));
float Eout=*energy+(A+1.0f)*(Q/A);
*energy=Eout;
//need to now sample the legendres....
cont_loop=true;
float mu;
while (cont_loop)
mu=rng(0,2.0f)-1.0f;
//first four terms, P0-P3
float p=coeffs.a0+coeffs.a1*mu+coeffs.a2*0.5f*(3.0f*mu*mu-1.0f)
+coeffs.a3*0.5f*(5.0f*mu*mu*mu-3.0f*mu);
//P4-P5
p+=0.125f*(coeffs.a4*(35.0f*mu*mu*mu*mu-30.0f*mu*mu+3.0f)
+coeffs.a5*(63.0f*mu*mu*mu*mu*mu-70.0f*mu*mu*mu+15.0f*mu));
//P6-P7
p+=.0625f*(coeffs.a6*(231.0f*mu*mu*mu*mu*mu*mu-
315.0f*mu*mu*mu*mu+105.0f*mu*mu-5.0f)
+coeffs.a7*(429.0f*mu*mu*mu*mu*mu*mu*mu
-693.0f*mu*mu*mu*mu*mu+315.0f*mu*mu*mu-35.0f*mu));
if (rng(0,8.0f)<=p)
break;
//so now I have mu. use it just like in scatter_iso.
float cos_phi=cos(atan(sin(acos(mu))/(1.0/A+mu)));
float sin_phi=sin(acos(cos_phi));
float cos_w=rng(0,2.0f)-1.0f;
float sin_w=sin(acos(cos_w));
float temp;
temp=sin_phi/(sqrt(1.0f-(*w)*(*w)));//reused to save space
77
float new_u,new_v,new_w;
if (isinf(temp))
new_u=0.0f;
new_v=0.0f;
new_w=(*w)*cos_phi;
else
new_u=temp*((*v)*sin_w-(*v)*(*u)*cos_w)+(*u)*cos_phi;
new_v=temp*(-(*u)*sin_w-(*w)*(*v)*cos_w)+(*v)*cos_phi;
new_w=sin_phi*sqrt(1.0f-(*w)*(*w))*cos_w+(*w)*cos_phi;
//this is done since machine accuracy with these floats seems to be making
//me have omega vectors larger than unity. Rescaling.
temp =new_u*new_u+new_v*new_v+new_w*new_w;
if ((temp>1.00f)||(temp<0.999f))
temp=1.0f/sqrt(temp);
new_u=new_u*temp;
new_v=new_v*temp;
new_w=new_w*temp;
//cout<<"\nu2+v2+w2 > 1!!!";
*u=new_u;
*v=new_v;
*w=new_w;
unsigned int binarySearch(float sortedArray[], unsigned int first, unsigned int
last, float key)
if (key<sortedArray[0])
return 1;
while (first <= last)
unsigned int mid = (first + last) / 2; // compute mid point.
if (key > sortedArray[mid])
first = mid + 1; // repeat search in top half.
else if (key < sortedArray[mid])
last = mid - 1; // repeat search in bottom half.
else
return mid; // found it. return position /////
return last+1; // failed to find key
float getmacro(float E, xsinfo xs_data, char type, float density, int loc,
unsigned int i)
//returns macro x/s
float temp=0.0f;
switch (type)
case ('t'):
temp=getmicro(E,xs_data.cross_sec[loc],'t',i);
break;
case ('f'):
temp=getmicro(E,xs_data.cross_sec[loc],'f',i);
break;
78
case ('c'):
temp=getmicro(E,xs_data.cross_sec[loc],'c',i);
break;
case ('e'):
temp=getmicro(E,xs_data.cross_sec[loc],'e',i);
break;
case ('i'):
temp=getmicro(E,xs_data.cross_sec[loc],'i',i);
return (temp*density);
int picknuclide(float allmatSigmaT, xsinfo xs_data, float E, materials
mat_def[])
float eta=rng(0,1.0f)*allmatSigmaT;
float running_sum=0.0f;
for (int i=0; i<mat_def[0].num_mat; i++)
int loc=mat_def[i].zaid_loc;
running_sum+=getmacro(E, xs_data, 't', mat_def[i].density,
loc,binarySearch(xs_data.cross_sec[loc].Emesh, 0,
xs_data.cross_sec[loc].mesh_length-1, E));
if (eta <= running_sum)
return i;
char get_Rxn(float E, xsinfo xs_data, materials mat_def)
//This function determines which reaction occurs after we've
//determined that one takes place
//int loc=findzaid(xs_data,mat_def.zaid_loc);
int loc=mat_def.zaid_loc;
unsigned int i=binarySearch(xs_data.cross_sec[loc].Emesh, 0,
xs_data.cross_sec[loc].mesh_length-1, E);
float sigis=getmacro(E,xs_data,'i',mat_def.density,loc,i);
float sigf=getmacro(E,xs_data,'f',mat_def.density,loc,i);
float sigrc=getmacro(E,xs_data,'c',mat_def.density,loc,i);
float siges=getmacro(E,xs_data,'e',mat_def.density,loc,i);
float sigt=sigf+sigrc+siges+sigis;
float eta=rng(0,sigt);
if (E<1E-10f)
return 'c';
//determine where in the range of scaled x/s the rand # is
if (eta <= sigf) return 'f';
else if (eta <= (sigf+sigrc)) return 'c';
else if (eta <= (sigt-sigis)) return 'e';
else if (eta <= sigt) return 'i';
else return '0';
float getmaterialmacroT(float E, xsinfo xs_data, materials mat_def[])
float macro=0.0;
for (int i=0; i<mat_def[i].num_mat; i++)
int loc=mat_def[i].zaid_loc;
79
macro+=getmicro(E,xs_data.cross_sec[loc],'t',
binarySearch(xs_data.cross_sec[loc].Emesh, 0,
xs_data.cross_sec[loc].mesh_length-1, E))*mat_def[i].density;
return macro;
float getmicro(float E, xs cross_sec, char type, unsigned int i)
float temp;
switch (type)
case 'f':
if (E<=cross_sec.Emesh[0])
temp=cross_sec.fission[0]*sqrt(cross_sec.Emesh[0]/E);
else
temp=(cross_sec.fission[i]-cross_sec.fission[i-
1])/(cross_sec.Emesh[i]-cross_sec.Emesh[i-1])*(E-cross_sec.Emesh[i-
1])+cross_sec.fission[i-1];
break;
case 'c':
if (E<=cross_sec.Emesh[0])
temp=cross_sec.rad_cap[0]*sqrt(cross_sec.Emesh[0]/E);
else
temp=(cross_sec.rad_cap[i]-cross_sec.rad_cap[i-
1])/(cross_sec.Emesh[i]-cross_sec.Emesh[i-1])*(E-cross_sec.Emesh[i-
1])+cross_sec.rad_cap[i-1];
break;
case 'e':
if (E<=cross_sec.Emesh[0])
temp=cross_sec.elastic[0]*sqrt(cross_sec.Emesh[0]/E);
else
temp=(cross_sec.elastic[i]-cross_sec.elastic[i-
1])/(cross_sec.Emesh[i]-cross_sec.Emesh[i-1])*(E-cross_sec.Emesh[i-
1])+cross_sec.elastic[i-1];
break;
case 'i':
if (E<=cross_sec.Emesh[0])
temp=cross_sec.inelastic[0]*sqrt(cross_sec.Emesh[0]/E);
else
temp=(cross_sec.inelastic[i]-cross_sec.inelastic[i-
1])/(cross_sec.Emesh[i]-cross_sec.Emesh[i-1])*(E-cross_sec.Emesh[i-
1])+cross_sec.inelastic[i-1];
break;
case 't':
if (E<=cross_sec.Emesh[0])
temp=(cross_sec.fission[0]+cross_sec.rad_cap[0]+cross_sec.elastic[0]+cross_sec.
inelastic[0])*sqrt(cross_sec.Emesh[0]/E);
else
temp=(cross_sec.fission[i]-cross_sec.fission[i-
1])/(cross_sec.Emesh[i]-cross_sec.Emesh[i-1])*(E-cross_sec.Emesh[i-
1])+cross_sec.fission[i-1];
temp+=(cross_sec.rad_cap[i]-cross_sec.rad_cap[i-
1])/(cross_sec.Emesh[i]-cross_sec.Emesh[i-1])*(E-cross_sec.Emesh[i-
1])+cross_sec.rad_cap[i-1];
temp+=(cross_sec.elastic[i]-cross_sec.elastic[i-
1])/(cross_sec.Emesh[i]-cross_sec.Emesh[i-1])*(E-cross_sec.Emesh[i-
1])+cross_sec.elastic[i-1];
80
temp+=(cross_sec.inelastic[i]-cross_sec.inelastic[i-
1])/(cross_sec.Emesh[i]-cross_sec.Emesh[i-1])*(E-cross_sec.Emesh[i-
1])+cross_sec.inelastic[i-1];
break;
default: cout<<"\nhow did i get here???"; temp=-0.5f;
return temp;
81
Appendix B
LADONg Source Code
The following CUDA/C++ code is the single-precision code for LADONg.
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>
#include <iostream>
#include <float.h>
#include <fstream>
#include <string>
#include <cuda.h>
using namespace std;
const int OBJECT_MAX=100;
const int NOT_FISSION_FLAG=101;
const int NUCLIDE_MAX=20;
const float NU=0.53f; //2+NU = nu, # neuts per fission
const float SMALLEST_float=0.001f; //used to barely move particle
const short THREAD_COUNT=64;
const int CONST_SIZE=1000;
struct reduced_array
unsigned int index;
float value;
;
struct neutron
float3 pos;
float energy;
unsigned short cell;
char flag;
short loc;
unsigned long int seed;
;
struct neutron2
unsigned short cell;
char flag;
short loc;
unsigned long int seed;
;
struct shapes
char type; //s for sphere p for parallelopiped, c for cylinder
float x0, y0, z0, R2; //R2 because its R^2
float x1, y1, z1; //these are required to define a cube (and z1 for cyl)
unsigned short inside_me[OBJECT_MAX];
;
82
struct materials
unsigned short num_mat;
int zaid_loc;
float density;
;
struct materials2
int zaid_loc;
float density;
;
struct legendres
float a0, a1, a2, a3, a4, a5, a6, a7;
;
struct xs
//these are micros
float * rad_cap;
float * fission;
float * elastic;
float * inelastic;
float * Emesh;
legendres coeffs;
float a,b,c,d;
int mesh_length;
;
struct xsinfo
xs* cross_sec;
int *list;
short listlength;
;
__constant__ shapes shapelist_d[OBJECT_MAX];
__constant__ unsigned short num_mat_d[OBJECT_MAX];
__constant__ materials2 mat_list_d[OBJECT_MAX*NUCLIDE_MAX];
texture<float,1,cudaReadModeElementType> Emesh_tex;
texture<float4,1,cudaReadModeElementType> xsmesh_tex;
__constant__ unsigned int Emesh_offsets_d[NUCLIDE_MAX];
__constant__ float4 xs0_d[NUCLIDE_MAX];
__constant__ float E0_d[NUCLIDE_MAX];
__constant__ reduced_array reduced_Emesh_d[CONST_SIZE];
__constant__ unsigned int reduced_offsets_d[NUCLIDE_MAX];
__constant__ legendres coeffs_d[NUCLIDE_MAX];
//***************************************************************************//
void inputgeometry(string filename,unsigned short& shape_num,shapes
shape_list[]);
float rng(unsigned long int a, float bounds);
void inputXS(unsigned short shape_num, xsinfo* xs_data, materials
mat_list[][NUCLIDE_MAX]);
void inputmaterials(string filename, materials mat_list[][NUCLIDE_MAX]);
void inputsourcelist2(string filename, neutron* source);
void inputjob(char job[], string& in_filename, unsigned long int& in_seed,
unsigned int& in_nbatch, unsigned int& in_batches, unsigned int&
in_batchskips);
83
float sample_X(float a, float b, float c, float max_y);
short cpu_findzaid(xsinfo* xs_data, int zaid);
//***************************************************************************//
__global__ void MCkernel(const unsigned short num_shapes, const xsinfo*
xs_data, const int length, neutron* list);
__device__ void gpugiveneutrondir(unsigned long int* a, float3* dir);
__host__ __device__ float gpurng(unsigned long int* a, const float bounds);
__device__ bool gpuinside_of(const unsigned short i, const float3* pos);
__device__ unsigned short gpugetCell(const unsigned short shape_num, const
float3* pos);
__device__ float gpudist_to_boundary(const unsigned short i, const float3* pos,
const float3* dir);
__device__ void gpuscatter_iso(neutron2* neut, const int zaid, float3* dir,
float* energy);
__device__ void gpuinelastic(neutron2* neut, const int zaid, float3* dir,
float* energy);
__device__ short findzaid(const xsinfo* xs_data, int zaid);
__device__ unsigned int accelSearch(unsigned int first, unsigned int last,
const float key);
__device__ unsigned int binarySearch(const float* sortedArray, unsigned int
first, unsigned int last, const float key);
__device__ unsigned int textureSearch(unsigned int first, unsigned int last,
const float key);
__device__ float getmicro_f_pretex(const float Ei, const float Eim1, const
float4* xsi, const float4* xsim1, const float E, const short loc);
__device__ float getmicro_c_pretex(const float Ei, const float Eim1, const
float4* xsi, const float4* xsim1, const float E, const short loc);
__device__ float getmicro_e_pretex(const float Ei, const float Eim1, const
float4* xsi, const float4* xsim1, const float E, const short loc);
__device__ float getmicro_i_pretex(const float Ei, const float Eim1, const
float4* xsi, const float4* xsim1, const float E, const short loc);
__device__ int gpupicknuclide(neutron2* neut, const xsinfo* xs_data, const
unsigned short cell, const float SigmaT, const float energy);
__device__ float gpugetmaterialmacroT(const xsinfo* xs_data, const neutron2*
neut, const float energy);
__device__ void gpuget_Rxn(neutron2* neut, const xsinfo* xs_data, const float
energy);
__device__ float getmicro_t(const float E, const xs* cross_sec, const short
loc);
void configure_cuda_call(int required_threads, int * blocks);
void convert2darray(materials mat2d[][NUCLIDE_MAX], materials * mat1d);
void checkCUDAError(const char *msg);
void getneutlist(unsigned int listlength, neutron* host, neutron* device);
void loadneutlist(unsigned int listlength, neutron* host, neutron* device);
void load_gpu(materials mat_list[][NUCLIDE_MAX], unsigned short num_shapes,
shapes shapelist[], xsinfo* xs_data, xsinfo** xs_data_d);
void Emesh_reduction(xsinfo* xs_data, reduced_array** list, unsigned int**
reduced_offsets, int* reduction_factor);
//***************************************************************************//
//MAIN
int main(int argc, char* argv[])
xsinfo* xs_data_d=NULL;
clock_t time_1, time_2;
//get main input
string file;
unsigned long int seed;
unsigned int nbatch; //neuts per batch
unsigned int batches;
unsigned int batchskips;
84
char jobname[]="input.job";
inputjob(jobname, file, seed, nbatch, batches, batchskips);
//init rng
srand(seed);
rng(seed,1.0f);
xsinfo xs_data;
unsigned short num_shapes; //get from input
shapes shapelist[OBJECT_MAX]; //get from input
inputgeometry(file, num_shapes, shapelist);
materials mat_list[OBJECT_MAX][NUCLIDE_MAX];
inputmaterials(file, mat_list);
neutron* queuelist;
cudaMallocHost((void**)&queuelist,2*nbatch*sizeof(neutron));
for (unsigned int i=0; i<2*nbatch; i++)
queuelist[i].energy=-1.0f;
inputsourcelist2(file,queuelist);
inputXS(num_shapes,&xs_data, mat_list);
load_gpu(mat_list, num_shapes, shapelist, &xs_data, &xs_data_d);
cout <<"\nINPUT PARSING AND GPU INITIALIZATION COMPLETE\nTIMER STARTED";
time_1=clock(); //start timer!
float keff=0;
float kbatch;
float ksum=0; //used for checking my kbatch algorithm
float keff_alt;//used for checking my kbatch algorithm
unsigned int queuecount=0;
unsigned int fissionqueuecount=0;
neutron* batchlist=NULL;
cudaMallocHost((void**)&batchlist,nbatch*sizeof(neutron));
neutron* batchlist_d=NULL;
cudaMalloc((void**)&batchlist_d,nbatch*sizeof(neutron));
checkCUDAError("cudamalloc of batchlist");
int threads=THREAD_COUNT;
int blocks=0;
// unsigned int lost_tally=0;
configure_cuda_call(nbatch, &blocks);
for (unsigned int batchcount=1; batchcount <= batches; batchcount++)
cout <<"\nBatch "<<batchcount;
unsigned int collisions=0;
unsigned int fission_tally=0;
unsigned int rad_cap_tally=0;
unsigned int leak_tally=0;
unsigned int fission_production_tally=0;
//extract queue from queuelist
for (unsigned int i=0; i<nbatch; i++)
if (queuelist[queuecount].energy==-1.0)
//If there are no neutrons in queue, start at r=0 with random E
batchlist[i].pos.x=0.0f;
batchlist[i].pos.y=0.0f;
batchlist[i].pos.z=0.0f;
batchlist[i].energy=sample_X(0.453f,-1.036f,2.29f,0.35820615f);
queuecount=0;
else
batchlist[i].pos=queuelist[queuecount].pos;
batchlist[i].energy=queuelist[queuecount].energy;
if (queuecount<(2*nbatch-2))
queuecount++;
85
else queuecount=0;
batchlist[i].seed=(unsigned long int) rng(0,4294967296.0f);
loadneutlist(nbatch, batchlist, batchlist_d);
checkCUDAError("load batchlist");
MCkernel<<<blocks,threads>>>(num_shapes, xs_data_d, nbatch,
batchlist_d);
cudaThreadSynchronize();
checkCUDAError("kernel invocation");
getneutlist(nbatch, batchlist, batchlist_d);
for (unsigned int i=0; i<nbatch; i++)
switch(batchlist[i].flag)
case 'f': //do fission stuff
//determine number of neutrons from fission
int num_fission_neuts;
if (rng(0,1.0f) > NU)
num_fission_neuts=3;
else num_fission_neuts=2;
for (int j=0; j<num_fission_neuts; j++)
neutron temp;
temp.pos=batchlist[i].pos;
short loc=batchlist[i].loc;
temp.energy=sample_X(xs_data.cross_sec[loc].a,
xs_data.cross_sec[loc].b,
xs_data.cross_sec[loc].c,
xs_data.cross_sec[loc].d);
queuelist[fissionqueuecount]=temp;
if (fissionqueuecount<(2*nbatch-2))
fissionqueuecount++;
else
fissionqueuecount=0;
fission_production_tally++;
//store initial rx,ry,rz,E for fission n's in queue
fission_tally++;
break;
case 'c': //do capture stuff
rad_cap_tally++;
break;
case 'l':
leak_tally++;
break;
case 'e': //These will return 0, since no collisions reported.
collisions++; //kept to maintain similar computations
between cpu and gpu
break;
case 'i':
collisions++;
break;
// case 'L':
// lost_tally++;
// break;
default: cout<<"\nGot a neutron that doesn't have correct flag:
"<<batchlist[i].flag<<"\n";
cout <<"\nfission neutrons: "<<fission_production_tally
86
<<" absorptions: "<<rad_cap_tally+fission_tally
<<" leaked neutrons: "<< leak_tally;
cout <<"\nAverage collisions per neutron=
"<<collisions/(float)nbatch;
kbatch=fission_production_tally /
((float)(leak_tally+rad_cap_tally+fission_tally));
if (batchcount>batchskips)
keff=((batchcount-batchskips-1)*(keff)+kbatch)/(batchcount-
batchskips);
ksum+=kbatch;
cout <<"\nkbatch= "<<kbatch<<"\trunning keff= "<<keff;
time_2=clock();
cout <<"\nkeff = "<<keff;
keff_alt=ksum/((float)(batches-batchskips));
cout <<"\nkeff_alt = "<<keff_alt;
float deltat=(float) (time_2-time_1)/(float) CLOCKS_PER_SEC;
cout <<"\nComputation Time= " << deltat <<" seconds";
cout <<"\nAverage Neutrons per Second = "<<nbatch*batches/deltat;
cout <<"\nAverage Neutrons per Minute = "<<nbatch*batches/deltat*60<<endl;
cudaFree(xs_data_d);
cudaFree(batchlist_d);
cudaFreeHost(batchlist);
cudaFreeHost(queuelist);
ofstream timeout;
timeout.open("time.out");
timeout<<deltat<<endl<<keff;
timeout.close();
return (EXIT_SUCCESS);
//***************************************************************************//
//CPU FUNCTIONS
void inputjob(char job[], string& in_filename, unsigned long int& in_seed,
unsigned int& in_nbatch, unsigned int& in_batches, unsigned int&
in_batchskips)
ifstream fin;
fin.open(job);
fin>>in_filename;
if (!fin) cout <<"\ninputjob file not open...";
fin>>in_seed>>in_nbatch>>in_batches>>in_batchskips;
fin.close();
void inputsourcelist2(string filename, neutron* source)
neutron temp;
filename.append(".src");
ifstream sin(filename.c_str());
if (!sin)
cout<<"\nerror reading .src file";
return;
int number;
sin>>number; //number of source neuts
for (int i=0; i<number; i++)
87
sin>>temp.pos.x>>temp.pos.y>>temp.pos.z>>temp.energy;//>>temp.dir.x>>temp.dir.y
>>temp.dir.z;
temp.cell=NOT_FISSION_FLAG;
source[i]=temp;
sin.close();
void inputmaterials(string filename, materials mat_list[][NUCLIDE_MAX])
for (int i=0; i<OBJECT_MAX; i++)
for (int j=0; j<NUCLIDE_MAX; j++)
mat_list[i][j].num_mat=0;
mat_list[i][j].zaid_loc=0;
mat_list[i][j].density=0.0f;
filename.append(".mat");
ifstream min(filename.c_str());
if (!min)
cout<<"\nerror reading .mat file";
return;
int number=0; //number of cells
min>>number;
for (int i=0; i<number; i++) //i is the same as cell #
min>> mat_list[i][0].num_mat;
for (int j=0; j<mat_list[i][0].num_mat; j++)
min>>mat_list[i][j].zaid_loc>>mat_list[i][j].density;
mat_list[i][j].num_mat=mat_list[i][0].num_mat;
for (int j=mat_list[i][0].num_mat; j<NUCLIDE_MAX; j++)
mat_list[i][j].num_mat=mat_list[i][0].num_mat;
min.close();
void inputXS(unsigned short shape_num, xsinfo* xs_data, materials
mat_list[][NUCLIDE_MAX])
ifstream xsin;
//get all of the zaids
int k=0;
//find how many diff zaids there are
int temp[OBJECT_MAX*NUCLIDE_MAX];
for (int i=0; i<OBJECT_MAX*NUCLIDE_MAX; i++)
temp[i]=0;
temp[k]=mat_list[0][0].zaid_loc;
k++;
for (int i=0; i<OBJECT_MAX; i++)
for (int j=0; j<NUCLIDE_MAX; j++)
88
bool found_val=false;
for (int z=0; z<k; z++)
if (temp[z]==mat_list[i][j].zaid_loc)
found_val=true;
if ((!found_val)&&(mat_list[i][j].zaid_loc!=0))
temp[k]=mat_list[i][j].zaid_loc;
k++;
for (int z=0; z<k; z++)
for (int i=0; i<OBJECT_MAX; i++)
for (int j=0; j<NUCLIDE_MAX; j++)
if (temp[z]==mat_list[i][j].zaid_loc)
mat_list[i][j].zaid_loc=z;
(*xs_data).listlength=k;
(*xs_data).list=(int*)malloc((*xs_data).listlength*sizeof(*(*xs_data).list));
for (int i=0; i<k; i++)
(*xs_data).list[i]=temp[i];
//This should have produced a list of all of the zaids to get data for
(*xs_data).cross_sec= (xs*)
malloc((*xs_data).listlength*sizeof(*(*xs_data).cross_sec));
for (int i=0; i<k; i++)
char xsfile[7];
sprintf(xsfile, "%d",(*xs_data).list[i]);
xsin.open(xsfile);
if (!xsin)
cout<<"\nerror reading xs file";
return;
//a,b,c are from: X(E)=a*exp(b*E)*sinh(sqrt(c*E))
//d is the max of X(E).
xsin>>(*xs_data).cross_sec[i].a>>(*xs_data).cross_sec[i].b>>(*xs_data).cross_se
c[i].c>>(*xs_data).cross_sec[i].d;
xsin >>(*xs_data).cross_sec[i].mesh_length;
(*xs_data).cross_sec[i].Emesh=(float*)malloc((*xs_data).cross_sec[i].mesh_lengt
h*sizeof(*(*xs_data).cross_sec[i].Emesh));
89
(*xs_data).cross_sec[i].fission=(float*)malloc((*xs_data).cross_sec[i].mesh_len
gth*sizeof(*(*xs_data).cross_sec[i].fission));
(*xs_data).cross_sec[i].rad_cap=(float*)malloc((*xs_data).cross_sec[i].mesh_len
gth*sizeof(*(*xs_data).cross_sec[i].rad_cap));
(*xs_data).cross_sec[i].elastic=(float*)malloc((*xs_data).cross_sec[i].mesh_len
gth*sizeof(*(*xs_data).cross_sec[i].elastic));
(*xs_data).cross_sec[i].inelastic=(float*)malloc((*xs_data).cross_sec[i].mesh_l
ength*sizeof(*(*xs_data).cross_sec[i].inelastic));
if ((*xs_data).cross_sec[i].Emesh==NULL) cout <<"\ndidnt initialize
Emesh...";
for (int j=0; j<(*xs_data).cross_sec[i].mesh_length; j++)
xsin>>(*xs_data).cross_sec[i].Emesh[j]>>(*xs_data).cross_sec[i].fission[j]>>(*x
s_data).cross_sec[i].rad_cap[j]
>>(*xs_data).cross_sec[i].elastic[j]>>(*xs_data).cross_sec[i].inelastic[j];
//get legendre coeffs
xsin>>(*xs_data).cross_sec[i].coeffs.a0>>(*xs_data).cross_sec[i].coeffs.a1
>>(*xs_data).cross_sec[i].coeffs.a2>>(*xs_data).cross_sec[i].coeffs.a3
>>(*xs_data).cross_sec[i].coeffs.a4>>(*xs_data).cross_sec[i].coeffs.a5
>>(*xs_data).cross_sec[i].coeffs.a6>>(*xs_data).cross_sec[i].coeffs.a7;
xsin.close();
float rng(unsigned long int a, float bounds)
const float mult=1.0f/4294967296.0f;
static unsigned long int seed;
if (a==0)
a=seed;
seed=(a * 1664525+1013904223)%4294967296;
a=seed;
return seed*bounds*mult;
void inputgeometry(string filename,unsigned short& shape_num,shapes
shape_list[])
filename.append(".geo");
ifstream gin(filename.c_str());
unsigned short number;
if (!gin)
cout<<"\nerror reading .geo file";
return;
gin >> number;
shape_num=number;
for(int i=0; i<(int)number; i++)
//get one object
gin >>shape_list[i].type>>shape_list[i].x0>>shape_list[i].x1
90
>>shape_list[i].y0>>shape_list[i].y1>>shape_list[i].z0
>>shape_list[i].z1>>shape_list[i].R2;
int temp;
int j=0;
bool continue_loop=true;
while (continue_loop)
gin>>temp;
shape_list[i].inside_me[j]=(unsigned short) temp;
j++;
if (temp==999) continue_loop=false;
gin.close();
float sample_X(float a, float b, float c, float max_y)
//X(E)=a*exp(-b*E)*sinh(sqrt(c*E))
float max_x=20.0f;
//float d is the max_y
bool cont_loop=true;
float x;
while (cont_loop)
x=rng(0,max_x);
if (rng(0,max_y)<=a*exp(-b*x)*sinh(sqrt(c*x)))
return x;
return 1.0f;
short cpu_findzaid(xsinfo* xs_data, int zaid)
for (short i=0; i<(*xs_data).listlength; i++)
if ((*xs_data).list[i]==zaid)
return i;
//if made it here, wasnt found
return 999;
//***************************************************************************//
//GPU-SPECIFIC FUNCTIONS
__global__ void MCkernel(const unsigned short num_shapes, const xsinfo*
xs_data, const int length, neutron* list)
__shared__ neutron2 working_neut[THREAD_COUNT];
//the mul24 requires the total number of neuts to be < 16 million (2^24).
int tid=mul24(blockIdx.x,blockDim.x)+threadIdx.x;
if ((tid)<length)
float3 pos=list[tid].pos;
working_neut[threadIdx.x].cell=list[tid].cell;
working_neut[threadIdx.x].seed=list[tid].seed;
float energy = list[tid].energy;
working_neut[threadIdx.x].flag='g'; //g==go
float3 dir;
gpugiveneutrondir(&(working_neut[threadIdx.x].seed),&(dir));
bool neutron_alive=true;
91
working_neut[threadIdx.x].cell=gpugetCell(num_shapes, &(pos));
if (working_neut[threadIdx.x].cell==OBJECT_MAX)
working_neut[threadIdx.x].cell=0;
working_neut[threadIdx.x].flag='l';
while (neutron_alive)
//begin transporting neutron
float eta=gpurng(&(working_neut[threadIdx.x].seed),1.0f);
bool finding_new_cell=true;
float SigmaT;
while (finding_new_cell)
SigmaT=gpugetmaterialmacroT(xs_data,
&(working_neut[threadIdx.x]),energy);
float coll_dist=-logf(eta)/SigmaT;
float
dist=gpudist_to_boundary(working_neut[threadIdx.x].cell,&(pos),&(dir));
//if ((dist<0.0f)&&(dist>-0.1f))
// working_neut[threadIdx.x].cell='L'; // L ==
'lost', geometry single precision fix-up
unsigned short i=0;
//This loop checks to make sure there were no
intersecting objects inbetween x,y,z and objects boundary
//(think of two spheres inside a sphere - where x,y,z
is inside of large sphere but outside small spheres)
while (shapelist_d[ working_neut[threadIdx.x].cell]
.inside_me[i] != 999)
float temp= gpudist_to_boundary(
shapelist_d[working_neut[threadIdx.x].cell].inside_me[i],&(pos),&(dir));
if ((temp>0.0f)&&(temp<dist))
dist=temp;
i++;
//See if neutron leaves current cell
if (coll_dist>=dist)
//it did, so just move it to outside that cell, and
then update information
//next time we come around this loop it will be like
starting with a fresh neutron
pos.x=pos.x+dir.x*(dist+SMALLEST_float);
pos.y=pos.y+dir.y*(dist+SMALLEST_float);
pos.z=pos.z+dir.z*(dist+SMALLEST_float);
working_neut[threadIdx.x].cell=
gpugetCell(num_shapes, &(pos));
if (working_neut[threadIdx.x].cell ==
OBJECT_MAX)
working_neut[threadIdx.x].flag='l';
eta=expf(((dist+SMALLEST_float)-
coll_dist)*SigmaT);
else //neutron did not leave cell, so just move it to
where it should be
pos.x=pos.x+dir.x*coll_dist;
pos.y=pos.y+dir.y*coll_dist;
pos.z=pos.z+dir.z*coll_dist;
finding_new_cell=false;
92
//if
((working_neut[threadIdx.x].flag=='l')||(working_neut[threadIdx.x].flag=='L'))
if (working_neut[threadIdx.x].flag=='l')
finding_new_cell=false;
if (neutron_alive)
unsigned short celltemp=working_neut[threadIdx.x].cell;
// if
((working_neut[threadIdx.x].flag=='l')||(working_neut[threadIdx.x].flag=='L'))
if (working_neut[threadIdx.x].flag=='l')
celltemp=0; //let it use some cell, avoids thread
seperation
unsigned short target_nuclide=gpupicknuclide(
&(working_neut[threadIdx.x]), xs_data,celltemp,SigmaT,energy);
gpuget_Rxn(&(working_neut[threadIdx.x]),xs_data,energy);
if (working_neut[threadIdx.x].flag=='e')
gpuscatter_iso(&(working_neut[threadIdx.x]),
(*xs_data).list[mat_list_d[target_nuclide+celltemp*NUCLIDE_MAX].zaid_loc],
&dir,&energy);
else if (working_neut[threadIdx.x].flag=='i')
gpuinelastic(&(working_neut[threadIdx.x]),(*xs_data).list[
mat_list_d[target_nuclide+celltemp*NUCLIDE_MAX].zaid_loc], &dir,&energy);
else neutron_alive=false;
list[tid].pos=pos;
list[tid].cell=working_neut[threadIdx.x].cell;
list[tid].loc=working_neut[threadIdx.x].loc;
list[tid].seed=working_neut[threadIdx.x].seed;
list[tid].energy=energy;
list[tid].flag=working_neut[threadIdx.x].flag;
__device__ void gpugiveneutrondir(unsigned long int* a, float3* dir)
(*dir).z=gpurng(a,2.0f)-1.0f;
float temp=sqrtf(1.0f-(*dir).z*(*dir).z);
(*dir).x=gpurng(a,6.283185307f);
(*dir).y=sinf((*dir).x)*temp;
(*dir).x=cosf((*dir).x)*temp;
__host__ __device__ float gpurng(unsigned long int* a, float bounds)
(*a)=((*a)*1664525+1013904223)%4294967296;
return (*a)*bounds*2.32830644E-10f;
__device__ bool gpuinside_of(const unsigned short i, const float3* pos)
//given x,y,z, if I am in a given object ('shape').
//Sphere (x-x0)^2+(y-y0)^2+(z-z0)^2=R^2
//Cube: z=z0; z=z1, x=x0, x=x1; y=y0, y=y1
bool result=false;
if (shapelist_d[i].type=='s') //spheres
float temp=((*pos).x-shapelist_d[i].x0)*((*pos).x-shapelist_d[i].x0) +
93
((*pos).y-shapelist_d[i].y0)*((*pos).y-shapelist_d[i].y0) +
((*pos).z-shapelist_d[i].z0)*((*pos).z-shapelist_d[i].z0);
if (temp<=shapelist_d[i].R2) //then it is inside, or on the surface of
the sphere
result=true; //!!!This works if the smallest cellIDs mean they are
inside the rest (so i dont need my 'inside_me' list...
else if (shapelist_d[i].type=='p') //paralellopiped
//z0 is top face, z1 is bottom face
//x0 is larger x face, x1 is smaller x face
//y0 is larger y face, y1 is smaller y face
if ((((*pos).z<=shapelist_d[i].z0)&&((*pos).z>=shapelist_d[i].z1))&&
((((*pos).x<=shapelist_d[i].x0)&&((*pos).x>=shapelist_d[i].x1))&&
(((*pos).y<=shapelist_d[i].y0)&&((*pos).y>=shapelist_d[i].y1))))
result=true;
return result;
__device__ unsigned short gpugetCell(const unsigned short shape_num, const
float3* pos)
for (unsigned short i=0; i<shape_num; i++)
if (gpuinside_of(i,pos))
return i;
return OBJECT_MAX;
__device__ float gpudist_to_boundary(const unsigned short i, const float3* pos,
const float3* dir)
if (shapelist_d[i].type=='s') //spheres
float d=(*dir).x*(*dir).x+(*dir).y*(*dir).y+(*dir).z*(*dir).z;
float b= 2.0f*(((*dir).x)*((*pos).x-
shapelist_d[i].x0)+((*dir).y)*((*pos).y-
shapelist_d[i].y0)+((*dir).z)*((*pos).z-shapelist_d[i].z0));
float c= (((*pos).x)*((*pos).x)-
2.0f*((*pos).x)*shapelist_d[i].x0+shapelist_d[i].x0*shapelist_d[i].x0) +
(((*pos).y)*((*pos).y)-
2.0f*((*pos).y)*shapelist_d[i].y0+shapelist_d[i].y0*shapelist_d[i].y0) +
(((*pos).z)*((*pos).z)-
2.0f*((*pos).z)*shapelist_d[i].z0+shapelist_d[i].z0*shapelist_d[i].z0)-
shapelist_d[i].R2;
float temp=b*b-4.0f*d*c;
if (temp < 0.0f)
return -10.0f; //a flag so i know that it didnt intersect
temp=sqrtf(temp);
c = 0.5f*(temp-b)/d;
d= 0.5f*(-temp-b)/d;
if (d>0.0f) //pick smallest root
return d;
else
return c;
else if (shapelist_d[i].type=='p') //parallelopiped
94
//z0 is top face, z1 is bottom face
//x0 is larger x face, x1 is smaller x face
//y0 is larger y face, y1 is smaller y face
//compute distances
float temp1, temp2, temp3;
temp1=1.0E37f;
temp2=1.0E37f;
temp3=1.0E37f;
if (((*dir).x)>0.0f)
temp1=(shapelist_d[i].x0-(*pos).x)/((*dir).x);
else if (((*dir).x)<0.0f)
temp1=(shapelist_d[i].x1-(*pos).x)/((*dir).x);
if (((*dir).y)>0.0f)
temp2=(shapelist_d[i].y0-(*pos).y)/((*dir).y);
else if (((*dir).y)<0.0f)
temp2=(shapelist_d[i].y1-(*pos).y)/((*dir).y);
if (((*dir).z)>0.0f)
temp3=(shapelist_d[i].z0-(*pos).z)/((*dir).z);
else if (((*dir).z)<0.0f)
temp3=(shapelist_d[i].z1-(*pos).z)/((*dir).z);
//find the smallest, that is our winner
temp1=min(temp1,temp2);
temp1=min(temp1,temp3);
return temp1;
return -10.0f;
__device__ void gpuscatter_iso(neutron2* neut, const int zaid, float3* dir,
float* energy)
//This calcs a new u,v,w, and energy after an isotropic collision
int A;
int C=zaid/1000;
A=zaid-1000*C;
float mu_cm=gpurng(&(*neut).seed,2.0f)-1.0f;
float temp;
float new_energy=*energy*(A*A+2.0f*A*mu_cm+1.0f)/(A*A+2.0f*A+1.0f);
*energy=new_energy;
temp=sqrtf(*energy/new_energy);
float cos_phi=cosf(atanf(sinf(acosf(mu_cm))/(1.0f/A+mu_cm)));
float sin_phi=sinf(acosf(cos_phi));
float cos_w=gpurng(&(*neut).seed,2.0f)-1.0f;
float sin_w=sinf(acosf(cos_w));
temp=sin_phi*(rsqrtf(1-((*dir).z)*((*dir).z)));//reused to save registers
float new_u=temp*(((*dir).y)*sin_w-
((*dir).y)*((*dir).x)*cos_w)+((*dir).x)*cos_phi;
float new_v=temp*(-((*dir).x)*sin_w-
((*dir).z)*((*dir).y)*cos_w)+((*dir).y)*cos_phi;
temp=sin_phi*sqrtf(1-((*dir).z)*((*dir).z))*cos_w+((*dir).z)*cos_phi;
//used instead of float new_w to save registers
cos_phi =new_u*new_u+new_v*new_v+temp*temp; //reused to save registers
if ((cos_phi>1.0f)||(cos_phi<0.999f))
95
cos_phi=rsqrtf(cos_phi);
new_u=new_u*cos_phi;
new_v=new_v*cos_phi;
temp=temp*cos_phi;
(*dir).x=new_u;
(*dir).y=new_v;
(*dir).z=temp;
__device__ void gpuinelastic(neutron2* neut, const int zaid, float3* dir,
float* energy)
//This calcs a new u,v,w, and energy after an isotropic collision
int A;
int C=zaid/1000;
A=zaid-1000*C;
float Q=gpurng(&((*neut).seed),(-1.0f*(*energy*A/(A+1.0f))));
*energy=*energy+(A+1.0f)*Q/A;
bool cont_loop=true;
float mu_cm;
while (cont_loop)
mu_cm=gpurng(&((*neut).seed),2.0f)-1.0f;
float p=coeffs_d[(*neut).loc].a0
+coeffs_d[(*neut).loc].a1*mu_cm
+coeffs_d[(*neut).loc].a2*0.5f*(3.0f*mu_cm*mu_cm-1.0f)
+coeffs_d[(*neut).loc].a3*0.5f*(5.0f*mu_cm*mu_cm*mu_cm-
3.0f*mu_cm)
+0.125f*(coeffs_d[(*neut).loc].a4*(35.0f*mu_cm*mu_cm*mu_cm*mu_cm-
30.0f*mu_cm*mu_cm+3.0f))
+(coeffs_d[(*neut).loc].a5*(63.0f*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm-
70.0f*mu_cm*mu_cm*mu_cm+15.0f*mu_cm))
+.0625f*(coeffs_d[(*neut).loc].a6*(231.0f*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm-
315.0f*mu_cm*mu_cm*mu_cm*mu_cm+105.0f*mu_cm*mu_cm-5.0f))
+coeffs_d[(*neut).loc].a7*(429.0f*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm-
693.0f*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm+315.0f*mu_cm*mu_cm*mu_cm-35.0f*mu_cm);
if (gpurng(&((*neut).seed),8.0f)<=p)
cont_loop=false;
float cos_phi=cosf(atanf(sinf(acosf(mu_cm))/(1.0f/A+mu_cm)));
float sin_phi=sinf(acosf(cos_phi));
float cos_w=gpurng(&((*neut).seed),2.0f)-1.0f;
float sin_w=sinf(acosf(cos_w));
float temp=sin_phi*(rsqrtf(1.0f-((*dir).z)*((*dir).z)));//reused to save
registers
float new_u=temp*(((*dir).y)*sin_w-
((*dir).y)*((*dir).x)*cos_w)+((*dir).x)*cos_phi;
float new_v=temp*(-((*dir).x)*sin_w-
((*dir).z)*((*dir).y)*cos_w)+((*dir).y)*cos_phi;
temp=sin_phi*sqrtf(1.0f-((*dir).z)*((*dir).z))*cos_w+((*dir).z)*cos_phi;
//used instead of float new_w to save registers
cos_phi =new_u*new_u+new_v*new_v+temp*temp; //reused to save registers
96
if ((cos_phi>1.0f)||(cos_phi<0.999f))
cos_phi=rsqrtf(cos_phi);
new_u=new_u*cos_phi;
new_v=new_v*cos_phi;
temp=temp*cos_phi;
(*dir).x=new_u;
(*dir).y=new_v;
(*dir).z=temp;
__device__ short findzaid(const xsinfo* xs_data, int zaid)
short j;
for (short i=0; i<(*xs_data).listlength; i++)
if ((*xs_data).list[i]==zaid)
j=i;
//if made it here, wasnt found
return j;
__device__ unsigned int accelSearch(unsigned int first, unsigned int last,
const float key)
while (first <= last)
unsigned int mid = (first + last) / 2; // compute mid point.
if (key > reduced_Emesh_d[mid].value)
first = mid + 1; // repeat search in top half.
else if (key < reduced_Emesh_d[mid].value)
last = mid - 1; // repeat search in bottom half.
else
return mid; // found it. return position /////
return last+1; // failed to find key
__device__ unsigned int binarySearch(const float* sortedArray, unsigned int
first, unsigned int last, const float key)
while (first <= last)
unsigned int mid = (first + last) / 2; // compute mid point.
if (key > sortedArray[mid])
first = mid + 1; // repeat search in top half.
else if (key < sortedArray[mid])
last = mid - 1; // repeat search in bottom half.
else
return mid; // found it. return position /////
return last+1; // failed to find key
__device__ unsigned int textureSearch(unsigned int first, unsigned int last,
const float key)
while (first <= last)
unsigned int mid = (first + last) / 2; // compute mid point.
97
unsigned int val=tex1Dfetch(Emesh_tex,mid);
if (key > val)
first = mid + 1; // repeat search in top half.
else if (key < val)
last = mid - 1; // repeat search in bottom half.
else
return mid; // found it. return position /////
return last+1; // failed to find key
//************************************************//
__device__ float getmicro_t_nosrch(const float E, const xs* cross_sec, const
int i, const short loc)
float temp;
if (E<=(*cross_sec).Emesh[0])
temp=((*cross_sec).fission[0]+(*cross_sec).rad_cap[0]+(*cross_sec).elastic[0]+(
*cross_sec).inelastic[0])*sqrtf((*cross_sec).Emesh[0]/E);
else
float Ei=tex1Dfetch(Emesh_tex,i+Emesh_offsets_d[loc]);
float Eim1=tex1Dfetch(Emesh_tex,i-1+Emesh_offsets_d[loc]);
float4 xsi=tex1Dfetch(xsmesh_tex,i+Emesh_offsets_d[loc]);
float4 xsim1=tex1Dfetch(xsmesh_tex,i-1+Emesh_offsets_d[loc]);
temp=(xsi.x-xsim1.x)/(Ei-Eim1)*(E-Eim1)+xsim1.x;
temp+=(xsi.y-xsim1.y)/(Ei-Eim1)*(E-Eim1)+xsim1.y;
temp+=(xsi.z-xsim1.z)/(Ei-Eim1)*(E-Eim1)+xsim1.z;
temp+=(xsi.w-xsim1.w)/(Ei-Eim1)*(E-Eim1)+xsim1.w;
return temp;
__device__ float getmicro_f_pretex(const float Ei, const float Eim1, const
float4* xsi, const float4* xsim1, const float E, const short loc)
if (E<=E0_d[loc])
return (xs0_d[loc].x*sqrtf(E0_d[loc]/E));
else
return (((*xsi).x-(*xsim1).x)/(Ei-Eim1)*(E-Eim1)+(*xsim1).x);
__device__ float getmicro_c_pretex(const float Ei, const float Eim1, const
float4* xsi, const float4* xsim1, const float E, const short loc)
if (E<=E0_d[loc])
return (xs0_d[loc].y*sqrtf(E0_d[loc])/E);
else
return (((*xsi).y-(*xsim1).y)/(Ei-Eim1)*(E-Eim1)+(*xsim1).y);
__device__ float getmicro_e_pretex(const float Ei, const float Eim1, const
float4* xsi, const float4* xsim1, const float E, const short loc)
if (E<=E0_d[loc])
return (xs0_d[loc].z*sqrtf(E0_d[loc]/E));
else
return (((*xsi).z-(*xsim1).z)/(Ei-Eim1)*(E-Eim1)+(*xsim1).z);
98
__device__ float getmicro_i_pretex(const float Ei, const float Eim1, const
float4* xsi, const float4* xsim1, const float E, const short loc)
if (E<=E0_d[loc])
return (xs0_d[loc].w*sqrtf(E0_d[loc]/E));
else
return (((*xsi).w-(*xsim1).w)/(Ei-Eim1)*(E-Eim1)+(*xsim1).w);
//************************************************//
__device__ float getmicro_t(const float E, const xs* cross_sec, const short
loc)
float temp;
if (E<=E0_d[loc])
temp=(xs0_d[loc].x+xs0_d[loc].y+xs0_d[loc].z+xs0_d[loc].w)*sqrtf(E0_d[loc]/E);
else
//unsigned int
i=accelSearch(reduced_offsets_d[loc],reduced_offsets_d[loc+1]-1,E);
//i=binarySearch((*cross_sec).Emesh,reduced_Emesh_d[i-
1].index,reduced_Emesh_d[i].index,E);
unsigned int
i=binarySearch((*cross_sec).Emesh,0,(*cross_sec).mesh_length,E);
float Ei=tex1Dfetch(Emesh_tex,i+Emesh_offsets_d[loc]);
float Eim1=tex1Dfetch(Emesh_tex,i-1+Emesh_offsets_d[loc]);
float4 xsi=tex1Dfetch(xsmesh_tex,i+Emesh_offsets_d[loc]);
float4 xsim1=tex1Dfetch(xsmesh_tex,i-1+Emesh_offsets_d[loc]);
temp=(xsi.x-xsim1.x)/(Ei-Eim1)*(E-Eim1)+xsim1.x;
temp+=(xsi.y-xsim1.y)/(Ei-Eim1)*(E-Eim1)+xsim1.y;
temp+=(xsi.z-xsim1.z)/(Ei-Eim1)*(E-Eim1)+xsim1.z;
temp+=(xsi.w-xsim1.w)/(Ei-Eim1)*(E-Eim1)+xsim1.w;
return temp;
__device__ void gpuget_Rxn(neutron2* neut, const xsinfo* xs_data, const float
energy)
unsigned int i=1;
float Ei, Eim1;
float4 xsi, xsim1;
if (energy>E0_d[(*neut).loc])
//i=accelSearch(reduced_offsets_d[(*neut).loc],reduced_offsets_d[(*neut).
loc+1]-1,energy);
//i=binarySearch((*xs_data).cross_sec[(*neut).loc].Emesh,reduced_Emesh_d[
i-1].index,reduced_Emesh_d[i].index,energy);
i=binarySearch((*xs_data).cross_sec[(*neut).loc].Emesh,0,(*xs_data).cross
_sec[(*neut).loc].mesh_length,energy);
Ei=tex1Dfetch(Emesh_tex,i+Emesh_offsets_d[(*neut).loc]);
Eim1=tex1Dfetch(Emesh_tex,i-1+Emesh_offsets_d[(*neut).loc]);
xsi=tex1Dfetch(xsmesh_tex,i+Emesh_offsets_d[(*neut).loc]);
xsim1=tex1Dfetch(xsmesh_tex,i-1+Emesh_offsets_d[(*neut).loc]);
float sigt, sigf, sigrc, sigis;
99
sigf=getmicro_f_pretex(Ei,Eim1, &xsi, &xsim1, energy, (*neut).loc);
sigrc=getmicro_c_pretex(Ei,Eim1, &xsi, &xsim1, energy, (*neut).loc);
//siges=getmicro_e_pretex(Ei,Eim1, &xsi, &xsim1, energy, &(*neut).loc]);
//sigt used here since siges is not necessary because of the way i've
done it later, so i'll get the value
//but just use a future variable instead. makes less registers
necessary.
sigt=getmicro_e_pretex(Ei,Eim1, &xsi, &xsim1, energy, (*neut).loc);
sigis=getmicro_i_pretex(Ei,Eim1, &xsi, &xsim1, energy, (*neut).loc);
sigt+=sigf+sigrc+sigis;
float eta=gpurng(&((*neut).seed),sigt);
//determine where in the range of scaled x/s the rand # is
// if (((*neut).flag!='l')||((*neut).flag!='L'))
if ((*neut).flag!='l')
if (eta <= sigf) (*neut).flag= 'f';
else if (eta <= (sigf+sigrc)) (*neut).flag= 'c';
else if (eta <= (sigt-sigis)) (*neut).flag= 'e';
else if (eta <= (sigt)) (*neut).flag= 'i';
if (energy<1E-10f)
(*neut).flag= 'c';
//*************************************************************************//
__device__ float gpugetmaterialmacroT(const xsinfo* xs_data, const neutron2*
neut, const float energy)
float macro=0.0f;
for (int i=0; i<num_mat_d[(*neut).cell]; i++)
short loc=mat_list_d[i+NUCLIDE_MAX*(*neut).cell].zaid_loc;
macro+=getmicro_t(energy,&((*xs_data).cross_sec[loc]),loc)*mat_list_d[i+NUCLIDE
_MAX*(*neut).cell].density;
return macro;
__device__ int gpupicknuclide(neutron2* neut, const xsinfo* xs_data, const
unsigned short cell, const float SigmaT, const float energy)
float eta=gpurng(&((*neut).seed),1.0f)*SigmaT;
float running_sum=0.0f;
for (int i=0; i<num_mat_d[cell]; i++)
short loc=mat_list_d[i+NUCLIDE_MAX*cell].zaid_loc;
running_sum+=getmicro_t(energy,&((*xs_data).cross_sec[loc]),loc)*mat_list
_d[i+NUCLIDE_MAX*cell].density;
if (eta < running_sum)
(*neut).loc=loc;
return i;
(*neut).loc=2;
return 999;
100
//***************************************************************************//
void load_gpu(materials mat_list[][NUCLIDE_MAX], unsigned short num_shapes,
shapes shapelist[], xsinfo* xs_data, xsinfo** xs_data_d)
//cudaSetDevice(0);
checkCUDAError("SetDevice");
//pass mat_list
materials mat1d[OBJECT_MAX*NUCLIDE_MAX];
convert2darray(mat_list,mat1d);
materials2 mat_temp[OBJECT_MAX*NUCLIDE_MAX];
unsigned short num_mat_temp[OBJECT_MAX];
for (int i=0; i<OBJECT_MAX*NUCLIDE_MAX; i++)
mat_temp[i].zaid_loc=mat1d[i].zaid_loc;
mat_temp[i].density=mat1d[i].density;
for (int i=0; i<OBJECT_MAX; i++)
num_mat_temp[i]=mat1d[0+NUCLIDE_MAX*i].num_mat;
//Now I have mat_list in the seperate forms. have to send to the GPU
constant vars.
cudaMemcpyToSymbol(mat_list_d,mat_temp,OBJECT_MAX*NUCLIDE_MAX*sizeof(mate
rials2),0,cudaMemcpyHostToDevice);
cudaMemcpyToSymbol(num_mat_d,num_mat_temp,OBJECT_MAX*sizeof(unsigned
short),0,cudaMemcpyHostToDevice);
checkCUDAError("matlist");
//pass shape info
cudaMemcpyToSymbol(shapelist_d,shapelist,OBJECT_MAX*sizeof(shapes),0,cuda
MemcpyHostToDevice);
checkCUDAError("shapelist");
//pass xs_data
//setup texture and offsets list
unsigned int* offset_temp;
offset_temp=(unsigned
int*)malloc(((*xs_data).listlength+1)*sizeof(unsigned int));
unsigned int sum=0;
for (int i=0; i<(*xs_data).listlength; i++)
offset_temp[i]=sum;
sum+=(*xs_data).cross_sec[i].mesh_length;
offset_temp[(*xs_data).listlength]=sum;
cudaMemcpyToSymbol(Emesh_offsets_d,offset_temp,((*xs_data).listlength+1)*
sizeof(unsigned int),0,cudaMemcpyHostToDevice);
//offsets is done, start making texture source array
float * Emesh;
Emesh=(float*)malloc((sum+10)*sizeof(float));
unsigned int k=0;
for (int i=0; i<(*xs_data).listlength;i++)
for (unsigned int j=0; j<(*xs_data).cross_sec[i].mesh_length; j++)
Emesh[k]=(*xs_data).cross_sec[i].Emesh[j];
k++;
float* Emesh_d;
cudaMalloc((void**)&Emesh_d,offset_temp[(*xs_data).listlength]*sizeof(flo
at));
cudaMemcpy(Emesh_d,Emesh,offset_temp[(*xs_data).listlength]*sizeof(float)
,cudaMemcpyHostToDevice);
101
cudaBindTexture(NULL,Emesh_tex,Emesh_d,offset_temp[(*xs_data).listlength]
*sizeof(float));
float4* xsmesh;
xsmesh=(float4*)malloc(offset_temp[(*xs_data).listlength]*sizeof(float4))
;
k=0;
for (int i=0; i<(*xs_data).listlength;i++)
for (unsigned int j=0; j<(*xs_data).cross_sec[i].mesh_length; j++)
xsmesh[k].x=(*xs_data).cross_sec[i].fission[j];
xsmesh[k].y=(*xs_data).cross_sec[i].rad_cap[j];
xsmesh[k].z=(*xs_data).cross_sec[i].elastic[j];
xsmesh[k].w=(*xs_data).cross_sec[i].inelastic[j];
k++;
float4* xsmesh_d;
cudaMalloc((void**)&xsmesh_d,offset_temp[(*xs_data).listlength]*sizeof(fl
oat4));
cudaMemcpy(xsmesh_d,xsmesh,offset_temp[(*xs_data).listlength]*sizeof(floa
t4),cudaMemcpyHostToDevice);
cudaBindTexture(NULL,xsmesh_tex,xsmesh_d,offset_temp[(*xs_data).listlengt
h]*sizeof(float4));
//do my 0s
float* E0;
float4* xs0;
E0=(float*)malloc((*xs_data).listlength*sizeof(float));
xs0=(float4*)malloc((*xs_data).listlength*sizeof(float4));
for (int i=0; i<(*xs_data).listlength; i++)
E0[i]=(*xs_data).cross_sec[i].Emesh[0];
xs0[i].x=(*xs_data).cross_sec[i].fission[0];
xs0[i].y=(*xs_data).cross_sec[i].rad_cap[0];
xs0[i].z=(*xs_data).cross_sec[i].elastic[0];
xs0[i].w=(*xs_data).cross_sec[i].inelastic[0];
cudaMemcpyToSymbol(E0_d,E0,(*xs_data).listlength*sizeof(float),0,cudaMemc
pyHostToDevice);
cudaMemcpyToSymbol(xs0_d,xs0,(*xs_data).listlength*sizeof(float4),0,cudaM
emcpyHostToDevice);
free(E0);
free(xs0);
free(offset_temp);
free(Emesh);
free(xsmesh);
//Done with texture
reduced_array* reduced_Emesh;
reduced_Emesh=(reduced_array*)malloc(CONST_SIZE*sizeof(reduced_array));
unsigned int* reduced_offsets;
reduced_offsets=(unsigned
int*)malloc(((*xs_data).listlength+1)*sizeof(unsigned int));
int reduction_factor;
Emesh_reduction(xs_data,&reduced_Emesh,&reduced_offsets,&reduction_factor
);
cudaMemcpyToSymbol(reduced_Emesh_d,reduced_Emesh,CONST_SIZE*sizeof(reduce
d_array),0,cudaMemcpyHostToDevice);
cudaMemcpyToSymbol(reduced_offsets_d,reduced_offsets,((*xs_data).listleng
th+1)*sizeof(unsigned int),0,cudaMemcpyHostToDevice);
free(reduced_offsets);
102
free(reduced_Emesh);
//legendres
legendres* coeffs_h=NULL;
coeffs_h=(legendres*)malloc(NUCLIDE_MAX*sizeof(legendres));
for (int j=0; j<(*xs_data).listlength; j++)
coeffs_h[j]=(*xs_data).cross_sec[j].coeffs;
cudaMemcpyToSymbol(coeffs_d,coeffs_h,NUCLIDE_MAX*sizeof(legendres),0,cuda
MemcpyHostToDevice);
cudaThreadSynchronize();
free(coeffs_h);
//continue with raw x/s data
xsinfo tempxs;
xs* temp;
temp=(xs*)malloc((*xs_data).listlength*sizeof(xs));
tempxs.cross_sec=(xs*)malloc((*xs_data).listlength*sizeof(xs));
tempxs.list=(int*)malloc((*xs_data).listlength*sizeof(int));
for (int j=0; j<(*xs_data).listlength; j++)
int mesh_length=(*xs_data).cross_sec[j].mesh_length;
float* rc; //rad_cap
float* f; //fission
float* e; //elastic
float* i; //inelastic
float* E; //Emesh
cudaMalloc((void**)&rc,mesh_length*sizeof(float));
cudaMalloc((void**)&f,mesh_length*sizeof(float));
cudaMalloc((void**)&e,mesh_length*sizeof(float));
cudaMalloc((void**)&i,mesh_length*sizeof(float));
cudaMalloc((void**)&E,mesh_length*sizeof(float));
cudaMemcpy(rc,((*xs_data).cross_sec[j].rad_cap),mesh_length*sizeof(float)
,cudaMemcpyHostToDevice);
cudaMemcpy(f,((*xs_data).cross_sec[j].fission),mesh_length*sizeof(float),
cudaMemcpyHostToDevice);
cudaMemcpy(e,((*xs_data).cross_sec[j].elastic),mesh_length*sizeof(float),
cudaMemcpyHostToDevice);
cudaMemcpy(i,((*xs_data).cross_sec[j].inelastic),mesh_length*sizeof(float
),cudaMemcpyHostToDevice);
cudaMemcpy(E,((*xs_data).cross_sec[j].Emesh),mesh_length*sizeof(float),cu
daMemcpyHostToDevice);
temp[j].coeffs=(*xs_data).cross_sec[j].coeffs;
temp[j].a=(*xs_data).cross_sec[j].a;
temp[j].b=(*xs_data).cross_sec[j].b;
temp[j].c=(*xs_data).cross_sec[j].c;
temp[j].d=(*xs_data).cross_sec[j].d;
temp[j].mesh_length=mesh_length;
temp[j].rad_cap=rc;
temp[j].fission=f;
temp[j].elastic=e;
temp[j].inelastic=i;
temp[j].Emesh=E;
xs* temp_d;
cudaMalloc((void**)&temp_d,(*xs_data).listlength*sizeof(xs));
103
cudaMemcpy(temp_d,temp,(*xs_data).listlength*sizeof(xs),cudaMemcpyHostToD
evice);
tempxs.cross_sec=temp_d;
int* tempint;
cudaMalloc((void**)&tempint,(*xs_data).listlength*sizeof(int));
cudaMemcpy(tempint,(*xs_data).list,(*xs_data).listlength*sizeof(int),cuda
MemcpyHostToDevice);
tempxs.list=tempint;
tempxs.listlength=(*xs_data).listlength;
cudaMalloc((void**)xs_data_d,sizeof(xsinfo));
cudaMemcpy(*xs_data_d,&tempxs,sizeof(xsinfo),cudaMemcpyHostToDevice);
// free(tempxs.list);
free(temp);
void loadneutlist(unsigned int listlength, neutron* host, neutron* device)
//Assumes cudaMalloc already run for the lists
cudaMemcpy(device, host, listlength*sizeof(neutron),
cudaMemcpyHostToDevice);
checkCUDAError("load");
void getneutlist(unsigned int listlength, neutron* host, neutron* device)
//Assumes cudaMalloc already run for the lists
cudaMemcpy(host, device, listlength*sizeof(neutron),
cudaMemcpyDeviceToHost);
checkCUDAError("get");
void checkCUDAError(const char *msg)
cudaError_t err = cudaGetLastError();
if( cudaSuccess != err)
fprintf(stderr, "Cuda error: %s: %s.\n", msg,
cudaGetErrorString( err) );
exit(EXIT_FAILURE);
void convert2darray(materials mat2d[][NUCLIDE_MAX], materials * mat1d)
int k=0;
for (int i=0; i<OBJECT_MAX; i++)
for (int j=0; j<NUCLIDE_MAX; j++)
mat1d[k++]=mat2d[i][j];
void configure_cuda_call(int required_threads, int * blocks)
if ((required_threads%THREAD_COUNT)!=0)
*blocks=required_threads/THREAD_COUNT+1;
else
*blocks=required_threads/THREAD_COUNT;
void Emesh_reduction(xsinfo* xs_data, reduced_array** list, unsigned int**
reduced_offsets, int* reduction_factor)
104
//determine total size
unsigned int tot_size=0;
for (int i=0; i<(*xs_data).listlength;i++)
tot_size+=(*xs_data).cross_sec[i].mesh_length;
//determine reduction_factor value
*reduction_factor=1+tot_size/(CONST_SIZE-(*xs_data).listlength);
//Now go through and set reduced_array to what it needs, as well as the
offsets to use for reference
unsigned int k=0;
for (unsigned int i=0; i<(*xs_data).listlength; i++)
(*reduced_offsets)[i]=k;
unsigned int length;
if ((*xs_data).cross_sec[i].mesh_length%(*reduction_factor)==0)
length=(*xs_data).cross_sec[i].mesh_length/(*reduction_factor);
else
length=(*xs_data).cross_sec[i].mesh_length/(*reduction_factor)+1;
for (unsigned int j=0; j<length; j++)
(*list)[k].value=(*xs_data).cross_sec[i].Emesh[(*reduction_factor)*j];
(*list)[k].index=(*reduction_factor)*j;
k++;
(*list)[k].value=(*xs_data).cross_sec[i].Emesh[(*xs_data).cross_sec[i].me
sh_length-1];
(*list)[k].index=(*xs_data).cross_sec[i].mesh_length-1;
k++;
(*reduced_offsets)[(*xs_data).listlength]=k;
105
Appendix C
Sample Input Files
C.1 Geometry File “array2.geo”
53
s -225 0 -150 0 -150 0 1600
999
s -225 0 -150 0 -150 0 2500
0 999
s -75 0 -150 0 -150 0 1600
999
s -75 0 -150 0 -150 0 2500
2 999
s 75 0 -150 0 -150 0 1600
999
s 75 0 -150 0 -150 0 2500
4 999
s 225 0 -150 0 -150 0 1600
999
s 225 0 -150 0 -150 0 2500
6 999
s -225 0 0 0 -150 0 1600
999
s -225 0 0 0 -150 0 2500
8 999
s -75 0 0 0 -150 0 1600
999
s -75 0 0 0 -150 0 2500
10 999
106
s 75 0 0 0 -150 0 1600
999
s 75 0 0 0 -150 0 2500
12 999
s 225 0 0 0 -150 0 1600
999
s 225 0 0 0 -150 0 2500
14 999
s -225 0 150 0 -150 0 1600
999
s -225 0 150 0 -150 0 2500
16 999
s -75 0 150 0 -150 0 1600
999
s -75 0 150 0 -150 0 2500
18 999
s 75 0 150 0 -150 0 1600
999
s 75 0 150 0 -150 0 2500
20 999
s 225 0 150 0 -150 0 1600
999
s 225 0 150 0 -150 0 2500
22 999
s -225 0 -150 0 150 0 1600
999
s -225 0 -150 0 150 0 2500
24 999
s -75 0 -150 0 150 0 1600
999
107
s -75 0 -150 0 150 0 2500
26 999
s 75 0 -150 0 150 0 1600
999
s 75 0 -150 0 150 0 2500
28 999
s 225 0 -150 0 150 0 1600
999
s 225 0 -150 0 150 0 2500
30 999
s -225 0 0 0 150 0 1600
999
s -225 0 0 0 150 0 2500
32 999
s -75 0 0 0 150 0 1600
999
s -75 0 0 0 150 0 2500
34 999
s 75 0 0 0 150 0 1600
999
s 75 0 0 0 150 0 2500
36 999
s 225 0 0 0 150 0 1600
999
s 225 0 0 0 150 0 2500
38 999
s -225 0 150 0 150 0 1600
999
s -225 0 150 0 150 0 2500
40 999
108
s -75 0 150 0 150 0 1600
999
s -75 0 150 0 150 0 2500
42 999
s 75 0 150 0 150 0 1600
999
s 75 0 150 0 150 0 2500
44 999
s 225 0 150 0 150 0 1600
999
s 225 0 150 0 150 0 2500
46 999
s -325 0 -200 0 200 -200 625
999
s -325 0 200 0 200 -200 625
999
s 325 0 -200 0 200 -200 625
999
s 325 0 200 0 200 -200 625
999
p 375 -375 250 -250 250 -250 0
0 1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 999
109
C.2 Material File “array2.mat”
53
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
110
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
3 92235 0.001094165 92238 0.035377986 8016 0.072944303
1 40092 0.042512584
1 40092 0.042512584
1 40092 0.042512584
1 40092 0.042512584
1 40092 0.042512584
3 1001 0.07 1002 1.00E-005 8016 0.033427788
111
C.3 Neutron Source File “array2.src”
95
-225 -150 -150 1.65
-225 -150 -150 0.24
-225 -150 -150 1.73
-225 -150 -150 1.62
-75 -150 -150 1.19
-75 -150 -150 0.31
-75 -150 -150 0.57
-75 -150 -150 0.65
75 -150 -150 1.02
75 -150 -150 1.65
75 -150 -150 1.47
75 -150 -150 0.02
225 -150 -150 0.71
225 -150 -150 1.38
225 -150 -150 1.32
225 -150 -150 0.24
-225 0 -150 0.82
-225 0 -150 1.72
-225 0 -150 0.29
-225 0 -150 0.26
-75 0 -150 0.56
-75 0 -150 0.91
-75 0 -150 0.18
-75 0 -150 0.63
75 0 -150 1.69
75 0 -150 1.15
75 0 -150 0.62
112
75 0 -150 0.45
225 0 -150 1.73
225 0 -150 1.26
225 0 -150 1.69
225 0 -150 1.58
-225 150 -150 0.09
-225 150 -150 0.96
-225 150 -150 1.44
-225 150 -150 0.33
-75 150 -150 1.22
-75 150 -150 0.21
-75 150 -150 1
-75 150 -150 0.85
75 150 -150 1.55
75 150 -150 0.97
75 150 -150 0.88
75 150 -150 0.62
225 150 -150 1.92
225 150 -150 0.07
225 150 -150 0.78
-225 -150 150 0.4
-225 -150 150 0.26
-225 -150 150 1.65
-225 -150 150 1.94
-75 -150 150 0.57
-75 -150 150 1.55
-75 -150 150 0.23
-75 -150 150 1.03
75 -150 150 1.56
75 -150 150 1.45
113
75 -150 150 0.82
75 -150 150 1.3
225 -150 150 0.78
225 -150 150 0.76
225 -150 150 0.57
225 -150 150 0.52
-225 0 150 1.74
-225 0 150 0.01
-225 0 150 0.42
-225 0 150 1.05
-75 0 150 0.38
-75 0 150 1.02
-75 0 150 1.17
-75 0 150 0.11
75 0 150 0.3
75 0 150 1.21
75 0 150 1.15
75 0 150 0.71
225 0 150 1.31
225 0 150 1.65
225 0 150 0.46
225 0 150 1.69
-225 150 150 1.88
-225 150 150 0.57
-225 150 150 0.11
-225 150 150 0.69
-75 150 150 0.72
-75 150 150 0.26
-75 150 150 1.02
-75 150 150 0.26
114
75 150 150 0.67
75 150 150 0.06
75 150 150 1.99
75 150 150 0.78
225 150 150 0.2
225 150 150 0.1
225 150 150 1.71
225 150 150 1.54
C.3 Nuclear Data File for 1H, “1001”
0.453
-1.036
2.29
0.35820615
228
1.000000000E-08 0.000000000E+00 5.298173371E-01 4.177776730E+01 0.000000000E+00
1.107437000E-08 0.000000000E+00 5.035555670E-01 4.015175885E+01 0.000000000E+00
1.265500000E-08 0.000000000E+00 4.711790902E-01 3.817722728E+01 0.000000000E+00
1.581625000E-08 0.000000000E+00 4.216244227E-01 3.523465417E+01 0.000000000E+00
1.897750000E-08 0.000000000E+00 3.849698537E-01 3.313620967E+01 0.000000000E+00
2.213875000E-08 0.000000000E+00 3.564179224E-01 3.155882285E+01 0.000000000E+00
2.530000000E-08 0.000000000E+00 3.333575608E-01 3.032768500E+01 0.000000000E+00
3.140196000E-08 0.000000000E+00 2.990841608E-01 2.857972139E+01 0.000000000E+00
3.750392000E-08 0.000000000E+00 2.735507801E-01 2.734986363E+01 0.000000000E+00
4.360588000E-08 0.000000000E+00 2.536139026E-01 2.643796224E+01 0.000000000E+00
4.970785000E-08 0.000000000E+00 2.375100203E-01 2.573547978E+01 0.000000000E+00
6.191177000E-08 0.000000000E+00 2.128573136E-01 2.472589906E+01 0.000000000E+00
7.411570000E-08 0.000000000E+00 1.946249396E-01 2.403706148E+01 0.000000000E+00
8.631962000E-08 0.000000000E+00 1.804030338E-01 2.353822209E+01 0.000000000E+00
115
9.852355000E-08 0.000000000E+00 1.688842471E-01 2.316088436E+01 0.000000000E+00
1.229314000E-07 0.000000000E+00 1.511597703E-01 2.262880036E+01 0.000000000E+00
1.473392000E-07 0.000000000E+00 1.380128318E-01 2.227209633E+01 0.000000000E+00
1.717471000E-07 0.000000000E+00 1.277882878E-01 2.201652395E+01 0.000000000E+00
1.961549000E-07 0.000000000E+00 1.195550201E-01 2.182446860E+01 0.000000000E+00
2.205628000E-07 0.000000000E+00 1.127417884E-01 2.167488172E+01 0.000000000E+00
2.693785000E-07 0.000000000E+00 1.020218294E-01 2.145697307E+01 0.000000000E+00
3.181942000E-07 0.000000000E+00 9.387577644E-02 2.130587858E+01 0.000000000E+00
3.670099000E-07 0.000000000E+00 8.741219777E-02 2.119494404E+01 0.000000000E+00
4.158257000E-07 0.000000000E+00 8.212085182E-02 2.111003071E+01 0.000000000E+00
5.134571000E-07 0.000000000E+00 7.389935283E-02 2.098858820E+01 0.000000000E+00
6.110886000E-07 0.000000000E+00 6.773638636E-02 2.090590029E+01 0.000000000E+00
7.087200000E-07 0.000000000E+00 6.289906779E-02 2.084595772E+01 0.000000000E+00
8.063515000E-07 0.000000000E+00 5.897139969E-02 2.080050293E+01 0.000000000E+00
1.001614000E-06 0.000000000E+00 5.290901083E-02 2.073611997E+01 0.000000000E+00
1.196877000E-06 0.000000000E+00 4.838950080E-02 2.069269073E+01 0.000000000E+00
1.392140000E-06 0.000000000E+00 4.486622308E-02 2.066140632E+01 0.000000000E+00
1.587403000E-06 0.000000000E+00 4.202557700E-02 2.063778986E+01 0.000000000E+00
1.977929000E-06 0.000000000E+00 3.765199272E-02 2.060448592E+01 0.000000000E+00
2.368455000E-06 0.000000000E+00 3.438960244E-02 2.058210845E+01 0.000000000E+00
2.758981000E-06 0.000000000E+00 3.185657365E-02 2.056602700E+01 0.000000000E+00
3.149507000E-06 0.000000000E+00 2.983053953E-02 2.055390553E+01 0.000000000E+00
3.930559000E-06 0.000000000E+00 2.670911071E-02 2.053682727E+01 0.000000000E+00
4.711611000E-06 0.000000000E+00 2.437652734E-02 2.052535340E+01 0.000000000E+00
5.492663000E-06 0.000000000E+00 2.256801568E-02 2.051710316E+01 0.000000000E+00
6.273715000E-06 0.000000000E+00 2.113030270E-02 2.051087946E+01 0.000000000E+00
7.835818000E-06 0.000000000E+00 1.891307671E-02 2.050209192E+01 0.000000000E+00
9.397922000E-06 0.000000000E+00 1.725598625E-02 2.049616727E+01 0.000000000E+00
1.096003000E-05 0.000000000E+00 1.597126059E-02 2.049189130E+01 0.000000000E+00
1.252213000E-05 0.000000000E+00 1.495041302E-02 2.048865466E+01 0.000000000E+00
116
1.564634000E-05 0.000000000E+00 1.337845769E-02 2.048405743E+01 0.000000000E+00
1.877055000E-05 0.000000000E+00 1.220592573E-02 2.048093233E+01 0.000000000E+00
2.189476000E-05 0.000000000E+00 1.129651330E-02 2.047865894E+01 0.000000000E+00
2.501897000E-05 0.000000000E+00 1.057158665E-02 2.047692474E+01 0.000000000E+00
3.126739000E-05 0.000000000E+00 9.458095307E-03 2.047443405E+01 0.000000000E+00
3.751581000E-05 0.000000000E+00 8.629782437E-03 2.047271502E+01 0.000000000E+00
4.376423000E-05 0.000000000E+00 7.987087330E-03 2.047144659E+01 0.000000000E+00
5.001265000E-05 0.000000000E+00 7.473133016E-03 2.047046604E+01 0.000000000E+00
6.250948000E-05 0.000000000E+00 6.685074944E-03 2.046903130E+01 0.000000000E+00
7.500632000E-05 0.000000000E+00 6.100163053E-03 2.046801696E+01 0.000000000E+00
8.750316000E-05 0.000000000E+00 5.646157483E-03 2.046725188E+01 0.000000000E+00
1.000000000E-04 0.000000000E+00 5.281919301E-03 2.046622089E+01 0.000000000E+00
1.231110000E-04 0.000000000E+00 4.758629197E-03 2.045534330E+01 0.000000000E+00
1.495397000E-04 0.000000000E+00 4.314992030E-03 2.044479941E+01 0.000000000E+00
1.813991000E-04 0.000000000E+00 3.915263847E-03 2.043439733E+01 0.000000000E+00
2.132585000E-04 0.000000000E+00 3.609086338E-03 2.042573052E+01 0.000000000E+00
2.531544000E-04 0.000000000E+00 3.310817300E-03 2.041658618E+01 0.000000000E+00
2.930504000E-04 0.000000000E+00 3.075822085E-03 2.040881000E+01 0.000000000E+00
3.418519000E-04 0.000000000E+00 2.846555201E-03 2.040065064E+01 0.000000000E+00
3.906535000E-04 0.000000000E+00 2.661778316E-03 2.039359925E+01 0.000000000E+00
4.496277000E-04 0.000000000E+00 2.480099055E-03 2.038618621E+01 0.000000000E+00
5.086020000E-04 0.000000000E+00 2.331055625E-03 2.037969961E+01 0.000000000E+00
5.784139000E-04 0.000000000E+00 2.185078966E-03 2.037294114E+01 0.000000000E+00
6.482258000E-04 0.000000000E+00 2.063403776E-03 2.036696178E+01 0.000000000E+00
7.298269000E-04 0.000000000E+00 1.944000937E-03 2.036074794E+01 0.000000000E+00
8.114280000E-04 0.000000000E+00 1.843322036E-03 2.035520706E+01 0.000000000E+00
1.000000000E-03 0.000000000E+00 1.659702848E-03 2.034354300E+01 0.000000000E+00
1.138267000E-03 0.000000000E+00 1.560128206E-03 2.030222925E+01 0.000000000E+00
1.288833000E-03 0.000000000E+00 1.470477857E-03 2.026189423E+01 0.000000000E+00
1.452245000E-03 0.000000000E+00 1.389192251E-03 2.022321533E+01 0.000000000E+00
117
1.629032000E-03 0.000000000E+00 1.315306590E-03 2.018609993E+01 0.000000000E+00
2.025060000E-03 0.000000000E+00 1.185236323E-03 2.011588892E+01 0.000000000E+00
2.481141000E-03 0.000000000E+00 1.068877755E-03 2.005067041E+01 0.000000000E+00
3.002791000E-03 0.000000000E+00 9.697747793E-04 1.998952169E+01 0.000000000E+00
3.593504000E-03 0.000000000E+00 8.817281932E-04 1.993216003E+01 0.000000000E+00
4.257798000E-03 0.000000000E+00 8.051065475E-04 1.987812610E+01 0.000000000E+00
5.000000000E-03 0.000000000E+00 7.375931888E-04 1.982645212E+01 0.000000000E+00
6.034926000E-03 0.000000000E+00 6.640425648E-04 1.966107082E+01 0.000000000E+00
7.207047000E-03 0.000000000E+00 6.000072207E-04 1.950570484E+01 0.000000000E+00
8.525727000E-03 0.000000000E+00 5.445084466E-04 1.935967800E+01 0.000000000E+00
1.000000000E-02 0.000000000E+00 4.958639821E-04 1.922153614E+01 0.000000000E+00
1.206215000E-02 0.000000000E+00 4.426316209E-04 1.892057209E+01 0.000000000E+00
1.440269000E-02 0.000000000E+00 3.975181418E-04 1.863943558E+01 0.000000000E+00
2.000000000E-02 0.000000000E+00 3.238853737E-04 1.812960941E+01 0.000000000E+00
2.464048000E-02 0.000000000E+00 2.832718308E-04 1.763113001E+01 0.000000000E+00
3.000000000E-02 0.000000000E+00 2.491047128E-04 1.717288258E+01 0.000000000E+00
3.500000000E-02 0.000000000E+00 2.225018744E-04 1.671338937E+01 0.000000000E+00
4.500000000E-02 0.000000000E+00 1.897098384E-04 1.592276684E+01 0.000000000E+00
5.500000000E-02 0.000000000E+00 1.651864579E-04 1.521350215E+01 0.000000000E+00
6.500000000E-02 0.000000000E+00 1.469345529E-04 1.457368221E+01 0.000000000E+00
9.000000000E-02 0.000000000E+00 1.164022971E-04 1.322601849E+01 0.000000000E+00
1.100000000E-01 0.000000000E+00 1.001933883E-04 1.233293832E+01 0.000000000E+00
1.400000000E-01 0.000000000E+00 8.325794400E-05 1.125504543E+01 0.000000000E+00
1.600000000E-01 0.000000000E+00 7.514343826E-05 1.065204516E+01 0.000000000E+00
1.900000000E-01 0.000000000E+00 6.585397216E-05 9.884190192E+00 0.000000000E+00
2.400000000E-01 0.000000000E+00 5.559232649E-05 8.878080473E+00 0.000000000E+00
3.000000000E-01 0.000000000E+00 4.829071169E-05 7.967857643E+00 0.000000000E+00
3.400000000E-01 0.000000000E+00 4.501850772E-05 7.485047186E+00 0.000000000E+00
4.000000000E-01 0.000000000E+00 4.149042868E-05 6.891235806E+00 0.000000000E+00
4.600000000E-01 0.000000000E+00 3.913823594E-05 6.411728021E+00 0.000000000E+00
118
5.500000000E-01 0.000000000E+00 3.739607821E-05 5.841250518E+00 0.000000000E+00
6.500000000E-01 0.000000000E+00 3.616309348E-05 5.350436405E+00 0.000000000E+00
7.500000000E-01 0.000000000E+00 3.516704102E-05 4.961127520E+00 0.000000000E+00
8.500000000E-01 0.000000000E+00 3.474201043E-05 4.642921467E+00 0.000000000E+00
1.000000000E+00 0.000000000E+00 3.446014919E-05 4.258621960E+00 0.000000000E+00
1.100000000E+00 0.000000000E+00 3.444000000E-05 4.047100000E+00 0.000000000E+00
1.200000000E+00 0.000000000E+00 3.441000000E-05 3.862500000E+00 0.000000000E+00
1.300000000E+00 0.000000000E+00 3.449000000E-05 3.699200000E+00 0.000000000E+00
1.400000000E+00 0.000000000E+00 3.436000000E-05 3.553300000E+00 0.000000000E+00
1.500000000E+00 0.000000000E+00 3.434000000E-05 3.421900000E+00 0.000000000E+00
1.600000000E+00 0.000000000E+00 3.431000000E-05 3.302500000E+00 0.000000000E+00
1.700000000E+00 0.000000000E+00 3.429000000E-05 3.193400000E+00 0.000000000E+00
1.800000000E+00 0.000000000E+00 3.427000000E-05 3.093100000E+00 0.000000000E+00
1.900000000E+00 0.000000000E+00 3.425000000E-05 3.000300000E+00 0.000000000E+00
2.000000000E+00 0.000000000E+00 3.423000000E-05 2.914200000E+00 0.000000000E+00
2.100000000E+00 0.000000000E+00 3.437814803E-05 2.833900000E+00 0.000000000E+00
2.200000000E+00 0.000000000E+00 3.452000000E-05 2.758800000E+00 0.000000000E+00
2.300000000E+00 0.000000000E+00 3.466785004E-05 2.688300000E+00 0.000000000E+00
2.400000000E+00 0.000000000E+00 3.481000000E-05 2.621900000E+00 0.000000000E+00
2.500000000E+00 0.000000000E+00 3.495760014E-05 2.559200000E+00 0.000000000E+00
2.600000000E+00 0.000000000E+00 3.510000000E-05 2.499900000E+00 0.000000000E+00
2.700000000E+00 0.000000000E+00 3.524738762E-05 2.443600000E+00 0.000000000E+00
2.800000000E+00 0.000000000E+00 3.539000000E-05 2.390200000E+00 0.000000000E+00
2.900000000E+00 0.000000000E+00 3.553720474E-05 2.339300000E+00 0.000000000E+00
3.000000000E+00 0.000000000E+00 3.568000000E-05 2.290700000E+00 0.000000000E+00
3.200000000E+00 0.000000000E+00 3.580000000E-05 2.199900000E+00 0.000000000E+00
3.400000000E+00 0.000000000E+00 3.591000000E-05 2.116600000E+00 0.000000000E+00
3.600000000E+00 0.000000000E+00 3.603000000E-05 2.039800000E+00 0.000000000E+00
3.800000000E+00 0.000000000E+00 3.614000000E-05 1.968600000E+00 0.000000000E+00
4.000000000E+00 0.000000000E+00 3.626000000E-05 1.902400000E+00 0.000000000E+00
119
4.200000000E+00 0.000000000E+00 3.629000000E-05 1.840600000E+00 0.000000000E+00
4.400000000E+00 0.000000000E+00 3.632000000E-05 1.782800000E+00 0.000000000E+00
4.600000000E+00 0.000000000E+00 3.636000000E-05 1.728600000E+00 0.000000000E+00
4.800000000E+00 0.000000000E+00 3.639000000E-05 1.677500000E+00 0.000000000E+00
5.000000000E+00 0.000000000E+00 3.642000000E-05 1.629400000E+00 0.000000000E+00
5.200000000E+00 0.000000000E+00 3.629000000E-05 1.583544516E+00 0.000000000E+00
5.400000000E+00 0.000000000E+00 3.616000000E-05 1.540638627E+00 0.000000000E+00
5.500000000E+00 0.000000000E+00 3.609940466E-05 1.520200000E+00 0.000000000E+00
5.600000000E+00 0.000000000E+00 3.604000000E-05 1.499846352E+00 0.000000000E+00
5.800000000E+00 0.000000000E+00 3.591000000E-05 1.460986149E+00 0.000000000E+00
6.000000000E+00 0.000000000E+00 3.578000000E-05 1.424400000E+00 0.000000000E+00
6.200000000E+00 0.000000000E+00 3.567000000E-05 1.389030836E+00 0.000000000E+00
6.400000000E+00 0.000000000E+00 3.556000000E-05 1.355621779E+00 0.000000000E+00
6.500000000E+00 0.000000000E+00 3.550453431E-05 1.339600000E+00 0.000000000E+00
6.600000000E+00 0.000000000E+00 3.545000000E-05 1.323642369E+00 0.000000000E+00
6.800000000E+00 0.000000000E+00 3.534000000E-05 1.292987067E+00 0.000000000E+00
7.000000000E+00 0.000000000E+00 3.523000000E-05 1.263900000E+00 0.000000000E+00
7.500000000E+00 0.000000000E+00 3.459000000E-05 1.195900000E+00 0.000000000E+00
8.000000000E+00 0.000000000E+00 3.394000000E-05 1.134500000E+00 0.000000000E+00
8.500000000E+00 0.000000000E+00 3.364000000E-05 1.078600000E+00 0.000000000E+00
9.000000000E+00 0.000000000E+00 3.333000000E-05 1.027700000E+00 0.000000000E+00
9.500000000E+00 0.000000000E+00 3.296000000E-05 9.810600000E-01 0.000000000E+00
1.000000000E+01 0.000000000E+00 3.259000000E-05 9.381600000E-01 0.000000000E+00
1.050000000E+01 0.000000000E+00 3.221000000E-05 8.985700000E-01 0.000000000E+00
1.100000000E+01 0.000000000E+00 3.182000000E-05 8.619400000E-01 0.000000000E+00
1.150000000E+01 0.000000000E+00 3.145000000E-05 8.279300000E-01 0.000000000E+00
1.200000000E+01 0.000000000E+00 3.108000000E-05 7.962800000E-01 0.000000000E+00
1.250000000E+01 0.000000000E+00 3.063000000E-05 7.667600000E-01 0.000000000E+00
1.300000000E+01 0.000000000E+00 3.018000000E-05 7.391500000E-01 0.000000000E+00
1.350000000E+01 0.000000000E+00 3.001000000E-05 7.132900000E-01 0.000000000E+00
120
1.400000000E+01 0.000000000E+00 2.983000000E-05 6.890000000E-01 0.000000000E+00
1.450000000E+01 0.000000000E+00 2.940000000E-05 6.661500000E-01 0.000000000E+00
1.500000000E+01 0.000000000E+00 2.896000000E-05 6.446200000E-01 0.000000000E+00
1.550000000E+01 0.000000000E+00 2.863000000E-05 6.242900000E-01 0.000000000E+00
1.600000000E+01 0.000000000E+00 2.830000000E-05 6.050700000E-01 0.000000000E+00
1.650000000E+01 0.000000000E+00 2.788000000E-05 5.868037565E-01 0.000000000E+00
1.700000000E+01 0.000000000E+00 2.745000000E-05 5.696100000E-01 0.000000000E+00
1.750000000E+01 0.000000000E+00 2.736000000E-05 5.531658724E-01 0.000000000E+00
1.800000000E+01 0.000000000E+00 2.726000000E-05 5.376400000E-01 0.000000000E+00
1.850000000E+01 0.000000000E+00 2.673000000E-05 5.227535194E-01 0.000000000E+00
1.900000000E+01 0.000000000E+00 2.620000000E-05 5.086600000E-01 0.000000000E+00
1.950000000E+01 0.000000000E+00 2.612000000E-05 4.951201309E-01 0.000000000E+00
2.100000000E+01 0.000000000E+00 2.553287162E-05 4.581500000E-01 0.000000000E+00
2.200000000E+01 0.000000000E+00 2.505853982E-05 4.360200000E-01 0.000000000E+00
2.300000000E+01 0.000000000E+00 2.461353168E-05 4.156500000E-01 0.000000000E+00
2.400000000E+01 0.000000000E+00 2.419487313E-05 3.968700000E-01 0.000000000E+00
2.500000000E+01 0.000000000E+00 2.380000000E-05 3.795000000E-01 0.000000000E+00
2.600000000E+01 0.000000000E+00 2.348033824E-05 3.634500000E-01 0.000000000E+00
2.700000000E+01 0.000000000E+00 2.317679620E-05 3.481100000E-01 0.000000000E+00
2.800000000E+01 0.000000000E+00 2.288800766E-05 3.329500000E-01 0.000000000E+00
2.900000000E+01 0.000000000E+00 2.261276583E-05 3.189100000E-01 0.000000000E+00
3.000000000E+01 0.000000000E+00 2.235000000E-05 3.056690000E-01 0.000000000E+00
3.200000000E+01 0.000000000E+00 2.168747690E-05 2.834800000E-01 0.000000000E+00
3.400000000E+01 0.000000000E+00 2.108303176E-05 2.639490000E-01 0.000000000E+00
3.500000000E+01 0.000000000E+00 2.080000000E-05 2.550341280E-01 0.000000000E+00
3.600000000E+01 0.000000000E+00 2.051871530E-05 2.466590000E-01 0.000000000E+00
3.800000000E+01 0.000000000E+00 1.998946910E-05 2.312710000E-01 0.000000000E+00
4.000000000E+01 0.000000000E+00 1.950000000E-05 2.175050000E-01 0.000000000E+00
4.200000000E+01 0.000000000E+00 1.903657740E-05 2.051280000E-01 0.000000000E+00
4.400000000E+01 0.000000000E+00 1.860497774E-05 1.939580000E-01 0.000000000E+00
121
4.500000000E+01 0.000000000E+00 1.840000000E-05 1.887745242E-01 0.000000000E+00
4.600000000E+01 0.000000000E+00 1.819764541E-05 1.838390000E-01 0.000000000E+00
4.800000000E+01 0.000000000E+00 1.781211419E-05 1.746410000E-01 0.000000000E+00
5.000000000E+01 0.000000000E+00 1.745000000E-05 1.662530000E-01 0.000000000E+00
5.200000000E+01 0.000000000E+00 1.701001484E-05 1.585780000E-01 0.000000000E+00
5.400000000E+01 0.000000000E+00 1.659711386E-05 1.515380000E-01 0.000000000E+00
5.500000000E+01 0.000000000E+00 1.640000000E-05 1.482352136E-01 0.000000000E+00
5.600000000E+01 0.000000000E+00 1.618772118E-05 1.450620000E-01 0.000000000E+00
5.800000000E+01 0.000000000E+00 1.578215898E-05 1.390920000E-01 0.000000000E+00
6.000000000E+01 0.000000000E+00 1.540000000E-05 1.335750000E-01 0.000000000E+00
6.200000000E+01 0.000000000E+00 1.506710831E-05 1.284640000E-01 0.000000000E+00
6.400000000E+01 0.000000000E+00 1.475164479E-05 1.237190000E-01 0.000000000E+00
6.500000000E+01 0.000000000E+00 1.460000000E-05 1.214765192E-01 0.000000000E+00
6.600000000E+01 0.000000000E+00 1.446365740E-05 1.193080000E-01 0.000000000E+00
6.800000000E+01 0.000000000E+00 1.420073033E-05 1.151970000E-01 0.000000000E+00
7.000000000E+01 0.000000000E+00 1.395000000E-05 1.113610000E-01 0.000000000E+00
7.200000000E+01 0.000000000E+00 1.370182002E-05 1.077740000E-01 0.000000000E+00
7.400000000E+01 0.000000000E+00 1.346467651E-05 1.044150000E-01 0.000000000E+00
7.500000000E+01 0.000000000E+00 1.335000000E-05 1.028174393E-01 0.000000000E+00
7.600000000E+01 0.000000000E+00 1.326691275E-05 1.012650000E-01 0.000000000E+00
7.800000000E+01 0.000000000E+00 1.310546711E-05 9.830760000E-02 0.000000000E+00
8.000000000E+01 0.000000000E+00 1.295000000E-05 9.552690000E-02 0.000000000E+00
8.200000000E+01 0.000000000E+00 1.286816341E-05 9.290890000E-02 0.000000000E+00
8.400000000E+01 0.000000000E+00 1.278879762E-05 9.044090000E-02 0.000000000E+00
8.500000000E+01 0.000000000E+00 1.275000000E-05 8.926185335E-02 0.000000000E+00
8.600000000E+01 0.000000000E+00 1.269844010E-05 8.811170000E-02 0.000000000E+00
8.800000000E+01 0.000000000E+00 1.259770183E-05 8.591080000E-02 0.000000000E+00
9.000000000E+01 0.000000000E+00 1.250000000E-05 8.382850000E-02 0.000000000E+00
9.200000000E+01 0.000000000E+00 1.243880486E-05 8.185630000E-02 0.000000000E+00
9.400000000E+01 0.000000000E+00 1.237921585E-05 7.998660000E-02 0.000000000E+00
122
9.500000000E+01 0.000000000E+00 1.235000000E-05 7.908985664E-02 0.000000000E+00
9.600000000E+01 0.000000000E+00 1.231922908E-05 7.821240000E-02 0.000000000E+00
9.800000000E+01 0.000000000E+00 1.225886126E-05 7.652760000E-02 0.000000000E+00
1.000000000E+02 0.000000000E+00 1.22000000