MONTE CARLO METHODS FOR NEUTRON TRANSPORT ON …

The Pennsylvania State University

The Graduate School

Department of Mechanical and Nuclear Engineering

MONTE CARLO METHODS FOR NEUTRON TRANSPORT ON GRAPHICS

PROCESSING UNITS USING CUDA

A Thesis in

Nuclear Engineering

by

Adam Gregory Nelson

2009 Adam Gregory Nelson

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Master of Science

December 2009

ii

The thesis of Adam Gregory Nelson was reviewed and approved* by the following:

Kostadin N. Ivanov

Distinguished Professor of Nuclear Engineering

Thesis Advisor

Maria Avramova

Assistant Professor of Nuclear Engineering

Jack Brenizer

J. ―Lee‖ Everett Professor of Mechanical and Nuclear Engineering

Chair of Nuclear Engineering

*Signatures are on file in the Graduate School

iii

ABSTRACT

This work examined the feasibility of utilizing Graphics Processing Units (GPUs) to

accelerate Monte Carlo neutron transport problems. These GPUs use many parallel processors to

perform the complex calculations necessary to create three-dimensional images at fast enough

rates for the video game industry. In 2006 NVIDIA released a programming framework (called

CUDA) that allows developers to easily program for the many cores provided by GPUs for

general purpose programs. Initial assessments have suggested that the MC algorithm may not be

able to fully utilize the GPU constraints. These constraints include the fact that MC codes are

highly dependent on branch statements (IF, ELSE, FOR, and WHILE) which can have a large

impact on GPU performance.

In this work, a Monte Carlo neutron transport code was written from scratch to be run on

both the x86 CPU platform and the GPU CUDA platform to understand the type of performance

that can be gained by utilizing GPUs.

After optimizing the code to run on the GPU, a speedup of nearly 21x was found when

using only single-precision floating point math. This can be further increased with no additional

effort if accuracy is sacrificed for speed: using a compiler flag, the speedup was increased to

nearly 24x. Further, if double-precision floating point math is desired for neutron tracking

through the geometry, a speedup of 11x was found.

While the GPUs have proven to be useful, they are not without limitations. The

following are such limitations: the maximum memory currently available on a single GPU is

4GB; the GPU RAM does not provide error-checking and correction; and the optimization

required to decrease GPU runtime can lead to code that is not readable by those who are not the

original developers.

iv

TABLE OF CONTENTS

LIST OF FIGURES ................................................................................................................. vi

LIST OF TABLES ................................................................................................................... vii

LIST OF ABBREVIATIONS .................................................................................................. viii

ACKNOWLEDGEMENTS ..................................................................................................... ix

Chapter 1 The Case For Acceleration of Monte Carlo Transport Simulations ....................... 1

Chapter 2 General Purpose Graphics Processing Units .......................................................... 3

2.1 The CUDA Architecture ............................................................................................ 4 2.2 Additional CUDA Programming Information ........................................................... 10 2.3 Limitations of the CUDA Model ............................................................................... 11

Chapter 3 Methodology .......................................................................................................... 13

3.1 LADONc .................................................................................................................... 13 3.1.1 Geometry Representation ................................................................................ 16 3.1.2 Material Representation .................................................................................. 17 3.1.3 Nuclear Data and Cross-Section Representation ............................................. 18 3.1.4 Initial Source Neutrons .................................................................................... 19 3.1.5 Random Number Generator ............................................................................ 20 3.1.6 LADONc Algorithm Details ........................................................................... 20

3.2 CERBERUSc ............................................................................................................. 25 3.3 Porting Codes to the GPU .......................................................................................... 26 3.4 Models Analyzed ....................................................................................................... 28

3.4.1 Model ―Test‖ ................................................................................................... 29 3.4.2 Model ―Two‖................................................................................................... 29 3.4.3 Model ―Three‖................................................................................................. 29 3.4.4 Model ―Four‖ .................................................................................................. 30 3.4.5 Model ―Array2‖............................................................................................... 30

3.5 Test System Hardware Specification ......................................................................... 31 3.6 CERBERUSg Initial Performance. ............................................................................ 33 3.7 LADONg Initial Performance .................................................................................... 33

Chapter 4 Optimizations ......................................................................................................... 34

4.1 Optimization Efforts .................................................................................................. 34 4.1.1 Reducing Memory Latency ............................................................................. 34 4.1.2 Register Pressure ............................................................................................. 37 4.1.3 Cross-Section Lookups ................................................................................... 38 4.1.4 Reducing Shared Memory Usage to Increase Blocks per SM ........................ 39 4.1.5 Thread Divergence .......................................................................................... 40

v

4.1.6 Increasing the Speed of Data Transfer Between System RAM and GPU

RAM ................................................................................................................. 41 4.1.7 Additional Optimizations Performed Which Reduce Accuracy...................... 41

Chapter 5 Results .................................................................................................................... 43

5.1 Single-Precision Results ............................................................................................ 44 5.1.1 Use of the Accurate Math, Single-Precision Functions................................... 44 5.1.2 Use of the Intrinsic Math, Single-Precision Functions .................................... 45 5.1.3 Examination of the Observed Peak ................................................................. 46

5.2 Double-Precision Results ........................................................................................... 48 5.2.1 Use of the Accurate Math, Double-Precision Functions ................................. 48 5.2.2 Use of the Intrinsic Math, Double-Precision Functions .................................. 50

5.3 Accuracy Comparison ................................................................................................ 51

Chapter 6 Applicability to Production Codes ......................................................................... 53

6.1 Limitations of CUDA for Monte Carlo Neutron Transport ....................................... 53 6.1.1 Maximum Available Memory ......................................................................... 53 6.1.2 Accuracy of Computations .............................................................................. 54 6.1.3 Error-Checking Memory ................................................................................. 54 6.1.4 Maintainability ................................................................................................ 55 6.1.5 Hardware Architecture Changes ..................................................................... 55 6.1.6 Optimizations May Not Be As Successful For Larger Problems .................... 56 6.1.7 Lack of Large-Scale GPU Cluster ................................................................... 56 6.1.8 CUDA Development Tools ............................................................................. 57

6.2 Possible Applications of CUDA-Accelerated MC ..................................................... 58

Chapter 7 Summary and Conclusions ..................................................................................... 59

7.1 Conclusions ................................................................................................................ 59 7.2 Recommendations for Future Work ........................................................................... 61

Bibliography ............................................................................................................................ 62

Appendix A LADONc Source Code ....................................................................................... 64

Appendix B LADONg Source Code ....................................................................................... 81

Appendix C Sample Input Files .............................................................................................. 105

C.1 Geometry File ―array2.geo‖ ...................................................................................... 105 C.2 Material File ―array2.mat‖......................................................................................... 109 C.3 Neutron Source File ―array2.src‖ .............................................................................. 111 C.3 Nuclear Data File for

1H, ―1001‖ .............................................................................. 114

vi

LIST OF FIGURES

Figure 1: NVIDIA Peak Performance Through Time, Courtesy NVIDIA .............................. 4

Figure 2: CUDA Kernel Execution, Courtesy NVIDIA .......................................................... 8

Figure 3: CUDA Hardware Architecture, Courtesy NVIDIA ................................................. 9

Figure 4: History-Based Algorithm ......................................................................................... 15

Figure 5: Details of Geometry Tracking .................................................................................. 16

Figure 6: Sample Code Transferring Neutron Data to GPU .................................................... 27

Figure 7: BFG/NVIDIA GTX 275 OC, Courtesy BFG ........................................................... 32

Figure 8: CUDA Code Transferring Data to Constant Memory .............................................. 35

Figure 9: CUDA Code Transferring Global Memory Data to Shared Memory ...................... 37

Figure 10: Example of Variable Reuse to Reduce Registers ................................................... 37

Figure 11: Single-Precision Speedup ....................................................................................... 44

Figure 12: Single-Precision Speedup Using Fast Math ........................................................... 45

Figure 13: MC-HISTORYc Run Time in the Peak Region ..................................................... 47

Figure 14: MC-HISTORYg Run Time in the Peak Region ..................................................... 47

Figure 15: Speedup in the Peak Region ................................................................................... 48

Figure 16: Double-Precision Speedup ..................................................................................... 49

Figure 17: Double-Precision Speedup Using Fast Math .......................................................... 50

vii

LIST OF TABLES

Table 1: Assumptions .............................................................................................................. 14

Table 2: Test System Hardware Specification ......................................................................... 31

Table 3: GPU Specifications, Courtesy BFG........................................................................... 32

Table 4: Accuracy Comparison ............................................................................................... 52

Table 5: Summary of Speedups ............................................................................................... 60

viii

LIST OF ABBREVIATIONS

(Entries are listed alphabetically)

CM Center-of-Mass System

CPU Central Processing Unit

CUDA Compute Unified Device Architecture

ECC Error Checking and Correcting

ENDF Evaluated Nuclear Data File

GPU Graphics Processing Unit

HPC High Performance Computer

MC Monte Carlo Neutron Transport

nvcc CUDA C Compiler

PCIe Peripheral Component Interconnect Express Bus

RAM Random Access Memory

SIMD Single-Instruction, Multiple-Data

SM Streaming Multiprocessor

ULP Unit in Last Place

ix

ACKNOWLEDGEMENTS

The author would like to thank the following people for their invaluable assistance with

this work. First, he thanks his research advisor, Professor Kostadin N. Ivanov, for guidance

through the process. Secondly, the users of the NVIDIA CUDA Programming Forum deserve

many thanks for providing solutions to difficult programming challenges encountered during the

performance of this research. Finally, the author would like to thank the unending support given

to him by his girlfriend over the past nine months as he continually was forced to choose

computer programming over his duties as a boyfriend (it is worth noting that in the months

following the completion of this thesis, she may wish the author had another topic to pursue).

1

Chapter 1

The Case For Acceleration of Monte Carlo Transport Simulations

The Monte Carlo method of solving the Boltzmann Neutron Transport Equation (MC) is

capable of producing highly accurate solutions for nuclear engineering problems of interest. This

is enabled through the exact representation of problem geometry (as long as a mathematical

description of the geometry exists), the use of nuclear data that is a continuous function of energy

(which can be as accurate as the data produced by the experimental database), and enabling a

continuous spectrum of particle directions. MC solutions therefore provide a much higher fidelity

than diffusion theory or discrete ordinates based solutions which both require discretization of the

problem geometry, energy, and/or angles.

Despite these benefits, MC has not yet displaced diffusion theory or discrete ordinates

methodologies for design. This is mainly due to the relatively large computation time required to

produce useful results. A meaningful power distribution from a MC solution requires a

converged source distribution and fine-block tally regions in all regions of interest to produce

small uncertainties on results; both of these require many millions of neutron simulations

(depending on the problem size). Therefore, the Monte Carlo method cannot displace the

established solution methodologies as a prime tool for design until the computation time can be

decreased to manageable levels. For instance, Drs. Kord Smith (at the M&C 2003 conference)

(Smith 2003), Bill Martin (at the M&C 2007 conference) (Martin 2007), and Forrest Brown (at

the PHYSOR 2008 conference) (Brown 2008) have postulated that the computational power to

perform a full core (PWR) calculation in one hour (which required twenty billion neutron

histories to reach a 1% standard deviation) will not be available until the year 2018.

2

This thesis will evaluate one possible option of accelerating MC simulations to enable the

use of MC for design: utilizing the graphics processors installed in most desktop and laptop

computers.

3

Chapter 2

General Purpose Graphics Processing Units

A Graphics Processing Unit (GPU - commonly referred to as a 3-D accelerator, video

card, or graphics card) is a processor dedicated to performing the computationally intense

floating-point calculations required to render 3-D images in real-time. These cards are installed

in most desktop and laptop personal computers to increase gaming performance of the computer.

GPUs communicate with the central processing unit (CPU) and the system‘s random access

memory (RAM) through the Peripheral Component Interconnect Express (PCIe) bus which

currently provides up to eight gigabytes per second of bandwidth (NVIDIA Corporation 2009).

Recent GPUs utilize many processing elements (similar to ‗cores‘ in CPUs) which operate in

parallel to convert the 3-D geometry and lighting conditions into a 2-D image to display on a

computer monitor. The details of the hardware architecture are discussed in more detail below.

Until recently GPUs were utilized solely for performing computations required to

produce the output image. However, in late 2006, NVIDIA released the G80 GPU, and with it,

the Compute Unified Device Architecture (CUDA) hardware and software architectures to

provide general-purpose processing capabilities on GPUs. The G80 hardware provided up to 371

gigaflops of peak performance in a package that costs only a few hundred dollars and consumed

less than 200W of power. Because these devices were primarily mainstream graphics cards,

40,000,000 CUDA capable cards were in home computers by the end of 2007 (Luebke 2007). A

comparison of the peak performance of CPUs and NVIDIA GPUs since 2003 can be seen in

Figure 1. This figure shows the enormous increase in GPU computational power, relative to

typical CPUs, since 2003.

4

Figure 1: NVIDIA Peak Performance Through Time, Courtesy NVIDIA

The large disparity between the performance of then-top-end CPUs and GPUs is because

the CPU is designed for sequential code performance with large instruction and data caches to

reduce the data transfer overhead, while GPUs utilize many processors which execute in parallel

with small local caches but very fast data transfer rates between the GPU‘s RAM and the

processors (approximately 10x the speed of system RAM). The GPU hardware and software

architectures will be discussed in more detail in the following section.

2.1 The CUDA Architecture

NVIDIA, in developing the CUDA environment, focused on creating a programming

environment which allows for essentially unlimited thread-level parallelism from the

programmers perspective which the underlying hardware then implements at runtime solve

massively parallel problems. Since the hardware and software are so inter-related, the topics will

5

be discussed together. The following information is discussed in the CUDA Programming Guide,

and the reader is directed there for further details (NVIDIA Corporation 2009).

For CUDA programs, the GPU is a co-processor which can be utilized to offload work

from the CPU. An entire program does not execute solely on the GPU, but instead only portions

of the programs are sent to the GPU for execution, typically those portions of the code for which

the programmer has decided that the GPU is a more suitable compute environment. These

portions of code executed by the GPU are called kernels. The lowest computation unit in the

kernel is a thread, which is essentially an instruction set that operates on a particular piece of data.

Threads are grouped into blocks (with a maximum of 1024 per block). Threads inside a block are

all required to start at the same instruction of a stream, but divergence to a certain degree is

allowed after that point. These threads can share data, and therefore all threads within a block

must be executed on the same ―Streaming Multiprocessor‖ (SM), which is somewhat akin to a

core on a CPU. Blocks have the property of being able to run in any order on any of the GPUs

compute units, which is a characteristic that the programmer must take into consideration.

Finally, the highest level of execution in a kernel is called the grid, which is an array of thread

blocks.

The threads inside of a block are split into groupings called warps. The current

generation of NVIDIA cards allocate thirty-two threads per warp. This grouping is transparent to

the user, and should not be assumed. Warps are sets of threads which are all processing the same

instruction. If a portion of the threads in a warp need to traverse one path of an execution path,

but another portion needs to go another path, a ‗divergent branch‘ condition exists. This is

usually caused by the use of IF/ELSE statements or conditional loops. When this occurs all

threads in the warp execute each of the possible paths possible, but only the result from the

correct path for each thread is stored. This is not optimum for decreasing the runtime of a code as

it reduces the maximum parallelism possible. The GPU can, with minimal overhead, switch

6

which warp an SM is executing. This is useful for when a particular work has requested data

from high latency memory – while the data is being fetched, the SM can switch the warp it is

executing.

Now that the basic execution model has been presented, it is useful to discuss the

memory model. Refer to Figure 2 and Figure 3 during the following discussion. The CUDA

software and hardware architectures allow for a few different memory spaces to be utilized by the

kernel. These include per-thread registers, per-thread local memory, per-block shared memory,

per-program constant memory, per-program texture memory, and per-program global memory.

The register file is a set of 16,384 32-bit registers which are dynamically allocated to

each thread that is executing on an SM. Similar to CPU registers, these are very fast, locally

stored data which cannot be shared amongst threads. Local memory is provided as a ‗spill-over‘

region should the registers be filled. Unfortunately, the local memory is only local in name: local

memory is actually stored in the GPU RAM, and hence has much slower access than the register

file (approximately 200 cycles in latency to access the data).

Shared memory is memory stored on-chip with an SM that provides data that all threads

in a block can access. Since it is local to the SM, it provides for access times essentially as fast as

the registers. Each SM has 16kB of data which is allocated for use on all of the blocks assigned

to an SM.

Constant and texture memory are both read-only data stored in the GPU RAM, however,

it is cached. The constant memory space is approximately 64kB, while the texture size is

essentially unlimited. Both spaces have caches, which provide for rapid data access times if the

data is in the cache. Constant cache is cached through linear locality while the texture cache uses

a spatial locality for caching. The texture memory can provide advanced functionality through

the texturing unit such as interpolation, but these features were not used in this analysis.

7

Global memory is an un-cached space whose size is only limited by the GPU RAM

available. Due to the un-cached nature, a read from global memory will cost approximately 200

cycles.

The architecture results in the fact that the fastest computation times can only be achieved

by using coalesced memory reads and writes. That is, the data requested by threads in a block

should be located in a contiguous memory space so that the entire memory transfer bus can be

filled with data which can be read in the minimum amount of cycles.

Since the registers and shared memory are divided amongst blocks and threads on a SM,

there is an optimum amount of shared memory and registers used by each block and thread to

allow more blocks to be run on each SM (and therefore to allow more blocks to be run at a time

on the entire GPU). This optimum configuration can be determined with the use of the CUDA

Occupancy Calculator, a tool by NVIDIA and provided to developers in the CUDA Toolkit

which calculates how many blocks can fit on an SM.

Finally, because the global and local memory is not cached and because cache-misses can

occur for the cached memory locales, the CUDA architecture can effectively ‗swap‘ the blocks

that are being executed at a current time so that while block A is awaiting data from global

memory, block B can be executed upon. This is called latency hiding. The developer can take

advantage of this by solving the problem at hand with more blocks than SMs on the GPU.

8

Figure 2: CUDA Kernel Execution, Courtesy NVIDIA

9

Figure 3: CUDA Hardware Architecture, Courtesy NVIDIA

10

2.2 Additional CUDA Programming Information

The CUDA programming environment is essentially an extension to the C programming

language. Because it is based on the C language, there is only a small learning curve for CUDA

development

Some specifics necessary when reviewing source code in the following chapters are as

follows:

Before any kernel can be called, the data necessary for the kernel must be copied

from the system RAM to the GPU RAM. CUDA provides functions which

perform this operation (for both to and from the GPU) for constant memory,

texture memory, and global memory. Shared memory, constant memory, global

memory, and textures are all declared by the programmer. Constant memory,

global memory, and textures have the life of a program, i.e., a programmer does

not have to load data into the memory spaces between each kernel call, unless the

data has changed.

Kernels are functions, like any other C function, except the function definition

and header must include the ―__global__‖ identifier, which tells the CUDA

Compiler that this is a GPU function that is callable by the CPU. When a kernel

is called, it requires an ―execution configuration‖ as well, which tells the GPU

how many threads and blocks to be created and how. These execution

configurations, in their simplest form (and that which was used in this work),

look like: ―<<<dim3 threads, dim3 blocks>>>‖. The ―dim3‖ type is a data

structure which contains a three-dimensional integer to describe the (up to) three-

dimensional grid.

11

Functions which are can only be called by the GPU must be written with the

―__device__‖ identifier before the function declaration and header.

Functions can be written which are used by the CPU and the GPU. These

functions would have both the identifiers ―__host__‖ and ―__device__‖ before

the function declaration and header.

2.3 Limitations of the CUDA Model

The discussion thus far has focused on the highlights of the use of GPUs to accelerate

general purpose programs. However, there are some downsides to the widespread adoption of

these cards for scientific applications which are discussed in the following paragraphs.

The CUDA Programming Guide (NVIDIA Corporation 2009) discusses the level that the

NVIDIA GPU‘s conform to the Institute of Electrical and Electronics Engineers (IEEE) Standard

for Floating Point Arithmetic (IEEE-754). The guide states that most operations do match the

standard, with a few exceptions. Exceptions include: a lack of dynamically configurable

rounding modes; a lack of a mechanism for signaling that a floating-point exception has occurred;

and the fact that various functions are implemented in non-IEEE specific ways (such as the

square root and division). In the end this means that some functions are less accurate than their

CPU equivalent.

The latest generation of NVIDIA‘s GPU, the GT200, is the first to support double-

precision floating point calculations. But, each SM has only one double-precision unit (there are

eight single-precision compute units in each SM). Therefore, a first order estimate of double-

precision computation runtime suggests that double-precision calculations would require eight

times as long to run as would single-precision floating point calculations.

12

The highest-end GPUs (at the time of this thesis) provide up to 4GB of RAM. For most

problems (especially shielding applications) this is not enough space and will require domain

decomposition techniques to overcome.

At this time GPUs do not provide error detection or correction. There therefore is no

means to ensure accuracy of the data transferred to and from the GPU RAM. This is a reliability

issue for clusters implementing GPUs because of the increased chance that the entire cluster will

experience a memory error.

13

Chapter 3

Methodology

To evaluate GPUs for the acceleration of MC transport codes, a baseline code as written

from scratch for the Intel x86 CPU architecture. This code is referred to here-in as ―LADONc,‖

where the c denotes CPU. A clean-sheet design was chosen over porting an existing code

because to properly convert and optimize, one must have a full knowledge of the inner-workings

of the software – something that would have taken an immense amount of time to do for a

production code such as MCNP. LADONc utilizes the history-based algorithm.

LADONc was then modified to utilize the event-based algorithm discussed in a paper by

Forrest Brown and Bill Martin (Forrest B. Brown 1984). This too was written for the CPU. This

code will be referred to as ―CERBERUSc.‖

LADONc and CERBERUSc were the baseline codes which were ported to the GPU

architecture by utilizing the CUDA programming language. These will be referred to as

―LADONg‖ and ―CERBERUSg,‖ where the g denotes GPU.

All of the above codes are discussed in further detail in the following sections.

3.1 LADONc

This code, depicted in flowchart form in Figure 4 and Figure 5, operates on one neutron

simulation at a time. Each neutron‘s calculations begin at birth (either from a user-defined source

or from a fission event) and tracked until the neutron leaks or is absorbed. To simplify the

development of LADONc, only neutron multiplication factor (kcalc) problems were considered.

Neutron physics are not modeled in exquisite detail, with such features as Doppler broadening or

advanced thermal scattering treatment being ignored. The free-gas model is used for all

14

scattering reactions. A list of assumptions and approximations can be found in Table 1. The

source code for LADONc is contained in Appendix A. Specifics of this code are discussed

below, as it provides the background for the rest of the work performed.

Table 1: Assumptions

Category Assumptions

General

Units for cross-sections are barns, number-densities are atoms/b-cm, geometries are cm

Geometry

Only supports Spheres, Cuboids

None of the above shapes can be transformed

If a neutron is exactly on the boundary of a cell it is considered inside the cell

User must define what cells are inside which other cells

Cells must be input in the order that they are inside each other

Cells must not intersect each other

Number of cells is arbitrarily limited to 100

Nuclear Data

No special treatment of cross-sections (unresolved resonances, bound nuclei)

The number of neutrons per fission is set at 2.53

No discrete Q-values used for inelastic scattering

If energy is less than the minimum in the cross-section data, 1/v scaling is used

Cross-section and fission spectrum energy is limited to 20 MeV

Physics

Only the free-gas model is used for collisions

Material Data

Number of total nuclides is arbitrarily limited to 20

15

Figure 4: History-Based Algorithm

For All Neutron Batches

Input Data

Initialize Tallies

For All Neutrons In Batch

Calculate Total Optical Thickness

Travelled and Transport Neutron To The

Corresponding x,y,z

Determine Nuclide That Neutron Collided

With

Interrogate Macroscopic Cross-Sections

For Target Nuclide, Current Cell, and

Energy

Determine Rxn Type

Rxn

Type?

Determine Number of

Fission Neuts, Add

That Many To Source

List

Increase

Radiative

Capture Tally,

End Neutron

Change Energy

and Direction

According To

Elastic Scatter

Laws

Change Energy

and Direction

According To

Inelastic Scatter

Laws

Fission

Radiative

Capture Elastic

Scatter Inelastic

Scatter

Update Keff

End Neutron If Leaked

16

Figure 5: Details of Geometry Tracking

3.1.1 Geometry Representation

To have the capability to solve problems more complex than an infinite homogenous

medium, a Monte Carlo neutron transport code must have methods to define geometric objects, or

Calculate Position If Neutron Moved Collision Distance

In Direction Of Ω

Get Initial Cell, ΣT(E), And Collision Distance

Has Cell

Changed?

Set Neutron

Position To

Calculated

NO YES

Determine Distance To

Cell Boundary And

Subtract From

Distance To Transport

While Neutron Still Has

Distance To Travel

Leaked? NO YES

Tally Leak

And End

Neutron

Store Final Neutron

x,y,z and Cell

17

cells, and be able to track which cell a particular point is in, and how far until the boundary of a

cell. This section describes how these actions are performed in LADONc.

LADONc is capable of modeling parallelepipeds, and spheres as these are easily defined

in terms of mathematical functions. These cells can be set to any size and placed at any location

in the model. The problem boundary is also defined through the use of one of these shapes.

LADONc makes use of geometric hierarchy; i.e., the user can define one cell to be inside of

another cell, displacing the material of the larger cell with the smaller cell‘s material. In a

production code this is useful for users by reducing the number of input values which need to be

updated when one dimension changes in the model (such as is usual during design iterations or

sensitivity studies). This hierarchy also reduces the need for gridding of the model because every

cell has associated with it a list of cells that are inside of it. For the purposes of this code, the user

inputs these ‗inside_me‘ lists as opposed to having the software perform this task automatically.

Additionally, the user must input the cells in this hierarchical order. For example, if

object A is inside of object B, it must come first in the geometry input file. This is not necessary

in a production code but is used here to reduce the computational effort of determining which cell

a neutron is in because the first cell that returns a successful query is the smallest one the neutron

can be in since the cells are progressed through in the order the user inputs them.

Geometric information is input into the program by use of the file ―FILE_NAME.geo‖,

where FILE_NAME is defined by the user. A sample geometry input file can be seen in

Appendix C.

3.1.2 Material Representation

Every geometric cell has associated with it a list of materials. Each material is described

by using an integer identifier and number density for each nuclide. The integer identifier, or

18

‗ZAID‘, represents the proton number (A) and mass number (Z) according to the following

formula:

𝑍𝐴𝐼𝐷 = 1000 ∗ 𝐴 + 𝑍

Equation 1

For example, the ZAID value for 𝑂816 is 8016. This format is used because it saves

memory and can be quickly decomposed later into the appropriate parts when necessary.

Material information is input into the program by use of the file ―FILE_NAME.mat‖,

where FILE_NAME is defined by the user. A sample material input file is shown in Appendix C.

3.1.3 Nuclear Data and Cross-Section Representation

To reduce development effort for this proof-of-concept analysis, no standardized cross-

section format (such as ACE) was used. Instead, a table was manually created for nuclides which

includes microscopic radiative capture, fission, elastic, and inelastic cross-sections as a function

of energy. Total cross-sections are calculated on the fly. These values could have been any

floating point number desired, but to have some resemblance of accuracy, values from the

Evaluated Nuclear Data File (ENDF) plotting program (ENDFPLOT 2.0) were used (KAERI,

Korea Atomic Energy Research Institute 2007). The inelastic scattering cross-sections are

simplified as well because only one set of eight Legendre coefficients are provided for each

nuclide. For fissile or fisisonable nuclides, parameters describing the fission neutron energy

spectrum, 𝜒 𝐸 , are also included in this data.

The cross-sections are considered point-wise because if a neutron‘s energy is between

two energy points which are defined in the cross-section set, then a simple linear interpolation is

performed, even in the resonance regions. A binary search algorithm is used to find which array

index corresponds to the energy just larger than the neutron energy in question. To avoid

19

negative cross-sections, 1/v extrapolation is used if a neutron‘s energy falls below the range of

energies of the cross-section data. The number of points in the energy mesh is small relative to

production codes. For instance, the U-235 cross-section file contains an energy mesh with 31,839

data points, ranging from 0.0115 eV to 20.0 MeV.

Nuclear data and cross-section information is stored in the program‘s executable

directory with one file per nuclide. For example, the file named ―8016‖ contains cross-sections

for 𝑂816 . The code determines which cross-section files to load in to memory based on the

ZAID‘s defined in the material input file. A sample cross-section data input file is located in

Appendix C.

3.1.4 Initial Source Neutrons

The user can provide the initial source neutron locations and energies. Neutron directions

are randomly generated by the program on the fly however. These neutrons are added to a

neutron queue list during the problem initialization and are removed as they are used. Neutrons

produced from fission are also added to the end of this queue list using the location of the fission

that caused them as their initial location and selecting the initial energy randomly from the fission

energy spectrum. If the queue list is depleted of user-input source neutrons and fission neutrons,

then neutrons are produced at the problem origin with an energy randomly selected from the 𝑈92235

fission energy spectrum.

Source neutron information is input into the program by use of the file

―FILE_NAME.src‖, where FILE_NAME is defined by the user. A sample source neutron input

file is located in Appendix C.

20

3.1.5 Random Number Generator

A linear congruential pseudo-random number generator was used from Numerical

Recipes (William Press 2002). Of course, more complicated generators could have been used,

but this was considered appropriate given that the goal of this research is not exact answers, but

the feasibility of speedups through the use of GPUs.

3.1.6 LADONc Algorithm Details

The following sections discuss the steps required of the history-based algorithm. This

algorithm essentially performs the required neutron tracking and neutron-nuclide interaction on a

per-neutron basis. That is, each neutron is tracked from birth to death (absorption or leakage)

before the next neutron‘s tracking can begin. The following sections describe the steps

performed for each neutron. Delta tracking was considered for inclusion, but was judged to be as

likely to cause divergent branches as the above method, and so the more straight-forward

approach was used.

3.1.6.1 Neutron Tracking

As can be seen in Figure 5, the neutron distance to collision is determined as follows:

1. Determine the cell that the neutron is in. This is performed by checking to see if

the neutrons position is inside of any of the objects in the geometry. Since the

input is entered so that any object that is inside of another one is entered first, the

smallest cell containing the neutron will be identified first and this will be used.

2. A randomly generated number between 0.0 and 1.0 is created, called η.

3. The expected collision distance is calculated based on Equation 2.

21

𝑐𝑜𝑙𝑙 𝑑𝑖𝑠𝑡 =− ln 𝜂

ΣT

Equation 2

However, this collision distance may or may not be the distance to proceed with.

If the collision distance is larger than the distance to the boundary further

calculations are required.

4. The current material total macroscopic cross-section, ΣT, is obtained for the

current cell and the current energy.

5. The distance to the current cell boundary is calculated. There is a chance that

there is another object inside of the current one, so all of those objects in the

‗inside_me‘ list (discussed previously) must also be checked. The smallest of all

of those distances is kept as the actual travel distance.

6. If the collision distance is larger than the distance calculated in the previous step,

the following takes place:

a. The neutron is moved this distance calculated in step 4 plus some small

additional distance (0.001) to assure that the neutron is outside of the cell

it was just in accounting for any possible rounding error.

b. The random parameter η is then updated to reflect the fact that the

neutron has ‗used‘ some of the original η. This is performed per

Equation 3 below. Note that the 0.001 added above is accounted for

here.

𝜂 = exp 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑡𝑟𝑎𝑣𝑒𝑙𝑙𝑒𝑑 + 0.001 − 𝑐𝑜𝑙𝑙𝑖𝑠𝑖𝑜𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ΣT

Equation 3

c. The current cell is queried in the same way it is described in Step 1

above.

22

d. If the current cell is inside the problem boundary (i.e., if the neutron has

not leaked) move to Step 3 and repeat until the collision distance is less

than the distance to the cell boundary. Otherwise, flag the neutron as

being leaked, and terminate its history.

7. If the collision distance is less than the distance calculated in Step 4, the neutron

is moved that distance.

The program has now successfully tracked the neutron to its collision location and is ready to

select the reaction type that the neutron will undergo.

3.1.6.2 Determination of Target Nuclide and Reaction Type

Before a reaction type can be determined, the nuclide type that was collided with must be

determined since each cell can be composed of many different nuclides. This is performed by

first generating a random number between 0.0 and the total material macroscopic cross-section (at

the energy of interest). Then, for each isotope in the material, the total macroscopic cross-section

is determined and compared with this random number. The nuclide the neutron has interacted

with is therefore the one whose range of total cross-section contains the value of the random

number.

Next, the reaction type that the neutron underwent when it collided with the target

nuclide is determined. The types of reactions available for the nuclide are: fission absorption,

non-fission absorption, elastic scattering and inelastic scattering. The target nuclide microscopic

cross-sections for each of these reactions are determined. Again, a random number is generated

and compared with these cross-sections to determine the reaction type.

23

3.1.6.3 Fission Reactions

If the neutron caused fission, the code will then create fission neutrons and add them to

the queue. Since fission events can emit a random number of neutrons, the code must first

determine how many to add to the queue. This is done rather simply: a random number is

generated between 0.0 and 1.0. If the number is larger than 0.53 (the average value of ν for 235

U

minus two), then three neutrons are to be created, if not, then two will be created. Note that no

distinction is made between prompt and delayed neutrons. The number of fissions that occurred

and the fission neutrons produced are tallied seperately.

For each neutron produced, the code will add a neutron to the queue with the initiating

neutron‘s location, but with an energy determined by sampling from the fissioning nuclide‘s

fission neutron energy spectrum, χ(E). The simulation of the initiating neutron is then terminated.

3.1.6.4 Non-Fission Absorption Reaction

If a neutron is absorbed without fissioning the target nuclide, then the neutron simulation

is terminated, and the absorption tally is incremented by one.

3.1.6.5 Scattering Treatment

Elastic and Inelastic scattering events are treated in a similar manner. The difference lies

in the angular distribution that is sampled from and the Q-value of the reaction (for elastic

scattering, the Q-value is zero MeV). Both methods are solved through the solution of the

conservation of energy and momentum. While the code uses separate functions for each, they

will be described here in parallel.

24

1. The ZAID of the target nuclide is decomposed into just the A value, since the

kinematics equations require knowledge of the atomic mass. Note that the mass

here is simplified as just A, instead of the actual atomic mass.

2. The first step is to determine the Center-Of-Mass (CM) system direction cosine,

µCM.

a. For elastic scattering, a value of µCM is picked by selecting a random

number between -1.0 and 1.0 from a uniform distribution. This

corresponds with isotropic scattering in the CM system.

b. For inelastic scattering, the same angle, µCM, is created through rejection

sampling from a distribution defined by a Legendre series truncated after

the 8th term. This is done by first randomly selecting µCM, from -1.0 to

1.0. This is then plugged into the Legendre polynomial expansion

(P(µCM)). Then another random number is generated from 0.0 to 8.0 (to

reflect the maximum possible value of the Legendre series in the range -

1.0 to 1.0). If the new random number is less than or equal to P(µCM)

then µCM is retained. If not, this step is repeated.

3. Now that µCM has been determined, the outgoing energy is determined.

a. For elastic scattering, this is calculated as shown in Equation 4.

𝐸𝑛𝑒𝑟𝑔𝑦𝑜𝑢𝑡 = 𝐸𝑛𝑒𝑟𝑔𝑦𝑖𝑛 ∗𝐴2 + 2 ⋅ 𝜇𝑐𝑚 ⋅ 𝐴 + 1

𝐴2 + 2 ⋅ 𝐴 + 1

Equation 4

b. For inelastic scattering, the reaction Q-value (the amount of energy

bound in the target nucleus during the process) is sampled from a

continuous distribution from 0.0 to Qmax. This is clearly a simplification

as the Q-value spectrum is not continuous, but this method was chosen to

25

simplify the nuclear data requirements of the code. Qmax, which is

negative for inelastic scattering, is determined using the relation shown

in Equation 5. This relation is derived to restrict the Q-values to those

which are physically possible, i.e., if energy cannot be conserved than

the Q-value chosen is impossible. After Q is determined, the outgoing

energy is determined by the relation shown in Equation 6.

𝑄𝑚𝑎𝑥 = 𝐸𝑛𝑒𝑟𝑔𝑦_𝑖𝑛 ⋅ −𝐴

𝐴 + 1.0

Equation 5

𝐸𝑛𝑒𝑟𝑔𝑦𝑜𝑢𝑡 = 𝐸𝑛𝑒𝑟𝑔𝑦𝑖𝑛 + 𝑄 𝐴 + 1.0

𝐴

Equation 6

4. From this stage forward the elastic and inelastic scattering routines are the same.

The remaining tasks are performed to convert the ingoing angle, Ωin, to the

outgoing angle, Ωout, through the kinematics relations, µCM, and the energy

change.

After the scatter reaction changes the angle and direction, the process continues from

Step 1 of the neutron tracking algorithm and continues until the neutron is lost through leakage or

absorption.

3.2 CERBERUSc

Monte Carlo neutron transport is characteristically driven by IF/ELSE tests and WHILE

loops. However, processors utilizing the Single Instruction-Multiple Data (SIMD) architecture,

such as vector computers and GPUs in a way (due to the fact that all threads in a warp must

26

follow the same execution path), suffer large losses in performance if each compute unit (thread)

requires different instructions be executed and should be avoided if possible. To accomplish this,

the history-based algorithm of analyzing many events for a single neutron and repeating for all

neutrons is modified to have the computer analyze many neutrons for a single event at once(such

as calculating the change in energy and direction due to a scatter event). This methodology, as

discussed by Brown and Martin, helps to reduce the number of IF/ELSE and WHILE tests that

can induce divergent branches in the vector portions of the code (Forrest B. Brown 1984).

For this work, the LADONc source code was modified to utilize this algorithm and is

referred to as CERBERUSc. All physics and assumptions in CERBERUSc are the same as for

LADONc. CERBERUSc calculates four ‗events‘ at once: transporting neutron to the collision

location; determination of the reaction type; elastic scattering; and inelastic scattering. Inbetween

each step a re-shuffling of the neutron lists was updated to reflect the results of each event.

Intermediate calculations were performed, and the next event begun. This had to be repeated

until all neutrons in the batch were absorbed or leaked. This method requires much event list

shuffling.

3.3 Porting Codes to the GPU

In this section, work necessary to port the CPU codes to the GPU is discussed. Both the

event and history-based codes required similar work to be performed to port the codes.

Before any code can be run on the GPU, the required memory space must be allocated in

the GPU RAM and the data passed from system RAM to the GPU RAM. This is accomplished

through CUDA runtime API function calls similar to the C++ language functions malloc and

memcpy (in fact, they are called cudaMalloc and cudaMemcpy, respectively). For the case of the

MC codes, the cell list, materials list, and cross-section data must be transferred to the GPU

27

before any calculations can begin. The list of neutron data in the current batch is passed before

each batch begins. Data also can be transferred back from the GPU RAM to the system RAM.

This is accomplished through similar API calls as discussed above, except a flag used to indicate

the direction of data transfer is changed. See Figure 6 for example C++ code which transfers the

neutron batch list to the GPU RAM.

Figure 6: Sample Code Transferring Neutron Data to GPU

Once the data is on the GPU, the GPU can begin to perform calculations using the data.

But first the code must be ported to use CUDA. This includes creating a kernel (the main

execution pathway) which, instead of looping over all neutrons in a batch (or event set, for the

event-based algorithm), one thread is assigned to each neutron to be simulated. This is the only

major distinction necessary to keep in mind when creating the kernel. Otherwise, even the same

functions can be used by GPU code as the CPU code, as long as the functions definitions are

appended by ―__device__ __host__‖, a keyword for the CUDA compiler which tells it to compile

versions of the function to be run on the GPU (device), and the CPU (host). Note that there can

be GPU functions only (those appended with ―__device__‖ only).

For the initial code porting, all of the double-precision data types (―double‖ in

C++/CUDA) were switched to single-precision floating point numbers (―float‖ in C++/CUDA) in

the CPU and GPU versions of the codes.

For LADONg, the GPU was used to parallelize the calculation of each neutron in a batch.

Each thread worked on one neutron from birth until absorption or leakage. For CERBERUSg,

the neutrons in each event were parallelized; each thread worked on one neutron of the event.

28

Each event was a kernel, and the re-shuffling was performed on the CPU. In both codes, sets of

threads were collected into blocks to allow as many threads to be processed at once as possible

and to increase the latency hiding ability of the GPU.

For all GPU codes, all problem data (neutron information, geometry data, cross-sections,

and materials data) was stored in global memory. The compiler placed most runtime variables

that were not arrays into registers, per its default action.

Once the code is ported, it only has to be compiled with the CUDA compiler, nvcc,

which also calls the C++ compiler as well to build the CPU portions of the code. If desired, the

program will be in one executable file.

The above is all the work necessary to perform the porting. More CUDA experience

would certainly have reduced this effort to a matter of hours. That being said, a programmer can

spend endless amounts of time optimizing the CUDA code once it is implemented to a working

level. The next chapter will discuss such optimization efforts.

3.4 Models Analyzed

Models with differing numbers of objects and scattering collisions per neutron will

respond differently to being run on the GPU. As the number of objects increase, more and more

objects need to be checked when determining which cell the neutron is in and the distance to the

nearest cell boundary. Also, if there is a larger object in one area of the model and the rest are

very small, the neutrons in the large object have a higher chance of spending time in that one cell,

reducing the number of loops through the geometry tracking WHILE-loop in the code.

Therefore, a range of models will have to be utilized for a proper means of comparison between

the CPU and GPU codes.

29

For the purposes of this thesis, five different models were analyzed. Each will be

discussed in the following paragraphs.

3.4.1 Model “Test”

This model, called ―Test‖, earned its name because it was the primary model used during

development of the code. This model features one sphere, centered at the origin, with a radius of

22.36cm (the radius-squared is 500 cm2). The sphere is composed of

235U

(0.04819545 atoms/barn-cm) and H2O (0.05 molecules/barn-cm).

3.4.2 Model “Two”

This model contains two concentric spheres, both centered at the origin. The inner sphere

has a 5cm radius, while the outer has a 22.36cm radius. The inner sphere is composed of 235

U

(0.04819545 atoms/barn-cm) and H2O (0.05 molecules/barn-cm), while the outer sphere is

composed of elemental zirconium (0.0425126 atoms/barn-cm).

3.4.3 Model “Three”

This model contains three spheres, two of which are inside a larger one, but not inside of

each other. The largest sphere has a radius of 9.05cm (the radius-squared is 82cm2), and is

centered at the origin. The other two spheres both have a radius of 4cm, and are centered at an x-

value of +5cm and -5cm (both are aligned at y=0cm and z=0cm). Both inner spheres are made of

the same uranium/water mixture as in the previous two models while the outer sphere is

composed of water alone (0.033 molecules/barn-cm).

30

3.4.4 Model “Four”

The fourth model contains three spheres and one cuboid. Again there are two inner

spheres of the same geometric and material description as in the ‗Three‘ model. The cuboid is

centered at the origin and extends from -10cm to +10cm in the x,y, and z directions. Finally, all

three inner shapes are encapsulated by one large sphere, centered at the origin, with a radius of

18.71cm (the radius-squared is 350cm2). The cuboid and outer sphere are both made of water

(0.033 molecules/barn-cm).

3.4.5 Model “Array2”

This model was created to tax the system. It features lots of water (hence, lots of

scattering reactions) and fifty three objects. Fifty-two are spheres, with the outer shape being a

cuboid. Of the fifty-two spheres, forty-eight are used to create twenty-four objects composed of a

UO2 sphere inside surrounded by a shell of elemental zirconium. The inner UO2 spheres have a

radius of 40cm, while the outer elemental zircaloy shells have a radius of 50cm. The UO2 is

enriched to 3w/o

235U, and the

16O has a number density of 0.07294 atoms/barn-cm. The outer

zircaloy shell is the same as used in model ―Two.‖ These spheres are placed in a 4x3x2 matrix

(x, y, z) centered at the origin. The remaining four spheres are placed in the top four corners of

the cuboid with a radius of 25cm. The outer cuboid is centered at the origin, extending from -

375cm to +375cm in the x-direction, and -250cm to +250cm in the y- and z-directions. This

object is composed of the same water as used in the previous models. Appendix C lists the input

files for this model.

31

3.5 Test System Hardware Specification

A large speedup could have easily been achieved by pairing the fastest CUDA-capable

GPU with the slowest CPU that could possibly support it. However, the results presented in the

next chapter were obtained by using relatively high-end, recent hardware. The full specs of the

system are shown in Table 2. The Core i7 CPU from Intel represents the latest in CPU

technology and outperforms a Core 2 Duo E6600 (2.40 GHz, 4 MB L2 cache) by 25% in tests of

LADONc. The GPU, the BFG GeForce GTX 275 OC, is a middle-end GPU that uses the latest

NVIDIA GPU architecture, the GT200. The specifications for this GPU can be seen in Table 3,

and an image of the card itself is shown in Figure 7. This GPU is of CUDA capability version 1.3

which means that it supports double-precision calculations and has 16,364 registers available per

SM (among other features). Additionally, it is important to note that the GPU in use is slightly

factory overclocked. These core and memory clock speeds will not be changed when obtaining

results. The GTX 275 card can support the processing of up to 30,720 concurrent threads.

Table 2: Test System Hardware Specification

CPU Intel Core i7-920 (quad-core, 8MB L2 cache, 2.66 Ghz)

Motherboard MSI Platinum SLI X58

Chipset Intel X58

Hard Disk Seagate 1TB SATA Hard Drive (32MB Buffer)

Memory OCZ Technology 6GB (Three 2GB) DDR3-1600 RAM

Video Card BFG GeForce GTX 275 OC 896MB GDDR3 RAM

Video Drivers NVIDIA Driver Version 190.18

Operating System Ubuntu 9.04 x86_64

32

Table 3: GPU Specifications, Courtesy BFG

BFG NVIDIA GeForce GTX 275 OC 896MB PCIe 2.0

GPU NVIDIA® GeForce® GTX 275

Core Clock 648MHz (vs. 633MHz standard)

Shader Clock 1440MHz (vs. 1404MHz standard)

Streaming Multiprocessors 30

Processor Cores 240

Video Memory 896MB

Memory Type GDDR3

Memory Data Rate 2304MHz (vs. 2268MHz standard)

Memory Interface 448-bit

Memory Bandwidth 129GB/sec

CUDA Capability 1.3 (Double-Precision Support)

Peak Giga-flops per Second 1010.88 (Single-Precision)

Figure 7: BFG/NVIDIA GTX 275 OC, Courtesy BFG

33

3.6 CERBERUSg Initial Performance.

CERBERUSg was the first conversion to be completed. Initial studies showed that only

a 13% speedup when compared to CERBERUSc (which is not the optimal algorithm for a serial

processor such as a CPU). CERBERUSg actually required more time to run than LADONc. The

reason for the slow runtime despite the large number of threads available on the GPU is because

the algorithm required many portions of the code to operate in serial instead of parallel (reducing

the maximum possible benefit expected), and the data transfer from GPU RAM to system RAM

was required for every re-shuffling of the event lists (adding time). CERBERUSc and

CERBERUSg were thus abandoned as soon as the first LADONg port was complete, due to the

obvious large amount of programming and optimization necessary to generate a fast code.

3.7 LADONg Initial Performance

After conversion from the CPU to GPU, LADONg ran three times as fast as LADONc on

the small problems that could be run at that time (the ―test‖ model at 24,576 neutrons per batch).

This was clearly the most likely path for success. All subsequent efforts were directed towards

optimizing LADONg to achieve the fastest code possible. It should be noted that if any

algorithmic changes were made, these were also considered for application to LADONc where it

was expected to decrease the CPU runtime as well. The major optimizations performed are

discussed in the following chapter.

34

Chapter 4

Optimizations

Optimization of the GPU code was an iterative process. The code was examined for any

possible change that could be made, and the results from before and after the change were

compared. If the change resulted in a performance increase then it was kept. If not, a lesson was

learned, and the change was removed. During this process the experiences of others are a very

useful resource; expert resources utilized during this work included the NVIDIA CUDA C

Programming Best Practices Guide (NVIDIA Corporation 2009), and the NVIDIA CUDA GPU

Computing user forums (a community of CUDA programmers available to answer questions and

provide suggestions) (NVIDIA Corporation 2009). This chapter will discuss the major LADONg

optimization efforts.

4.1 Optimization Efforts

4.1.1 Reducing Memory Latency

As discussed previously, the global memory access time is approximately 200 cycles.

The impact of this can be mitigated by running enough blocks of threads to allow the GPU to

perform calculations on other blocks while waiting for the data to be returned from global

memory. This is clearly not optimum while memory with less latency is available; there are

16kB of shared memory (memory shared by each SP, with a latency of approximately 2 cycles);

64kB of cached constant memory (which can be used by all threads and blocks, is as fast as

reading from registers if the data requested is in the 8kB cache per MP, otherwise it can cost

approximately 200 cycles); and large amounts of cached texture memory (which can be used by

35

all threads and blocks, and is as fast as reading from registers of the data is cached, otherwise it

costs approximately 200 cycles to access). The following paragraphs discuss optimizations

performed to utilize this available faster memory. It is important to note that most of these

speedups presented were performed comparing the ―test‖ model, run with 24,576 neutrons (a

multiple of the number of threads per block available on the GPU) run on the GPU compared

with the same model being analyzed on the CPU.

The first step examined was in placing the geometry information in constant memory.

Constant memory was used because the geometric data does not change during the simulation of

neutrons and the information is needed repeatedly during the collision detection portion of the

code. To do this, a ―__constant__‖ variable for the shape data was made global in scope, and the

data was copied to this symbol in the routine that loads data to the GPU, as shown in Figure 8.

Figure 8: CUDA Code Transferring Data to Constant Memory

This change immediately resulted in an increase from a speedup of three to five. This is

because the required geometric data easily fits inside the constant cache and the number of times

that a 200 cycle latency memory transfer from global memory was required was significantly

reduced.

Next, the material information was moved from global memory to constant memory

using similar code to that used for the shape list. This array was rather large, as it includes a

number density and location for where to find the cross-section data for every nuclide in every

possible cell (for the arbitrary limits chosen, this equates to 100 cells*20 nuclides per cell = 2,000

nuclides). This could have been programmed better to not provide a location in memory for

36

unused nuclide slots in each object, but the increased complexity was deemed unnecessary, since

adequate constant memory space is available. Unfortunately, this work resulted in a speedup of

only a few percent: likely due to the fact that the portion of the code which requires material

information was still limited by the expensive cross-section lookup routine, which takes a large

portion of time in any MC transport code. Efforts to increase the efficiency of the cross-section

lookup are discussed later.

Since the Legendre polynomial coefficients for each nuclide used in the inelastic

scattering function were of small size, they too were included in constant memory as opposed to

global memory. The speedup increase from this effort was only a few percent, which should be

expected since this data is only required when a neutron undergoes an inelastic scattering event.

Finally, every thread requires a data structure to describe the neutron that it is working on

(including data such as: position, cell, random number seed, energy, etc.). This data is accessed

frequently by the kernel, and should not be placed in the high latency global memory. However,

it cannot be placed in constant memory because its values change during runtime. The more

logical location for this memory is in the shared memory, not because it must be shared by

threads, but because it must be modifiable while still being low latency. This was done by

allocating an array of neutron structures, one element of the array per thread, and copying the

starting neutron data from global memory to the newly allocated shared memory. After the

threads have completed the neutron simulation the results are copied back to the global memory

so the results can be transferred back to the system RAM. The code snippets in Figure 9 shows

the code required to do this. Unfortunately, this also only resulted in a speedup of about four

percent. This is again due to the more limiting memory transfers taking place due to the cross-

section lookup routines.

37

Figure 9: CUDA Code Transferring Global Memory Data to Shared Memory

4.1.2 Register Pressure

As the number of variables declared in the kernel increase, the number of registers

required also increase. Therefore, the usage of extraneous variables was removed either by

revising the algorithm or by reusing variables that had already been declared throughout the

optimization process. Reducing registers can allow for more blocks to fit in an SM (since the

number of registers per SM are limited), and it can avoid the spillover of data from the registers

to local memory (which has a long 200 cycle latency). An example of where this was performed

is in the elastic scattering function, of which an excerpt of is shown in Figure 10.

Figure 10: Example of Variable Reuse to Reduce Registers

38

This type of effort can be successful at times (the compiler should catch the most obvious

chances for register reuse), but it does tend to make the code less readable, and thus harder to

maintain.

Another option is the CUDA compiler command line option ―—maxrregcount=N‖,

where N is the desired number of registers to be used by the kernel. This option forces the

compiler to only use N registers and to place the rest of the required storage locations into local

memory. As noted earlier, local memory is slow. However the benefit of using maxrregcount is

that if the speedup achieved by allowing more blocks to run on each SM (because the register

count is lower) outweighs the slowdown due to the increased local memory usage, then an overall

gain will be achieved. For example, this was used in the double-precision version of LADONg.

Compilation without the maxrregcount option of this code results in 49 registers and 192 bytes of

local memory required. This results in a limitation of four blocks per SM due the amount of

registers available per SM. If maxrregcount=40 is used, however, then the local memory required

increases to 248 bytes, but the number of blocks per SM increases to 6 blocks. This particular

change resulted in a runtime decrease of up to 5%.

4.1.3 Cross-Section Lookups

The cross-section lookup routine takes approximately 33% of the computation time in the

LADONc code. This is expected to also be true for LADONg, but a per-function profiler is not

yet available for CUDA so this assumption could not be validated. Obviously, great gains for

little effort should be expected by increasing the performance of this part of the algorithm. While

the algorithm looked sound, the best chance for optimization was, again, in optimizing the

memory path. As discussed previously, cross-section data requires megabytes of data. This

39

means that the only way to get faster memory access for the entire cross-section table is by

utilizing the texture memory.

At first texture memory was utilized to supply energy values necessary for performing

the binary search to determine which cross-section data array index contains the data to use for

cross-section interpolation. This was found to be detrimental to performance due to the fact that

little to no locality existed for a majority of the energy values being searched. This was further

compounded by of the very large number of texture memory fetches required to do a binary

search.

Instead of searching on the energy data in texture memory, the energy data from global

memory was used for the binary search and all other requests for this data (those that did not

perform many requests for the data every few cycles like in the binary search) utilized the texture

memory. This provided some of the benefits of the cache provided by the texture memory. In

addition to this strategy, the cross-sections were stored in texture memory inside of a four

dimensional single precision floating point structure. The four dimensional structure was utilized

because it allowed the texture memory reads to gather all four cross-section values (fission, non-

fission capture, elastic, and inelastic) in one memory read operation instead of four. These

changes together resulted in a decrease in runtime of approximately 9%.

4.1.4 Reducing Shared Memory Usage to Increase Blocks per SM

Towards the end of the optimization process it became evident that the use of shared

memory for the neutron data structure was limiting the number of blocks that can be processed

concurrently on each SM. The shared memory was used solely by the neutron data structure.

This included data for the neutron position (three float variables), neutron direction (three float

variables), neutron energy (one float variable), material total macroscopic cross-section (one float

40

variable), the random variable for collision distance detection (―eta‖, one float variable), random

number generator seed (one unsigned long integer variable), current cell identification (one short

integer), and a collided nuclide identifier (one short integer). These were present because in the

CPU code it made for a convenient organization of data specific to each neutron (except the

random number generator seed, which is specific to the GPU because of the parallel structure).

This is a large structure stored in shared memory for each thread in each block on an SM. Since

there is a limit to the amount of shared memory per SM, having large shared memory usage by

each block limited the amount of blocks that can be executed on by each SM. For example,

running 64 threads per block limited the maximum number of blocks on each SM to five. Since

the direction, material total macroscopic cross-section, and random variable for collision distance

detection were not needed to be stored after the current neutron was transported by the kernel,

these were removed from the neutron structure and placed directly into the available registers.

This allowed eight blocks to be run concurrently on each SM, speeding up the calculations by

approximately 25%.

4.1.5 Thread Divergence

In several functions, thread divergence could be avoided through some tweaks to the

algorithms. For example, the most natural one to perform was to allow leaked neutrons to

continue as if they had not yet leaked. To do this, the reaction type flag was set to ‗l‘ (lower-case

L) to denote leakage, the neutrons current cell was set to 0 (a cell guaranteed to be defined

because of the required ordering of cells in the geometry). The calculations were allowed to

proceed just as if it was a neutron that has not leaked, except a statement was added during the

phase where the reaction type is determined that says ―If the neutron has its flag currently set to

‗leak‘, do not modify it.‖

41

4.1.6 Increasing the Speed of Data Transfer Between System RAM and GPU RAM

As it is shown in the CUDA Software Development Kit sample program,

―bandwidthTest‖, using page-locked memory for arrays in system RAM that will be transferred to

and from the GPU RAM can increase data throughput. This is mainly because: the data is page-

locked in system RAM, i.e., it will always be in memory and not stored in a disk cache if the

operating system decides it needs more RAM; and the GPU can access data through the PCIe bus

without having to communicate through the CPU, removing a potential bottleneck/extra step from

the process. This memory management strategy was implemented for the neutron batch list so

that the data transfer to the GPU and from the GPU before and after each batch would be faster.

This is achieved in CUDA by using the API function ―cudaMallocHost()‖ to allocate memory in

system RAM instead of using the standard C library‘s ―malloc()‖ routine, and ―cudaFreeHost()‖

instead of ―free()‖.

Since the system this was run on has 6GB of RAM (i.e., there is plenty of space in RAM

so that no data, for any process, needs to be written to the disk cache), and the Core i7 CPU has

very fast memory transfer speeds, not much speedup was expected for this application. This

intuition was proven correct as only a 1.5% speedup was seen when implementing this feature.

The feature was kept through the optimization process because it was expected to become more

important as the batch size increased.

4.1.7 Additional Optimizations Performed Which Reduce Accuracy

The CUDA architecture includes intrinsic floating point and integer math functions

(multiplication, division, sine, cosine, natural log, exponentials, etc.) which utilize specialized

hardware routines to perform the calculations. These intrinsic functions perform the

42

computations faster than the standard software equivalent but with a reduction in accuracy. For

example, the CUDA single-precision exponential function, ―expf(x)‖, provides solutions with

accuracies of 1 ULP (Unit in Last Place – a unit of accuracy for floating point numbers). The

intrinsic version, ―__expf(x)‖ has an maximum ULP error of 2+floor(abs(1.16x)), at least twice

as large as the software based equivalent, expf(x) (NVIDIA Corporation 2009). These intrinsic

functions can be utilized in calculations where accuracy does not matter for further runtime

improvements. A study was performed which replaced all math functions with their intrinsic

counterparts (where an equivalent existed), and these results are discussed in the next chapter.

43

Chapter 5

Results

This chapter presents a comparison of runtimes and values for kcalc to show the speedups

and level of accuracies obtained. All ―speedups‖ discussed are defined as the CPU compute time

divided by the GPU compute time (i.e. LADONc time divided by LADONg time). These times

are not the entire runtime from beginning the program to completion, but instead are times from

before the first neutron batch begins running until the very last neutron batch is finished. This is

done because the loading of the problem from files requires the same amount of time in both

versions of the code, and the copying of the problem geometry, materials, and cross-section

information from system RAM to GPU RAM is negligible when required only once (as you can

imagine with transfer speeds of up to 5GB/sec). This timing scheme does, however, capture the

transfer of the neutron batch data from the system RAM to the GPU RAM, and back, for every

batch in LADONg.

For the preceeding analyses, the CPU and GPU codes were run for the five previously

identified models at neutron batch sizes of 100, 1,000, 10,000, 24,576, 49,152, 98,304, 196,608,

and 1,000,000 histories. Fifty batches were run with the first fifteen of these discarded. The

following sections describe results for each of the analyses performed.

The single and double-precision versions of LADONc were compiled using GNU‘s g++

with the following syntax: ―g++ <filename.cpp> -o <executable name>‖. The default compiler

options were used. The LADONg codes discussed were compiled using NVIDIA‘s nvcc

compiler with the following syntax: ―nvcc <filename.cu> -o <executable name> -arch sm_13 -

<additional options for each problem>‖. The ―-arch sm_13‖ option defines which CUDA

capability version GPU used is compatible with. For the case of the GTX 275, this is CUDA

capability version 1.3. The additional options are discussed in each section where applicable.

44

5.1 Single-Precision Results

5.1.1 Use of the Accurate Math, Single-Precision Functions

Figure 11 shows the speedups obtained with the LADONg code that was compiled

without using the faster, and less accurate intrinsic math functions. The maximum speedup is

20.56x (model ―four‖ at 1,000,000 neutrons per batch). The next best performing model was the

array2 model, at 19.37x (from the 49,152 neutrons per batch run). The speedup levels off as the

neutrons per batch increase. This is because the GPU only needs so many blocks being run in

order to hide the memory latency adequately, and any further block requests have a linear impact

on runtime.

Figure 11: Single-Precision Speedup

0

5

10

15

20

25

0 200000 400000 600000 800000 1000000

Spe

ed

up

Neutrons per Batch

test

two

three

four

array2

45

5.1.2 Use of the Intrinsic Math, Single-Precision Functions

The CUDA compiler (nvcc) command line argument ―-use_fast_math‖ forces the

compiler to use the intrinsic single-precision functions which run significantly faster than the

standard functions, although in doing so the accuracy is reduced. This option affects functions

such as logf(), expf(), sinf(), cosf, et cetera (the complete list of these functions can be found in

the CUDA Programming Guide (NVIDIA Corporation 2009)). Figure 12 shows the speedup if

LADONg is compiled with this option.

Figure 12: Single-Precision Speedup Using Fast Math

The maximum speedup is now seen in array2, which is at 23.91x (49,152 neutrons per

batch). This is a 23% increase in speedup compared to the same run‘s performance without the

fast math option. Four, which was the performance leader in the previous case, saw an increase

in speed of 8%. The significant increase in runtime for array2 is expected because array2

0

5

10

15

20

25

30

0 200000 400000 600000 800000 1000000

Spe

ed

up

Neutrons per Batch

test

two

three

four

array2

46

contains significantly more geometric objects in the model than do any of the other cases (fifty-

three versus four, at most). The increased number of objects means more calculations are

required in the neutron tracking portion of the code, where most of the functions converted to

intrinsic functions are located, as seen in the source code.

5.1.3 Examination of the Observed Peak

In all of the above cases, there seems to be an ideal configuration for the array2 model.

Since the speedup is a function of LADONc and LADONg runtimes, the large speedup increase

could be due to an increase in CPU runtime (without a corresponding increase in GPU runtime),

or a decrease in GPU runtime (without a corresponding decrease in CPU runtime). To analyze

this trend, LADONc and LADONg were run for all models with an increased fidelity on the

number of neutrons per batch around this ‗peak‘ point. Figure 13 and Figure 14 show the

resultant runtimes for the CPU and GPU code, respectively. The CPU code shows variability in

the runtimes, especially for the array2 model, but the trend still seems to follow a linear

trajectory.

The GPU code displays more linear results except for prompt jumps in runtime seen at

17,000 and 33,000 neutrons per batch. Interestingly, these jumps occur when multiples of the

number of active blocks on the GPU is reached. For example, there is one neutron per thread,

sixty-four threads per block, eight blocks running per SM, and 30 SMs per GPU. This shows that

the GPU is at a maximum capacity when 15,360 neutrons per batch are being simulated. The

GPU puts the inactive blocks in a queue where they wait until a slot opens up on an SM. Any

multiple of this number will also cause a similar response, although smaller in magnitude.

Computing the speedups of these cases reveals the same sort of ‗optimal‘ behavior as shown in

47

Figure 15. This explains why the peak is obtained for the 24,576 neutrons per batch case in these

analyses.

Figure 13: LADONc Run Time in the Peak Region

Figure 14: LADONg Run Time in the Peak Region

0

20

40

60

80

100

120

140

0 10000 20000 30000 40000

CP

U R

un

Tim

e

Neutrons Per Batch

test cpu

two cpu

three cpu

four cpu

array2 cpu

0

1

2

3

4

5

6

7

8

0 10000 20000 30000 40000

GP

U R

un

Tim

e

Neutrons Per Batch

test gpu

two gpu

three gpu

four gpu

array2 gpu

48

Figure 15: Speedup in the Peak Region

5.2 Double-Precision Results

5.2.1 Use of the Accurate Math, Double-Precision Functions

To quantify the impact of performing double-precision floating point computations on the

GPU, a double-precision version of the CPU and GPU codes were created. These versions did

not force every floating point calculation to be performed in double-precision; instead, only those

related to the geometric tracking were performed with double-precision math. This was because

the cross-section data available in the Evaluated Nuclear Data File (ENDF) and other repositories

is not of double-precision accuracy, and so should not require double-precision calculations. The

double-precision version of LADONg was compiled with the option ―-maxrregcount=40‖ to

reduce the number of registers required from 49 to 40 to increase the number of blocks required.

These results are shown in Figure 16. The runtime of the CPU code was only negligibly

impacted by switching to double-precision geometry computations. However, the GPU runtime

0

2

4

6

8

10

12

14

16

18

20

0 5000 10000 15000 20000 25000 30000 35000

Spe

ed

up

Neutrons Per Batch

test

two

three

four

array2

49

was essentially increased by a factor of one-and-a-half to two. The largest speedup obtained was

11.21x, which came from the runs of the ―four‖ model.

As discussed previously, there is one double-precision math unit per SM. Each SM has

eight single-precision math units. This would suggest that 1/8th of the performance could be

obtained using double-precision math. However, if the code is limited by memory latency and

transfer time but not calculation speed, then the memory performance of double-precision math is

limiting. The double-precision data requires twice as many bytes and therefore half of the

memory transfer performance is to be expected. For LADONg this is the case and explains why

the performance approaches ½ that of the single-precision math version of LADONg.

Figure 16: Double-Precision Speedup

0

2

4

6

8

10

12

0 200000 400000 600000 800000 1000000

Spe

ed

up

Neutrons per Batch

test

two

three

four

array2

50

5.2.2 Use of the Intrinsic Math, Double-Precision Functions

Finally, the fast-math compilation option was utilized with the double-precision version

of the GPU code to see the impact on runtime. As can be seen in Figure 17, the fastest speedup

achieved is 12.66x (from the ―four‖ model at 1,000,000 neutrons per batch). This is an

approximate speedup of 17% over the same double-precision model. The same model (and

neutrons per batch) runs in 56% of the time using single-precision, fast math, which agrees with

the previous conclusions about the expected performance hit when switching from single-

precision to double-precision floating point math.

Figure 17: Double-Precision Speedup Using Fast Math

0

2

4

6

8

10

12

14

0 200000 400000 600000 800000 1000000 1200000

Spe

ed

up

Neutrons per Batch

test

two

three

four

array2

51

5.3 Accuracy Comparison

Based on the previous discussion regarding the accuracy of non-IEEE compliance for

floating-point calculations, a useful comparison would be to observe the differences in the kcalcs

determined by the versions of the codes presented above. Table 4 shows the difference in kcalc

between the CPU and GPU codes for some of the cases discussed above. A positive value

indicates that the value of kcalc for the GPU was larger than its CPU counterpart. The cases up to

and including 1,000,000 neutrons per batch were run with 50 batches (first 15 batches discarded).

The 4,000,000 case was run at 100 batches (25 discarded). No run was obtained for the array2

model at 4,000,000 neutrons per batch due to the large CPU runtime.

These results show three trends: 1) increasing the number of histories removes the

differences due to differing random number streams between the CPU and GPU codes; 2) a bias

still exists between the CPU and GPU codes even after 75 batches of 4,000,000 neutrons (likely

due to a fault in the parallel random number generator algorithm used); and 3) the bias that exists

is not a strong function of using the fast-math functions.

52

Table 4: Accuracy Comparison

Neutrons per Batch

Differences in kcalc

Accuracy Case test two three four array2

10,000

-0.0104 0.0001 -0.0011 -0.0014 0.0003 Single-Precision

-0.0104 -0.0021 -0.0011 -0.0034 0.0005 Single Fast-Math

-0.0053 -0.0002 -0.0037 -0.0015 0.0017 Double-Precision

24,576

-0.0030 -0.0060 0.0016 -0.0039 0.0034 Single-Precision

-0.0030 -0.0081 0.0033 -0.0032 0.0010 Single Fast-Math

-0.0030 -0.0002 0.0010 -0.0010 0.0046 Double-Precision

49,152

0.0033 0.0009 0.0009 -0.0014 0.0006 Single-Precision

0.0020 -0.0013 0.0009 -0.0007 0.0018 Single Fast-Math

0.0008 0.0008 -0.0005 0.0008 0.0002 Double-Precision

98,304

0.0010 0.0012 0.0001 0.0000 0.0001 Single-Precision

0.0018 -0.0008 0.0001 -0.0004 -0.0003 Single Fast-Math

-0.0019 -0.0008 0.0010 -0.0005 -0.0005 Double-Precision

196,608

-0.0017 0.0001 0.0021 0.0010 0.0006 Single-Precision

-0.0016 -0.0010 0.0005 0.0002 0.0007 Single Fast-Math

0.0001 -0.0010 0.0013 0.0002 -0.0005 Double-Precision

1,000,000

-0.0002 0.0001 0.0008 0.0008 -0.0003 Single-Precision

-0.0012 -0.0007 0.0009 0.0005 -0.0003 Single Fast-Math

-0.0013 0.0000 0.0060 0.0007 -0.0005 Double-Precision

4,000,000

-0.0002 0.0001 0.0009 0.0008 N/A Single-Precision

-0.0001 0.0002 0.0009 0.0008 N/A Single Fast-Math

0.0001 0.0001 0.0011 0.0009 N/A Double-Precision

53

Chapter 6

Applicability to Production Codes

The results obtained in this thesis only pertain directly to the speedup of the LADONc

code. The reader is more likely interested in how the GPU would accelerate their own MC

applications. The following section discusses limitations identified based on the experience

gained porting the LADONc code to LADONg. After that section is a discussion of ways that a

GPU-accelerated MC code could provide rather useful.

6.1 Limitations of CUDA for Monte Carlo Neutron Transport

6.1.1 Maximum Available Memory

The NVIDIA Tesla S1070 (the top-of-the-line professional grade 1U rack mounted GPU

made exclusively for CUDA applications) provides four GPUs, each with exclusive access to

4GB of RAM, totaling 16GB per node. For a typical HPC node, it is not uncommon for a two

CPU system to have access to as much as 48 GB of RAM, all of it accessible by both CPUs.

Monte Carlo Transport codes solving reactor-sized problems will require the 48GB of memory

(maybe more) to store the geometry information, cross-section data, and fine-block region tally

data. Therefore, effective use of GPUs for full-core design calculations would require domain

decomposition techniques to reduce the memory requirements of each of the GPUs in a system.

This is an active area of research for MC applications (Thomas Brunner 2009).

It is important to keep in mind when considering domain decomposition techniques that

since the data transferred from GPU to GPU and CPU to/from GPU is limited by the PCIe bus

54

(approximately 8GB/s), the communication is even more so of a bottleneck for GPU based

solutions.

6.1.2 Accuracy of Computations

At this time, the floating-point arithmetic accuracy is not fully IEEE-754 compliant. This

may not end up being an issue, but that will not be known until a more fully-featured code is

written for CUDA and qualified against a suite of benchmark models. Additionally, NVIDIA has

complete control over the implementation of floating-point arithmetic on the GPUs. Any

qualification study performed may not be applicable to other generations of NVIDIA GPUs.

Besides the general accuracy questions above, another issue exists regarding the double-

precision floating point number performance. For this application, double-precision calculations

run at roughly twice the runtime of the single-precision performance. Double-point precision is

desired for at least the geometric portions of the program to avoid neutrons being lost due to

insufficient numerical accuracy available for neutron tracking.

6.1.3 Error-Checking Memory

The current generation of NVIDIA GPUs do not support error-checking and correcting

(ECC) RAM. This means that a fault in the data is not determined and fixed during runtime by

the hardware (as it is in typical CPU cluster nodes), resulting in possibly incorrect and

unpredictable results. However, as discussed in the paper by Maruyama, Nukada, and Matsuoka

(Maruyama, 2009), software-based ECC is possible for GPU algorithms. Software ECC makes

use of algorithms which detect transient bit-flips that can be detected and corrected all while the

main algorithm is running. This takes up resources however; roughly 60% performance loss was

55

found with this method for bandwidth-limited applications such as LADONc. Computation-

limited applications ran with a 7% overhead due to the software ECC.

6.1.4 Maintainability

A program that has a highly accessible source code with a small learning curve clearly

provides long term knowledge management benefits to an organization. While CUDA code is

not difficult to understand on its own, any optimizations made to the code, including efforts made

to reduce the number of registers, could make it unreadable to those without an intimate

knowledge of its inner workings (i.e., the developers), reducing the code‘s maintainability (and

possibly longevity).

6.1.5 Hardware Architecture Changes

The CUDA architecture is relatively young, even in the world of computers. The first

CUDA capable GPU was released in 2006. Conversely, the x86 technology used by most CPUs

in use today was first introduced in 1978 by Intel (Intel Corporation n.d.). The x86 architecture

has proven to be at least as future safe to produce for as a computer technology can be.

The fact that NVIDIA owns the CUDA architecture and can change it at anytime if

necessary limits the usefulness of a GPU-accelerated MC code for design, simply because it may

not be possible to guarantee that the code will last the life of the design (if an organization has

such requirements).

It should be mentioned that work is progressing on an open, free standard for parallel

programming on many different platforms (including both CPUs and GPUs). This standard is

called OpenCL (short for Open Computing Language) (Khronos Group 2009). It is currently

56

supported by many large organizations including: Intel, NVIDIA, AMD, IBM, Texas

Instruments, and even Los Alamos National Laboratory. This standard should allow one to

program one code which can run on any platform supported, with the code automatically taking

advantage of the hardware it is run on with minimal (if any) programmer effort. OpenCL 1.0 was

first released on August 28, 2009 in Apple‘s Operating System, Mac OS X Snow Leopard.

6.1.6 Optimizations May Not Be As Successful For Larger Problems

As discussed in Chapter 4, the speedups achieved in this effort were obtained through

fitting geometric and material data in the GPU‘s constant memory (a maximum of 64KB). While

the methods used for representing material data in this thesis were not ideal, the number of cells

and materials had to be limited for these speedup levels to be achieved. The same type of

speedups should not be expected without limiting the number of cells or materials. Alternatively,

a strategy that could be pursued to remedy this is to incorporate a ‗pre-staging‘ algorithm in the

code where the necessary geometric and material data could be dynamically loaded into the

constant memory when needed instead of fitting it all in before the neutron simulations begin.

Examples in the CUDA programming guide (NVIDIA Corporation 2009) make use of this

method.

6.1.7 Lack of Large-Scale GPU Cluster

This work has shown that a MC can be accelerated by GPUs (with the caveats described

above). However, most organizations do not purchase a high-performance computer (HPC)

strictly for use with one specific code. Since it is certainly likely that many codes will not be able

to utilize GPUs to accelerate codes (at least without a modest programming effort), it is expected

57

there will be little support to purchase such a highly-specific HPC without it being more generally

applicable first. This downside can be mitigated as the general purpose GPU usage increases and

the science community adopts GPUs for calculations.

6.1.8 CUDA Development Tools

Currently, the only CUDA development tools are provided by NVIDIA. These include

the CUDA compiler (nvcc), a debugger (cuda-gdb), and a profiler (Visual Profiler).

The debugger is a ported version of the GNU Debugger (gdb), called cuda-gdb, which

performs similar functions to gdb (NVIDIA Corporation 2009). The product description for this

still-in-development-tool from NVIDIA states that: “[cuda-gdb] is designed to present the user

with an all-in-one debugging environment capable of debugging native host code as well as

CUDA code. Standard debugging features are inherently supported for host code, and additional

features have been provided to support debugging CUDA code.”

The Visual Profiler is useful for determining statistics about a kernel such as: amount of

time spent transferring data versus running a kernel; the number of local/global/shared memory

reads and writes (coalesced and uncoalesced); and the amount of thread divergence present in a

code. Unfortunately, the profiler does not yet provide statistics on individual functions within the

kernels – all results are presented in terms of kernels and data transfers.

These tools are very useful, however, during the development process of both LADONc

and LADONg it was evident that the state of CPU (C++) development is clearly much more

advanced than the equivalent GPU tools. Of course, CUDA has only been available for three

years at this point, while C/C++ was first introduced in 1972. Until the development tools have

reached a similar level of usability, programmers should expect to spend extra effort debugging

and profiling than acclimated to.

58

6.2 Possible Applications of CUDA-Accelerated MC

If effort is not expended to address the above issues then a nearly direct application of the

work done herein could be applicable as a design scoping tool or as a neutronics solver in a multi-

physics framework. These tools require quick runtimes and can be more flexible on the level of

accuracy required. As runtime is reduced, an increasing number of design iterations can be

performed in a given time period producing a further optimized product for potentially less

money.

The following address the previously noted deficiencies as they relate to scoping tools or

multi-physics solvers:

The size of each tally region required can be reduced. This makes it possible to

fit a problem on a single GPU with access to only 4GB of RAM without the use

of domain decomposition.

Single precision floating-point math can be used (decreasing the runtime by at

least two when compared to a double-precision code).

For a scoping tool, error checking may not be necessary; if an unexpected

deviation is encountered, then the code should be re-run.

For the use of CUDA in a scoping tool, the unknown future of CUDA is not as

limiting since the GPU-accelerated scoping tool wouldn‘t be used for the final

design; it does not have to remain maintainable for the life of the design. This

does not mitigate the risk of spending effort in a code, only to have it not be

useful in the future when the GPU architecture changes.

The code complexity/maintainability issues of highly optimized code will still remain however –

a confusing source code is a confusing source code, regardless of how it is used.

59

Chapter 7

Summary and Conclusions

This work examined the feasibility of utilizing Graphics Processing Units (GPUs) to

accelerate Monte Carlo neutron transport problems. These GPUs use many parallel processors to

perform the complex calculations necessary to create three-dimensional images at fast enough

rates for the video game industry. The GPU (BFG/NVIDIA GTX 275 OC 896MB) used for this

analysis has 230 of these parallel processors, and is capable of approximately 1,088x109

Floating

Point Operations per second. In 2006 NVIDIA released a programming framework (called

CUDA) that allows developers to easily create codes which can be executed on the many cores

provided by GPUs for general purpose programs not necessarily related to graphics. Initial

assessments have suggested that the MC algorithm may not be able to fully utilize the GPU

constraints. These constraints include the fact that MC codes are highly dependent on branch

statements (IF, ELSE, FOR, and WHILE) which can have a large impact on GPU performance.

This work went through the process of writing a Monte Carlo neutron transport code from scratch

to be run on both the x86 CPU platform and later ported to the GPU CUDA platform to

understand the type of performance that can be gained by utilizing GPUs.

7.1 Conclusions

After the CPU code was ported with little effort and no optimization, it was found that

the GPU code ran three times as fast. This result was quite promising and further work went into

optimization to determine how fast the GPU code could be run.

Since the geometry and materials in use defines the complexity of the problem, numerous

models were run in this analysis. These include simple models with one sphere up to a model

60

with 53 objects. The maximum speedups obtained can be seen in Table 5. These results show

that the more complex models had achieved the largest speedups due to the fact that more time

was spent in the geometry tracking portions of the code, where the GPU performance was

beneficial (rapid access to the important data). Since high precision methods are desired for

production codes, single- and double-precision versions of the codes were also compared on the

same models. These results show an increase in GPU runtime of approximately a factor of two

when double-precision math is utilized.

Table 5: Summary of Speedups

Case Maximum Speedup Compared to CPU

Single-Precision 20.56x

Single-Precision with Fast Math 23.91x

Double-Precision 11.21x

Double-Precision with Fast Math 12.66x

Some disadvantages for production-level codes were discussed as well. These include: a

lack of size and quality of memory on the GPUs (the maximum GPU RAM available on a card is

currently 4GB per GPU, and there is no error checking or correcting performed); non-IEEE

compliant floating point operations with slightly more inaccuracy than those that are IEEE

compliant; and a loss of performance when double-precision is desired. These limitations only

act to slow down the problem being run on the GPU, a good programmer will still be able to

come away with a sizeable speedup even when accounting for the above.

61

7.2 Recommendations for Future Work

This work was not performed by an experienced programmer. Any development team

experienced in either C or C++ and Monte Carlo neutron transport programming should be able

to achieve much better results than those discussed in this document. That being said, these

results are enticing enough that developers are encouraged to examine the porting of their specific

MC codes to the CUDA environment. A candidate that comes to mind is the PSG2/Serpent

Monte Carlo transport code, a code written from scratch in 2004 using the C language (Leppänen

2009).

62

Bibliography

Brown, Forrest. "PHYSOR 2008 Conference Monte Carlo Workshop." Invited Lecturer.

Interlaken, 2008.

Forrest B. Brown, William R. Martin. "Monte Carlo Methods for Radiation Transport

Analysis on Vector Computers." Progress in Nuclear Energy (Pergamon Press Ltd.) 14, no. 3

(July 1984): 269-299.

Intel Corporation. Corporate Archives Timeline.

http://www.intel.com/museum/archives/timeline/index.htm (accessed August 29, 2009).

KAERI, Korea Atomic Energy Research Institute. ENDFPLOT 2.0. 2007.

http://atom.kaeri.re.kr/cgi-bin/endfplot.pl (accessed March 2009).

Khronos Group. OpenCL Overview. 2009. http://www.khronos.org/opencl/ (accessed

August 29, 2009).

Leppänen, Jaako. "Burnup Calculation Capability in the PSG2 / Serpent Monte Carlo

Reactor Physics Code." M&C 2009. Saratoga Springs, 2009.

Luebke, David. "The Democratization of Parallel Computing." International Conference

for High Performance Computing, Networking, Storage and Analysis 2007 (SC07). Reno,

Nevada, November 2007.

Martin, William. "Joint International Topical Meeting on Mathematics and Computation

and Supercomputing in Nuclear." Invited Keynote Speaker. Monterrey, California, April 2007.

N. Maruyama, A. Nukada, and S. Matsuoka, ―Software-Based ECC for GPUs‖ in 2009

Symposium on Application Accelerators in High Performance Computing (SAAHPC’09),

Urbana, Illinois, July 2009.

NVIDIA Corporation. CUDA Programming and Development. 2009.

http://forums.nvidia.com/index.php?showforum=71 (accessed September 8, 2009).

63

NVIDIA Corporation. NVIDIA CUDA C Programming Best Practices Guide. CUDA

Toolkit 2.3. Portable Document Format. Santa Clara, CA, July 2009.

NVIDIA Corporation. NVIDIA CUDA Debugger - CUDA-GDB Debugger. Version 2.3

Beta. Portable Document Format. Santa Clara, California, July 2009.

NVIDIA Corporation. NVIDIA CUDA Programming Guide. CUDA Toolkit 2.3. Portable

Document Format. Santa Clara, California, July 2009.

Smith, Kord. "Nuclear Mathematical and Computational Sciences: A Century in Review,

A Century Anew." Invited Keynote Speaker. Gatlinburg, Tennessee, May 2003.

Thomas Brunner, Patrick Brantley. "An Efficient, Robust, Domain-Decomposition

Algorithm for Particle Monte Carlo." Journal of Computational Physics (Academic Press

Professional, Inc.) 228, no. 10 (June 2009): 3882-3890.

William Press, Saul Teukolsky, William Vetterling, Brian Flannery. Numerical Recipes

in C++ - The Art of Scientific Computing. 2. Cambridge: Cambridge University Press, 2002.

64

Appendix A

LADONc Source Code

The following C++ code is the single-precision code for LADONc.

#include <stdlib.h>

#include <stdio.h>

#include <math.h>

#include <time.h>

#include <iostream>

#include <float.h>

#include <fstream>

#include <string>

using namespace std;

const int OBJECT_MAX=100;

const int NOT_FISSION_FLAG=101;

const int NUCLIDE_MAX=20;

const float NU=0.53; //2+NU = nu, # neuts per fission

const float SMALLEST_float=0.001f; //used to move particle slightly

struct neutron

float x;

float y;

float z;

float energy;

float ox;

float oy;

float oz; //directions omegax, omegay, omegaz

unsigned short cell;

;

struct shapes

char type; //s for sphere p for parallelopiped, c for cylinder

float x0, y0, z0, R2; //R2 because its R^2

float x1, y1, z1; //these are required to define a cube (and z1 for cyl)

unsigned short inside_me[OBJECT_MAX];

;

struct materials

unsigned short num_mat;

int zaid_loc;

float density;

;

struct legendres

float a0, a1, a2, a3, a4, a5, a6, a7;

;

struct xs

65

//these are micros

float * rad_cap;

float * fission;

float * elastic;

float * inelastic;

float * Emesh;

legendres coeffs;

float a,b,c,d;

int mesh_length;

;

struct xsinfo

xs* cross_sec;

int *list;

int listlength;

;

void inputgeometry(string filename, unsigned short& shape_num, shapes

shape_list[]);

float rng(unsigned long int a, float bounds);

void inputXS(unsigned short shape_num, xsinfo* xs_data, materials

mat_list[][NUCLIDE_MAX]);

void inputmaterials(string filename, materials mat_list[][NUCLIDE_MAX]);

void inputsourcelist2(string filename, neutron* source);

void inputjob(char job[], string& in_filename, unsigned long long int& in_seed,

unsigned int& in_nbatch, unsigned int& in_batches, unsigned int&

in_batchskips);

void giveneutrondir(float* ox, float* oy, float* oz);

float sample_X(float a, float b, float c, float max_y);

bool inside_of(shapes shape, float x, float y, float z);

unsigned short getCell(unsigned short shape_num, shapes shape_list[], float x,

float y, float z);

float dist_to_boundary(shapes shape, float* x, float* y, float* z, float ox,

float oy, float oz);

void move_n(float x, float y, float z, float ox, float oy, float oz, float*

endx, float* endy, float* endz, float dr);

void scatter_iso(int zaid, float* u, float* v, float* w, float* energy);

void inelastic_scatter(int zaid, float* u, float* v, float* w, float* energy,

legendres coeffs);

int findzaid(xsinfo xs_data, int zaid);

unsigned int binarySearch(float sortedArray[], unsigned int first, unsigned int

last, float key);

float getmicro(float E, xs cross_sec, char type, unsigned int i);

int picknuclide(float allmatSigmaT, xsinfo xs_data, float E, materials

mat_def[]);

float getmaterialmacroT(float E, xsinfo xs_data, materials mat_def[]);

float getmacro(float E, xsinfo xs_data, char type, float density, int loc,

unsigned int i);

char get_Rxn(float E, xsinfo xs_data, materials mat_def);

int main(int argc, char* argv[])

clock_t time_1, time_2;

//get main input

string file;

unsigned long long int seed;

unsigned int nbatch; //neuts per batch

unsigned int batches;

unsigned int batchskips;

66

char jobname[]="input.job"; //!!! when using commandline, replace this with

//argv[1]

inputjob(jobname, file, seed, nbatch, batches, batchskips);

//init rng

rng(seed,1.0f);

xsinfo xs_data;

unsigned short num_shapes; //get from input

shapes shapelist[OBJECT_MAX]; //get from input

inputgeometry(file, num_shapes, shapelist);

materials mat_list[OBJECT_MAX][NUCLIDE_MAX];

inputmaterials(file, mat_list);

neutron* queuelist;

queuelist=(neutron*)malloc(2*nbatch*sizeof(neutron));

for (unsigned int i=0; i<2*nbatch; i++)

queuelist[i].energy=-1.0f;

inputsourcelist2(file,queuelist);

inputXS(num_shapes,&xs_data, mat_list);

cout <<"\nINPUT PARSING COMPLETE\nTIMER STARTED";

time_1=clock(); //start timer!

float keff=0;

float kbatch;

float ksum=0; //used for checking my kbatch algorithm

float keff_alt;//used for checking my kbatch algorithm

unsigned int queuecount=0;

unsigned int fissionqueuecount=0;

unsigned int lost_tally=0;

for (int batchcount=1; batchcount <= batches; batchcount++)

cout <<"\nBatch "<<batchcount;

unsigned int collisions=0;

unsigned int fission_tally=0;

unsigned int rad_cap_tally=0;

unsigned int leak_tally=0;

unsigned int fission_production_tally=0;

for (unsigned int count=1; count <= nbatch; count++)

//cout <<"\nbeginning neutron # "<<count;

bool neutron_alive=true;

float coll_dist;

char rxn_type;

int target_nuclide;

neutron neut;

//depends upon input source distribution, and fission neutron queue

//create neutron

if (queuelist[queuecount].energy==-1.0f)

//If there are no neutrons in queue, start at r=0 with random E

neut.x=0.0f;

neut.y=0.0f;

neut.z=0.0f;

neut.energy=sample_X(0.453f,-1.036f,2.29f,0.35820615f);

neut.cell=NOT_FISSION_FLAG;

queuecount=0;

else

neut.x=queuelist[queuecount].x;

neut.y=queuelist[queuecount].y;

neut.z=queuelist[queuecount].z;

67

neut.energy=queuelist[queuecount].energy;

neut.cell=queuelist[queuecount].cell;

if (queuecount<(2*nbatch-2))

queuecount++;

else queuecount=0;

giveneutrondir(&neut.ox,&neut.oy,&neut.oz);

if (neut.cell==OBJECT_MAX) //source in wrong xyz, leak and move on

neutron_alive=false;

while (neutron_alive)

//begin transporting neutron

neut.cell=getCell(num_shapes, shapelist, neut.x,neut.y,neut.z);

float tempx,tempy,tempz;

float eta=rng(0,1.0f);

bool finding_new_cell=true;

float SigmaT;

while (finding_new_cell)

float dist=dist_to_boundary(shapelist[neut.cell],

&neut.x,&neut.y,&neut.z,neut.ox,neut.oy,neut.oz);

if (dist<0.0f)

finding_new_cell=false; //Neut Lost!


lost_tally++;

//set cell temporarily to something to survive this

iteration of the loop

neut.cell=0;

SigmaT=getmaterialmacroT(neut.energy,xs_data,

mat_list[neut.cell]);

coll_dist=-log(eta)/(SigmaT);

bool find_dist=true;

int i=0;

while (find_dist)

float temp;

if (shapelist[neut.cell].inside_me[i]!=999)

temp=dist_to_boundary

(shapelist[shapelist[neut.cell].inside_me[i]],&neut.x,&neut.y,&neut.z,

neut.ox,neut.oy,neut.oz);

if ((temp>0)&&(temp<dist))

dist=temp;

i++;

else find_dist=false;

if (coll_dist>=dist)

neut.x=neut.x+neut.ox*(dist+SMALLEST_float);

neut.y=neut.y+neut.oy*(dist+SMALLEST_float);

neut.z=neut.z+neut.oz*(dist+SMALLEST_float);

neut.cell=getCell(num_shapes, shapelist,

neut.x,neut.y,neut.z);

if (neut.cell==OBJECT_MAX)


finding_new_cell=false;

68

leak_tally++;

else

eta=exp(((dist+SMALLEST_float)-

coll_dist)*SigmaT);

else

neut.x=neut.x+neut.ox*(coll_dist);

neut.y=neut.y+neut.oy*(coll_dist);

neut.z=neut.z+neut.oz*(coll_dist);


if (neutron_alive)

target_nuclide=picknuclide(SigmaT, xs_data,

neut.energy, mat_list[neut.cell]);

rxn_type=get_Rxn(neut.energy, xs_data,

mat_list[neut.cell][target_nuclide]);

int loc=mat_list[neut.cell][target_nuclide].zaid_loc;

switch (rxn_type)

case 'f': //do fission stuff

//determine number of neutrons from fission

int num_fission_neuts;

if (rng(0,1.0f) > NU)

num_fission_neuts=3;

else num_fission_neuts=2;

for (int i=0; i<num_fission_neuts; i++)

neutron temp;

temp.x=neut.x;

temp.y=neut.y;

temp.z=neut.z;

temp.cell=neut.cell;

temp.energy=

sample_X(xs_data.cross_sec[loc].a,

xs_data.cross_sec[loc].b,

xs_data.cross_sec[loc].c,

xs_data.cross_sec[loc].d);

queuelist[fissionqueuecount]=temp;

if (fissionqueuecount<(2*nbatch-2))

fissionqueuecount++;

else

fissionqueuecount=0;

fission_production_tally++;

//store initial rx,ry,rz,E for fission n's in queue

fission_tally++;


break;

case 'c': //do capture stuff


rad_cap_tally++;

break;

case 'e':

collisions++;

scatter_iso(xs_data.list[ mat_list[neut.cell][target_nuclide].zaid_loc],

&neut.ox,&neut.oy,&neut.oz,&neut.energy);

break;

69

case 'i': //do inelastic stuff;

collisions++;

inelastic_scatter(xs_data.list[mat_list[neut.cell][target_nuclide].zaid_loc],

&neut.ox,&neut.oy,&neut.oz,&neut.energy, xs_data.cross_sec[loc].coeffs);

break;

default: cout << "\nError in get_Rxn; returned "<<

rxn_type;

cout <<"\nfission neutrons: "<<fission_production_tally

<<" absorptions: "<<rad_cap_tally+fission_tally

<<" leaked neutrons: "<< leak_tally;

cout <<"\nAverage collisions per neutron= "<<collisions/(float)nbatch;

kbatch=fission_production_tally /

((float)(leak_tally+rad_cap_tally+fission_tally));

if (batchcount>batchskips)

keff=((batchcount-batchskips-1)*(keff)+kbatch)/(batchcount-batchskips);

ksum+=kbatch;

cout <<"\nkbatch= "<<kbatch<<"\trunning keff= "<<keff;

time_2=clock();

cout <<"\nkeff = "<<keff;

keff_alt=ksum/((float)(batches-batchskips));

cout <<"\nkeff_alt = "<<keff_alt;

float deltat=(float) (time_2-time_1)/(float) CLOCKS_PER_SEC;

cout <<"\nComputation Time= " << deltat <<" seconds";

cout <<"\nAverage Neutrons per Second = "<<nbatch*batches/deltat;

cout <<"\nAverage Neutrons per Minute = "<<nbatch*batches/deltat*60;

cout <<"\nLost neutrons = " << lost_tally<<endl;

ofstream timeout;

timeout.open("time.out");

timeout<<deltat<<endl<<keff;

timeout.close();

return (EXIT_SUCCESS);

void inputjob(char job[], string& in_filename, unsigned long long int& in_seed,

unsigned int& in_nbatch, unsigned int& in_batches, unsigned int& in_batchskips)

ifstream fin;

fin.open(job);

fin>>in_filename;

if (!fin) cout <<"\ninputjob file not open...";

fin>>in_seed>>in_nbatch>>in_batches>>in_batchskips;

fin.close();

void inputsourcelist2(string filename, neutron* source)

neutron temp;

filename.append(".src");

ifstream sin(filename.c_str());

if (!sin)

cout<<"\nerror reading .src file";

return;

70

int number;

sin>>number; //number of source neuts

for (int i=0; i<number; i++)

sin>>temp.x>>temp.y>>temp.z>>temp.energy;

temp.cell=NOT_FISSION_FLAG;

source[i]=temp;

sin.close();

void inputmaterials(string filename, materials mat_list[][NUCLIDE_MAX])

for (int i=0; i<OBJECT_MAX; i++)

for (int j=0; j<NUCLIDE_MAX; j++)

mat_list[i][j].num_mat=0;

mat_list[i][j].zaid_loc=0;

mat_list[i][j].density=0.0f;

filename.append(".mat");

ifstream min(filename.c_str());

if (!min)

cout<<"\nerror reading .mat file";

return;

int number; //number of cells

min>>number;

for (int i=0; i<number; i++) //i is the same as cell #

min>> mat_list[i][0].num_mat;

for (int j=0; j<mat_list[i][0].num_mat; j++)

min>>mat_list[i][j].zaid_loc>>mat_list[i][j].density;

mat_list[i][j].num_mat=mat_list[i][0].num_mat;

for (int j=mat_list[i][0].num_mat; j<NUCLIDE_MAX; j++)


min.close();


mat_list[][NUCLIDE_MAX])

ifstream xsin;

//get all of the zaids

int k=0;

//find how many diff zaids there are

int temp[OBJECT_MAX*NUCLIDE_MAX];

for (int i=0; i<OBJECT_MAX*NUCLIDE_MAX; i++)

temp[i]=0;

temp[k]=mat_list[0][0].zaid_loc;

k++;


71


bool found_val=false;

for (int z=0; z<k; z++)

if (temp[z]==mat_list[i][j].zaid_loc)

found_val=true;

if ((!found_val)&&(mat_list[i][j].zaid_loc!=0))

temp[k]=mat_list[i][j].zaid_loc;

k++;





mat_list[i][j].zaid_loc=z;

(*xs_data).listlength=k;

(*xs_data).list=(int*)malloc((*xs_data).listlength*sizeof(*(*xs_data).list));

for (int i=0; i<k; i++)

(*xs_data).list[i]=temp[i];

//This should have produced a list of all of the zaids to get data for

(*xs_data).cross_sec= (xs*)

malloc((*xs_data).listlength*sizeof(*(*xs_data).cross_sec));


char xsfile[7];

sprintf(xsfile, "%d",(*xs_data).list[i]);

xsin.open(xsfile);

if (!xsin)

cout<<"\nerror reading xs file";

return;

//a,b,c are from: X(E)=a*exp(b*E)*sinh(sqrt(c*E))

//d is the max of X(E).

xsin>>(*xs_data).cross_sec[i].a>>(*xs_data).cross_sec[i].b>>(*xs_data).cross_se

c[i].c>>(*xs_data).cross_sec[i].d;

xsin >>(*xs_data).cross_sec[i].mesh_length;

(*xs_data).cross_sec[i].Emesh=(float*)malloc((*xs_data).cross_sec[i].mesh_lengt

h*sizeof(*(*xs_data).cross_sec[i].Emesh));

72

(*xs_data).cross_sec[i].fission=(float*)malloc((*xs_data).cross_sec[i].mesh_len

gth*sizeof(*(*xs_data).cross_sec[i].fission));

(*xs_data).cross_sec[i].rad_cap=(float*)malloc((*xs_data).cross_sec[i].mesh_len

gth*sizeof(*(*xs_data).cross_sec[i].rad_cap));

(*xs_data).cross_sec[i].elastic=(float*)malloc((*xs_data).cross_sec[i].mesh_len

gth*sizeof(*(*xs_data).cross_sec[i].elastic));

(*xs_data).cross_sec[i].inelastic=(float*)malloc((*xs_data).cross_sec[i].mesh_l

ength*sizeof(*(*xs_data).cross_sec[i].inelastic));

if ((*xs_data).cross_sec[i].Emesh==NULL) cout <<"\ndidnt initialize

Emesh...";

for (int j=0; j<(*xs_data).cross_sec[i].mesh_length; j++)

xsin>>(*xs_data).cross_sec[i].Emesh[j]>>(*xs_data).cross_sec[i].fission[j]>>(*x

s_data).cross_sec[i].rad_cap[j]

>>(*xs_data).cross_sec[i].elastic[j]>>(*xs_data).cross_sec[i].inelastic[j];

//get legendre coeffs

xsin>>(*xs_data).cross_sec[i].coeffs.a0>>(*xs_data).cross_sec[i].coeffs.a1

>>(*xs_data).cross_sec[i].coeffs.a2>>(*xs_data).cross_sec[i].coeffs.a3


>>(*xs_data).cross_sec[i].coeffs.a6>>(*xs_data).cross_sec[i].coeffs.a7;

xsin.close();

float rng(unsigned long int a, float bounds)

const float mult=1.0f/4294967296.0f;

static unsigned long int seed;

if (a==0)

a=seed;

seed=(a * 1664525+1013904223) % 4294967296;

a=seed;

return seed*bounds*mult;

void inputgeometry(string filename,unsigned short& shape_num,shapes

shape_list[])

filename.append(".geo");

ifstream gin(filename.c_str());

unsigned short number;

if (!gin)

cout<<"\nerror reading .geo file";

return;

gin >> number;

shape_num=number;

for(int i=0; i<(int)number; i++)

//get one object

gin >>shape_list[i].type>>shape_list[i].x0>>shape_list[i].x1

73

>>shape_list[i].y0>>shape_list[i].y1>>shape_list[i].z0

>>shape_list[i].z1>>shape_list[i].R2;

int temp;

int j=0;

bool continue_loop=true;

while (continue_loop)

gin>>temp;

shape_list[i].inside_me[j]=(unsigned short) temp;

j++;

if (temp==999) continue_loop=false;

gin.close();

void giveneutrondir(float* ox, float* oy, float* oz)

*oz=rng(0,2.0f)-1.0f;

float temp=sqrtf(1.0f-(*oz)*(*oz));

*ox=rng(0,6.283185307f);

*oy=sinf(*ox)*temp;

*ox=cosf(*ox)*temp;

float sample_X(float a, float b, float c, float max_y)

//X(E)=a*exp(-b*E)*sinh(sqrt(c*E))

float max_x=20.0f;

//float d is the max_y

bool cont_loop=true;

float x;

while (cont_loop)

x=rng(0,max_x);

if (rng(0,max_y)<=a*exp(-b*x)*sinh(sqrt(c*x)))

return x;

int findzaid(xsinfo xs_data, int zaid)

for (int i=0; i<xs_data.listlength; i++)

if (xs_data.list[i]==zaid)

return i;

//if made it here, wasnt found

return 999;

bool inside_of(shapes shape, float x, float y, float z)

//given x,y,z, if I am in a given object ('shape').

//Sphere (x-x0)^2+(y-y0)^2+(z-z0)^2=R^2

//Cube: z=z0; z=z1, x=x0, x=x1; y=y0, y=y1

//Cylinder: (x-x0)^2+(y-y0)^2=R^2 z>=z0 z<=z1

74

if (shape.type=='s') //spheres

float temp=(x-shape.x0)*(x-shape.x0) +

(y-shape.y0)*(y-shape.y0) +

(z-shape.z0)*(z-shape.z0);

//cout <<"\nx2+y2+z2= "<<temp<<" R2= "<<shape.R2;

if (temp<=shape.R2) //then it is inside, or on the surface of the

sphere

// cout <<"\nx2+y2+z2= "<<temp<<" R2= "<<shape.R2;

return true; //!!!This works if the smallest cellIDs mean they are

inside the rest (so i dont need my 'inside_me' list...

else if (shape.type=='p') //paralellopiped

//z0 is top face, z1 is bottom face

//x0 is larger x face, x1 is smaller x face

//y0 is larger y face, y1 is smaller y face

if (((z<=shape.z0)&&(z>=shape.z1))&&

(((x<=shape.x0)&&(x>=shape.x1))&&

((y<=shape.y0)&&(y>=shape.y1))))

return true;

return false;

unsigned short getCell(unsigned short shape_num, shapes shape_list[], float x,

float y, float z)

for (unsigned short i=0; i<shape_num; i++)

if (inside_of(shape_list[i],x,y,z))

return i;

return OBJECT_MAX;

void move_n(float x, float y, float z, float ox, float oy, float oz, float*

endx, float* endy, float* endz, float dr)

*endx=x+ox*dr;

*endy=y+oy*dr;

*endz=z+oz*dr;

float dist_to_boundary(shapes shape, float* x, float* y, float* z, float ox,

float oy, float oz)

if (shape.type=='s') //spheres

float d=(ox*ox+oy*oy+oz*oz);

float b= 2.0f*(ox*(*x-shape.x0)+oy*(*y-shape.y0)+oz*(*z-shape.z0));

float c= ((*x)*(*x)-2.0f*(*x)*shape.x0+shape.x0*shape.x0) +

((*y)*(*y)-2.0f*(*y)*shape.y0+shape.y0*shape.y0) +

((*z)*(*z)-2.0f*(*z)*shape.z0+shape.z0*shape.z0)-shape.R2;

float temp=b*b-4.0f*d*c;

if (temp < 0.0f)

return -10.0f; //a flag so i know that it didnt intersect

temp=sqrtf(temp);

c = 0.5f*(temp-b)/d;

75

d= 0.5f*(-temp-b)/d;

if (d>0.0f) //pick smallest root

return d;

else

return c;

else if (shape.type=='p') //parallelopiped




//normalize omegas

//compute distances

float temp1, temp2, temp3;

temp1=1.0E37f;

temp2=1.0E37f;

temp3=1.0E37f;

if (ox>0.0f)

temp1=(shape.x0-*x)/ox;

else if (ox<0.0f)

temp1=(shape.x1-*x)/ox;

if (oy>0.0f)

temp2=(shape.y0-*y)/oy;

else if (oy<0.0f)

temp2=(shape.y1-*y)/oy;

if (oz>0.0f)

temp3=(shape.z0-*z)/oz;

else if (oz<0.0f)

temp3=(shape.z1-*z)/oz;

//find the smallest, that is our winner

temp1=min(temp1,temp2);


return temp1;

else return -10.0f;

void scatter_iso(int zaid, float* u, float* v, float* w, float* energy)

//This calcs a new u,v,w, and energy after an isotropic elastic collision

int A=zaid/1000;

A=zaid-1000*A;

float mu_cm=rng(0,2.0f)-1.0f;

float new_energy=*energy*(A*A+2.0f*A*mu_cm+1.0f) /

((float)(A*A+2.0f*A+1.0f));

float temp=sqrt(*energy/new_energy);

float cos_phi=cos(atan(sin(acos(mu_cm))/(1.0f/A+mu_cm)));

float sin_phi=sin(acos(cos_phi));

float cos_w=rng(0,2.0f)-1.0f;

float sin_w=sin(acos(cos_w));

temp=sin_phi/(sqrt(1.0f-(*w)*(*w)));//reused to save space

float new_u, new_v, new_w;

if (isinf(temp))

new_u=0.0f;

new_v=0.0f;

new_w=(*w)*cos_phi;

76

else

new_u=temp*((*v)*sin_w-(*v)*(*u)*cos_w)+(*u)*cos_phi;

new_v=temp*(-(*u)*sin_w-(*w)*(*v)*cos_w)+(*v)*cos_phi;

new_w=sin_phi*sqrt(1.0f-(*w)*(*w))*cos_w+(*w)*cos_phi;

//this is done since machine accuracy with these floats seems to be making

//me have omega vectors larger than unity. Rescaling.

temp =new_u*new_u+new_v*new_v+new_w*new_w;

if ((temp>1.00f)||(temp<0.999f))

temp=1.0f/sqrt(temp);

new_u=new_u*temp;

new_v=new_v*temp;

new_w=new_w*temp;

*u=new_u;

*v=new_v;

*w=new_w;

*energy=new_energy;

void inelastic_scatter(int zaid, float* u, float* v, float* w, float* energy,

legendres coeffs)

int A=zaid/1000;

A=zaid-1000*A;

//First step: sample Q - done with Uniform RNG from 0 to maxQ


float Q=rng(0,(*energy*-A/((float)A+1.0f)));

float Eout=*energy+(A+1.0f)*(Q/A);

*energy=Eout;

//need to now sample the legendres....

cont_loop=true;

float mu;

while (cont_loop)

mu=rng(0,2.0f)-1.0f;

//first four terms, P0-P3

float p=coeffs.a0+coeffs.a1*mu+coeffs.a2*0.5f*(3.0f*mu*mu-1.0f)

+coeffs.a3*0.5f*(5.0f*mu*mu*mu-3.0f*mu);

//P4-P5

p+=0.125f*(coeffs.a4*(35.0f*mu*mu*mu*mu-30.0f*mu*mu+3.0f)

+coeffs.a5*(63.0f*mu*mu*mu*mu*mu-70.0f*mu*mu*mu+15.0f*mu));

//P6-P7

p+=.0625f*(coeffs.a6*(231.0f*mu*mu*mu*mu*mu*mu-

315.0f*mu*mu*mu*mu+105.0f*mu*mu-5.0f)

+coeffs.a7*(429.0f*mu*mu*mu*mu*mu*mu*mu

-693.0f*mu*mu*mu*mu*mu+315.0f*mu*mu*mu-35.0f*mu));

if (rng(0,8.0f)<=p)

break;

//so now I have mu. use it just like in scatter_iso.

float cos_phi=cos(atan(sin(acos(mu))/(1.0/A+mu)));

float sin_phi=sin(acos(cos_phi));

float cos_w=rng(0,2.0f)-1.0f;

float sin_w=sin(acos(cos_w));

float temp;

temp=sin_phi/(sqrt(1.0f-(*w)*(*w)));//reused to save space

77

float new_u,new_v,new_w;

if (isinf(temp))

new_u=0.0f;

new_v=0.0f;

new_w=(*w)*cos_phi;

else

new_u=temp*((*v)*sin_w-(*v)*(*u)*cos_w)+(*u)*cos_phi;

new_v=temp*(-(*u)*sin_w-(*w)*(*v)*cos_w)+(*v)*cos_phi;

new_w=sin_phi*sqrt(1.0f-(*w)*(*w))*cos_w+(*w)*cos_phi;

//this is done since machine accuracy with these floats seems to be making

//me have omega vectors larger than unity. Rescaling.

temp =new_u*new_u+new_v*new_v+new_w*new_w;

if ((temp>1.00f)||(temp<0.999f))

temp=1.0f/sqrt(temp);

new_u=new_u*temp;

new_v=new_v*temp;

new_w=new_w*temp;

//cout<<"\nu2+v2+w2 > 1!!!";

*u=new_u;

*v=new_v;

*w=new_w;

unsigned int binarySearch(float sortedArray[], unsigned int first, unsigned int

last, float key)

if (key<sortedArray[0])

return 1;

while (first <= last)

unsigned int mid = (first + last) / 2; // compute mid point.

if (key > sortedArray[mid])

first = mid + 1; // repeat search in top half.

else if (key < sortedArray[mid])

last = mid - 1; // repeat search in bottom half.

else

return mid; // found it. return position /////

return last+1; // failed to find key

float getmacro(float E, xsinfo xs_data, char type, float density, int loc,

unsigned int i)

//returns macro x/s

float temp=0.0f;

switch (type)

case ('t'):

temp=getmicro(E,xs_data.cross_sec[loc],'t',i);

break;

case ('f'):

temp=getmicro(E,xs_data.cross_sec[loc],'f',i);

break;

78

case ('c'):

temp=getmicro(E,xs_data.cross_sec[loc],'c',i);

break;

case ('e'):

temp=getmicro(E,xs_data.cross_sec[loc],'e',i);

break;

case ('i'):

temp=getmicro(E,xs_data.cross_sec[loc],'i',i);

return (temp*density);

int picknuclide(float allmatSigmaT, xsinfo xs_data, float E, materials

mat_def[])

float eta=rng(0,1.0f)*allmatSigmaT;

float running_sum=0.0f;

for (int i=0; i<mat_def[0].num_mat; i++)

int loc=mat_def[i].zaid_loc;

running_sum+=getmacro(E, xs_data, 't', mat_def[i].density,

loc,binarySearch(xs_data.cross_sec[loc].Emesh, 0,

xs_data.cross_sec[loc].mesh_length-1, E));

if (eta <= running_sum)

return i;

char get_Rxn(float E, xsinfo xs_data, materials mat_def)

//This function determines which reaction occurs after we've

//determined that one takes place

//int loc=findzaid(xs_data,mat_def.zaid_loc);

int loc=mat_def.zaid_loc;

unsigned int i=binarySearch(xs_data.cross_sec[loc].Emesh, 0,

xs_data.cross_sec[loc].mesh_length-1, E);

float sigis=getmacro(E,xs_data,'i',mat_def.density,loc,i);

float sigf=getmacro(E,xs_data,'f',mat_def.density,loc,i);

float sigrc=getmacro(E,xs_data,'c',mat_def.density,loc,i);

float siges=getmacro(E,xs_data,'e',mat_def.density,loc,i);

float sigt=sigf+sigrc+siges+sigis;

float eta=rng(0,sigt);

if (E<1E-10f)

return 'c';

//determine where in the range of scaled x/s the rand # is

if (eta <= sigf) return 'f';

else if (eta <= (sigf+sigrc)) return 'c';

else if (eta <= (sigt-sigis)) return 'e';

else if (eta <= sigt) return 'i';

else return '0';

float getmaterialmacroT(float E, xsinfo xs_data, materials mat_def[])

float macro=0.0;

for (int i=0; i<mat_def[i].num_mat; i++)

int loc=mat_def[i].zaid_loc;

79

macro+=getmicro(E,xs_data.cross_sec[loc],'t',

binarySearch(xs_data.cross_sec[loc].Emesh, 0,

xs_data.cross_sec[loc].mesh_length-1, E))*mat_def[i].density;

return macro;

float getmicro(float E, xs cross_sec, char type, unsigned int i)

float temp;

switch (type)

case 'f':

if (E<=cross_sec.Emesh[0])

temp=cross_sec.fission[0]*sqrt(cross_sec.Emesh[0]/E);

else

temp=(cross_sec.fission[i]-cross_sec.fission[i-

1])/(cross_sec.Emesh[i]-cross_sec.Emesh[i-1])*(E-cross_sec.Emesh[i-

1])+cross_sec.fission[i-1];

break;

case 'c':


temp=cross_sec.rad_cap[0]*sqrt(cross_sec.Emesh[0]/E);

else

temp=(cross_sec.rad_cap[i]-cross_sec.rad_cap[i-


1])+cross_sec.rad_cap[i-1];

break;

case 'e':


temp=cross_sec.elastic[0]*sqrt(cross_sec.Emesh[0]/E);

else

temp=(cross_sec.elastic[i]-cross_sec.elastic[i-


1])+cross_sec.elastic[i-1];

break;

case 'i':


temp=cross_sec.inelastic[0]*sqrt(cross_sec.Emesh[0]/E);

else

temp=(cross_sec.inelastic[i]-cross_sec.inelastic[i-


1])+cross_sec.inelastic[i-1];

break;

case 't':


temp=(cross_sec.fission[0]+cross_sec.rad_cap[0]+cross_sec.elastic[0]+cross_sec.

inelastic[0])*sqrt(cross_sec.Emesh[0]/E);

else

temp=(cross_sec.fission[i]-cross_sec.fission[i-


1])+cross_sec.fission[i-1];

temp+=(cross_sec.rad_cap[i]-cross_sec.rad_cap[i-


1])+cross_sec.rad_cap[i-1];

temp+=(cross_sec.elastic[i]-cross_sec.elastic[i-


1])+cross_sec.elastic[i-1];

80

temp+=(cross_sec.inelastic[i]-cross_sec.inelastic[i-


1])+cross_sec.inelastic[i-1];

break;

default: cout<<"\nhow did i get here???"; temp=-0.5f;

return temp;

81

Appendix B

LADONg Source Code

The following CUDA/C++ code is the single-precision code for LADONg.

#include <stdlib.h>

#include <stdio.h>

#include <math.h>

#include <time.h>

#include <iostream>

#include <float.h>

#include <fstream>

#include <string>

#include <cuda.h>

using namespace std;

const int OBJECT_MAX=100;

const int NOT_FISSION_FLAG=101;

const int NUCLIDE_MAX=20;

const float NU=0.53f; //2+NU = nu, # neuts per fission

const float SMALLEST_float=0.001f; //used to barely move particle

const short THREAD_COUNT=64;

const int CONST_SIZE=1000;

struct reduced_array

unsigned int index;

float value;

;

struct neutron

float3 pos;

float energy;


char flag;

short loc;

unsigned long int seed;

;

struct neutron2


char flag;

short loc;


;

struct shapes

char type; //s for sphere p for parallelopiped, c for cylinder

float x0, y0, z0, R2; //R2 because its R^2

float x1, y1, z1; //these are required to define a cube (and z1 for cyl)

unsigned short inside_me[OBJECT_MAX];

;

82

struct materials

unsigned short num_mat;

int zaid_loc;

float density;

;

struct materials2

int zaid_loc;

float density;

;

struct legendres

float a0, a1, a2, a3, a4, a5, a6, a7;

;

struct xs

//these are micros

float * rad_cap;

float * fission;

float * elastic;

float * inelastic;

float * Emesh;

legendres coeffs;

float a,b,c,d;

int mesh_length;

;

struct xsinfo

xs* cross_sec;

int *list;

short listlength;

;

__constant__ shapes shapelist_d[OBJECT_MAX];

__constant__ unsigned short num_mat_d[OBJECT_MAX];

__constant__ materials2 mat_list_d[OBJECT_MAX*NUCLIDE_MAX];

texture<float,1,cudaReadModeElementType> Emesh_tex;

texture<float4,1,cudaReadModeElementType> xsmesh_tex;

__constant__ unsigned int Emesh_offsets_d[NUCLIDE_MAX];

__constant__ float4 xs0_d[NUCLIDE_MAX];

__constant__ float E0_d[NUCLIDE_MAX];

__constant__ reduced_array reduced_Emesh_d[CONST_SIZE];

__constant__ unsigned int reduced_offsets_d[NUCLIDE_MAX];

__constant__ legendres coeffs_d[NUCLIDE_MAX];

//***************************************************************************//


shape_list[]);

float rng(unsigned long int a, float bounds);


mat_list[][NUCLIDE_MAX]);

void inputmaterials(string filename, materials mat_list[][NUCLIDE_MAX]);

void inputsourcelist2(string filename, neutron* source);

void inputjob(char job[], string& in_filename, unsigned long int& in_seed,


in_batchskips);

83

float sample_X(float a, float b, float c, float max_y);

short cpu_findzaid(xsinfo* xs_data, int zaid);

//***************************************************************************//

__global__ void MCkernel(const unsigned short num_shapes, const xsinfo*

xs_data, const int length, neutron* list);

__device__ void gpugiveneutrondir(unsigned long int* a, float3* dir);

__host__ __device__ float gpurng(unsigned long int* a, const float bounds);

__device__ bool gpuinside_of(const unsigned short i, const float3* pos);

__device__ unsigned short gpugetCell(const unsigned short shape_num, const

float3* pos);

__device__ float gpudist_to_boundary(const unsigned short i, const float3* pos,

const float3* dir);

__device__ void gpuscatter_iso(neutron2* neut, const int zaid, float3* dir,

float* energy);

__device__ void gpuinelastic(neutron2* neut, const int zaid, float3* dir,

float* energy);

__device__ short findzaid(const xsinfo* xs_data, int zaid);

__device__ unsigned int accelSearch(unsigned int first, unsigned int last,

const float key);

__device__ unsigned int binarySearch(const float* sortedArray, unsigned int

first, unsigned int last, const float key);

__device__ unsigned int textureSearch(unsigned int first, unsigned int last,

const float key);

__device__ float getmicro_f_pretex(const float Ei, const float Eim1, const

float4* xsi, const float4* xsim1, const float E, const short loc);

__device__ float getmicro_c_pretex(const float Ei, const float Eim1, const


__device__ float getmicro_e_pretex(const float Ei, const float Eim1, const


__device__ float getmicro_i_pretex(const float Ei, const float Eim1, const


__device__ int gpupicknuclide(neutron2* neut, const xsinfo* xs_data, const

unsigned short cell, const float SigmaT, const float energy);

__device__ float gpugetmaterialmacroT(const xsinfo* xs_data, const neutron2*

neut, const float energy);

__device__ void gpuget_Rxn(neutron2* neut, const xsinfo* xs_data, const float

energy);

__device__ float getmicro_t(const float E, const xs* cross_sec, const short

loc);

void configure_cuda_call(int required_threads, int * blocks);

void convert2darray(materials mat2d[][NUCLIDE_MAX], materials * mat1d);

void checkCUDAError(const char *msg);

void getneutlist(unsigned int listlength, neutron* host, neutron* device);

void loadneutlist(unsigned int listlength, neutron* host, neutron* device);

void load_gpu(materials mat_list[][NUCLIDE_MAX], unsigned short num_shapes,

shapes shapelist[], xsinfo* xs_data, xsinfo** xs_data_d);

void Emesh_reduction(xsinfo* xs_data, reduced_array** list, unsigned int**

reduced_offsets, int* reduction_factor);

//***************************************************************************//

//MAIN

int main(int argc, char* argv[])

xsinfo* xs_data_d=NULL;

clock_t time_1, time_2;

//get main input

string file;


unsigned int nbatch; //neuts per batch

unsigned int batches;

unsigned int batchskips;

84

char jobname[]="input.job";

inputjob(jobname, file, seed, nbatch, batches, batchskips);

//init rng

srand(seed);

rng(seed,1.0f);

xsinfo xs_data;

unsigned short num_shapes; //get from input

shapes shapelist[OBJECT_MAX]; //get from input

inputgeometry(file, num_shapes, shapelist);

materials mat_list[OBJECT_MAX][NUCLIDE_MAX];

inputmaterials(file, mat_list);

neutron* queuelist;

cudaMallocHost((void**)&queuelist,2*nbatch*sizeof(neutron));

for (unsigned int i=0; i<2*nbatch; i++)

queuelist[i].energy=-1.0f;

inputsourcelist2(file,queuelist);

inputXS(num_shapes,&xs_data, mat_list);

load_gpu(mat_list, num_shapes, shapelist, &xs_data, &xs_data_d);

cout <<"\nINPUT PARSING AND GPU INITIALIZATION COMPLETE\nTIMER STARTED";

time_1=clock(); //start timer!

float keff=0;

float kbatch;

float ksum=0; //used for checking my kbatch algorithm

float keff_alt;//used for checking my kbatch algorithm

unsigned int queuecount=0;

unsigned int fissionqueuecount=0;

neutron* batchlist=NULL;

cudaMallocHost((void**)&batchlist,nbatch*sizeof(neutron));

neutron* batchlist_d=NULL;

cudaMalloc((void**)&batchlist_d,nbatch*sizeof(neutron));

checkCUDAError("cudamalloc of batchlist");

int threads=THREAD_COUNT;

int blocks=0;

// unsigned int lost_tally=0;

configure_cuda_call(nbatch, &blocks);

for (unsigned int batchcount=1; batchcount <= batches; batchcount++)

cout <<"\nBatch "<<batchcount;

unsigned int collisions=0;

unsigned int fission_tally=0;

unsigned int rad_cap_tally=0;

unsigned int leak_tally=0;

unsigned int fission_production_tally=0;

//extract queue from queuelist

for (unsigned int i=0; i<nbatch; i++)

if (queuelist[queuecount].energy==-1.0)

//If there are no neutrons in queue, start at r=0 with random E

batchlist[i].pos.x=0.0f;

batchlist[i].pos.y=0.0f;

batchlist[i].pos.z=0.0f;

batchlist[i].energy=sample_X(0.453f,-1.036f,2.29f,0.35820615f);

queuecount=0;

else

batchlist[i].pos=queuelist[queuecount].pos;

batchlist[i].energy=queuelist[queuecount].energy;

if (queuecount<(2*nbatch-2))

queuecount++;

85

else queuecount=0;

batchlist[i].seed=(unsigned long int) rng(0,4294967296.0f);

loadneutlist(nbatch, batchlist, batchlist_d);

checkCUDAError("load batchlist");

MCkernel<<<blocks,threads>>>(num_shapes, xs_data_d, nbatch,

batchlist_d);

cudaThreadSynchronize();

checkCUDAError("kernel invocation");

getneutlist(nbatch, batchlist, batchlist_d);

for (unsigned int i=0; i<nbatch; i++)

switch(batchlist[i].flag)

case 'f': //do fission stuff

//determine number of neutrons from fission

int num_fission_neuts;

if (rng(0,1.0f) > NU)

num_fission_neuts=3;

else num_fission_neuts=2;

for (int j=0; j<num_fission_neuts; j++)

neutron temp;

temp.pos=batchlist[i].pos;

short loc=batchlist[i].loc;

temp.energy=sample_X(xs_data.cross_sec[loc].a,

xs_data.cross_sec[loc].b,

xs_data.cross_sec[loc].c,

xs_data.cross_sec[loc].d);

queuelist[fissionqueuecount]=temp;

if (fissionqueuecount<(2*nbatch-2))

fissionqueuecount++;

else

fissionqueuecount=0;

fission_production_tally++;

//store initial rx,ry,rz,E for fission n's in queue

fission_tally++;

break;

case 'c': //do capture stuff

rad_cap_tally++;

break;

case 'l':

leak_tally++;

break;

case 'e': //These will return 0, since no collisions reported.

collisions++; //kept to maintain similar computations

between cpu and gpu

break;

case 'i':

collisions++;

break;

// case 'L':

// lost_tally++;

// break;

default: cout<<"\nGot a neutron that doesn't have correct flag:

"<<batchlist[i].flag<<"\n";

cout <<"\nfission neutrons: "<<fission_production_tally

86

<<" absorptions: "<<rad_cap_tally+fission_tally

<<" leaked neutrons: "<< leak_tally;

cout <<"\nAverage collisions per neutron=

"<<collisions/(float)nbatch;

kbatch=fission_production_tally /

((float)(leak_tally+rad_cap_tally+fission_tally));

if (batchcount>batchskips)

keff=((batchcount-batchskips-1)*(keff)+kbatch)/(batchcount-

batchskips);

ksum+=kbatch;

cout <<"\nkbatch= "<<kbatch<<"\trunning keff= "<<keff;

time_2=clock();

cout <<"\nkeff = "<<keff;

keff_alt=ksum/((float)(batches-batchskips));

cout <<"\nkeff_alt = "<<keff_alt;

float deltat=(float) (time_2-time_1)/(float) CLOCKS_PER_SEC;

cout <<"\nComputation Time= " << deltat <<" seconds";

cout <<"\nAverage Neutrons per Second = "<<nbatch*batches/deltat;

cout <<"\nAverage Neutrons per Minute = "<<nbatch*batches/deltat*60<<endl;

cudaFree(xs_data_d);

cudaFree(batchlist_d);

cudaFreeHost(batchlist);

cudaFreeHost(queuelist);

ofstream timeout;

timeout.open("time.out");

timeout<<deltat<<endl<<keff;

timeout.close();

return (EXIT_SUCCESS);

//***************************************************************************//

//CPU FUNCTIONS

void inputjob(char job[], string& in_filename, unsigned long int& in_seed,


in_batchskips)

ifstream fin;

fin.open(job);

fin>>in_filename;

if (!fin) cout <<"\ninputjob file not open...";

fin>>in_seed>>in_nbatch>>in_batches>>in_batchskips;

fin.close();

void inputsourcelist2(string filename, neutron* source)

neutron temp;

filename.append(".src");

ifstream sin(filename.c_str());

if (!sin)

cout<<"\nerror reading .src file";

return;

int number;

sin>>number; //number of source neuts

for (int i=0; i<number; i++)

87

sin>>temp.pos.x>>temp.pos.y>>temp.pos.z>>temp.energy;//>>temp.dir.x>>temp.dir.y

>>temp.dir.z;

temp.cell=NOT_FISSION_FLAG;

source[i]=temp;

sin.close();

void inputmaterials(string filename, materials mat_list[][NUCLIDE_MAX])



mat_list[i][j].num_mat=0;

mat_list[i][j].zaid_loc=0;

mat_list[i][j].density=0.0f;

filename.append(".mat");

ifstream min(filename.c_str());

if (!min)

cout<<"\nerror reading .mat file";

return;

int number=0; //number of cells

min>>number;

for (int i=0; i<number; i++) //i is the same as cell #

min>> mat_list[i][0].num_mat;

for (int j=0; j<mat_list[i][0].num_mat; j++)

min>>mat_list[i][j].zaid_loc>>mat_list[i][j].density;


for (int j=mat_list[i][0].num_mat; j<NUCLIDE_MAX; j++)


min.close();


mat_list[][NUCLIDE_MAX])

ifstream xsin;

//get all of the zaids

int k=0;

//find how many diff zaids there are

int temp[OBJECT_MAX*NUCLIDE_MAX];


temp[i]=0;

temp[k]=mat_list[0][0].zaid_loc;

k++;



88

bool found_val=false;



found_val=true;

if ((!found_val)&&(mat_list[i][j].zaid_loc!=0))

temp[k]=mat_list[i][j].zaid_loc;

k++;





mat_list[i][j].zaid_loc=z;

(*xs_data).listlength=k;

(*xs_data).list=(int*)malloc((*xs_data).listlength*sizeof(*(*xs_data).list));


(*xs_data).list[i]=temp[i];

//This should have produced a list of all of the zaids to get data for

(*xs_data).cross_sec= (xs*)

malloc((*xs_data).listlength*sizeof(*(*xs_data).cross_sec));


char xsfile[7];

sprintf(xsfile, "%d",(*xs_data).list[i]);

xsin.open(xsfile);

if (!xsin)

cout<<"\nerror reading xs file";

return;

//a,b,c are from: X(E)=a*exp(b*E)*sinh(sqrt(c*E))

//d is the max of X(E).

xsin>>(*xs_data).cross_sec[i].a>>(*xs_data).cross_sec[i].b>>(*xs_data).cross_se

c[i].c>>(*xs_data).cross_sec[i].d;

xsin >>(*xs_data).cross_sec[i].mesh_length;

(*xs_data).cross_sec[i].Emesh=(float*)malloc((*xs_data).cross_sec[i].mesh_lengt

h*sizeof(*(*xs_data).cross_sec[i].Emesh));

89

(*xs_data).cross_sec[i].fission=(float*)malloc((*xs_data).cross_sec[i].mesh_len

gth*sizeof(*(*xs_data).cross_sec[i].fission));

(*xs_data).cross_sec[i].rad_cap=(float*)malloc((*xs_data).cross_sec[i].mesh_len

gth*sizeof(*(*xs_data).cross_sec[i].rad_cap));

(*xs_data).cross_sec[i].elastic=(float*)malloc((*xs_data).cross_sec[i].mesh_len

gth*sizeof(*(*xs_data).cross_sec[i].elastic));

(*xs_data).cross_sec[i].inelastic=(float*)malloc((*xs_data).cross_sec[i].mesh_l

ength*sizeof(*(*xs_data).cross_sec[i].inelastic));

if ((*xs_data).cross_sec[i].Emesh==NULL) cout <<"\ndidnt initialize

Emesh...";

for (int j=0; j<(*xs_data).cross_sec[i].mesh_length; j++)

xsin>>(*xs_data).cross_sec[i].Emesh[j]>>(*xs_data).cross_sec[i].fission[j]>>(*x

s_data).cross_sec[i].rad_cap[j]

>>(*xs_data).cross_sec[i].elastic[j]>>(*xs_data).cross_sec[i].inelastic[j];

//get legendre coeffs

xsin>>(*xs_data).cross_sec[i].coeffs.a0>>(*xs_data).cross_sec[i].coeffs.a1



>>(*xs_data).cross_sec[i].coeffs.a6>>(*xs_data).cross_sec[i].coeffs.a7;

xsin.close();

float rng(unsigned long int a, float bounds)

const float mult=1.0f/4294967296.0f;

static unsigned long int seed;

if (a==0)

a=seed;

seed=(a * 1664525+1013904223)%4294967296;

a=seed;

return seed*bounds*mult;


shape_list[])

filename.append(".geo");

ifstream gin(filename.c_str());

unsigned short number;

if (!gin)

cout<<"\nerror reading .geo file";

return;

gin >> number;

shape_num=number;

for(int i=0; i<(int)number; i++)

//get one object

gin >>shape_list[i].type>>shape_list[i].x0>>shape_list[i].x1

90

>>shape_list[i].y0>>shape_list[i].y1>>shape_list[i].z0

>>shape_list[i].z1>>shape_list[i].R2;

int temp;

int j=0;

bool continue_loop=true;

while (continue_loop)

gin>>temp;

shape_list[i].inside_me[j]=(unsigned short) temp;

j++;

if (temp==999) continue_loop=false;

gin.close();

float sample_X(float a, float b, float c, float max_y)

//X(E)=a*exp(-b*E)*sinh(sqrt(c*E))

float max_x=20.0f;

//float d is the max_y


float x;

while (cont_loop)

x=rng(0,max_x);

if (rng(0,max_y)<=a*exp(-b*x)*sinh(sqrt(c*x)))

return x;

return 1.0f;

short cpu_findzaid(xsinfo* xs_data, int zaid)

for (short i=0; i<(*xs_data).listlength; i++)

if ((*xs_data).list[i]==zaid)

return i;


return 999;

//***************************************************************************//

//GPU-SPECIFIC FUNCTIONS

__global__ void MCkernel(const unsigned short num_shapes, const xsinfo*

xs_data, const int length, neutron* list)

__shared__ neutron2 working_neut[THREAD_COUNT];

//the mul24 requires the total number of neuts to be < 16 million (2^24).

int tid=mul24(blockIdx.x,blockDim.x)+threadIdx.x;

if ((tid)<length)

float3 pos=list[tid].pos;

working_neut[threadIdx.x].cell=list[tid].cell;

working_neut[threadIdx.x].seed=list[tid].seed;

float energy = list[tid].energy;

working_neut[threadIdx.x].flag='g'; //g==go

float3 dir;

gpugiveneutrondir(&(working_neut[threadIdx.x].seed),&(dir));

bool neutron_alive=true;

91

working_neut[threadIdx.x].cell=gpugetCell(num_shapes, &(pos));

if (working_neut[threadIdx.x].cell==OBJECT_MAX)

working_neut[threadIdx.x].cell=0;

working_neut[threadIdx.x].flag='l';

while (neutron_alive)

//begin transporting neutron

float eta=gpurng(&(working_neut[threadIdx.x].seed),1.0f);

bool finding_new_cell=true;

float SigmaT;

while (finding_new_cell)

SigmaT=gpugetmaterialmacroT(xs_data,

&(working_neut[threadIdx.x]),energy);

float coll_dist=-logf(eta)/SigmaT;

float

dist=gpudist_to_boundary(working_neut[threadIdx.x].cell,&(pos),&(dir));

//if ((dist<0.0f)&&(dist>-0.1f))

// working_neut[threadIdx.x].cell='L'; // L ==

'lost', geometry single precision fix-up

unsigned short i=0;

//This loop checks to make sure there were no

intersecting objects inbetween x,y,z and objects boundary

//(think of two spheres inside a sphere - where x,y,z

is inside of large sphere but outside small spheres)

while (shapelist_d[ working_neut[threadIdx.x].cell]

.inside_me[i] != 999)

float temp= gpudist_to_boundary(

shapelist_d[working_neut[threadIdx.x].cell].inside_me[i],&(pos),&(dir));

if ((temp>0.0f)&&(temp<dist))

dist=temp;

i++;

//See if neutron leaves current cell

if (coll_dist>=dist)

//it did, so just move it to outside that cell, and

then update information

//next time we come around this loop it will be like

starting with a fresh neutron

pos.x=pos.x+dir.x*(dist+SMALLEST_float);

pos.y=pos.y+dir.y*(dist+SMALLEST_float);

pos.z=pos.z+dir.z*(dist+SMALLEST_float);

working_neut[threadIdx.x].cell=

gpugetCell(num_shapes, &(pos));

if (working_neut[threadIdx.x].cell ==

OBJECT_MAX)

working_neut[threadIdx.x].flag='l';

eta=expf(((dist+SMALLEST_float)-

coll_dist)*SigmaT);

else //neutron did not leave cell, so just move it to

where it should be

pos.x=pos.x+dir.x*coll_dist;

pos.y=pos.y+dir.y*coll_dist;

pos.z=pos.z+dir.z*coll_dist;


92

//if

((working_neut[threadIdx.x].flag=='l')||(working_neut[threadIdx.x].flag=='L'))

if (working_neut[threadIdx.x].flag=='l')


if (neutron_alive)

unsigned short celltemp=working_neut[threadIdx.x].cell;

// if

((working_neut[threadIdx.x].flag=='l')||(working_neut[threadIdx.x].flag=='L'))

if (working_neut[threadIdx.x].flag=='l')

celltemp=0; //let it use some cell, avoids thread

seperation

unsigned short target_nuclide=gpupicknuclide(

&(working_neut[threadIdx.x]), xs_data,celltemp,SigmaT,energy);

gpuget_Rxn(&(working_neut[threadIdx.x]),xs_data,energy);

if (working_neut[threadIdx.x].flag=='e')

gpuscatter_iso(&(working_neut[threadIdx.x]),

(*xs_data).list[mat_list_d[target_nuclide+celltemp*NUCLIDE_MAX].zaid_loc],

&dir,&energy);

else if (working_neut[threadIdx.x].flag=='i')

gpuinelastic(&(working_neut[threadIdx.x]),(*xs_data).list[

mat_list_d[target_nuclide+celltemp*NUCLIDE_MAX].zaid_loc], &dir,&energy);

else neutron_alive=false;

list[tid].pos=pos;

list[tid].cell=working_neut[threadIdx.x].cell;

list[tid].loc=working_neut[threadIdx.x].loc;

list[tid].seed=working_neut[threadIdx.x].seed;

list[tid].energy=energy;

list[tid].flag=working_neut[threadIdx.x].flag;

__device__ void gpugiveneutrondir(unsigned long int* a, float3* dir)

(*dir).z=gpurng(a,2.0f)-1.0f;

float temp=sqrtf(1.0f-(*dir).z*(*dir).z);

(*dir).x=gpurng(a,6.283185307f);

(*dir).y=sinf((*dir).x)*temp;

(*dir).x=cosf((*dir).x)*temp;

__host__ __device__ float gpurng(unsigned long int* a, float bounds)

(*a)=((*a)*1664525+1013904223)%4294967296;

return (*a)*bounds*2.32830644E-10f;

__device__ bool gpuinside_of(const unsigned short i, const float3* pos)

//given x,y,z, if I am in a given object ('shape').

//Sphere (x-x0)^2+(y-y0)^2+(z-z0)^2=R^2

//Cube: z=z0; z=z1, x=x0, x=x1; y=y0, y=y1

bool result=false;

if (shapelist_d[i].type=='s') //spheres

float temp=((*pos).x-shapelist_d[i].x0)*((*pos).x-shapelist_d[i].x0) +

93

((*pos).y-shapelist_d[i].y0)*((*pos).y-shapelist_d[i].y0) +

((*pos).z-shapelist_d[i].z0)*((*pos).z-shapelist_d[i].z0);

if (temp<=shapelist_d[i].R2) //then it is inside, or on the surface of

the sphere

result=true; //!!!This works if the smallest cellIDs mean they are

inside the rest (so i dont need my 'inside_me' list...

else if (shapelist_d[i].type=='p') //paralellopiped




if ((((*pos).z<=shapelist_d[i].z0)&&((*pos).z>=shapelist_d[i].z1))&&

((((*pos).x<=shapelist_d[i].x0)&&((*pos).x>=shapelist_d[i].x1))&&

(((*pos).y<=shapelist_d[i].y0)&&((*pos).y>=shapelist_d[i].y1))))

result=true;

return result;

__device__ unsigned short gpugetCell(const unsigned short shape_num, const

float3* pos)

for (unsigned short i=0; i<shape_num; i++)

if (gpuinside_of(i,pos))

return i;

return OBJECT_MAX;

__device__ float gpudist_to_boundary(const unsigned short i, const float3* pos,

const float3* dir)

if (shapelist_d[i].type=='s') //spheres

float d=(*dir).x*(*dir).x+(*dir).y*(*dir).y+(*dir).z*(*dir).z;

float b= 2.0f*(((*dir).x)*((*pos).x-

shapelist_d[i].x0)+((*dir).y)*((*pos).y-

shapelist_d[i].y0)+((*dir).z)*((*pos).z-shapelist_d[i].z0));

float c= (((*pos).x)*((*pos).x)-

2.0f*((*pos).x)*shapelist_d[i].x0+shapelist_d[i].x0*shapelist_d[i].x0) +

(((*pos).y)*((*pos).y)-

2.0f*((*pos).y)*shapelist_d[i].y0+shapelist_d[i].y0*shapelist_d[i].y0) +

(((*pos).z)*((*pos).z)-

2.0f*((*pos).z)*shapelist_d[i].z0+shapelist_d[i].z0*shapelist_d[i].z0)-

shapelist_d[i].R2;

float temp=b*b-4.0f*d*c;

if (temp < 0.0f)

return -10.0f; //a flag so i know that it didnt intersect

temp=sqrtf(temp);

c = 0.5f*(temp-b)/d;

d= 0.5f*(-temp-b)/d;

if (d>0.0f) //pick smallest root

return d;

else

return c;

else if (shapelist_d[i].type=='p') //parallelopiped

94




//compute distances

float temp1, temp2, temp3;

temp1=1.0E37f;

temp2=1.0E37f;

temp3=1.0E37f;

if (((*dir).x)>0.0f)

temp1=(shapelist_d[i].x0-(*pos).x)/((*dir).x);

else if (((*dir).x)<0.0f)

temp1=(shapelist_d[i].x1-(*pos).x)/((*dir).x);

if (((*dir).y)>0.0f)

temp2=(shapelist_d[i].y0-(*pos).y)/((*dir).y);

else if (((*dir).y)<0.0f)

temp2=(shapelist_d[i].y1-(*pos).y)/((*dir).y);

if (((*dir).z)>0.0f)

temp3=(shapelist_d[i].z0-(*pos).z)/((*dir).z);

else if (((*dir).z)<0.0f)

temp3=(shapelist_d[i].z1-(*pos).z)/((*dir).z);

//find the smallest, that is our winner



return temp1;

return -10.0f;

__device__ void gpuscatter_iso(neutron2* neut, const int zaid, float3* dir,

float* energy)

//This calcs a new u,v,w, and energy after an isotropic collision

int A;

int C=zaid/1000;

A=zaid-1000*C;

float mu_cm=gpurng(&(*neut).seed,2.0f)-1.0f;

float temp;

float new_energy=*energy*(A*A+2.0f*A*mu_cm+1.0f)/(A*A+2.0f*A+1.0f);

*energy=new_energy;

temp=sqrtf(*energy/new_energy);

float cos_phi=cosf(atanf(sinf(acosf(mu_cm))/(1.0f/A+mu_cm)));

float sin_phi=sinf(acosf(cos_phi));

float cos_w=gpurng(&(*neut).seed,2.0f)-1.0f;

float sin_w=sinf(acosf(cos_w));

temp=sin_phi*(rsqrtf(1-((*dir).z)*((*dir).z)));//reused to save registers

float new_u=temp*(((*dir).y)*sin_w-

((*dir).y)*((*dir).x)*cos_w)+((*dir).x)*cos_phi;

float new_v=temp*(-((*dir).x)*sin_w-

((*dir).z)*((*dir).y)*cos_w)+((*dir).y)*cos_phi;

temp=sin_phi*sqrtf(1-((*dir).z)*((*dir).z))*cos_w+((*dir).z)*cos_phi;

//used instead of float new_w to save registers

cos_phi =new_u*new_u+new_v*new_v+temp*temp; //reused to save registers

if ((cos_phi>1.0f)||(cos_phi<0.999f))

95

cos_phi=rsqrtf(cos_phi);

new_u=new_u*cos_phi;

new_v=new_v*cos_phi;

temp=temp*cos_phi;

(*dir).x=new_u;

(*dir).y=new_v;

(*dir).z=temp;

__device__ void gpuinelastic(neutron2* neut, const int zaid, float3* dir,

float* energy)

//This calcs a new u,v,w, and energy after an isotropic collision

int A;

int C=zaid/1000;

A=zaid-1000*C;

float Q=gpurng(&((*neut).seed),(-1.0f*(*energy*A/(A+1.0f))));

*energy=*energy+(A+1.0f)*Q/A;


float mu_cm;

while (cont_loop)

mu_cm=gpurng(&((*neut).seed),2.0f)-1.0f;

float p=coeffs_d[(*neut).loc].a0

+coeffs_d[(*neut).loc].a1*mu_cm

+coeffs_d[(*neut).loc].a2*0.5f*(3.0f*mu_cm*mu_cm-1.0f)

+coeffs_d[(*neut).loc].a3*0.5f*(5.0f*mu_cm*mu_cm*mu_cm-

3.0f*mu_cm)

+0.125f*(coeffs_d[(*neut).loc].a4*(35.0f*mu_cm*mu_cm*mu_cm*mu_cm-

30.0f*mu_cm*mu_cm+3.0f))

+(coeffs_d[(*neut).loc].a5*(63.0f*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm-

70.0f*mu_cm*mu_cm*mu_cm+15.0f*mu_cm))

+.0625f*(coeffs_d[(*neut).loc].a6*(231.0f*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm-

315.0f*mu_cm*mu_cm*mu_cm*mu_cm+105.0f*mu_cm*mu_cm-5.0f))

+coeffs_d[(*neut).loc].a7*(429.0f*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm-

693.0f*mu_cm*mu_cm*mu_cm*mu_cm*mu_cm+315.0f*mu_cm*mu_cm*mu_cm-35.0f*mu_cm);

if (gpurng(&((*neut).seed),8.0f)<=p)

cont_loop=false;

float cos_phi=cosf(atanf(sinf(acosf(mu_cm))/(1.0f/A+mu_cm)));

float sin_phi=sinf(acosf(cos_phi));

float cos_w=gpurng(&((*neut).seed),2.0f)-1.0f;

float sin_w=sinf(acosf(cos_w));

float temp=sin_phi*(rsqrtf(1.0f-((*dir).z)*((*dir).z)));//reused to save

registers

float new_u=temp*(((*dir).y)*sin_w-

((*dir).y)*((*dir).x)*cos_w)+((*dir).x)*cos_phi;

float new_v=temp*(-((*dir).x)*sin_w-

((*dir).z)*((*dir).y)*cos_w)+((*dir).y)*cos_phi;

temp=sin_phi*sqrtf(1.0f-((*dir).z)*((*dir).z))*cos_w+((*dir).z)*cos_phi;

//used instead of float new_w to save registers

cos_phi =new_u*new_u+new_v*new_v+temp*temp; //reused to save registers

96

if ((cos_phi>1.0f)||(cos_phi<0.999f))

cos_phi=rsqrtf(cos_phi);

new_u=new_u*cos_phi;

new_v=new_v*cos_phi;

temp=temp*cos_phi;

(*dir).x=new_u;

(*dir).y=new_v;

(*dir).z=temp;

__device__ short findzaid(const xsinfo* xs_data, int zaid)

short j;

for (short i=0; i<(*xs_data).listlength; i++)

if ((*xs_data).list[i]==zaid)

j=i;


return j;

__device__ unsigned int accelSearch(unsigned int first, unsigned int last,

const float key)



if (key > reduced_Emesh_d[mid].value)


else if (key < reduced_Emesh_d[mid].value)


else



__device__ unsigned int binarySearch(const float* sortedArray, unsigned int

first, unsigned int last, const float key)



if (key > sortedArray[mid])


else if (key < sortedArray[mid])


else



__device__ unsigned int textureSearch(unsigned int first, unsigned int last,

const float key)



97

unsigned int val=tex1Dfetch(Emesh_tex,mid);

if (key > val)


else if (key < val)


else



//************************************************//

__device__ float getmicro_t_nosrch(const float E, const xs* cross_sec, const

int i, const short loc)

float temp;

if (E<=(*cross_sec).Emesh[0])

temp=((*cross_sec).fission[0]+(*cross_sec).rad_cap[0]+(*cross_sec).elastic[0]+(

*cross_sec).inelastic[0])*sqrtf((*cross_sec).Emesh[0]/E);

else

float Ei=tex1Dfetch(Emesh_tex,i+Emesh_offsets_d[loc]);

float Eim1=tex1Dfetch(Emesh_tex,i-1+Emesh_offsets_d[loc]);

float4 xsi=tex1Dfetch(xsmesh_tex,i+Emesh_offsets_d[loc]);

float4 xsim1=tex1Dfetch(xsmesh_tex,i-1+Emesh_offsets_d[loc]);

temp=(xsi.x-xsim1.x)/(Ei-Eim1)*(E-Eim1)+xsim1.x;

temp+=(xsi.y-xsim1.y)/(Ei-Eim1)*(E-Eim1)+xsim1.y;

temp+=(xsi.z-xsim1.z)/(Ei-Eim1)*(E-Eim1)+xsim1.z;

temp+=(xsi.w-xsim1.w)/(Ei-Eim1)*(E-Eim1)+xsim1.w;

return temp;

__device__ float getmicro_f_pretex(const float Ei, const float Eim1, const

float4* xsi, const float4* xsim1, const float E, const short loc)

if (E<=E0_d[loc])

return (xs0_d[loc].x*sqrtf(E0_d[loc]/E));

else

return (((*xsi).x-(*xsim1).x)/(Ei-Eim1)*(E-Eim1)+(*xsim1).x);

__device__ float getmicro_c_pretex(const float Ei, const float Eim1, const


if (E<=E0_d[loc])

return (xs0_d[loc].y*sqrtf(E0_d[loc])/E);

else

return (((*xsi).y-(*xsim1).y)/(Ei-Eim1)*(E-Eim1)+(*xsim1).y);

__device__ float getmicro_e_pretex(const float Ei, const float Eim1, const


if (E<=E0_d[loc])

return (xs0_d[loc].z*sqrtf(E0_d[loc]/E));

else

return (((*xsi).z-(*xsim1).z)/(Ei-Eim1)*(E-Eim1)+(*xsim1).z);

98

__device__ float getmicro_i_pretex(const float Ei, const float Eim1, const


if (E<=E0_d[loc])

return (xs0_d[loc].w*sqrtf(E0_d[loc]/E));

else

return (((*xsi).w-(*xsim1).w)/(Ei-Eim1)*(E-Eim1)+(*xsim1).w);

//************************************************//

__device__ float getmicro_t(const float E, const xs* cross_sec, const short

loc)

float temp;

if (E<=E0_d[loc])

temp=(xs0_d[loc].x+xs0_d[loc].y+xs0_d[loc].z+xs0_d[loc].w)*sqrtf(E0_d[loc]/E);

else

//unsigned int

i=accelSearch(reduced_offsets_d[loc],reduced_offsets_d[loc+1]-1,E);

//i=binarySearch((*cross_sec).Emesh,reduced_Emesh_d[i-

1].index,reduced_Emesh_d[i].index,E);

unsigned int

i=binarySearch((*cross_sec).Emesh,0,(*cross_sec).mesh_length,E);

float Ei=tex1Dfetch(Emesh_tex,i+Emesh_offsets_d[loc]);

float Eim1=tex1Dfetch(Emesh_tex,i-1+Emesh_offsets_d[loc]);

float4 xsi=tex1Dfetch(xsmesh_tex,i+Emesh_offsets_d[loc]);

float4 xsim1=tex1Dfetch(xsmesh_tex,i-1+Emesh_offsets_d[loc]);

temp=(xsi.x-xsim1.x)/(Ei-Eim1)*(E-Eim1)+xsim1.x;

temp+=(xsi.y-xsim1.y)/(Ei-Eim1)*(E-Eim1)+xsim1.y;

temp+=(xsi.z-xsim1.z)/(Ei-Eim1)*(E-Eim1)+xsim1.z;

temp+=(xsi.w-xsim1.w)/(Ei-Eim1)*(E-Eim1)+xsim1.w;

return temp;

__device__ void gpuget_Rxn(neutron2* neut, const xsinfo* xs_data, const float

energy)

unsigned int i=1;

float Ei, Eim1;

float4 xsi, xsim1;

if (energy>E0_d[(*neut).loc])

//i=accelSearch(reduced_offsets_d[(*neut).loc],reduced_offsets_d[(*neut).

loc+1]-1,energy);

//i=binarySearch((*xs_data).cross_sec[(*neut).loc].Emesh,reduced_Emesh_d[

i-1].index,reduced_Emesh_d[i].index,energy);

i=binarySearch((*xs_data).cross_sec[(*neut).loc].Emesh,0,(*xs_data).cross

_sec[(*neut).loc].mesh_length,energy);

Ei=tex1Dfetch(Emesh_tex,i+Emesh_offsets_d[(*neut).loc]);

Eim1=tex1Dfetch(Emesh_tex,i-1+Emesh_offsets_d[(*neut).loc]);

xsi=tex1Dfetch(xsmesh_tex,i+Emesh_offsets_d[(*neut).loc]);

xsim1=tex1Dfetch(xsmesh_tex,i-1+Emesh_offsets_d[(*neut).loc]);

float sigt, sigf, sigrc, sigis;

99

sigf=getmicro_f_pretex(Ei,Eim1, &xsi, &xsim1, energy, (*neut).loc);

sigrc=getmicro_c_pretex(Ei,Eim1, &xsi, &xsim1, energy, (*neut).loc);

//siges=getmicro_e_pretex(Ei,Eim1, &xsi, &xsim1, energy, &(*neut).loc]);

//sigt used here since siges is not necessary because of the way i've

done it later, so i'll get the value

//but just use a future variable instead. makes less registers

necessary.

sigt=getmicro_e_pretex(Ei,Eim1, &xsi, &xsim1, energy, (*neut).loc);

sigis=getmicro_i_pretex(Ei,Eim1, &xsi, &xsim1, energy, (*neut).loc);

sigt+=sigf+sigrc+sigis;

float eta=gpurng(&((*neut).seed),sigt);

//determine where in the range of scaled x/s the rand # is

// if (((*neut).flag!='l')||((*neut).flag!='L'))

if ((*neut).flag!='l')

if (eta <= sigf) (*neut).flag= 'f';

else if (eta <= (sigf+sigrc)) (*neut).flag= 'c';

else if (eta <= (sigt-sigis)) (*neut).flag= 'e';

else if (eta <= (sigt)) (*neut).flag= 'i';

if (energy<1E-10f)

(*neut).flag= 'c';

//*************************************************************************//

__device__ float gpugetmaterialmacroT(const xsinfo* xs_data, const neutron2*

neut, const float energy)

float macro=0.0f;

for (int i=0; i<num_mat_d[(*neut).cell]; i++)

short loc=mat_list_d[i+NUCLIDE_MAX*(*neut).cell].zaid_loc;

macro+=getmicro_t(energy,&((*xs_data).cross_sec[loc]),loc)*mat_list_d[i+NUCLIDE

_MAX*(*neut).cell].density;

return macro;

__device__ int gpupicknuclide(neutron2* neut, const xsinfo* xs_data, const

unsigned short cell, const float SigmaT, const float energy)

float eta=gpurng(&((*neut).seed),1.0f)*SigmaT;

float running_sum=0.0f;

for (int i=0; i<num_mat_d[cell]; i++)

short loc=mat_list_d[i+NUCLIDE_MAX*cell].zaid_loc;

running_sum+=getmicro_t(energy,&((*xs_data).cross_sec[loc]),loc)*mat_list

_d[i+NUCLIDE_MAX*cell].density;

if (eta < running_sum)

(*neut).loc=loc;

return i;

(*neut).loc=2;

return 999;

100

//***************************************************************************//

void load_gpu(materials mat_list[][NUCLIDE_MAX], unsigned short num_shapes,

shapes shapelist[], xsinfo* xs_data, xsinfo** xs_data_d)

//cudaSetDevice(0);

checkCUDAError("SetDevice");

//pass mat_list

materials mat1d[OBJECT_MAX*NUCLIDE_MAX];

convert2darray(mat_list,mat1d);

materials2 mat_temp[OBJECT_MAX*NUCLIDE_MAX];

unsigned short num_mat_temp[OBJECT_MAX];


mat_temp[i].zaid_loc=mat1d[i].zaid_loc;

mat_temp[i].density=mat1d[i].density;


num_mat_temp[i]=mat1d[0+NUCLIDE_MAX*i].num_mat;

//Now I have mat_list in the seperate forms. have to send to the GPU

constant vars.

cudaMemcpyToSymbol(mat_list_d,mat_temp,OBJECT_MAX*NUCLIDE_MAX*sizeof(mate

rials2),0,cudaMemcpyHostToDevice);

cudaMemcpyToSymbol(num_mat_d,num_mat_temp,OBJECT_MAX*sizeof(unsigned

short),0,cudaMemcpyHostToDevice);

checkCUDAError("matlist");

//pass shape info

cudaMemcpyToSymbol(shapelist_d,shapelist,OBJECT_MAX*sizeof(shapes),0,cuda

MemcpyHostToDevice);

checkCUDAError("shapelist");

//pass xs_data

//setup texture and offsets list

unsigned int* offset_temp;

offset_temp=(unsigned

int*)malloc(((*xs_data).listlength+1)*sizeof(unsigned int));

unsigned int sum=0;

for (int i=0; i<(*xs_data).listlength; i++)

offset_temp[i]=sum;

sum+=(*xs_data).cross_sec[i].mesh_length;

offset_temp[(*xs_data).listlength]=sum;

cudaMemcpyToSymbol(Emesh_offsets_d,offset_temp,((*xs_data).listlength+1)*

sizeof(unsigned int),0,cudaMemcpyHostToDevice);

//offsets is done, start making texture source array

float * Emesh;

Emesh=(float*)malloc((sum+10)*sizeof(float));

unsigned int k=0;

for (int i=0; i<(*xs_data).listlength;i++)

for (unsigned int j=0; j<(*xs_data).cross_sec[i].mesh_length; j++)

Emesh[k]=(*xs_data).cross_sec[i].Emesh[j];

k++;

float* Emesh_d;

cudaMalloc((void**)&Emesh_d,offset_temp[(*xs_data).listlength]*sizeof(flo

at));

cudaMemcpy(Emesh_d,Emesh,offset_temp[(*xs_data).listlength]*sizeof(float)

,cudaMemcpyHostToDevice);

101

cudaBindTexture(NULL,Emesh_tex,Emesh_d,offset_temp[(*xs_data).listlength]

*sizeof(float));

float4* xsmesh;

xsmesh=(float4*)malloc(offset_temp[(*xs_data).listlength]*sizeof(float4))

;

k=0;


for (unsigned int j=0; j<(*xs_data).cross_sec[i].mesh_length; j++)

xsmesh[k].x=(*xs_data).cross_sec[i].fission[j];

xsmesh[k].y=(*xs_data).cross_sec[i].rad_cap[j];

xsmesh[k].z=(*xs_data).cross_sec[i].elastic[j];

xsmesh[k].w=(*xs_data).cross_sec[i].inelastic[j];

k++;

float4* xsmesh_d;

cudaMalloc((void**)&xsmesh_d,offset_temp[(*xs_data).listlength]*sizeof(fl

oat4));

cudaMemcpy(xsmesh_d,xsmesh,offset_temp[(*xs_data).listlength]*sizeof(floa

t4),cudaMemcpyHostToDevice);

cudaBindTexture(NULL,xsmesh_tex,xsmesh_d,offset_temp[(*xs_data).listlengt

h]*sizeof(float4));

//do my 0s

float* E0;

float4* xs0;

E0=(float*)malloc((*xs_data).listlength*sizeof(float));

xs0=(float4*)malloc((*xs_data).listlength*sizeof(float4));

for (int i=0; i<(*xs_data).listlength; i++)

E0[i]=(*xs_data).cross_sec[i].Emesh[0];

xs0[i].x=(*xs_data).cross_sec[i].fission[0];

xs0[i].y=(*xs_data).cross_sec[i].rad_cap[0];

xs0[i].z=(*xs_data).cross_sec[i].elastic[0];

xs0[i].w=(*xs_data).cross_sec[i].inelastic[0];

cudaMemcpyToSymbol(E0_d,E0,(*xs_data).listlength*sizeof(float),0,cudaMemc

pyHostToDevice);

cudaMemcpyToSymbol(xs0_d,xs0,(*xs_data).listlength*sizeof(float4),0,cudaM

emcpyHostToDevice);

free(E0);

free(xs0);

free(offset_temp);

free(Emesh);

free(xsmesh);

//Done with texture

reduced_array* reduced_Emesh;

reduced_Emesh=(reduced_array*)malloc(CONST_SIZE*sizeof(reduced_array));

unsigned int* reduced_offsets;

reduced_offsets=(unsigned

int*)malloc(((*xs_data).listlength+1)*sizeof(unsigned int));

int reduction_factor;

Emesh_reduction(xs_data,&reduced_Emesh,&reduced_offsets,&reduction_factor

);

cudaMemcpyToSymbol(reduced_Emesh_d,reduced_Emesh,CONST_SIZE*sizeof(reduce

d_array),0,cudaMemcpyHostToDevice);

cudaMemcpyToSymbol(reduced_offsets_d,reduced_offsets,((*xs_data).listleng

th+1)*sizeof(unsigned int),0,cudaMemcpyHostToDevice);

free(reduced_offsets);

102

free(reduced_Emesh);

//legendres

legendres* coeffs_h=NULL;

coeffs_h=(legendres*)malloc(NUCLIDE_MAX*sizeof(legendres));

for (int j=0; j<(*xs_data).listlength; j++)

coeffs_h[j]=(*xs_data).cross_sec[j].coeffs;

cudaMemcpyToSymbol(coeffs_d,coeffs_h,NUCLIDE_MAX*sizeof(legendres),0,cuda


cudaThreadSynchronize();

free(coeffs_h);

//continue with raw x/s data

xsinfo tempxs;

xs* temp;

temp=(xs*)malloc((*xs_data).listlength*sizeof(xs));

tempxs.cross_sec=(xs*)malloc((*xs_data).listlength*sizeof(xs));

tempxs.list=(int*)malloc((*xs_data).listlength*sizeof(int));

for (int j=0; j<(*xs_data).listlength; j++)

int mesh_length=(*xs_data).cross_sec[j].mesh_length;

float* rc; //rad_cap

float* f; //fission

float* e; //elastic

float* i; //inelastic

float* E; //Emesh

cudaMalloc((void**)&rc,mesh_length*sizeof(float));

cudaMalloc((void**)&f,mesh_length*sizeof(float));

cudaMalloc((void**)&e,mesh_length*sizeof(float));

cudaMalloc((void**)&i,mesh_length*sizeof(float));

cudaMalloc((void**)&E,mesh_length*sizeof(float));

cudaMemcpy(rc,((*xs_data).cross_sec[j].rad_cap),mesh_length*sizeof(float)

,cudaMemcpyHostToDevice);

cudaMemcpy(f,((*xs_data).cross_sec[j].fission),mesh_length*sizeof(float),

cudaMemcpyHostToDevice);

cudaMemcpy(e,((*xs_data).cross_sec[j].elastic),mesh_length*sizeof(float),


cudaMemcpy(i,((*xs_data).cross_sec[j].inelastic),mesh_length*sizeof(float

),cudaMemcpyHostToDevice);

cudaMemcpy(E,((*xs_data).cross_sec[j].Emesh),mesh_length*sizeof(float),cu

daMemcpyHostToDevice);

temp[j].coeffs=(*xs_data).cross_sec[j].coeffs;

temp[j].a=(*xs_data).cross_sec[j].a;

temp[j].b=(*xs_data).cross_sec[j].b;

temp[j].c=(*xs_data).cross_sec[j].c;

temp[j].d=(*xs_data).cross_sec[j].d;

temp[j].mesh_length=mesh_length;

temp[j].rad_cap=rc;

temp[j].fission=f;

temp[j].elastic=e;

temp[j].inelastic=i;

temp[j].Emesh=E;

xs* temp_d;

cudaMalloc((void**)&temp_d,(*xs_data).listlength*sizeof(xs));

103

cudaMemcpy(temp_d,temp,(*xs_data).listlength*sizeof(xs),cudaMemcpyHostToD

evice);

tempxs.cross_sec=temp_d;

int* tempint;

cudaMalloc((void**)&tempint,(*xs_data).listlength*sizeof(int));

cudaMemcpy(tempint,(*xs_data).list,(*xs_data).listlength*sizeof(int),cuda


tempxs.list=tempint;

tempxs.listlength=(*xs_data).listlength;

cudaMalloc((void**)xs_data_d,sizeof(xsinfo));

cudaMemcpy(*xs_data_d,&tempxs,sizeof(xsinfo),cudaMemcpyHostToDevice);

// free(tempxs.list);

free(temp);

void loadneutlist(unsigned int listlength, neutron* host, neutron* device)

//Assumes cudaMalloc already run for the lists

cudaMemcpy(device, host, listlength*sizeof(neutron),


checkCUDAError("load");

void getneutlist(unsigned int listlength, neutron* host, neutron* device)

//Assumes cudaMalloc already run for the lists

cudaMemcpy(host, device, listlength*sizeof(neutron),

cudaMemcpyDeviceToHost);

checkCUDAError("get");

void checkCUDAError(const char *msg)

cudaError_t err = cudaGetLastError();

if( cudaSuccess != err)

fprintf(stderr, "Cuda error: %s: %s.\n", msg,

cudaGetErrorString( err) );

exit(EXIT_FAILURE);

void convert2darray(materials mat2d[][NUCLIDE_MAX], materials * mat1d)

int k=0;



mat1d[k++]=mat2d[i][j];

void configure_cuda_call(int required_threads, int * blocks)

if ((required_threads%THREAD_COUNT)!=0)

*blocks=required_threads/THREAD_COUNT+1;

else

*blocks=required_threads/THREAD_COUNT;

void Emesh_reduction(xsinfo* xs_data, reduced_array** list, unsigned int**

reduced_offsets, int* reduction_factor)

104

//determine total size

unsigned int tot_size=0;


tot_size+=(*xs_data).cross_sec[i].mesh_length;

//determine reduction_factor value

*reduction_factor=1+tot_size/(CONST_SIZE-(*xs_data).listlength);

//Now go through and set reduced_array to what it needs, as well as the

offsets to use for reference

unsigned int k=0;

for (unsigned int i=0; i<(*xs_data).listlength; i++)

(*reduced_offsets)[i]=k;

unsigned int length;

if ((*xs_data).cross_sec[i].mesh_length%(*reduction_factor)==0)

length=(*xs_data).cross_sec[i].mesh_length/(*reduction_factor);

else

length=(*xs_data).cross_sec[i].mesh_length/(*reduction_factor)+1;

for (unsigned int j=0; j<length; j++)

(*list)[k].value=(*xs_data).cross_sec[i].Emesh[(*reduction_factor)*j];

(*list)[k].index=(*reduction_factor)*j;

k++;

(*list)[k].value=(*xs_data).cross_sec[i].Emesh[(*xs_data).cross_sec[i].me

sh_length-1];

(*list)[k].index=(*xs_data).cross_sec[i].mesh_length-1;

k++;

(*reduced_offsets)[(*xs_data).listlength]=k;

105

Appendix C

Sample Input Files

C.1 Geometry File “array2.geo”

53

s -225 0 -150 0 -150 0 1600

999

s -225 0 -150 0 -150 0 2500

0 999

s -75 0 -150 0 -150 0 1600

999

s -75 0 -150 0 -150 0 2500

2 999

s 75 0 -150 0 -150 0 1600

999

s 75 0 -150 0 -150 0 2500

4 999

s 225 0 -150 0 -150 0 1600

999

s 225 0 -150 0 -150 0 2500

6 999

s -225 0 0 0 -150 0 1600

999

s -225 0 0 0 -150 0 2500

8 999

s -75 0 0 0 -150 0 1600

999

s -75 0 0 0 -150 0 2500

10 999

106

s 75 0 0 0 -150 0 1600

999

s 75 0 0 0 -150 0 2500

12 999

s 225 0 0 0 -150 0 1600

999

s 225 0 0 0 -150 0 2500

14 999

s -225 0 150 0 -150 0 1600

999

s -225 0 150 0 -150 0 2500

16 999

s -75 0 150 0 -150 0 1600

999

s -75 0 150 0 -150 0 2500

18 999

s 75 0 150 0 -150 0 1600

999

s 75 0 150 0 -150 0 2500

20 999

s 225 0 150 0 -150 0 1600

999

s 225 0 150 0 -150 0 2500

22 999

s -225 0 -150 0 150 0 1600

999

s -225 0 -150 0 150 0 2500

24 999

s -75 0 -150 0 150 0 1600

999

107

s -75 0 -150 0 150 0 2500

26 999

s 75 0 -150 0 150 0 1600

999

s 75 0 -150 0 150 0 2500

28 999

s 225 0 -150 0 150 0 1600

999

s 225 0 -150 0 150 0 2500

30 999

s -225 0 0 0 150 0 1600

999

s -225 0 0 0 150 0 2500

32 999

s -75 0 0 0 150 0 1600

999

s -75 0 0 0 150 0 2500

34 999

s 75 0 0 0 150 0 1600

999

s 75 0 0 0 150 0 2500

36 999

s 225 0 0 0 150 0 1600

999

s 225 0 0 0 150 0 2500

38 999

s -225 0 150 0 150 0 1600

999

s -225 0 150 0 150 0 2500

40 999

108

s -75 0 150 0 150 0 1600

999

s -75 0 150 0 150 0 2500

42 999

s 75 0 150 0 150 0 1600

999

s 75 0 150 0 150 0 2500

44 999

s 225 0 150 0 150 0 1600

999

s 225 0 150 0 150 0 2500

46 999

s -325 0 -200 0 200 -200 625

999

s -325 0 200 0 200 -200 625

999

s 325 0 -200 0 200 -200 625

999

s 325 0 200 0 200 -200 625

999

p 375 -375 250 -250 250 -250 0

0 1 2 3 4 5 6 7 8 9

10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

30 31 32 33 34 35 36 37 38 39

40 41 42 43 44 45 46 47 48 49

50 51 999

109

C.2 Material File “array2.mat”

53

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

110

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

3 92235 0.001094165 92238 0.035377986 8016 0.072944303

1 40092 0.042512584

1 40092 0.042512584

1 40092 0.042512584

1 40092 0.042512584

1 40092 0.042512584

3 1001 0.07 1002 1.00E-005 8016 0.033427788

111

C.3 Neutron Source File “array2.src”

95

-225 -150 -150 1.65

-225 -150 -150 0.24

-225 -150 -150 1.73

-225 -150 -150 1.62

-75 -150 -150 1.19

-75 -150 -150 0.31

-75 -150 -150 0.57

-75 -150 -150 0.65

75 -150 -150 1.02

75 -150 -150 1.65

75 -150 -150 1.47

75 -150 -150 0.02

225 -150 -150 0.71

225 -150 -150 1.38

225 -150 -150 1.32

225 -150 -150 0.24

-225 0 -150 0.82

-225 0 -150 1.72

-225 0 -150 0.29

-225 0 -150 0.26

-75 0 -150 0.56

-75 0 -150 0.91

-75 0 -150 0.18

-75 0 -150 0.63

75 0 -150 1.69

75 0 -150 1.15

75 0 -150 0.62

112

75 0 -150 0.45

225 0 -150 1.73

225 0 -150 1.26

225 0 -150 1.69

225 0 -150 1.58

-225 150 -150 0.09

-225 150 -150 0.96

-225 150 -150 1.44

-225 150 -150 0.33

-75 150 -150 1.22

-75 150 -150 0.21

-75 150 -150 1

-75 150 -150 0.85

75 150 -150 1.55

75 150 -150 0.97

75 150 -150 0.88

75 150 -150 0.62

225 150 -150 1.92

225 150 -150 0.07

225 150 -150 0.78

-225 -150 150 0.4

-225 -150 150 0.26

-225 -150 150 1.65

-225 -150 150 1.94

-75 -150 150 0.57

-75 -150 150 1.55

-75 -150 150 0.23

-75 -150 150 1.03

75 -150 150 1.56

75 -150 150 1.45

113

75 -150 150 0.82

75 -150 150 1.3

225 -150 150 0.78

225 -150 150 0.76

225 -150 150 0.57

225 -150 150 0.52

-225 0 150 1.74

-225 0 150 0.01

-225 0 150 0.42

-225 0 150 1.05

-75 0 150 0.38

-75 0 150 1.02

-75 0 150 1.17

-75 0 150 0.11

75 0 150 0.3

75 0 150 1.21

75 0 150 1.15

75 0 150 0.71

225 0 150 1.31

225 0 150 1.65

225 0 150 0.46

225 0 150 1.69

-225 150 150 1.88

-225 150 150 0.57

-225 150 150 0.11

-225 150 150 0.69

-75 150 150 0.72

-75 150 150 0.26

-75 150 150 1.02

-75 150 150 0.26

114

75 150 150 0.67

75 150 150 0.06

75 150 150 1.99

75 150 150 0.78

225 150 150 0.2

225 150 150 0.1

225 150 150 1.71

225 150 150 1.54

C.3 Nuclear Data File for 1H, “1001”

0.453

-1.036

2.29

0.35820615

228

1.000000000E-08 0.000000000E+00 5.298173371E-01 4.177776730E+01 0.000000000E+00

1.107437000E-08 0.000000000E+00 5.035555670E-01 4.015175885E+01 0.000000000E+00

1.265500000E-08 0.000000000E+00 4.711790902E-01 3.817722728E+01 0.000000000E+00

1.581625000E-08 0.000000000E+00 4.216244227E-01 3.523465417E+01 0.000000000E+00

1.897750000E-08 0.000000000E+00 3.849698537E-01 3.313620967E+01 0.000000000E+00

2.213875000E-08 0.000000000E+00 3.564179224E-01 3.155882285E+01 0.000000000E+00

2.530000000E-08 0.000000000E+00 3.333575608E-01 3.032768500E+01 0.000000000E+00

3.140196000E-08 0.000000000E+00 2.990841608E-01 2.857972139E+01 0.000000000E+00

3.750392000E-08 0.000000000E+00 2.735507801E-01 2.734986363E+01 0.000000000E+00

4.360588000E-08 0.000000000E+00 2.536139026E-01 2.643796224E+01 0.000000000E+00

4.970785000E-08 0.000000000E+00 2.375100203E-01 2.573547978E+01 0.000000000E+00

6.191177000E-08 0.000000000E+00 2.128573136E-01 2.472589906E+01 0.000000000E+00

7.411570000E-08 0.000000000E+00 1.946249396E-01 2.403706148E+01 0.000000000E+00

8.631962000E-08 0.000000000E+00 1.804030338E-01 2.353822209E+01 0.000000000E+00

115

9.852355000E-08 0.000000000E+00 1.688842471E-01 2.316088436E+01 0.000000000E+00

1.229314000E-07 0.000000000E+00 1.511597703E-01 2.262880036E+01 0.000000000E+00

1.473392000E-07 0.000000000E+00 1.380128318E-01 2.227209633E+01 0.000000000E+00

1.717471000E-07 0.000000000E+00 1.277882878E-01 2.201652395E+01 0.000000000E+00

1.961549000E-07 0.000000000E+00 1.195550201E-01 2.182446860E+01 0.000000000E+00

2.205628000E-07 0.000000000E+00 1.127417884E-01 2.167488172E+01 0.000000000E+00

2.693785000E-07 0.000000000E+00 1.020218294E-01 2.145697307E+01 0.000000000E+00

3.181942000E-07 0.000000000E+00 9.387577644E-02 2.130587858E+01 0.000000000E+00

3.670099000E-07 0.000000000E+00 8.741219777E-02 2.119494404E+01 0.000000000E+00

4.158257000E-07 0.000000000E+00 8.212085182E-02 2.111003071E+01 0.000000000E+00

5.134571000E-07 0.000000000E+00 7.389935283E-02 2.098858820E+01 0.000000000E+00

6.110886000E-07 0.000000000E+00 6.773638636E-02 2.090590029E+01 0.000000000E+00

7.087200000E-07 0.000000000E+00 6.289906779E-02 2.084595772E+01 0.000000000E+00

8.063515000E-07 0.000000000E+00 5.897139969E-02 2.080050293E+01 0.000000000E+00

1.001614000E-06 0.000000000E+00 5.290901083E-02 2.073611997E+01 0.000000000E+00

1.196877000E-06 0.000000000E+00 4.838950080E-02 2.069269073E+01 0.000000000E+00

1.392140000E-06 0.000000000E+00 4.486622308E-02 2.066140632E+01 0.000000000E+00

1.587403000E-06 0.000000000E+00 4.202557700E-02 2.063778986E+01 0.000000000E+00

1.977929000E-06 0.000000000E+00 3.765199272E-02 2.060448592E+01 0.000000000E+00

2.368455000E-06 0.000000000E+00 3.438960244E-02 2.058210845E+01 0.000000000E+00

2.758981000E-06 0.000000000E+00 3.185657365E-02 2.056602700E+01 0.000000000E+00

3.149507000E-06 0.000000000E+00 2.983053953E-02 2.055390553E+01 0.000000000E+00

3.930559000E-06 0.000000000E+00 2.670911071E-02 2.053682727E+01 0.000000000E+00

4.711611000E-06 0.000000000E+00 2.437652734E-02 2.052535340E+01 0.000000000E+00

5.492663000E-06 0.000000000E+00 2.256801568E-02 2.051710316E+01 0.000000000E+00

6.273715000E-06 0.000000000E+00 2.113030270E-02 2.051087946E+01 0.000000000E+00

7.835818000E-06 0.000000000E+00 1.891307671E-02 2.050209192E+01 0.000000000E+00

9.397922000E-06 0.000000000E+00 1.725598625E-02 2.049616727E+01 0.000000000E+00

1.096003000E-05 0.000000000E+00 1.597126059E-02 2.049189130E+01 0.000000000E+00

1.252213000E-05 0.000000000E+00 1.495041302E-02 2.048865466E+01 0.000000000E+00

116

1.564634000E-05 0.000000000E+00 1.337845769E-02 2.048405743E+01 0.000000000E+00

1.877055000E-05 0.000000000E+00 1.220592573E-02 2.048093233E+01 0.000000000E+00

2.189476000E-05 0.000000000E+00 1.129651330E-02 2.047865894E+01 0.000000000E+00

2.501897000E-05 0.000000000E+00 1.057158665E-02 2.047692474E+01 0.000000000E+00

3.126739000E-05 0.000000000E+00 9.458095307E-03 2.047443405E+01 0.000000000E+00

3.751581000E-05 0.000000000E+00 8.629782437E-03 2.047271502E+01 0.000000000E+00

4.376423000E-05 0.000000000E+00 7.987087330E-03 2.047144659E+01 0.000000000E+00

5.001265000E-05 0.000000000E+00 7.473133016E-03 2.047046604E+01 0.000000000E+00

6.250948000E-05 0.000000000E+00 6.685074944E-03 2.046903130E+01 0.000000000E+00

7.500632000E-05 0.000000000E+00 6.100163053E-03 2.046801696E+01 0.000000000E+00

8.750316000E-05 0.000000000E+00 5.646157483E-03 2.046725188E+01 0.000000000E+00

1.000000000E-04 0.000000000E+00 5.281919301E-03 2.046622089E+01 0.000000000E+00

1.231110000E-04 0.000000000E+00 4.758629197E-03 2.045534330E+01 0.000000000E+00

1.495397000E-04 0.000000000E+00 4.314992030E-03 2.044479941E+01 0.000000000E+00

1.813991000E-04 0.000000000E+00 3.915263847E-03 2.043439733E+01 0.000000000E+00

2.132585000E-04 0.000000000E+00 3.609086338E-03 2.042573052E+01 0.000000000E+00

2.531544000E-04 0.000000000E+00 3.310817300E-03 2.041658618E+01 0.000000000E+00

2.930504000E-04 0.000000000E+00 3.075822085E-03 2.040881000E+01 0.000000000E+00

3.418519000E-04 0.000000000E+00 2.846555201E-03 2.040065064E+01 0.000000000E+00

3.906535000E-04 0.000000000E+00 2.661778316E-03 2.039359925E+01 0.000000000E+00

4.496277000E-04 0.000000000E+00 2.480099055E-03 2.038618621E+01 0.000000000E+00

5.086020000E-04 0.000000000E+00 2.331055625E-03 2.037969961E+01 0.000000000E+00

5.784139000E-04 0.000000000E+00 2.185078966E-03 2.037294114E+01 0.000000000E+00

6.482258000E-04 0.000000000E+00 2.063403776E-03 2.036696178E+01 0.000000000E+00

7.298269000E-04 0.000000000E+00 1.944000937E-03 2.036074794E+01 0.000000000E+00

8.114280000E-04 0.000000000E+00 1.843322036E-03 2.035520706E+01 0.000000000E+00

1.000000000E-03 0.000000000E+00 1.659702848E-03 2.034354300E+01 0.000000000E+00

1.138267000E-03 0.000000000E+00 1.560128206E-03 2.030222925E+01 0.000000000E+00

1.288833000E-03 0.000000000E+00 1.470477857E-03 2.026189423E+01 0.000000000E+00

1.452245000E-03 0.000000000E+00 1.389192251E-03 2.022321533E+01 0.000000000E+00

117

1.629032000E-03 0.000000000E+00 1.315306590E-03 2.018609993E+01 0.000000000E+00

2.025060000E-03 0.000000000E+00 1.185236323E-03 2.011588892E+01 0.000000000E+00

2.481141000E-03 0.000000000E+00 1.068877755E-03 2.005067041E+01 0.000000000E+00

3.002791000E-03 0.000000000E+00 9.697747793E-04 1.998952169E+01 0.000000000E+00

3.593504000E-03 0.000000000E+00 8.817281932E-04 1.993216003E+01 0.000000000E+00

4.257798000E-03 0.000000000E+00 8.051065475E-04 1.987812610E+01 0.000000000E+00

5.000000000E-03 0.000000000E+00 7.375931888E-04 1.982645212E+01 0.000000000E+00

6.034926000E-03 0.000000000E+00 6.640425648E-04 1.966107082E+01 0.000000000E+00

7.207047000E-03 0.000000000E+00 6.000072207E-04 1.950570484E+01 0.000000000E+00

8.525727000E-03 0.000000000E+00 5.445084466E-04 1.935967800E+01 0.000000000E+00

1.000000000E-02 0.000000000E+00 4.958639821E-04 1.922153614E+01 0.000000000E+00

1.206215000E-02 0.000000000E+00 4.426316209E-04 1.892057209E+01 0.000000000E+00

1.440269000E-02 0.000000000E+00 3.975181418E-04 1.863943558E+01 0.000000000E+00

2.000000000E-02 0.000000000E+00 3.238853737E-04 1.812960941E+01 0.000000000E+00

2.464048000E-02 0.000000000E+00 2.832718308E-04 1.763113001E+01 0.000000000E+00

3.000000000E-02 0.000000000E+00 2.491047128E-04 1.717288258E+01 0.000000000E+00

3.500000000E-02 0.000000000E+00 2.225018744E-04 1.671338937E+01 0.000000000E+00

4.500000000E-02 0.000000000E+00 1.897098384E-04 1.592276684E+01 0.000000000E+00

5.500000000E-02 0.000000000E+00 1.651864579E-04 1.521350215E+01 0.000000000E+00

6.500000000E-02 0.000000000E+00 1.469345529E-04 1.457368221E+01 0.000000000E+00

9.000000000E-02 0.000000000E+00 1.164022971E-04 1.322601849E+01 0.000000000E+00

1.100000000E-01 0.000000000E+00 1.001933883E-04 1.233293832E+01 0.000000000E+00

1.400000000E-01 0.000000000E+00 8.325794400E-05 1.125504543E+01 0.000000000E+00

1.600000000E-01 0.000000000E+00 7.514343826E-05 1.065204516E+01 0.000000000E+00

1.900000000E-01 0.000000000E+00 6.585397216E-05 9.884190192E+00 0.000000000E+00

2.400000000E-01 0.000000000E+00 5.559232649E-05 8.878080473E+00 0.000000000E+00

3.000000000E-01 0.000000000E+00 4.829071169E-05 7.967857643E+00 0.000000000E+00

3.400000000E-01 0.000000000E+00 4.501850772E-05 7.485047186E+00 0.000000000E+00

4.000000000E-01 0.000000000E+00 4.149042868E-05 6.891235806E+00 0.000000000E+00

4.600000000E-01 0.000000000E+00 3.913823594E-05 6.411728021E+00 0.000000000E+00

118

5.500000000E-01 0.000000000E+00 3.739607821E-05 5.841250518E+00 0.000000000E+00

6.500000000E-01 0.000000000E+00 3.616309348E-05 5.350436405E+00 0.000000000E+00

7.500000000E-01 0.000000000E+00 3.516704102E-05 4.961127520E+00 0.000000000E+00

8.500000000E-01 0.000000000E+00 3.474201043E-05 4.642921467E+00 0.000000000E+00

1.000000000E+00 0.000000000E+00 3.446014919E-05 4.258621960E+00 0.000000000E+00

1.100000000E+00 0.000000000E+00 3.444000000E-05 4.047100000E+00 0.000000000E+00

1.200000000E+00 0.000000000E+00 3.441000000E-05 3.862500000E+00 0.000000000E+00

1.300000000E+00 0.000000000E+00 3.449000000E-05 3.699200000E+00 0.000000000E+00

1.400000000E+00 0.000000000E+00 3.436000000E-05 3.553300000E+00 0.000000000E+00

1.500000000E+00 0.000000000E+00 3.434000000E-05 3.421900000E+00 0.000000000E+00

1.600000000E+00 0.000000000E+00 3.431000000E-05 3.302500000E+00 0.000000000E+00

1.700000000E+00 0.000000000E+00 3.429000000E-05 3.193400000E+00 0.000000000E+00

1.800000000E+00 0.000000000E+00 3.427000000E-05 3.093100000E+00 0.000000000E+00

1.900000000E+00 0.000000000E+00 3.425000000E-05 3.000300000E+00 0.000000000E+00

2.000000000E+00 0.000000000E+00 3.423000000E-05 2.914200000E+00 0.000000000E+00

2.100000000E+00 0.000000000E+00 3.437814803E-05 2.833900000E+00 0.000000000E+00

2.200000000E+00 0.000000000E+00 3.452000000E-05 2.758800000E+00 0.000000000E+00

2.300000000E+00 0.000000000E+00 3.466785004E-05 2.688300000E+00 0.000000000E+00

2.400000000E+00 0.000000000E+00 3.481000000E-05 2.621900000E+00 0.000000000E+00

2.500000000E+00 0.000000000E+00 3.495760014E-05 2.559200000E+00 0.000000000E+00

2.600000000E+00 0.000000000E+00 3.510000000E-05 2.499900000E+00 0.000000000E+00

2.700000000E+00 0.000000000E+00 3.524738762E-05 2.443600000E+00 0.000000000E+00

2.800000000E+00 0.000000000E+00 3.539000000E-05 2.390200000E+00 0.000000000E+00

2.900000000E+00 0.000000000E+00 3.553720474E-05 2.339300000E+00 0.000000000E+00

3.000000000E+00 0.000000000E+00 3.568000000E-05 2.290700000E+00 0.000000000E+00

3.200000000E+00 0.000000000E+00 3.580000000E-05 2.199900000E+00 0.000000000E+00

3.400000000E+00 0.000000000E+00 3.591000000E-05 2.116600000E+00 0.000000000E+00

3.600000000E+00 0.000000000E+00 3.603000000E-05 2.039800000E+00 0.000000000E+00

3.800000000E+00 0.000000000E+00 3.614000000E-05 1.968600000E+00 0.000000000E+00

4.000000000E+00 0.000000000E+00 3.626000000E-05 1.902400000E+00 0.000000000E+00

119

4.200000000E+00 0.000000000E+00 3.629000000E-05 1.840600000E+00 0.000000000E+00

4.400000000E+00 0.000000000E+00 3.632000000E-05 1.782800000E+00 0.000000000E+00

4.600000000E+00 0.000000000E+00 3.636000000E-05 1.728600000E+00 0.000000000E+00

4.800000000E+00 0.000000000E+00 3.639000000E-05 1.677500000E+00 0.000000000E+00

5.000000000E+00 0.000000000E+00 3.642000000E-05 1.629400000E+00 0.000000000E+00

5.200000000E+00 0.000000000E+00 3.629000000E-05 1.583544516E+00 0.000000000E+00

5.400000000E+00 0.000000000E+00 3.616000000E-05 1.540638627E+00 0.000000000E+00

5.500000000E+00 0.000000000E+00 3.609940466E-05 1.520200000E+00 0.000000000E+00

5.600000000E+00 0.000000000E+00 3.604000000E-05 1.499846352E+00 0.000000000E+00

5.800000000E+00 0.000000000E+00 3.591000000E-05 1.460986149E+00 0.000000000E+00

6.000000000E+00 0.000000000E+00 3.578000000E-05 1.424400000E+00 0.000000000E+00

6.200000000E+00 0.000000000E+00 3.567000000E-05 1.389030836E+00 0.000000000E+00

6.400000000E+00 0.000000000E+00 3.556000000E-05 1.355621779E+00 0.000000000E+00

6.500000000E+00 0.000000000E+00 3.550453431E-05 1.339600000E+00 0.000000000E+00

6.600000000E+00 0.000000000E+00 3.545000000E-05 1.323642369E+00 0.000000000E+00

6.800000000E+00 0.000000000E+00 3.534000000E-05 1.292987067E+00 0.000000000E+00

7.000000000E+00 0.000000000E+00 3.523000000E-05 1.263900000E+00 0.000000000E+00

7.500000000E+00 0.000000000E+00 3.459000000E-05 1.195900000E+00 0.000000000E+00

8.000000000E+00 0.000000000E+00 3.394000000E-05 1.134500000E+00 0.000000000E+00

8.500000000E+00 0.000000000E+00 3.364000000E-05 1.078600000E+00 0.000000000E+00

9.000000000E+00 0.000000000E+00 3.333000000E-05 1.027700000E+00 0.000000000E+00

9.500000000E+00 0.000000000E+00 3.296000000E-05 9.810600000E-01 0.000000000E+00

1.000000000E+01 0.000000000E+00 3.259000000E-05 9.381600000E-01 0.000000000E+00

1.050000000E+01 0.000000000E+00 3.221000000E-05 8.985700000E-01 0.000000000E+00

1.100000000E+01 0.000000000E+00 3.182000000E-05 8.619400000E-01 0.000000000E+00

1.150000000E+01 0.000000000E+00 3.145000000E-05 8.279300000E-01 0.000000000E+00

1.200000000E+01 0.000000000E+00 3.108000000E-05 7.962800000E-01 0.000000000E+00

1.250000000E+01 0.000000000E+00 3.063000000E-05 7.667600000E-01 0.000000000E+00

1.300000000E+01 0.000000000E+00 3.018000000E-05 7.391500000E-01 0.000000000E+00

1.350000000E+01 0.000000000E+00 3.001000000E-05 7.132900000E-01 0.000000000E+00

120

1.400000000E+01 0.000000000E+00 2.983000000E-05 6.890000000E-01 0.000000000E+00

1.450000000E+01 0.000000000E+00 2.940000000E-05 6.661500000E-01 0.000000000E+00

1.500000000E+01 0.000000000E+00 2.896000000E-05 6.446200000E-01 0.000000000E+00

1.550000000E+01 0.000000000E+00 2.863000000E-05 6.242900000E-01 0.000000000E+00

1.600000000E+01 0.000000000E+00 2.830000000E-05 6.050700000E-01 0.000000000E+00

1.650000000E+01 0.000000000E+00 2.788000000E-05 5.868037565E-01 0.000000000E+00

1.700000000E+01 0.000000000E+00 2.745000000E-05 5.696100000E-01 0.000000000E+00

1.750000000E+01 0.000000000E+00 2.736000000E-05 5.531658724E-01 0.000000000E+00

1.800000000E+01 0.000000000E+00 2.726000000E-05 5.376400000E-01 0.000000000E+00

1.850000000E+01 0.000000000E+00 2.673000000E-05 5.227535194E-01 0.000000000E+00

1.900000000E+01 0.000000000E+00 2.620000000E-05 5.086600000E-01 0.000000000E+00

1.950000000E+01 0.000000000E+00 2.612000000E-05 4.951201309E-01 0.000000000E+00

2.100000000E+01 0.000000000E+00 2.553287162E-05 4.581500000E-01 0.000000000E+00

2.200000000E+01 0.000000000E+00 2.505853982E-05 4.360200000E-01 0.000000000E+00

2.300000000E+01 0.000000000E+00 2.461353168E-05 4.156500000E-01 0.000000000E+00

2.400000000E+01 0.000000000E+00 2.419487313E-05 3.968700000E-01 0.000000000E+00

2.500000000E+01 0.000000000E+00 2.380000000E-05 3.795000000E-01 0.000000000E+00

2.600000000E+01 0.000000000E+00 2.348033824E-05 3.634500000E-01 0.000000000E+00

2.700000000E+01 0.000000000E+00 2.317679620E-05 3.481100000E-01 0.000000000E+00

2.800000000E+01 0.000000000E+00 2.288800766E-05 3.329500000E-01 0.000000000E+00

2.900000000E+01 0.000000000E+00 2.261276583E-05 3.189100000E-01 0.000000000E+00

3.000000000E+01 0.000000000E+00 2.235000000E-05 3.056690000E-01 0.000000000E+00

3.200000000E+01 0.000000000E+00 2.168747690E-05 2.834800000E-01 0.000000000E+00

3.400000000E+01 0.000000000E+00 2.108303176E-05 2.639490000E-01 0.000000000E+00

3.500000000E+01 0.000000000E+00 2.080000000E-05 2.550341280E-01 0.000000000E+00

3.600000000E+01 0.000000000E+00 2.051871530E-05 2.466590000E-01 0.000000000E+00

3.800000000E+01 0.000000000E+00 1.998946910E-05 2.312710000E-01 0.000000000E+00

4.000000000E+01 0.000000000E+00 1.950000000E-05 2.175050000E-01 0.000000000E+00

4.200000000E+01 0.000000000E+00 1.903657740E-05 2.051280000E-01 0.000000000E+00

4.400000000E+01 0.000000000E+00 1.860497774E-05 1.939580000E-01 0.000000000E+00

121

4.500000000E+01 0.000000000E+00 1.840000000E-05 1.887745242E-01 0.000000000E+00

4.600000000E+01 0.000000000E+00 1.819764541E-05 1.838390000E-01 0.000000000E+00

4.800000000E+01 0.000000000E+00 1.781211419E-05 1.746410000E-01 0.000000000E+00

5.000000000E+01 0.000000000E+00 1.745000000E-05 1.662530000E-01 0.000000000E+00

5.200000000E+01 0.000000000E+00 1.701001484E-05 1.585780000E-01 0.000000000E+00

5.400000000E+01 0.000000000E+00 1.659711386E-05 1.515380000E-01 0.000000000E+00

5.500000000E+01 0.000000000E+00 1.640000000E-05 1.482352136E-01 0.000000000E+00

5.600000000E+01 0.000000000E+00 1.618772118E-05 1.450620000E-01 0.000000000E+00

5.800000000E+01 0.000000000E+00 1.578215898E-05 1.390920000E-01 0.000000000E+00

6.000000000E+01 0.000000000E+00 1.540000000E-05 1.335750000E-01 0.000000000E+00

6.200000000E+01 0.000000000E+00 1.506710831E-05 1.284640000E-01 0.000000000E+00

6.400000000E+01 0.000000000E+00 1.475164479E-05 1.237190000E-01 0.000000000E+00

6.500000000E+01 0.000000000E+00 1.460000000E-05 1.214765192E-01 0.000000000E+00

6.600000000E+01 0.000000000E+00 1.446365740E-05 1.193080000E-01 0.000000000E+00

6.800000000E+01 0.000000000E+00 1.420073033E-05 1.151970000E-01 0.000000000E+00

7.000000000E+01 0.000000000E+00 1.395000000E-05 1.113610000E-01 0.000000000E+00

7.200000000E+01 0.000000000E+00 1.370182002E-05 1.077740000E-01 0.000000000E+00

7.400000000E+01 0.000000000E+00 1.346467651E-05 1.044150000E-01 0.000000000E+00

7.500000000E+01 0.000000000E+00 1.335000000E-05 1.028174393E-01 0.000000000E+00

7.600000000E+01 0.000000000E+00 1.326691275E-05 1.012650000E-01 0.000000000E+00

7.800000000E+01 0.000000000E+00 1.310546711E-05 9.830760000E-02 0.000000000E+00

8.000000000E+01 0.000000000E+00 1.295000000E-05 9.552690000E-02 0.000000000E+00

8.200000000E+01 0.000000000E+00 1.286816341E-05 9.290890000E-02 0.000000000E+00

8.400000000E+01 0.000000000E+00 1.278879762E-05 9.044090000E-02 0.000000000E+00

8.500000000E+01 0.000000000E+00 1.275000000E-05 8.926185335E-02 0.000000000E+00

8.600000000E+01 0.000000000E+00 1.269844010E-05 8.811170000E-02 0.000000000E+00

8.800000000E+01 0.000000000E+00 1.259770183E-05 8.591080000E-02 0.000000000E+00

9.000000000E+01 0.000000000E+00 1.250000000E-05 8.382850000E-02 0.000000000E+00

9.200000000E+01 0.000000000E+00 1.243880486E-05 8.185630000E-02 0.000000000E+00

9.400000000E+01 0.000000000E+00 1.237921585E-05 7.998660000E-02 0.000000000E+00

122

9.500000000E+01 0.000000000E+00 1.235000000E-05 7.908985664E-02 0.000000000E+00

9.600000000E+01 0.000000000E+00 1.231922908E-05 7.821240000E-02 0.000000000E+00

9.800000000E+01 0.000000000E+00 1.225886126E-05 7.652760000E-02 0.000000000E+00

1.000000000E+02 0.000000000E+00 1.22000000

MONTE CARLO METHODS FOR NEUTRON TRANSPORT ON …

Documents

Transcript of MONTE CARLO METHODS FOR NEUTRON TRANSPORT ON …