Sven Woop's Homepage - A Ray Tracing Hardware ...Sven Woop A thesis submitted in partial...

A Ray Tracing

Hardware Architecture

for Dynamic Scenes

by

Sven Woop

A thesis submitted in partial fulfillment of the requirements

for the degree of

Diplom-Informatiker

(Diploma in Computer Science)

Completed under the supervision of

Jorg Schmittler and Prof. Dr.-Ing. Philipp Slusallek

at the

Universitat des Saarlandes

Fachrichtung 6.2 - Informatik

Computer Graphik

Im Stadtwald - Geb. 36.1, Raum 018

66123 Saarbrucken

March 29, 2004

[email protected]

Copyright c© 2004, by Sven Woop

i

Eidesstattliche Erklarung

Hiermit erklare ich an Eides Statt, dass ich die vorliegende Arbeit selbstandig

verfasst und außer den angegebenen keine weiteren Hilfsmittel verwendet

habe.

Saarbrucken den 29. Marz, 2004

Sven Woop

ii

Acknowledgements

I would like to thank Jorg Schmittler for his assistance and for spending

several nights to get the prototype working. Thanks to Prof. Slusallek for

his support and constructive criticism.

iii

Abstract

This thesis describes a ray tracing hardware architecture for dynamic

scenes that makes it possible to ray trace highly complex scenes in real

time. Ray tracing of dynamic scenes does not seem to be efficiently possi-

ble, as ray tracing requires an acceleration structure whose creation is very

costly. The well-known solution to this problem is to partition the scene

into movable objects, which causes to use a top-level acceleration structure

over the objects, and a bottom-level acceleration structure in each object.

The presented architecture efficiently supports such partitioned scenes by

using one transformation unit for both the triangle intersection and the ob-

ject space transformation. A prototype of the hardware architecture has

been implemented into an FPGA which is in fact the first working special

purpose real time ray tracing hardware available today. The performance

and implementation details of this prototype are discussed in detail at the

end of this thesis.

Contents

1 Introduction 1

2 Previous Work 5

3 The Basic Ray Tracing Algorithm 7

3.1 k-D Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 k-D Tree Creation . . . . . . . . . . . . . . . . . . . . 11

3.1.2 Recursive k-D Tree Traversal . . . . . . . . . . . . . . 13

3.1.3 Packet k-D Tree Traversal . . . . . . . . . . . . . . . . 17

4 The Dynamic Ray Tracing Algorithm 21

4.1 Top-Level k-D Tree Creation . . . . . . . . . . . . . . . . . . 24

4.2 Bounding Box Clipping . . . . . . . . . . . . . . . . . . . . . 25

4.3 Overlapping Objects . . . . . . . . . . . . . . . . . . . . . . . 27

4.3.1 Hierarchical k-D Trees . . . . . . . . . . . . . . . . . . 28

4.3.2 Mailboxing . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3.3 Multiple Scenes . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Ray Transformation . . . . . . . . . . . . . . . . . . . . . . . 30

4.5 Hit-Distance Transformation . . . . . . . . . . . . . . . . . . 31

4.6 Normal Transformation . . . . . . . . . . . . . . . . . . . . . 32

5 Triangle Intersection 35

5.1 Affine Triangle Transformation . . . . . . . . . . . . . . . . . 36

5.1.1 Memory Efficient Triangle Transformation . . . . . . . 36

5.1.2 Normal Consistent Triangle Transformation . . . . . . 38

5.2 Unit Triangle Intersection . . . . . . . . . . . . . . . . . . . . 38

v

vi CONTENTS

6 The Dynamic SaarCOR Architecture 41

6.1 Dynamic Ray Tracing Core . . . . . . . . . . . . . . . . . . . 43

6.1.1 Traversal Unit . . . . . . . . . . . . . . . . . . . . . . 44

6.1.2 Mailboxed List Unit . . . . . . . . . . . . . . . . . . . 46

6.1.3 Transformation Unit . . . . . . . . . . . . . . . . . . . 47

6.1.4 Intersection Unit . . . . . . . . . . . . . . . . . . . . . 49

6.1.5 Balancing . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2 Shading Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

6.2.1 Primary Rays . . . . . . . . . . . . . . . . . . . . . . . 51

6.2.2 Light Rays . . . . . . . . . . . . . . . . . . . . . . . . 52

6.2.3 Reflection Rays . . . . . . . . . . . . . . . . . . . . . . 52

7 FPGA Prototype 55

7.1 Implementation Statistics . . . . . . . . . . . . . . . . . . . . 60

7.1.1 Gate Count . . . . . . . . . . . . . . . . . . . . . . . . 60

7.1.2 Complexity . . . . . . . . . . . . . . . . . . . . . . . . 61

7.2 Performance Statistics . . . . . . . . . . . . . . . . . . . . . . 63

7.2.1 Hardware Quality Index . . . . . . . . . . . . . . . . . 63

7.2.2 Graphics Hardware Quality Index . . . . . . . . . . . 64

7.2.3 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

7.2.4 Cache Hit Rate . . . . . . . . . . . . . . . . . . . . . . 67

7.2.5 Memory Bandwidth . . . . . . . . . . . . . . . . . . . 68

7.2.6 Performance . . . . . . . . . . . . . . . . . . . . . . . 70

8 Conclusion 71

9 Future Work 73

10 Appendix A 75

10.1 Office . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

10.2 Gael . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

10.3 Conference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

10.4 Trees4000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

List of Figures

3.1 Ray Tracing Basics . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2 k-D Tree Semantics . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 k-D Tree Example . . . . . . . . . . . . . . . . . . . . . . . . 11

3.4 k-D Tree Traversal Example . . . . . . . . . . . . . . . . . . . 13

3.5 Hit-Distance Computation . . . . . . . . . . . . . . . . . . . . 14

3.6 Traversal Decisions . . . . . . . . . . . . . . . . . . . . . . . . 16

3.7 Packet Traversal . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.8 Example of an Invalid Packet . . . . . . . . . . . . . . . . . . 19

4.1 Dynamic Acceleration Structure . . . . . . . . . . . . . . . . 22

4.2 Ray Transformation into Object Space . . . . . . . . . . . . . 23

4.3 Bounding Box of Object Instances . . . . . . . . . . . . . . . 24

4.4 Bounding Box Clipping . . . . . . . . . . . . . . . . . . . . . 25

4.5 Bounding Box Clipping Example . . . . . . . . . . . . . . . . 26

4.6 Overlapping Objects . . . . . . . . . . . . . . . . . . . . . . . 27

4.7 Room Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.8 Hierarchical k-D Trees as Solution to the Room Problem . . . 28

4.9 Normal Transformation . . . . . . . . . . . . . . . . . . . . . 32

5.1 Unit Triangle Intersection . . . . . . . . . . . . . . . . . . . . 35

6.1 Dynamic Ray Tracing Architecture . . . . . . . . . . . . . . . 43

6.2 Traversal Unit . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.3 Mailboxed List Unit . . . . . . . . . . . . . . . . . . . . . . . 47

6.4 Transformation Unit . . . . . . . . . . . . . . . . . . . . . . . 48

6.5 Compressable Packets . . . . . . . . . . . . . . . . . . . . . . 49

6.6 Reflection Matrix Illustration . . . . . . . . . . . . . . . . . . 54

vii

viii LIST OF FIGURES

7.1 ADMXRC Development Platform . . . . . . . . . . . . . . . . 55

7.2 ADMXRC Top-Level Flowchart . . . . . . . . . . . . . . . . . 55

7.3 Dynamic SaarCOR Prototype . . . . . . . . . . . . . . . . . . 56

7.4 Hardware Optimized Hilbert Curve . . . . . . . . . . . . . . . 59

7.5 Cache Hit Rate using the Hardware Optimized Hilbert Curve 59

7.6 Hardware Quality Index . . . . . . . . . . . . . . . . . . . . . 64

7.7 Graphics Hardware Quality Index . . . . . . . . . . . . . . . . 65

7.8 Usage of Units . . . . . . . . . . . . . . . . . . . . . . . . . . 66

7.9 Frame Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.10 Cache Hit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 67

7.11 Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . . 68

7.12 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

List of Tables

4.1 Millions of operations for various strategies . . . . . . . . . . 30

7.1 Maximum Cache Size per Unit . . . . . . . . . . . . . . . . . 58

7.2 Gate Count Computation . . . . . . . . . . . . . . . . . . . . 60

7.3 Complexity of one Ray Tracing Pipeline . . . . . . . . . . . . 61

7.4 Gate Count and Memory Bits per Unit using 32 Packets . . . 61

7.5 Gate Count and Memory Bits per Unit using 512 Cache Lines 62

ix

x LIST OF TABLES

Chapter 1

Introduction

Ray tracing is in fact one of the most popular rendering techniques to create

highly realistic images. However, because it is a computationally expensive

recursive algorithm that requires large memory bandwidth, it is a challenging

task to implement it in hardware.

As a consequence, the state of the art in interactive 3D computer graph-

ics is still rasterization hardware. The rasterization algorithm is efficient

for scenes consisting of few triangles, while ray tracing is not. Thus, to-

days computer graphics hardware can handle scenes with several hundred

thousand triangles. This is made possible by high memory bandwidth and

high floating point performance. For instance Nvidia’s GeForce 3[1] offers

76 GFlops at a clock rate of 200 MHz and has a 256 bit wide memory

interface running at 230 MHz, delivering a memory bandwidth of 7.2 GB/s.

In recent years the scenes of standard computer games have become

more and more detailed. Indeed, computer games are developed based on

the current graphics card standard, but rasterization hardware will become

a limiting factor in the near future. Because the main concept behind ras-

terization hardware is to project each triangle of the scene to a frame- and

z-buffer, the rasterization algorithm scales linearly in the number of trian-

gles of the scene. Furthermore it is difficult to parallelize the rasterization

algorithm, as the bandwidth to the frame- and z-buffer becomes critical.

This is because each triangle that is projected onto the image plane involves

many memory accesses to the frame- and z-buffer. If the triangles of the

scene are large the performance consequently drops. For a detailed descrip-

tion of the rasterization algorithm see any standard textbook for computer

1

2 CHAPTER 1. INTRODUCTION

graphics, for example that by Shirley [2].

Ray tracing does not suffer from these problems, as the tracing of single

rays can trivially be parallelized, because they are not dependent on each

other. On the other hand, it can be shown that the ray tracing algorithm

scales logarithmically in the number of triangles in the scene [3]. The only

problem might be that the initial hardware cost for ray tracing is high and

the memory interface to the scene database has to deliver sufficient band-

width to the parallel working ray tracing units. Later in this thesis it will

be shown that it is possible to deliver the required bandwidth using fairly

small caches.

A main advantage of the ray tracing algorithm is that it simulates reality,

by supporting different kinds of lighting effects like reflections, refractions,

shadows and even real-time global illumination [4]. For a human viewer

these effects are very important to understand the three dimensional relation

between the objects of the scene. Here, shadows play an especially important

role.

Indeed rasterization hardware supports some of these effects, but only by

using multi-pass rasterization tricks to fake them. These multi-pass raster-

ization techniques (to produce shadows for instance) are often non-obvious

and difficult to implement. In contrast, ray tracing offers an extremely sim-

ple and intuitive shading model. For instance, it is simple to shoot a ray

from a point in the scene to a light source to check whether it lies in the

shadow of the light source or not.

Particularly the large number of memory accesses (which are more or less

randomly distributed over the scene data) and the expensive computation

made it impossible to create a real-time ray tracing system recently. Lately a

lot of work has been done to cope with these problems. Taking advantage of

the coherence between neighboring rays to reduce the memory bandwidth

and using a cluster of processors to provide enough computational power

made real-time ray tracing possible. Such a software based real-time ray

tracing system has been developed by the Computer Graphics Lab of the

Saarland University [5, 6, 7].

However these techniques require a lot of costly, but standard hardware.

The SaarCOR project follows a different way. Instead of using standard PC

hardware for the computation, it is more efficient to create special purpose

hardware that is optimized to the ray tracing application. Jorg Schmit-

3

tler designed such an architecture which is called SaarCOR (Saarbrucken’s

Coherence Optimized Ray Tracer). This architecture has been fully simu-

lated with really nice results [8].

Up to now SaarCOR has been limited to static scenes and to a standard

k-D tree as acceleration structure. In such a static scene the camera can be

moved around, but no object can be moved itself. This is a hard limitation

which makes it impossible to develop a computer game for the standard

SaarCOR architecture for instance.

In this thesis a ray tracing hardware architecture for dynamic scenes is

presented based on the SaarCOR architecture. As ray tracing heavily relies

on precomputations it seems to be difficult to ray trace dynamic scenes.

Thus a data structure is required that allows as many precomputations as

possible to be done, but also to move objects in the scene around. This

can be achieved by partitioning the scene into movable objects and build-

ing a top-level acceleration structure over them. This top-level acceleration

structure needs to be recomputed each time an object has been moved. Each

object itself contains a precomputed bottom-level acceleration structure that

stays static forever. To traverse a ray in the bottom-level acceleration struc-

ture of the object, the ray has to be transformed to its local coordinate

system. This requires a transformation unit, which can be used as a kind of

precomputation too, if using a new triangle intersection method, described

in Section 5.

Using the structured scene representation it is possible to share geometry

by placing the same object at several positions. This reduces the represen-

tation of most scenes. A prototype of the hardware architecture has been

implemented into an FPGA which is in fact the first working special purpose

real-time ray tracing hardware available today.

Most of the concepts of this thesis can be understood without a detailed

knowlege of FPGAs or ASICs, but a short description will be given here.

An FPGA (field programmable gate array) can be seen as a CLB ar-

ray (configurable logic block) with some programmable routing resources to

connect the single CLBs. The internal structure of these CLBs differs from

architecture to architecture. In Xilinx FPGAs, the CLBs mainly consist of

some registers and LUTs (look up tables). LUTs are programmable 4 to

4 CHAPTER 1. INTRODUCTION

1 function generators that can be used together with the routing resources

to encode each circuit. The circuit in the FPGA can be reconfigured ar-

bitrarily often. For a detailed description on FPGAs see the book “Field

Programmable Gate Arrays” [9].

In contrast an ASIC (application specific integrated circuit) consists of

an array of NAND gates. The interconnection between different gates is

done by some extra silicon layers that are added to the chip. Thus a main

difference from FPGAs is that ASICs are in no way reconfigurable. The ad-

vantage of ASICs are their high gate capacity, low price at a high number of

pieces and high speed compared to FPGAs. A description on ASIC design

can be found in the book “Application-Specific Integrated Circuits” [10].

At the beginning of this thesis the basics of the ray tracing algorithm

using k-D trees are explained. To achieve dynamics, the standard k-D trees

are extended to 2-level k-D trees and the transformations needed for the 2-

level traversal algorithm are discussed as well as some problems that might

occur. The next Section describes a new triangle intersection method that is

used in the hardware architecture. These Sections form the basics to under-

stand the ray tracing hardware architecture for dynamic scenes, presented

in the following Chapter. The prototype implementation of the architecture

is described and a detailed analysis of the performance is given. The last

part finally summarizes this thesis and shows areas of future work.

Chapter 2

Previous Work

The state of the art in interactive ray tracing are in fact software based

systems. Several approaches have already been realized on MIMD and SIMD

architectures [11, 12, 13] exploiting the coherence between neighboring rays.

By parallelization of the algorithm on supercomputers [14, 15, 16, 17, 18, 19]

and recently standard PCs [6, 7] interactive ray tracing has become possible.

Besides these software based ray tracing systems some special purpose

hardware has been developed. As the most costly operation of the ray trac-

ing algorithm is the ray triangle intersection, the first commercially available

ray tracing accelerator performed this operation only [20, 21]. This ray trac-

ing accelerator has no hardware support for the traversal operation thus it

is not able to do ray tracing in real-time. In 1999, Pfister et al. published

the VolumePro 500 architecture which is a single-chip real-time volume ren-

dering hardware [22].

A different approach is to map the ray tracing application to a multi-

processor architecture on a single chip [23], which should be available in

the near future. Purcell has simulated a ray tracer for such an architec-

ture delivering real-time performance [24]. A kind of multi-processor vector

architecture is present in todays high end graphics cards too, in form of pro-

grammable pixel shaders. It has been shown that the ray tracing application

can be mapped to these shaders [25].

Ray tracing of dynamic scenes is a new topic of research. The paper “Dis-

tributed Interactive Ray Tracing of Dynamic Scenes” [26] discusses basics

of the 2-level ray tracing algorithm for dynamic scenes used in the hardware

architecture presented in this thesis. Instead of rebuilding the acceleration

5

6 CHAPTER 2. PREVIOUS WORK

structure to achieve dynamics it is possible to use special algorithms to up-

date it. Thus using a hierarchical grid as an acceleration structure, it is

possible to update an object’s position in the scene in constant time [27].

Chapter 3

The Basic Ray Tracing

Algorithm

Ray Tracing is a simulation technique to create realistic images of 3 dimen-

sional scenes. This is done by shooting imaginary rays through a scene and

interpreting the resulting intersection, as described in this Section.

In a real environment light is emitted by some light sources and then

distributed to the scene in a manner consistent with physical laws. If a

camera is positioned into this environment some light enters it and an im-

age is projected onto the image plane. The physical theory of this light

distribution is well-known today, but to simulate it exactly is difficult, since

available computational power is strongly limited. Thus in practice some

approximations need to be made.

In contrast to reality, ray tracing goes the opposite way and follows the

light back from the camera to the light sources. This is done by shooting

so called primary rays for each pixel of the image from the camera into the

scene and computing the closest object that is hit by the ray, the hit-object.

This shooting of a ray to determine the hit-object is called ray casting and

the origin of the primary rays is the projection center of the camera. The

point in 3D space where the object is hit is called the hit-point of the ray

(see Figure 3.1).

After computing the hit-object and hit-point to a primary ray it is known

which object is visible through the pixel, thus the algorithm does a kind of

visible surface computation. At this stage a shader corresponding to the

material of the hit-object is called, which has the task of computing the

7

8 CHAPTER 3. THE BASIC RAY TRACING ALGORITHM

Figure 3.1: A 2 dimensional example of the ray tracing algorithm. For eachpixel of the image a primary ray is shot into the scene and the closest objectthat is hit by the ray is computed.

color of the pixel using the intersection results of the ray with the hit-object.

The shader computes the pixel color based on several material proper-

ties of the hit-object, like the object’s color, surface normal, reflectivity and

transparency, and using scene properties such as the light sources. More ad-

vanced shaders would shoot secondary rays to simulate several light effects.

Thus it is possible to shoot light rays from the light sources of the scene to

the hit-point of the ray to compute whether the hit-point lies in the shadow

of a light source or not. Even reflections can be computed by using the

surface normal to compute a reflection ray to determine which geometry is

seen through the reflective surface.

The shading computation in detail is out of the scope of this thesis. For

further information about shading see the book ”Fundamentals of Computer

Graphics” [2].

A costly part of the algorithm is the ray casting operation to find the

closest hit-object. To do this efficiently a data structure which subdivides the

space of the scene into subspaces is required. This allows objects to be found

efficiently at a given location. Such a data structure is called an acceleration

structure as it accelerates the ray casting operation. Many acceleration

structures exists, some are recursive and others flat data structures [28]. In

3.1. K-D TREES 9

the hardware architecture presented in this thesis only k-D trees are used

as acceleration structure, thus the basics of k-D trees are explained in the

next Section.

3.1 k-D Trees

A k-D tree is an acceleration structure that is typically used for ray tracing

to accelerate the ray casting operation. It subdivides a k-dimensional space

containing some objects recursively and axis aligned into subspaces, and

stores for each of these subspaces the contained geometry. Because ray

tracing is applied to a 3D space, only this case will be discussed here.

The scene subdivision is encoded as a binary tree, the k-D tree. Each

leaf node of this tree specifies one of the subspaces and contains a list of all

objects that lie in the subspace.

Using this recursive data structure it is possible to efficiently find the

closest object hit by a ray. This is done by determining the subspaces

through which the ray traverses. In the order the ray traverses these sub-

spaces, it is intersected with the geometry in each subspace. This walking

through the subspaces is called the traversal operation and it terminates if a

hit-point in the current subspace has been found. This traversal operation

is very efficient, as only the geometry in the subspaces the ray traverses,

need to be used in the intersection calculation. Geometry far away from the

ray will never be touched if the subspacing is fine enough.

Definition 3.1.1. A plane h in R3 can be defined by an implicit function

H(x) = n · x − d = 0, if n 6= 0, n ∈ R3 and d ∈ R. We define h+ = {x ∈

R3 | H(x) ≥ 0} and h− = {x ∈ R

3 | H(x) < 0} to be the positive and

negative half-space bounded by h, respectively. Let k ∈ {1, 2, 3} be the so

called splitting axis and n = ek be the k-th unit vector, then we call the

plane h an axis aligned splitting plane and the value d the splitting position.

Definition 3.1.2. A k-D tree T is defined by the following grammar:

T = Node((k, d), Tleft, Tright)

| Leaf({Object1, . . . , Objectn})

Object ⊂ R3 closed


On the one hand, a node of a k-D tree can be a normal Node containing

an axis aligned splitting plane (k, d) and a left and right subtree (Tleft and

Tright). On the other hand, it can be a Leaf node containing a set of objects.

This set can be empty if the number of objects n is 0. An object is a closed

subset of R3. In practice mostly triangles or cubes will be used as objects.

The semantics of the k-D tree defines a subspace S(T ) to each node T

of a k-D tree. The subspace of the root node is defined as R3. If S(T ) is the

subspace of the node T = Node((k, d), Tleft, Tright) and h the plane defined

by (k, d) then the subspace of the left subtree is S(Tleft) = S(T ) ∩ h− and

the subspace of the right subtree S(Tright) = S(T ) ∩ h+. Figure 3.2 shows

this subdivision scheme of the space.

Figure 3.2: This Figure shows how the space is recursively subdivided byk-D trees. The large box is the subspace of node T and is split into twohalves by the splitting plane p1 = (1, d). The normal of this splitting planeis parallel to the x-axis and goes through the point (d, 0, 0).

As the splitting planes in the nodes of the k-D tree are axis aligned, it is

called an axis aligned BSP tree (binary space subdivision tree). It is possible

to use other non axis aligned splitting planes too, which yields to BSP trees

in general and more complex traversal computations. In the following only

the case of axis aligned splitting planes will be considered.

3.1. K-D TREES 11

3.1.1 k-D Tree Creation

The task of the k-D tree creation algorithm is to build a k-D tree for a

scene consisting of several objects. Thus it has to subdivide the space of

the scene recursively into subspaces. It starts with the complete space R3

containing all the geometry of the scene. Then an axis aligned splitting

plane is selected which splits the space into the left and the right subspace

according to the semantics of the k-D tree. For each of both subspaces the

objects that intersect with it are computed. Note that objects can belong to

both subspaces. The subspaces together with the objects intersecting them

are handled recursively by the algorithm. If some termination criteria is

fulfilled, the subdivision of the current subspace is terminated and a leaf

node, containing all objects in it, is created.

This is the main concept for each k-D tree creation algorithm. Different

algorithms mostly differ only in the heuristics that are used to search the

splitting plane and in the termination criteria.

The algorithm 3.1.3 defines the createKDTree function in an abstract

way. It gets a subspace S and a set O of objects and returns a k-D tree. The

subspaces can be represented as simple bounding boxes (that are possibly

infinite) and the set of objects as arrays or lists. For an example of a simple

k-D tree in 2 dimensions see Figure 3.3.

Figure 3.3: Figure (b) shows a k-D tree for the simple 2D scene of Figure(a). The labels on the inner nodes of the k-D tree tell the splitting planeand the leaf nodes contain a list of objects.


Algorithm 3.1.3. k-D Tree Creation

function createKDTree (S, O)

begin

if termination criteria is fulfilled then

return Leaf(O)

Select an axis aligned splitting plane h by some criteria.

Sleft = S ∩ h−

Sright = S ∩ h+

Oleft = {x ∈ O | x ∩ Sleft 6= ∅}

Oright = {x ∈ O | x ∩ Sright 6= ∅}

Tleft = createKDTree (Sleft,Oleft)

Tright = createKDTree (Sright,Oright)

return Node(h,Tleft,Tright)

end

There are two issues we have not dealt with yet. The first one is how to

select the splitting plane and the second one is what the termination criteria

looks like. As a simple approach the splitting plane can be selected such that

the largest dimension of the subspace is split exactly in the middle. It can

be shown that this is not very efficient especially if the objects in the scene

are not equally distributed [3]. As termination criteria a maximal tree depth

in conjunction with a minimal number of objects in the leaves can be used,

for instance.

A different more advanced approach is to search the optimal splitting

plane related to a cost function. Such a function was proposed by Havran

[3] and can be used as a termination criteria too, by comparing the cost of

a split and no split.

3.1. K-D TREES 13

3.1.2 Recursive k-D Tree Traversal

The reason why we introduced k-D trees was to optimize the ray casting

operation, which means to compute the closest hit-point of a ray with the

scene. The k-D tree subdivides the scene into subspaces. Thus the sequence

of subspaces a ray traverses can be determined to intersect the ray with the

geometry stored in them. The algorithm that performs this enumeration

of the subspaces is called the k-D tree traversal algorithm. In conjunction

with an object intersection algorithm, the closest hit-point of the ray with

the scene can be computed.

Definition 3.1.4. A ray R is represented by a tuple R = (org, dir) ∈ (R3)2.

The first component org of the tuple is a point of R3 and represents the origin

of the ray. The second component is a vector of R3 and specifies the direction

of the ray. The points on the ray can be computed by R(x) = org + x · dir

if 0 ≤ x.

Definition 3.1.5. Such a ray R hits an object obj if there is a λ ∈ [0, +∞[

such that R(λ) ∈ obj. A minimal λ with this property is called the hit-

distance of the ray to the object and R(λ) the hit-point. Because an axis

aligned splitting plane is a closed subset of R3, we can define the terms

hit-distance and hit-point the same way for rays and splitting planes.

Figure 3.4: The ray R of Figure (a) is traversed according to Figure (b)through the k-D tree.

Since a k-D tree is a recursive data structure, the k-D tree traversal

algorithm is a completely recursive algorithm as well. It works recursively

on the nodes of the k-D tree and makes a traversal decision at each node.


The traversal decision determines whether the ray traverses the subspace of

the left and/or right subtree and the order it traverses them. Using this

traversal decision the algorithm follows the ray through the k-D tree data

structure by working on the subtree that is traversed first and putting the

other one onto the stack. If a leaf node is reached the intersection algorithm

is called to intersect the ray with each object stored in the leaf node and the

closest hit-point is determined. If this hit-point lies in the subspace of the

leaf node a valid hit-point is found and the ray is called a terminated ray.

In such a case or if the stack is empty the algorithm terminates. Otherwise,

it continues by obtaining the next node from the stack.

To compute the traversal decision the algorithm needs the near and

far-value which is the distance to the the entry-point and exit-point of the

ray with the subspace of the current node. Using this near and far-value

together with the distance d to the splitting plane of the current node the

traversal decision can be computed.

If δ is the splitting position and k the splitting axis, then the intersection

distance d of the ray R = (org, dir) to the splitting plane can be computed

according to the formula of Figure 3.5.

d =δ − orgk

dirk

Figure 3.5: Hit-Distance Computation

To compute the traversal order the algorithm determines the half-space

of the splitting plane that is closer to the origin of the ray. If orgk ≤ δ this

is the negative half-space (corresponding to the left subtree) or otherwise

the positive half-space. The closer subspace is traversed first, if the ray

intersects it. The farther one follows later. In the first case the so called

traversal order is from left to right, otherwise from right to left.

3.1. K-D TREES 15

Algorithm 3.1.6. k-D Tree Traversal

function traverseKDTree (R, T )begin

λ = ∞near = −∞ far = ∞

while truebegin

while T is of Node((k,split),Tleft,Tright)begin

d = (split − R.orgk)/R.dirk

if R.orgk ≤ split thenTnear = Tleft, Tfar = Tright

elseTfar = Tleft, Tnear = Tright

go near = d ≥ near ∨ d ≤ 0go far = d ≤ far ∧ d ≥ 0

if go near ∧ go far thenpush far and Tfar to the stackT = Tnear, far = d

else if go near ∧ not go far thenT = Tnear

else if not go near ∧ go far thenT = Tfar

end

T is of Leaf({Object1, . . . , Objectn})compute closest hit-distance λ for {Object1, . . . , Objectn}

if λ ≤ far then return λif stack is empty then return λnear = farpop far and T from stack

endend


Figure 3.6: Traversal Decisions

Whether the ray really traverses through the nearer and/or farther side

is computed by the following formulas, which are illustrated in Figure 3.6.

go near = d ≥ near ∨ d ≤ 0 go far = d ≤ far ∧ d ≥ 0

One important invariant of the algorithm is that the near and far-

value is exactly the distance to the entry and exit point of the ray with

the subspace of the current node. This property is essential and has to be

maintained through the complete algorithm. Thus the near and far-values

have to be updated at each traversal step of the algorithm. If only one of

the subtrees is traversed by the ray, then the near and far values stay the

same (see Figure 3.6), but if both children have to be traversed, the near

and far values need to be updated. As the algorithm first traverses into the

closer child node the near value can be maintained but the far value has

to be set to the hit distance d. To restore the far-value later, it is pushed

onto the far-stack and the farther node onto the node-stack. If later a leaf

node is reached and no hit has been found in it a node is popped from the

node-stack and the near and far values are updated by setting near = far

3.1. K-D TREES 17

and taking the far value from the far-stack as the new far value.

Using the near and far value it is possible to determine whether there

is a valid hit-point which is necessary to terminate the ray. A valid hit-point

is found if a leaf is encountered and the hit-distance to the current closest

hit-point is smaller than the current far-value, since then the found hit-

point lies in (or before) the leaf node’s subspace. Alternatively the ray can

be terminated at the next traversal step by testing if the closest hit-distance

is smaller than the current near-value.

Figure 3.6 shows the most important situations that occur in the traver-

sal algorithm. Besides these cases there are some degenerate ones that have

to be handled carefully. These cases occur if the ray does not have got a

well-defined single hit-point with the splitting plane. If so the hit-distance

cannot be computed and the traversal decision formulas cannot be applied.

This can happen if the ray is parallel to the splitting plane or if it lies com-

pletely in it. The later hardware approach solves this problem by using

a normalized floating point representation that cannot represent the value

zero. Thus each ray has a hit-point with each possible splitting plane.

3.1.3 Packet k-D Tree Traversal

A drawback of ray tracing is the large memory bandwidth that is needed for

the computation. Reducing this bandwidth is possible by exploiting the ray

coherence between rays corresponding to neighboring pixels on the screen.

This coherence derives from the fact that rays traversing through a similar

region of the 3D space, traverse similar nodes of the k-D tree and intersect

many of the same objects.

It is possible to take advantage of this ray coherence by traversing a

packet of some neighboring rays in parallel as if they were one single ray.

This strategy reduces the required memory bandwidth, as data is fetched for

a complete packet of rays instead of a single ray. Furthermore when imple-

menting such a packet traversal algorithm in software, SIMD architectures

available in todays standard PCs can be taken advantage of. Because these

SIMD architectures allow 4 computations to be done in parallel packets of

4 rays can be handled efficiently using these special instructions.

The packet traversal algorithm is closely related to the standard traversal

algorithm, but instead of computing a traversal decision for a single ray it

computes a similar packet traversal decision for a packet of rays. In the


computation of this packet traversal decision, only so called active rays of

the packet are involved. A ray of a packet is active in the current node if it

is not terminated and if it intersects with the subspace of the node. Because

this active value is required for each ray in the packet an active vector for

the packet is needed. Although this active vector needs to be recomputed

at each traversal step this is quite simple since a ray is active in the left

child of a node if it is active in the current node and if it wants to traverse

through the left child. The same holds for the right child.

If one of the active rays of the packet wants to traverse through the left

child, then the packet traverses through the left child as well. The same

holds for the right child. The traversal order for the packet is inherited from

the active rays of the packet that traverse through both children, if it is the

same for each of these rays. The packet is terminated if each of its rays is

terminated.

If a pop operation is done, the active vector has to be updated, and

therefore needs to be pushed onto the stack together with a node. A further

situation that might occur is that a node is reached and ray R1 traverses

through both children and R2 through the farther child only. Here, the

farther node is pushed onto the stack and R1 traversed through the nearer

child. Later a pop operation obtains the farther node from the stack, and

each of both rays is active in this node. However ray R1 needs to update its

near and far values, as it traversed the nearer and farther child, unlike R2.

Thus a kind of both vector needs to be pushed to the stack also, indicating

if a ray wants to traverse through both children to update the near and far

values correctly.

Figure 3.7: The packet is traversed from left to right, as the rays R1 andR2, traverse from left to right. Thus the right node is pushed onto the stackand the operation continues in the left child. The rays R1 and R2 are activein both children, but R3 only in the right one.

3.1. K-D TREES 19

A problem occurs if the traversal order is not the same for each active

ray of the packet that wants to traverse both children. Such a packet is

called an invalid packet. It is invalid since no valid packet traversal decision

can be computed. No matter which child is handled first there is always

a ray in the packet that wants to handle the other one first. If the packet

terminates in the first traversed child, a possible closer hit-point in the other

child is forgotten (see Figure 3.8).

In practice this case occurs very rarely and it can be shown that this

does not happen if there are no two rays of the packet that cross in at leat

one of the 3 projections to the xy-,yz- or xz-plane. This never occurs for

primary rays and light rays, since rays with the same origin never cross.

Therefore the algorithm can handle these types of packets correctly.

Figure 3.8: The Figure shows a situation in which no packet traversal deci-sion exists. No matter which child is handled first, either ray R1 or ray R2is not intersected with triangle tri3.

If no such packet traversal decision exists this situation can be handled as

a kind of special case. If a node for which no packet traversal decision exists

is reached, the left child is traversed first. The right child is remembered

and traversed later by treating it as a special case.

A different possibility is to split the packet before the traversal into sub

packets, in which the rays do not cross as explained above. To split the

packet this way, only the signs of the three components of the ray directions

must be compared. If there are two rays whose direction sign is different

in one dimension then the rays cross and have to be put in different sub

packets.

One of these solutions needs only to be applied if a shader produces

invalid packets. This for instance can happen if a packet is reflected by a

curved surface. However, if only primary rays and light rays are allowed,


the problem never can occur. In the hardware architecture to be described

later only primary rays are used and the problem of crossing rays can be

safely ignored.

Chapter 4

The Dynamic Ray Tracing

Algorithm

In this Chapter a ray tracing algorithm for dynamic scenes is presented that

allows the movement of a huge number of triangles in the scene.

On the first view the efficient ray tracing of dynamic scenes does not

seem to be possible since fast ray tracing relies so much on precomputations.

In particular, the precomputed acceleration structure is a problem since it

has to be rebuilt or updated if the geometry of the scene has changed.

For a dynamic real-time ray tracing system this update must work even

if the complete scene consists of several million triangles. Here standard

acceleration structures cannot be used since the construction of a k-D tree

for instance is at least in O(n) in the number of triangles in the scene (each

triangle has to be visited at least once). It is possible to build acceleration

structures that allow updating the position of triangles in constant time [27],

but several million triangles cannot be moved around this way.

There exists a simple solution to this problem if the scene is restricted to

some kind of structured motion [26]. The case of unstructured motion, that

is if triangles are moved around arbitrarily, is not covered in this thesis. In

contrast to unstructured motion, structured motion is if some triangles are

moved around in some sense as one single object. For instance, in a scene

consisting of a table and a chair, normally all triangles in the chair or table

are moved around at once.

For such structured motion, the structure of the motion can be exploited

by packing the triangles into movable objects. These objects internally stay

21

22 CHAPTER 4. THE DYNAMIC RAY TRACING ALGORITHM

static, thus a local bottom-level acceleration structure and a local bounding

volume can be precomputed for them. The local bounding volume contains

all the geometry of the object.

The object can be positioned, rotated and scaled in the scene by an

affine transformation. Such a positioned object is called an object instance

and consists of the affine transformation used and a reference to the object.

This concept of having some objects and one or more object instances to

each object leads to a kind of geometry sharing, as an object needs to be

saved only once.

To traverse rays efficiently through the object instances a dynamic top-

level acceleration structure must be built over them. Only this top-level

acceleration structure needs to be updated, if the position of an object in-

stance has changed. This is possible as long as the number of objects in the

scene stays small.

As there is a dynamic top-level acceleration structure over the object

instances and a bottom-level acceleration structures in the objects, this is a

kind of 2-level acceleration structure (see Figure 4.1).

In the example of the chair and table, two objects have to be modeled:

one chair and one table. These two objects inside stay static over time but

they can be instantiated at several positions in the scene. Thus the top-

level acceleration structure is quite simple (it consists of few objects) but

the objects themselves can be fairly complex.

Figure 4.1: The Figure shows a dynamic top-level acceleration structureover 4 object instances i1, . . . , i4 of 3 objects o1, o2, o3. The objects consistof their static bottom level acceleration structure.

The traversal algorithm for 2-level acceleration structures first traverses

through the top-level acceleration structure until an object instance needs to

23

be intersected. This is done by transforming the ray to the local coordinate

system of the object and traversing through the local acceleration structure

to find the hit-triangle in the object. The transformation of the ray to the

local coordinate system is necessary as the acceleration structure of the ob-

ject is only valid in the coordinate system in which it has been created. Thus

the positioning of the object instance needs to be reversed by transforming

the ray. Thus the inverse of the transformation that was used to position

the object is required to transform the ray to the local coordinate system of

the object.

An important property of the concept is that the internal geometry of the

object is hidden from the rest of the world. Thus from outside the object’s

geometry is only represented by its local bounding volume, which needs

to be as accurate as possible to avoid unnecessary ray object intersections,

which are normally very costly.

Figure 4.2: Figure (a) shows a simple scene consisting of two instances ofthe same object. The drawn ray hits the left chair thus it is transformed toits local coordinate system, as can be seen in Figure (b). There the splittingplanes are again axis aligned so that the traversal can be continued in theobject.

The concepts of the dynamic ray tracing algorithm does not depend on

a special acceleration structure or kind of local bounding volume, but in the

following only k-D trees and axis aligned bounding volumes will be used.

Furthermore no update strategy for the top-level k-D tree will be used it is

simply rebuilt each time the object positions have changed.

The following Sections describe some details of the dynamic ray tracing

algorithm. Some special properties of the top-level k-D tree creation will be

discussed as well as problems that might occur using local bounding volumes.

As affine transformations are used to position objects in the scene, the way


a ray is transformed under an affine transformation needs to be analysed.

Furthermore we show that the hit-distance is maintained under an arbitrary

affine transformation which dramatically simplifies an implementation of the

algorithm. As most shading models need the normal of the geometry in the

world coordinate system, normal transformation is also discussed.

4.1 Top-Level k-D Tree Creation

The basic k-D tree creation algorithm has been described in Section 3.1.1.

This algorithm can be applied the same way to compute a top-level k-D

tree for a set of object instances by using the transformed local bounding

volume of the object instances as their simplified geometry. Because this

transformed bounding volume is no longer axis aligned determining if it

intersects with a subspace or not is very costly to compute.

As it is mostly required to rebuild the top-level acceleration structure

for each frame, some optimization needs to be done to speed up the top-

level k-D tree construction. This is done by computing the smallest axis

aligned bounding box that encloses the transformed bounding volume. This

is called the instance bounding volume and is used as the geometry of the

object instance in the k-D tree creation algorithm (see Figure 4.3). To

compute the intersection of the axis aligned instance bounding volume and

the subspace (which can be represented as an possibly infinite axis aligned

bounding box also) is trivial.

Figure 4.3: Figure (a) shows an object with its bounding box. In Figure (b)this object is instantiated using a rotation. The estimated bounding boxfor the object instance is drawn dotted. Figure (c) shows the best possiblebounding box estimation for the object instance if the exact geometry of theobject is used in the estimation.

This simplification has some disadvantages since the axis aligned instance

4.2. BOUNDING BOX CLIPPING 25

bounding volume is not optimal (see Figure 4.3). Although a best estimation

for the axis aligned instance bounding volume exists, it is not a good idea

to compute it, because then the internal structure of the object would have

to be involved in the computation, which might be too costly.

What can be done is to search for a better representation of the local

bounding volume of an object. Instead of an axis aligned box an ellipsoid

can be used which often is a better approximation. Such an elliptic bounding

volume of an object can be computed in O(n) [29]. A different optimization

would be to rotate the object in such a way that its initial bounding box

fits as well as possible. A situation like in the left most image of Figure 4.3

is in fact the worst case.

4.2 Bounding Box Clipping

Intersections with object instances are mostly very expensive, as this re-

quires one ray transformation and some traversal steps in the object. One

possibility to avoid and optimize ray object intersections is to perform a kind

of bounding box clipping on the instance bounding volume in the top-level

k-D tree and on the local bounding volume of the object at the beginning

of the bottom-level k-D tree.

Figure 4.4: Figure (a) shows a 2 dimensional rectangle with its clippingplanes. The corresponding clipping tree is shown in Figure (b). Figure (c)shows the clipping tree to to a box in 3 dimensions.

Using traversal steps this bounding box clipping has the task of determin-

ing if the ray intersects with an axis aligned bounding box or not. This can


be done by using 6 clipping planes that exactly correspond to the bounding

planes of the axis aligned bounding box (see Figure 4.4).

The bounding box clipping to the instance bounding volume in the top-

level k-D tree guarantees that the bounding box of the object’s instance is

really hit if a leaf node containing this object is encountered.

The bounding box clipping at the beginning of the bottom-level traversal

is useful too, as the local bounding box available there is much more accu-

rate than the bounding box of the instance. Furthermore this bottom-level

bounding box clipping should be performed since many unnecessary traver-

sal steps can be avoided. This is due to the fact that otherwise the infinitely

large empty space around the object is not handled optimally as the clip-

ping planes at the border of the object reach to infinity. This causes many

traversal steps if the ray does not hit the object and traverses to infinity

(see Figure 4.5).

Figure 4.5: Figure (a) shows a chair without bounding box clipping, whoseclipping planes reach to infinity. Here the drawn ray would traverse throughmany subspaces of the acceleration structure. In Figure (b) some bold extraclipping planes clip against the bounding box of the chair. Here the raytraverses only 2 subspaces of the acceleration structure.

Because of the same reason it is better to perform a kind of scene bound-

ing clipping at the beginning of the top-level acceleration structure other-

wise ray losses (that is if rays traverse to infinity and produce no hit) will

be costly. Only if no ray losses can occur in the scene, this scene bounding

clipping should not be performed.

4.3. OVERLAPPING OBJECTS 27

4.3 Overlapping Objects

Overlapping objects play a crucial role in 2-level k-D trees since in the

overlapping area each of the objects need to be intersected. Consider a

scene consisting of n objects that overlap completely. A ray that intersects

this region in space needs to traverse through each of the n objects. For

such worst case scenes, the dynamic ray tracing algorithm scales linearly in

the number of objects. Thus overlapping of objects should be avoided as

often as possible, if modelling a scene.

If two objects overlap only slightly it is usually best to partition the scene

in such a way that the area filled by both objects is separated by the clipping

planes (see Figure 4.6). Thus only in the overlapping area both objects need

to be intersected. The overlapping area cannot be handled more efficiently

since each of both objects could generate the closer hit-point, which is not

known in advance.

Figure 4.6: Figure (a) shows two object instances that overlap a bit. Bythe clipping planes h1, . . . , h4, the overlapping area is separated. The corre-sponding k-D tree is shown in Figure (b).

Much more critical is the case where there are a lot of objects in a

different object like in Figure 4.7. If the standard algorithm to create an

acceleration structure is used, then the large object 1 (which contains the

other ones) is in each leaf node of the tree. This is a problem as during

traversal each time a leaf node is encountered the algorithm intersects with

object o1, but one intersection with it would be sufficient. This problem is

called the room problem, as it typically occurs, if a room is modeled with


some objects inside.

The resulting k-D tree for such a scene can be seen, as a degenerate case

of the space subdivision because after each subdivision, the object o1 is in

each of both subspaces.

Figure 4.7: Figure (a) shows a large object o1 containing 3 other objects.The corresponding k-D tree in Figure (b) has object o1 in each leaf node.

4.3.1 Hierarchical k-D Trees

Several possible solutions to the room problem exist. One possibility is to

allow objects to be in k-D tree nodes too and not only in the leaves. This

concept is called hierarchical k-D trees as the hierarchy of the objects is

encoded to the k-D tree.

Figure 4.8: Figure (a) shows the same scene as in Figure 4.7. The corre-sponding hierarchical k-D tree can be seen in Figure (b). The difference to anormal k-D tree is that object o1 is in the inner object list of the root nodeof the hierarchical k-D tree.

4.3. OVERLAPPING OBJECTS 29

Figure 4.8 shows a hierarchical k-D tree. Each node of it has a set of

so called inner objects which are intersected if this node is handled during

traversal. The structure of the k-D tree in Figure 4.8 forces the traversal

algorithm to intersect object o1 exactly one time.

Note that the size of the hierarchical k-D tree is reduced compared to

the last version, as the leaves are smaller. This is a principal property of the

concept, as each time an object is in all or almost all leaf nodes reachable

from a node N , it is more optimal to put the object in the inner object list

of node N , which reduces the size of the tree.

4.3.2 Mailboxing

A different solution to the problem is known as mailboxing which is a kind

of object intersection cache. In a small cache the objects the ray has been

intersected with are saved. If an object needs to be intersected by the

traversal algotithm, the mailbox system looks up the cache. If the ray has

already been intersected with this object no further intersection is done.

Otherwise the object is intersected and added to the cache.

There are several possible strategies to handle the cache. The most

popular is to save the last n objects intersected with. Another would be to

use a hashing function to map the objects to slots.

The mailboxing approach has been shown to be more efficient than using

hierarchical k-D trees. The reason is that hierarchical k-D trees alone only

solve the special room problem. However there are many more situations

where an object is intersected more than once since each object is mostly in

several leaf nodes.

4.3.3 Multiple Scenes

In some cases it is sufficient to use a much simpler solution to the problem.

Imagine a level of a standard shooting game where is mostly a large main

scene, modelled as a single object, and perhaps some dynamic objects. The

main scene object is an object containing a lot of other objects which is a

problem, as described earlier. Instead of putting the main scene object to

the root node of a k-D tree (which the hierarchical k-D tree concept had

done) first the main scene is traversed and then the other geometry. If there

is only one large object containing many other ones, this concept is nearly


equivalent to the hierarchical k-D tree concept but simpler.

In the following table some simulation results of the conference scene at

a resolution of 1024x768 are listed. The first line shows a simulation without

any of the optimizations followed by hierarchical k-D trees. Mailboxing is

simulated such that the last 8 objects are saved and in the last simulations

the main scene object (room) of the conference scene is traversed before

the objects (chairs). The number of traversal operations (Trav-Ops), object

intersection operations (Obj-Int-Ops) and triangle intersections (Triangle-

Int-Ops) can be seen.

Optimization Trav-Ops Obj-Int-Ops Triangle-Int-Ops

None 295.4 6.4 57.6Hierarchical k-D tree 70.3 2.0 10.9Mailboxing 63.2 1.6 10.3Multiple Scenes 71.5 2.1 11.3

Table 4.1: Millions of operations for various strategies

It can bee seen that mailboxing is the best of the three optimizations,

thus the later hardware architecture implements this strategy.

4.4 Ray Transformation

If an object instance is hit during traversal, the ray is first transformed into

the local coordinate system of that object. This is done by applying an

affine transformation to the ray. In this Section we show how a ray has to

be transformed using such an affine transformation.

The affine transformation is given by f(v) = Av+B with A ∈ MatR(3×

3) and B ∈ R3 and maps points of R

3 to points of R3. The ray is given

by a tuple R = (org, dir) ∈ (R3)2. The origin of the ray can easily be

transformed by plugging it into v, as it is a point. The direction of the ray

represents a vector not a point, thus it has to be transformed in a different

way. As vectors represent directions, this property has to be maintained by

the transformation. Assume there are two points X and Y given, then there

4.5. HIT-DISTANCE TRANSFORMATION 31

is a vector V = Y − X connecting X to Y . The transformed vector f(V )

has to fulfill the equation:

f(V ) = f(Y ) − f(X) = A Y + B − (A X + B) = A(Y − X) = A V

Thus the transformation of a complete ray looks like:

f(R) = (f(org), f(dir))

= (A · org + B, A · dir)

4.5 Hit-Distance Transformation

Some hit-point information needs to be computed during the traversal al-

gorithm: the hit-point with the splitting plane and the hit-point with the

scene.

One possibility would be to save the hit-point as a real point of R3 but

this has the disadvantage that it has to be transformed back to the world

coordinate system if the hit-point lies in an instantiated object. A much

better way is to store a hit-point with a ray R = (org, dir) indirectly as

a λ-value or hit-distance such that the real hit-point H ∈ R3 fulfills the

following equation:

H = R(λ) = org + λ · dir

On the one hand this hit-distance can be used to compute a traversal

decision (see Section 3.1.2) and on the other hand no back transformation

of the hit-distance is required, which the following equations show. Let f

be an affine transformation f(x) = A · x + B then it yields:

f(H) = f(org + λ · dir)

= A · (org + λ · dir) + B

= (A · org + B) + λ · A · dir

= f(org) + λ · f(dir)

=⇒ org + λ · dir = f−1(f(org) + λ · f(dir))

This means that the same λ-value can be used to represent the hit-


point in both coordinate systems. Computing the hit-point in the object

and transforming it back to the world coordinate system is the same as

using the same λ to compute the hit-point in the world-coordinate system.

Thus the value λ is in some sense invariant under the application of affine

transformations.

With this background it can be explained why only affine transforma-

tions are used in the 2-level k-D tree algorithm. The relevant point is that if

intersecting with an object instance not the object in the instance is trans-

formed, but the ray itself. If the object’s geometry had been transformed

(which is too costly) it would be possible to use an arbitrary transformation.

But transforming the ray has to result in a ray again. Affine transformations

fulfill this property and map rays to rays as the above equations show.

4.6 Normal Transformation

Most shading models (like Phong shading for example) need the normal of

the geometry at the hit-point to approximate the surface lighting behavior.

However, this normal is needed in the world coordinate system, but normals

are present only in the local coordinate system of the hit-object. Therefore,

the shader has to transform these normals back to the world coordinate

system using the inverse of the transformation that was used to position

the object. Thus we need to analyse how a normal is transformed under

an arbitrary affine transformation f(x) = Ax + B. Like for vectors, this

transformation has to be applied in a special way to preserve the normal

property. It is trivial to see that affine transformations map tangents to

tangents. This fact will be used to derive a transformation for normals.

Figure 4.9: Figure (a) shows a box and the normal n of the right side. Ifthe box is transformed under an affine transformation like in Figure (b), thecorrectly transformed normal nf is different from An.

4.6. NORMAL TRANSFORMATION 33

The tangent in the source space is called t and the transformed one

in the destination space tf , which is equal to A t as tangents are vectors.

Analogously the normal in the source space is called n and the searched one

in the destination space nf . As n is a normal, nf is not the same as A n as

seen in Figure 4.9. The following shows that a matrix A′ can be found such

that nf = A′ n. The vectors n and t are perpendicular, which means that

the scalar product is zero nT t = 0. Doing some transformations yields:

nT t = nT A−1A t = (nT A−1)(A t) = (nT A−1)TT tf = ((A−1)Tn)T tf = 0

This equation shows that (A−1)Tn is a vector that is perpendicular to

tf , thus it has to be the searched normal. The transformation matrix A′ is

given by A′ = (A−1)T

and the complete mapping of a normal looks like:

nf = (A−1)Tn

Even if the normal n was normalized, nf is usually not normalized.

Chapter 5

Triangle Intersection

In order to decrease the required floating point resources of the hardware

architecture described later in this thesis, I developed a special triangle

intersection method that is based on affine ray transformations. Because

such affine ray transformations are necessary in the dynamic ray tracing

algorithm using this intersection method will make it possible to save a lot

of hardware resources by sharing one transformation unit for two purposes.

The so called unit triangle intersection method consists of two stages.

First the ray is transformed, using a triangle specific affine triangle trans-

formation, to a coordinate system, in which the triangle looks like the unit

triangle ∆unit with the edge points (1, 0, 0), (0, 1, 0) and (0, 0, 0). In the sec-

ond stage, a simple intersection test of the transformed ray with the unit

triangle is done.

Figure 5.1: Unit Triangle Intersection

35

36 CHAPTER 5. TRIANGLE INTERSECTION

5.1 Affine Triangle Transformation

The affine triangle transformation to a triangle ∆ = (a, b, c) is an affine

transformation T∆(x) = m · x + n with m ∈ MatR(3 × 3) and n ∈ R3

that maps the triangle ∆ to the unit triangle ∆unit. The inverse T−1∆

(x) =

m′ · x + n′ of T∆ can easily be described by the following equations:

T−1∆

1

0

0

= a T−1

∆

0

1

0

= b T−1

∆

0

0

0

= c

These equations map the edge points of the unit triangle to the edge

points of the triangle. If q ∈ R3 is an arbitrary vector, then the solution

T−1∆

of the equations takes the form:

m′ =

ax − cx bx − cx qx

ay − cy by − cy qy

ax − cz bz − cz qz

n′ =

cx

cy

cz

Unfortunately the vector q is undetermined but there are two useful

possibilities to choose q. The first concept is to minimize the memory needed

to store a triangle matrix and the second one allows to do some dot product

computations for free.

5.1.1 Memory Efficient Triangle Transformation

The representation of the triangle transformation can be minimized by

choosing q in such a way that the triangle transformation matrix m of T∆

has the first column equal to (1, 1, 1)T , which can be achieved by setting

q = −(a − c) − (b − c) + (1, 0, 0)T . Here it is not necessary to save the first

column of the matrix.

m′ =

ax − cx bx − cx −(ax − cx) − (bx − cx) + 1

ay − cy by − cy −(ay − cy) − (by − cy) + 0

ax − cz bz − cz −(az − cz) − (bz − cz) + 0

n′ =

cx

cy

cz

It needs to be shown that the inverse T∆ of T−1∆

is of the form:

5.1. AFFINE TRIANGLE TRANSFORMATION 37

m =

1 βx γx

1 βy γy

1 βz γz

n =

δx

δy

δz

Using properties of affine transformations, it can be shown that n =

−m′−1 · n′. Thus it is equivalent to prove:

T∆ ·

1

0

0

=

1

1

1

+ n =

1

1

1

− m′−1 · n′

This can be shown using the inverse of T∆:

1

0

0

= T−1

∆

1

1

1

− m′−1 · n′

= m′ ·

1

1

1

− n′ + n′ =

1

0

0

This proof requires the existence of T∆ and it turns out that this inverse

does not always exist. The choice of q geometrically means to map the

normal Nunit = (0, 0, 1)T of the unit triangle to the point T−1∆

(Nunit) =

−(a− c)− (b− c)+ e1 + c. In fact the part −(a− c)− (b− c)+ c of this sum

lies in the triangle plane. Thus the triangle transformation does not exist if

the triangle normal is perpendiculer to e1, since then −(a−c)−(b−c)+e1+c

lies in the triangle plane too. This problem can be solved by choosing q in

such a way that one of the other two columns of m is (1, 1, 1)T . The n-th

column can be set to zero if q = −(a − c) − (b − c) + en. The proof of this

is analogous to above.

To store the minimized representation of the triangle transformation it

is necessary to save the number of the column that is equal to (1, 1, 1)T . But

this can simply be encoded in 2 bits.

Furthermore a criteria is required that chooses the column to be set to

(1, 1, 1)T . But this is quite simple, since n is optimal if the normal Nunit

is mapped to a point as far away from the triangle as possible. Thus n is

choosen such that the angle between en and the normal of the triangle ∆ is

minimal.


5.1.2 Normal Consistent Triangle Transformation

A different possibility is to choose q in such a way that the normalized

normal N = (a− c)× (b− c)/|(a− c)× (b− c)| of the triangle is mapped to

the normal of the unit triangle.

T−1∆

(Nunit) = T−1∆

0

0

1

= N

The solution to T−1∆

looks like:

m′ =

ax − cx bx − cx Nx

ay − cy by − cy Ny

az − cz bz − cz Nz

n′ =

cx

cy

cz

The transformation T−1∆

is completely defined and the inversion of T−1∆

yields again an affine transformation if the triangle is not degenerate. Thus

T∆ exists for each not degenerate triangle ∆.

5.2 Unit Triangle Intersection

To intersect a ray R = (org, dir) with a triangle ∆ the ray R is trans-

formed using T∆ to the unit triangle space. The intersection distance λ and

the barycentric (u,v)-coordinates do not change under an arbitrary bijective

affine transformation. As the triangle transformation is bijective for not de-

generate triangles, it is equivalent to compute the ray-triangle intersection

in the world coordinate system between R and ∆, or in the unit triangle co-

ordinate system between the transformed ray R′ and ∆unit. The advantage

of the second method, is that the intersection computation of an ray with

the unit triangle is quite simple, since the unit triangle lies in the xy-plane.

Let R′ = T∆(R) = T∆(org, dir) = (m · org + n, m · dir) = (org′, dir′) be

the ray transformed to the unit triangle space, then the intersection can be

computed by:

5.2. UNIT TRIANGLE INTERSECTION 39

λ = −org′zdir′z

u = λ · dir′x + org′x

v = λ · dir′y + org′y

The hit-point lies in the triangle, if the so called in-triangle test u ≥

0 ∧ v ≥ 0 ∧ u + v ≤ 1 is fulfilled and has the barycentric triangle

coordinates (u, v, 1 − u − v).

If the second triangle transformation that maps the geometry normal

of the triangle to the normal of the unit triangle is used, it is possible to

compute the dot product between the ray direction and the triangle normal

in both coordinate systems. In the unit triangle system the computation is

extremely simple:

dir′ ·

0

0

1

= dir′z

Thus the z-component of the transformed ray direction, is the dot-

product between the ray direction and the geometry normal of the triangle.

If the ray direction of R was normalized, then dir′z is exactly the cosine

between the ray direction and the normal vector.

It is not obvious to see that the dot product is maintained under the unit

triangle transformation, but this special transformation has this property as

it can be written as:

T∆ = Txy ◦ TR ◦ TT

The transformation TT is a translation that maps the triangle edge point

c to (0, 0, 0). The rotation TR rotates the triangle to the xy-plane and the

last transformation Txy is a composition of transformations that maps the

triangle in the xy-plane to the correct form. This last transformation does

not change the z-component of its input vector.

The translation and rotation does not change any angle nor length and

the transformation Txy does not change the result of the dot product with

the normal vector (0, 0, 1) as the transformation is perpendicular to the


normal. Thus the complete triangle transformation does not change the dot

product. Note that because the last transformation Txy changes the length

of the vectors the angle between the ray direction and the normal is not

maintained by the triangle transformation, only the dot product.

The described method can be used to compute the cosine between the

ray direction and triangle normal only if the direction of the initial ray is

normalized. Thus in conjunction with the dynamic ray tracing algorithm

the only transformations that can be used to instantiate objects are compo-

sitions of translation and rotation matrices, as otherwise the length of the

direction is changed.

Of course this concept to transform the ray first and then to intersect

with a unit object can be applied to many other types of objects like ellipses

or rectangles too. An advantage is that only one representation is required

for a wide range of objects, as the transformation to the unit object is

described by an affine transformation in each case. Additionally only the

type of the object has to be stored, to call the correct unit intersection

function.

A drawback of this triangle intersection method is that the triangle ma-

trix depends on each of the edge points of the triangle. Thus because of

computation accuracy rays can be shot through two triangles that lie beside

each other and have two vertices in common. This problem can be solved by

using a small epsilon in the comparisons of the in-triangle test. Nevertheless

most triangle intersection methods suffer from this problem.

Chapter 6

The Dynamic SaarCOR

Architecture

The architecture presented in this Section is a general approach for a dy-

namic ray tracing hardware architecture which has many aspects in common

with the standard SaarCOR architecture [8]. A main difference is that the

Dynamic SaarCOR Architecture supports dynamic scenes but the standard

SaarCOR architecture not.

Dynamics is achieved by partitioning the scene into movable objects as

described in Section 4. The geometry in the objects remains static but the

objects themselves can be moved around. This requires the rebuilding of

a top-level acceleration structure over the objects in each frame, if some

objects have been moved. The architecture gives no hardware support to

rebuild the top-level acceleration structure, as this is sufficiently possible

using the host PC, if the number of objects is less than 50000.

Hardware support is given for the triangle intersection, traversal through

the dynamic 2-level acceleration structure and the shading computation as

these are the most expensive operations. To support this a costly affine

ray transformation unit to transform rays to the local coordinate system of

an object is required. Because this unit is almost of the same complexity

as a standard triangle intersection unit a naive approach would double the

required chip area. But using the special unit triangle intersection method

as described in Section 5, it is possible to share the transformation unit for

two purposes. Furthermore the shader can use the transformation unit to

perform the primary or secondary ray computation.

41

42 CHAPTER 6. THE DYNAMIC SAARCOR ARCHITECTURE

The reasons why the special triangle intersection method is used, is to

share the transformation unit mainly for the object space transformation

and the triangle intersection. On principle, it would be possible, to separate

the transformation and intersection using two independent units. But this

has some disadvantages because the transformation unit would be used only

20% of time if it would by fully pipelined. A lot of computational power

is wasted this way. Increasing the usage of the transformation unit would

be possible if the operation is done sequentially, such that approximately

5 cycles are required per ray transformation. But then the transformation

could slow down the complete pipeline, if at some parts of the scene it is

used much more frequently. This slowing down is a typical behavior if too

many special purpose units are used in the design.

To exploit coherence between neighboring rays the architecture handles

packets of rays as described in Section 3.1.3. By doing so data is always

accessed for a packet of rays reducing the size of the memory interface.

At a given time there are always several independent packets in the ray

tracing system to increase the usage of the units. This is necessary as the

special purpose pipelines needed for the computation are fairly deep. On the

other hand, memory latency can be hidden since during a memory request

of one packet, the other packets can do operations in the chip.

Because each packet can be seen as a single thread running in the system

this concept is a kind of multi-threading. Each packet corresponds to a

complete data-set in the chip, consisting of near and far value, stacks and

other required internal data. In order to guarantee that each packet accesses

only its data-set, a unique packet-id (pid) identifies it and is used to address

the correct data-set. This packet-id is passed from unit to unit, as a kind

of job-passing. If the traversal unit reaches a leaf node for instance, the

packet-id is delivered to a different unit that handles the list of objects.

A very important topic in ray tracing is the shading computation. Due to

the variety of possible shading models, the corresponding shading hardware

should be a fully programmable special purpose CPU. As shading is out of

the scope of this thesis shading will be marginally mentioned only.

The Dynamic Ray Tracing Architecture (see Fig. 6.1) consits of one or

more Dynamic Ray Tracing Pipelines (DynRTP) which are subdivided to

a Ray Generation and Shading unit (RGS) and the Dynamic Ray Tracing

Core (DynRTC). The main task of the RGS unit is to do the shading com-

6.1. DYNAMIC RAY TRACING CORE 43

Figure 6.1: Dynamic Ray Tracing Architecture

putations, using the Dynamic Ray Tracing Core (DynRTC) to shoot rays

through the scene, and to compute primary rays.

The Dynamic Ray Tracing Core consists of four main parts. First there

is the traversal unit that traverses a packet of rays through the acceleration

structure. The lists of objects of the acceleration structure are handled by

the list unit. The transformation unit applies an affine transformation to a

packet of rays and the intersection unit intersects rays with the unit triangle.

The Ray Generation Controller tells the DynRTP units which pixels

to render next. The scene data as well as some other configuration data

(camera position, acceleration structure, etc.) are sent through a PCI or

AGP interface to the chip. Each Dynamic Ray Tracing pipeline has access

to the scene data through a cache interface. This cache interface consists of

four independent caches for each type of data that is used.

6.1 Dynamic Ray Tracing Core

The ray tracing core is the basic ray casting unit of the architecture. Thus

it is responsible for tracing packets of rays through the scene and returning

the information in the object that was hit. As a fundamental concept of the

dynamic ray tracing approach is the partitioning of the scene into movable

objects, the dynamic ray tracing core has to traverse the packet through a


top-level acceleration structure to find a possible hit-object and then trans-

form it to the local coordinate system of that object. There the traversal

needs to be continued to find a possible hit-triangle.

The Dynamic Ray Tracing Core is used by the shader unit (RGS) to

shoot rays through the scene. To do so the shader first needs to initialize

the dynamic ray tracing core by sending the k-D tree root node for the

next packet and the transformation to apply first. For a primary ray, this

transformation is a simple camera transformation. After that, the shader

sends the packet of rays in sequence to the pipeline. It always passes the

transformation unit first, which applies the stored transformation to it.

Because the transformed ray has to traverse through the scene it is saved

in the traversal and transformation unit for later use. The traversal unit

starts the top-level traversal of the packet until a leaf node is reached. It

sends the list of objects, saved in the leaf node, to the list unit which has

the task to handle the list. Thus it reads the first list entry out of the list

and sends it to the transformation unit. This one fetches the object, stores

the object’s root node into the traversal unit and applies the stored inverse

object transformation to the packet of rays. At this point the inverse of the

object transformation is required, since we do not position the object, but

transform the ray into the object.

The transformed ray is now in the local coordinate system of the object

and is saved in the traversal and transformation unit. The traversal starts

with the bottom-level traversal in the object with the transformed ray until

a leaf node is reached. The list unit handles the list again but the trans-

formation unit now reads unit triangle matrices out of memory and applies

these transformations to the packet.

The packet transformed to the unit triangle space is intersected with the

unit triangle by the intersection unit. The intersection result is stored in

the traversal unit which in particular needs the hit-distance to compute the

ray termination correctly. If the list of triangles was empty, the operation

is continued at the list unit or otherwise at the traversal unit.

6.1.1 Traversal Unit

The traversal unit traverses packets of rays in parallel through the scene.

This is done using a k-D tree and k-D tree traversal algorithm as explained

in the Sections 3.1.2 and 3.1.3.


The traversal unit consists of a memory interface, to load k-D tree nodes,

and a special purpose pipeline. This one is internally subdivided into some

traversal slices to handle the single rays of the packet in parallel. In each

pass through the pipeline a packet traversal step is computed.

Figure 6.2: The Figure shows the traversal unit consisting of the memoryinterface, 4 traversal slices, a packet traversal decision unit and the collecthits unit. For each of the units the necessary internal data is shown.

Figure 6.2 shows the internal structure of the traversal unit. The opera-

tion always starts at the memory interface which fetches the next or first k-D

tree node out of memory. If this node was a leaf then the packet together

with the list address is sent to the list unit to compute intersection results.

Otherwise the node is sent to the traversal slices which compute a traversal

decision for each of the rays in the packet. These single traversal decisions

are combined into a packet traversal decision by the packet traversal deci-

sion unit. The packet traversal decision is sent to the memory interface and

back to the traversal slices as these have to do stack operations depending

on it. Using the packet traversal decision the memory interface can fetch

the next node to process and do push/pop operations of the nodes.

Because the memory interface is responsible for the computation of the


node addresses it saves the current node and handles the node stack. In

contrast because the traversal slices compute the traversal decision for a ray

they need to store and update the near and far values, the ray and handle

the far-stack needed for the computations.

The collect hits unit computes the closest intersection for each ray of

the packet. If this unit gets a new intersection result, it determines whether

the new hit-distance is closer than the one saved. If so, the new intersection

result is saved and the one stored is deleted. The intersection result typically

consists of the hit-distance, hit-object and hit-triangle. Because the local

barycentric uv-coordinates of the hit-point are required to support textures

they need to be saved as intersection result too. As the special unit triangle

intersection method is used, the cosine between ray direction and triangle

normal can be computed for free. Therefore it is saved as intersection result

for later usage in the shader.

An important point is that the collect hits unit gives the traversal slices

access to the current hit-distance of their ray of the packet. Using this

information the traversal slices can terminate a ray. A ray is terminated if

there is a hit closer than the far value of the leaf node, where the hit occured.

As the traversal unit terminates the ray at the next traversal step the hit-

distance is compared against the current near value. If it is before the near

value each further hit would be farther away than the stored one. If each

ray of the packet is finished or the stack is empty the traversal operation is

finished.

6.1.2 Mailboxed List Unit

The mailboxed list unit has the task of handling a list of object or triangle

addresses, filtering the addresses in a kind of intersection cache (mailbox)

and sending the passed addresses to the transformation unit.

This mailboxing is necessary as most objects are present in several leaf

nodes of the k-D tree. Therefore it can happen that an object is intersected

several times which greatly reduces the performance (see Section 4.3). Espe-

cially the room problem decreases the performance. Therefore it is required

to avoid multiple intersections with objects and triangles. This is the task

of the mailbox unit, which saves already intersected objects in slots and

preserves packets to be intersected twice with an object.

The list unit gets a job from the traversal unit consisting of a single


Figure 6.3: Mailboxed List Unit

address of the list to handle. The first entry of the list is read and sent to

the mailbox unit. This one is a packet based mailbox which checks if the

packet has already been intersected with this object. If so control is returned

to the list unit to read the next list entry or to continue at the traversal unit

if this was the last list entry. If the list entry was not yet intersected, it is

sent to the transformation unit to be intersected.

The operation at the list unit is continued if a triangle intersection or

object intersection operation is done and the list was not empty. If the list

was empty the traversal operation is continued.

6.1.3 Transformation Unit

An essential part of the algorithm is the ray transformation which is done

by a specialized transformation unit. This unit performs the transformation

of the rays to the object’s coordinate system and transforms the rays to

the unit triangle system as a kind of precomputation for the intersection

unit. Furthermore the shader can use the transformation unit to apply the

camera transformation to compute a primary ray and to compute secondary

rays like light rays or reflection rays. The transformation of a packet is done

sequentially, which allows for a good balancing between the traversal unit

and the transformation unit (see Section 6.1.5).

Because most ray packets have a single ray origin, this origin needs to

be transformed only once. The transformation unit exploits this property


by a kind of packet compression that transforms a packet of n rays into

n + 1 vectors. The first vector is the common ray origin and the other

ones the direction vectors of the packet. Such a compressed packet can be

transformed by a fairly cheap transformation unit for vectors and decoded

to a normal packet of rays by a decompression unit.

Figure 6.4: Transformation Unit

A transformation job starts at the load matrix unit which reads the ma-

trix of an object or triangle column by column out of memory and stores

them in the transform unit. If the matrix was completely read, the send

packet unit gets the job. This unit has a copy of the rays of the packet to

process and sends these to the compress packet unit. This unit compresses

the packet and sends the vectors and points to be transformed sequentially

to the transform unit. This unit applies the previously stored affine trans-

formation to its input vectors. Finally, the packet is combined into a valid

packet again by the decompress packet unit.


There is an important path of the transformed packet to the send packet

unit. This path is needed if a packet was transformed to the local coordinate

system of an object, because then the transformed ray needs to be saved to

be intersected with the triangles in the object later.

There exist two modes for the transformation, one to transform points

and a different one to transform vectors. This is important as both have to

be transformed differently as explained in Section 4.4. Furthermore, there

exist two compression modes that indicate whether the packet has a common

origin or not. If so the packet is compressed. Otherwise, each origin and

direction of the packet is transformed, resulting in 2n transformations. The

compression mode is set by the shader, as it has the necessary information

about the type of the packet.

Figure 6.5: The Figure shows that primary rays as well as light rays aretypes of packets with a single origin. Even reflections at planar surfacesmaintain this property, as the virtual origin can be seen as the commonorigin of the packet.

It figures out that most kinds of packets can be compressed (see Figure

6.5). Packets of primary rays are trivially compressable, since their origin

is the projection center of the camera. Light rays that shoot from the light

source to the hit-points have a common light source origin. Even reflected

packets of rays retain their single origin if the packet was reflected by exactly

one planar surface. The reflection at curved surfaces yields a compressable

packet only in special cases.

6.1.4 Intersection Unit

The intersection unit is a simple pipeline that intersects rays with the unit

triangle, applying the formulas of Section 5.2. As inputs it gets rays trans-


formed to the unit triangle space and computes an intersection result consist-

ing of the hit-distance, barycentric coordinates and the dot product between

the ray direction and the triangle normal vector. This intersection result is

combined with the hit-object and hit-triangle and then saved in the collect

hits unit of the traversal.

6.1.5 Balancing

The subdivision of an algorithm into special purpose units may become a

problem if the units are too special and used very rarely. Thus the balancing

between the individual units of the dynamic ray tracing core need to be

analysed.

The most expensive units of the design are the traversal unit and the

transformation unit. Simulations showed that a balancing of 4 to 1 between

the traversal and intersection operation is optimal for the k-D tree algorithm

[8]. The same ratio can be used for the ratio between the traversal and the

transformation unit too, which means that 4 times more traversal operations

as ray transformations need to be done.

This ratio can approximately be achieved using a packet size of 4 rays per

packet, which are traversed in parallel and transformed sequentially. Thus

the transformation unit requires five times more cycles to handle a packet

than the traversal unit if the packet can be compressed. Thus we have a

ratio of 5 to 1 if the packet can be compressed, or 8 to 1, otherwise. This

ratio of 5 to 1 has been shown to be optimal for the dynamic architecture,

as can be seen in the usage statistics in Appendix A.

6.2 Shading Unit

The shading unit should consist of several programmable special purpose

shading CPUs, because of the wide range of possible shading models. This

concept of the programmable shading unit will not be discussed, but rather

the interface between the shader unit and the Dynamic RTC.

This interface consists (besides a channel to send the k-D tree root node)

of a channel to store a transformation in the RTC. Because this stored

transformation is always applied to the packet sent to the ray tracing core,

the shader can compute primary rays, light rays or reflection rays, using

the transformation unit. Each of these computations can be performed by

6.2. SHADING UNIT 51

storing a suitable transformation in the RTC and by sending a special ray

to be transformed. If all rays of the packet have been transformed, the RTC

starts with the traversal operation.

6.2.1 Primary Rays

Primary rays are rays from the camera to the scene, which are computed for

each pixel of the image. A camera can be represented by three orthogonal

vectors u, v, w and its position p. The vectors u, v and w define the local

coordinate system of the camera, such that u shows to the right, v to the

top and w in the viewing direction of the camera. To a pixel (x,y) on the

screen belongs the primary ray:

x′ =x

xmax−

1

2y′ =

y

ymax−

1

2

prim ray = (p , x′ · u + y′ · v + w)

This primary ray can also be computed by the following ray transforma-

tion:

Tshear =

1xmax

0 −12

0

0 1ymax

−12

0

0 0 1 0

Tc =

ux vx wx px

uy vy wy py

uz vz wz pz

pre prim ray =

0

0

0

,

x

y

1

prim ray = Tc(Tshear(pre prim ray))

The shown 4x3 matrices represent affine transformations where the left

3x3 minor stands for the linear part and the fourth column for the affine part.

The transformation Tshear is a shearing transformation that performs the

mapping of the pixel coordinates to the x′ and y′ values. The transformation

Tc performs the affine composition of the u, v, w and p vectors with the x′, y′

values. If the special ray pre prim ray is transformed first with Tshear and


then with Tc, the primary ray computation is performed.

Thus the RGS unit stores the camera matrix Tc ◦ Tshear as a transfor-

mation to the RTC and sends the pre-primary rays pre prim ray to it.

6.2.2 Light Rays

Light rays are secondary rays that are computed to determine the amount

of light that illuminates the hit-point of a primary ray for instance. Such a

light ray goes from the light source to the hit-point of the primary ray.

To compute a light ray for a primary ray, the shader has to read back the

primary ray R = (org, dir) and the intersection result from the RGC. The

intersection result consists among other things of the hit-distance λ which

is needed to compute the hit-point R(λ). If L is the position of the light

source, the light ray can be computed by:

Rlight = (L, R(λ) − L) = (L, org + λ · dir − L)

The same computation can be done by the following ray transformation:

Tlight =

orgx dirx Lx Lx

orgy diry Ly Ly

orgz dirz Lz Lz

R′ =

0

0

0

,

1

λ

−1

The transformation of the ray R′ by Tlight yields the light ray from the

light source to the hit-point of the ray. Note that because this transformation

Tlight depends on the ray R the shader has to load a special matix for each

of the rays of the packet. Furthermore the real hit-point of the ray does not

need to be computed in the shader.

6.2.3 Reflection Rays

Reflection rays are computed to simulate reflective surfaces. Thus a ray

that hits a reflective surface is reflected by it and traversed further into the

6.2. SHADING UNIT 53

reflection direction. The geometry the reflected ray hits, is exactly what is

seen through the reflective surface.

The reflection of a ray at a planar surface can be performed by an affine

reflection transformation. Such a reflection transformation can be precom-

puted for each triangle of the scene using the normal consisting triangle

transformation. The concept is to transform the ray first to the unit trian-

gle space, then to reflect it at the xy-plane and to transform it back again.

This precomputation can be done by the following composition of 3 affine

transformations:

Treflect = T−1∆

◦

1 0 0 0

0 1 0 0

0 0 −1 0

◦ T∆

This reflection transformation depends on the triangle and maps each

ray to the reflected ray. The reflected ray starts at the reflected origin and

has the reflected ray direction as can be seen in Figure 6.6. To use a ray

reflected this way, the traversal of the reflected ray has to start at the hit-

distance of the unreflected ray. This can be done by setting the near value

of the traversal algorithm to the hit-distance of it and ignoring each hit that

is closer than this distance.


If the triangle lies in an object (which is always the case for the dynamic

ray tracing algorithm) two additional transformations have to be done. First

the ray has to be transformed into the object coordinate system, then to be

reflected, and at last to be transformed back again to the world coordinate

system. These additional transformations increase the cost of a reflection

ray, but can be performed in 3 passes by the transformation unit as well.

Figure 6.6: The Figure shows how the reflection matrix reflects a packet ofrays at a surface.

Chapter 7

FPGA Prototype

In this Section the prototype implementation of the dynamic ray tracing

architecture is described. As development platform the ADM-XRC-II PCI

board from Alpha Data [30] has been used. This board contains a Xilinx

Virtex-II 6000-4 [31] FPGA, 6 SRAM chips each with 4 MB of memory, a

PCI controller and some IO-adapters.

Figure 7.1: ADMXRC DevelopmentPlatform

Figure 7.2: ADMXRC Top-LevelFlowchart

These IO-adapters are used as a VGA-out interface by generating a dig-

ital RGB-signal in the chip which is translated by an external digital-to-

analog converter to an analog video signal.

My work on the prototype was the development of the dynamic ray trac-

ing core which has been completely developed using JHDL [32] as hardware

55

56 CHAPTER 7. FPGA PROTOTYPE

description language. JHDL has been used as it has a powerful debugging

infrastructure that allows the simulation of the complete RTC at one part

and to log data buses into files. The system was completely developed under

Linux.

Some limitation had to be done mapping the architecture to the Xilinx

Virtex-II 6000 FPGA. Unfortunately there were only enough resources to

implement one ray tracing pipeline. The main problem was the strongly

limited memory resources (blockrams) in the chip.

Another limitation was the dedicated multipliers of the Virtex-II plat-

form which are only 18 bits wide. Thus a floating point representation with

a 16 bit mantisse size, 7-bit exponent and 1 sign-bit is used. It turned out

that this accuracy is sufficient to do ray tracing even for complex and highly

detailed standard scenes.

The number of packets in the ray tracing chip can be adjusted from 1 to

64 packets for simulation purposes. Later it is shown that a number of 32

packets in the system is in some sense optimal.

Because the prototype is not capable of rebuilding the top-level k-D tree

on the chip it has to be computed by the host PC the PCI-card is connected

to. After each frame, the updated top-level k-D tree is written to the ray

tracing prototype.

Figure 7.3: The Figure shows the Dynamic SaarCOR Prototype Top-LevelChart. The numbers at the busses are the used data and address bits.

The traversal unit is subdivided into 4 traversal slices as packets of 4

rays per packet are handled in parallel. The two traversal levels (top-level

57

and bottom-level) are done by using an internal depth bit. This bit is 0 in

the top-level operation and 1 in the bottom-level operation. Some of the

internal registers need to be duplicated for both traversal levels, since they

are more or less unrelated. Because even the stack is duplicated both the

top-level and bottom-level operations support a stack depth of 31 entries. If

one of the stacks is full the traversal operation cannot be continued correctly.

This problem can partially be solved by doing no further push operations

and continuing the traversal operation. This strategy works quite well, as

errors occur only in tiny details of the scene.

The traversal unit works on k-D tree nodes of 64 bit width. Thus a 64

bit wide memory interface is required, delivering a bandwidth of 0.68 GB/s

at 85 MHz.

The list unit reads 19 bit wide addresses out of a list and is one of the

most trivial units of the design, as it mainly consists of an address counter.

A special bit marks the last list entry.

The mailbox unit is implemented as a mailbox with 8 slots. Each time

an object is handled which is not already present in the mailbox, it is saved

into an empty slot. Because no strategy is implemented to clear the slots

again a full mailbox stays full. This simple mailbox has been very efficient

in the prototype. It is used at the top-level and bottom-level, thus works

for objects and triangles.

The transformation unit can store an affine transformation for each

packet in the system. This strategy is wasteful, but allows transforma-

tions to be read out of memory independently of the transformation, which

simplified the low level design.

The object and triangle transformations are represented by a 4x3 matrix

and only normal consistent triangle matrices that map the triangle normal to

the unit triangle normal are used. Thus the cosine between the ray direction

and triangle normal can be computed in the intersection unit.

The memory interface consists of three caches: one for the k-D tree

nodes, one for the lists and one for the matrices. The FPGA has access to

six 32 bit wide SRAM chips with a 20 bit address space. Three of these

SRAM chips are used by the ray tracing core. The matrix columns are

mapped to all three SRAMS, the 64 bit wide nodes to two of the SRAMS

and the 32 bit wide list entries to one SRAM as shown in Figure 6.1.

Thus the prototype has the following limitations for the scene size. The


maximum number of k-D tree nodes as well as the number of list entries is

limited to 524288 nodes. Triangles and/or objects can be 131072 in total.

Note, that it is possible to support scenes with more than 131072 triangles if

using objects and instantiating the same object several times. Thus scenes

with several billions of triangles can be visualized.

The used small direct mapped caches (see table 7.1) showed to be suf-

ficient for a wide range of scenes. The cache size can be adjusted in 10

steps from 20 to 29 cache lines for simulation purposes. The use of a direct

mapped cache (as opposed to a 2-way cache for instance) was caused by

the coarse internal granulation of the memory blocks of the FPGA to 2 kB

blocks.

Unit Cache

Traversal 4 kBList 2 kBTransformation 6 kB

Total 12 kB

Table 7.1: Maximum Cache Sizes per Unit (without index structure)

The prototype shader is a simple eye light shading pipeline that uses

a color per triangle and the cosine between the ray direction and triangle

normal which is computed by the RTC. Light rays and reflection rays are

supported in the latest version too. The standard resolution of the prototype

is 512x384.

To increase the cache hit rate, the RGC unit performs no scanline ray

generation, but uses a kind of hardware optimized hilbert curve. Computing

the image line by line results in bad cache hit rates, as the 2D image space

is not scanned locally. If there is a triangle on the left of the image, it

is very probable that it no longer is in the cache if the complete line is

finished. Therefore it is important to work locally on the image like the

hilbert curve does. But this is not suitable to be computed in hardware as

it is too complicated.

59

Figure 7.4: Figure (a) shows the recursive pattern that is used to computethe hardware optimized hilbert curve in Figure (b).

The curve used in the prototype can be efficiently computed in hardware

but fulfills the same purpose as the hilbert curve. The curve is computed by a

simple counter whose destination bits are interpreted as . . . y3x3y2x2y1x1y0x0.

The coordinates (x[3 : 0], y[3 : 0]) generate a curve like in Figure 7.4.

By using this curve to generate primary rays the cache hit rate is in-

creased by approximately 10% to 20%, especially for the list and matrix

cache (see Figure 7.5).

0

20

40

60

80

100

0 100 200 300 400 500 600

Hitrate

Cachelines

Scene Gael

TraversalList

Transformation 0

20

40

60

80

100

0 100 200 300 400 500 600

Hitrate

Cachelines

Scene Gael

TraversalList

Transformation

Figure 7.5: Both figures show the Cache hit rate depending on the numberof cache lines, once with scanline on the left and the hardware optimizedhilbert curve on the right.


7.1 Implementation Statistics

In this Section some statistics about the complexity of the ray tracing proto-

type are given. The presented numbers are in each case worst case numbers

that are computed out of some statistics of the Xilinx routing software.

7.1.1 Gate Count

The complexity of hardware circuits is usually measured in number of gates.

This gate count tells how many NAND gates are necessary to implement the

circuit. In the following Sections gate counts are stated for the prototype,

which are computed using the following mapping.

Unit gate count

full adder 9D flip-flop 6D flip-flop with clock enable 84-input LUT 1 to 93-input LUT 1 to 6memory bit 4

Table 7.2: Gate Count Computation

The source of this data is the Xilinx application note XAPP059 [33].

In addition dual port memory bits are counted as two single port memory

bits and the embedded 18-bit multipliers with 7000 gates per unit. In the

computations the worst case gate count for the LUTs are used and gates

necessary to address the memory bits are ignored.

7.1. IMPLEMENTATION STATISTICS 61

7.1.2 Complexity

The table 7.3 lists the complexity of one ray tracing pipeline measured in the

number of floating-point units for addition, multiplication, division and com-

parison, respectively. The rightmost column additionally lists the amount of

internal memory each unit uses to store ray-data, stacks and further needed

internal data.

Unit Add Mul Div Comp Mem

Traversal 4 0 4 13 44.5 kBList 0 0 0 0 0.8 kBTransformation 9 9 0 0 9.3 kBIntersection 3 2 1 3 0.0 kBCache (with index structure) 0 0 0 0 15.6 kB

Total 16 11 5 16 70.2 kB

Table 7.3: Complexity of one ray tracing pipeline with 32 packets and 512cache lines (dual port memory bits counted as 2 bits)

DynamicRTC logicgates

bits perpacket

memorybits

memorygates

DynamicRTC 21,338 0 0 0Traversal 8,470 0 0 0TraversalMemoryInterface 5,060 1,292 41,344 165,376TraversalStackPointer 2,568 12 384 1,536TraversalSlice0 43,107 2,352 75,264 301,056TraversalSlice1 43,107 2,352 75,264 301,056TraversalSlice2 43,107 2,352 75,264 301,056TraversalSlice3 43,107 2,352 75,264 301,056PacketTraversalDecision 309 0 0 0CollectHits 4,155 688 22,016 88,064List 2,743 76 2,432 9,728Mailbox 7,108 136 4,352 17,408LoadObject 4,557 19 608 2,432SendPacket 2,262 1,152 36,864 147,456PacketEncoder 1,316 0 0 0Transformation 148,040 1,152 36,864 147,456PacketDecoder 694 72 2,304 9,216Intersection 105,972 0 0 0

Total 487,020 14,007 448,224 1,792,896

Table 7.4: Gate Count and Memory Bits per Unit using 32 Packets

Table 7.4 shows the estimated number of gates for each of the units of


MemoryInterface logicgates

bits percacheline

cachememorybits

cachememorygates

MemoryInterface 4,323 0 0 0NodeCache 4,152 83 42,496 169,984ListCache 3,704 51 26,112 104,448MatrixCache 5,624 115 58,880 235,520

Total 17,803 249 127,488 509,952

Total gates 2,807,671

Table 7.5: Gate Count and Memory Bits per Unit using 512 Cache Lines

the design. Further it shows the number of memory bits required per packet

in the system as well as the required memory gates for the on chip memory

for a system with 32 packets. Table 7.5 shows the gate count of the memory

interface and caches, as well as the number of bits required per cache line.

A system with 512 cache lines and 32 packets requires at most a number of

2,807,671 gates.

If P is the number of packets in the system and CL the number of cache

lines, then the gate count CRTC for the complete Dynamic RTC can be

estimated by the following formula:

CRTC = 487, 020 + 56, 028 · P + 996 · CL

The necessary internal memory bits can be computed by:

BitsRTC = 14, 007 · P + 249 · CL

7.2. PERFORMANCE STATISTICS 63

7.2 Performance Statistics

This Section discusses the performance achieved with the ray tracing proto-

type. On the one hand the maximal performance is shown as well as some

analysis to estimate the quality of the design. These quality estimates are

based on gate level computations, thus only of interest for a mapping to an

ASIC, not for an FPGA.

The Section describes several kinds of statistics that are listed in Ap-

pendix A for 4 test scenes.

7.2.1 Hardware Quality Index

It is easy to develop arithmetic units in hardware, but to feed these units is

very difficult. To feed them on-chip memory in the form of registers stacks

and caches is required. This on-chip memory is necessary but most of its

gates are idle during the computations in contrast to the arithmetic units.

Thus the definition of the following hardware quality index QHW describes

the percentage of gates that are working in the chip.

QHW =UAU · CAU

CAU + CIM

· 100

The value UAU is the usage ratio of the arithmetic units and CAU the

cost of them in gates. Analogous CIM is the cost of the internal memory in

gates.

The hardware quality index can be used to compare two different versions

of the same hardware algorithm. The version with the higher quality index

is to be preferred, as it uses the gates more efficiently. Optimal system

parameters, such as cache size and the number of internal packets, can be

computed using this index.

Figure 7.6 shows the hardware quality index dependent on the number

of packets in the system for two scenes. The best gate usage of about 9.5%

can be achieved with a number of 32 packets in the system.

This means that it is more efficient to put several ray tracing pipelines

with 32 packets onto the chip than a smaller number of pipelines with more

than 32 packets. Because the same yields in the other direction it is better

to use 32 packets than more units with a smaller number of them.

The computed maximum is not optimal for an FPGA architecture as


0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70

Hardware Quality Index

Packets

Scene Gael, 512x384, 85 MHz

Hardware Quality Index 0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70


Packets

Scene Conference, 512x384, 85 MHz


Figure 7.6: This Figure shows the Hardware Quality Index of the DynamicRay Tracing Core for the scene Gael and Conference dependent on the num-ber of packets in the system.

there the cost should not be counted in gates. This is because todays FPGAs

consist (beside CLBs) of some special resources like blockrams and multiplier

blocks. Thus memory can be much cheaper if these memory blocks can be

used efficiently by the design.

The optimal values for several system parameters depend on each other.

Thus for the ray tracing architecture it is required to take into account

the available memory bandwidth, memory latency and delay, cache size,

packets in the system, pipeline depth of the internal pipelines and the kind

of scene to be handled efficiently. Therefore, in practice it is difficult to build

the perfect system, but using the described index it is possible to compare

different configurations of the hardware.

7.2.2 Graphics Hardware Quality Index

The hardware quality index described in the last Chapter has the disad-

vantage that it makes no statement about the quality of the ray tracing

algorithm used, only whether the algorithm is computed efficiently.

But in fact a different ray tracing algorithm might require less traversal

steps to achieve the same result, but much more sleeping memory resources.

Nevertheless it could be the better choice. The following graphics hardware

quality index QGHW can be used to compare different kinds of ray trac-

ing and rasterization hardware algorithms, since it takes into account the

performance in rays shot per cycle achieved by the algorithm.

QGHW =rays per cycle

CAU + CIM

· 1, 000, 000


The index QGHW describes the number of rays a single gate of the circuit

can shoot in 1,000,000 clock cycles through the scene. For rasterization

hardware, the number of shot rays per cycle has to be replaced by the

number of pixels that are rendered per cycle.

0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0 10 20 30 40 50 60 70

Ray Tracing Quality Index

Packets


Ray Tracing Quality Index 0

0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0 10 20 30 40 50 60 70


Packets



Figure 7.7: This Figure shows the Graphics Hardware Quality Index of theDynamic Ray Tracing Core for the scene Gael and Conference dependenton the number of packets in the system.

Figure 7.7 shows the ray tracing quality of the prototype for two scenes.

The maximal quality is again achieved at a number of 32 packets in the

system. As the rays shot per cycle are proportional to the usage of the

arithmetic units, the hardware quality index and graphics hardware quality

index yield the maximum at the same position.

Unfortunately it is difficult to compute a fair quality index for todays

rasterization hardware, as these chips support many extra features besides

simple rasterization of triangles. But in general it can be said that for scenes

consisting of little triangles, the quality index for rasterization hardware will

be much higher. In contrast if considering scenes with several million of

triangles ray tracing will become more efficient at some point.

7.2.3 Usage

The usage of a unit is the percentage of cycles where it is working. This

usage can be computed for the 4 most important units of the design and it

directly corresponds to the achieved performance.

Therefore it is an important task to adjust the system parameters in such

a way that the usage is fairly high. The usage can be increased by using

more packets in the system to fill the pipeline stages, or by larger caches, to

prevent long wait cycles for memory requests. Both parameters have to be


increased carefully, as too much internal memory may be a drawback too,

as the required gates compute nothing.

Figure 7.8 shows the usage of the individual units dependent on the

number of packets in the system. The usage increases with the number of

packets in the system as each packet can fill stages of the pipelines.

0

20

40

60

80

100

0 10 20 30 40 50 60 70

Usage

Packets


TraversalList

TransformationIntersection

0

20

40

60

80

100

0 10 20 30 40 50 60 70

Usage

Packets


TraversalList


Figure 7.8: Usage of Units

There are several pipelines in the system that are separated by FIFO

queues (first in first out queues) and memory interfaces, and filled differently

by one packet. Thus a packet fills one pipeline stage in the traversal unit,

since the 4 rays of the packet are traversed in parallel, but normally 5 stages

in the transformation pipeline. This is because the rays of the packet are

transformed in sequence, which means first transforming the ray origin and

then the 4 ray directions.

It seems that the usage scales linearly in the number of packets in the

system. But this is only true if there are few packets in the system as the

number of total pipeline stages limits the linear scaling. Thus it is impossible

to increase the usage any more if the usage of one unit reaches nearly 100%.

Even the usage of the other units that is normally far below 100% cannot

be increased any more, as there is always a fixed ratio between the usage

values of the units for a given image.

The curves of Figure 7.8 approximate to the maximal theoretical usage

for each unit in the limit and there is a fixed factor between each 2 curves

that is independent of the number of packets in the system.

The frames per second dependent on the number of packets in the system,

directly corresponds to the usage of the single units. This is because the

usage of the units is proportional to the performance achieved.


0

5

10

15

20

25

0 10 20 30 40 50 60 70

Frames per Second

Packets


fps 0

5

10

15

20

25

0 10 20 30 40 50 60 70

Frames per Second

Packets


fps

Figure 7.9: Frame Rate

7.2.4 Cache Hit Rate

The cache hit rates are an important aspect of ray tracing hardware al-

gorithms, since the required bandwidth behind the caches determines the

number of parallel working units that can be connected to the available

memory interface.

0

20

40

60

80

100

0 100 200 300 400 500 600

Cache Hit Rate

Cache Lines


TraversalList

Transformation 0

20

40

60

80

100

0 100 200 300 400 500 600

Cache Hit Rate

Cache Lines


TraversalList

Transformation

Figure 7.10: Cache Hit Rate

Figure 7.10 shows the cache hit rate of the 3 types of caches dependent

on the number of cache lines for the Gael and Conference scenes. The size of

the required direct mapped caches is extremely low especially for the nodes.

This is because 4 cache lines are required to map a complete matrix but only

one to map a k-D tree node and because the coherence of k-D tree nodes at

the top of the k-D tree is much higher than for nodes at the bottom. This is

because the subspace a node at the top of the tree represents is much larger

than near its leaf nodes.

The cache hit rates for the triangle matrices is not satisfactory, but can be

improved using more advanced cache strategies. Thus 2-way or 4-way caches


should achieve much better cache hit rates in an ASIC implementation of

the design.

7.2.5 Memory Bandwidth

One of the most critical points of most types of hardware is the memory

interface as it has to deliver the required bandwidth, otherwise the chip

cannot work to its limit.

One strategy that the ray tracing prototype uses to decrease the required

memory bandwidth is to traverse packets of rays in parallel. Here the k-D

tree nodes, list entries and matrices are fetched only once for 4 rays of a

packet. In spite of this optimization, the required memory bandwidth is

fairly high. Therefore it is necessary to use caches for each of the units in

the pipeline.

0

5

10

15

20

25

0 200 400 600 800 1000 1200

Frames per Second

Memory Bandwidth [MB/s]


fps 0

5

10

15

20

25

0 200 400 600 800 1000 1200

Frames per Second



fps

Figure 7.11: Achieved performance using 64 packets and 512 cache lines ifthe memory bandwidth is scaled by the memory clock ratio factor.

A point of interest is the memory bandwidth needed behind the caches,

which is analysed by Figure 7.11. The maximal memory bandwidth of the

RTC to the 3 SRAM chips, is 1.02 GB/s at 85 MHz. The Figure shows

how the performance drops if the memory bandwidth behind the caches is

reduced to the specified value. Note that for most scenes it is possible to use

4 ray tracing pipelines in parallel as a scaling of the memory bandwidth of 14

produces a drop in the performance of only about 20%. The conference scene

is a exception as the performance drops extremely if the memory bandwidth

is limited. This shows that larger or more efficient caches are required for

this scene.

The data of figure 7.11 can be used to compute a worst case frame

rate, if several parallel prototype RTCs together with their small caches are


connected to the 1.02 GB/s memory interface. For instance the performance

of two RTC units at the 1.02 GB/s memory interface is higher than twice the

performance that reaches one unit at a 0.5 GB/s memory interface. Figure

7.12 shows the possible performance if there would be the specified number

of pipelines working in parallel at the memory interface of 1.02 GB/s.

0

10

20

30

40

50

60

70

0 2 4 6 8 10 12 14 16

Frames per Second

RTC Units

Scalability, Scene Gael, 512x384, 85 MHz

fps 0

10

20

30

40

50

60

70

80

90

0 2 4 6 8 10 12 14 16Frames per Second

RTC Units

Scalability, Scene Conference, 512x384, 85 MHz

fps

Figure 7.12: Scalability


7.2.6 Performance

The ray tracing prototype is able to achieve a real time performance of 10 to

30 frames per second for a wide range of scenes at a resolution of 512x384.

For a detailed overview of the reached performance for 4 test scenes see

Appendix A.

Dependent on the routing achieved by the Xilinx software maximal fre-

quencies of 85 to 92 MHz are possible. For the statistics in Appendix A and

the following performance values, the lower value of 85 MHz is used.

At a frequency of 85 MHz, the prototype has a floating point performance

of 4.08 billion flops, which when compared to todays rasterization hardware

is a fairly low value. The frequency of the prototype cannot be increased

much more because the used internal 18-bit wide multiplier blocks allow a

scaling to maximally 110 MHz.

Maximally 85 million packet traversal steps per second can be done,

which is equivalent to 340 million single ray traversal steps. The transfor-

mation unit can transform approximately 68 million rays per second (if the

packets are compressable) and consequently the same number of triangle

intersections can be done.

Chapter 8

Conclusion

This thesis has shown that creating a special purpose real-time hardware for

ray tracing is possible, even on FPGAs with their limited CLB and memory

resources. The used FPGA is not the best available today as there are

new FPGA chips with about 60% more CLBs and four times more memory

and multiplier blocks. Especially these memory and multiplier resources

have been the most limiting factor in the prototype. Thus using these new

FPGAs a ray tracing chip with two or four ray tracing pipelines should be

possible.

By mapping the architecture to an ASIC it would be possible to do ray

tracing at a resolution of 1024x768 in real-time, even if some secondary rays

are shot. This is as the capacity of todays high end ASICs is in the range of

52 million gates using a 0.095 µm silicon gate CMOS process. Since a ray

tracing pipeline requires 2.8 million gates, at most 18 ray tracing units could

be placed on the chip. But because programmable shaders are required, as

well as some larger caches to provide the parallel working units, a number of

8 ray tracing pipelines per ASIC would be realistic. In conjunction with an

increasing of the frequency to about 266 MHz, the performance of a high end

ASIC implementation would have about 20 times more performance than

the prototype.

Because the described hardware architecture supports structured motion

the scene has to be partitioned into movable objects. A main part of the

traversal algorithm for such partitioned scenes is to transform the ray to the

local coordinate system of the object to continue the traversal in it. This

operation requires an affine ray transformation unit, which is fairly costly.

71

72 CHAPTER 8. CONCLUSION

To reduce the required floating point resources on the chip, this transforma-

tion unit is also used to intersect with triangles. This is possible using the

described unit triangle intersection method. One further optimization was

to exploit the fact that since most packets of rays have the same ray origin

it needs only be transformed once for the packet.

The last Sections showed how optimal values for several system param-

eters like number of packets and cache lines can be computed. This is im-

portant to map the architecture to an ASIC, since there for cost reduction

purposes, it is necessary to use the available gates as efficiently as possible.

Inspite of the small caches it would be possible to use 2 or 4 ray tracing

cores in parallel at the described memory interface delivering 1.02 GB per

second. Using larger more advanced caches and some cache hierarchy it will

be possible to use many more units in parallel.

Chapter 9

Future Work

Of course the development of the ray tracing prototype is not yet finished.

To support larger scenes, cheaper DRAM resources should be used as a

scene database. The used alpha data development platform contains 256

MB of DRAM memory on a 64 bit wide interface, but because of the simpler

protocol, the SRAM resources have been used only.

Inspite of the fact that the top-level k-D tree was rebuilt fast enough

on the host PC for our test scenes, hardware support for this operation

should be supported, especially if the number of objects gets too large. This

hardware support should be available for k-D trees consisting of triangles

too, because then vertex shaders can be used to modify the position of the

vertex edge points of the triangles, followed by a k-D tree reconstruction.

Up to now the ray tracing prototype supports only a simple fixed eye

light shading model. This shader should be replaced by some programmable

special purpose shading CPUs that perform the color and secondary ray

computation. Shading CPUs are necessary because of the wide range of

shading models available for the ray tracing application.

The prototype uses a k-D tree as acceleration structure, but in fact

no analysis have been done, if this is the best for a hardware ray tracing

approach. Indeed the k-D tree algorithm seems to be the best choice in

software based systems [3], but some other acceleration structures can be

implemented using fairly simple traversal units. For the regular grid accel-

eration structure for instance there exist simple traversal algorithms based

on integer arithmetic. This integer arithmetic causes a much flater traversal

unit, which consequently requires less packets in the system. Furthermore

73

74 CHAPTER 9. FUTURE WORK

no stack is required in the grid traversal algorithm.

Chapter 10

Appendix A

The following Sections show statistics of four test scenes used to test the

prototype. The shown statistic diagramms are discussed in detail in Section

7.2. The standard configuration for the statistics is a resolution of 512x384,

a number of 64 packets in the system, 512 cache lines and using the hardware

optimized hilbert curve if not specified differently.

The last two statistics of each image show a walk through the scene, to

show typical frame rates that are achieved.

75

76 CHAPTER 10. APPENDIX A

10.1 Office

Objects 1Total Triangles 34,313FPGA Szene Size 3.7 MBTypical Frame Rate 20-30 fpsResolution 512x384

0

5

10

15

20

25

30

35

0 10 20 30 40 50 60 70

Frames per Second

Packets

Scene Office, 512x384, 85 MHz

fps 0

20

40

60

80

100

0 10 20 30 40 50 60 70

Usage

Packets


TraversalList


0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70


Packets



0.005

0.01

0.015

0.02

0.025

0 10 20 30 40 50 60 70


Packets



0

50

100

150

200

250

0 2 4 6 8 10 12 14 16

Frames per Second

RTC Units

Scalability, Scene Office, 512x384, 85 MHz

fps 0

20

40

60

80

100

0 100 200 300 400 500 600

Cache Hit Rate

Cache Lines


TraversalList

Transformation

10.1. OFFICE 77

0

5

10

15

20

25

30

35

0 200 400 600 800 1000 1200

Frames per Second



fps 0

20

40

60

80

100

0 200 400 600 800 1000 1200

Usage



TraversalList


0

5

10

15

20

25

30

35

40

0 20 40 60 80 100 120 140 160 180 200

Frames per Second

Frame Number


fps 0

20

40

60

80

100

0 20 40 60 80 100 120 140 160 180 200

Usage

Frame Number


TraversalList



10.2 Gael


0

5

10

15

20

25

0 10 20 30 40 50 60 70

Frames per Second

Packets


fps 0

20

40

60

80

100

0 10 20 30 40 50 60 70

Usage

Packets


TraversalList


0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70


Packets



0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0 10 20 30 40 50 60 70


Packets



0

10

20

30

40

50

60

70

0 2 4 6 8 10 12 14 16

Frames per Second

RTC Units

Scalability, Scene Gael, 512x384, 85 MHz

fps 0

20

40

60

80

100

0 100 200 300 400 500 600

Cache Hit Rate

Cache Lines


TraversalList

Transformation

10.2. GAEL 79

0

5

10

15

20

25

0 200 400 600 800 1000 1200

Frames per Second



fps 0

20

40

60

80

100

0 200 400 600 800 1000 1200

Usage



TraversalList


0

5

10

15

20

25

30

0 20 40 60 80 100 120 140 160 180 200

Frames per Second

Frame Number


fps 0

20

40

60

80

100

0 20 40 60 80 100 120 140 160 180 200

Usage

Frame Number


TraversalList



10.3 Conference


0

5

10

15

20

25

0 10 20 30 40 50 60 70

Frames per Second

Packets


fps 0

20

40

60

80

100

0 10 20 30 40 50 60 70

Usage

Packets


TraversalList


0

1

2

3

4

5

6

7

8

9

0 10 20 30 40 50 60 70


Packets



0.002

0.004

0.006

0.008

0.01

0.012

0.014

0.016

0 10 20 30 40 50 60 70


Packets



0

10

20

30

40

50

60

70

80

90

0 2 4 6 8 10 12 14 16

Frames per Second

RTC Units

Scalability, Scene Conference, 512x384, 85 MHz

fps 0

20

40

60

80

100

0 100 200 300 400 500 600

Cache Hit Rate

Cache Lines


TraversalList

Transformation

10.3. CONFERENCE 81

0

5

10

15

20

25

0 200 400 600 800 1000 1200

Frames per Second



fps 0

20

40

60

80

100

0 200 400 600 800 1000 1200

Usage



TraversalList


0

5

10

15

20

25

30

0 20 40 60 80 100 120 140 160 180 200

Frames per Second

Frame Number


fps 0

20

40

60

80

100

0 20 40 60 80 100 120 140 160 180 200

Usage

Frame Number


TraversalList



10.4 Trees4000

Objects 4,000Total Triangles 20 MillionFPGA Szene Size 3.4 MBTypical Frame Rate 8-14 fpsResolution 512x384

0

1

2

3

4

5

6

7

8

9

10

0 10 20 30 40 50 60 70

Frames per Second

Packets

Scene trees4000, 512x384, 85 MHz

fps 0

20

40

60

80

100

0 10 20 30 40 50 60 70

Usage

Packets


TraversalList


0

1

2

3

4

5

6

7

8

0 10 20 30 40 50 60 70


Packets



0.001

0.002

0.003

0.004

0.005

0.006

0.007

0.008

0 10 20 30 40 50 60 70


Packets



0

5

10

15

20

25

30

35

0 2 4 6 8 10 12 14 16

Frames per Second

RTC Units

Scalability, Scene trees4000, 512x384, 85 MHz

fps 0

20

40

60

80

100

0 100 200 300 400 500 600

Cache Hit Rate

Cache Lines


TraversalList

Transformation

10.4. TREES4000 83

0

1

2

3

4

5

6

7

8

9

10

0 200 400 600 800 1000 1200

Frames per Second



fps 0

20

40

60

80

100

0 200 400 600 800 1000 1200

Usage



TraversalList


0

2

4

6

8

10

12

14

16

18

20

0 20 40 60 80 100 120 140 160 180 200

Frames per Second

Frame Number


fps 0

20

40

60

80

100

0 20 40 60 80 100 120 140 160 180 200

Usage

Frame Number


TraversalList


Bibliography

[1] http://www.nvidia.com. Geforce3 - the world’s most advanced proces-

sor, 2001.

[2] Peter Shirley. Fundamentals of Computer Graphics. A K Peters Ltd,

June 2002.

[3] Vlastimil Havran. Heuristic Ray Shooting Algorithms. PhD the-

sis, Department of Computer Science and Engineering, Faculty

of Electrical Engineering, Czech Technical University in Prague,

http://www.cgg.cvut.cz/˜havran/phdthesis.html, November 2000.

[4] Ingo Wald, Thomas Kollig, Carsten Benthin, Alexander Keller, and

Philipp Slusallek. Interactive Global Illumination using Fast Ray Trac-

ing. Rendering Techniques 2002, pages 15–24, 2002. (Proceedings of

the 13th Eurographics Workshop on Rendering).

[5] Ingo Wald and Philipp Slusallek. State-of-the-Art in Interactive Ray-

Tracing. In State of the Art Reports, Eurographics 2001, pages 21–42,

2001.

[6] Ingo Wald, Carsten Benthin, Markus Wagner, and Philipp Slusallek.

Interactive Rendering with Coherent Ray Tracing. Computer Graphics

Forum (Proceedings of EUROGRAPHICS 2001, 20(3), 2001.

[7] Ingo Wald, Philipp Slusallek, and Carsten Benthin. Interactive Dis-

tributed Ray Tracing of Highly Complex Models. In Proceedings of the

12th EUROGRPAHICS Workshop on Rendering, June 2001. London.

[8] Jorg Schmittler, Ingo Wald, and Philipp Slusallek. SaarCOR – A Hard-

ware Architecture for Ray Tracing. In Proceedings of Eurographics

Workshop on Graphics Hardware, pages 27–36, 2002.

85

86 BIBLIOGRAPHY

[9] John V. Oldfield and Richard C. Dorf. Field Programmable Gate Ar-

rays. Wiley-Interscience, January 1995.

[10] Michael John Sebastian Smith. Application-Specific Integrated Circuits.

Addison-Wesley, June 1997.

[11] Stuart A. Green and Derek J. Paddon. Exploiting coherence for mul-

tiprocessor ray tracing. IEEE Computer Graphics and Applications,

9(6):12–26, 1989.

[12] Stuart A. Green and Derek J. Paddon. A highly flexible multiprocessor

solution for ray tracing. The Visual Computer, 6(2):62–73, 1990.

[13] Tony T.Y. Lin and Mel Slater. Stochastic Ray Tracing Using SIMD

Processor Arrays. The Visual Computer, pages 187–199, 1991.

[14] Michael J. Muuss. Towards real-time ray-tracing of combinatorial solid

geometric models. In Proceedings of BRL-CAD Symposium ’95, June

1995.

[15] M. J. Keates and Roger J. Hubbold. Interactive ray tracing on a vir-

tual shared-memory parallel computer. Computer Graphics Forum,

14(4):189–202, 1995.

[16] Steven Parker, Peter Shirley, Yarden Livnat, Charles Hansen, and Pe-

ter Pike Sloan. Interactive ray tracing. In Interactive 3D Graphics

(I3D), pages 119–126, April 1999.

[17] Steven Parker, Michael Parker, Yaren Livnat, Peter Pike Sloan, Chuck

Hansen, and Peter Shirley. Interactive ray tracing for volume visual-

ization. IEEE Transactions on Computer Graphics and Visualization,

5(3), 1999.

[18] Steven Parker, Peter Shirley, Yarden Livnat, Charles Hansen, and Pe-

ter Pike Sloan. Interactive ray tracing for isosurface rendering. In IEEE

Visualization ’98, 1998.

[19] Matt Pharr, Craig Kolb, Reid Gershbein, and Pat Hanrahan. Rendering

complex scenes with memory-coherent ray tracing. Computer Graphics,

31(Annual Conference Series):101–108, August 1997.

[20] Advanced Rendering Technologies. http://www.art-render.com.

BIBLIOGRAPHY 87

[21] D. Hall. The AR350: Today’s ray trace rendering processor. In Proceed-

ings of the Eurographics/SIGGRAPH workshop on Graphics hardware

- Hot 3D Session 1, 2001.

[22] Hanspeter Pfister, Jan Hardenbergh, Jim Knittel, Hugh Lauer, and

Larry Seiler. The VolumePro real-time ray-casting system. In Computer

Graphics 31, pages 251–260, 1999.

[23] K. Mai, T. Paaske, N. Jayasena, R. Ho, W. Dally, and M. Horowitz.

Smart Memories: A Modular Recongurable Architecture. IEEE Inter-

national Symposium on Computer Architecture, 2000.

[24] Timothy Purcell. The SHARP Ray Tracing Architecture. SIGGRAPH

course on Interactive Ray Tracing, 2001.

[25] Timothy J. Purcell, Ian Buck, William R. Mark, and Pat Hanrahan.

Ray Tracing on Programmable Graphics Hardware. In Proceedings of

SIGGRAPH 2002, 2002.

[26] Ingo Wald, Carsten Benthin, and Philipp Slusallek. Distributed Interac-

tive Ray Tracing of Dynamic Scenes. In Proceedings of the IEEE Sym-

posium on Parallel and Large-Data Visualization and Graphics (PVG),

2003.

[27] Erik Reinhard, Brian Smits and Chuck Hansen. Dynamic acceleration

structures for interactive ray tracing. In Proceedings of SIGGRAPH,

2002.

[28] Allen Y. Chang. A Survey of Geometric Data Structures for Ray Trac-

ing. Technical report, Polytechnic University, October 2001.

[29] Emo Welzl. Smallest enclosing disks (ball and ellipsoids), chapter New

Results and New Trends in Computer Science (H. Maurer, ed.), pages

359–370. 1991.

[30] Alphadata. www.alpha-data.com.

[31] Xilinx, Virtex2-6000 FPGA. www.xilinx.com/virtex2.

[32] Peter Bellows and Brad Hutchings. JHDL - An HDL for Reconfigurable

Systems. Technical report, Department of Electrical and Computer

Engineering, www.jhdl.org.

88 BIBLIOGRAPHY

[33] Xilinx. www.xilinx.com.

Sven Woop's Homepage - A Ray Tracing Hardware ...Sven Woop A thesis submitted in partial...

Documents

Transcript of Sven Woop's Homepage - A Ray Tracing Hardware ...Sven Woop A thesis submitted in partial...