Simpler and Faster HLBVH with Work Queues · tree traversal needed for spatial partitioning has...

Simpler and Faster HLBVH with Work Queues

Kirill Garanzha∗

NVIDIAKeldysh Institute of Applied Mathematics

Jacopo Pantaleoni∗

NVIDIA ResearchDavid McAllister∗

NVIDIA

Figure 1: Some of our test scenes, from left to right: Conference, Fairy Forest, Turbine Blade and Power Plant.

Abstract

A recently developed algorithm called Hierachical LinearBounding Volume Hierarchies (HLBVH) has demonstratedthe feasibility of reconstructing the spatial index needed forray tracing in real-time, even in the presence of millions offully dynamic triangles. In this work we present a simplerand faster variant of HLBVH, where all the complex book-keeping of prefix sums, compaction and partial breadth-firsttree traversal needed for spatial partitioning has been re-placed with an elegant pipeline built on top of efficient workqueues and binary search. The new algorithm is both fasterand more memory efficient, removing the need for temporarystorage of geometry data for intermediate computations. Fi-nally, the same pipeline has been extended to parallelizethe construction of the top-level SAH optimized tree on theGPU, eliminating round-trips to the CPU, accelerating theoverall construction speed by a factor of 5 to 10x.

CR Categories: I.3.2 [Graphics Systems C.2.1, C.2.4,C.3)]: Stand-alone systems—; I.3.7 [Three-DimensionalGraphics and Realism]: Color,shading,shadowing, andtexture—Raytracing;

Keywords: ray tracing, real-time, spatial index

1 Introduction

For over 20 years, ray tracing has been considered an of-fline rendering technology. Historically, its need for a pre-computed spatial index over the entire geometry dataset,such as a bounding volume hierarchy (BVH) or a k-d tree[Wald et al. 2007b], has been the highest barrier to over-come in making it usable at interactive rates. This barrierhas started to fall down in the last decade, when a newwave of graphics research has started exploring construction

∗e-mail:{kgaranzha,jpantaleoni,davemc}@nvidia.com

of acceleration structures at interactive rates on both serial[Wächter and Keller 2006; Wald et al. 2007a] and parallelarchitectures [Popov et al. 2006; Shevtsov et al. 2007; Wald2007; Lauterbach et al. 2009; Zhou et al. 2008]. Finally, re-cent developments in massively parallel computing allowedtapping into the domain of real-time rendering, showing thatthe construction of bounding volume hierarchies can be per-formed in a few milliseconds even in the presence of millionsof fully dynamic triangles [Pantaleoni and Luebke 2010].

In this paper we extend the work of Pantaleoni and Luebke[2010] by introducing a novel variant of their HierarchicalLinear Bounding Volume Hierarchies (HLBVH) that is bothsimpler, faster and easier to generalize. Our first result isreplacing their ad-hoc, complex mix of prefix-sums, com-paction and partial breadth-first tree traversal primitivesused to perform the actual object partitioning step with asingle, elegant pipeline based on efficient work-queues. Be-sides greatly simplifying the original algorithm and offeringsuperior speeds, the new pipeline also removes the need forall the additional temporary storage that was previously re-quired. Our second result is the parallelization of the SurfaceArea Heuristic (SAH) optimized HLBVH hybrid. This com-bines the added flexibility of our task-based pipeline with theefficiency of a parallel binning scheme [Wald 2007]. Overall,this allows us to get a speedup factor of up to 10x over state-of-the-art. Beyond this speedup, by parallelizing the entirepipeline, it may all be run on the GPU, thereby eliminat-ing the costly copies between the CPU and GPU memoryspaces.

As with the original HLBVH, all our algorithms are imple-mented in CUDA [Nickolls et al. 2008], relying on freelyavailable efficient sorting primitives [Merrill and Grimshaw2010]. On an NVIDIA GTX 480 GPU, the resulting systemcan build a very high quality BVH in under 10.5 ms for amodel with over 1.75 million polygons.

2 Background

LBVH and HLBVH: The basis of our work is, as for theprevious version of HLBVH, the original idea introduced byLauterbach et al. [2009], who provided a very novel BVHconstruction algorithm. The principle is very simple: the3D extent of the scene is discretized using n bits per dimen-sion, and each point is assigned a linear coordinate along

a space-filling Morton curve of order n (which can can becomputed by interleaving the binary digits of the discretizedcoordinates). Primitives are then sorted according to theMorton code of their centroid. Finally, the hierarchy is builtby grouping the primitives in clusters with the same 3n bitcode, then grouping the clusters with the same 3(n−1) highorder bits, and so on, until a complete tree is built. As the3m high order bits of a Morton code identify the parent voxelin a coarse grid with 2m divisions per side, this process cor-responds to splitting the primitives recursively in the spatialmiddle, from top to bottom.

HLBVH has improved on the basic algorithm in two ways:first, it provided a faster construction algorithm applyinga compress-sort-decompress strategy to exploit spatial andtemporal coherence in the input mesh. Second, it introduceda high-quality hybrid builder, in which the top of the hierar-chy is built using a Surface Area Heuristic (SAH) [Goldsmithand Salmon 1987] sweep builder over the clusters defined bythe voxelization at level m.

Ingo Wald [Wald 2010] reports a parallel binned-SAH BVHbuilder optimized for a prototype many core architecturecalled MIC. Similarly to what we do, he builds a cus-tom scheduler based on task-queues to essentially imple-ment a very light-weight threading model, avoiding the over-heads of the builtin hardware threads support. In order toachieve maximum speed, he spends significant effort quan-tizing bounding box extents to minimize memory traffic andpacking and accessing data in a ad-hoc Array of Structuresof Arrays format. Moreover, to perform the binning step heuses substantial amount of local storage to keep duplicatebin statistics per-thread and avoid frequent use of atomicinstructions on a shared set, requiring an additional paral-lel merging stage. All this seems to be unnecessary on ourtarget architecture, most probably due to the much largernumber of supported concurrent hardware threads and theconsequent better latency hiding capabilities. Results are re-ported only for rather small scenes, and the achieved speedis significantly lower (4 to 5 times) than that offered by ourhybrid SAH builder.

Uniform Grids: Kalojanov and Slusallek [2009] introducedreal-time algorithms for the construction of uniform grids.While relatively fast, the construction times for this methodare higher than those required by HLBVH, most notably dueto the fact that spatial partitioning requires object duplica-tion, a data amplification step that translates into higherbandwidth requirements. Moreover, the resulting spatial in-dices often lead to significantly lower traversal performanceduring ray tracing, as they don’t allow efficient skipping ofempty space. Hierarchical variants [Kalojanov et al. 2011]improve both construction times and rendering performance,but they generally remain slower.

3 Algorithm Overview

As with the original HLBVH [Pantaleoni and Luebke 2010],our algorithms can be used to create both a standard LBVHand a higher quality SAH hybrid. However, as the standardLBVH involves a subset of the work needed for creating thehigh quality variant, we will concentrate on the descriptionof the latter. Similarly to the original LBVH [Lauterbachet al. 2009], our algorithm starts by sorting primitives alonga 30-bit Morton curve spanning the scene’s bounding box.Unlike Pantaleoni and Luebke [2010], who used compress-

Figure 2: Efficient queuing with warp-wide reductions.Each thread computes how many elements to insert in theoutput queue independently, but the actual output slots arecomputed by a warp-synchronous reduction and a singleper-warp atomic to a global counter. Different warps canproceed independently.

sort-decompress to accelerate sorting, we perform this stageof the algorithm relying on a brute force but more efficientleast-significant-digit radix sorting algorithm that was notavailable at the time [Merrill and Grimshaw 2010]. However,following Pantaleoni and Luebke’s observation that Mortoncodes define a hierarchical grid, where each 3n bit code iden-tifies a unique voxel in a regular grid with 2n entries per side,and where the first 3m bits of the code identify the parentvoxel in a coarser grid with 2m subdivisions per side, we thenproceed to form coarse clusters of objects falling in each 3mbit bin. This is again an instance of a run-length encod-ing compression algorithm, and can be implemented with asingle compaction operation.

After the clusters are identified, we partition all primitivesinside each cluster using LBVH-style spatial middle splits,and then create a top-level tree, partitioning the clustersthemselves with a binned SAH builder similar in spirit to theone described by Wald et al [2007]. Both the spatial middlesplit partitioning and the SAH builder rely on an efficienttask-queue system to parallelize work over the individualnodes of the output hierarchies. The next sections describethe main parts of this system in detail.

3.1 Task Queues

The original LBVH performed partitioning in a breadth-firstmanner, creating the nodes of the final hierarchy level bylevel, starting from the root. As each input node could out-put either 0 or 2 nodes, each pass performed a prefix sum tocompute the output position of the two children. HLBVHimproved on this by performing a partial breadth first traver-sal, where several levels were processed at the same timeoutputting an entire treelet for each input node. The disad-vantage of this approach is that processing each level or setof levels required several separate kernel launches, which isa relatively large latency operation.

We take an entirely different approach, relying on simpletask queues, where an individual task corresponds to pro-cessing a single node. At run time, each warp (the physicalSIMT unit of work on NVIDIA GPUs) continues fetchingsets of tasks to process from the input queue, with each setcontaining one task per thread, using a single global mem-ory atomic add per warp to update the queue head. Aftereach thread in a warp has computed the number of output

Figure 3: Assignment of the morton codes to the prim-itive centroids (a 2D projection and 4-bit morton codesare shown). Bottom: sorted sequence of the primitiveswhere morton codes are used as keys (for every respectiveprimitive the morton code bits are shown in separate rows;binary search partitions are shown with bold red lines).

tasks it will generate, all threads in the warp participate ina warp-wide prefix sum to compute the offset of their outputtasks relative to the warp’s common base. Finally, the firstthread in the warp performs a single global memory atomicadd to compute the base address in the output queue. Byusing a separate queue per level, we can do all the processinginside a single kernel call, while at the same time producinga breadth-first tree layout. The process is summarized inFigure 2.

3.2 Middle Split Hierarchy Emission

Once again, we deviate from the previous approaches usedto perform middle splits in LBVH and HLBVH, while gen-erating exactly the same primitive partitioning. We recallthat each node corresponded to a consecutive range of prim-itives sorted by their Morton codes, and that splitting anode required finding the first element in the range whosecode differed from the preceding element.

Previously, this was achieved in a backwards fashion: ratherthan processing a single node per thread, with each threadlooking for such a split point, each thread processed a singleMorton code and tried to detect whether it differed from itspredecessor.

If it did, it then elected the point as a split plane for itsparent node. Unfortunately, locating the parent node firstrequired precomputing a mapping between the primitivesand the containing ranges, which in turn required anotherprefix sum.

Figure 4: The middle-split queues corresponding to Fig-ure 3.

We avoid this complex machinery by reverting to the stan-dard ordering that would be used on a serial device: we mapeach node to a single thread, and let each thread find its ownsplit plane. However, instead of simply looping through theentire range of primitives in the node, we observe that itis possible to reformulate the problem as a simple binarysearch. In fact, if the node is located at level l, we knowthat the Morton codes of its primitives will have the exactsame set of high l − 1 bits. If we find the first bit p ≥ lby which the first and the last Morton code in the node’srange differ, it is then sufficient to perform a binary searchto locate the first Morton code that contains a 1 at bit p.The process is illustrated in Figure 3 and Figure 4.

The reason this process is particularly efficient is that, fora node containing N primitives, it finds the split plane bytouching only O(log2(N)) memory cells. The original ap-proach touched and processed the entire set of N Mortoncodes.

Large Leaf Refinement: Middle splits could sometimes fail,leading to occasional large leaves. When such a failure isdetected, it’s easy to split these leaves by the object-median.

Bounding Box Computation: After the topology of theBVH has been computed, we run a simple bottom-up re-fitting procedure to compute the bounding boxes of eachnode in the tree. The process is made particularly simple bythe fact that our BVH is stored in breadth-first order. Weuse one kernel launch per tree level and one thread per nodein the level.

3.3 Parallel Binned SAH Builder

As previously discussed, we run a SAH-optimized tree con-struction algorithm over the coarse clusters defined by thefirst 3m bits of our Morton curve, with m typically between5 and 7. Before proceeding to the detailed description of thealgorithm, we note that our algorithms run in a boundedmemory footprint: if we process Nc clusters, we need topreallocate space only for 2Nc − 1 nodes.

Figure 5: Pseudo-code for the SAH binning procedure.

Pseudo-code for the whole process is given in Figure 5.

In this pass we treat a cluster from the prior pass, with itsaggregate bounding box, as a primitive. Once again, we splitthe computation into split-tasks organized in a single inputqueue and a single output queue. Each task corresponds toa node that needs to be split, and is described by three inputfields, the node’s bounding box, the number of clusters insidethe node, and the node id. Two more fields are computedon-the-fly, the best split plane and the id of the first childsplit-task. As shown in Figure 6 we store these in structure-of-arrays (SOA) format, keeping five separate arrays indexedby task id. Additionally, we keep an array cluster split idthat maps each cluster to the current node (i.e. split-task) itbelongs to, which we update with every splitting operation.

The loop starts by assigning all clusters to the root node,forming split-task 0. Then, for each loop iteration, we per-form the following 3 steps:

binning: a step analogous to that described by Wald et al[2007a], where each node’s bounding box is split into M ,typically eight, slab-shaped bins in each dimension. A binstores an initially empty bounding box and a count. Weaccumulate each cluster’s bounding box into the bin con-taining its centroid, and atomically increment the count ofthe number of clusters falling within the bin. This procedureis executed in parallel across the clusters: each thread looksat a single cluster and accumulates its bounding box into thecorresponding bin within the corresponding split-task, usingatomic min/max to grow the bins’ bounding boxes.

Figure 6: Data-flow visualization of the SAH binningprocedure. First, clusters (in orange and green) contributeto forming the bin statistics of their parent node. Finally,nodes in the input task queue are split, generating 2 entriesinto the output queue (in pink).

SAH evaluation: for each split-task in the input queue, weevaluate the surface area metric for all the split planes ineach dimension between the uniformly distributed bins andselect the best one. If the split-task contains a single cluster,we stop the subdivision; otherwise, we create two outputsplit-tasks, with bounding boxes corresponding to the leftand right subspaces determined by the SAH split.

cluster distribution: the mapping between clusters andsplit-tasks is updated, mapping each cluster to one of thetwo output split-tasks generated by its previous owner. Inorder to determine the new split-task id, it is sufficient tocompare the i-th cluster’s bin id to the value stored in thebest split field of the corresponding split-task:

int old_id = cluster_split_id[i];int bin_id = cluster_bin_id[i];int split_id = queue[in].best_split[ old_id ];int new_id = queue[in].new_task[ old_id ];cluster_split_id[i] =

new_id + (bin_id < split_id ? 0 : 1);

In conclusion, note that there is some flexibility in the orderof the algorithm phases. For example, refitting can be per-formed separately for bottom-level and top-level phases totrade off cluster bounding box precision against parallelism.

Scene # of Triangles # of 15-bit Clusters BVH Memory Build Timefinal temp new old

Fairy 174k 2.4k 4 MB 33 MB 4.8 ms 23 msConference 282k 2.5k 6.5 MB 36 MB 6.2 ms 45 msStanford Dragon 871k 2.1k 20 MB 51 MB 8.1 ms 81 msTurbine Blade 1.76M 2.3k 42 MB 75 MB 10.5 ms 137 msPower Plant 12.7M 2.0k 290 MB 367 MB 62.0 ms -

Table 1: Build time and memory consumption statistics for various scenes. The Build Time column reports the buildingtimes for both our new binned SAH HLBVH implementation, and the one from Pantaleoni and Luebke. The Power Plantcould not be built with the previous version of HLBVH due to its higher memory usage.

Scene HLBVHMorton code setup 0.19 msradix sort 1.64 mstop level SAH build 2.17 msmiddle split emission 3.1 msAABB refitting 1.0 ms

Table 2: Timing breakdown for the Stanford Dragon.

4 Results

We have implemented all our algorithms using CUDA.The resulting source code will be freely available athttp://code.google.com/p/hlbvh/.

We have run our algorithms on a variety of typical ray trac-ing scenes with various complexity: the Fairy Forest, theConference, the Stanford Dragon, the Turbine Blade andthe Power Plant (Figure 1). Our benchmark system uses aGeForce 480 GTX GPU with 1.5GB of GPU memory, anda 3GHz 4-core AMD Phenom with 16 GB DDR3 memory,running CUDA 3.2 and Windows XP-64.

In Table 1 we report absolute build times for our new imple-mentations of the SAH-based HLBVH. In all cases, our treesused 4 primitives per leaf. Table 2 provides a more detailedbreakdown of the timings of the individual components ofour builders on one scene.

Besides measuring build performance, we measured the qual-ity of the trees produced by both our algorithms. Table 7shows both the traversal performance predicted using theSAH cost metric [Goldsmith and Salmon 1987], and the mea-sured performance in the context of a 3-bounce GPU pathtracer.

5 Summary and Discussion

First, we have presented a novel implementation of HLBVHbased on generic task queues, demonstrating how this flexi-ble paradigm of work dispatching can be used to build sim-ple and fast parallel algorithms. Second, we used the samemechanism to implement a massively parallel binned SAHbuilder for the high quality HLBVH variant. Since this isimplemented on the GPU the synchronization and memorycopies between CPU and GPU are eliminated. When consid-ering the elimination of these overheads the resulting builderis 5-10 times faster than the previous state-of-the-art. Whenconsidering just the kernel times alone it is up to 3 timesfaster. The new implementation can produce high qualitybounding volume hierarchies in real-time even for moder-ately complex models.

The new algorithms are faster than the original HLBVH

Figure 7: Predicted and measured traversal performanceof middle-split HLBVH, our binned SAH-based HLBVHand a classical SAH sweep builder. The performance mea-surements were done in the context of 3-bounce path trac-ing. On the left column, we report 1/SAH cost, on theright, actual traversal speed. All the values in the chartare normalized to 1 relative to the SAH sweep builder.Smaller values are better.

implementation despite the fact that they don’t exploit anyspatial or temporal coherence in the input meshes. This ispossible thanks to the general simplification offered by theadoption of work queues, which allows significantly reducingthe number of high latency kernel launches and reducingdata transformation passes.

5.1 Future Work

As part of future work, we plan to integrate specializedbuilders for clusters of fine intricate geometry, such as hair,fur and foliage. Further on, this work can be easily inte-grated with triangle splitting strategies such as those de-scribed by Ernst and Greiner [2007]. Another interesting av-enue is to try re-incorporating some of the original compress-sort-decompress techniques to exploit coherence internal tothe mesh.

5.2 Aknowledgements

We thank the University of Utah for the Fairy scene, AnatGrynberg and Greg Ward for the Conference scene and theUniversity of North Carolina for the PowerPlant model. Weare grateful to the anonymous reviewers for their valuablecomments.

References

Ernst, M., and Greiner, G. 2007. Early split clipping forbounding volume hierarchies. Symposium on InteractiveRay Tracing 0 , 73–78.

Goldsmith, J., and Salmon, J. 1987. Automatic cre-ation of object hierarchies for ray tracing. IEEE ComputerGraphics and Applications 7, 5 , 14–20.

Kalojanov, J., and Slusallek, P. 2009. A parallel al-gorithm for construction of uniform grids. In Proceedingsof the Conference on High Performance Graphics 2009,ACM, New York, NY, USA, HPG ’09, 23–28.

Kalojanov, J., Billeter, M., and Slusallek, P. 2011.Two-level grids for ray tracing on GPUs. ComputerGraphics Forum (4).

Lauterbach, C., Garland, M., Sengupta, S., Luebke,D., and Manocha, D. 2009. Fast bvh construction onGPUs. Comput. Graph. Forum 28, 2, 375–384.

Merrill, D., and Grimshaw, A. 2010. Revisiting sortingfor GPGPU stream architectures. Tech. Rep. CS2010-03,Department of Computer Science, University of Virginia,February.

Nickolls, J., Buck, I., Garland, M., and Skadron, K.2008. Scalable parallel programming with cuda. ACMQueue 6, 2, 40–53.

Pantaleoni, J., and Luebke, D. 2010. HLBVH: Hier-archical LBVH construction for real-time ray tracing ofdynamic geometry. In High-Performance Graphics 2010,ACM Siggraph / Eurographics Symposium Proceedings,Eurographics, 87–95.

Popov, S., Günther, J., Seidel, H.-P., and Slusallek,P. 2006. Experiences with streaming construction of SAHKD-trees. In Proceedings of the 2006 IEEE Symposium onInteractive Ray Tracing, IEEE Computer Society, 89–94.

Shevtsov, M., Soupikov, A., and Kapustin, E. 2007.Highly parallel fast kd-tree construction for interactive raytracing of dynamic scenes. Computer Graphics Forum 26,3 , 395–404.

Wald, I., Boulos, S., and Shirley, P. 2007. Ray TracingDeformable Scenes using Dynamic Bounding Volume Hi-erarchies. ACM Transactions on Graphics 26, 1, 485–493.

Wald, I., Mark, W. R., Günther, J., Boulos, S., Ize,T., Hunt, W., Parker, S. G., and Shirley, P. 2007.State of the Art in Ray Tracing Animated Scenes. InEurographics 2007 State of the Art Reports, Eurographics.

Wald, I. 2007. On fast Construction of SAH based Bound-ing Volume Hierarchies. In Proceedings of the 2007 Eu-rographics/IEEE Symposium on Interactive Ray Tracing,Eurographics.

Wald, I. 2010. Fast Construction of SAH BVHs on theIntel Many Integrated Core (MIC) Architecture. IEEETransactions on Visualization and Computer Graphics.(to appear).

Wächter, C., and Keller, A. 2006. Instant ray trac-ing: The bounding interval hierarchy. In In RenderingTechniques 2006 - Proceedings of the 17th EurographicsSymposium on Rendering, Eurographics, 139–149.

Zhou, K., Hou, Q., Wang, R., and Guo, B. 2008. Real-time kd-tree construction on graphics hardware. ACMTrans. Graph. 27, 5, 1–11.

Simpler and Faster HLBVH with Work Queues · tree traversal needed for spatial partitioning has...

Documents

Transcript of Simpler and Faster HLBVH with Work Queues · tree traversal needed for spatial partitioning has...