An Efficient Alias-free Shadow Algorithm for Opaque and Transparent Objects using per-triangle

10
An Efficient Alias-free Shadow Algorithm for Opaque and Transparent Objects using per-triangle Shadow Volumes Erik Sintorn * Chalmers University Of Technology Ola Olsson Chalmers University Of Technology Ulf Assarsson Chalmers University Of Technology (a) (b) (c) Figure 1: Images rendered with the novel shadow algorithm. All images rendered in 1024x1024, time taken to generate shadow buffers in parenthesis. (a) Pixel accurate hard shadows in a game scene (7.29ms, 60k triangles). (b) Alpha-textured shadow casters (13ms, 35k triangles). (c) Colored transparent shadows. Image rendered using depth peeling of 8 layers (75.66ms, 5-19 ms per layer, 60k triangles). Abstract This paper presents a novel method for generating pixel-accurate shadows from point light-sources in real-time. The new method is able to quickly cull pixels that are not in shadow and to trivially ac- cept large chunks of pixels thanks mainly to using the whole trian- gle shadow volume as a primitive, instead of rendering the shadow quads independently as in the classic Shadow-Volume algorithm. Our CUDA implementation outperforms z-fail consistently and sur- passes z-pass at high resolutions, although these latter two are hard- ware accelerated, while inheriting none of the robustness issues as- sociated with these methods. Another, perhaps even more impor- tant property of our algorithm, is that it requires no pre-processing or identification of silhouette edges and so robustly and efficiently handles arbitrary triangle soups. In terms of view sample test and set operations performed, we show that our algorithm can be an or- der of magnitude more efficient than z-pass when rendering a game- scene at multi-sampled HD resolutions. We go on to show that the algorithm can be trivially modified to support textured, semi- transparent and colored semi-transparent shadow-casters and that it can be combined with either depth-peeling or stochastic trans- parency to also support transparent shadow receivers. Compared to recent alias-free shadow-map algorithms, our method has a very small memory footprint, does not suffer from load-balancing issues, and handles omni-directional lights without modification. It is eas- ily incorporated into any deferred rendering pipeline and combines many of the strengths of shadow maps and shadow volumes. CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism—Color, shading, shadowing, and texture; Keywords: shadows, alias-free, real time, transparency * e-mail: [email protected] e-mail:[email protected] e-mail:[email protected] Links: DL PDF 1 Introduction Generating accurate shadows from point light-sources for each pixel sample remains a challenging problem for real-time appli- cations. Despite generations of research, we have yet to see a pixel-accurate shadow-algorithm for point lights that requires no pre-processing, works on any arbitrary set of triangles and that runs at stable real-time frame rates for typical game-scenes on consumer level hardware. Traditional shadow mapping [Williams 1978] tech- niques generate shadows from a discretized image representation of the scene and so alias when queried for light visibility in screen space. Real-time techniques based on irregular rasterization [Sin- torn et al. 2008] tend to generate unbalanced workloads that fit cur- rent GPUs poorly and consequently, frame rates are often very un- stable. Real-time ray tracing algorithms rely heavily on geometry pre-processing to generate efficient acceleration structures. Finally, robust implementations of the Shadow-Volume algorithm require pre-processing the mesh to find edge connectivity, work poorly or not at all for polygon soups without connectivity, and have frame rates that are all but stable as the view of a complex scene changes. Nevertheless, the idea of directly rasterizing the volumes that rep- resent shadows onto the view samples remains compelling. In the

Transcript of An Efficient Alias-free Shadow Algorithm for Opaque and Transparent Objects using per-triangle

An Efficient Alias-free Shadow Algorithm for Opaque and Transparent Objectsusing per-triangle Shadow Volumes

Erik Sintorn∗

Chalmers University Of TechnologyOla Olsson†

Chalmers University Of TechnologyUlf Assarsson‡

Chalmers University Of Technology

(a) (b) (c)

Figure 1: Images rendered with the novel shadow algorithm. All images rendered in 1024x1024, time taken to generate shadow buffersin parenthesis. (a) Pixel accurate hard shadows in a game scene (7.29ms, 60k triangles). (b) Alpha-textured shadow casters (13ms, 35ktriangles). (c) Colored transparent shadows. Image rendered using depth peeling of 8 layers (75.66ms, 5-19 ms per layer, 60k triangles).

Abstract

This paper presents a novel method for generating pixel-accurateshadows from point light-sources in real-time. The new method isable to quickly cull pixels that are not in shadow and to trivially ac-cept large chunks of pixels thanks mainly to using the whole trian-gle shadow volume as a primitive, instead of rendering the shadowquads independently as in the classic Shadow-Volume algorithm.Our CUDA implementation outperforms z-fail consistently and sur-passes z-pass at high resolutions, although these latter two are hard-ware accelerated, while inheriting none of the robustness issues as-sociated with these methods. Another, perhaps even more impor-tant property of our algorithm, is that it requires no pre-processingor identification of silhouette edges and so robustly and efficientlyhandles arbitrary triangle soups. In terms of view sample test andset operations performed, we show that our algorithm can be an or-der of magnitude more efficient than z-pass when rendering a game-scene at multi-sampled HD resolutions. We go on to show thatthe algorithm can be trivially modified to support textured, semi-transparent and colored semi-transparent shadow-casters and thatit can be combined with either depth-peeling or stochastic trans-parency to also support transparent shadow receivers. Comparedto recent alias-free shadow-map algorithms, our method has a verysmall memory footprint, does not suffer from load-balancing issues,and handles omni-directional lights without modification. It is eas-ily incorporated into any deferred rendering pipeline and combinesmany of the strengths of shadow maps and shadow volumes.

CR Categories: I.3.7 [Computer Graphics]: Three-DimensionalGraphics and Realism—Color, shading, shadowing, and texture;

Keywords: shadows, alias-free, real time, transparency

∗e-mail: [email protected]†e-mail:[email protected]‡e-mail:[email protected]

Links: DL PDF

1 Introduction

Generating accurate shadows from point light-sources for eachpixel sample remains a challenging problem for real-time appli-cations. Despite generations of research, we have yet to see apixel-accurate shadow-algorithm for point lights that requires nopre-processing, works on any arbitrary set of triangles and that runsat stable real-time frame rates for typical game-scenes on consumerlevel hardware. Traditional shadow mapping [Williams 1978] tech-niques generate shadows from a discretized image representationof the scene and so alias when queried for light visibility in screenspace. Real-time techniques based on irregular rasterization [Sin-torn et al. 2008] tend to generate unbalanced workloads that fit cur-rent GPUs poorly and consequently, frame rates are often very un-stable. Real-time ray tracing algorithms rely heavily on geometrypre-processing to generate efficient acceleration structures. Finally,robust implementations of the Shadow-Volume algorithm requirepre-processing the mesh to find edge connectivity, work poorly ornot at all for polygon soups without connectivity, and have framerates that are all but stable as the view of a complex scene changes.Nevertheless, the idea of directly rasterizing the volumes that rep-resent shadows onto the view samples remains compelling. In the

upcoming sections, we hope to convince the reader that this basicidea is sound and that choosing the right manner of rasterization isthe key to efficiently generate shadows using shadow volumes.

Figure 2: A point lies in the shadow volume of a triangle if it liesbehind all planes CAB, DAC, BAD and BCD. To optimize culling oftiles, let E be the eye position and test against the planes EDA andEAB, in order to get a less conservative tile-test (see Section 3.5) .These two planes are depicted with red dashed lines.

The basic algorithm described in this paper can be easily summa-rized. We render the shadow volume of each triangle, see Figure 2,onto a depth buffer hierarchy generated from the standard depthbuffer after rendering the camera view of the scene. A node inthis hierarchy is a texel at some level, containing the min and maxdepths and defines a bounding box in normalized device coordi-nates (see Section 3). To avoid confusion with traditional shadowvolumes, which represent shadows from an object, we will refer toour single triangle shadow volumes as shadow frustums. A shadowfrustum can either intersect a node, in which case the child-nodesmust be inspected; completely envelop a node, in which case thenode and all its child-nodes are in shadow and can be flagged assuch; or the node can lie completely outside the frustum, in whichcase all child-nodes can be abandoned. We have implemented thisalgorithm in CUDA, where it takes the form of a hierarchical ras-terizer operating on shadow frustum primitives. The algorithm hasmany things in common with the traditional Shadow-Volume algo-rithm, but by considering every shadow frustum separately (as op-posed to forming the shadow volume from the silhouette edges of anobject) and by maintaining the complete frustum throughout traver-sal (as opposed to splitting it into per-edge shadow-quads), we man-age to elegantly steer clear of the many quirks and robustness issuesthat are associated with other Shadow-Volume algorithms.

Our main contribution is an accurate and efficient algorithm for de-termining light source visibility for all view samples which:

• works for any arbitrary triangle-soup without pre-processing.

• is demonstrated to be very efficient in the amount of workperformed per shadowed view sample.

• has a very low memory footprint.

• trivially extends to allow for textured shadow-casters.

• trivially extends to allow for semi-transparent and coloredsemi-transparent shadow-casters.

• is easy to integrate into any deferred rendering pipeline.

We show that the shadow frustum–depth hierarchy traversal, as wellas interpolation of per-vertex attributes, can be done entirely inhomogeneous clip space, thereby eliminating any potential prob-lems with near and far clip planes. Additionally, we suggest anovel method where binary light visibility is recorded stochasti-cally per view sample with a probability equal to the opacity ofthe shadow-casting triangle. This allows us to store a single bitof visibility information per sample in Multi-Sample Anti Alias-ing (MSAA), Coverage-Sample Anti Aliasing (CSAA) or Super-Sample Anti Aliasing (SSAA) rendering contexts and still get acorrect-on-average visibility result when the pixel is resolved. We

also propose a scheme to anti-alias the shadow-edges without ac-tually using more than one depth-value per pixel in the depth-hierarchy. This method is equivalent to PCF filtering of an infiniteresolution shadow map and like PCF filtering gives occasionallyincorrect but visually pleasing results.

2 Previous Work

For a thorough overview of real-time shadow algorithms, we referthe reader to [Eisemann et al. 2009]. Below, we will review thework most relevant to the algorithm presented in this paper. Al-gorithms for rendering hard shadows in real-time can be roughlysorted into three categories:

Shadow mapping Today shadow mapping [Williams 1978] andrelated techniques constitute the de-facto standard shadowing algo-rithm for real-time applications, despite suffering from quite severealiasing artifacts. This widespread adoption has come about due togood hardware support, ability to handle arbitrary geometry, andlow variability in frame times.

Another reason for the popularity is that the shadow map can be fil-tered to hide artifacts, mimicking the effect of an area light-source.Filtering during lookup usually requires a large number of sam-ples [Reeves et al. 1987; Fernando 2005], whereas pre-filtering re-quires additional attributes, which increases storage and bandwidthrequirements [Donnelly and Lauritzen 2006; Annen et al. 2008].Filtering can enable the use of relatively low resolution shadowmaps while producing visually pleasing results.

However, without resorting to a very high-resolution shadow map,sharp shadows cannot be produced, leaving shadows artificiallyblurred or with obvious discretization artifacts. Several algorithmsattempt to improve precision where it is most needed, without ab-horrent memory requirements, either by warping or by partitioningthe shadow map. Warping techniques [Stamminger and Drettakis2002; Wimmer et al. 2004; Lloyd et al. 2008] can yield impres-sive results but suffer from special cases where they degenerate toordinary shadow maps. Partitioning approaches can produce high-quality sharp shadows quickly for some scenes [Arvo 2004; Zhanget al. 2006; Lefohn et al. 2007; Lauritzen et al. 2011], but have dif-ficulties if the scenes are very open with widely distributed geome-try, leading to aliasing re-appearing, unpredictable run-time perfor-mance, or escalating memory requirements.

Alias-free shadow maps Alias-free shadow-mapping algo-rithms are exact per view sample [Aila and Laine 2004; Johnsonet al. 2005; Sintorn et al. 2008]. To our knowledge, the only pixelaccurate alias-free shadow algorithm that runs in real-time on cur-rent GPUs for complex scenes is [Sintorn et al. 2008]. While thisalgorithm runs admirably on some scenes and is likely to perform aswell as or better than ours on views with a very high variance in thedepth buffer, it breaks down in other configurations (e.g. when allview samples project to a single line in light space). Like our algo-rithm, these exact methods could trivially support semi-transparentand textured shadow casters.

Shadow volumes The Shadow-Volume algorithm, introducedby Crow in [1977], was implemented with hardware accelerationin 1985 [Fuchs et al. 1985] but did not see widespread use until aversion was suggested that could be hardware accelerated on con-sumer grade graphics hardware with the z-pass algorithm [Heid-mann 1991]. The idea is to isolate the silhouette-edges and extrudethese away from the light-source, forming shadow-quads that en-close the shadow volume for an object. These shadow volumes are

rendered onto the camera’s depth buffer and the stencil buffer is in-cremented for front-facing quads and decremented for back-facingso that, when all quads are processed, the stencil buffer value willbe 0 only for those pixels that do not lie in shadow. This essentiallycreates a per-pixel count of the number of shadow volumes that areentered by a ray cast from the eye to the view sample. Alternatively,the counting can be performed from the view samples to infinity, amethod called z-fail [Bilodeau and Songy 1999; Carmack 2000].

Z-pass classically suffers from the eye-in-shadow problem, i.e. ifthe camera lies within one or more shadow volumes, or if the near-plane clips any shadow quad, the values in the stencil buffer willbe incorrect. This is partly solved by Hornus et al. [2005] wherethe lights’ view is aligned with that of the camera and the light’sfar plane is set to equal the camera’s near plane. Since the light isnot in shadow, the scene can then be rendered from the light with aprojection matrix set up to match that of the camera, and the stencilbuffer is updated with all shadow quads that lie between the lightand the camera near plane in a first pass. The algorithm has somerobustness issues that get worse as the lights’ position approachesthe cameras near plane. With the advent of depth-clamping, thesolution to the eye in shadow problem is reduced to evaluating howmany shadow casters lie between the camera and light, but to dateno fully robust solution has been presented to this problem.

While the z-fail algorithm can be made practically robust [Everittand Kilgard 2002], it is typically significantly slower due to higheroverdraw [Laine 2005] and the need for near- and far capping ge-ometry. An eight-bit stencil buffer (still the maximum allowed oncurrent GPUs) can overflow, and resorting to using higher-precisioncolor buffers will accentuate the rasterization cost significantly.

Several papers exist that aim to reduce fill rate requirements. In[Lloyd et al. 2004], the authors consider the objects in a scenegraph and discuss a number of ways to prune the set of objectsto find potential shadow-casters (for the current light view) and po-tential shadow receivers (for the current camera view). They alsosuggest a way to limit the distance a shadow caster has to extendits shadow volume in order to conservatively reach all potentialshadow receivers. All of the optimizations in this paper are equallyapplicable to our algorithm (if we were to include a far plane forour shadow frustums), but as our rasterization already culls re-ceiving tiles much more efficiently than the z-pass or z-fail algo-rithms, the overdraw reductions would probably not be large. Ailaand Akenine-Moller [2004] suggest an optimization that in someways resemble ours. In the first stage of their algorithm, 8x8 pixeltiles that lie on the shadow-volume boundary are identified (a min-imum and maximum z for each tile is maintained, and thus, the 3Dbounding box for the tile can be tested against each shadow-quad)and other tiles are classified as either fully in shadow or fully lit.One such shadow buffer (containing a boolean boundary flag andan eight-bit stencil value) is required per shadow volume. In thesecond stage, the per pixel shadow is calculated for boundary tileswhereas non-boundary tiles can be set to the shadow state of thetile. The authors suggest two hardware modifications to make thealgorithm more efficient: a hierarchical stencil-buffer that wouldmake classification of tiles faster and a so called Delay Stream thatwould help the rasterizer keep track of when the classification ofa shadow volume is complete and final stencil buffer updates canbegin. Nevertheless, this solution would be infeasible for a largeramount of shadow volumes.

Chan and Durand [2004] use a shadow map to fill in the hard shad-ows and also identify shadow-silhouette pixels. Then, shadow vol-umes are used to generate correct shadows at these silhouette pix-els. However, the algorithm relies on custom hardware to rejectnon-silhouette pixels and cannot guarantee not missing small sil-houettes due to the discrete shadow map sampling.

Textured and semi transparent shadows Materials that aresemi-transparent are common in real-time applications and presenta problem for most shadow algorithms. A semi transparent surfaceis given an alpha or opacity value, α, which is defined as the ratioof received light that is absorbed or reflected at the surface (1 − αis the ratio that continues through the surface). This same modelis commonly used to represent both materials that have partial cov-erage (e.g. a screen door) and materials that transmit light (e.g.thin glass) [McGuire and Enderton 2011]. For shadow-map basedmethods, these materials are difficult because a visibility lookup isno longer a binary function and so, a single depth value is not suf-ficient to define the visibility along a ray from the light towards thepoint being shadowed. Several techniques have been suggested tosolve this by rendering shadow maps with several layers, the mostcommon being Deep Shadow Maps [Lokovic and Veach 2000] forwhich the scene has to be rendered several times in the absenceof a hardware accelerated A-buffer [Carpenter 1984], or by sam-pling visibility at discrete depths [Kim and Neumann 2001; Yukseland Keyser 2008; Sintorn and Assarsson 2009] which works wellfor ”fuzzy” geometry such as smoke or hair, but not so well forpolygons where strong transitions in visibility happen at every sur-face. Recently, a different method was proposed [Enderton et al.2010; McGuire and Enderton 2011] in which, when generating theshadow map, a triangle fragment is simply discarded with probabil-ity (1 − α). A PCF lookup into this map will return a value thatis correct on average. However, if too few taps are taken from thefilter region, the result will be noisy and so, to achieve sharp andnoise free shadows, very many taps will be required from a veryhigh resolution shadow map.

Semi-transparent shadows have been considered for shadow-volume type algorithms as well. A straightforward approach wassuggested in [Kim et al. 2008], where the stencil buffer is re-placed by a floating point buffer and the stencil increment/decre-ment operations are replaced by adding or removing log(1 − α),where α is the opacity of the object that generated the shadowquad. This elegantly produces a final stencil value, s, such thatexp(s) =

∏i(1 − αi), where αi is the opacity value of a shadow

caster that covers the sample point, which is exactly the visibilityat that point. The method produces pixel-accurate sharp shadowsand can easily be extended to support colored shadow-casters, butrequires opacity to be constant per object. Needless to say, the addi-tional math and blending operations exacerbate the overdraw prob-lem inherent in the traditional Shadow-Volume algorithm. Also,textured shadow casters are not supported.

In [Hasselgren and Akenine-Moller 2007] the problem is insteadsolved simply by handling semi-transparent or textured objects sep-arately, so that every triangle shadow volume is rendered for suchobjects. Despite the optimizations discussed in that paper, this willcause very much overdraw and will be prohibitively slow for com-plex geometry. The realization from that paper, that transparentand textured shadow casters are easily supported when the trian-gles’ shadow volumes are handled separately, is, however, the ba-sis for these extensions to our algorithm which renders per-triangleshadow volumes very efficiently. It should also be mentioned that asimilar idea was used for textured soft shadows [Forest et al. 2009].

In games and other real-time graphics applications, complex ge-ometry is often approximated by mapping a binary alpha mask tosimple geometry. A common example is the leaves on the tree inFigure 1(b). Casting shadows from such objects is simple when us-ing shadow-mapping type algorithms, where, when rendering theshadow map, a fragment can simply be discarded if the alpha valueis below some threshold. Unfortunately, as the filtered opacity valuewill not be binary even if the original alpha mask is, this techniqueintroduces even more aliasing to the shadow-map algorithm. Theproblem with shadow casters of this kind, when using shadow-

volume type algorithms, is that when a shadow-quad is renderedover a view sample, we know only that the sample lies within theshadow volume of some object but we have no means of determin-ing which triangle covers the sample, and so, we cannot do a lookupinto the alpha mask to see if the point is truly in shadow. If we ren-der triangle shadow volumes individually however (as in [Hassel-gren and Akenine-Moller 2007]), we may pass along the uv coordi-nates of each vertex, as well as the vertices themselves, and then doa ray-triangle intersection test prior to updating the stencil buffer,or by some other means find the uv coordinates on the triangle.

Figure 3: Every texel in the second level of the depth hierarchy de-fines a bounding box in normalized device coordinates and, equiv-alently, a world space frustum which contains all view samples in-side.

3 Algorithm

We start out with as basic an approach to shadow volumes as wecan imagine. The scene has been rendered to an off-screen setof buffers containing the ambient light component, direct lightingand the depth buffer. The depth buffer, along with the model-view-projection matrix, implicitly gives us the position of a point on a rayshot from the camera through the mid-sample of each pixel, at theclosest triangle intersection. We call these positions view samplesand they are the points for which we want to evaluate shadow (i.e.we want to evaluate whether these points are visible from the lightsource or not). We can do this by testing, for every view sample,whether it lies within the shadow frustum (i.e. the shadow volumeof the triangle). If so, the sample is marked as shadowed in a sep-arate buffer requiring a single bit per pixel. We will call this bufferthe shadow buffer in the discussion below. To evaluate whether thesample is within the volume or not is a matter of testing the pointagainst the four planes that make up the volume (See Figure 2).

While exhaustively testing all view samples against all triangle vol-umes is obviously not a good idea in practice, it is worthwhile tonote a few things about this naive algorithm:

• It works without modification for any arbitrary triangle soup.

• It will be robust as long as some precautions are taken (seeparagraph on robustness below)

• It requires a very small amount of extra memory storage (asingle bit per view sample).

• It requires no pre-processing nor does it matter where thelight, camera near or far planes are situated.

• When a pixel has been marked as in-shadow, it need no longerbe considered by other triangle shadow volumes.

Let’s consider what could be done to alleviate the overdraw prob-lems in an imaginary customized stamp rasterizer with a two-levelhierarchical depth buffer and an equally sized two-level hierarchi-cal shadow buffer (which can be thought of as a one bit stencilbuffer). Let’s say the upper level of the hierarchical depth buffercontains the min and max of the 4x4 depth-values of the lower

level. A texel’s (x, y) coordinates and these two depths then de-fine an axis aligned bounding box in normalized device coordinates(or a bounding-frustum in world space, see Figure 3), for the viewsamples contained within. We will refer to such bounding-frustumsas tiles from here on.

Our imaginary rasterizer takes a triangle and a light-source positionas input and rasterizes the projected shadow volume. The maindifference in how our rasterizer works, compared to how shadowquads are traditionally rasterized, lies in the way that we cull againstthe hierarchical depth buffers. Where a traditional rasterizer willcull a number of fragments of a shadow quad only if they all liein front of the min depth stored in the upper level of the hierarchy(for z-fail), our rasterizer tests the tile against each plane of theshadow frustum. If the tile is found to lie outside either plane, itcan be culled and no bits will be set for the contained view samplesin the shadow buffer. Additionally, if the tile is found to lie insideall planes, we know that all contained view samples are coveredby this triangle, and so, we simply set a bit in the higher level ofour hierarchical shadow buffer and can then safely abandon the tile.If the bounding-box can not be trivially rejected nor accepted, theindividual view samples will be tested against the shadow volumeplanes and the lower level of the shadow buffer is updated.

Before the shadow buffer is used to determine whether a pixelshould be considered in shadow or not, the two levels must bemerged. This is done by, for each set bit in the higher level, alsosetting the corresponding 4x4 bits in the lower level, regardless oftheir current state.

For the algorithm to be perfectly robust, two things must be consid-ered. First, if an edge is shared by two triangles, a view sample willbe tested against the same plane twice, only with opposite normals.If the sample lies very close to the plane, we can get the erroneousresult that the sample lies outside both, unless we make sure thatthe equations for these planes are constructed in exactly the sameway, which may not happen if we simply use the vertices in the or-der they are submitted. Instead, when constructing a plane from thelight’s position and two edge vertices, the vertices are taken in anorder defined by their world space coordinates (any unique orderingwill do). Similarly, for a perfectly robust solution, testing whether asample lies below the plane formed by the triangle should be doneexactly in the same way as when the original depth buffer valuewas created. A hardware vendor could ensure that this is the case,but our software implementation must resort to adding a bias to thetriangle plane to avoid self shadowing artifacts. Unlike the bias re-quired for shadow maps, this bias can be very small and constantand in practice it does not introduce any noticeable artifacts.

3.1 A software hierarchical shadow-volume rasterizer

To evaluate the new algorithm, we have designed a hierarchicalshadow volume rasterizer and implemented the design in softwareusing NVIDIA’s CUDA platform. We chose a fully hierarchical ap-proach because this is the most viable known approach to parallelSIMD software rasterization [Abrash 2009], while for a hardwareimplementation it is likely that a two-level stamp rasterizer wouldprove more efficient.

The rasterizer extends the two-level approach presented earlier to befully hierarchical, with L = dlogsNe levels, for some branchingfactor s and number of pixels, N , in both the hierarchical z-bufferand shadow buffer. These hierarchies represent an implicit full treewith branching factor s, which is traversed during rasterization of atriangle shadow volume.

Our design broadly follows the approach used for the 2D trianglerasterizer in Larrabee [Abrash 2009], which is illustrated in Fig-

Algorithm 1 Basic parallel traversal algorithm, for a square tile sizeT × T , which assumes SIMD with T × T lanes. The algorithm isexpressed as a program running on an individual SIMD lane, iden-tified by simdIndex ∈ [0..T × T ) and able to broadcast a singlebit to each other using BALLOT. We use ⊗ to denote element-wisemultiplication. Tiles are referenced using integer tuples definingtheir location within the current hierarchy level.

1: procedure TRAVERSAL(level , parentTile, tri )2: subTile ← (simdIndex mod T, simdIndex/T )3: tile ← parentTile ⊗ (T, T ) + subTile4: if level is final level then5: if TESTVIEWSAMPLE(tile ,tri ) then6: UPDATESHADOWBUFFER(level , tile)7: return8: tileIntersects ← TESTTRIVOLUME(level , tile, tri)9: if tileIntersects = ACCEPT then

10: UPDATESHADOWBUFFER(level , tile)11: else12: queue ← BALLOT(tileIntersects = REFINE)13: for each nonzero bit b i in queue do14: child ← parentTile ⊗ (T, T ) + (i mod T, i/T )15: TRAVERSAL(level + 1, child , tri )

ure 4, as this is simpler to illustrate and exactly analogous to whatwe do. The example shows 4 × 4 tiles, which would be suitablefor 16-wide SIMD, with each lane processing a tile at the currentlevel in the hierarchy. For each edge, a trivial accept corner anda trivial reject corner are found. These are the tile corners withgreatest and least projection on the edge normal, as shown for tile0 in the figure. If the trivial reject corner is outside any edge, thenthe tile can be rejected (shown in green). Conversely, if the trivialaccept corner is inside all edges, then the tile can be trivially ac-cepted (shown in blue). If neither of these conditions are satisfied(white), then the tile must be recursively refined at the next level ofthe hierarchy (Figure 4(b)). Note the single yellow tile shown (tile9), which is obviously outside the triangle but cannot be rejected bythe algorithm because its trivial reject corners are not outside anyedge. This is sometimes called the triangle shadow [McCormackand McNamara 2000], and produces false positives unless some ex-tra test is employed that is capable of rejecting such tiles.

Figure 4: Hierarchical 2D triangle rasterization illustrated by twolevels. In (a), the green tiles are trivially rejected, white tiles needmore refining and the yellow tile (9) is in the triangle shadow. Thepurple and blue dots show, for tile 0, the trivial reject and acceptcorners, respectively, to use with the red edge. In (b), showing thenext level of refinement for tile 6, the blue tiles are trivially ac-cepted.

Our suggested process for rasterizing shadow frustums is very sim-ilar, with some notable differences. Firstly, as we are rasterizing

shadow frustums, the three edge equations defining a triangle arereplaced by four plane equations, which define the shadow frus-tum. Secondly, the hierarchical shadow buffer enables much sim-pler trivial accept handling; only a single bit needs to be updatedin the correct hierarchy level. Lastly, our rendering is fully hierar-chical instead of tiled, with each primitive traversing the hierarchyfrom the root and writing the results directly into the shadow-bufferhierarchy. Certain other implementation details are also different aswe target an NVIDIA GPU rather than the Larrabee architecture.This is further elaborated on in Section 3.4.

Using homogeneous clip space is advantageous since, by preced-ing the perspective divide, it removes the need for clipping [Olanoand Greer 1997]. The intersection test between a tile and trian-gle shadow volume is very similar to the frustum vs. AABB test –commonly used for view frustum culling – with the shadow volumebeing the frustum.

Traversing the hierarchy from the root, as opposed to using a tiledapproach, has the advantages of improved scaling with resolutionand that large shadow volumes can trivially accept or reject largertiles. Even though triangles in today’s complex scenes are oftenvery small, the projected shadow frustums generated from such tri-angles can still be arbitrarily large, especially in the worst cases(see Figure 5). Efficiently handling the worst cases is important ifwe wish to construct a shadowing algorithm with low variability inframe times.

The basic SIMD traversal algorithm is shown in Algorithm 1. EachSIMD lane handles one sub-tile. They then exchange their resultsas bits in a single word, before recursively descending to the nextlevel.

3.2 Textured and transparent shadows

As described in section 2, the original Shadow-Volume algorithmrelies on extending shadow quads from the silhouette edges only. Inthat way, a large number of shadow quads can be culled away andoverdraw is reduced, but we lose the ability to determine which tri-angles cover which view samples. If all we want is binary shadowinformation, this is acceptable, as a sample will be in shadow re-gardless of which or how many triangle shadow volumes it lieswithin. If one intends to draw textured shadows or shadows castfrom semi-transparent triangles, however, all triangles that cover aview sample must be considered individually.

In our approach, the triangle shadow volumes are always consid-ered individually. Hasselgren et. al. [2007] show that if all tri-angle shadow volumes are rendered separately, textured and semi-transparent shadows are feasible, but they do not suggest anymethod to render shadow frustums efficiently and so are limitedto rather low polygon counts. We incorporate their ideas into ourefficient rendering of shadow frustums and can render hundreds ofthousands of textured or semi-transparent shadows in real-time. Be-low, we describe how textured and semi-transparent shadow-casterscan be taken care of with very small changes to the original algo-rithm.

Semi-transparent shadow casters To incorporate semi trans-parent shadows in our method, we modify the hierarchical shadowbuffer such that it contains a floating point value, instead of a bit, forevery tile and view sample. The shadow buffer is cleared to zero.When updating the shadow buffer (UPDATESHADOWBUFFER inAlgorithm 1), instead of setting a bit, log(1 − α) is atomicallyadded. To merge the hierarchical shadow buffer into a singleshadow buffer with a transmittance value per view sample, we sim-ply add the value of a parent node to all its children instead of OR-

0

1/1024

1/32

1

10

100

200

Figure 5: Visualizing the overdraw caused by different algorithms (according to the metric given in Section 4). From left to right (totalnumber of view sample test-and-set operations in parenthesis): The scene, z-pass algorithm (19.7 million), z-fail algorithm (18.3 million),ours (1.2 million). The overdraw in the first two algorithms is proportional to the sum of the areas of projected shadow-quads, whereas inour algorithm view-samples that lie outside the triangle shadow-frustums are quickly culled.

ing them. The leaf nodes will now contain, for every view sample,∑log(1− αi) = log(

∏1− αi) for every triangle i that lies be-

tween it and the light source. To get the transmittance, for eachview sample we simply raise e to the power of that value. This al-lows us to efficiently render shadows from models with per-trianglealpha values, which is often sufficient to generate compelling im-ages (see Figure 1(c)). If the alpha-value needs to be interpolatedover the triangle or fetched from a texture, we can not trivially ac-cept an internal node in this simple way, as explained in the nextparagraph. Colored transparent shadows are trivially supported byapplying the above scheme to each wavelength (e.g. to a shadowbuffer of RGB-tuples).

Textured shadow casters Computing interpolated texture coor-dinates to support textured shadow casters is surprisingly simplein our method. Recall from section 3.1 that to determine whethera view sample is in shadow, we test its position against the fourplanes that make up the shadow frustum and evaluate the sign ofthe results. Though perhaps not immediately obvious, it can beshown that the distances obtained from each of the planes generatedfrom the triangle edges (d0, d1, d2), are indeed a scaled version ofthe barycentric coordinates on the triangle. Scaling these distancessuch that d′0 + d′1 + d′2 = 1, we have the true barycentric coordi-nates and can obtain the texture coordinates for the view sample.Moreover, this same approach holds in clip space, so no transfor-mations are required. Given the texture coordinates, we can get thealpha-mask value, opacity value, or colored opacity value from atexture and proceed to update the shadow buffer for a view sample.

Note that while it is simple for us to introduce support for texturedshadow casters, we are forced to abandon the trivial accept opti-mization described in section 3.1. It is simply not the case any morethat if a tile lies entirely within the shadow volume of a triangle, allsub-tiles will have the same shadow value. Indeed, the triangle maycast no shadow if its texture is empty. Instead, when a tile is triviallyaccepted, we flag it as such and immediately traverse all sub-tileswithout testing, until we reach the individual samples, for which weevaluate the texture and update the final level of the shadow buffer.This gives worse performance, of course, than when trivial acceptis as simple as updating a shadow-buffer node, but still works atacceptable frame rates for complex models. There is a large drop inperformance in our implementation, however, when the shadow ofa single or very few triangles cover a large part of the screen. In thiscase, only one or a few multiprocessors will have any work to doand load-balancing becomes a problem. To alleviate this, we couldinstead employ the method described in [Abrash 2009] to triviallyaccept a tile. The tile and shadow frustum pair would then simplybe pushed to a work queue that could be processed efficiently in a

separate pass.

As noted previously, alpha-masked shadow casters are triviallysupported by the shadow-mapping algorithm. When renderingthe shadow map, a fragment can simply be discarded if the al-pha value is below some threshold. Real valued alpha textures,however, are not easily supported, and one has to use more com-plex shadow-mapping techniques for this to work (e.g. Stochas-tic Transparency [Enderton et al. 2010]). Even when renderingalpha-masked shadow casters, the projection of a fragment onto theshadow map rarely covers a single texel, and so, filtering shouldbe employed, and then the simple alpha-mask texture again re-turns a real valued result which will be thresholded. Our methodtrivially handles filtered lookups into an alpha-mask texture, andconsequently, produces higher quality shadows (this too was notedby [Hasselgren and Akenine-Moller 2007]).

Algorithm 2 Testing a shadow frustum against a tile. The algorithmfirst constructs the normalized device coordinate representation ofthe tile, and then tests the trivial-reject and trivial-accept cornersagainst each of the four planes that define the shadow frustum. Thexy extents of the tiles at a level, in normalized device coordinates,are available through the constant tileSizelevel.

1: procedure TESTTRIVOLUME(level , tile, tri )2: tileMin.xy ← (−1.0,−1.0) + tile ⊗ tileSizelevel

3: tileMin.z ← fetchMinDepth(level, tile)4: tileMax.xy ← tileMin.xy + tileSizelevel

5: tileMax.z ← fetchMaxDepth(level, tile)6: numInside ← 07: for each plane pi in tri do8: if TESTPLANEAABB(pi, tileMin , tileMax) > 0 then9: return REJECT

10: else11: numInside ← numInside + 112: if numInside = 4 then13: return ACCEPT14: else15: return REFINE

Stochastic transparent shadows When the shadow buffer con-tains a float per node instead of a single bit, the memory require-ments are obviously much higher. Especially for high quality an-tialiased render targets (MSAA, CSAA or SSAA buffers) where ev-ery pixel has several depth samples, each of which should be testedagainst the shadow volumes for correct shadows, the memory foot-print may be a limiting factor to the usefulness of our algorithm (orany other sample-accurate transparent shadows algorithm). For ex-

ample, a 1920x1080 buffer with 16 depth samples per pixel and 32-bit float transmittance values would require 130MB of memory forthe final level. Therefore, we suggest a different approach, whereevery pixel sample still holds a single bit of shadowing information.When updating the shadow buffer with a semi transparent shadowcaster, the bit is simply set stochastically with a probability equal toα. The shadow buffer is used as per usual to decide whether eachsample shall be considered lit or in shadow. When resolving thefinal pixel color the result will be noisy but correct on average (seeFigure 6(a)). The proofs to why this works are equivalent to thosein [Enderton et al. 2010], and, while we have not implemented it,the same scheme to stratify samples over a pixel as is presented inthat paper should work well to reduce the noise.

Transparent shadow receivers We have shown that semi-transparent shadow casters present no problems to our algorithm.Since it is, like the original shadow-volume algorithm, essentially adeferred rendering algorithm, transparent shadow receivers are notquite as trivial, though. Since the light-visibility calculations donot (as with e.g. shadow maps) happen during fragment shading,but in a post-processing pass, simple techniques where polygonsare sorted on depth before rendering will not work with our algo-rithm. Oftentimes, these approaches are not sufficient anyways, asthey are prone to errors (two triangles may span the same depth andcannot be uniquely sorted). Our algorithm works well with depth-peeling [Everitt 2001], where layers of transparent objects are ren-dered in several passes and with Stochastic Transparency [Endertonet al. 2010], although the latter produces z-buffers with a high vari-ance which causes a hierarchical z-buffer to be less than optimal.

(a) (b)

Figure 6: Stochastic (a) and Real valued (b) transparent shadows.

3.3 Antialiasing

Hard shadow edges often mean that two neighboring pixels willhave vastly different intensities, and so, anti-aliasing can greatlyimprove image quality. Our algorithm works well with full screenanti-aliasing schemes like MSAA, CSAA or SSAA as long as thepixel-sample positions’ can be obtained. For the shadow calcula-tions, such a buffer is simply considered a large render target, andwhen the shadow buffer has to be updated for a view sample, thepixel sample positions offset is fetched from a table.

We also suggest a novel anti-aliasing scheme that requires no extradepth-samples per pixel. When the scene is rendered from the cam-era, an additional color buffer is rendered that contains the x and yderivatives of the fragment’s depth. Building the depth buffer hi-erarchy works exactly as before, except the derivatives are used tofind a minimum and a maximum depth already at the lowest level.The shadow buffer hierarchy is allocated with one additional level(so that a view sample will have a number of shadow bits instead ofone single bit), and when traversing the shadow frustums throughthe hierarchical depth buffer, we simply traverse as though therewere an additional level of the depth hierarchy, but the final view

samples to be tested are generated from the samples (x, y) positionsin the pixel and the depth derivatives. The final shadow value usedfor the pixel will be the ratio of set bits to clear bits in the shadowbuffer. Note that this is equivalent to projecting a fragment on ashadow map of infinite resolution and taking a number of PCF tapswithin this region.

3.4 Implementation

We have implemented our algorithm using CUDA, where the nativeSIMD group (called a warp) is 32 threads wide. One warp is issuedper shadow frustum (with enough warps in each CUDA block tofully utilize the hardware), and the threads in each warp cooper-ate in rasterization of the frustum. The threads in a warp can effi-ciently exchange bit flags using the __ballot intrinsic (available onNVIDIA GPUs of compute capability 2.0 and above). On devicesthat lack this instruction, a parallel reduction in shared memory willyield the same result, at some cost in performance. Choosing abranching factor that matches the SIMD width allows the traversalto entirely avoid divergence (threads within a warp executing differ-ent code paths, for example if shadow frustums traverse the tree todifferent depths). A branching factor of 32 also matches the 32-bitword width, which makes updating bit masks simple and efficient.

However, 32 items cannot tile a square region. To construct a treefrom a square frame buffer, we instead alternate between 8× 4 and4× 8 at each level. The implementation is otherwise faithful to thetraversal algorithm (Algorithm 1) presented earlier. To accumulateresults in the shadow-buffer hierarchy, we use atomic operations,e.g. the atomicOr intrinsic. While atomic operations are often heldto be slow, we were not able to observe any penalty from usingthem, which may be because of the relatively low load the depthfirst traversal places on the memory subsystem.

In order to support transparency, we need to use 32 floating pointnumbers per tile instead of 32 bits used for binary shadow. Updat-ing these is done by using atomic add from each SIMD lane. Tohandle colored transparent shadows, we simply use three atomicadds, one per component.

3.5 Optimizations

Culling against shadow frustum silhouette As described insection 3.1, our shadow frustum vs. tile test is conservative, andcan thus produce false positives leading to tiles refined unneces-sarily. This problem is the 3D equivalent of the triangle-shadowproblem for 2D rasterization (see Section 3.1), and causes traver-sal to refine tiles that are outside the projected shadow volume. Toimprove culling efficiency, we also test two additional edges thatdefine the 2D projection of the shadow volume (illustrated usingred lines in Figure 2). The new edges are defined in 2D homo-geneous clip space, and are tested in a very similar fashion to theplanes already used. Adding this test helps ensuring that we do notvisit any tiles not actually within the on-screen shadow.

Front face culling When rendering closed objects, we canchoose to use only the triangles that face the light or the back-facingtriangles as shadow casters [Zioma 2003]. This is also employed inshadow mapping, where rendering only back facing triangles canreduce self shadowing artifacts. For our algorithm, there is an evenmore compelling reason to use this approach. Consider what hap-pens when a front-facing triangle that is visible from the camera isused as a shadow caster. This triangle will have to traverse the hi-erarchical depth buffer all the way down to the view samples thatbelong to that triangle, since all of these will lie exactly on the trian-gle plane. Unlike the traditional shadow-volume algorithm, using

this optimization does not require that the model is actually mod-eled as a two manifold mesh. As long as the object will rendercorrectly to screen with backface-culling, it will work robustly as ashadow caster with front-face culling. For example, unclosed back-drop geometry will cast shadows properly.

Remove unlit depth samples We want to avoid computing lightvisibility for view samples that are already unlit, either because ofthe sample not facing the light, or being unlit for other reasons likebeing part of the background. To achieve this, we flag unlit viewsamples and do not include their depths when building the hierar-chical depth buffers.

Maintaining an updated shadow buffer When traversing allshadow frustums through the hierarchical depth buffer, we set bitsin the hierarchical shadow buffer representing completely shad-owed tiles or view samples. Naturally, once a tile or view sample isfound to be in shadow, given some triangle frustum, its state cannotchange. Therefore an obvious optimization to our traversal algo-rithm is to stop traversal as soon as we reach a node that is alreadymarked as being in shadow. It is simple to modify Algorithm 1to AND the bitmask queue with the inverse of the current shadowbuffer value for the node. This, however, will only stop the shadowfrustum from being traversed through nodes that have previouslybeen trivially accepted by some shadow frustum (not tiles that havebeen filled by several different shadow frustums), so the gain in effi-ciency is modest. Alternatively, a thread that fills a node can eitherrecursively propagate that change up the hierarchy or simply up-date the one level above and rely on the changes to propagate dueto other threads over time. Neither method improves performancein our implementation however, probably due to the increased costof reading the shadow-buffer and the potentially long latency be-fore an update is visible to other threads. An even more efficientoptimization would be to keep the hierarchical depth buffer dynam-ically updated (by removing shadowed tiles from the hierarchy), butthis seems less likely to be feasible.

4 Results and Discussion

To evaluate the performance of our algorithm, we have imple-mented carefully tuned versions of the z-pass and z-fail algorithms.Our implementations are similar to those suggested by [Aldridgeand Woods 2004], except they are implemented using shaders andrun entirely on the GPU. Since the stencil buffer can only be in-cremented or decremented by one, shadow quads shared by twotriangles are rasterized twice, as this is much faster than replacingthe stencil buffer with a color buffer (which, using blending, can beincremented by two or more). In the z-fail implementation, we ren-der the near and far caps, while in the z-pass we need only renderthe shadow quads. Both implementations rely on depth clampingto avoid clipping artifacts. In the z-pass algorithm, we initialize thestencil buffer with a value corresponding to the number of shadowvolumes the camera is in. As mentioned, establishing this value ro-bustly is still an unsolved problem and occasionally causes graveartifacts. Both z-pass and z-fail can also fail if the eight-bit stencilbuffer overflows.

The time taken to render the shadow volumes is plotted for a fly-through of a game-scene (see supplementary video), in Figure 7.We show results for two resolutions, 1024x1024 and 4096x4096.This latter resolution may seem extravagant, but really correspondsto approximately the same number of samples that would be pro-cessed for an image rendered in 1080p with 8xMSAA. The timingsreported for our algorithm are those measured for generating theshadow buffer (build depth hierarchy, triangle-setup, rasterization

and final merging) and omit any redundant buffer copies that hap-pen when mixing OpenGL and CUDA. The timings presented forz-pass omit the time taken to evaluate how many shadow casterslie between the light and the camera. The graphs show the per-formance with and without the front-face culling (FFC) optimiza-tion described in section 3.5. The scene used is a part of the freelyavailable Epic Citadel [Epic Games 2011] (∼ 60k triangles) whichcontains many open edges, in what are really closed objects, andso would not have worked with the simpler shadow-volume algo-rithms. The scene has been slightly modified to contain no one-sided geometry (the cloth in the original model). All timings weremeasured on an NVIDIA GTX480 GPU.

100

150

200

ZPASSZFAILNew, with FFCNew, no FFC

0

50

100

150

200

0 50 100

400

500

600

700

800

900

0

100

200

300

400

500

600

700

800

900

0 50 100

Figure 7: Comparing z-pass, z-fail and the new algorithm, withand without Front Face Culling (FFC), in a fly-through of thecitadel scene. Above: Time taken (in ms) to generate per–view-sample shadow information. Below: Millions of test-and-set oper-ations required. Left is for a render target of size 1024x1024, right4096x4096.

As can be seen from Figure 7, when front-face culling is enabled,our software GPU implementation outperforms z-fail even at thelower resolution, and performs with a similar average as z-pass atthe higher resolution, though with much lower variability. With-out front-face culling, z-pass is still faster than our algorithm atthe higher resolution but, as noted previously, the z-pass algorithmis not entirely robust and so our algorithm is a compelling alter-native. Moreover, for meshes without connectivity information orwith short silhouette loops (e.g. destructible buildings, or a flock ofbirds) our frame times stay low while z-pass rendering times wouldincrease significantly.

To further examine the performance of our algorithm we have mea-sured the total number of test-and-set operations per view sample.For the shadow-volume algorithms, we have ignored the work re-quired by the rasterizer and early z-culling (as we lack informationto properly evaluate that), and so the number of test-and-set opera-tions reported is simply the total number of stencil updates requiredfor a frame. For our algorithms, we have counted the total numberof tile/shadow frustum and sample/shadow frustum tests performed.

Figure 7 shows the number of test-and-set operations for the samescene and animation as before. Clearly, the new hierarchical al-gorithm is more efficient, even at relatively low resolutions, whileat the high resolution it is especially effective. As expected, our

hierarchical approach scales well with increasing resolutions: rais-ing the number of view samples sixteenfold only requires betweenabout two to four times as many test-and-set operations, while forz-pass the increase is around 16 times. The results also demon-strate the low variability of the new algorithm which is due to ouralgorithms ability to trivially accept large tiles that lie within theshadow frustum and to trivially reject tiles that lie outside. Observethat, when front-face culling is disabled, the number of requiredtest-and-set operations is more than doubled (roughly 2.2 times forthe lower, and around 2.4 times for the higher resolution). This isthe expected behavior, as visible front faces must be refined all theway to the sample level (see Section 3.5), and shows that front-faceculling ought to be enabled whenever possible.

Another important consideration for a shadowing algorithm is ro-bustness and the artifacts it may produce. The z-pass algorithm isgenerally not robust, because of the camera in shadow problem, andbecause of stencil buffer overflow, which also affects the z-fail algo-rithm. When these failures are encountered, the shadow computedfor the entire screen may be incorrect – a highly disturbing artifact.Our algorithm, on the other hand, has no inherent robustness issues,and will at worst produce light leakage if the mesh is not properlywelded.

The amount of memory required by our basic (non transparent) al-gorithm, is very low. For a 4096x4096 rendertarget our hierarchi-cal shadow buffer requires only 2.1MB, besides the resident depthbuffer, our hierarchical depth buffers require an additional 2.1MBfor a total of 4.2MB. An eight-bit stencil buffer, which is a bareminimum for shadow-volume algorithms requires ∼ 16MB for thesame resolution. The alias-free shadow-map implementation de-scribed in [Sintorn et al. 2008] stores all view sample positions(three floats) in a compact array of lists per light space pixel whichwould take ∼ 200MB at this resolution. Additionally, they requirea shadow map where each texel needs to store as many bits as arethe maximum list size. For a shadow map of 1024x1024 and a maxlist size of 512 that would mean an additional 16MB of memory.For comparison, a 4096x4096 omnidirectional shadow map wouldrequire 384MB.

The results of the stochastic shadow buffer described in section 3.2are demonstrated in Figure 6. The images are rendered at a reso-lution of 4096x4096 and downsampled to 1024x1024. Again, thisis to illustrate how the algorithm could work with images renderedusing high quality MSAA or CSAA. While it is quite noisy, thestochastic image renders slightly faster than using a float value persample and, more importantly, requires only a single bit of visi-bility information per view sample. The memory footprint of ouralgorithm is thus reduced from the original∼ 64MB in Figure 6(b),to only ∼ 2MB in Figure 6(a).

Figure 8: The results of rendering a scene with different variantsof our algorithm.

Figure 8 shows the tree from figure 1(b) rendered with different

variants of our algorithm. The algorithm and the time taken to gen-erate the shadows were (from left to right): standard binary visibil-ity (7.9ms), semi-transparent shadow casters where all leafs haveconstant α = 0.5 (8.6ms), semi-transparent textured shadow cast-ers (12.9ms), stochastic semi-transparent textured shadow casters(12.9ms), and colored semi transparent shadow casters (10.4ms).

The choice of 32 as a branching factor for the hierarchical depthand shadow buffers is natural, as this matches both native word sizeand SIMD width. However, a high branching factor results in morewasted work; for example, all shadow frustums that are not culledin the setup phase will need to test all 32 tiles in the first level ofthe hierarchy, whereas a binary tree would only need to test two.Lower branching factor, on the other hand, results in deeper treesand more divergence. We have not explored this trade-off.

5 Future Work

The optimization where unlit samples are removed enables the useof a two-pass approach, where a first pass runs the algorithm onlyon those triangles that are expected to be good blockers by someheuristic, then refreshes the hierarchy by removing yet more un-lit samples, and finally traverses the remaining triangles [Olssonand Assarsson 2011]. As constructing the depth hierarchy is cheap,this optimization may yield a significant increase in efficiency, es-pecially if blocker geometry (i.e. conservative simple geometry) isplaced manually or generated.

The algorithm could be extended to handle soft shadows quite sim-ply, in a manner similar to that of [Sintorn et al. 2008]. The trianglefrustums would then be expanded to include the whole influenceregion of the triangle, given an area light source, and the shadowbuffer could contain, for each view sample, a bit per light sam-ple. Also, our novel antialiasing scheme resembles PCF filtering inmany ways. It seems likely that this algorithm could be modifiedto support samples taken outside the pixels’ bounding box, to sup-port PCF style blurred shadows in our algorithm. Several problemsremain to be solved in these areas, though.

6 Conclusion

We have presented a novel shadow algorithm based on individualtriangle shadow volumes, which combines many of the strengthsof shadow maps and shadow volumes. We also demonstrated aGPU software implementation of a hierarchical rasterizer that sup-ports the algorithm. Despite running entirely in software, it com-petes well against highly tuned implementations of shadow vol-umes which rely heavily on hardware acceleration, and offers real-time performance. Meanwhile, the new algorithm is completelyrobust, works for any aribtrary collection of triangles and integrateseasily into a deferred rendering pipeline making it a compellingchoice for rendering pixel accurate shadows, especially at high res-olutions.

References

ABRASH, M. 2009. Rasterization on larrabee. Dr. Dobbs Journal.

AILA, T., AND AKENINE-MOLLER, T. 2004. A hierarchicalshadow volume algorithm. In Proc. of the ACM SIGGRAPH/EU-ROGRAPHICS conf. on Graphics hardware, HWWS ’04, 15–23.

AILA, T., AND LAINE, S. 2004. Alias-free shadow maps. In Proc.of EGSR 2004, 161–166.

ALDRIDGE, G., AND WOODS, E. 2004. Robust, geometry-independent shadow volumes. In Proc. of 2nd international conf.

on Computer graphics and interactive techniques in Australasiaand South East Asia, GRAPHITE ’04, 250–253.

ANNEN, T., MERTENS, T., SEIDEL, H.-P., FLERACKERS, E.,AND KAUTZ, J. 2008. Exponential shadow maps. In Proc.of graphics interface 2008, GI ’08, 155–161.

ARVO, J. 2004. Tiled shadow maps. In Proc. of Computer GraphicsInternational 2004, 240–246.

BILODEAU, W., AND SONGY, M., 1999. Real time shadows. Cre-ativity 1999, Creative Labs Inc. Sponsored game developer con-ferences, Los Angeles, California, and Surrey, England.

CARMACK, J., 2000. Z-fail shadow volumes. Internet Forum.

CARPENTER, L. 1984. The a -buffer, an antialiased hidden surfacemethod. SIGGRAPH Comput. Graph. 18 (January), 103–108.

CHAN, E., AND DURAND, F. 2004. An efficient hybrid shadowrendering algorithm. In Proc. of the EGSR, 185–195.

CROW, F. C. 1977. Shadow algorithms for computer graphics.SIGGRAPH Comput. Graph. 11 (July), 242–248.

DONNELLY, W., AND LAURITZEN, A. 2006. Variance shadowmaps. In Proc. of i3D 2006, I3D ’06, 161–165.

EISEMANN, E., ASSARSSON, U., SCHWARZ, M., AND WIMMER,M. 2009. Casting shadows in real time. In ACM SIGGRAPHAsia 2009 Courses, SIGGRAPH Asia 2009.

ENDERTON, E., SINTORN, E., SHIRLEY, P., AND LUEBKE, D.2010. Stochastic transparency. IEEE TVCG 99.

EPIC GAMES, 2011. Unreal development kit: Epic citadel. http://www.udk.com/showcase-epic-citadel.

EVERITT, C., AND KILGARD, M. J., 2002. Practical and robuststenciled shadow volumes for hardware-accelerated rendering.Published online at http://developer.nvidia.com.

EVERITT, C., 2001. Interactive order-independent transparency.Published online at http://www.nvidia.com/object/Interactive_Order_Transparency.html.

FERNANDO, R. 2005. Percentage-closer soft shadows. In ACMSIGGRAPH 2005 Sketches, SIGGRAPH 2005.

FOREST, V., BARTHE, L., GUENNEBAUD, G., AND PAULIN, M.2009. Soft textured shadow volume. Computer Graphics Forum,EGSR 2009 28, 4, 1111–1121.

FUCHS, H., GOLDFEATHER, J., HULTQUIST, J. P., SPACH, S.,AUSTIN, J. D., BROOKS, JR., F. P., EYLES, J. G., AND POUL-TON, J. 1985. Fast spheres, shadows, textures, transparencies,and image enhancements in pixel-planes. SIGGRAPH Comput.Graph. 19 (July), 111–120.

HASSELGREN, J., AND AKENINE-MOLLER, T. 2007. Texturedshadow volumes. Journal of Graphics Tools, 59–72.

HEIDMANN, T. 1991. Real shadows, real time. Iris Universe 18,28–31. Silicon Graphics, Inc.

HORNUS, S., HOBEROCK, J., LEFEBVRE, S., AND HART, J. C.2005. ZP+: correct Z-pass stencil shadows. In ACM symp. onInter. 3D Graphics and Games, I3D, April, 2005, 195–202.

JOHNSON, G. S., LEE, J., BURNS, C. A., AND MARK, W. R.2005. The irregular z-buffer: Hardware acceleration for irregulardata structures. ACM Trans. on Graphics 24, 4, 1462–1482.

KIM, T.-Y., AND NEUMANN, U. 2001. Opacity shadow maps. InProc. EG Workshop on Rendering Techniques, 177–182.

KIM, B., KIM, K., AND TURK, G. 2008. A shadow-volume algo-rithm for opaque and transparent nonmanifold casters. journalof graphics, gpu, and game tools 13, 3, 1–14.

LAINE, S. 2005. Split-plane shadow volumes. In Proc. of GraphicsHardware 2005, 23–32.

LAURITZEN, A., SALVI, M., AND LEFOHN, A. 2011. Sampledistribution shadow maps. In Proc., I3D ’11, 97–102.

LEFOHN, A. E., SENGUPTA, S., AND OWENS, J. D. 2007. Reso-lution matched shadow maps. ACM TOG 26, 4, 20:1–20:17.

LLOYD, B., WEND, J., GOVINDARAJU, N. K., AND MANOCHA,D. 2004. Cc shadow volumes. In EGSR/Eurographics Workshopon Rendering Techniques, 197–206.

LLOYD, D. B., GOVINDARAJU, N. K., QUAMMEN, C., MOL-NAR, S. E., AND MANOCHA, D. 2008. Logarithmic perspectiveshadow maps. ACM TOG 27 (November), 106:1–106:32.

LOKOVIC, T., AND VEACH, E. 2000. Deep shadow maps. In Proc.SIGGRAPH 2000 (Aug.), SIGGRAPH 2000, 385–392.

MCCORMACK, J., AND MCNAMARA, R. 2000. Tiled polygontraversal using half-plane edge functions. In Proc. of ACM work-shop on Graphics hardware, HWWS ’00, 15–21.

MCGUIRE, M., AND ENDERTON, E. 2011. Colored stochasticshadow maps. In Proc. of i3D’11 (Februari.).

OLANO, M., AND GREER, T. 1997. Triangle scan conversionusing 2d homogeneous coordinates. In Proc. of ACM workshopon Graphics hardware, 89–95.

OLSSON, O., AND ASSARSSON, U. 2011. Improved ray hier-archy alias free shadows. Technical Report 2011:09, ChalmersUniversity of Technology, may.

REEVES, W. T., SALESIN, D. H., AND COOK, R. L. 1987. Ren-dering antialiased shadows with depth maps. In Proc., SIG-GRAPH 87, 283–291.

SINTORN, E., AND ASSARSSON, U. 2009. Hair self shadow-ing and transparency depth ordering using occupancy maps. InProc., i3D ’09, 67–74.

SINTORN, E., EISEMANN, E., AND ASSARSSON, U. 2008.Sample-based visibility for soft shadows using alias-free shadowmaps. CG Forum (EGSR 2008) 27, 4 (June), 1285–1292.

STAMMINGER, M., AND DRETTAKIS, G. 2002. Perspectiveshadow maps. In Proc., SIGGRAPH 2002, 557–562.

WILLIAMS, L. 1978. Casting curved shadows on curved surfaces.SIGGRAPH Comput. Graph. 12 (August), 270–274.

WIMMER, M., SCHERZER, D., AND PURGATHOFER, W. 2004.Light space perspective shadow maps. In Rendering Techniques2004 (Proc. EGSR), 143–151.

YUKSEL, C., AND KEYSER, J. 2008. Deep opacity maps. Com-puter Graphics Forum (Proc. of EUROGRAPHICS 2008) 27, 2.

ZHANG, F., SUN, H., XU, L., AND LUN, L. K. 2006. Parallel-split shadow maps for large-scale virtual environments. In Proc.of the 2006 ACM international conf. on Virtual reality continuumand its applications, VRCIA ’06, 311–318.

ZIOMA, R. 2003. Reverse extruded shadow volumes. In ShaderX2:Shader Programming Tips & Tricks with DirectX 9, W. Engel,Ed. Wordware Publishing, 587–593.