Tiled Based Forward Rendering using CPU and Compute shader · Web view2015. 6. 4. · Title:...

Title: Tiled Forward Lighting, Using CPU and Compute ShaderStudent: Alan Raby

Supervisor: Laurent Noel

A project report submitted in partial fulfillment of the Degree of BSc (Hons.) Computer Games Development

1

Abstract

The project aimed to show the viability of using Tiled Forward Rendering using a modern GPU architecture, for use within modern computer games, and also comparisons between it and other modern lighting techniques. It focuses mainly on the technical aspect of Tiled Forward rendering and the comparisons between rendering using CPU generated tiles and compute shader generated tiles. A large focus of the project was on the final outcome of the project showing clear positives and negatives of using this type of lighting, compared to Deferred Rendering.

The paper initially explains the fundamentals behind rendering a scene in 3D, this includes aspects such as rendering pipelines, shader architecture and DirectX key principles. Then it goes on to explain different light rendering techniques, including deferred, forward and some more lighting types such as clustered rendering.

The paper then explains the fundamentals of rendering using a Tiled Forward renderer and how the geometry is decoupled from the lighting, and then breaks it down into its component parts explaining certain aspects such as creating the tiles, adding in the linked lights, culling out erroneous lighting and finally how these are then rendered with the geometry. It further goes on to show the results and compare tiled forward rendering between CPU tiled rendering and compute shader tiled rendering.

.

2

Table of Contents1. Introduction.....................................................................................................................5

1.1 Background..............................................................................................................51.2 Overview..................................................................................................................5

2 . Rendering a 3D World..................................................................................................72.1 Overview..................................................................................................................72.2 Structure of a 3D model...........................................................................................72.3 Local Space..............................................................................................................82.4 World Space.............................................................................................................82.5 View Space..............................................................................................................92.6 Projection Transform.............................................................................................102.7 Input Layouts.........................................................................................................102.8 Rendering...............................................................................................................10

3 The Rendering Pipeline.................................................................................................123.1 Overview................................................................................................................123.2 The Input Assembler..............................................................................................123.3 Vertex Shader........................................................................................................123.4 Tessellator Stages..................................................................................................123.5 Geometry Shader...................................................................................................133.6 Rasterizer Stage.....................................................................................................133.7 Pixel Shader Stage.................................................................................................133.8 Output-Manager Stage...........................................................................................14

4 . Displaying a 3D World...............................................................................................154.1 Overview................................................................................................................154.2 Texture Mapping....................................................................................................154.3 Normal Mapping....................................................................................................164.4 Parallax Mapping...................................................................................................164.5 Forward Rendering................................................................................................174.6 Types of Lighting...................................................................................................17

5 . Comparisons between Lighting Algorithms...............................................................205.1 Overview................................................................................................................205.2 Traditional Forward Rendering.............................................................................205.3 Deferred Rendering................................................................................................205.4 Clustered Shading..................................................................................................215.5 Forward + Rendering.............................................................................................22

6 . What is Tiled Forward Rendering...............................................................................23

3

6.1 Overview................................................................................................................236.2 What is Tiled Forward Rendering.........................................................................236.3 Why use Tiled Forward Rendering?......................................................................236.4 The strength of Tiled Forward Rendering.............................................................236.5 Tiled Forward Rending Explained.........................................................................256.6 Theoretical Speed comparison...............................................................................266.7 Issues with tiling....................................................................................................27

7 . Implementation using a CPU tiling architecture.........................................................297.1 Overview................................................................................................................297.2 Implementation......................................................................................................297.3 Buffer Creation......................................................................................................297.4 How the tiling was done........................................................................................307.5 The Pixel Shader and Tiled Rendering..................................................................32

8 . Implementation using the Compute Shader................................................................348.1 Overview................................................................................................................348.2 How the Compute Shader works...........................................................................348.3 Design and Implementation...................................................................................368.4 Pixel Shader with UAV ordered buffers................................................................378.5 Conclusion.............................................................................................................38

9 . Results of Tiled Forward Rendering...........................................................................399.1 Overview................................................................................................................399.2 Methodology..........................................................................................................399.3 Results....................................................................................................................399.4 Tile Depth..............................................................................................................439.5 Different Architecture AMD vs NVidia................................................................459.6 Conclusion.............................................................................................................45

10 . Evaluation and Reflection.......................................................................................4710.1 Overview............................................................................................................4710.2 Evaluation..........................................................................................................4710.3 Reflection, Optimisations and Future improvements.........................................47

References............................................................................................................................50Appendix 1 – Project Proposal.............................................................................................................53

Bibliography.........................................................................................................................55Appendix 2 – Comparison Data...........................................................................................................56

4

1. Introduction

1.1 BackgroundIn modern 3D games using forward rendering for the lighting within a scene was limited due to the complexity required to produce multiple lights and the processing power that involves. The more lights that a scene requires, the more processing power a GPU uses to light that scene (Lauritzen, A. 2010).

In recent years Deferred rendering has been one solution to overcome some of the issues for multiple light sources, but the drawbacks are an out of order sequence within the GPU’s pipeline that results in some standard rendering techniques requiring ‘ubershaders’ to overcome. Also with the use of G-Buffers (Saito,T, Takahashi,T 2009), this results in an added strain on the GPU that can cancel out any benefits from a typical scene management using Forward Rendering.

Tiled Forward Rendering is a method of rendering that attempts to overcome both the problem of complex lighting, mixed with complex geometry, allowing multiple and complex light sources in a scene without the computational complexity that standard Forward rendering requires but with the ability to render complex geometry, such as Transparency and Full Screen Anti-Aliasing (Billeter, et al, 2013)

1.2 Overview

1.2.1 Chapter 2 – Rendering a 3D worldThis Chapter explains the concepts behind rendering a 3D world in real time, the procedures involved in this and the concepts behind it. The chapter goes into details on relevant topics such as rendering pipeline, simple scene lighting, texture mapping and other relevant items that are used in generating a 3D scene.

1.2.2 Chapter 3 – The Rendering PipelineThis chapter explains how a GPU renders a 3D scene using the supplied vertex, normal and texture information. While lighting is touched on, it primarily covers the technical aspects of rendering the scene without lighting applied.

1.2.3 Chapter 4 – Displaying a 3D worldThis chapter explains the mathematics and techniques used in lighting a 3D world. Issues regarding different lighting techniques are brought up and discussed, along with general benefits for each type of technique.

1.2.4 Chapter 5 – What is Tiled Forward RenderingThis Chapter provides technical insight into Tiled Forward Lighting and explains the positives and negatives to this approach. The chapter will explain why using this technique is justified over the simpler Forward Lighting, or Deferred Lighting.

5

1.2.5 Chapter 6 – Implementation using a CPU tiling architectureThis chapter describes in detail how the project was implemented for CPU based Tiled Forward rendering. It explains how all of the resources were used, the algorithms, the tiling and finally how the tiled lights were used in the rendering pipeline.

1.2.6 Chapter 7 – Implementation Using the Compute shaderWithin the project, the compute shader was used to create a comparison between CPU based tiled Forward Lighting, and the GPUs version of it. The compute shader is a way of using the GPU to provide simultaneous calculations all at once (in parallel), which could lead to a more efficient rendering of the lights.

1.2.7 Chapter 8 – ResultsThis chapter shows the results of Tiled Forward Lighting and how it compares between scenes, lights, tiles size and other metrics. Deferred lighting is also included in this to show a comparison.

1.2.8 Chapter 9 - Comparisons between lighting algorithmsDifferent techniques to light a scene have different positives and negatives. This Chapter explains the differences between Forward, Deferred and other lighting Algorithms.

1.2.9 Chapter 10 – EvaluationThe final Chapter is a critical Evaluation of the development, process and end result of the project. It will discuss any failings in the project, insights gained to what could be added going forward.

6

2. Rendering a 3D World

2.1 OverviewGenerating a 3D world involved many process steps from start to finish. This section describes in detail those steps. It also goes into detail the maths that play a large role in transforming models into the 3D world, from simple transformations to more complex rotations using Matrices. This section also goes into detail on the hardware process, or more importantly the GPU pipeline and the DirectX API (Application Programming Interface) that renders these 3D images to screen.

2.2 Structure of a 3D model A 3D model comprises itself of differing components stored in a structured way in memory. The initial stage of displaying your 3D world is first to load a model into memory as a mesh. The mesh itself is usually a list of vertices stored in a vertex buffer, this buffer comprises of a list of contiguous vertices (Luna D, 2012). These vertices are then used to make up a ‘wireframe’ mesh of the model.

The vertices themselves are grouped in a structured way, the main ways to group these are either with a triangle list or a triangle strip. These topologies are defined within the API when deciding on the type of list you would like to use yourself. They are known as primitive topologies (Luna D, 2012) and can range from simple point lists, to more complex point patch lists, however most models and 3D engines deal with triangle strips and triangle lists rather than more complex polygons (Luebke D, 2007). These lists or strips are known as the vertex buffer and are passed from the CPU to the GPU.

A triangle list stores 3 vertices for every triangle that you wish to display. The triangles themselves don’t need to be connected to any other triangles in your model. While this can give you a lot of flexibility, you also have a larger memory footprint and the overhead in system resources that this gives.

A triangle strip on the other hand, is comprised of a list of triangles that are connected with shared vertices between them. This then ultimately means that a strip of triangles are comprised of 2 or less vertices, as each triangle could be comprised of adjacent triangles. This method of displaying the mesh is less flexible as you need to carefully think about how the triangle strip is made up, but is far less resource intensive as less space is used in memory (both GPU and system RAM) which ultimately results in a more efficient way to display the model (which is ideal for 3D game engines).

7

An example of a triangle list. Each triangle is created using 3 vertices, without reusing any for the next triangle. Inefficient but easier to visualize.

In order to render these triangles strips, a separate list is created. This list defines how the vertices are put together to form the triangles in the triangle strip. This list is stored in a memory and passed with the triangles to the GPU; it is known as the index buffer. It comprises of the order in which vertices are used to make up a triangle.

2.3 Local Space3D models initially start off in what is called local space (or model space). This is usually a 3D based system, with a point of origin around the centre of the model. Every vertex in the model will be designed around this space until it is subsequently transformed over to the world space.

This allows for a model to be created independently from the 3D world, using its own geometry. This makes it far easier to manipulate prior to its subsequent transformation over to world space. It also allows you to reuse a model across multiple scenes or even instance the model for duplication on the same scene, with nothing more than a different local to world transform to make this happen.

This thought process can be thought of as building a model in one place, before you place it with all your other models when you need to display it.

2.4 World SpaceThe world space is the fundamental coordinate system, or space. It is the space that all of the models ultimately end up transformed into, where the game or scene takes place. World space is not defined by any other parent space and the position and orientation of world space are not particularly special (Rabin, 2010).

8

An example of a triangle strip. Notice the red lines that indicate the final join. More complex than a triangle list, but more efficient.

An object in its own local space, able to rotate around its axis.

To place a model into world space a transformation from local space to world space is required. This process is done by means of a matrix multiplication on the vertices of the local model. During the transformation scaling, skewing, rotation and translation are all done in a single matrix. This matrix, sometimes called the world matrix, is unique to each instance of an individual model. The way in which this matrix is created is by combining other matrices together, these generally being a scale, rotation and a translation matrix. The order in which you multiply these matrices is important due to matrix multiplication being non commutative (Van Verth, 2008).

Once these transformations are finished, the model is now effectively part of the 3D scene, with all of the other transformed models. All of the models at this stage will be relative to each other in world space, ready to be transformed from world space to view space.

2.5 View SpaceOnce the models are all placed into the world space, a final transformation into view space needs to be done. This is effectively a transformation to view the scene from a point and direction from within the world space. This is usually visualised as a camera positioned in the world space. Once the transformation to view space takes place, the final projection transformation takes place. Again, this is via a matrix multiplication transforming the 3D scene into a 2D scene.

9

An example of how a local space fits within a world space. Each has its own coordinates system, but the local object has its transformed into world space during the rendering cycle.

Image of the view from the camera as it views the world.

2.6 Projection TransformOnce the models are transformed to view space, it is at this point that they need to be projected onto the viewport (or virtual screen). This is effectively a 2D image of the scene. In order to do this a projection transform is performed, that allows the viewer to see the scene as the camera would see the scene. The projection transformation uses what is called the projection matrix to transform the world space to view space. At this point the volume of the world that is used needs to be transformed from 3D to 2D (Luna D, 2012). The projection matrix typically scales and makes a perspective projection of the world scene (Microsoft, 2011), along with defining the frustum of the transform (the area that is viewed). The frustum is defined by the Field of View (FoV) that the camera sees, along with the far clipping plane and near clipping plane.

These calculations are placed in the projection matrix and then used in a projection transform against the viewing position of the view space. The resulting matrix gives the 2D representation of a point in 3D space.

2.7 Input LayoutsOnce you have your world, view and projection matrices sent to the GPU, you can now send the vertex data to the GPU, via the vertex buffer. The information that generally is sent will be the vertex position in its model space and the texture coordinates for the texture mapping process. Other information can be defined to be sent to the GPU, this can include lighting information such as Normals for the vertices.

Example of an Input layout

2.8 RenderingAfter everything has been placed in the Input buffers, resources, index buffers and other GPU based variables (such as RW buffers, textures, arrays), the final stage of rendering happens. Until recently, the preferred method with Direct X was to create a technique that renders the

10

The projection transforms from 3D to 2D

image on the GPU. However with the advent of Direct X 11, the method of using techniques to render are being deprecated. Instead individually setting the states up prior to calling the shaders will be required. While this has always been available, it was not as popular as FX files (and how they held techniques) due to ease of use, however it does give more flexibility overall.

An Example of a HLSL technique, this one runs through the vertex shader, geometry shader and Pixel shader in that order. It also indicates the type of blending that the final output uses (Addictive in this case), the type of culling of the polygons and whether the depth buffer is being used.

11

3 The Rendering Pipeline

3.1 OverviewThe render pipeline on a DX11 capable card consists of many different stages the first of which is the Input Assembler stage, to the last which is the Output merger stage. Generally each stage of the pipeline takes data from the previous to ultimately give a rendered 3D scene. Certain stages within the pipeline are programmable in High Level Shading Language (HLSL), while others just act on the data passing through.

3.2 The Input AssemblerThe Input Assembler stage reads the user defined data (vertex data, normal, UV data, etc.) from a vertex buffer and assembles that information into the primitive data that will be used by the other pipeline stages (Microsoft, 2011).

3.3 Vertex ShaderThis stage generally performs translations on vertex data, such as world, view and projection transforms of the vertex data. The vertex data takes in one vertex and outputs a single vertex. Other information, such as Normal information or texture UV coordinates, is generally passed through this stage without any transformations.

3.4 Tessellator StagesThe Tessellator stages comprise of the Hull shader stage, Tessellator stage and Domain Shader Stage. These stages are generally used when tessellation is required for a scene. While similar to a geometry shader, they take in more input data to allow for the creation of more complex polygons, from low count polygons. A good example of this would be taking in depth information for a stone wall and artificially creating new primitives for the stone wall, resulting in a 3D textured image, rather than a flat 2D. These three stages work together to give a more lifelike model, each using the previous stages output. These stages are far more detailed and are outside the scope of this document, but have been included for completeness.

12

3.5 Geometry Shader The geometry shader, unlike the Vertex shader, works on primitives, such as triangles, points or vertices. The Geometry shaders input is the output from the Vertex Shader. In addition to this, each primitive can also include vertex data for edge-adjacent primitives. The geometry shader also has the ability of increasing the number of primitives that it outputs, and decrease them too. This can be useful for many areas of a 3D scene, from lighting, instancing, real world physics, particles or even simple tessellation. This additional data can then be passed to the Stream Output Stage and recirculated back to the Pipeline, or it can carry on through to the next stage in the pipeline, the Rasterizer Stage.

3.6 Rasterizer StageThe rasterization stage performs a conversion of primitive data, into a raster image (composed of pixels), while also interpolating the pixels to within the viewports dimensions (Wei, Lei. 2005). Rasterization also includes clipping of the primitive at this point to the viewport. This involves testing whether the primitive should be displayed on the screen. If none of the primitives’ vertices are within the boundary of the viewport, then no rasterization takes place and the pixel shader is never invoked.

When vertices enter the rasterizer, they are all assumed to be in a homogenous clip space with the x-axis pointing right, Y pointing up and Z pointing away from the camera. This allows for testing of the depth buffer at this point, along with culling of primitives based on the direction whether a primitive is drawn clockwise or counter-clockwise. The rasterizer at this point calls the pixel shader for every pixel that is to be displayed on the clipped polygon, after interpolation has taken place.

3.7 Pixel Shader StageThe pixel shader is generally the main stage for displaying a 3D scene. The input for this stage is an interpolated pixel based on the primitive that the rasteriser stage received from either the Vertex shader or the Geometry Shader. The output from this stage is generally a single float4 number that represents the colour of the pixel in RGB and the alpha channel.The pixel shader is also a programmable stage using HLSL and can also use Unordered Buffers for read/write.

The pixel shader stage generally is the stage that textures the primitive, using the UV data passed in and also does per-pixel lighting at this stage. This makes this stage very flexible

13

Example of culling as the areas of the primitives outside of the viewport will not be sent to the pixel shader as they are outside of the viewports dimensions.

and is the prime stage for texturing and lighting, which will be discussed in depth in the next chapters.

3.8 Output-Manager StageThe output-manager stage is the final stage in the pipeline. At this stage various output data from the pixel shader, depth buffer and stencil buffer are combined to produce the final Texel for the render target. This render target in a basic scene could be the back buffer, ready to display to the screen, or it could be a buffer that will be used with more post-processing techniques. If it is the back-buffer that is being rendered too, this will make up how the scene will look on the display.

14

4. Displaying a 3D World

4.1 OverviewIn this chapter I will explain different lighting techniques that a 3D game engine employs. From the simple methods of lighting a surface using forward rendering techniques, to types of light mapping on textured surfaces. This should then set the scene for further explanations in the next chapters.

4.2 Texture MappingWhen it comes to create a rich and colourful world in 3D, the preferred method used in 3D games is to use texture maps. This is a technique of mapping image data, such as a photograph, onto a face of a polygon.

For this to happen, every vertex in a polygon is given a texture co-ordinate, also known as a UV co-ordinate). Then when it comes to draw the pixel on the screen, instead of using a rudimentary colour, it samples the colour that is stored with the texture map instead. This colour, is then modified based on any lighting that is affecting the pixel.

Due to the nature of 3D graphics processing, the texture map may be needed in differing resolutions for both efficiency savings and to smooth the texture when up close or at a distance. This method, known as super-sampling, allows the pixel shader to pick a texture based on depth. While you could allow a GPU to calculate this, it is more efficient to have these textures pre-sampled and stored prior to rendering.

As with all types of mapping, textures use what is called the UV coordinate system. This system uses the ranges in X & Y from 0 to 1, with the Y axis going from top to bottom.

A final option that you can use with textures is the way they behave when you go beyond the 0-1 range. There are 4 modes that can be implemented (Rabin, 2010).

Wrap: Wrap makes the texture appear to start from the edge you finish. So if you have a UV coordinate of (1.5, 1.5), that would actually translate to (0.5, 0.5).

Clamp: Clamp causes the texture to use outermost pixel as any subsequent pixel. So calling a UV of (1.5, 1.5) would give you the pixel from (1, 1)

15

The above image is a simple stone texture map that could be used to overlay on top of a polygon.

Mirror: Mirror is similar to how wrap works apart from the texture is mirrored in the relevant direction. So a UV of (1.25, 1.25) would actually give you the pixel from (0.75, 0.75). This mirroring carries on forever, so a UV of (2.25, 2.25) would give you the pixel from 0.25, 0.25)

Mirror Once: Mirror once is very similar to mirror, but only mirrors once in the -1 to +1 range. After that, it behaves just like clamp.

4.3 Normal MappingNormal mapping is another technique to add detail into a 3D world, without too many resources being used. The point of a normal map is to add what looks like bumps and light reflecting details onto an object, thereby giving a more natural look to an object, especially when that object is moving around a light source. Normal maps are resources, generally stored in the RGB channels of a texture. These relate to the directional vector of the surface of an object. These surface normals are retrieved in the same manner as the Texture Map, using interpolation and UV co-ordinates.

When the vector is retrieved from the Normal Map, the vector is used as the angle of reflection for that part of the texture, it is combined with light sources at this point to give the impression of depth for that part of the polygon.

4.4 Parallax MappingParallax Mapping is a technique where the pixels on a texture are displaced by the amount stored within the parallax texture. This technique allows for 3D looking textures using a 2D texture map. The parallax resource itself stores the height of each pixel on the texture, then using an algorithm in the pixel shader, you displace the texture based on its height from the parallax resource. The end result is a 3D effect on the texture, when used with other techniques for lighting, gives a look of depth to a flat surface. However this illusion of fake 3D can be spotted when you are parallel to the texture as there is no depth.

16

The above image shows a height map that is used for a parallax map. Each pixel within the resource depicts the height at that point. It is combined with its corresponding texture to give the impression of a 3D image. While the image on the left shows the result of using parallax mapping on a texture.

The left hand side of the image shows Parallax mapping, while the right is a standard texture.

The image above is a normal map, showing the vectors as RGB colours

4.5 Forward RenderingForward rendering is the basic method of rendering lighting and geometry at the same time. The scene and lights are sent to the GPU effectively together (coupled) and the rasteriser loops through all the lights as it shades each pixel. This method is most in tune with the architecture of the GPU but suffers from computational explosion as every new light is added.

Forward rendering is the primary method of modern day GPUs to light a scene. It is by far the easiest type of lighting to do as it follows the rendering pipeline, taking place during the pixel shading of the geometry.

The advantage of this method of shading is that GPU can create complex shading via the hardware, rather than using complex shaders. However as more and more lights are added to the scene, your shader work grows exponentially. (Lazarov, D. 2011)

As you can see, this method will quickly become a burden on a GPU once the number of lights starts to increase as can be seen with its BigO notation O(Geometry Screen Pixels * Number of Lights).

In modern 3D games using forward rendering for the lighting within a scene is limited due to the complexity required to produce multiple lights and the processing power that involves. The more lights that a scene requires, the more processing power a GPU uses to light that scene (Lauritzen, A. 2010).

4.6 Types of LightingWithin a scene there are many types of lighting that can be used. From the simple ambient light, to a more complex specular lighting. All of these affect the texture that is displayed on the polygon in one way or another.

4.6.1 Diffuse LightingDiffuse lighting works by scattering the light from a light source in many directions. This gives an effect of lighting an object when a light source passes near enough to it to light the object. Diffuse lighting however doesn’t take the viewer’s angle into respect, so the light of an object is based solely on the attenuation (distance) from the light source.

17

An example of how a light hitting an object bounces off in all directions without any lessening of the power.

When diffuse lighting hits an object, the reflected light combines the colour of the light and the sampled texture colour of the object. This results in a natural looking object that the viewer sees.

4.6.2 Specular lightingUnlike diffuse lighting, specular lighting takes into account the position (angle) of the viewer and the angle at which the light ray hits the object. This style of lighting creates a more natural look to an object as it creates highlighted parts of the image, based on the reflection of the light off that object.

Like diffuse lighting, the colour of the light and the colour of the sampled texture are important in the effect. If a white light is the light source, then the corresponding reflected light, will still hold much of its source colour, in this case white. Unlike diffuse lighting, the reflected light is usually brighter as most of the light is focused, rather than diffused in all directions.

4.6.3 Ambient LightingAmbient lighting is the easiest lighting effect used in 3D worlds. It is a fixed intensity light that affects all objects equally, regardless of distance or angle from the light. The viewing angle and position also holds no relevance to how much light affects the object that is being viewed.

When a scene is rendered, every pixel that is shown will take on the colour of the ambient light as its base. Subsequent lighting techniques (such as diffuse, specular etc.) will combine with the ambient light to give the final lit pixel.

Ambient light can also be combined with an ambient occlusion map. This is a method that darkens parts of an image, based on the area being occluded from outer surfaces. It is a cheaper method of the overall superior, but costlier, global illumination.

4.6.4 Global IlluminationGlobal Illumination is a way to give more realistic levels of lighting throughout a scene. The main basis of global illumination is to light polygons using indirect lighting, such as shadows, refracted light and reflected light from other surfaces. The disadvantages of this sort of lighting is efficiency. Global illumination requires a lot of work from both the GPU and the

18

An example of how unlike diffuse lighting, specular lighting reflects at an angle to give a bright spot on the model.

CPU, however the results do create a sense of photo-realism when combined with other lighting and texturing techniques.

Below are two images to demonstrate the global Illumination. The first is without, while the second shows global illumination.

19

This image shows how a scene looks prior to Global illumination being used. None of the surfaces reflect light onto others.

This image shows how a scene looks after Global illumination is used. Note how the walls now have reflective light of green and red on them

5. Comparisons between Lighting Algorithms

5.1 OverviewThis section sets about looking at how the lighting in 3D engines is continually improving and how in the future techniques such as Tiled Forward Rendering, Tiled Deferred Rendering or other advanced lighting techniques could be used. It will look at new techniques that are still in their infancy, along with ones that are now entering the supply chain. It will also discuss current lighting schemes.

5.2 Traditional Forward RenderingForward rendering is the basic method of rendering lighting and geometry at the same time. The scene and lights are sent to the GPU effectively together (coupled) and the rasteriser loops through all the lights as it shades each pixel. This method is most in tune with the architecture of the GPU but suffers from computational explosion as every new light is added. The advantage of this method of shading is that GPU can create complex shading via the hardware, rather than using complex shaders. However as more and more lights are added to the scene, your shader work grows exponentially. (Lazarov, D. 2011)

5.3 Deferred RenderingDeferred Rendering was a new approach to lighting a scene using a technique of decoupling the lights from the geometry. Due to this approach, the amount of lights that can be displayed is far greater than traditional forward rendering could achieve. (Liktor, G. Dachsbavher, C., 2013).

In order for this to work the geometry is rendered to a G-buffer (Geometry Buffer) on a per pixel basis. For complex scenes the G-buffer must also hold extra information such as normal, diffuse colour, and world position. Once this data has been stored, the lighting calculations are then done on the G-buffer, giving the end result of a lit scene (Klint, J. 2008).

Deferred rendering isn’t without its drawbacks. Unlike forward rendering it struggles with FSAA/MSAA and transparency. Transparency in itself requires another G-Buffer to be used which takes extra memory (Liktor, G. Dachsbavher, C., 2013)

Also of note is that each G-Buffer tends to be quite complex and takes a lot of memory. In real world gaming situations this would mean displacing textures from the GPU to allow bigger G-Buffers to be used, or more of them (transparency) (Lauritzen, A. 2010).

20

Simplified workings of Deferred Shading

Shows comparison between Tiled Deferred and Tiled Forward+ based AMDS own Demo

As can be seen with the above figure, Deferred Lighting (in this case Tiled Deferred) has a performance gain over Tiled Forward, until the lights start to get past 2000. This doesn’t take into account MSAA or Blending, which also favours Tiled Forward (AMD 2013).

5.4 Clustered ShadingLike all portioning methods, TFM suffers from saturation issues where lights cluster on a single tile. This can cause a reduction in speed due to overdrawing. A solution that was recently demonstrated in AMD’s Leo tech demo was to incorporate 3D tiles. This solution is known as Clustered Shading (Billeter, M. Olsson, O. Assarsson, U. 2012), although AMDs approach also incorporated other complex scene lighting.

The principles are the same as standard 2D tiling in that you create a linked list of lights for the tile but instead of linking them in a 2D buffer, you make your buffer 3D and link the lights based on depth as well as the usual height and width of the screen.

This ultimately allows you to target only the lights that affect geometry that is only being rendered at that point in time. The disadvantage to this is the added complexity of the shader,

21

Caption shows 3D tile placements in yellow (courtesy GPU Gems4)

but this is outweighed by the speed increase that ultimately allows for a more complex and lifelike scene to be displayed.

5.5 Forward + RenderingForward+ rendering is AMD’s version of Tiled Forward/Clustered rendering, but also encompasses a ray cast light creation to give an extra level of illumination based on rendered objects being lit by six rays per light. The area which the ray lands upon is then turned into a 180 degree spot light and gives off a global illumination effect on an subsequent objects this newly created light illuminates (Harada T, et al. 2013).

Using this technique allows for almost film studio type lighting within a scene, as hundreds of point reflected lighting can be used to create an ambient like effect on rendered objects (AMD Tech Demo, 2013). It however does have some pitfalls.

Firstly, there is a computational overhead from ray casting and checking for occlusion of those rays hitting the backplane polygon. Secondly it suffers from clustering of lights, similar to all forward rendering techniques. This can either lead to slow down, on screen artefacts or a mixture of both.

However careful scene management generally lessens and negates these issues.

22

6. What is Tiled Forward Rendering

6.1 OverviewThis chapter explains what tiled forward rendering is and how it can be used in a 3D engine to create multiple lights, without the overhead that is normally associated with forward rendering.

6.2 What is Tiled Forward RenderingTiled Forward Rendering is a method of rendering that attempts to overcome both the problem of complex lighting, mixed with complex geometry, allowing multiple and complex light sources in a scene without the computational complexity that standard Forward rendering requires but with the ability to render complex geometry, such as Transparency and Full Screen Anti-Aliasing (Billeter, et al, 2013)

6.3 Why use Tiled Forward Rendering?Under standard forward rendering, the GPU must render every light for every object within a scene. Also, as the geometry in the scene becomes much more complex, with different texture types such as Normal mapping, diffuse and specular shading, the overhead on the GPU becomes too much. Although some culling can be done, ultimately Forward rendering becomes an issue for lighting (Zink, J et al. 20011)

Tiled forward rendering (TFM) is a method of increasing the number of lights in the scene over the traditional method of Forward Rendering. It does this by de-coupling the geometry from the lighting, while still maintaining traditional rendering techniques such as transparency and FSAA amongst others. Since the geometry is no longer linked to the lighting, culling of lights that don’t affect the geometry is not only quicker to do, but creates a scene that allows for more lighting than would otherwise be required (Lewis, P., 2012).

6.4 The strength of Tiled Forward Rendering.In a normal scene, the geometry is sent to the GPU and all the lights are processed, regardless of whether that piece of geometry is actually lit by it, thus causing the GPU to use processing power that could be used elsewhere. TFM breaks this link by effectively only passing the lights that are required to the pixel shader, when rasterization is taking place. This eliminates most of the superfluous lights that the Shader will use. With other culling techniques, based around z-buffer culling, this can be enhanced further which greatly increase the number of lights (Shishkovtsov, O. 2005). Tile based rendering also improves other areas such as Blending and Overdraw

23

As it is a forward rendering technique, this fits into the natural flow of the GPU’s pipeline. So not only does it allow for an increased number of lighting points per scene, but the scene can still use hardware specific shading procedures such as transparency. Other techniques such as deferred rendering require a software method for these shaders, which dramatically reduce the amount of lights a deferred renderer could display. Tile based rendering also improves other areas such as Memory bandwidth issues, overdraw, Blending and Multisampling AA (MSAA) (Ribble, M. 2012). This can have a dramatic effect on complex rendered scenes that still wish to use the GPUs efficiencies, without needing to create complex shaders to do the same work; thus hitting performance and ultimately the look of a game.

6.5 Tiled Forward Rending Explained

The power of TFM is in the way it stores and culls the lighting information. To do this, the screen is ‘Tiled’ into blocks of pixels, with 16 x16 to 32 x 32 currently being thought of as

24

An example of the tiling, number of lights and complexity of a scene under Tiled Forward rendering (courtesy AMD Leo Demo)

The red rectangular boxes are the size of the light spheres as they are positioned within the Tiles. This example shows each tile taking up 32 pixels in size

the ideal number (Lewis, P., 2012). Within these tiles a list of all of the lights that affect that tile are stored, for later retrieval by the pixel shader. If the tile itself is too small, then any optimisations that have been gained from using this approach will reduce with having too few lights per tile. The same can be said with tiles that are too big, where you end up having too many lights within a tile (Billeter, M. Olsson, O. Assarsson, U., 2013).

Once you have decided on your tile size, it is then a matter of creating these tiles on the GPU. One solution is to create 3 separate arrays (buffers), this solution seems to be the preferred choice and gives the most efficient use of memory management (Billeter, M. Olsson, O. Assarsson, U., 2013). Although other solutions exist, this does seem the current favoured method.

The lighting info is split into its component parts within the buffers. The light grid (tiles) holds a link to the linked list of lights, and an offset number for the loop through (in this example). The link list holds an index of all of the lights.

At this stage the lights area of influence is calculated using its radius. Every tile that this passes over the light is added to the linked list that it encompasses. Differing methods have been proposed for this

25

Image shows how a screen can be split into tiles. The areas that the light bounding boxes cover on the screen, will be added to each tiles linked list for processing at the rasterization stage. (picture courtesy SIGGraph)

ranging from a simple sphere of influence model to using the compute shader to create a line-sphere test. (Lewis, P., 2012).

For improvements in efficiency at rasterization stage later, two Z-buffers depth pre-passes can be used. The first is a per-pixel depth pass that will be used at rasterization stage to cull any lights that fall out of scope based on the depth value, but prior to that a tile depth buffer that uses either the Min or Max value of an entire scene can also be used (Pranckevičius, A. 2012). This acts as a quick screening prior to a more specific culling of the light at rasterization. While this can be done quickly and efficiently, todays DX11 GPUs can use Direct Computer to order the list prior to culling taking place (Lewis, P., 2012).

As rasterization takes place, you are then only looking at the tile that the current pixel lies under. You can link in only those lights in that portion of the screen from the hundreds that the entire scene may be using (Billeter, M. et al., 2013). This is where the strength of Tiled Forward shows. At this point you are only rendering the lights for the pixels within that tile. Since a Z culling has taken place prior to this step, efficiency issues brought with overdrawing (or lighting in this case) should be kept to a minimum.

With all rasterization of opaque objects done in a scene, you can then render your transparent objects as you normally would (Luna, F. 2012), again using the Tiled link lists to decide which lights are illuminating the current pixel. As this is no different to how a typical GPU renders transparent objects, there is no degradation on speed.

6.6 Theoretical Speed comparisonAs has been discussed, Tiled forward rendering makes use of the existing GPU pipeline, yet allows for far greater efficiency due to its tiling methods. This allows for the GPU to work at its most efficient, without the bottleneck that standard forward rendering gives. With the added benefits of culling the lights, especially when using the compute shader for generating the minimum and maximum z-buffer values per tile, you can increase the speed even further (Lewis, P., 2012).

As can be seen in above figure, once light culling is used with Tiled Forward Rendering, the speed is on par with Tiled Deferred rendering and quicker than standard deferred rendering. However it is not always possible to cull the lighting in a scene all of the time, so the

26

Rendering speed comparisons (courtesy of Persson E, Olsson O)

Showing the Z bounds of the Tiles

resultant comparisons will differ dependent on the location of the viewer and the viewable lights.

6.7 Issues with tiling One of the issues that appears when using a tiling system is its inability to show more lights in a tile than should appear on the screen. When you go above your set amount, scene artefacts start to appear and increase with severity as each light is effectively missed.

As can be seen by the image below, using tiles that become easily saturated can have dramatic effects on the scene. In this case around 1000 lights are trying to fit into tiles that can only fit 128 lights.

The image above Showing screen artefacts from tiles being full. This is more apparent with the Direct Compute method of Tiling, than CPU based.

One of the ways round this is to have tiles that are always going to be greater than the maximum number of lights in a scene. However, this can lead to using up resources quicker, or with the CPU based tiling, increasing the traffic across the PCI bus.

Another viable solution is to manage the lights better in a scene or create dynamic buffers that change as the scene changes, increasing or decreasing the tile size as the needs arise.

27

However this method can have its drawbacks as you need a more complex rendering engine to dynamically allocate light buffers dependent on the scene.

A further Issue with tiling is down in part with its performance when culling. If a scene has a lot of light sources that are hidden part of the time, then the rendering engine will increase speed dramatically. If however the lights are always being displayed in a scene, then tiling starts to lose its effectiveness as it becomes closer on par with Forward Rendering, be it with advanced culling techniques.

28

7. Implementation using a CPU tiling architecture

7.1 OverviewThis chapter describes in detail how the project was implemented. It includes details on the creation of the 3DTextures used for tiling information and the techniques used to create the tiles. It then explains how this information is used on the GPU to produce the forward tiling.

7.2 ImplementationThe initial project was designed around tiling being setup on the CPU side, then the resultant information being sent over to the GPU. As the project was intended to compare Tiled Forward Rendering against other methods of lighting, a template was chosen based around a deferred rendering program. This became the base of the program and the additional compute shader, HLSL and CPU based techniques where added to this.

The code itself is split into three major parts. The first part is the creation of all of the resources for the GPU, including UAVBuffers, RWBuffers, constant buffers, 3DTextures along with texture resources and mappings. The second major part of the code was the creation of the actual tiling algorithm and implementation within the code. The last major part was using all of the resources and tiling information and produces the tiled forward rendering in HLSL.

When all combined, the result would be a rendering engine for thousands of lights, that allowed for transparency and full screen anti-aliasing.

Along with all of the code for the CPU, Tiled Forward Rendering was also implemented using the compute shader. That is discussed in detail in the next chapter as the use of the compute shader is very complex and detailed.

7.3 Buffer CreationThe project needed many different types of buffers ranging from simple texture buffers, to more complex 3DTexture buffers, unordered Access Buffers (UAVBuffers) and structured buffers (SRVBuffers). Due to the nature of the project, these buffers would also need to be mapped within system ram, so that the contents could be transferred over to the GPU.

7.3.1 3DTextures BuffersThe tiling on the CPU side was implemented using a 3DTextureResource. This was in contrast to the compute shader method of using two separate UAVBuffers. The reasoning behind this at the time was to use a resource that was designed on a 3D hierarchy and already implemented into DirectX.

29

A Texture3D resource. This was used for all of the tiling information to be stored in one texture prior to transfering over to the GPU

Hopefully this would give efficiency savings when the data was transferred from system memory to GPU memory every frame.

The top level of the 3DTexture was set aside to hold data on how many lights where used in the tile, while the depth of the texture (the texture MIP levels when setting up the texture), would be the actual tiling data.

When manually dealing with textures there are certain aspects that need to be remembered, these being the X & Y stride length. These are two values that are set aside by different GPU architectures when managing Texture resources. The stride length denotes how much spare space is used at the end of a texture. This is effectively dead memory that is not used.

So when manually placing data into the 3DTexture, at the end of every row, the relevant stride length must be added and the same applies to the last row used (this is not necessary on 2D textures, just on 3D textures). If these values are not added, you will ultimately end up with corrupt data and incompatibility between hardware platforms.

7.3.2 Unordered Access BuffersThese buffers are special buffers in that they are both readable and writable on the GPU. They were setup for use within the compute shader to insert the tiling information into, then used by the pixel shader to use this information to create the tiled lighting effect.

7.3.3 Constant BufferConstant buffers are used for storing data on the system memory, then transferring over to the GPU. The use of these buffers was primarily used for storing the light data itself (position, colour, and radius). Other than that, these can be thought of as arrays that are used within the GPU.

7.4 How the tiling was doneTo create a tiling list, you first loop through the list of all the lights in the scene. As you are looping through, the first part of tiling is to check if any part of the light is projecting onto the scene.

This is done by first of all getting the lights world position and creating a 3D bounding box, based around the radius of the light. This then gives you 8 new points per light, at equal distances from the light source (the lights bounding box).

The first thing you need to do at this stage is check if all of these points fall outside the viewing frustum. If so, none of the light is on the scene and can be discarded (culled).

From here you now need to convert these points into viewing space using a perspective transform. This will then give you eight X & Y coordinates that are the screen position for

30

This image shows two lights with there corresponding bounding boxes based around the light radius

these points. The points then need to be sorted from lowest to highest values on each of the axis. At this point you find the highest and lowest X and Y values respectively, and they will dictate which tiles they will be fitting into.

At this point, the actually placing of the lights in the tiles takes place. To lessen the complexity the tiles are defined in an n by n array. The array size varies depending on screen size and how many pixels each tile encompasses. So, for a screen size of 1280 by 960, with a pixel size of 32 by 32, you would need tiles of 40 across by 30 down.

However because the light may start and finish outside the boundaries of the screen, you must make the max and min values stored either 0 for one edge of the screen for all values below zero, or 40 (32 for the Y max edge tile) for the outer edge (or whatever the outer edge tile is dependent on viewport size and pixel size of the tile).

At this point you now have the dimensions of the tiles that the light fits into. So two loops are needed, one to fill the X axis and an inner loop that loops around the Y axis.

The following snippet of code is the code for loops that loop around the two axis, all that is left is to place the light into the lookup table.

At this point we have two loops that will loop through the correct tile information. It is at this point that you fill the tile in with the tiling information. The first thing is to check the tile can fit anymore lights. This information ultimately is set by the tile depth variable that controls the MIP level on the 3DTexture but as we’re initially placing this information into an array first, it would be the in the tile counter array located at[X, Y], where X and Y are the inner loops. So checking this hasn’t gone above the maximum allowed number of tiles first, followed by incrementing that number, and lastly using that number as LightsUsed in a temp variable.

Now that you have found that you can place a new light in the tile, it is a simple matter of putting a pointer to the current light you are working on into the Tile array on [X, Y, LightsUsed].

Carrying on with the loop for each tile will then result in the light being placed in the correct tile for its screen position, ready for transferring over to the GPU.

Once all of the lights have been transferred into the Tiling Array, it is then a simple matter of placing this data into a 3DTexture. The first MIP level being number of lights used in the tile and the subsequent MIP levels being the actual pointer to the light number.

31

Great care at this point is needed as not only does specific hardware have specific stride lengths for textures, resulting in individual column and row lengths, but there is also specific lengths that can be passed over to the GPU based on how the texture was set up. All of this needs to be taken into account, otherwise as was mentioned earlier, it will not be compatible between different hardware platforms.

7.5 The Pixel Shader and Tiled RenderingAt this point the scene is ready to be lit by the tiled lighting lists. As it is based around forward rendering, the standard path through the pipeline for lighting is used. Firstly, the image is sent to the Vertex shader, it is then passed through to the rasteriser prior to the pixel shader.

Once it reaches the pixel shader, a perspective transform of the pixel is performed, this gives us the X and Y coordinate of the tile. As can be seen in the code below, this is a relatively simple task as the information is fed into us via the vertex shader. All that is required is to divide by how many pixel each tile uses.

Now that this information has been gathered, it is a simple matter of finding the number of lights used for the tile, with a simple lookup on the first Mip level of the 3DTexture holding the tiling information.

As the code line above shows, the value is “loaded” from the texture, rather than the usual method of “sampling”, this is due to the way the sample command is interpolated. So instead the load command is used, so that direct access to the relevant part of the texture is access, rather than the interpolated part of it.

If a value of 5 found on the first MIP level of the texture, this would indicate 5 lights are used for that tile position. This value is then looped through that many times on the Z-Depth of the texture. Each level will hold a look up to the light being used for that tile, so for the example given above it will access 5 levels of the textures Z axis.

The table to the left shows how the information is stored within the texture. Once the X & Y dimension is found the MIP level (or Z axis) is then used to find the lighting information. This is stored as a header, in MIP level 0 and light number information. The first item located in MIP level 0, are the number of lights used in the tile, the subsequent cells hold a pointer to the light used. In this example lights 12, 200, 7, 3 and 23 are lighting that tile on the scene.

32

Once the light is found, you then light the pixel in the standard way for forward rendering. This could be diffuse, specular, normal or even parallax types of lighting, depending on how your scene is designed.

Once the lighting information has been calculated, the pixel shader loops through to the next light held in the texture and using additive blending, adds in any subsequent light information. Unlike deferred rendering, Alpha blending can also take place at this stage, which is one of the advantages of Tiled Forward Rendering.

Once all of the light and alpha information is calculated for the pixel to be rendered, the information is passed back into the pipeline, ready to be drawn on the screen.

33

8. Implementation using the Compute Shader

8.1 OverviewBecause of the parallel nature of Tiled Lighting, it is possible to use the compute shader to create the light lists for a 3D Scene. This section explains the design, implementation and result of using the compute shader to do this and a brief synopsis comparing the results compared to CPU based Tiled Light Lists.

8.2 How the Compute Shader worksModern Day GPUs are no longer just hardware devices to create images on the screen. They are more multipurpose than ever before and allow for complex programming, similar to modern day CPUS.

While there are still limitations present in the GPUs for multi-purpose programming, the brute force nature and parallelism of a GPU can allow you to design software than is hundreds or thousands of times quicker than a CPU alone (Ti, 2009). This scaling performance is based on Amdahl’s Law, which describes the maximum speedup that a parallel designed program can achieve (Tatourian, 2013).

8.2.1 InitialisingUnlike the other shaders within the GPU, the compute shader doesn’t fit into any sequence as it is independent of the graphics pipeline. This allows for great flexibility when using it. It allows you to implement very powerful routines within your program, pass the data to it and then use the Compute Shader to process that information back. The end result can be passed back to the CPU, or stored within a resource on the GPU for use by the graphics pipeline.

To use a compute shader you can either call it directly, or similarly use a technique. The project was based around calling the compute shader using the FX11 technique code, so the calls themselves are made by first applying the technique, then dispatching it (calling the routine).

The code above shows how to initialise the compute shader within the C++ code, and how to call it. Notice on the Dispatch command the 3 integers used. These relate to threading groups

34

and ultimately how many parallel threads you are wanting to run. The scheduler within the Dispatch command will run all of the threads, even if the hardware only has a certain amount of threads it is capable of running. For example, if you set the Dispatch to run 1, 2, 1 then it will run 2 threads. This is due to the nature of using Dispatch and the GPU threading architecture (Thibieroz & Cebenoyan, 2010).

8.2.2 Threads and Thread GroupsWhen using the compute shader, there are certain aspects of its internal architecture that need to be looked at. These are Threads (also known as waves and wavefronts (Thibieroz & Cebenoyan, 2010) and thread groups.

To declare a wavefront in HLSL, you use the command [numthreads(wavefront, threadgroup, z)]. These values are used internally on the compute shader and the code has access to the thread id by using the following.

[numthreads(64, 1, 1)]void CSMain( uint3 DTid : SV_DispatchThreadID ){

int threadId = DTid.x * DTid.y * DTid.z;

}

To call the above function, you would dispatch it with:-

g_pd3dContext->Dispatch(x, y, z);

Where x, y, z are variables of your choosing to pass into the compute shader and then become the threadID.

With the example above, threadId can be used to decide which thread you are in and the maximum number of waves running at once is 64. You must be careful in the above example as you will have 64 threads running regardless, so some checking is required within the code if your threads are specific to your needs (usually this is the case with image manipulation).

When the compute shader runs, it groups its threads together in what is known as waves. Dependent on the hardware configuration, a group of waves will run in one go. On an AMD GPU it is 64 waves and NVidia it is 32 (Thibieroz & Cebenoyan, 2010). Waves themselves have access to shared memory, while waves in other threadgroups need a more costly read to gain access.

Threadgroups are groups of waves. Unlike waves, they don’t work synchronously together a wave group starts and finishes at the same time before being freed up. Threadgroups are there to allow for manipulation of the threads, while they may not be synchronised or use shared memory for quick access, they can still wait for all other threadgroups to hit the same point before continuing (Nvidia, 2010)

35

8.3 Design and ImplementationDue to the nature of tiling a scene for lighting, it was possible to use the power of GPU to create the tiled Light lists. Two Unordered buffers where used for the creation of these lists, one for the number of lights used and second containing the tiles, containing a linked list to each light. The following is the HLSL for the code used to create these buffers.

The structure itself firstly designs what fits within the “CTilesUsed” buffer. It stores how many lights are within the tile and the maximum depth of the lights within the tile. This structure kept the data needed to a minimum but allows for efficiency savings by doing depth checks before using any lights on a pixel. If the pixel falls outside the ZBounding, then it would not be lit by any of the lights with the tile.

The next buffer is the array of tiles itself. The buffer itself is a 1D buffer, but the array itself consists of an X by Y array formation to designate how the tiles are laid out on the screen, followed by 3rd dimension of the maximum number of lights a tile can store.

So, if the screen was 1024 by 720 and the number of pixels per tile was 32, then the array would be 32 by 23, giving a buffer size of 736 entries corresponding to each available tile on the screen, this would then be multiplied by the maximum number of lights that a tile could store. If we took for example a number of 32, then the subsequent buffer depth would be 23552 in size, or 94208 bytes.

The resultant lookup table should then reflect the way the tiling algorithm is written for the compute shader. It is more efficient than the method used on the CPU as the data is sequential, but does have a far harder to understand way of accessing the data.

The next stage of the design was decide upon how many threads where needed or should a loop be used in its place. Both methods where analysed and timing information was compared for both.

When just a loop system was employed, and all of the lights looped through, the levels of performance almost dropped to zero. While using the maximum number of threads available produced the best results.

However, when the total number of lights are greater than the physical number of threads available on a GPU, then slowdown occurs again, but not as dramatic as using 1 thread for all of the lights (as is the case in the CPU based tiling).

36

Because of the parallel nature of the code, accessing data becomes an issue as another thread may be modifying at the same time. To get around this issue the compute shader has what is called atomic locking. This is a mechanism of locking the resource a thread is currently using, modifying it and then unlocking it for other threads to then use.

During the time the resource is locked, any thread that is trying to modify the same resource freezes and awaits a signal from the GPU that the resource it is waiting on has become free.

Unfortunately this causes a slowdown and can clearly be seen in the results in the next chapter. The more threads that are trying to modify a tile, the greater the slowdown that occurs. On the scene that clustered lighting closely together, the slowdown causes the Tiled Rendering to slow down far more than the CPU based Tiled rendering.

Below is an example of how locking works within the compute shader.

Due to the nature of the GPU, any thread that is accessing the same tile will halt at this section, waiting for the thread currently inserting a light in the same tile to finish. The extreme version of this could be every light being in one tile and all threads queuing to insert their relevant data.

8.4 Pixel Shader with UAV ordered buffers

As with the pixel shader used for the CPU rendering version, it is at this point that you can retrieve the number of lights used for pixel currently being rendered, however as the GPU isn’t using a 3D Texture, but a UAV Buffer, it is done slightly differently.

As you can see the code for retrieving this information is not loaded via a texture, the method used for CPU rendering, but accessed like an array. This buffer is used for the number of lights used and is virtually identical to how you would access an array used on a tradition CPUs. The actual tile information is held within a separate buffer (array) called “CTiles” and is accessed in a similar fashion.

Here you can see how the loop goes through each light held in a tile, retrieving the light number that is a look up to a separate table of lights.

Again, once this light is found a traditional forward rendering of that pixel takes place.

37

8.5 ConclusionThe results of using the compute shader differed not just from scene to scene, but also from positions within each scene. There was also a marked difference when using GPUs from NVIDIA compared with ATI branded cards.

When using a scene that had a lower number of lights, the computer shader technique usually won out over CPU based Tiled Shading. Initially this seemed to contradict the parallelism of the GPU over a CPU, once I started to investigate why this was the case it became clear that two overriding issues were the cause.

The first issue was the more you used the GPU to create the tiling, the less time it actually had to render the lights. This was more apparent on the first scene which included a lot of objects in it and had a lot of lights being overdrawn. Unfortunately there would be no way around this, without resorting to a different technique, as the GPU would always need to render the objects and create the tiles. Improvements by making the code more efficient would certainly allow more GPU rendered lights before the tipping point occurred and the CPU rendering became more efficient.

The second issue that was apparent was down the way the algorithm was coded. This created many atomic locks per light as the GPU would be fighting over the read/write buffer for access. The more densely packed the lights became, the more locking was required and slowdown became more apparent.

To overcome this issue a re-write of the Compute shader code would be needed. This would require the compute shader to loop through the tiles, rather than the lights; thus overcoming locking issues.

Overall the use of compute shaders brings many different techniques into play. From the simple use of RWTextures when creating the data, to using multiple threading of HLSL code to accomplish the required results.

38

9. Results of Tiled Forward Rendering

9.1 OverviewThis Chapter explains the results that were achieved from Tiled Forward Rendering. The results are based on separate scenes, separate GPUs and differing setups for the program. The results should show that different hardware and software configurations can give dramatic results.

9.2 MethodologyTo try and keep the results accurate, I placed the camera position in the same spot each time. This position was found by taking measurements around each of the scenes until a place that gave a more average frame rate was shown. This was required due to the nature of Tiled Forward rendering where speeds can heavily vary depending on the number of lights within the viewing frustum, to the angle at which viewing is down (causing clustering of lights on 1 section of the screen).

Two scenes were chosen. The first being a landscape scene with a lot of objects, this would not only push the fps down but give a nice open perspective for far away lights. The second scene was the universal scene called “Sponza”, this scene is far smaller and clusters lights far more giving a more varied FPS from 200+ to less than 50.

9.3 ResultsThe results are split into two separate scenes. Those for the Landscape scene which has the lights further apart allowing for less lights per tiles. The second scene is the sponsa scene, which is a standard scene used for testing lighting. In this scene the lights are far closer together increasing the amount per tile and overall scene coverage. The results were achieved on an AMD 270 GPU, with an Intel i5 CPU running at 3.3GHZ.

9.3.1 Landscape Scene

As can be seen in the table below, the GPU performed better than the CPU when it uses a small tile size. While increasing the tile size is more beneficial for the CPU.

This result seems to go against what should be expected, as increasing the amount of tiles should put an extra burden on the GPU. The results for the CPU seem to fall in line.

39

The scene above shows the compute shader displaying 2000 lights. When the size of the tile is within 8-16, it gives an increase in performance compared with a tile 32-64 pixels in size.

Lights Tile Size (Pixels) Tile Depth Scene Renderer FPS2000 n/a n/a Grass Deferred 572000 8 128 Grass Compute Shader 492000 16 128 Grass Compute Shader 492000 32 128 Grass CPU 442000 32 128 Grass Compute Shader 432000 64 128 Grass CPU 412000 16 128 Grass CPU 402000 8 128 Grass CPU 302000 64 128 Grass Compute Shader 272000 n/a n/a Grass Forward 2

As the table above shows the difference on the scene can vary based on factors such as Tile size. When using GPU based Tiling, it appears to be around 15% slower than deferred rendering. CPU based rendering is above 20% slower in the best case scenario. However these figures do vary based on position of the viewing angle towards the horizon.

Lights Tile Size (Pixels) Tile Depth Scene Renderer FPS

1024 n/a n/a Grass Deferred 591024 16 128 Grass Compute Shader 531024 32 128 Grass Compute Shader 521024 8 128 Grass Compute Shader 521024 64 128 Grass Compute Shader 501024 32 128 Grass CPU 471024 64 128 Grass CPU 451024 16 128 Grass CPU 421024 8 128 Grass CPU 351024 n/a n/a Grass Forward 4

When the number of lights is decreased to 1024, the results seem to change dramatically for some of the tiling methods, notably the compute shader using an 8 pixel wide tile which has gone from worst performing to 3rd best. This seems unusual and hints that there could be an underlying inefficiency to the compute shader code. The CPU tiling basically follows on from the previous table, showing a 32 pixel wide tile as the most efficient

Lights Tile Size (Pixels) Tile Depth Scene Renderer FPS

128 64 128 Grass Deferred 61128 8 128 Grass Compute Shader 56128 16 128 Grass Compute Shader 55128 32 128 Grass Compute Shader 56128 64 128 Grass Compute Shader 56128 32 128 Grass CPU 49128 64 128 Grass CPU 49128 16 128 Grass CPU 48

40

128 8 128 Grass CPU 44128 n/a n/a Grass Forward 31

With even lower light, the results seem to be showing that using the Compute Shader with 8 pixel tiles gives the better results for Tiled Forward rendering. While using 32 Pixel for CPU based rendering gives that type of tiling the better results. This gain shows that the Direct Compute method of Tiling, while more efficient than CPU, has some underlying issue that is causing it to be less efficient on larger tiles.

Lights Tile Size (Pixels) Tile Depth

Scene Renderer

FPS

16 n/a 128 Grass Deferred 6116 n/a n/a Grass Forward 6016 8 128 Grass Compute Shader 5616 32 128 Grass Compute Shader 5616 16 128 Grass Compute Shader 5516 64 128 Grass Compute Shader 5516 32 128 Grass CPU 4916 64 128 Grass CPU 4916 16 128 Grass CPU 4816 8 128 Grass CPU 44

Finally the scene with 16 lights produces similar results to using 128 lights. This result shows that the tiling is now at its most efficient between 16-128 lights on a wide open scene. This is likely down to each light only fitting into one tile at this point. As a comparison to standard forward lighting, between 1-70 Lights, Forward lighting is better or on par with Compute Shading lighting, and from 1-75 with CPU based lighting.

9.3.2 Results for sponza sceneThe sponza scene is a scene that is used throughout the industry to demonstrate lighting techniques. It is an enclosed scene that clusters the lighting, producing far more lights per tile and more tiles used. In this respect it pushes tiled lighting to its limits early on. Because of this, the tile depth in the results was increased to 255. This is the maximum that the project would allow for the CPU based rendering due to the use of 3Dtextures, which have a 255 limit on the MIP maps.

41

Lights Tile Size (Pixels) Tile Depth Scene Renderer

FPS

1024 n/a n/aSponz

a Deferred 68

1024 32 255Sponz

a CPU 24

1024 64 255Sponz

a CPU 23

1024 32 255Sponz

a Compute Shader 21

1024 64 255Sponz

a Compute Shader 20

1024 16 255Sponz

a CPU 20

1024 16 255Sponz

a Compute Shader 20

1024 8 255Sponz

a Compute Shader 16

1024 8 255Sponz

a CPU 5

1024 n/a n/aSponz

a Forward 3

When a 1024 lights are used in the Sponza scene, the end results are dramatically different from the Landscape scene when it comes to Tiled Forward Shading. The CPU based tiling beats the compute shader version of it when a 32 pixel width tile is used by around 20%. However with an 8 pixel tile, the CPU based barely manages to outperform the inefficient forward rendering.

Initially looking at these figures and the amount of saturation the tiles and screen is under, you could argue that Tiled Forward rendering at this point would not be useful over a standard deferred renderer.

Lights Tile Size (Pixels) Tile Depth Scene RendererFPS

256 n/a n/aSponz

a Deferred190

256 32 255Sponz

a CPU 79

256 64 255Sponz

a CPU 73

256 64 255Sponz

a Compute Shader 66

256 32 255Sponz

a Compute Shader 66

256 16 255Sponz

a Compute Shader 60

256 16 255Sponz

a CPU 53

42

256 8 255Sponz

a Compute Shader 40

256 8 255Sponz

a CPU 13

256 n/a n/aSponz

a Forward 10

At 256 lights, it still looks like the deferred renderer is a much more suitable choice. While there has been some improvement at the lower end for the spectrum with regards to using the compute shader, but it is still lagging behind the CPU based rendering techniques. This again is in stark contrast to the Landscape scene, where the compute shader was easily outperforming CPU based rendering.

Lights Tile Size (Pixels) Tile Depth Scene Renderer FPS128 n/a n/a Sponza Deferred 258128 32 255 Sponza Compute Shader 119128 64 255 Sponza Compute Shader 115128 64 255 Sponza CPU 113128 32 255 Sponza CPU 108128 16 255 Sponza Compute Shader 96128 16 255 Sponza CPU 70128 8 255 Sponza Compute Shader 50128 8 255 Sponza CPU 21128 n/a n/a Sponza Forward 20

As the light count decreases to 128 lights, the compute shader becomes far more efficient than the CPU. The deferred renderer is still far more efficient at this point, but for a game engine 120fps is more than enough, although being 50% slower than deferred still is a big drawback at this point. Unlike the Landscape scene, 8 and 16 pixel tiles are more a liability for the compute shader.

Lights Tile Size (Pixels) Tile Depth Scene RendererFPS

16 n/a n/aSponz

a Deferred260

16 64 255Sponz

a Compute Shader250

16 32 255Sponz

a Compute Shader250

16 16 255Sponz

a Compute Shader250

16 n/a n/aSponz

a Forward148

16 32 255Sponz

a CPU116

16 64 255Sponz

a CPU113

16 16 255Sponz

a CPU104

16 8 255 Sponz Compute Shader 90

43

a

16 8 255Sponz

a CPU 67

In the final set of results, the Compute shader is now on parity with the deferred rendering. This is a dramatic turnaround consider how inefficient it is with more lights. There could be many reasons for this, from inefficient code to cache misses with higher number of lights. One reason most certainly is down to the lack of lights being stored in the tiling information, as was seen on the landscape scene, the GPU struggles more when inserting multiple values into the buffer.

9.4 Tile DepthOne of the most important issues when it comes to tiling a scene, is how much space do you initialise for each tile. This is more important when creating your tiles on the CPU side, then transferring all of the information over to the GPU. This is because there are performance issues when transferring over the PCI bus.

Because of this, it is important to optimise the tile depth to what would be considered safe enough not to cause artefacts on a screen, while not creating it too large as to affect performance dramatically due to the transfer over the PCI bus.

As can be seen with the above tables, the maximum depth when using the sponza scene was set to 255 lights per tile. Even at this high number, the tiles became saturated to the point of creating artefacts. In fact, the scene with 1024 lights was possible pushing the envelope for accurate lighting at that point.

However, the Landscape scene only needed to have the amount at 128 lights per tile and suffered no artefacts even with twice as many lights as the sponza scene with 2000 lights being displayed. So what does this mean to the overall performance? Well as can be seen from the chart below, that shows CPU based tiling with an optimum tile set, a substantial increase.

Lights Tile Size Tile Depth Scene FPS Large Tile Diff% +/-16 32 32 Sponza 223 116 92.24

128 32 92 Sponza 150 108 38.8916 32 32 Landscape 52 49 6.12

128 32 32 Landscape 51 49 4.08

As you can see having the optimum depth for the tile, the Sponza scene with 16 lights has a 92% increase in speed. Putting it slightly below the compute shader for the frame rate. When you increase the lights to 128, with a tile depth of 92 to remove any artefacts, you get another increase of 39%. Which not only is a substantial boost, but firmly places the CPU into second place after deferred rendering.

When it comes to the GPU and the tile depth, there appears to be only a nominal speed decrease from moving from large tile depths to small. The only issue would be storage space on the GPU, which is more critical than memory usage on the CPU side.

44

Putting these numbers into the tables shows what can be achieved through optimising the tiles themselves for the CPU.

Lights Tile Size (Pixels) Tile Depth Scene Renderer FPSDiff%

+/-128 n/a n/a Sponza Deferred 258 0128 32 92 Sponza CPU 143 32128 64 92 Sponza CPU 137 2116 64 16 Sponza Compute Shader 117 0

128 16 92 Sponza CPU 114 6316 32 16 Sponza Compute Shader 113 -116 16 16 Sponza Compute Shader 93 -1

128 8 92 Sponza CPU 55 16216 8 16 Sponza Compute Shader 52 4

128 n/a n/a Sponza Forward 20 0

As can be seen from the table above, optimising the tile size and tile depth on an individual basis can give dramatic results. In one of the cases 162% increase in speed was achieved through this.

When sorting the data to tile size, you can see how tile size has an effect on the outcome. As was mentioned prior, a smaller tile size should not be the better solution theoretically but due to overdrawing the same location, due to clustering of the lights, there is possibly some inefficiency in the way the GPU locks the memory prior to inserting the data in the tiles.

Lights Tile Size (Pixels) Tile Depth Scene Renderer FP

S

16 8 16 Sponza Compute Shader 52



3


7

128 8 92 Sponza CPU 55

128 16 92 Sponza CPU 11

4

128 32 92 Sponza CPU 14

3

128 64 92 Sponza CPU 13

7

9.5 Different Architecture AMD vs NVidiaThe project was designed around an AMD 6850 and an AMD R720, when the results were tested against NVidia GPUs different results came back. Dependent on the Architecture of

45

the GPU, the compute shader was either more or less efficient based on which architecture was used.

This was not a surprise as different architecture does things in a different manner. For completeness the results should have also been run using an AMD CPU, but an inability to procure one for testing meant those results are unknown.

LightsTile Size (Pixels) Tile Depth Scene Renderer

FPS

128 n/a n/aSponz

a Deferred123

128 64 92Sponz

a Compute Shader 80

128 32 92Sponz

a Compute Shader 75

128 32 92Sponz

a CPU 72

128 16 92Sponz

a CPU 70

128 64 92Sponz

a CPU 68

128 8 92Sponz

a CPU 65

128 16 92Sponz

a Compute Shader 61

128 8 92Sponz

a Compute Shader 30

128 n/a n/aSponz

a Forward 14

The above table shows the same Sponza scene being run on an NVidia GTX 660 GPU and an Intel i7 @ 3.3 GHz. As can be seen, the compute shader on the scene runs the scene more efficiently than it does on the AMD GPU. While the frame rates are not as high, this is down to the GPU being less powerful. A GPU of equal price from NVidia would run the scene more efficiently. Without looking too much into this a possibility is down to the way the locking works on NVidia or the way it caches its memory.

9.6 ConclusionAs can be shown by the graphs and figures, the setup of the hardware and software can have dramatic effects on the results. Even the way the scene is laid out makes a marked change in FPS that can be achieved.

When going for a lighted scene with spread out lights, the compute shader seems to pull ahead. This is especially so when you reduce the size of the tile to 8-16 pixels in size. However on scenes with tightly compacted lights, filling the entire screen, CPU tiled based rendering works.

However much of the CPU based tiled deficiencies can be lessened by creating a tiled depth that is less than the maximum needed. This is down to the amount of time used sending larger 3DTextures, the storage mechanism for the tiles on the CPU, over to the GPU via the PCI

46

bus. This doesn’t make it perfect as you then run the risk of artefacts appearing on the image being displayed but careful consideration to clustering lights and keeping the scene more evenly lit would certainly help.

Using the compute shader for the tiling can have massive benefits. There always seems to be a mode of rendering that is at worst as fast as CPU rendering, and at most 2x as quick, although as previously stated optimising the CPU based tiling lessens this substantially.

There are drawbacks of course from using the Compute Shader for the tiles but these are mainly GPU memory resources, which usually less than a texture in size and not normally an issue on GPUs today.

One aspect of the results with the computer shader is the way smaller tiles have more dramatic increases in speeds and how clustering of lights leads to poor performance. This seems at odds to how the algorithms should work. If anything, there should be a proportionate scaling from CPU based tiles to Compute based tiles in terms of performance. Going off these results indicates that there is a flaw in the compute shader, which is down to multiple lights per tile in some way.

47

10 . Evaluation and Reflection

10.1 OverviewThis chapter aims to evaluate and critically review the project as a whole. Part of the aim of the project was to evaluate if Tiled Forward Rendering was suitable for a 3D game engine and whether its positives and negatives had any dramatic consequences on the way the project was designed and implemented.

10.2 EvaluationWith the need for ever more complex lifelike scenes in computer games, traditional methods of display lighting are limiting the complexity that can be delivered at an acceptable frame rate. Even with advanced techniques such as Deferred Rendering, additional complex objects start to cause slowdowns within the scene, making it obvious and another approach is needed.

Tiled Forward Rendering is one way to solve this problem. It works under the current and last generation GPUs and has future improvements that will incorporate all the latest compute shaders of the current gen and next gen. It also reduces the need for large G-Buffers that traditional deferred renders need to light a scene (and more so for complex geometry such as transparency).

Future shading techniques, such as Forward+ rendering have shown that keeping a GPUs pipeline in order, while using a tiled/clustered technique for the lighting can give an almost film like representation of the real world. This is in stark contrast to Deferred rendering, which has apparent limitations to the complexity of a scene.

While Tiled Shading does have some speed drawbacks over deferred rendering, these are only apparent on scenes with basic geometry that don’t include MSAA/FSAA and blending. Both Deferred rendering and Tiled Forward Rendering are exponentially better at displaying lights than forward rendering, and while deferred rendering can manage more lights in scenes of two thousand or less lights, adding in blending and MSAA quickly lowers this number.

Tiled Forward Rendering has many advantages over deferred rendering and is certainly currently future proof for games engines as this rendering method works well under previous, current and next gen GPUs. It may have a slightly slower rendering time than deferred in some circumstances, its big advantage is with complex geometry and the number of lights it can display.

Overall, if using a GPU to its fullest capacity and using Direct Compute in the culling and ordering of the link lists, there appears every reason to use Tiled Forward Rendering. It’s virtually the same speed as deferred, it allows for more complex scenes, more visually appealing scenes and it is also future proof.

10.3 Reflection, Optimisations and Future improvementsOverall the project achieved the goals that I set out to achieve. The project set out to test whether Tiled Forward Rendering was a viable solution for modern day graphics engines, especially in the gaming industry.

48

From the results shown in the project it certainly seems that using this type of lighting has many benefits that increase the visual elements of a scene, with only a small reduction of performance. However this does come with certain caveats.

For one, the way clustering of lights can cause artefacts is a major problem on a heavily lit scene. This causes issues with both maxing out the amount of tiles that can be stored per tile and on scenes that cluster close to the viewing camera (covering vast amounts of the screen) can cause slowdown due to the nature of forward lighting.

The first problem of filling tile up to the maximum can be alleviated by using a more advanced technique called Tiled Clustered Lighting, as explained chapter 6. This not only solves many of the depth problems with the tiling, such as overdraw, but also is another speed up due to the nature of culling lights not affecting 3D objects.

The second problem however is more a problem with Forward lighting and is much harder to solve. Tiled Clustered Rendering again solves some of the issues, especially where the lights are far apart on the Z plane, as overdraw is taken care of as the tiles are in 3 dimensions.

Some other optimisations can certainly take place. The CPU code is less optimised than the compute shader and requires an intermediate array which purely acts as storage. This could easily be phased out and just access the 3DTexture instead. Likewise, the actual tiling code is especially raw and could do with refactoring. This is especially true for the light bounding boxes which are quite inefficient and the use of SSE instructions could help matters, or just cleaner frustum culling.

The compute shader has quite some hefty loops involved. I suspect this is one of the major slowdowns on the compute shader, as GPUs are generally bad at dealing with loops and predictions. Any savings on the loops, would no doubt be felt more.

Lastly, the CPU tiling could be recoded to use multiple core rendering, similar to the proposed changes for the GPU. As can be seen from the results, the CPU is particular fast even though it is only one core.

For future improvements implementing part of the project that was missing, would certainly help with speed issues. Due to time constraints the z buffering of tiles was only partially implemented. As parts of the documentation suggest, this can give a good improvement of speed for lights that are behind objects. This type of culling is quick and efficient and would certainly have increased performance on the Landscape scene due to the amount of lights that fall under the scene. The sponza scene would possibly have had a minor increase due to this, although as earlier stated, clustered tiling would have a more dramatic effect on this scene due to the clustering of the lights.

Also changing the code for the compute shader to work in a different way to avoid atomic locks. There is no doubt that atomic locking causes efficiency issues due to threads stalling as they wait for the locks to unlock. There is also the fact that using the locks themselves causes slowdown. In a future revision it would be more efficient to have a thread per tile, with the loops looping around the lights, rather than the other way around. This would alleviate the locking completely and should give a marked improvement on the rendering. In some cases I suspect even being a better option over deferred for speed alone.

49

Adding in a newer algorithm based on Clustered Tiling would be an ideal improvement. This would no doubt increase the complexity of the code, but would improve the efficiency on tightly clustered scenes, when the distance between lights is quite wide.

Lastly, changing the code to allow for dynamic switching between CPU and GPU based rendering could allow for differently lit scenes to be more efficient. As could be seen between the two scenes in the project, both rendering styles certainly had pro’s and con’s. Having the option to switch between the two on the fly could give a better engine overall.

Overall I am happy with the final result of the project. I met all of the goals that I set out to do and integrated and compared a relatively new style of lighting.

50

ReferencesAMD (2013) - AMD Tiled Forward & Tiled Deferred Comparison<http://developer.amd.com/wordpress/media/2013/06/TiledLighting11_v1.0.zip > Accessed [18/11/2013]

AMD Tech. Leo Time Forward+ Rendering Demo [Online]http://developer.amd.com/resources/documentation-articles/samples-demos/gpu-demos/amd-radeon-hd-7900-series-graphics-real-time-demos/ > Accessed [04/11/2013]

Billeter, M. Olsson, O. Assarsson, U., (2013). Tiled Forward ShadingIn Engel, W. ed 2013. GPU Pro4CRC Press, Chapter 11.4

Billeter, M. Olsson, O. Assarsson, U. 2012. Tiled and Clustered Shading [pdf] [Online] <http://www.cse.chalmers.se/~olaolss/get_file.php?filename=papers/tiled_shading_siggraph_2012.pdf > Accessed [13/10/2013]

Harada T, McKee, J, Yang, J. 2013. Forward+: A Step Toward Film-Style Shading in Real TimeIn Engel, W. ed 2013. GPU Pro4CRC Press, Chapter 11.5

Klint, J (2008). Deferred Rendering in Leadwerks Engine< http://www.leadwerks.com/files/Deferred_Rendering_in_Leadwerks_Engine.pdf > Accessed: [20/11/2013]

Lauritzen, A. (2010). Deferred Rendering for Current and Future Rendering Pipelines. [Online]< http://download-software.intel.com/sites/default/files/m/d/4/1/d/8/lauritzen_deferred_shading_siggraph_2010.pdf > Accessed [05/11/2013]

Lazarov, D. (2012). Physically Based Lighting in Call of Duty: Black Ops [Online]<http://advances.realtimerendering.com/s2011/Lazarov-Physically-Based-Lighting-in-Black-Ops%20(Siggraph%202011%20Advances%20in%20Real-Time%20Rendering%20Course).pptx > Accessed [17/11/2013]

Lewis, P., (2012). Tile-Based Forward Rendering[Online] <http://www.pjblewis.com/articles/tile-based-forward-rendering>Accessed [23/10/2013]

Liktor, G. Dachsbavher, C., (2013). Decoupled Deferred Shading on the GPUIn Engel, W. ed 2013. GPU Pro4 [book]CRC Press, Chapter 11.3

Luebke D, H.G., 2007. How GPUs Work. [Online] Nvidia (1) Available at: http://www.cs.virginia.edu/~gfx/papers/pdfs/59_HowThingsWork.pdf [Accessed 03 April 2014].

51

http://www.cs.virginia.edu/~gfx/papers/pdfs/59_HowThingsWork.pdf

http://www.pjblewis.com/articles/tile-based-forward-rendering

http://advances.realtimerendering.com/s2011/Lazarov-Physically-Based-Lighting-in-Black-Ops%20(Siggraph%202011%20Advances%20in%20Real-Time%20Rendering%20Course).pptx



http://download-software.intel.com/sites/default/files/m/d/4/1/d/8/lauritzen_deferred_shading_siggraph_2010.pdf

http://download-software.intel.com/sites/default/files/m/d/4/1/d/8/lauritzen_deferred_shading_siggraph_2010.pdf

http://www.leadwerks.com/files/Deferred_Rendering_in_Leadwerks_Engine.pdf

http://www.cse.chalmers.se/~olaolss/get_file.php?filename=papers/tiled_shading_siggraph_2012.pdf

http://www.cse.chalmers.se/~olaolss/get_file.php?filename=papers/tiled_shading_siggraph_2012.pdf

http://developer.amd.com/resources/documentation-articles/samples-demos/gpu-demos/amd-radeon-hd-7900-series-graphics-real-time-demos/


http://developer.amd.com/wordpress/media/2013/06/TiledLighting11_v1.0.zip

Luna, F. (2012). BlendingIn: Game Programming with Direct X 11Mercury Learning, Chapter 9

Luna, F. (2012). The Rendering PipelineIn: Game Programming with Direct X 11Mercury Learning, Chapter 5

Microsoft, 2011. Direct X 11 Pipeline. [Online] Available at: http://msdn.microsoft.com/en-us/library/windows/desktop/bb205116(v=vs.85).aspx[Accessed 08 04 2014].

Persson E, Olsson O. Practical Clustered Deferred and Forward Shading (2013) [Online]< http://www.cse.chalmers.se/~olaolss/papers/siggraph_2013.pdf > Accessed [10/11/2013]

Pranckevičius, A. (2012) Theory for Forward Rendering [Online]< http://aras-p.info/blog/2012/03/02/2012-theory-for-forward-rendering/ >Accessed [01/11/2013]

Rabin, S., 2010. Rendering Graphics.In: Introduction to Game Development. s.l.:s.n., Cengage Learning, p. 428.

Rabin, S., 2010. Rendering Graphics.In: Introduction to Game Development. s.l.:s.n., Cengage Learning, p. 444.

Thibieroz, N. & Cebenoyan, C., 2010. AMD/NVidia Direct Compute. [Online] Available at: http://developer.amd.com/wordpress/media/2012/10/DirectCompute%20Performance.ppsx

Ribble, M. Next Gen Tile-Based GPUs [Online]http://developer.amd.com/wordpress/media/2012/10/gdc2008_ribble_maurice_TileBasedGpus.pdf > Accessed [17/11/2013]

Saito,T, Takahashi,T. (2009). Comprehensible Rendering of 3-D Shapes [Online]< http://www.cs.otago.ac.nz/cosc455/p197-saito.pdf > Accessed [01/11/2013]

Shishkovtsov, O. (2005). GPU Gems 2, Deferred Shading in S.T.A.L.K.E.R. [Online]< http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter09.html > Accessed [17/11/2013]

Ti, T., 2009. Features-Advantages-DirectCompute. [Online] Available at: http://on-demand.gputechconf.com/gtc/2009/presentations/1015-Features-Advantages-DirectCompute.pdf

Van Verth, J. B. L., 2008. Matrices and Linear Transformations. In: Essential Mathamatics for Games. s.l.:Morgan Kaufmann.

52

http://on-demand.gputechconf.com/gtc/2009/presentations/1015-Features-Advantages-DirectCompute.pdf

http://on-demand.gputechconf.com/gtc/2009/presentations/1015-Features-Advantages-DirectCompute.pdf

http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter09.html

http://www.cs.otago.ac.nz/cosc455/p197-saito.pdf

http://developer.amd.com/wordpress/media/2012/10/gdc2008_ribble_maurice_TileBasedGpus.pdf

http://developer.amd.com/wordpress/media/2012/10/gdc2008_ribble_maurice_TileBasedGpus.pdf

http://aras-p.info/blog/2012/03/02/2012-theory-for-forward-rendering/

http://www.cse.chalmers.se/~olaolss/papers/siggraph_2013.pdf

http://msdn.microsoft.com/en-us/library/windows/desktop/bb205116(v=vs.85).aspx

Wei, Lei. 2005. A Crash Course on Programmable Graphics Hardware. [Online] Available at: http://graphics.stanford.edu/~liyiwei/courses/GPU/paper/paper.pdf[Accessed 26 04 2014].

Zink, J. Pettineo, M. Hoxley, J. (2011). Deferred Rendering.In: Practical Rendering and Computation with Direct3D 11CRC Press, Chapter 11

53

Appendix 1 – Project Proposal

Department of Computing Degree Project ProposalName: Alan Raby Course: Computer Games Development Size: doubleDiscussed with (lecturer): Laurent Noel Type: development

Previous and Current ModulesGames Development 1Software DevelopmentAdvanced Programming With C++Professional Skills Computer GraphicsSoftware Engineering Practices

Problem ContextWith standard deferred lighting, transparency and FSAA are difficult to implement and result in performance losses compared to the standard forward rendering. However forward rendering requires the lights to be coupled with the scene geometry and as more lights are used, the greater the performance hit to the GPU. Deferred shading, and Tile Forward Shading decouple themselves from the geometry and usually only require 1 pass at time of rendering. The ProblemI hope to achieve for this project, as the very least, a forward rendering solution to allow hundreds of lights with transparency and FSAA still working. Potential Ethical or Legal IssuesNone

Specific ObjectivesShow a functioning Tiled Forward Shading that is decoupled from geometry. Show Transparency workingShow FSAA in the sceneDemonstrate the advantages of the technique The ApproachThe approach used for this is based around the online information that has recently come about. Tiled Forward Rendering has the same characteristics as Tiled Deferred Rendering. From what I can gather, parts of each can be swapped for the other. The difference is one allows for better scene management, as you are using the forward shading path correctly. While Tiled Deferred Shading allows for more lights, but less complex in nature.

54

I will be aiming to produce a scene that will demonstrate the techniques that are currently being discussed online, as no “book” currently discusses it. ResourcesVisual StudioDX11 Potential Commercial ConsiderationsEstimated costs and benefitsThe costs for this project will mainly be down to man hours. If I take that I would be paid £15/Hour and 200 hours work will be done, then the cost to a customer would be £3000. Expenditure on visual studio and lighting/heating & food would also be included (and of course a healthy profit if it was a commercial venture). The benefit out of this for a customer would be a working demonstration of Tiled Forward Shading.

Literature ReviewTiled Forward Rendering

55

Bibliographyhttp://www.youtube.com/watch?v=s2y7e3Zm1xc


http://www.pjblewis.com/articles/tile-based-forward-rendering/

http://www.cse.chalmers.se/~olaolss/main_frame.php?contents=publication&id=tiled_shading

56

Appendix 2 – Comparison Data

Manufacturer Lights Tile Depth Tile Size (Pixels) Scene Renderer

AMD 2000 128 8 Grass Compute ShaderAMD 2000 128 16 Grass Compute ShaderAMD 2000 128 32 Grass Compute ShaderAMD 1024 128 8 Grass Compute ShaderAMD 1024 128 16 Grass Compute ShaderAMD 1024 128 32 Grass Compute ShaderAMD 1024 128 64 Grass Compute ShaderAMD 512 128 32 Grass Compute ShaderAMD 256 128 32 Grass Compute ShaderAMD 128 128 8 Grass Compute ShaderAMD 128 128 16 Grass Compute ShaderAMD 128 128 32 Grass Compute ShaderAMD 128 128 64 Grass Compute ShaderAMD 64 64 64 Grass Compute ShaderAMD 64 32 32 Grass Compute ShaderAMD 1024 255 8 Sponza Compute ShaderAMD 1024 255 16 Sponza Compute ShaderAMD 1024 255 32 Sponza Compute ShaderAMD 1024 255 64 Sponza Compute ShaderAMD 256 255 8 Sponza Compute ShaderAMD 256 255 16 Sponza Compute ShaderAMD 256 255 32 Sponza Compute ShaderAMD 256 255 64 Sponza Compute ShaderAMD 128 255 8 Sponza Compute ShaderAMD 128 255 16 Sponza Compute ShaderAMD 128 255 32 Sponza Compute ShaderAMD 128 255 64 Sponza Compute ShaderAMD 16 255 8 Sponza Compute ShaderAMD 16 255 16 Sponza Compute ShaderAMD 16 255 32 Sponza Compute ShaderAMD 16 255 64 Sponza Compute ShaderAMD 16 92 8 Sponza Compute ShaderAMD 16 92 16 Sponza Compute ShaderAMD 16 92 32 Sponza Compute ShaderAMD 16 92 32 Sponza Compute ShaderAMD 2000 128 8 Grass CPUAMD 2000 128 16 Grass CPUAMD 2000 128 32 Grass CPUAMD 1024 128 8 Grass CPUAMD 1024 128 16 Grass CPUAMD 1024 128 32 Grass CPU

57

AMD 1024 128 64 Grass CPUAMD 512 128 32 Grass CPUAMD 256 128 32 Grass CPUAMD 128 128 8 Grass CPUAMD 128 128 16 Grass CPUAMD 128 128 32 Grass CPUAMD 128 128 64 Grass CPUAMD 64 64 64 Grass CPUAMD 64 32 32 Grass CPUAMD 1024 255 8 Sponza CPUAMD 1024 255 16 Sponza CPUAMD 1024 255 32 Sponza CPUAMD 1024 255 64 Sponza CPUAMD 256 255 8 Sponza CPUAMD 256 255 16 Sponza CPUAMD 256 255 32 Sponza CPUAMD 256 255 64 Sponza CPUAMD 128 255 8 Sponza CPUAMD 128 255 16 Sponza CPUAMD 128 255 32 Sponza CPUAMD 128 255 64 Sponza CPUAMD 128 92 8 Sponza CPUAMD 128 92 16 Sponza CPUAMD 128 92 32 Sponza CPUAMD 128 92 32 Sponza CPUAMD 16 255 8 Sponza CPUAMD 16 255 16 Sponza CPUAMD 16 255 32 Sponza CPUAMD 16 255 64 Sponza CPUAMD 2000 128 8 Grass DeferredAMD 2000 128 16 Grass DeferredAMD 2000 128 32 Grass DeferredAMD 1024 128 8 Grass DeferredAMD 1024 128 16 Grass DeferredAMD 1024 128 32 Grass DeferredAMD 1024 128 64 Grass DeferredAMD 512 128 32 Grass DeferredAMD 256 128 32 Grass DeferredAMD 128 128 8 Grass DeferredAMD 128 128 16 Grass DeferredAMD 128 128 32 Grass DeferredAMD 128 128 64 Grass DeferredAMD 64 64 64 Grass DeferredAMD 64 32 32 Grass DeferredAMD 1024 255 8 Sponza DeferredAMD 1024 255 16 Sponza Deferred

58

AMD 1024 255 32 Sponza DeferredAMD 1024 255 64 Sponza DeferredAMD 256 255 8 Sponza DeferredAMD 256 255 16 Sponza DeferredAMD 256 255 32 Sponza DeferredAMD 256 255 64 Sponza DeferredAMD 128 255 8 Sponza DeferredAMD 128 255 16 Sponza DeferredAMD 128 255 32 Sponza DeferredAMD 128 255 64 Sponza DeferredAMD 16 255 8 Sponza DeferredAMD 16 255 16 Sponza DeferredAMD 16 255 32 Sponza DeferredAMD 16 255 64 Sponza DeferredAMD 1024 255 8 Sponza ForwardAMD 1024 255 16 Sponza ForwardAMD 1024 255 32 Sponza ForwardAMD 1024 255 64 Sponza ForwardAMD 256 255 8 Sponza ForwardAMD 256 255 16 Sponza ForwardAMD 256 255 32 Sponza ForwardAMD 256 255 64 Sponza ForwardAMD 128 255 8 Sponza ForwardAMD 128 255 16 Sponza ForwardAMD 128 255 32 Sponza ForwardAMD 128 255 64 Sponza ForwardAMD 16 255 8 Sponza ForwardAMD 16 255 16 Sponza ForwardAMD 16 255 32 Sponza ForwardAMD 16 255 64 Sponza Forward

NVIDIA 512 255 8 Grass Compute ShaderNVIDIA 512 255 16 Grass Compute ShaderNVIDIA 512 255 32 Grass Compute ShaderNVIDIA 512 255 64 Grass Compute ShaderNVIDIA 16 16 8 Grass Compute ShaderNVIDIA 16 16 16 Grass Compute ShaderNVIDIA 16 16 32 Grass Compute ShaderNVIDIA 16 16 64 Grass Compute ShaderNVIDIA 512 255 8 Sponza Compute ShaderNVIDIA 512 255 16 Sponza Compute ShaderNVIDIA 512 255 32 Sponza Compute ShaderNVIDIA 512 255 64 Sponza Compute ShaderNVIDIA 128 92 8 Sponza Compute ShaderNVIDIA 128 92 16 Sponza Compute ShaderNVIDIA 128 92 32 Sponza Compute ShaderNVIDIA 128 92 64 Sponza Compute Shader

59

NVIDIA 512 255 8 Grass CPUNVIDIA 512 255 16 Grass CPUNVIDIA 512 255 32 Grass CPUNVIDIA 512 255 64 Grass CPUNVIDIA 16 16 8 Grass CPUNVIDIA 16 16 16 Grass CPUNVIDIA 16 16 32 Grass CPUNVIDIA 16 16 64 Grass CPUNVIDIA 512 255 8 Sponza CPUNVIDIA 512 255 16 Sponza CPUNVIDIA 512 255 32 Sponza CPUNVIDIA 512 255 64 Sponza CPUNVIDIA 128 92 8 Sponza CPUNVIDIA 128 92 16 Sponza CPUNVIDIA 128 92 32 Sponza CPUNVIDIA 128 92 64 Sponza CPUNVIDIA 512 255 8 Grass DeferredNVIDIA 16 16 8 Grass DeferredNVIDIA 512 255 8 Sponza DeferredNVIDIA 128 92 8 Sponza DeferredNVIDIA 512 255 8 Grass ForwardNVIDIA 16 16 8 Grass ForwardNVIDIA 512 255 8 Sponza ForwardNVIDIA 128 92 8 Sponza Forward

60

Tiled Based Forward Rendering using CPU and Compute shader · Web view2015. 6. 4. · Title:...

Documents

Transcript of Tiled Based Forward Rendering using CPU and Compute shader · Web view2015. 6. 4. · Title:...