Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent...

35

Transcript of Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent...

Page 1: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –
Page 2: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Bandwidth-Efficient

Rendering

Marius Bjørge ARM

Page 3: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

• Efficient on-chip rendering

• Post-processing

– Bloom

– Blur filters

Agenda

Page 4: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

• Extensions

– Framebuffer fetch

– Pixel Local Storage

• Why extensions?

– Surely mobile GPUs are already bandwidth-efficient?

Efficient on-chip rendering

Page 5: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

• Read the current fragment’s previous color value

• ARM also supports reading the previous depth and stencil

values of the current fragment

• Useful for

– Programmable blending

– Programmable depth/stencil testing

Framebuffer fetch

Page 6: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

• Per-pixel storage that is persistent throughout the lifetime

of the frame

– Read/write access

– Storage stays on-chip

– Storage layout declared per fragment shader invocation – does

not depend on framebuffer format

• Useful for

– Deferred shading

– Order Independent Transparency [1]

– Volume rendering

Pixel Local Storage (PLS)

Page 7: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Pixel Local Storage (PLS)

• Rendering pipeline

changes slightly when

PLS is enabled

– Writing to PLS bypasses

blending

• Note

– Fragment order

– PLS and color share the

same memory location

Page 8: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Pixel Local Storage (PLS) OIT phaseOpaque phase

Fill gbufferLight

accumulation

Pixel Local Storage

Init OIT

R32UI R32UI R32UI R32UI

Transparent rendering

Resolve

ColorRGB10A2 RGB10A2 RG16F RG16F

At this point we change the layout of the PLS

Page 9: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Post-processing

Page 10: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

• High-end mobile devices typically have small displays

with massive resolutions

• Rendering at native resolution is often out of the question,

especially if you add post-processing to the mix

• Solution: mixed resolution rendering

– Go as low as you can without sacrificing quality, and then upscale

Post-processing

Page 11: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Mobile post-processing

On-chip

• Color Grading

• Tonemapping

Off-chip

• Anti-aliasing

• Bloom

• Depth of Field

• Screen Space Ambient

Occlusion

• Screen Space Reflections

Page 12: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

• Doesn’t have to be physically correct

• Wide + thin

Bloom

Threshold

Blur

Composite

Page 13: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

• What makes a good blur filter?

• Goal:

– High quality

– Stable

– High performance

Blur

Page 14: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

• 5x5 box blur = 25 samples

• Separate the blurs

– 5 + 5 = 10 samples

Box blur Horizontal

Vertical

Page 15: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Gaussian blur

• Convolve a gaussian

function over the image

• Separable just like the

box filter

Page 16: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Linear sampling optimization [2]

• Reduce number of

texture lookups by

exploiting the HW texture

unit

– Modify sample offsets and

gaussian weights

• Get 9x9 at similar cost as

5x5

Page 17: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Mixing resolutions

Down

Up

Down

Up

Page 18: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

• Gets increasingly complicated when using separable

kernels

Mixing resolutions

Down

Horz

Vert

Down

Horz

Vert

Up

Horz

Vert

Up

Page 19: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Kawase blur [3]

Page 20: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Kawase blur

Page 21: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

“Dual filtering”

Downsample filter

Upsample filter

1/2

1/8

1/8

1/8

1/8

Page 22: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

“Dual filtering”

Page 23: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Comparing filters

Page 24: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

• 97x97 blur

• Gaussian used as reference

• Kawase

– First downsample to 1/16th resolution

– Setup with 0, 1, 2, 3, 4, 4, 5, 6, 7 distances passes

• “Dual filtering” setup with 8 passes

• Naïve method which relies on glGenerateMipmap

Comparison setup

Page 25: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Input Reference Dual Kawase

PSNR:

49.78 dB

50.02 dB

Page 26: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Stability comparison

Page 27: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Input Reference Dual Kawase Naive

Page 28: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Input Reference Dual Kawase Naive

Page 29: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Input Reference Dual Kawase Naive

Page 30: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

Performance comparison

Page 31: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

41.9

21.4 23.5

4.5 2.8

19.2

9.9 7.0

2.8 1.7

0

5

10

15

20

25

30

35

40

45

Gaussian Box 5x5 gaussian Kawase Dual

1080p

720p

Performance (ms)

Tested on a Mali-T760 MP8

Page 32: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

98%

10% 12% 5% 2% 3% 4% 2%

0%10%20%30%40%50%60%70%80%90%

100%

Linear sampling 5x5 gaussianreduction

Kawase Dual

Reads

Writes

Bandwidth

Page 33: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

2% 70% 63% 71% 0%

10%

20%

30%

40%

50%

60%

70%

80%

Linear sampling 5x5 gaussianreduction

Kawase Dual

Hit

Cache utilization

Page 34: Bandwidth-Efficient Rendering · Framebuffer fetch • Per-pixel storage that is persistent throughout the lifetime of the frame – Read/write access – Storage stays on-chip –

• On-chip rendering

– Please use the extensions

• Bloom

– Multi-pass mixed resolution

– “Dual filter” blur

• Next steps

– Work on getting on-chip rendering into future core APIs

– Look into alternative data flows for doing blurs

Summary