GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
-
Upload
umbra-software -
Category
Technology
-
view
337 -
download
0
Transcript of GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
Antwan HätäläUmbra 3 Lead programmer
Boosting your ARMmobile 3D rendering
performance with Umbra 3
INDEX• Who are we?• Games• What is Umbra 3 and occlusion culling• bringing our system to the PlayStation 4• experiences and benefits• lessons learned
UMBRASOFTWAREOcclusion culling middlewarefor 3D games
Founded in 2007
14 employees
Based in Helsinki, Finland
Support office in Seattle, WA
Same problem – Different solutions
Mo Money – Mo Problems
“Level artists are there to fill theworld with content. Integrating Umbra
saved us not only artist time but the time to create and maintain an efficient
visibility culling solution. Umbra’s support provides us with the solutions and
features that we need.”
“Umbra’s technology is playing an important rolein the creation of our next universe, by freeing our
artists from the burden of manual markups typically associated
with polygon soup.”
Occlusionculling basics
Occlusion Culling: Why bother?
• Process and render only whats visible• improved frame rate and rendering performance• allows you to put more detail into levels and create larger
levels
6
What is Umbra ?
7
Determines visible objects fast to save further work both on CPU and GPU
Rasterizes automatically generated proprietary occluder models on CPU
Operates in low resolution, generates conservative (dilated) results Rasterization is embarassingly parallel in nature
Parallellize across CPU cores Process multiple pixels/elements in SIMD
Optimized for SSE, Altivec, Cell and ARM NEON
Umbra 3 occluder rasterizer
8
Processing of multiple data elements (2 to 16) in single instruction Separate execution pipeline: can execute in parallel with ARM Separate register file: 16 128-bit regs (or 32 64-bit), SP floats or 8-64
bit integers Mandatory in Cortex-A8/A12/A15, optional in Cortex-A9
For mobile 3D title purposes, it will be there Actual cycle counts will vary: 64-bit vs 128-bit, single vs dual issue,
latencies For multi-platform, target A9 and enjoy free benefits on more advanced
platforms Used in one of three ways
Inline assembly Compiler intrinsics Compiler auto-vectorization
Similar to SSE, Altivec but for best performance you need to know your platform
NEON overview
9
Collaborate with the compiler, but keep an eye on the output Align your data when possible Inline functions that operate on SIMD values Use __restrict to let compiler reorder Watch for register spilling
Schedule enough NEON work, even when it might be redundant Loading data from ARM registers is relatively cheap, storing back is expensive Hide load/store latencies by interleaving with computation (unroll your loops)
Never interleave VFP instructions with NEON Means pipeline flush, tens of cycles of penalty Watch for ”s” register use is compiler output
NEON common best practices
10
No penalty from interleaving 2-wide ops with 4-wide ops Cortex-A8/A9 does 64-bit float operations per cycle vget_high_xxx, vget_low_xxx to address quadword halves
Narrow to 64 bits early 16x4 and 8x8 are also 64 bits, for many operations 32 bits per channel not needed Even if CPU can churn out 128 bits per cycle, savings to be had in result latency etc. Use VMOVN or coupled operation and narrow
Careful with your constants VMOV and VMVN can encode lots of useful constants Compilers do a good job of constant encoding, but can’t choose the constants for you
Killer instructions Shift-and-insert: VSRI, VSLI Byte permute by table lookup: VTBL, VTBX Gather load and scatter store: VLD2-4, VST2-4
NEON optimization tricks
11
Example routine: gather sign bits of large array of float values
NEON optimization example
function gather_signbits(flt_array):let output_bitmap = bitmap of size len(flt_array)foreach elem in flt_array at index idx:if (elem < 0)set_bit(output_bitmap, idx)elseclear_bit(output_bitmap,idx)
12
Sufficient unrolling: handle 16 elements in one iteration
compare 4 values per instruction bitwise and for correct bit offsets collapse with vertical or (pairwise
add)
Neon optimization example: first attempt20: add.w r2, r0, #3224: vld1.64 {d28-d29}, [r0 :128]28: vld1.64 {d24-d25}, [r2 :128]2c: add.w r2, r0, #1630: vclt.f32 q14, q14, #034: vld1.64 {d26-d27}, [r2 :128]38: add.w r2, r0, #48
; 0x303c: vclt.f32 q12, q12, #040: vand q14, q8, q1444: vld1.64 {d30-d31}, [r2 :128]48: vclt.f32 q13, q13, #04c: vand q13, q11, q1350: vclt.f32 q15, q15, #054: vand q12, q10, q1258: vand q15, q9, q155c: vorr q13, q14, q1360: vorr q12, q12, q1564: vorr q12, q13, q1268: vpadd.i32 d24, d24, d256c: vpadd.i32 d24, d24, d2470: vst1.32 {d24[0]}, [r0 :32], r1
13
Compare with zero = shift sign bit Can shift and combine
simultaneously with VSRI instruction
Narrow to 16 bits (VMOVN) before proceeding further
half the amount of constants
Neon optimization example: shift-and-insert, narrow early
18: vld1.64 {d18-d19}, [r0 :128]1c: add.w r3, r0, #1620: adds r1, #422: vshr.u32 q9, q9, #1926: vld1.64 {d20-d21}, [r3 :128]2a: add.w r3, r0, #322e: vsri.32 q9, q10, #2332: vld1.64 {d20-d21}, [r3 :128]36: add.w r3, r0, #48
; 0x303a: vsri.32 q9, q10, #273e: vld1.64 {d20-d21}, [r3 :128]42: vsri.32 q9, q10, #3146: vmovn.i32 d18, q94a: vand d18, d18, d164e: vshl.u16 d18, d18, d1752: vpaddl.u16 d18, d1856: vpadd.i32 d18, d18, d185a: vst1.32 {d18[0]}, [r0 :32], r2
Thank you.For more on Umbra 3, go to:
umbra3.com [email protected]
Follow us on Twitter @umbrasoftware