© Copyright Khronos Group, 2004 - Page 1 The challenge of migration : desktop to handheld Phil...

21
© Copyright Khronos Group, 2004 - Page 1 The challenge of migration : desktop to handheld Phil Atkin Product Manager 3D Graphics September 2004

Transcript of © Copyright Khronos Group, 2004 - Page 1 The challenge of migration : desktop to handheld Phil...

© Copyright Khronos Group, 2004 - Page 1

The challenge of migration : desktop to handheld

Phil AtkinProduct Manager 3D Graphics

September 2004

© Copyright Khronos Group, 2004 - Page 2

TopicsTopics

OverviewOverview•Definitions• What does ‘desktop’ mean?• What does ‘handheld’ mean?

•Challenges• Management of 3D resources• Management of CPU resources

•Case study• Realities of porting a desktop 3D framework to handheld• Demonstrations (Intel / Intrinsyc Carbonado)• Performance (PowerBook vs. Carbonado)

•Conclusions

© Copyright Khronos Group, 2004 - Page 3

Desktop vs. handheld systemsDesktop vs. handheld systems

Desktop systemDesktop system•CPU + GPU + 3D API

- Powerful - 1GHz up to >3GHz CPU with SIMD floating-point- Big caches- Minimum ‘Free3D’ chipset - Maximum GeForce 6800 / Radeon X800- OpenGL 1.5 transitioning to OpenGL 2.0

Handheld system (PowerVR 3D)Handheld system (PowerVR 3D)•CPU + GPU + 3D API

- CPU ranges from 100MHz to 500+MHz- Small caches- CPU may or may not have FP capability- Minimum MBX Lite no VGP - 1M tris, 100M pixels- Maximum MBX VGP - 4M tris, 350M pixels, free AA- OpenGL ES 1.0 transitioning to OpenGL ES 1.1

© Copyright Khronos Group, 2004 - Page 4

Handheld 3DHandheld 3D

•Delivering accelerated handheld 3D is all about power management•All chip vendors have access to similar process technologies

- Leads to similar power / MHz- Leads to similar performance / mW

•All system vendors have access to the similar battery technologies- Leads to similar ‘talk time / game-time’ per recharge

•Some architectures have clear power/performance advantages- Tile-based rendering, on-die framebuffers - minimize data passing between chips

•These factors lead to a relatively narrow spectrum of capabilities•Low-end and high-end systems only differ by 3-4x•Admittedly PowerVR sets a high baseline, but the generalization holds

© Copyright Khronos Group, 2004 - Page 5

ObservationsObservations

Even low-end handheld 3D accelerators will offer excellent performanceEven low-end handheld 3D accelerators will offer excellent performance•On par with 2nd / 3rd generation desktop accelerators•Efficient API is in place and standardized•Hence the path from the driver to the hardware is sorted - but …

What about the path from the application to the driver?What about the path from the application to the driver?•How to structure application code to keep hardware busy?

Despite relatively narrow spectrum of 3D capabilitiesDespite relatively narrow spectrum of 3D capabilities•Potential for extremely large disparity between systems•Floating point-less CPU, rasterizer-only 3D•Very high performance CPU / FPU, vertex-programmable 3D

How to develop or port with such a spread of computational capabilities?How to develop or port with such a spread of computational capabilities?

© Copyright Khronos Group, 2004 - Page 6

The challengeThe challenge

Management of 3D capabilities is not the challengeManagement of 3D capabilities is not the challenge•The usual techniques learned in the desktop space can be used•Resolution / triangle count / texture filtering / AA quality

Management of CPU resources is the challengeManagement of CPU resources is the challenge•Lowering vertex counts to GPU will inherently lower CPU load•But the problem is far bigger in scope than just this•The data type float is essentially unavailable at the low end

Platform CPUs have such diverse capabilities - eitherPlatform CPUs have such diverse capabilities - either•Stratify in software, code explicitly to each market stratum•Or code in a floating-point agnostic manner

The latter is achievable and allows a single code base across The latter is achievable and allows a single code base across platformsplatforms

© Copyright Khronos Group, 2004 - Page 7

Why bother porting to an FPU-less Why bother porting to an FPU-less platform?platform?Consider the following 3 likely classes of handheld deviceConsider the following 3 likely classes of handheld device•Class A

- High-performance CPU, FPU, GPU with vertex processing

•Class B- High-performance CPU, GPU with vertex processing

•Class C- CPU, rasterizer

•Classes B and C will likely be smaller die, lower cost•Will likely ship in higher volumes•If so -

- will offer more revenue opportunities for software vendors- yet platforms do not have floating-point capability

•But a Class A device may win out•Software vendors must cover all the bases to guarantee success

© Copyright Khronos Group, 2004 - Page 8

Why not just make everything fixed Why not just make everything fixed point?point?Because your desktop platform Because your desktop platform •Will be faster in floating-point•Does not have fixed-point OpenGL ES entrypoints!

If you really need If you really need •The same code base to run on desktop and handheld•High performance on all classes of handheld systems

You need to abstract out your numeric formatYou need to abstract out your numeric format

C++ class, build-time switchable from 16.16 to floatC++ class, build-time switchable from 16.16 to float

© Copyright Khronos Group, 2004 - Page 9

Porting desktop software - 4 step Porting desktop software - 4 step programprogram•Observations

- Debugging on a handheld is no fun- The porting process needs to be derisked as much as possible- Strive to get as close as possible to the handheld codebase without

leaving the desktop- Code extremely defensively - make no assumptions regarding

performance

•‘Portification’- Yes, I know it’s not a real word…- The process of preparing for the port without actually executing on it

•Step 1 - implement the abstracted real number class•Step 2 - portify 3D code •Step 3 - portify application code•Step 4 - do the port

© Copyright Khronos Group, 2004 - Page 10

Step 1 - implement real number classStep 1 - implement real number class

•C++ operators for +-*/ and type conversion•Note ARM does not have a divide instruction

- Recommendation - normalize / reciprocate / multiply / denormalize- ARM does have a normalize instruction - CLZ

•Functions for common but expensive operations- E.g. implement your own sqrt and trig- Why - because you may wish to sidestep glRotate() etc.

•These functions will of course work in fixed or float•Hence testability on desktop is high and immediate

© Copyright Khronos Group, 2004 - Page 11

Step 2 - portify 3D codeStep 2 - portify 3D code

•Isolate your 3D code if not already done- Minimize #include <gl/gl.h>

•Modify 3D code so it is OpenGL / OpenGL ES agnostic

•Modify it so it is floating point / fixed point agnostic

•And obviously modify your data too•Make your world representable by 16.16

© Copyright Khronos Group, 2004 - Page 12

Step 3 - portify application codeStep 3 - portify application code

•Work out what maths absolutely must be floating-point•Replace everything else with real number class•But be really careful - for example

- Really common case - distance between 2 points - Pythagoras- Squaring those numbers will blow up for almost all cases- Code defensively - implement a ‘radius’ function that will not blow up

•OK, you could keep this example as floats- But floats are so very expensive without FPU- It’s a common operation, and it’s easy to get it right in fixed-point

•Remember - conservation of CPU cycles is the challenge- The hardware developers and Khronos have taken care of the 3D- CPU cycles are precious, conserve them

© Copyright Khronos Group, 2004 - Page 13

Step 4 - port to the handheld platformStep 4 - port to the handheld platform

This step is really easy if the last 3 went well ... This step is really easy if the last 3 went well ... •Take cross-compiler•Turn on all the #ifdefs you prepared earlier•Type ‘make’•Or under Embedded Visual C++ hit F7

It will just work. Trust me, it will.It will just work. Trust me, it will.

© Copyright Khronos Group, 2004 - Page 14

Case study - the Mobile Scene GraphCase study - the Mobile Scene Graph

Framework for 3D applicationsFramework for 3D applications•Initial implementation - desktop

- Interactive landscape, architecture and garden design review- Straightforward design

- Classic app + cull + draw, frustum culling- C++, STL, polymorphic, RTTI

- Target platform PowerBook G3 500MHz / OpenGL / glut

•Transitioned into- Desktop - interactive landscape, architecture and garden design

review- Handheld - experimental testbed for OpenGL ES rendering- Target platforms

- PowerBook G3 500MHz / OpenGL 1.4 / glut- Intel / Intrinsyc Carbonado / OpenGL ES 1.0 / egl

•Great opportunity to take on a port- Aiming for 100% application source code compatibility- Aiming to deliver highest possible performance on desktop and

handheld

© Copyright Khronos Group, 2004 - Page 15

MSG Implementation detailsMSG Implementation details

•‘MSGReal’- Build-time switchable float or OpenGL ES 16.16 fixed point- C++ operators provide +-*/ and common type conversions- Functions provide trig, sqrt / recipsqrt- All expensive operations implemented by piecewise

quadratics

•Additional 4.12 ‘MSGShortFix’ type- Intermediate product fits into 32 bits, no double-length

maths- Superbright unclamped colour accumulation- Reflection-mapping via quadratic approximation without

overflow

•Only 2 internal functions use floating-point- Plane fitter for frustum construction- Determinant calculation in matrix inverter

© Copyright Khronos Group, 2004 - Page 16

Porting realities - timescalesPorting realities - timescales

Approximately 3 man-months of portificationApproximately 3 man-months of portification•Difficult to measure accurately•Coding was in progress as portification began

Approximately 20,000 lines of codeApproximately 20,000 lines of code•Only 800 lines can see <gl/gl.h>• Just 8 #ifdefs in this module• i.e.if this is representative, the portification process is manageable

2 evening porting sessions2 evening porting sessions• Just 6 hours at the desk from ‘move code onto PC’ to ‘run on handheld’•… and one evening should have been enough

Then performance tuningThen performance tuning•Anticipated >30Hz was only 15-20Hz•Now tuned up to >40Hz with no change in geometric load

© Copyright Khronos Group, 2004 - Page 17

Porting realitiesPorting realities - gotchas - gotchas

Handheld specificHandheld specific•Performance not linear with clock for a variety of reasons

- e.g. caching behaviour, driver behaviour, architectural

•Limited container class and template support•Some C++ operations will hurt more than you expect

- Very slow RTTI- STL list operations sort(), push_back(), pop_front() proved surprisingly

expensive

3D gotchas3D gotchas•Unanticipated differences in behaviour

- E.g. multiple strips from single pointer setup – multiple TnL on Carbonado

- Would benefit from gLDrawMultiElements

•Short tristrip performance- Would benefit from gLDrawMultiElements!!

•Best performance - glDrawElements(glTriangles)•Fixed-point to integer conversion in OpenGL ES interface

© Copyright Khronos Group, 2004 - Page 18

DemonstrationsDemonstrations

MSGRefMap - arithmetic performance MSGRefMap - arithmetic performance testtest•Single object, reflection mapped

- Cull time virtually zero- Virtually all cycles spent in reflection-map

code - This is fixed-point on all platforms- 16-bit skybox textures

MSGHurricane - frustum-culling testMSGHurricane - frustum-culling test•2048 objects in hierarchical terrain

- unlit, 8-bit luminance texture

•7 animated aircraft- lit with 2 lights- 16-bit aircraft texture- 16-bit skybox textures

© Copyright Khronos Group, 2004 - Page 19

PerformancePerformance

MSGRefMapMSGRefMap•PowerBook floating point

- OpenGL renderer - 116 Hz- NULL renderer - 1360 Hz

•PowerBook fixed point- NULL renderer - 1620 Hz

•Carbonado fixed point- OpenGL ES renderer - 35.9

Hz - NULL renderer - 668.4 Hz

•Carbonado floating point- NULL renderer - 101.2 Hz

MSGHurricaneMSGHurricane•PowerBook floating point

- OpenGL renderer - 122 Hz- NULL renderer - 1890 Hz

•PowerBook fixed point- NULL renderer - 960 Hz

•Carbonado fixed point- OpenGL ES renderer - 34.6

Hz - NULL renderer - 271.5 Hz

•Carbonado floating point- NULL renderer - 46.25 Hz

•Fixed-point code averages 6x faster than FP emulation- Despite data structure traversal and other non-arithmetic code- Despite fixed point reflection-mapping code in floating point version- This is a fast CPU, yet it is too slow in FP emulation running MSGHurricane

© Copyright Khronos Group, 2004 - Page 20

Last word on performanceLast word on performance

The missing case - The missing case - •Floating point application code•Fixed point framework / middleware•Estimated by isolating application cycles on Carbonado

- Time spent in application = 11% of frame time (NULL renderer)

•MSGHurricane- Fixed point frame time = 0.0037 sec- Floating point frame time = 0.021 sec- Mixed-mode frame = (89% * 0.0037) + (11% * 0.021) = 0.011 sec- Estimated 88Hz mixed-mode rate

•Within 33mS budget•But scale processor back to 150MHz and it becomes too slow

again•And this is just a demo - just splines, no physics, no gameplay•Floating-point emulation is just too slow for even the simplest case

© Copyright Khronos Group, 2004 - Page 21

ConclusionsConclusions

•The software migration process can be relatively painless•Source code should be ‘portified’ - i.e. made

- 3D API agnostic- Isolate and encapsulate your 3D API interactions- Structure desktop code to be OpenGL ES friendly

- Floating point agnostic- Abstract out your real number format- At minimum in middleware layer- Ideally allow fixed-point from application down to hardware

•You can do all this from the safety of your workstation- No handheld platform debugging until project is mature- MSG ported to Carbonado in 2 evenings with just printf

•And if you get it right- It will just port and just work - but may require some tuning- Performance will be high across platforms- Resulting software will be highly portable and reusable