Post on 27-Dec-2021
Gameloft and Intel: Working together to bring High Quality graphics to x86 Android*
By Steve Hughes
Introduction
Most people, including gamers, pigeonhole computing devices as either desktop or mobile and expect
high-end effects on their desktop apps and lower level streamlined effects on their mobile devices. They
usually accept the gulf between the devices and don’t complain. However, when I started looking at the
4th generation Intel® Atom™ processor (codenamed Bay Trail) late last year, I realized that it is no lazy
piece of HW. In fact, I saw the potential to add some significant desktop-style effects to the right game
and with a bit of work produce a real showpiece app to demonstrate its capabilities. After a quick look
around, I decided to work with Gameloft on its racing title GT Racing 2 (GTR2). I already knew the team
at Gameloft, and they’ve always been eager to go the extra mile to optimize performance and make
their games stand out.
In this article I will describe the effects implemented by Gameloft in GTR2 and focus on how we
managed to fit those effects into the 30 frames per second (FPS) budget we had set ourselves. We were
also limited by time since we wanted to show off what we’d achieved at GDC in SF 2014, and the end of
2013 was already fast approaching.
The effects
The exact effects used in the game were chosen by Gameloft as they knew which effects they most
wanted to include. This is only fair, they know their engine and we needed to get the effects in quickly
so we could spend time on optimization. Figures 1 and 2 show the before / after images, clearly showing
what we managed to do to enhance the image with the extra CPU and GPU time we had on the x86
device.
Light Shafts To achieve the light shafts effect the sun is rendered to a second render target and a radial blur pass in
the opposite direction to the sun’s position. This is carried out on a low resolution render target in
several passes, and the results look like this:
Figure 3. Initial render of the sun at a low resolution render target. The sun here is occluded by opaque scene objects to get the shape seen here.
Figure 4. Secondly a set of radial blur passes are applied to get the image in the right. From the original partially obscured sun we now get the glow loosely representing airborne particles colliding with direct sunlight.
Figure 1. The normal appearance of GTR2 on existing
ARM* devices. Great models but a bit of a letdown with
normal lighting.
Figure 2. GTR2 on Intel® Atom™ processor-based tablets
showing enhanced visuals from the bloom and light
shaft effects.
Figure 5. The blurred image is then added back in to the original frame and the final effect can be seen here. This effect is applied real-time during the race.
Bloom Bloom is a fairly stock effect but easy to get wrong. The object of bloom is to simulate the effect of a
sudden bright light in a scene saturating the image and leaking out into the scene around it.
Bloom is completed in three stages:
1: The original scene is filtered to remove any dark pixels leaving only the bright pixels in the scene. This
image is written to another render target (Figure 6).
Figure 6. Light pixels are extracted from the original image.
2: The new render target containing the light pixels is then blurred. This it to simulate the bright pixels
leaching into the surrounding dark pixels (Figure 7).
Figure 7. Light pixel image is then blurred.
3: Finally the blurred light pixel render target is added to the original scene to produce the bloom effect
(Figure 8).
Figure 8. Blurred light pixel images are added to original scene with some scaling to produce the final bloom image.
Depth of field
To achieve depth of field, we start with the game scene and apply a horizontal and vertical Gaussian blur
pass.
Figure 9. The original game scene
After the two stages, we can see that the whole image is now blurred (Figure 10). We now have a
blurred image and the original sharp image, along with a depth buffer from the original render pass.
The next step is to select a depth value which will be our focal point - such as the center of the car. For
each pixel on the screen, we blend the blurred image and the sharp image based on the difference
between the depth of the current pixel and the focal point depth value. Pixels father away in depth
from the focal point will have greater contribution from the blurred image, while pixels with a depth
value close to the focal point will have a greater contribution from the sharp image.
Figure 10. Blurred out of focus copy of game scene
Figure 11. Depth of field in action on Bay Trail. We left this effect to the menu and other non-game screens, because accurate distance vision is important in racing.
The net result (Figure 11) is a fairly good approximation to depth of field images such as you would get
from a camera.
Heat Haze
Figure 12. Heat haze was reserved for the start grid, where it gave a realistic heat feel to the cars before the race start.
Heat haze effects try to simulate the air shimmer you see rising from heated objects in sunlight (Figure
12). The effect is created by applying an animating distortion effect to the original color buffer. To
confine the effect to the region around the car the effect is masked by an alpha channel image (Figure
13).
Figure 13. Heat haze mask generated from the camera viewpoint.
The effect was confined to the starting grid because accurate distance vision is essential to successful
racing.
Getting started on Optimization Developers often view game optimization as a path of diminishing returns. By that I mean a lot of work
generally goes in to optimizing a game to an average frame time of 33ms for 30FPS, but generally there
is no point optimizing a game past 30FPS because that is the rate at which it will be expected to run.
However, on mobile devices this is not true. In all we had about 12ms worth of effects to add that would
have increased the frame time to 45ms (nearer 22FPS). This meant we had to remove 12ms from an
already optimized game to achieve our final target frame rate with all the effects turned on.
The place to start in any optimization process is to look at whether the game is GPU or CPU bound. That
is, determine if GPU or CPU code needs optimizing to improve frame time. Using Graphics Performance
Analyzers (GPA), the System Analyzer, we captured data for the following graph:
Figure 14. GPU Busy hovers around 90-100% for most of the race, while the CPU averages around 25%. It’s fairly clear that the
app is GPU bound, which is reasonable for a racing game.
It’s pretty easy to make this graph. Simply add the metrics you want to System Analyzer, then hit the
“CSV” button to dump out the metrics you want to a csv file. You can then load them in to Excel* or
other graphing software.
A lot of developers don’t know that GPA works great on x86 mobile devices. It’s a great set of tools and
well worth looking at.
Drilling down on a frame We captured a number of frames from the game before the effects were added using System Analyzer
then opened them with Frame Analyzer to see what low hanging fruit we could find. Figure 15 shows a
frame of the game before the effects were added that I used a lot in the early stages:
Figure 15. Frame is split in to two halves. Some big GPU events occur in the last half.
First and most obvious are the two calls to glClear() in the second half of the frame highlighted in purple
in the frame graph. This is an issue I often find in engines - render targets tend to be cleared first even
though they are going to be fully written to later. Removing these was an easy fix that gave us about
5ms, getting us well on the way.
The big blue bar in between the two glClear() calls is an interesting event. We had been experimenting
with the screen size of earlier development kits that Gameloft received, and with very large screens
(2560x1900) it was more efficient for them to render to a lower res back buffer then upsample to the
full size screen. The event in question is the upsample from the back buffer to the screen. This is a huge
event and needed some scrutiny. What I found here was that most of the time the EUs on the GPU were
stalled waiting for the texture sampler on this erg. This made sense actually because the fragment
shader was very simple and the size of the texture being copied was huge (>8Mb), so naturally the
shader would spend a lot of time waiting for the data it needed in order to complete. This led me to
think that we could probably render to a full-size render target and get rid of the upsample. The net
result was not a performance improvement because what time we gained was used up by rendering to
the larger target. What we did gain was a fair bit of visual improvement.
The last thing of note in this frame was the 4 big ergs labelled A, B, C, and D. You may have noticed that
my approach here is to look at all the big ergs and see what can be done to remove or reduce them.
That’s the best way to get started with Graphics Performance Analyzers. In cases like these 4 ergs, we
could do very little: these are the 4 cars in view in the frame. This is a racing game so it is only right to
devote a fair amount of rendering time to the cars.
Platform Analyzer Investigation.
One place we looked for performance was Platform Analyzer, which is a relatively new tool in GPA. With
Platform Analyzer you can look at the CPU / GPU holistically and see how the queues are managed on
the GPU (Figure 16). I startled to see that we had a problem with the driver that was hurting us:
Figure 16. From Platform Analyzer. Horizontal scale is time; the stacked chart at the top shows queue depth on the GPU.
At first it looks like the GPU queues are always full and everything is fine. However, looking closely at the
marked points we noticed that about every 10 frames an event occurred that stalled the whole process
and drained out the queue, grinding almost to a halt before starting up again. We spent some time
looking for a periodical draw call that had some kind of dependency, but it was hard to know what to
look for.
This one turned out to an Intel graphics driver issue. As often happens in prerelease HW, the drivers
were still being worked on. This turned out to be a stall, which had been fixed a few weeks before but
we hadn’t updated the driver because we were otherwise happy with the driver we had. We’re not sure
of the actual improvement we got from a frames-per-second perspective, but we did get a much
smoother frame rate as a result of the driver fix.
Drilling down into the effects At this point we had gained about 5ms frame time and improved the visual quality. We still needed to
find about another 7ms so we decided to look at the effects themselves. We weren’t going to skimp on
visual quality, but since they were all new we thought there might be some performance gains to find.
Figure 17. Frame Analyzer capture with bloom and light-shaft effects added. Note that glClears() are gone, so predictably there
is a lot more time spent in the second half of the frame, where all the post processing for the effects takes place.
Looking at this frame (Figure 17) we were drawn to the ergs labelled B and C, which turned out to be
blur and bright passes for the bloom effect. These were consuming 3-4ms each which we figured looked
a little high. After investigation Gameloft made some significant changes here to the effect which
resulted in a significant performance increase.
Firstly, they found that the blur stages were being executed on a full screen texture. This was reduced
to a quarter screensize and the result was that the blur almost dropped from the Frame Analyzer display
all together.
Secondly, the bright pass render target was in full HD. Gameloft found that this could be safely reduced
insize to about half screen without visual changes and gaining another significant increase in
performance.
After the bloom render targets had been optimized and we had gained some performance, we started
to look more closely at the bloom itself. The general consensus was that the bloom looked a little
washed out (Figure 18), so after verifying that the blur and bright pass textures looked ok, we took a
look at the shaders.
The math in the bloom shader looked a little complex, as compared to a typical bloom shader like this
fragment:
lowp vec4 bloom = texture2D(blur, vCoord0) * 1.5 - threshold;
gl_FragColor = bloom * bloomFactor;
As an experiment I used a little known feature of GPA Frame Analyzer where you can modify shaders in
a captured frame and recompile them to see the difference in appearance, performance, etc. It didn’t
take long to invent a shader that did a simple bloom within the confines of the frame (you can change
the source, but you couldn’t touch input or constants in GPA at the time).
The shader ran a tiny bit faster than the original shader, but the significant contribution from the shader
changes was the visual quality. As a result, a new shader was created for the bloom pass which made
the effect significantly better. Compare figure 18 with figure 19 to see the difference we saw.
Figure 18. Bloom effect showing the “washed out feel” of the shadows and the rocks on the left.
Figure 19. New bloom, which looks almost HDR compared to the old one.
Conclusions
The aim of this project was to take a game already optimized to 30FPS and optimize it further to gain
enough ms per frame to allow room for about 12ms of effects to be added. We managed to pull about
7ms from the game itself and save another 5ms from the effects themselves and as a result of driver
fixes. We managed to prove that modern mobile devices like Bay Trail are capable of executing effects
that previously were preserved for consoles and desktop GPUs. None of what we did would have been
possible without GPA, and without a great working relationship with Gameloft.
About Gameloft
A leading global publisher of digital and social games, Gameloft® has established itself as one of the top
innovators in its field since 2000. Gameloft creates games for all digital platforms, including feature
phones, smartphones, tablets, set-top boxes and connected TVs. Gameloft operates its own established
franchises such as Asphalt®, Real Football®, Modern Combat and Order & Chaos®, and also partners
with major rights holders including Marvel®, Hasbro®, FOX®, Mattel® and Disney®. Gameloft is present
on all continents, distributes its games in over 120 countries and employs over 5,200 developers.
For more information, consult http://www.gameloft.com.
About the Author
Steve is a senior application engineer at Intel, providing technical support to game developers in the areas of 3D graphics enabling and multi-threading solutions on PC and mobile devices. Steve has 14 years of experience as a programmer in the gaming industry where he worked on 11 titles, went through 2 bankruptcies, and generally had a good time. Steve is a keen gamer, writes and plays music, and isn’t a writer!
Notices
INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS
OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS
DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL
ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO
SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A
PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER
INTELLECTUAL PROPERTY RIGHT.
UNLESS OTHERWISE AGREED IN WRITING BY INTEL, THE INTEL PRODUCTS ARE NOT DESIGNED NOR INTENDED FOR
ANY APPLICATION IN WHICH THE FAILURE OF THE INTEL PRODUCT COULD CREATE A SITUATION WHERE PERSONAL
INJURY OR DEATH MAY OCCUR.
Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not
rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined." Intel
reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities
arising from future changes to them. The information here is subject to change without notice. Do not finalize a
design with this information.
The products described in this document may contain design defects or errors known as errata which may cause
the product to deviate from published specifications. Current characterized errata are available on request.
Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your
product order.
Copies of documents which have an order number and are referenced in this document, or other Intel literature,
may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm
Software and workloads used in performance tests may have been optimized for performance only on Intel
microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer
systems, components, software, operations, and functions. Any change to any of those factors may cause the
results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products.
Any software source code reprinted in this document is furnished under a software license and may only be used
or copied in accordance with the terms of that license.
Intel, the Intel logo, and Intel Atom are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2014 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.