1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos...
-
date post
15-Jan-2016 -
Category
Documents
-
view
228 -
download
0
Transcript of 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos...
![Page 1: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/1.jpg)
11
A Single (Unified) Shader A Single (Unified) Shader GPU Microarchitecture for GPU Microarchitecture for
Embedded SystemsEmbedded Systems
Victor Moya, Carlos González, Victor Moya, Carlos González, Jordi Roca, Agustín FernándezJordi Roca, Agustín Fernández
Department of Computer Department of Computer Architecture UPCArchitecture UPC
Roger EspasaRoger EspasaIntel DEG Intel DEG BarcelonaBarcelona
![Page 2: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/2.jpg)
22
IntroductionIntroduction
Graphics and specifically 3D graphics have Graphics and specifically 3D graphics have become an important element in current PDA, become an important element in current PDA, mobile phone and other handheld systemsmobile phone and other handheld systems OpenGL ES: A simplified OpenGL specification for OpenGL ES: A simplified OpenGL specification for
embedded systemsembedded systems
The classic GPU architecture for the PC is not The classic GPU architecture for the PC is not suited for embedded systemssuited for embedded systems Low powerLow power Low area budgetLow area budget
We propose a single unified shader GPU We propose a single unified shader GPU architecture for embedded systemsarchitecture for embedded systems
![Page 3: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/3.jpg)
33
OutlineOutline
ATTILA PCATTILA PC
ATTILA EmbeddedATTILA Embedded
Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit
ATTILA Simulation FrameworkATTILA Simulation Framework
ResultsResults
![Page 4: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/4.jpg)
44
OutlineOutline
ATTILA PCATTILA PC
ATTILA EmbeddedATTILA Embedded
Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit
ATTILA Simulation FrameworkATTILA Simulation Framework
ResultsResults
![Page 5: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/5.jpg)
55
Attila Classic for PCsAttila Classic for PCs
Optimized for large resolutionsOptimized for large resolutions Above 1024x768Above 1024x768
Optimized for high performanceOptimized for high performanceHigh power requirementsHigh power requirements
No power optimizationsNo power optimizations 100+ watts on current high-end GPUs100+ watts on current high-end GPUs
Large area budgetLarge area budget 300+ million transistors on current high-end GPUs300+ million transistors on current high-end GPUs
Large dedicated of memory bandwidthLarge dedicated of memory bandwidth 40+ GB/s on current high-end GPUs40+ GB/s on current high-end GPUs
Specialized Shader UnitsSpecialized Shader Units 2 to 8 vertex shader units2 to 8 vertex shader units 1 to 6 fragment shader units1 to 6 fragment shader units
![Page 6: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/6.jpg)
66
Vertex Shader
Vertex Shader
Primitive Assembly
Clipping
Triangle Setup
Rasterization
FragmentShader
ROP
HierarchicalZ
Vertex Fetch
MemoryController
MemoryController
Attila PCAttila PC
SpecializedShaders
Four fragments
processed in parallel
FragmentShader
ROP
![Page 7: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/7.jpg)
77
OutlineOutline
ATTILA PCATTILA PC
ATTILA EmbeddedATTILA Embedded
Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit
ATTILA Simulation FrameworkATTILA Simulation Framework
ResultsResults
![Page 8: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/8.jpg)
88
Embedded RequirementsEmbedded Requirements
Optimized for small resolutionsOptimized for small resolutions 320x240 to 640x480320x240 to 640x480
Optimized for low powerOptimized for low power Reduce frequencyReduce frequency Power optimizationsPower optimizations Improve efficiencyImprove efficiency
Small area budgetSmall area budget Remove non crucial hardwareRemove non crucial hardware
Low available bandwidthLow available bandwidthReduced shading powerReduced shading powerReduce design complexityReduce design complexity
![Page 9: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/9.jpg)
99
Attila EmbeddedAttila Embedded
No Hierarchical ZNo Hierarchical ZNo Z compressionNo Z compressionSingle unified shaderSingle unified shader
1 SIMD ALU1 SIMD ALU MultithreadedMultithreaded
16 threads of four vertex/triangle/fragment elements16 threads of four vertex/triangle/fragment elements16 128-bit registers for temporal storage available per thread16 128-bit registers for temporal storage available per thread
Texture unit outputs 1 bilinear for a whole fragment quad each 4 cycles Texture unit outputs 1 bilinear for a whole fragment quad each 4 cycles 4 KB Texture Cache4 KB Texture Cache
ROPROP One z and one color values updated per cycle in the framebuffer (a fragment One z and one color values updated per cycle in the framebuffer (a fragment
quad each 4 cycles).quad each 4 cycles).
Single 64-bit DDR channelSingle 64-bit DDR channel Limited by current simulator implementationLimited by current simulator implementation Assimilated to small (1 MB) embedded DRAMAssimilated to small (1 MB) embedded DRAM
32-bit high latency bus to large system memory for 32-bit high latency bus to large system memory for texturestextures
![Page 10: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/10.jpg)
1010
MemoryController
ROP
Shader
Vertex Fetch
Primitive Assembly
Rasterization
Scheduler
Distributor
Vertices Triangles Fragments
Attila EmbeddedAttila Embedded
Single Unified Shader
Single fragment per cycle pipeline
Clipping
![Page 11: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/11.jpg)
1111
OutlineOutline
ATTILA PCATTILA PC
ATTILA EmbeddedATTILA Embedded
Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit
ATTILA Simulation FrameworkATTILA Simulation Framework
ResultsResults
![Page 12: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/12.jpg)
1212
Triangle Setup in the ShaderTriangle Setup in the Shader
2D Homogeneous Rasterization2D Homogeneous Rasterization Olano & Greer Olano & Greer
Triangle setup algorithm:Triangle setup algorithm: Calculate setup matrix from triangle vertex matrixCalculate setup matrix from triangle vertex matrix Calculate interpolation equation for fragment ZCalculate interpolation equation for fragment Z Cull triangles based on their facing direction (area sign)Cull triangles based on their facing direction (area sign)
Algorithm suited for a SIMD implementation in the Algorithm suited for a SIMD implementation in the Unified ShaderUnified ShaderInputs:Inputs:
Four 3 component vectors as input for the triangle vertex positionsFour 3 component vectors as input for the triangle vertex positions
Outputs:Outputs: Three 4 component vectors as output for the triangle edge and z Three 4 component vectors as output for the triangle edge and z
interpolation equation coefficients.interpolation equation coefficients. One signed triangle area register as output for face culling stageOne signed triangle area register as output for face culling stage
26 Instruction Triangle Shader program26 Instruction Triangle Shader program
![Page 13: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/13.jpg)
1313
Triangle Setup in the ShaderTriangle Setup in the Shader
BenefitsBenefits Reduce areaReduce area
No specialized hardware required for Triangle setupNo specialized hardware required for Triangle setup Reduce design complexityReduce design complexity Improve efficiencyImprove efficiency
Graphic workload in embedded applications may not fully Graphic workload in embedded applications may not fully utilize the triangle setup specialized hardware in most casesutilize the triangle setup specialized hardware in most casesHigher utilization of the shaderHigher utilization of the shader
CostsCosts Shader workload increasesShader workload increases Rerouting of the rasterization pipeline requiredRerouting of the rasterization pipeline required
![Page 14: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/14.jpg)
1414
OutlineOutline
ATTILA PCATTILA PC
ATTILA EmbeddedATTILA Embedded
Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit
ATTILA Simulation FrameworkATTILA Simulation Framework
ResultsResults
![Page 15: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/15.jpg)
1515
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
![Page 16: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/16.jpg)
1616
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
GLInterceptor
•Capture a trace of OpenGL API calls from a real game
![Page 17: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/17.jpg)
1717
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
GLPlayer
•Reproduce the captured trace
![Page 18: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/18.jpg)
1818
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
OpenGL Library- Transform Fixed Function API into Shader code- Transform Fixed Function API into Shader code- 200 API calls supported- 200 API calls supported- ARB Vertex and Fragment extensions- ARB Vertex and Fragment extensions- Alpha and Fog emulated via Shader code- Alpha and Fog emulated via Shader code
DriverDriver- Low level interface to GPU hardware- Low level interface to GPU hardware- Attila memory management- Attila memory management
![Page 19: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/19.jpg)
1919
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
ATTILA SimulatorATTILA Simulator- Detailed cycle-by-cycle simulation of all - Detailed cycle-by-cycle simulation of all
pipeline stagespipeline stages- 20 boxes, modeling a 100-deep pipeline- 20 boxes, modeling a 100-deep pipeline- Execute@Execute: functionality - Execute@Execute: functionality
embedded at each pipeline stageembedded at each pipeline stage
![Page 20: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/20.jpg)
2020
Spot the differencesSpot the differences
AttilaNVidia GeForce FX 5900XT
![Page 21: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/21.jpg)
2121
OutlineOutline
ATTILA PCATTILA PC
ATTILA EmbeddedATTILA Embedded
Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit
ATTILA Simulation FrameworkATTILA Simulation Framework
ResultsResults
![Page 22: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/22.jpg)
2222
BenchmarkBenchmark
Unreal Tournament 2004Unreal Tournament 2004 NOT AN EMBEDDED BENCHMARKNOT AN EMBEDDED BENCHMARK
Up to 300K vertices per frame!Up to 300K vertices per frame! Fixed function OpenGL APIFixed function OpenGL API
Vertex and fragments shaders generated by our Vertex and fragments shaders generated by our librarylibrary
320x240 resolution320x240 resolution 140 of 450 frames simulated140 of 450 frames simulated 100+ frames ~ 1 day simulation100+ frames ~ 1 day simulation
On a Xeon P4 @ 2.0GhzOn a Xeon P4 @ 2.0Ghz
![Page 23: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/23.jpg)
2323
ConfigurationsConfigurationsWe have evaluatedWe have evaluated
3 middle-end to low-end PC GPU configurations3 middle-end to low-end PC GPU configurations 2 integrated on chipset GPUs and high-end PDA GPUs configurations2 integrated on chipset GPUs and high-end PDA GPUs configurations 4 embedded low-end GPUs configurations4 embedded low-end GPUs configurations
We tried to keep a balance between memory bandwidth and shading We tried to keep a balance between memory bandwidth and shading computing powercomputing power
From 4 to no vertex shader unitsFrom 4 to no vertex shader units From 2 quad fragment shader units to a single unified shader unitFrom 2 quad fragment shader units to a single unified shader unit From four to one 64-bit DDR memory channelsFrom four to one 64-bit DDR memory channels Store framebuffer in small (1 MB) GPU memory and textures in system memoryStore framebuffer in small (1 MB) GPU memory and textures in system memory
Halved the frequency for embedded systemsHalved the frequency for embedded systems Restricted design rulesRestricted design rules Reduce power consumptionReduce power consumption
Removed all optional features at the low endRemoved all optional features at the low end Hierarchical ZHierarchical Z Z compressionZ compression Specialized Triangle Setup hardwareSpecialized Triangle Setup hardware
![Page 24: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/24.jpg)
2424
Evaluated ConfigurationsEvaluated ConfigurationsConfConf ResRes MHzMHz VShVSh (F)Sh(F)Sh Fetch Fetch
WayWayRegs Regs
ThreadThreadSetupSetup BusesBuses CacheCache eDRAMeDRAM HZHZ Z Z
ComprCompr
AA 10241024 400400 44 2x42x4 22 16x3216x32 FixedFixed 44 16 KB16 KB -- YY YY
BB 320320 400400 44 2x42x4 22 16x3216x32 FixedFixed 44 16 KB16 KB -- YY YY
CC 320320 400400 22 1x41x4 22 16x3216x32 FixedFixed 22 16 KB16 KB -- YY YY
DD 320320 400400 22 1x41x4 22 16x3216x32 FixedFixed 22 8 KB8 KB -- NN YY
EE 320320 200200 -- 1x21x2 22 8x328x32 FixedFixed 11 8 KB8 KB -- NN YY
FF 320320 200200 -- 1x21x2 22 8x328x32 FixedFixed 11 4 KB4 KB -- NN NN
GG 320320 200200 -- 1x11x1 22 16x1616x16 FixedFixed 11 4 KB4 KB -- NN NN
HH 320320 200200 -- 1x11x1 11 16x1616x16 FixedFixed 11 4 KB4 KB -- NN NN
II 320320 200200 -- 1x11x1 11 16x1616x16 ShaderShader 11 4 KB4 KB -- NN NN
JJ 320320 200200 -- 1x11x1 11 16x1616x16 ShaderShader 11 4 KB4 KB 1 MB1 MB NN NN
KK 320320 200200 -- 1x11x1 11 16x1616x16 ShaderShader 11 4 KB4 KB 1 MB1 MB YY YY
![Page 25: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/25.jpg)
2525
Configuration ComparisonConfiguration Comparison
0
5
10
15
20
25
BW (GB/s) 23,8 23,8 11,9 11,9 2,98 2,98 2,98 2,98 2,98 4,47 4,47
A B C D E F G H I J K0
1020304050607080
GFlops 76,8 76,8 38,4 38,4 6,4 6,4 3,2 1,6 1,6 1,6 1,6
A B C D E F G H I J K
0
20
40
60
80
100
Caches (KB) 96 96 48 24 24 12 12 12 12 12 12
A B C D E F G H I J K
![Page 26: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/26.jpg)
2626
PerformancePerformance
Average of 20 frames per second at 320x240 for the Average of 20 frames per second at 320x240 for the lower end single shader configurationslower end single shader configurations
0
20
40
60
80
100
FPS 80,2 339 209 202 61,4 60,1 33,6 24,2 20,2 20,2 20,5
A B C D E F G H I J K
![Page 27: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/27.jpg)
2727
EfficiencyEfficiency
The limiting factor for PC and high embedded configurations is memory The limiting factor for PC and high embedded configurations is memory bandwidthbandwidth
Shaders underutilized for the evaluated benchmarkShaders underutilized for the evaluated benchmarkThe limiting factor for low end configurations is shading processingThe limiting factor for low end configurations is shading processing
Memory bandwidth could be further reducedMemory bandwidth could be further reducedCaches seem over dimensioned for the low-end embedded configurationsCaches seem over dimensioned for the low-end embedded configurations
02468
10121416182022
A B C D E F G H I J K
FPS per GFops FPS per BW FPS per Cache KB
![Page 28: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/28.jpg)
2828
Shaded Triangle Setup PerformanceShaded Triangle Setup Performance
No overhead on fragment limited benchmarksNo overhead on fragment limited benchmarks16% less performance in vertex and triangle 16% less performance in vertex and triangle limited traceslimited traces
0,7
0,75
0,8
0,85
0,9
0,95
1
torus UT-2004 lit spheres spaceship VL-II
on shader on specif ic hardw are
![Page 29: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/29.jpg)
2929
ConclusionConclusion
The Attila Embedded achieves 20 frames per The Attila Embedded achieves 20 frames per second on a single unified shader architecture at second on a single unified shader architecture at a 320x240 resolution when using a year old PC a 320x240 resolution when using a year old PC benchmarkbenchmark 1 MB of fast embedded DRAM provides more 1 MB of fast embedded DRAM provides more
than enough bandwidth for framebuffer than enough bandwidth for framebuffer accessesaccesses
Texture data stored in system memoryTexture data stored in system memory 16% performance reduction when removing 16% performance reduction when removing
the specialized Triangle Setup unit in the the specialized Triangle Setup unit in the worst tested caseworst tested case
![Page 30: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/30.jpg)
3030
Questions?Questions?
![Page 31: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/31.jpg)
3131
MemoryController
MemoryController
MemoryController
MemoryController
ROP ROP ROP ROP
Shader
Shader
Shader
Shader
Vertex Fetch
Primitive Assembly
Clipping
Triangle Setup
Rasterization
HierarchicalZ
Scheduler
Distributor
Attila PCAttila PC
Unified Shader Pool
![Page 32: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/32.jpg)
3232
02468
10121416
A B C D E F G H I J K
Performance Performance per Gflop
Performance per BW Performance per Cache KB
![Page 33: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.](https://reader036.fdocuments.in/reader036/viewer/2022062309/56649d595503460f94a39481/html5/thumbnails/33.jpg)
3333
PowerVR SGXPowerVR SGX