Heterogeneous Computing (HC) & Micro-Heterogeneous Computing (MHC)
Heterogeneous Computing ->...
Transcript of Heterogeneous Computing ->...
| Heterogeneous Computing -> Fusion | saahpc 2010 1
Heterogeneous Computing -> Fusion
Norm Rubin
AMD Fellow
| Heterogeneous Computing -> Fusion | saahpc 2010 2
Definitions
Heterogenous Computing
– A system comprised of two or more compute engines with signficant structural differences
– In our case, a low latency x86 CPU and a high throughput Radeon GPU
Fusion
– Bringing together two or more components and joining them into a single unified whole
– In our case, combining CPUs and GPUs on a single silicon die for higher performance and lower power
| Heterogeneous Computing -> Fusion | saahpc 2010 3
AMD Balanced Platform Advantage
Delivers optimal performance for a wide range of
platform configurations
Other Highly Parallel Workloads
Graphics Workloads
Serial/Task-Parallel Workloads
CPU is ideal for scalar processing
Out of order x86 cores with low latency memory access
Optimized for sequential and branching algorithms
Runs existing applications very well
GPU is ideal for parallel processing
GPU shaders optimized for throughput computing
Ready for emerging workloads
Media processing, simulation, natural UI, etc
| Heterogeneous Computing -> Fusion | saahpc 2010 4
Three Eras of Processor Performance
Single-Core Era
Sin
gle
-th
read
P
erf
orm
an
ce
?
Time
we are
here
o
Enabled by:
Moore’s Law
Voltage Scaling
MicroArchitecture
Constrained by:
Power
Complexity
Multi-Core Era
Th
rou
gh
put P
erf
orm
ance
Time
(# of Processors)
we are
here
o
Enabled by:
Moore’s Law
Desire for Throughput
20 years of SMP arch
Constrained by:
Power
Parallel SW availability
Scalability
Heterogeneous Systems Era
Ta
rge
ted
Ap
plic
atio
n
Pe
rfo
rman
ce
Time
(Data-parallel exploitation)
we are
here
o
Enabled by:
Moore’s Law
Abundant data parallelism
Power efficient GPUs
Temporarily constrained by:
Programming models
Communication overheads
| Heterogeneous Computing -> Fusion | saahpc 2010 5
Emerging Application Spaces
Category Characteristics Application Examples
Massive Data Mining
Full 64b addressing
Huge data sets
New data types
Image, Video, Audio processing
Pattern analytics and search
Natural User Interfaces
Massive “behind-the-scenes”
computing
Face and gesture recognition
Real time video & audio proc
Physical world interpretation
Visualization Advanced rendering
Interactive physics
Multi-layered Graphics Holographic Displays Scientific visualization & CAD Next generation Gaming
Cloud + Client Applications
Seamless responsiveness
Workload partitioning
Next generation browsers
HTML5 Apps with Native Code from JavaScript
| Heterogeneous Computing -> Fusion | saahpc 2010 6
GPU SP ALU Performance
HD4870
HD5870
CPU
| Heterogeneous Computing -> Fusion | saahpc 2010 7
GPU DP ALU Performance
HD4870
HD5870
CPU
| Heterogeneous Computing -> Fusion | saahpc 2010 8
GPU BW Performance expectations over time
250
0
100
200
50
150
300
HD5870
HD4870
| Heterogeneous Computing -> Fusion | saahpc 2010 9
GPU Computing Efficiency Trend
7.50
4.56
4.50
2.24
2.21
0.92
2.01
1.06
1.07
0.42
GFLOPS/W
GFLOPS/mm2
14.47 GFLOPS/W
7.90 GFLOPS/mm2
| Heterogeneous Computing -> Fusion | saahpc 2010 10
ATI Radeon™ HD 5870 Compute Architecture
20 SIMD Engines
1600 shader cores
Ultra-Threaded Dispatch Processor
Instruction and Constant Caches
Memory Export Buffer
Fetch path with multi-level caches
Global Data Store
| Heterogeneous Computing -> Fusion | saahpc 2010 11
Memory Hierarchy
Distributed Memory Controller
Optimized for latency hiding and
memory access efficiency
GDDR5 memory at 150GB/s
Up to 272 billion 32-bit
fetches/second
Up to 1 TB/sec L1 texture fetch
bandwidth
Up to 435 GB/sec between L1 &
L2
| Heterogeneous Computing -> Fusion | saahpc 2010 12
Comparative Stats on ATI Radeon HD 5870 GPU
* Based on internal AMD testing
AMD Opteron™
Model 2435
ATI Radeon™
HD 4870
ATI Radeon™
HD 5870
One Year
Difference
Die Size 346 mm2
263 mm2
334 mm2 1.27x
Transistors 904 million 956 million 2.15 billion 2.25x
Memory Bandwidth
12.8 GB/s 115 GB/sec 153 GB/sec 1.33x
SP GFlops 124.8 1200 2720 2.25x
DP GFlops 62.4 240 544 2.25
ALUs 54 800 1600 2x
Board Power*
Idle 15.5 W 90 W 27 W 0.3x
Max 115 W 160 W 188 W 1.17x
| Heterogeneous Computing -> Fusion | saahpc 2010 13
Yesterday’s Chip Designs Won’t Do
GPU
110 million transistors @150nm 2D and 3D gaming
Nascent video processing
CPU
105 million transistors @130nm Compute tasks including video decode
| Heterogeneous Computing -> Fusion | saahpc 2010 14
Today We Are Evolving
TeraFLOPS-class GPU
2.15 billion transistors @40nm 3D OS
Multi-panel HD gaming Full HD video and audio
Multi-core CPU
758 million transistors @45nm Multi-tasking Most compute tasks
| Heterogeneous Computing -> Fusion | saahpc 2010 15
Tomorrow Will Amaze
Significantly enhances active/ resting battery life
High-bandwidth I/O
~1 billion transistors @32nm in one design
APU: Fusion of CPU & GPU compute power within one processor
| Heterogeneous Computing -> Fusion | saahpc 2010 16
AMD Fusion™ APUs Fill the Need
Windows, MacOS and Linux franchises
Thousands of apps
Established programming and memory model
Mature tool chain
Extensive backward compatibility for applications and OSs
High barrier to entry
x86 CPU owns the Software World
Enormous parallel computing capacity
Outstanding performance-per - watt-per-dollar
Very efficient hardware threading
SIMD architecture well matched to modern workloads: video, audio, graphics
GPU Optimized for Modern Workloads
| Heterogeneous Computing -> Fusion | saahpc 2010 17
Fusion APUs: Putting it all together
System-level Programmable
Multi-Core Era
Heterogeneous Systems Era
Single-Thread Era
Fusion APU
Heterogeneous Computing
Throughput Performance
Pro
gra
mm
er
Ac
ce
ss
ibilit
y
Graphics Driver-based
programs
OCL/DC Driver-based
programs
Power-efficient
Data Parallel
Execution
High Performance
Task Parallel Execution
Microprocessor Advancement
GP
U A
dv
an
ce
me
nt
Unaccepta
ble
Expert
s O
nly
M
ain
str
eam
| Heterogeneous Computing -> Fusion | saahpc 2010 18
PC with Discrete GPU
| Heterogeneous Computing -> Fusion | saahpc 2010 19
Fusion APU Based PC
| Heterogeneous Computing -> Fusion | saahpc 2010 20
Performance & Scalability
Two x86 Cores Tuned for Target Markets
Mainstream Client and
Server Markets
“Bulldozer”
“Bobcat” Flexibility,
Low Power & Low Cost Low
Power Markets
Lower Cost
Cloud Optimized
| Heterogeneous Computing -> Fusion | saahpc 2010 21
Heterogeneous Computing:
Next-Generation Software Ecosystem
Hardware & Drivers: AMD Fusion™, Discrete CPUs/GPUs
OpenCL & Direct Compute
Tools: HLL compilers, Debuggers,
Profilers Middleware/Libraries: Video,
Imaging, Math/Sciences, Physics
High Level Frameworks
End-user Applications
Ad
van
ced
Op
tim
izati
on
s
& L
oad
Bala
ncin
g
Load balance across CPUs and GPUs; leverage
AMD Fusion™ performance advantages
Drive new features into
industry standards
Increase ease of application
development
| Heterogeneous Computing -> Fusion | saahpc 2010 22
Open Standards:
Vendor specific Cross-platform limiters
• Apple Display Connector
• 3dfx Glide
• Nvidia CUDA
• Nvidia Cg
• Rambus
• Unified Display Interface
Digital Visual Interface
OpenCL™ DirectX®
Certified DP JEDEC OpenGL®
Maximize Developer Freedom and Addressable Market
Vendor neutral Cross-platform enablers
| Heterogeneous Computing -> Fusion | saahpc 2010 23
The Benefits of Fusion
Unparalleled processing capabilities in mobile form factors
Shared memory for the CPU and GPU
Eliminates copies, increasing performance
Reduces dispatch overhead
Lower latency from the GPU to memory
Power efficient design
Enables architectural innovations between CPU, GPU and the Memory System
Scalable architecture that can target a broad range of platforms from mobile to data center
| Heterogeneous Computing -> Fusion | saahpc 2010 24
The Fusion Opportunity
A new architectural and performance balance point for computing
A new machine target for research
A high volume opportunity for new algorithms, new workloads and new applications
The deployment opportunity is especially strong in the consumer market place
| Heterogeneous Computing -> Fusion | saahpc 2010 25
Questions?
| Heterogeneous Computing -> Fusion | saahpc 2010 26
Backup slides
| Heterogeneous Computing -> Fusion | saahpc 2010 27
Thread Processors
5-way VLIW Architecture
4 Stream Cores and 1 Special
Function Stream Core
Separate Branch Unit
All 5 cores co-issue
Scheduling across the cores is done
by the compiler
Each core delivers a 32-bit result per
clock
Thread Processor writes 5 results
per clock
4 32-bit FP MAD per clock
2 64-bit FP MUL or ADD per clock
1 64-bit FP MAD per clock
4 24-bit Int MUL or ADD per clock
Special functions
1 32-bit FP MAD
per clock
Stream Cores
| Heterogeneous Computing -> Fusion | saahpc 2010 28
SIMD Engines
Diagram shows 2 SIMD Engines
Each SIMD Unit includes:
16 Thread Processors (80 shader cores) + 32KB Local Data Share
Its own Thread Sequencer which operates a shared set of threads
A dedicated fetch unit with an 8KB L1 cache
| Heterogeneous Computing -> Fusion | saahpc 2010 29
TeraScale 2 Architecture – Radeon HD 5870
| Heterogeneous Computing -> Fusion | saahpc 2010 30
OpenCL™ and DirectX® 11 DirectCompute
How will developers choose?
DirectX® 11 DirectCompute
Easiest path to add compute capabilities to existing DirectX applications
Windows Vista® and Windows® 7 only
OpenCL™
Ideal path for new applications porting to the GPU for the first time
True multiplatform: Windows®, Linux®, MacOS
Natural programming without dealing with a graphics API