Heterogeneous Systems Architecture: The Next Area of Computing Innovation
-
Upload
amd -
Category
Technology
-
view
36.204 -
download
2
description
Transcript of Heterogeneous Systems Architecture: The Next Area of Computing Innovation
HETEROGENEOUS SYSTEMS ARCHITECTURE:
THE NEXT AREA OF COMPUTING INNOVATION
CASE STUDY: THE HOLODECK
Dr. Lisa Su Senior Vice President and GM, Global Business Units,
AMD
ISSCC Conference
February 18, 2013
2 | ISSCC Keynote | February 18th, 2013
CHALLENGES TO MOORE’S LAW SCALING
Lithography challenges begin severely limiting area scaling at 20nm node
– Fewer 1X metals due to cost
– Less aggressive feature scaling due to lithography challenges
Compounded by rapidly increasing lithography costs
– 28 20nm transition is inflection point with dual exposure
– No cost / transistor crossover for first time at 28 20nm transition
0.0
0.2
0.4
0.6
0.8
1.0
45nm 40nm 32nm 28nm 20nm 20FinFET
Norm
aliz
ed A
rea
0.0
0.2
0.4
0.6
0.8
1.0
45nm 40nm 32nm 28nm 20nm 20FinFET
No
rma
lize
d C
ost/
Tra
nsis
tor
Cost Per Transistor Scaling Area Scaling by Technology Generation
3 | ISSCC Keynote | February 18th, 2013
A PARADIGM SHIFT…
Throughput Performance Accelerator
Homogeneous
Computing
High-level programmable
Multi-Core
Era
Heterogeneous
Systems Era
Single-Core
Era
Graphics driver-based
programs
OpenCL/DX driver-based
programs
Pro
gra
mm
ab
ilit
y
CP
U
Microprocessor Advancement
GP
U
Ad
va
nc
em
en
t
Heterogeneous
Computing
4 | ISSCC Keynote | February 18th, 2013
HETEROGENEOUS SYSTEMS ARCHITECTURE MEMORY MODEL
From
32 bit
To
64 bit
Yesterday
Today
5 | ISSCC Keynote | February 18th, 2013
ARCHITECTURES – A HISTORICAL PERSPECTIVE
Surround Computing Era Legacy Processing Era
Heterogeneous Architectures
Traditionally Optimized Platforms
Single Core CPUs
Multi-Core CPUs/GPUs
2000s 2010s 1990s 1981
APUs and legacy SOC
6 | ISSCC Keynote | February 18th, 2013
CHANGING THE THINKING, CHANGING THE GAME
HSA is designed to make the GPU hardware
directly accessible to the software, using the high
level languages programmers already in use on
the CPU
C, C++, Java, Python…even JavaScript, HTML5
ISA agnostic – e.g., x86, 64-bit ARM, Radeon, Mali
GPU becomes a peer processor to the CPU in
terms of system integration
Full programming language features
Shared virtual memory: pointer is a pointer
Coherency
Context switching
HSA Foundation – an
industry-wide initiative
7 | ISSCC Keynote | February 18th, 2013
BENEFITS OF HETEROGENEOUS SYSTEM ARCHITECTURE
8 | ISSCC Keynote | February 18th, 2013
EFFECTIVE COMPUTE OFFLOAD
Made easy by HSA
Unleash the best compute elements depending on task
APU Accelerated
Software Applications
Serial and Task
Parallel Workloads
HSA Accelerated Processing Unit
Data Parallel Workloads
9 | ISSCC Keynote | February 18th, 2013
0 fps
5 fps
10 fps
15 fps
20 fps
25 fps
CPU CPU+GPU
Performance
CPU Cores
CPU Cores
NB+GPU
NB+GPU
DRAM
DRAM
0 W
5 W
10 W
15 W
20 W
25 W
30 W
35 W
CPU CPU+GPU
Power
MOTION DSP 720P
BRINGING IT ALL TOGETHER
AMD internal testing: AMD E2-3200 APU (2 cores @ 2400Mhz, GPU:2 CU @ 444Mhz),
Windows 7 OS, MotionDSP vReveal Applications 720P MP4 input
(http://www.vreveal.com/stabilization)
>4.0X Better Energy Efficiency1
Synergistic use of GPU compute
+ shared memory
=
lower power and higher performance
10 | ISSCC Keynote | February 18th, 2013
TODAY’S DISCUSSION: FROM SURROUND COMPUTING TO
ENABLING THE HOLODECK
1. A fully featured Holodeck is
still many years away
2. Today our discussion will:
Establish a Holodeck framework
Identify Holodeck enabling technologies
Discuss how Heterogeneous Systems
Architecture (HSA) accelerates these
technologies
Undertake an HSA deep dive on one of
these enabling technologies
Look at how new dedicated processors
will enable Holodeck functionality
11 | ISSCC Keynote | February 18th, 2013
WHAT IS A HOLODECK?
12 | ISSCC Keynote | February 18th, 2013
THE HOLODECK FRAMEWORK: AN EVOLUTION OF SURROUND COMPUTING
Natural User Interfaces
Context Computing
360 Degree Virtual
Environments
13 | ISSCC Keynote | February 18th, 2013
HOLODECK ENABLING TECHNOLOGIES: PROFOUND IMPLICATIONS FOR COMPUTER ARCHITECTURE
Computational Photography Delivering seamless and immersive video environments
Directional Audio Using audio to enhance immersion and realism of our environments
Natural User Interfaces Enabling realistic, natural human
communication
Context Computing Delivering an intuitive understanding
of the user’s needs in real time
Augmented Reality Bringing it all together – combining the
real and the virtual
14 | ISSCC Keynote | February 18th, 2013
COMPUTATIONAL PHOTOGRAPHY 360 DEGREE VISUAL ENVIRONMENTS, PHOTOSTITCHING, PERIPHERAL VISION AND HSA
Mapping real life scenes through finite images
Photo stitching of tiled environments and
perceptual correction
Detect interest points & match features
Projecting geometry with point features
using algorithms like RANSAC
Image processing to account for
curved screen surfaces
Modulate brightness to account for
peripheral vision
HSA presents a unified view of the
system with shared memory so CPU and
GPU acceleration in the entire process
15 | ISSCC Keynote | February 18th, 2013
DIRECTIONAL AUDIO
Couples computationally demanding 3D
audio and spatialization effects with
"always on" background processing like
(VAD) Voice Activity Detection
Voice activity detection is best
implemented with special audio
processors and acceleration
techniques
Spatialization effects such as
“Convolution Reverb” are best
done with GPU acceleration
HSA enables seamless
integration of CPU and GPU
acceleration with other
independent accelerators
16 | ISSCC Keynote | February 18th, 2013
Speech Recognition:
Background processing – echo
cancellation & noise suppression
Audio feature extraction
Voice pattern recognition through
Markov model or similar algorithm
Gesture Recognition:
Frame preprocessing & filtering
Optical flow or object tracking
Sophisticated computer vision
algorithms to delineate the hand or
body parts from the background
NATURAL USER INTERFACES
NUI algorithms all benefit from
CPU/GPU and audio processors to
efficiently perform these functions at
the lowest power
17 | ISSCC Keynote | February 18th, 2013
CONTEXT COMPUTING BIOMETRICS EXAMPLE
• Facial Recognition:
• Face detection (is there a face) –
GPU acceleration
• Face identification (pattern
matching through algorithms like
Haar face detection) – CPU and
GPU acceleration
• Validation through blink detection
(make sure it is a real face) –
GPU acceleration
HSA enables mix and match of the best
acceleration for each phase of the
process
18 | ISSCC Keynote | February 18th, 2013
AUGMENTED REALITY
• Image Registration:
• Relies on robust and fast feature
detection – benefits from
CPU/GPU acceleration
• Object Tracking:
• Relies on “optical flow” algorithm
– benefits from CPU/GPU
acceleration
• Image Composition:
• Once information exists from the
above, becomes a classic
graphics rendering use case
The building blocks of HSA enable the
augmented reality world.
19 | ISSCC Keynote | February 18th, 2013
THE WAY FORWARD
Many technologies required to
enable our vision
– Heterogeneous engines that
accelerate key client and server
workloads
– Datacenters optimized for
latency, scalability, and
efficiency
– Processors optimized for new
and emerging workloads
– Active research into new
algorithms
ENABLING TECHNOLOGY DEEP DIVE:
ACCELERATING NATURAL USER INTERFACES (HAAR
FACE DETECTION) WITH HETEROGENEOUS
SYSTEMS ARCHITECTURE
21 | ISSCC Keynote | February 18th, 2013
LOOKING FOR FACES IN ALL THE RIGHT PLACES
22 | ISSCC Keynote | February 18th, 2013
LOOKING FOR FACES IN ALL THE RIGHT PLACES
Quick HD Calculations
Search square = 21 x 21
Pixels = 1920 x 1080 = 2,073,600
Search squares = 1900 x 1060 = ~2 Million
23 | ISSCC Keynote | February 18th, 2013
LOOKING FOR DIFFERENT SIZE FACES BY SCALING THE VIDEO FRAME
24 | ISSCC Keynote | February 18th, 2013
LOOKING FOR DIFFERENT SIZE FACES BY SCALING THE VIDEO FRAME
More HD Calculations
70% scaling in H and V
Total Pixels = 4.07 Million
Search squares = 3.8 Million
25 | ISSCC Keynote | February 18th, 2013
HAAR CASCADE STAGES
Feature l
Feature m
Feature p
Feature r
Feature q
Feature k
Stage N
Stage N+1
Face still possible? Yes
No
REJECT FRAME
26 | ISSCC Keynote | February 18th, 2013
22 CASCADE STAGES, EARLY OUT BETWEEN EACH
STAGE 22 STAGE 21 STAGE 2 STAGE 1
NO FACE
FACE CONFIRMED
Final HD Calculations
Search squares = 3.8 million
Average features per square = 124
Calculations per feature = 100
Calculations per frame = 47 GCalcs
Calculation Rate
30 frames/sec = 1.4TCalcs/second
60 frames/sec = 2.8TCalcs/second
…and this only gets front-facing faces
27 | ISSCC Keynote | February 18th, 2013
CASCADE DEPTH ANALYSIS
0
5
10
15
20
25Cascade Depth
20-25 15-20 10-15 5-10 0-5
28 | ISSCC Keynote | February 18th, 2013
UNBALANCING DUE TO EXITS IN EARLIER CASCADE STAGES
Live
Dead
When running on the GPU, we run each search rectangle on a separate
work item
Early out algorithms, like HAAR, exhibit divergence between work items
– Some work items exit early
– Their neighbors continue
– SIMD packing suffers as a result
29 | ISSCC Keynote | February 18th, 2013
0
10
20
30
40
50
60
70
80
90
100
1 2 3 4 5 6 7 8 9-22
Tim
e (
ms)
Cascade Stage
A10-4600M (6CU@497Mhz, 4 cores@2700Mhz)
GPU CPU
PROCESSING TIME/STAGE
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 GHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)
30 | ISSCC Keynote | February 18th, 2013
0
2
4
6
8
10
12
0 1 2 3 4 5 6 7 8 22
Imag
es/S
ec
Number of Cascade Stages on GPU
AMD A10-4600M APU (6CU@497Mhz, 4 cores@2700Mhz)
CPU HSA GPU
PERFORMANCE CPU-VS-GPU
AMD A10 4600M APU with Radeon™ HD Graphics; CPU: 4 cores @ 2.3 MHz (turbo 3.2 GHz); GPU: AMD Radeon HD 7660G,
6 compute units, 685MHz; 4GB RAM; Windows 7 (64-bit); OpenCL™ 1.1 (873.1)
31 | ISSCC Keynote | February 18th, 2013
HAAR SOLUTION RUN DIFFERENT CASCADES ON GPU AND CPU
By seamlessly sharing data between CPU and GPU,
HSA allows the right processor to handle its appropriate
workload
+2.5x
-2.5x
INCREASED
PERFORMANCE DECREASED ENERGY
PER FRAME
32 | ISSCC Keynote | February 18th, 2013
2x
4x
4x
5x
9x
10x
10x
12x
0 2 4 6 8 10 12 14
Face detect
Video stabilization
Stereo vision
Audio search
Visual Search
Voice recognition
Photo indexing
Gesture recognition
Acceleration vs. CPU
APPLICATION ACCELERATION USING HSA
AMD estimates Source:AMD Whitepaper, Accelerating Consumer/Prosumer Multimedia with HSA, June 2012
33 | ISSCC Keynote | February 18th, 2013
HSA EVOLUTION
System Integration
GPU compute
context switch
GPU graphics
pre-emption
Quality of Service
Next Gen
Architectural Integration
Unified Address Space
for CPU and GPU
GPU uses pageable
system memory via
CPU pointers
Fully coherent memory
between CPU & GPU
Kaveri
Optimized Platforms
GPU Compute C++
support
User mode scheduling
Bi-Directional Power
Mgmt between CPU
and GPU
Trinity
Physical Integration
Llano
Integrate CPU & GPU
in silicon
Unified Memory
Controller
Common
Manufacturing
Technology
34 | ISSCC Keynote | February 18th, 2013
HSA PROGRAMMABILITY ADVANTAGE
• Works with today’s programming models and languages
• Architected to enable CPU like programmability
• Promotes development and adoption of extended standards
• Write Once Run Anywhere – with Performance
C, C++, Java … OpenCL, C++
AMP, Java8 …
Domain-
Specific
Ext / APIs
DX11,
OpenGL …
HSA Intermediate Language (HSAIL)
Unified Programming Models HSA
Foundation
Compute Acceleration
Graphics Acceleration
35 | ISSCC Keynote | February 18th, 2013
CONCLUSION
The age of traditional computing is
dead.
A paradigm shift in processing has
brought about the Heterogeneous
Systems Era
HSA will enable us to dramatically
scale processing power while
increasing power efficiency
The Holodeck still years away, but
HSA and dedicated hardware
blocks will accelerate and enable
technologies as they emerge
36 | ISSCC Keynote | February 18th, 2013
ACKNOWLEDGEMENTS
Bill Herz
Phil Rogers
Marty Johnson
Chris Hook
Sumant Subramanian
THANK YOU
38 | ISSCC Keynote | February 18th, 2013
DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and
typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to
product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences
between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or
otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to
time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO
RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN
NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES
ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, Radeon, and combinations thereof
are trademarks of Advanced Micro Devices, Inc. Other names and logos are used for informational purposes only and may
be trademarks of their respective owners.