Steve Scott, Tesla CTO - Nvidia · Steve Scott, Tesla CTO SC’11 November 15, 2011. What goal do...
Transcript of Steve Scott, Tesla CTO - Nvidia · Steve Scott, Tesla CTO SC’11 November 15, 2011. What goal do...
Steve Scott, Tesla CTOSC’11
November 15, 2011
What goal do these products have in common?
Performance / W
Exaflop Expectations
Not constant size,
cost or power
CM5
~200 KW
K Computer
~10 MW
First Exaflop
Computer
The Road Ahead is Steep
In the Good Old DaysLeakage was not important, and
voltage scaled with feature size
L’ = L/2
V’ = V/2
E’ = CV2 = E/8
f’ = 2f
D’ = 1/L2 = 4D
P’ = P
Halve L and get 4x the
transistors and 8x the capability
for the same power!
MF to GF to TF and almost to PF
Technology was giving us 68% per year in perf/W!
The New RealityLeakage has limited threshold voltage,
largely ending voltage scaling
Halve L and get 2x the capability for the
same power.
Processors realized ~50% per year in perf/W…
(spent it on single thread performance)
Technology will give us only 19% per year in perf/W
The Future Belongs to the Efficient
Chips have become power (not area) constrained
density increases quadratically with feature size
energy/op decreases linearly with feature size
Which Takes More Energy?
This one takes over 4.2x the energy (40nm)!
Performing a 64-bit floating-point FMA:
893,500.288914668
43.90230564772498
= 39,226,722.78026233027699
+ 2.02789331400154
= 39,226,724.80815564
Or moving the three 64-bit operands
18 mm across the die:
Loading the data from off chip takes >> 100x the energy.
It’s getting worse: in10nm, relative cost will be 15x!
Flops are cheap. Communication is expensive.
Achieving Energy Efficiency
Reduce overhead.
Minimize data motion.
Multi-core CPUs
Industry has gone multi-core as a first response to power issues
Performance through parallelism
Dial back complexity and clock rate
Exploit locality
Less than 2% of chip power today goes to flops.
But CPUs are fundamentally designed for single thread
performance rather than energy efficiency
Fast clock rates with deep pipelines
Data and instruction caches optimized for latency
Superscalar issue with out-of-order execution
Dynamic conflict detection
Lots of predictions and speculative execution
Lots of instruction overhead per operation
GPU225 pJ/FLOP
Optimized for Throughput
Explicit Managementof On-chip Memory
CPU1690 pJ/FLOP
Optimized for Latency
Caches
Westmere
32nm
Fermi
40nm
#2 : Tianhe-1A7168 Tesla GPUs
2.6 PFLOPS
#4 : Nebulae4650 Tesla GPUs
1.3 PFLOPS
#5 : Tsubame 2.04224 Tesla GPUs
1.2 PFLOPS (most efficient PF system)
#3 : Jaguar36K AMD Opteron CPUs
1.8 PFLOPS
#1 : K Computer88K Fujitsu Sparc CPUs
10.5 PFLOPS
Growing Momentum for GPUs in SupercomputingTesla Powers 3 of 5 Top Systems (November 2011)
Titan18000 Tesla GPUs
>20 PFLOPS
NVIDIA GPU Computing Uptake
Compute-capable NVIDIA GPUs>300,000,000
NVIDIA SW Development Toolkit Downloads>500,000
Active NVIDIA GPU Computing Developers>100,000
Universities Teaching GPU Computing>450
Widespread adoption by HPC OEMs
2x Faster, 3x More Energy Efficient
(and much smaller!)
than Current #1 (K Computer)
Titan Cray XK6
18,000 Tesla GPUs
20+ PetaFlops
~90% of flops from GPUs
ORNL Adopts GPUs for Next-Gen Supercomputer
NCSA Mixes GPUs into BlueWaters
By incorporating a future version of the XK6 system,
Blue Waters will provide a bridge to the future of
scientific computing.
“
”--NCSA Director Thom Dunning
Contains over 30 Cray XK cabinets
With over 3000 NVIDIA Tesla GPUs
What About Programming?
GPU Libraries: Plug In & Play
Parallel Algorithms QUDALattice QCD
Dense Linear Algebra
cuBLAS
Directives: Ease of Programming and Portability
Available from PGI, CAPS, and soon Cray
main() {
double pi = 0.0; long i;
#pragma omp parallel for reduction(+:pi)
for (i=0; i<N; i++)
{
double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t);
}
printf(“pi = %f\n”, pi/N);
}
CPU
OpenMP
main() {
double pi = 0.0; long i;
#pragma omp acc_region_loop
#pragma omp parallel for reduction(+:pi)
for (i=0; i<N; i++)
{
double t = (double)((i+0.05)/N);
pi += 4.0/(1.0+t*t);
}
#pragma omp end acc_region_loop
printf(“pi = %f\n”, pi/N);
}
CPU GPU
Cray Directives
OpenACC: Open Parallel Programming StandardEasy, Fast, Portable
OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.
“
”Buddy Bland
Titan Project DirectorOak Ridge National Lab
OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.
“” Michael Wong
CEO, OpenMPDirectives Board
With Directives, tuning work focusses on exposing parallelism,
which makes codes inherently better
Focus on Exposing Parallelism
Example: Application tuning work using directives for new Titan system at ORNL
S3DResearch more efficient combustion with next-generation fuels
CAM-SEAnswer questions about specific climate change adaptation and mitigation scenarios
• Tuning top 3 kernels (90% of runtime)• 3 to 6x faster on CPU+GPU vs. CPU+CPU• But also improved all-CPU version by 50%
• Tuning top key kernel (50% of runtime)• 6.5x faster on CPU+GPU vs. CPU+CPU• Improved performance of CPU version by 100%
The Future of HPC is Green
We’re constrained by power
You can’t simultaneously optimize for single thread
performance and power efficiency
Most work must be done by cores designed for
throughput and efficiency
Locality is key – explicit control of memory
hierarchy
GPUs are the path to our tightly-coupled, energy-
efficient, hybrid processor future
Thank You.
Questions?