Steve Scott, Tesla CTO - Nvidia · Steve Scott, Tesla CTO SC’11 November 15, 2011. What goal do...

Steve Scott, Tesla CTOSC’11

November 15, 2011

What goal do these products have in common?

Performance / W

Exaflop Expectations

Not constant size,

cost or power

CM5

~200 KW

K Computer

~10 MW

First Exaflop

Computer

The Road Ahead is Steep

In the Good Old DaysLeakage was not important, and

voltage scaled with feature size

L’ = L/2

V’ = V/2

E’ = CV2 = E/8

f’ = 2f

D’ = 1/L2 = 4D

P’ = P

Halve L and get 4x the

transistors and 8x the capability

for the same power!

MF to GF to TF and almost to PF

Technology was giving us 68% per year in perf/W!

The New RealityLeakage has limited threshold voltage,

largely ending voltage scaling

Halve L and get 2x the capability for the

same power.

Processors realized ~50% per year in perf/W…

(spent it on single thread performance)

Technology will give us only 19% per year in perf/W

The Future Belongs to the Efficient

Chips have become power (not area) constrained

density increases quadratically with feature size

energy/op decreases linearly with feature size

Which Takes More Energy?

This one takes over 4.2x the energy (40nm)!

Performing a 64-bit floating-point FMA:

893,500.288914668

43.90230564772498

= 39,226,722.78026233027699

+ 2.02789331400154

= 39,226,724.80815564

Or moving the three 64-bit operands

18 mm across the die:

Loading the data from off chip takes >> 100x the energy.

It’s getting worse: in10nm, relative cost will be 15x!

Flops are cheap. Communication is expensive.

Achieving Energy Efficiency

Reduce overhead.

Minimize data motion.

Multi-core CPUs

Industry has gone multi-core as a first response to power issues

Performance through parallelism

Dial back complexity and clock rate

Exploit locality

Less than 2% of chip power today goes to flops.

But CPUs are fundamentally designed for single thread

performance rather than energy efficiency

Fast clock rates with deep pipelines

Data and instruction caches optimized for latency

Superscalar issue with out-of-order execution

Dynamic conflict detection

Lots of predictions and speculative execution

Lots of instruction overhead per operation

GPU225 pJ/FLOP

Optimized for Throughput

Explicit Managementof On-chip Memory

CPU1690 pJ/FLOP

Optimized for Latency

Caches

Westmere

32nm

Fermi

40nm

#2 : Tianhe-1A7168 Tesla GPUs

2.6 PFLOPS

#4 : Nebulae4650 Tesla GPUs

1.3 PFLOPS

#5 : Tsubame 2.04224 Tesla GPUs

1.2 PFLOPS (most efficient PF system)

#3 : Jaguar36K AMD Opteron CPUs

1.8 PFLOPS

#1 : K Computer88K Fujitsu Sparc CPUs

10.5 PFLOPS

Growing Momentum for GPUs in SupercomputingTesla Powers 3 of 5 Top Systems (November 2011)

Titan18000 Tesla GPUs

>20 PFLOPS

NVIDIA GPU Computing Uptake

Compute-capable NVIDIA GPUs>300,000,000

NVIDIA SW Development Toolkit Downloads>500,000

Active NVIDIA GPU Computing Developers>100,000

Universities Teaching GPU Computing>450

Widespread adoption by HPC OEMs

2x Faster, 3x More Energy Efficient

(and much smaller!)

than Current #1 (K Computer)

Titan Cray XK6

18,000 Tesla GPUs

20+ PetaFlops

~90% of flops from GPUs

ORNL Adopts GPUs for Next-Gen Supercomputer

NCSA Mixes GPUs into BlueWaters

By incorporating a future version of the XK6 system,

Blue Waters will provide a bridge to the future of

scientific computing.

“

”--NCSA Director Thom Dunning

Contains over 30 Cray XK cabinets

With over 3000 NVIDIA Tesla GPUs

What About Programming?

GPU Libraries: Plug In & Play

Parallel Algorithms QUDALattice QCD

Dense Linear Algebra

cuBLAS

Directives: Ease of Programming and Portability

Available from PGI, CAPS, and soon Cray

main() {

double pi = 0.0; long i;

#pragma omp parallel for reduction(+:pi)

for (i=0; i<N; i++)

{

double t = (double)((i+0.05)/N);

pi += 4.0/(1.0+t*t);

}

printf(“pi = %f\n”, pi/N);

}

CPU

OpenMP

main() {

double pi = 0.0; long i;

#pragma omp acc_region_loop

#pragma omp parallel for reduction(+:pi)

for (i=0; i<N; i++)

{

double t = (double)((i+0.05)/N);

pi += 4.0/(1.0+t*t);

}

#pragma omp end acc_region_loop

printf(“pi = %f\n”, pi/N);

}

CPU GPU

Cray Directives

OpenACC: Open Parallel Programming StandardEasy, Fast, Portable

OpenACC will enable programmers to easily develop portable applications that maximize the performance and power efficiency benefits of the hybrid CPU/GPU architecture of Titan.

“

”Buddy Bland

Titan Project DirectorOak Ridge National Lab

OpenACC is a technically impressive initiative brought together by members of the OpenMP Working Group on Accelerators, as well as many others. We look forward to releasing a version of this proposal in the next release of OpenMP.

“” Michael Wong

CEO, OpenMPDirectives Board

With Directives, tuning work focusses on exposing parallelism,

which makes codes inherently better

Focus on Exposing Parallelism

Example: Application tuning work using directives for new Titan system at ORNL

S3DResearch more efficient combustion with next-generation fuels

CAM-SEAnswer questions about specific climate change adaptation and mitigation scenarios

• Tuning top 3 kernels (90% of runtime)• 3 to 6x faster on CPU+GPU vs. CPU+CPU• But also improved all-CPU version by 50%

• Tuning top key kernel (50% of runtime)• 6.5x faster on CPU+GPU vs. CPU+CPU• Improved performance of CPU version by 100%

The Future of HPC is Green

We’re constrained by power

You can’t simultaneously optimize for single thread

performance and power efficiency

Most work must be done by cores designed for

throughput and efficiency

Locality is key – explicit control of memory

hierarchy

GPUs are the path to our tightly-coupled, energy-

efficient, hybrid processor future

Thank You.

Questions?

Steve Scott, Tesla CTO - Nvidia · Steve Scott, Tesla CTO SC’11 November 15, 2011. What goal do...

Documents

Transcript of Steve Scott, Tesla CTO - Nvidia · Steve Scott, Tesla CTO SC’11 November 15, 2011. What goal do...