© NVIDIA Corporation 2009 -...

Post on 06-May-2020

5 views 0 download

Transcript of © NVIDIA Corporation 2009 -...

© NVIDIA Corporation 2009

Early 3D Graphics

Perspective study of a chalice

Paolo Uccello, circa 1450

© NVIDIA Corporation 2009

Early Graphics Hardware

Perspective study of a chalice

Paolo Uccello, circa 1450

Artist using a perspective machine

Albrecht Dürer, 1525

© NVIDIA Corporation 2009

Early Electronic Graphics Hardware

SKETCHPAD: A Man-Machine Graphical Communication System

Ivan Sutherland, 1963

© NVIDIA Corporation 2009

The Graphics Pipeline

The Geometry Engine: A VLSI Geometry System for Graphics

Jim Clark, 1982

© NVIDIA Corporation 2009

The Graphics Pipeline

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Framebuffer

© NVIDIA Corporation 2009

The Graphics Pipeline

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Framebuffer

© NVIDIA Corporation 2009

The Graphics Pipeline

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Framebuffer

© NVIDIA Corporation 2009

The Graphics Pipeline

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Framebuffer

© NVIDIA Corporation 2009

The Graphics Pipeline

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Framebuffer

© NVIDIA Corporation 2009

The Graphics Pipeline

Key abstraction of real-time graphics

Hardware used to look like this

One chip/board per stage

Fixed data flow through pipeline

Vertex

Rasterize

Pixel

Test & Blend

Framebuffer

© NVIDIA Corporation 2009

!"#$%&'()(*+%!"#$%&'()*%)" +,#-.%/0,%-.//0&12%34+

SGI RealityEngine (1993)

Vertex

Rasterize

Pixel

Test & Blend

Framebuffer

© NVIDIA Corporation 2009

567$#*8%($%9)+%1)2%)%&"!"#$%&'3454,"#$6&%7"4*,#-.%/040'0&"7,%-.//0&12%3:

SGI InfiniteReality (1997)

Vertex

Rasterize

Pixel

Test & Blend

Framebuffer

© NVIDIA Corporation 2009

The Graphics Pipeline

Remains a useful abstraction

Hardware used to look like this

Vertex

Rasterize

Pixel

Test & Blend

Framebuffer

© NVIDIA Corporation 2009

The Graphics Pipeline

Hardware used to look like this:

Vertex, pixel processing became programmable

Vertex

Rasterize

Pixel

Test & Blend

Framebuffer

!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0

4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">?

@

10' 1 A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH

>I1J"A"9I1J"E"=I1JH

K

© NVIDIA Corporation 2009

The Graphics Pipeline

Hardware used to look like this

Vertex, pixel processing became programmable

New stages added

Vertex

Rasterize

Pixel

Test & Blend

Framebuffer

Geometry

!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0

4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">?

@

10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH

>I1J"A"9I1J"E"=I1JH

K

© NVIDIA Corporation 2009

The Graphics Pipeline

Hardware used to look like this

Vertex, pixel processing became programmable

New stages added

GPU architecture increasingly

centers around shader execution

Vertex

Rasterize

Pixel

Test & Blend

Framebuffer

Geometry

Tessellation

!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0

4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">?

@

10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH

>I1J"A"9I1J"E"=I1JH

K

Modern GPUs: Unified Design

Shader D

Shader A

Shader B

Shader C

Shader

Core

ibuffer ibuffer ibuffer ibuffer

obuffer obuffer obufferobuffer

Discrete Design Unified Design

Vertex shaders, pixel shaders, etc. become threads

running different programs on a flexible core

L2

Framebuffer

SP SP

L1

TF

;<#

(9=%

1#6

>(??

6#

@(#$(A%;<#(9=%.??"(

-($"B%C%09?$(#DE(

/(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(

.7B"$%&??(8F)(#

26?$

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

Framebuffer

L2

Framebuffer

L2

Framebuffer

L2

Framebuffer

L2

Framebuffer

GeForce 8: Modern GPU Architecture

L2

Framebuffer

SP SP

L1

TF

;<#

(9=%

1#6

>(??

6#

@(#$(A%;<#(9=%.??"(

-($"B%C%09?$(#DE(

/(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(

.7B"$%&??(8F)(#

26?$

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

Framebuffer

L2

Framebuffer

L2

Framebuffer

L2

Framebuffer

L2

Framebuffer

GeForce 8: Modern GPU Architecture

© NVIDIA Corporation 2009

L2

Framebuffer

Modern GPU Architecture: GT200

SP SP

L1

TF

SP

L2

Framebuffer

L2

Framebuffer

L2

Framebuffer

L2

Framebuffer

L2

Framebuffer

L2

Framebuffer

L2

Framebuffer

SP SP

L1

TF

SP SP SP

L1

TF

SP SP SP

L1

TF

SP SP SP

L1

TF

SP

SP SP

L1

TF

SP SP SP

L1

TF

SP SP SP

L1

TF

SP SP SP

L1

TF

SP SP SP

L1

TF

SP

;<#

(9=%

-><

(=")

(#

@(#$(A%;<#(9=%.??"(

-($"B%C%09?$(#DE(

/(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(

.7B"$%&??(8F)(#

26?$

© NVIDIA Corporation 2009

DR

AM

I/F

HO

ST

I/F

Gig

a T

hre

ad

DR

AM

I/F

DR

AM

I/FD

RA

M I/F

DR

AM

I/FD

RA

M I/F

L2

NVIDIA next-!"#$%&"'()*$+',-).",./'"

Next-Gen GPU Architecture: Fermi

GPUs Today

Lessons from Graphics Pipeline

Throughput is paramount

Create, run, & retire lots of threads very rapidly

Use multithreading to hide latency

!""# $%%% $%%# $%!%

!"#$%&'(

)*%+,-./

012-.314 '56

')*%+,-./

012-.31 27

&'5*%+,-./

012-.31%((88

6(&*%+,-./

012-.31 )%

68*%+,-./

!"#$%&'

)9 +,-./

Perspective

1 TeraFLOP in 1993

© NVIDIA Corporation 2009

!"#$%&#'($)**+#,-$./'

Computers no longer get faster, just wider

You must re-think your algorithms to be parallel !

Data-parallel computing is most scalable solution

Why GPU Computing?

Throughput architecture ! massive parallelism

Massive parallelism ! computational horsepower

Fact:

nobody cares about theoretical peak

Challenge:

harness GPU power for real application performance

CPU

GPU

GF

LO

PS

!"#

!"#$#%&'()*%&+,-.-

/0-&12345

67.&89*:;)

$"#

#<=4>&+234&?@&6.A

0&12345

,-/&89*:;)

100X

AstrophysicsRIKEN

30X

Gene SequencingU of Maryland

Matlab ComputingAccelerEyes

Video TranscodingElemental Tech

Medical Imaging U of Utah

146X

CUDA Successes

18X 50X

149X

Financial simulationOxford

36X

Molecular DynamicsU of Illinois

47X

Linear AlgebraUniversidad Jaime

20X

3D UltrasoundTechniscan

130X

Quantum ChemistryU of Illinois

© NVIDIA Corporation 2009

Accelerating Insight

4.6 Days

27 Minutes

2.7 Days

30 Minutes

8 Hours

13 Minutes16 Minutes

3 Hours

CPU Only Heterogeneous with Tesla GPU

© NVIDIA Corporation 2009

GPU Computing 1.0

(Ignoring prehistory: Ikonas, Pixel Machine, Pixel-01/2#-34

GPU Computing 1.0: compute pretending to be graphics

Disguise data as textures or geometry

Disguise algorithm as render passes

!Trick graphics pipeline into doing your computation!

Term GPGPU coined by Mark Harris

© NVIDIA Corporation 2009

Typical GPGPU Constructs

© NVIDIA Corporation 2009

Typical GPGPU Constructs

GPGPU Hardware & Algorithms

GPUs get progressively more capable

Fixed-function ! register combiners ! shaders

fp32 pixel hardware greatly extends reach

Algorithms get more sophisticated

Cellular automata ! PDE solvers ! ray tracing

Clever graphics tricks

High-level shading languages emerge

© NVIDIA Corporation 2009

GPU Computing 2.0 5 Enter CUDA

GPU Computing 2.0: direct compute

Program GPU directly, no graphics-based restrictions

GPU Computing supplants graphics-based GPGPU

November 2006: NVIDIA introduces CUDA

© NVIDIA Corporation 2009

CUDA In One Slide

Thread

B(#G$<#(9=)6>9)%8(86#*

B(#GF)6>'?<9#(=8(86#*

. . .

Kernel !"#$%

B(#G=(HD>(I)6F9)

8(86#*

Global barrier

Block

Local barrier

Kernel &''$%

. . .. . .

CUDA C Example

!"#$%&'()*+&,-#'./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5

6

3"- /#01%#%7%89%# : 09%;;#5

*<#=%7%'4(<#=%;%*<#=9

>

??%@0!"A,%&,-#'. BCDEF%A,-0,.

&'()*+&,-#'./02%GH82%(2%*59

++I."J'.++%!"#$%&'()*+)'-'..,./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5

6

#01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,'$@$(H(9

#3 /# : 05%%*<#=%7%'4(<#=%;%*<#=9

>

??%@0!"A,%)'-'..,. BCDEF%A,-0,. O#1N%GPQ%1N-,'$&?J."KA

#01%0J."KA&%7%/0%;%GPP5%?%GPQ9

&'()*+)'-'..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59

Serial C Code

Parallel C Code

© NVIDIA Corporation 2009

Heterogeneous Programming

Use the right processor for the right job

Serial Code

. . .

. . .

Parallel Kernel

&''((()*+,-.)*/01)222$"#34%5

Serial Code

Parallel Kernel

!"#((()*+,-.)*/01)222$"#34%5

GPU Computing 3.0 5 An Ecosystem

GPU Computing 3.0: an emerging ecosystem

Hardware & product lines

Algorithmic sophistication

Cross-platform standards

Education & research

Consumer applications

High-level languages

!"#$%&'()*$+,$-./)(*01$$%&0()2$3,2&*4$56/+7,89*4$/55*

© NVIDIA Corporation 2009

Products

CUDA is in products from laptops to supercomputers

`

© NVIDIA Corporation 2009

Emerging HPC Products

New class of hybrid GPU-CPU servers

2 TeslaM1060 GPUs

SuperMicro 1U

GPU Server

Upto 18 Tesla M1060 GPUs

Bull Bullx

Blade Enclosure

© NVIDIA Corporation 2009

Algorithmic Sophistication

Sort

Sparse matrix

Hash tables

Fast multipole method

Ray tracing (parallel tree traversal)

012$3"4/5.4$6'7($%89.)():+.)7#$76$;9+'4"$<+.')=-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

© NVIDIA Corporation 2009

Cross-Platform Standards

GPU Computing Applications

NVIDIA GPUwith the CUDA Parallel Computing Architecture

CUDA C OpenCLtm Direct Compute

CUDA Fortran

JavaPython.NET

MATLAB3

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

© NVIDIA Corporation 2009

Over 250 universities teach CUDAOver 1000 research papers

CUDA Momentum

Over 500 CUDA Appswww.nvidia.com/CUDA

CUDA powered TSUBAME29th fastest supercomputer

in the world

180 Million CUDA GPUs

100,000 active developers

© NVIDIA Corporation 2009

Data Structures

! !"#$%!&&'()*+(,)(+!-#

! !"#$%!&&"-%!,)(+!-#

! !"#$%!&&'()*+(,.!#

! Etc.

Algorithms

! !"#$%!&&%-#!

! !"#$%!&&#('$+(

! !"#$%!&&(/+0$%*)(,%+12

! Etc.

thrust

thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA

C++ template metaprogramming automatically chooses the fastest code path at compile time

© NVIDIA Corporation 2009

thrust::sort

sort.cu60*7,819)(:;#84:<;'4:=>97:'#?;2

60*7,819)(:;#84:<19>079=>97:'#?;2

60*7,819)(:;#84:<39*9#":9?;2

60*7,819)(:;#84:<4'#:?;2

60*7,819)(74:1,0!2

0*: @"0*$>'01%

A

<<)39*9#":9)#"*1'@)1":")'*):;9);'4:

:;#84:BB;'4:=>97:'#(0*:2);=>97$CDDDDDD%5

:;#84:BB39*9#":9$;=>97?!930*$%.);=>97?9*1$%.)#"*1%5

<<):#"*4&9#):')19>079)"*1)4'#:

:;#84:BB19>079=>97:'#(0*:2)1=>97 E);=>975

<<)4'#:)CFDG)HI!)-9J4<497)'*)K/IDD

:;#84:BB4'#:$1=>97?!930*$%.)1=>97?9*1$%%5

#9:8#*)D5L

http://thrust.googlecode.com

© NVIDIA Corporation 2009

GPU Begins to Vanish

Ever-increasing number of codes & commercial 6/78/9#-$%:;-<$9*$=/-<#+($'"#2$>0?$@-$6+#-#2<

Some bio/chem codes available or porting:

NAMD / VMD, GROMACS (alpha), HOOMD, GPU HMMER, MUMmerGPU, AutoDock3

LAMMPS, CHARMM, Q-ChemA$>/;--@/2A$B)CDE3

Consumer applications, e.g. Photoshop

Badaboom, vReveal, Nero MoveIt

OS: Microsoft Windows 7, MacOS Snowleapord

© NVIDIA Corporation 2009

Next-Gen GPU Architecture: Fermi

3 billion transistors

Over 2x the cores (512 total)

~2x the memory bandwidth

L1 and L2 caches

8x the peak DP performance

ECC

C++

© NVIDIA Corporation 2009

SM Microarchitecture

Register File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16

Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

Objective 5 optimize for GPU computingNew ISA

Revamp issue / control flow

New CUDA core architecture

32 cores per SM (512 total)

64KB of configurable L1$ / shared memory

FP32 FP64 INT SFU LD/ST

Ops / clk 32 16 32 4 16

© NVIDIA Corporation 2009

New IEEE 754-2008 arithmetic standard

Fused Multiply-Add(FMA) for SP & DP

New integer ALU optimized for 64-bit and extended precision ops

SM Microarchitecture

Register File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16

Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Core

Instruction Cache

CUDA CoreDispatch Port

Operand Collector

Result Queue

FP Unit INT Unit

Hardware Thread Scheduling

Concurrent kernel execution + faster context switch

Serial Kernel Execution Parallel Kernel Execution

Tim

e

Kernel 1 Kernel 1 Kernel 2

Kernel 2 Kernel 3

Kernel 3

Ker

4

nelKernel 5

Kernel 5

Kernel 4

Kernel 2

Kernel 2

More Fermi Goodness

ECC protection for DRAM, L2, L1, RF

Unified 40-bit address space for local, shared, global

5-20x faster atomics

Dual DMA engines for CPU"! GPU transfers

ISA extensions for C++ (e.g. virtual functions)

Key CUDA Challenges

Express other programming models elegantly

Producer-consumer: persistent thread blocks reading and writing work queues

Task-parallel: kernel, thread block or warp as parallel task

C*<"$7*FF*2$%6*'#+-;-#+($G?HB$6/<<#+2-

Foster more high-level languages & platforms

Improve & mature development environment

Key GPU Workloads

Computational graphics

Scientific and numeric computing

Image processing 5 video & images

Computer vision

Speech & natural language

Machine learning

© NVIDIA Corporation 2009

The Future of Computing?

Forward-looking statements:

All future interesting problems are throughput

problems.

GPUs will evolve to be the general-purpose throughput processors.

CPUs as we know them will become (already are?) %9**I$#2*;9"(A$/2I$-"+@28$<*$/$7*+2#+$*=$<"#$I@#J

© NVIDIA Corporation 2009

Final Thoughts 5 Education

We should teach parallel computing in CS 1 or CS 2

G*F6;<#+-$I*2,<$9#<$=/-<#+A$:;-<$'@I#+

Manycore is the future of computing

Heapsort and mergesort

Both O(n lg n)

One parallel-friendly, one not

Students need to understand this early

now

MiscellaneousThoughts

© NVIDIA Corporation 2009

NVIDIA Resources Available

NVIDIA enables heterogeneous computing research and teaching:

Grants for leading researchers in GPU ComputingPh.D. Fellowship program

Professor Partnership program

Discounted hardware

GPU Computing Ventures 5 Venture fund focused on GPU computing

Technical Training and Assistance