© NVIDIA Corporation 2009 -...
Transcript of © NVIDIA Corporation 2009 -...
© NVIDIA Corporation 2009
Early 3D Graphics
Perspective study of a chalice
Paolo Uccello, circa 1450
© NVIDIA Corporation 2009
Early Graphics Hardware
Perspective study of a chalice
Paolo Uccello, circa 1450
Artist using a perspective machine
Albrecht Dürer, 1525
© NVIDIA Corporation 2009
Early Electronic Graphics Hardware
SKETCHPAD: A Man-Machine Graphical Communication System
Ivan Sutherland, 1963
© NVIDIA Corporation 2009
The Graphics Pipeline
The Geometry Engine: A VLSI Geometry System for Graphics
Jim Clark, 1982
© NVIDIA Corporation 2009
The Graphics Pipeline
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
© NVIDIA Corporation 2009
The Graphics Pipeline
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
© NVIDIA Corporation 2009
The Graphics Pipeline
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
© NVIDIA Corporation 2009
The Graphics Pipeline
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
© NVIDIA Corporation 2009
The Graphics Pipeline
Vertex Transform & Lighting
Triangle Setup & Rasterization
Texturing & Pixel Shading
Depth Test & Blending
Framebuffer
© NVIDIA Corporation 2009
The Graphics Pipeline
Key abstraction of real-time graphics
Hardware used to look like this
One chip/board per stage
Fixed data flow through pipeline
Vertex
Rasterize
Pixel
Test & Blend
Framebuffer
© NVIDIA Corporation 2009
!"#$%&'()(*+%!"#$%&'()*%)" +,#-.%/0,%-.//0&12%34+
SGI RealityEngine (1993)
Vertex
Rasterize
Pixel
Test & Blend
Framebuffer
© NVIDIA Corporation 2009
567$#*8%($%9)+%1)2%)%&"!"#$%&'3454,"#$6&%7"4*,#-.%/040'0&"7,%-.//0&12%3:
SGI InfiniteReality (1997)
Vertex
Rasterize
Pixel
Test & Blend
Framebuffer
© NVIDIA Corporation 2009
The Graphics Pipeline
Remains a useful abstraction
Hardware used to look like this
Vertex
Rasterize
Pixel
Test & Blend
Framebuffer
© NVIDIA Corporation 2009
The Graphics Pipeline
Hardware used to look like this:
Vertex, pixel processing became programmable
Vertex
Rasterize
Pixel
Test & Blend
Framebuffer
!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0
4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">?
@
10' 1 A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH
>I1J"A"9I1J"E"=I1JH
K
© NVIDIA Corporation 2009
The Graphics Pipeline
Hardware used to look like this
Vertex, pixel processing became programmable
New stages added
Vertex
Rasterize
Pixel
Test & Blend
Framebuffer
Geometry
!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0
4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">?
@
10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH
>I1J"A"9I1J"E"=I1JH
K
© NVIDIA Corporation 2009
The Graphics Pipeline
Hardware used to look like this
Vertex, pixel processing became programmable
New stages added
GPU architecture increasingly
centers around shader execution
Vertex
Rasterize
Pixel
Test & Blend
Framebuffer
Geometry
Tessellation
!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0
4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">?
@
10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH
>I1J"A"9I1J"E"=I1JH
K
Modern GPUs: Unified Design
Shader D
Shader A
Shader B
Shader C
Shader
Core
ibuffer ibuffer ibuffer ibuffer
obuffer obuffer obufferobuffer
Discrete Design Unified Design
Vertex shaders, pixel shaders, etc. become threads
running different programs on a flexible core
L2
Framebuffer
SP SP
L1
TF
;<#
(9=%
1#6
>(??
6#
@(#$(A%;<#(9=%.??"(
-($"B%C%09?$(#DE(
/(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(
.7B"$%&??(8F)(#
26?$
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
Framebuffer
L2
Framebuffer
L2
Framebuffer
L2
Framebuffer
L2
Framebuffer
GeForce 8: Modern GPU Architecture
L2
Framebuffer
SP SP
L1
TF
;<#
(9=%
1#6
>(??
6#
@(#$(A%;<#(9=%.??"(
-($"B%C%09?$(#DE(
/(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(
.7B"$%&??(8F)(#
26?$
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
Framebuffer
L2
Framebuffer
L2
Framebuffer
L2
Framebuffer
L2
Framebuffer
GeForce 8: Modern GPU Architecture
© NVIDIA Corporation 2009
L2
Framebuffer
Modern GPU Architecture: GT200
SP SP
L1
TF
SP
L2
Framebuffer
L2
Framebuffer
L2
Framebuffer
L2
Framebuffer
L2
Framebuffer
L2
Framebuffer
L2
Framebuffer
SP SP
L1
TF
SP SP SP
L1
TF
SP SP SP
L1
TF
SP SP SP
L1
TF
SP
SP SP
L1
TF
SP SP SP
L1
TF
SP SP SP
L1
TF
SP SP SP
L1
TF
SP SP SP
L1
TF
SP
;<#
(9=%
-><
(=")
(#
@(#$(A%;<#(9=%.??"(
-($"B%C%09?$(#DE(
/(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(
.7B"$%&??(8F)(#
26?$
© NVIDIA Corporation 2009
DR
AM
I/F
HO
ST
I/F
Gig
a T
hre
ad
DR
AM
I/F
DR
AM
I/FD
RA
M I/F
DR
AM
I/FD
RA
M I/F
L2
NVIDIA next-!"#$%&"'()*$+',-).",./'"
Next-Gen GPU Architecture: Fermi
GPUs Today
Lessons from Graphics Pipeline
Throughput is paramount
Create, run, & retire lots of threads very rapidly
Use multithreading to hide latency
!""# $%%% $%%# $%!%
!"#$%&'(
)*%+,-./
012-.314 '56
')*%+,-./
012-.31 27
&'5*%+,-./
012-.31%((88
6(&*%+,-./
012-.31 )%
68*%+,-./
!"#$%&'
)9 +,-./
Perspective
1 TeraFLOP in 1993
© NVIDIA Corporation 2009
!"#$%&#'($)**+#,-$./'
Computers no longer get faster, just wider
You must re-think your algorithms to be parallel !
Data-parallel computing is most scalable solution
Why GPU Computing?
Throughput architecture ! massive parallelism
Massive parallelism ! computational horsepower
Fact:
nobody cares about theoretical peak
Challenge:
harness GPU power for real application performance
CPU
GPU
GF
LO
PS
!"#
!"#$#%&'()*%&+,-.-
/0-&12345
67.&89*:;)
$"#
#<=4>&+234&?@&6.A
0&12345
,-/&89*:;)
100X
AstrophysicsRIKEN
30X
Gene SequencingU of Maryland
Matlab ComputingAccelerEyes
Video TranscodingElemental Tech
Medical Imaging U of Utah
146X
CUDA Successes
18X 50X
149X
Financial simulationOxford
36X
Molecular DynamicsU of Illinois
47X
Linear AlgebraUniversidad Jaime
20X
3D UltrasoundTechniscan
130X
Quantum ChemistryU of Illinois
© NVIDIA Corporation 2009
Accelerating Insight
4.6 Days
27 Minutes
2.7 Days
30 Minutes
8 Hours
13 Minutes16 Minutes
3 Hours
CPU Only Heterogeneous with Tesla GPU
© NVIDIA Corporation 2009
GPU Computing 1.0
(Ignoring prehistory: Ikonas, Pixel Machine, Pixel-01/2#-34
GPU Computing 1.0: compute pretending to be graphics
Disguise data as textures or geometry
Disguise algorithm as render passes
!Trick graphics pipeline into doing your computation!
Term GPGPU coined by Mark Harris
© NVIDIA Corporation 2009
Typical GPGPU Constructs
© NVIDIA Corporation 2009
Typical GPGPU Constructs
GPGPU Hardware & Algorithms
GPUs get progressively more capable
Fixed-function ! register combiners ! shaders
fp32 pixel hardware greatly extends reach
Algorithms get more sophisticated
Cellular automata ! PDE solvers ! ray tracing
Clever graphics tricks
High-level shading languages emerge
© NVIDIA Corporation 2009
GPU Computing 2.0 5 Enter CUDA
GPU Computing 2.0: direct compute
Program GPU directly, no graphics-based restrictions
GPU Computing supplants graphics-based GPGPU
November 2006: NVIDIA introduces CUDA
© NVIDIA Corporation 2009
CUDA In One Slide
Thread
B(#G$<#(9=)6>9)%8(86#*
B(#GF)6>'?<9#(=8(86#*
. . .
Kernel !"#$%
B(#G=(HD>(I)6F9)
8(86#*
Global barrier
Block
Local barrier
Kernel &''$%
. . .. . .
CUDA C Example
!"#$%&'()*+&,-#'./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5
6
3"- /#01%#%7%89%# : 09%;;#5
*<#=%7%'4(<#=%;%*<#=9
>
??%@0!"A,%&,-#'. BCDEF%A,-0,.
&'()*+&,-#'./02%GH82%(2%*59
++I."J'.++%!"#$%&'()*+)'-'..,./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5
6
#01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,'$@$(H(9
#3 /# : 05%%*<#=%7%'4(<#=%;%*<#=9
>
??%@0!"A,%)'-'..,. BCDEF%A,-0,. O#1N%GPQ%1N-,'$&?J."KA
#01%0J."KA&%7%/0%;%GPP5%?%GPQ9
&'()*+)'-'..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59
Serial C Code
Parallel C Code
© NVIDIA Corporation 2009
Heterogeneous Programming
Use the right processor for the right job
Serial Code
. . .
. . .
Parallel Kernel
&''((()*+,-.)*/01)222$"#34%5
Serial Code
Parallel Kernel
!"#((()*+,-.)*/01)222$"#34%5
GPU Computing 3.0 5 An Ecosystem
GPU Computing 3.0: an emerging ecosystem
Hardware & product lines
Algorithmic sophistication
Cross-platform standards
Education & research
Consumer applications
High-level languages
!"#$%&'()*$+,$-./)(*01$$%&0()2$3,2&*4$56/+7,89*4$/55*
© NVIDIA Corporation 2009
Products
CUDA is in products from laptops to supercomputers
`
© NVIDIA Corporation 2009
Emerging HPC Products
New class of hybrid GPU-CPU servers
2 TeslaM1060 GPUs
SuperMicro 1U
GPU Server
Upto 18 Tesla M1060 GPUs
Bull Bullx
Blade Enclosure
© NVIDIA Corporation 2009
Algorithmic Sophistication
Sort
Sparse matrix
Hash tables
Fast multipole method
Ray tracing (parallel tree traversal)
012$3"4/5.4$6'7($%89.)():+.)7#$76$;9+'4"$<+.')=-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007
© NVIDIA Corporation 2009
Cross-Platform Standards
GPU Computing Applications
NVIDIA GPUwith the CUDA Parallel Computing Architecture
CUDA C OpenCLtm Direct Compute
CUDA Fortran
JavaPython.NET
MATLAB3
OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.
© NVIDIA Corporation 2009
Over 250 universities teach CUDAOver 1000 research papers
CUDA Momentum
Over 500 CUDA Appswww.nvidia.com/CUDA
CUDA powered TSUBAME29th fastest supercomputer
in the world
180 Million CUDA GPUs
100,000 active developers
© NVIDIA Corporation 2009
Data Structures
! !"#$%!&&'()*+(,)(+!-#
! !"#$%!&&"-%!,)(+!-#
! !"#$%!&&'()*+(,.!#
! Etc.
Algorithms
! !"#$%!&&%-#!
! !"#$%!&&#('$+(
! !"#$%!&&(/+0$%*)(,%+12
! Etc.
thrust
thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA
C++ template metaprogramming automatically chooses the fastest code path at compile time
© NVIDIA Corporation 2009
thrust::sort
sort.cu60*7,819)(:;#84:<;'4:=>97:'#?;2
60*7,819)(:;#84:<19>079=>97:'#?;2
60*7,819)(:;#84:<39*9#":9?;2
60*7,819)(:;#84:<4'#:?;2
60*7,819)(74:1,0!2
0*: @"0*$>'01%
A
<<)39*9#":9)#"*1'@)1":")'*):;9);'4:
:;#84:BB;'4:=>97:'#(0*:2);=>97$CDDDDDD%5
:;#84:BB39*9#":9$;=>97?!930*$%.);=>97?9*1$%.)#"*1%5
<<):#"*4&9#):')19>079)"*1)4'#:
:;#84:BB19>079=>97:'#(0*:2)1=>97 E);=>975
<<)4'#:)CFDG)HI!)-9J4<497)'*)K/IDD
:;#84:BB4'#:$1=>97?!930*$%.)1=>97?9*1$%%5
#9:8#*)D5L
http://thrust.googlecode.com
© NVIDIA Corporation 2009
GPU Begins to Vanish
Ever-increasing number of codes & commercial 6/78/9#-$%:;-<$9*$=/-<#+($'"#2$>0?$@-$6+#-#2<
Some bio/chem codes available or porting:
NAMD / VMD, GROMACS (alpha), HOOMD, GPU HMMER, MUMmerGPU, AutoDock3
LAMMPS, CHARMM, Q-ChemA$>/;--@/2A$B)CDE3
Consumer applications, e.g. Photoshop
Badaboom, vReveal, Nero MoveIt
OS: Microsoft Windows 7, MacOS Snowleapord
© NVIDIA Corporation 2009
Next-Gen GPU Architecture: Fermi
3 billion transistors
Over 2x the cores (512 total)
~2x the memory bandwidth
L1 and L2 caches
8x the peak DP performance
ECC
C++
© NVIDIA Corporation 2009
SM Microarchitecture
Register File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16
Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
Objective 5 optimize for GPU computingNew ISA
Revamp issue / control flow
New CUDA core architecture
32 cores per SM (512 total)
64KB of configurable L1$ / shared memory
FP32 FP64 INT SFU LD/ST
Ops / clk 32 16 32 4 16
© NVIDIA Corporation 2009
New IEEE 754-2008 arithmetic standard
Fused Multiply-Add(FMA) for SP & DP
New integer ALU optimized for 64-bit and extended precision ops
SM Microarchitecture
Register File
Scheduler
Dispatch
Scheduler
Dispatch
Load/Store Units x 16
Special Func Units x 4
Interconnect Network
64K ConfigurableCache/Shared Mem
Uniform Cache
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
Instruction Cache
CUDA CoreDispatch Port
Operand Collector
Result Queue
FP Unit INT Unit
Hardware Thread Scheduling
Concurrent kernel execution + faster context switch
Serial Kernel Execution Parallel Kernel Execution
Tim
e
Kernel 1 Kernel 1 Kernel 2
Kernel 2 Kernel 3
Kernel 3
Ker
4
nelKernel 5
Kernel 5
Kernel 4
Kernel 2
Kernel 2
More Fermi Goodness
ECC protection for DRAM, L2, L1, RF
Unified 40-bit address space for local, shared, global
5-20x faster atomics
Dual DMA engines for CPU"! GPU transfers
ISA extensions for C++ (e.g. virtual functions)
Key CUDA Challenges
Express other programming models elegantly
Producer-consumer: persistent thread blocks reading and writing work queues
Task-parallel: kernel, thread block or warp as parallel task
C*<"$7*FF*2$%6*'#+-;-#+($G?HB$6/<<#+2-
Foster more high-level languages & platforms
Improve & mature development environment
Key GPU Workloads
Computational graphics
Scientific and numeric computing
Image processing 5 video & images
Computer vision
Speech & natural language
Machine learning
© NVIDIA Corporation 2009
The Future of Computing?
Forward-looking statements:
All future interesting problems are throughput
problems.
GPUs will evolve to be the general-purpose throughput processors.
CPUs as we know them will become (already are?) %9**I$#2*;9"(A$/2I$-"+@28$<*$/$7*+2#+$*=$<"#$I@#J
© NVIDIA Corporation 2009
Final Thoughts 5 Education
We should teach parallel computing in CS 1 or CS 2
G*F6;<#+-$I*2,<$9#<$=/-<#+A$:;-<$'@I#+
Manycore is the future of computing
Heapsort and mergesort
Both O(n lg n)
One parallel-friendly, one not
Students need to understand this early
now
MiscellaneousThoughts
© NVIDIA Corporation 2009
NVIDIA Resources Available
NVIDIA enables heterogeneous computing research and teaching:
Grants for leading researchers in GPU ComputingPh.D. Fellowship program
Professor Partnership program
Discounted hardware
GPU Computing Ventures 5 Venture fund focused on GPU computing
Technical Training and Assistance