© NVIDIA Corporation 2009

Early 3D Graphics

Perspective study of a chalice

Paolo Uccello, circa 1450

Early Graphics Hardware

Perspective study of a chalice

Paolo Uccello, circa 1450

Artist using a perspective machine

Albrecht Dürer, 1525

Early Electronic Graphics Hardware

SKETCHPAD: A Man-Machine Graphical Communication System

Ivan Sutherland, 1963

The Graphics Pipeline

The Geometry Engine: A VLSI Geometry System for Graphics

Jim Clark, 1982

Vertex Transform & Lighting

Triangle Setup & Rasterization

Texturing & Pixel Shading

Depth Test & Blending

Framebuffer

Key abstraction of real-time graphics

Hardware used to look like this

One chip/board per stage

Fixed data flow through pipeline

Vertex

Rasterize

Test & Blend

Framebuffer

!"#$%&'()(*+%!"#$%&'()*%)" +,#-.%/0,%-.//0&12%34+

SGI RealityEngine (1993)

Vertex

Rasterize

Test & Blend

Framebuffer

567$#*8%($%9)+%1)2%)%&"!"#$%&'3454,"#$6&%7"4*,#-.%/040'0&"7,%-.//0&12%3:

SGI InfiniteReality (1997)

Vertex

Rasterize

Test & Blend

Framebuffer

Remains a useful abstraction

Vertex

Rasterize

Test & Blend

Framebuffer

Hardware used to look like this:

Vertex, pixel processing became programmable

Vertex

Rasterize

Test & Blend

Framebuffer

!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0

4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">?

10' 1 A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH

>I1J"A"9I1J"E"=I1JH

New stages added

Vertex

Rasterize

Test & Blend

Framebuffer

Geometry

!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0

4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">?

10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH

>I1J"A"9I1J"E"=I1JH

New stages added

GPU architecture increasingly

centers around shader execution

Vertex

Rasterize

Test & Blend

Framebuffer

Geometry

Tessellation

!!"#$%&"'&()$*"+)(,-(./"-0)"+$1(231/)"$**1'1-0

4456-7$644 8-1*"8)%9**:,6-$';"9<",6-$';"=<",6-$';">?

10'"1"A"'&()$*B*CDC E"76-%FG1.DC ;"76-%FB*CDCH

>I1J"A"9I1J"E"=I1JH

Modern GPUs: Unified Design

Shader D

Shader A

Shader B

Shader C

Shader

ibuffer ibuffer ibuffer ibuffer

obuffer obuffer obufferobuffer

Discrete Design Unified Design

Vertex shaders, pixel shaders, etc. become threads

running different programs on a flexible core

Framebuffer

@(#$(A%;<#(9=%.??"(

-($"B%C%09?$(#DE(

/(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(

.7B"$%&??(8F)(#

Framebuffer

GeForce 8: Modern GPU Architecture

Framebuffer

@(#$(A%;<#(9=%.??"(

-($"B%C%09?$(#DE(

/(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(

.7B"$%&??(8F)(#

Framebuffer

GeForce 8: Modern GPU Architecture

Framebuffer

Modern GPU Architecture: GT200

Framebuffer

SP SP SP

@(#$(A%;<#(9=%.??"(

-($"B%C%09?$(#DE(

/(68%;<#(9=%.??"( 1DA()%;<#(9=%.??"(

.7B"$%&??(8F)(#

NVIDIA next-!"#$%&"'()*$+',-).",./'"

Next-Gen GPU Architecture: Fermi

GPUs Today

Lessons from Graphics Pipeline

Throughput is paramount

Create, run, & retire lots of threads very rapidly

Use multithreading to hide latency

!""# $%%% $%%# $%!%

!"#$%&'(

)*%+,-./

012-.314 '56

')*%+,-./

012-.31 27

&'5*%+,-./

012-.31%((88

6(&*%+,-./

012-.31 )%

68*%+,-./

!"#$%&'

)9 +,-./

Perspective

1 TeraFLOP in 1993

!"#$%&#'($)**+#,-$./'

Computers no longer get faster, just wider

You must re-think your algorithms to be parallel !

Data-parallel computing is most scalable solution

Why GPU Computing?

Throughput architecture ! massive parallelism

Massive parallelism ! computational horsepower

nobody cares about theoretical peak

Challenge:

harness GPU power for real application performance

!"#$#%&'()*%&+,-.-

/0-&12345

67.&89*:;)

#<=4>&+234&?@&6.A

0&12345

,-/&89*:;)

AstrophysicsRIKEN

Gene SequencingU of Maryland

Matlab ComputingAccelerEyes

Video TranscodingElemental Tech

Medical Imaging U of Utah

CUDA Successes

18X 50X

Financial simulationOxford

Molecular DynamicsU of Illinois

Linear AlgebraUniversidad Jaime

3D UltrasoundTechniscan

Quantum ChemistryU of Illinois

Accelerating Insight

4.6 Days

27 Minutes

2.7 Days

30 Minutes

8 Hours

13 Minutes16 Minutes

3 Hours

CPU Only Heterogeneous with Tesla GPU

GPU Computing 1.0

(Ignoring prehistory: Ikonas, Pixel Machine, Pixel-01/2#-34

GPU Computing 1.0: compute pretending to be graphics

Disguise data as textures or geometry

Disguise algorithm as render passes

!Trick graphics pipeline into doing your computation!

Term GPGPU coined by Mark Harris

Typical GPGPU Constructs

GPGPU Hardware & Algorithms

GPUs get progressively more capable

Fixed-function ! register combiners ! shaders

fp32 pixel hardware greatly extends reach

Algorithms get more sophisticated

Cellular automata ! PDE solvers ! ray tracing

Clever graphics tricks

High-level shading languages emerge

GPU Computing 2.0 5 Enter CUDA

GPU Computing 2.0: direct compute

Program GPU directly, no graphics-based restrictions

GPU Computing supplants graphics-based GPGPU

November 2006: NVIDIA introduces CUDA

CUDA In One Slide

Thread

B(#G$<#(9=)6>9)%8(86#*

B(#GF)6>'?<9#(=8(86#*

Kernel !"#$%

B(#G=(HD>(I)6F9)

8(86#*

Global barrier

Local barrier

Kernel &''$%

. . .. . .

CUDA C Example

!"#$%&'()*+&,-#'./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5

3"- /#01%#%7%89%# : 09%;;#5

*<#=%7%'4(<#=%;%*<#=9

??%@0!"A,%&,-#'. BCDEF%A,-0,.

&'()*+&,-#'./02%GH82%(2%*59

++I."J'.++%!"#$%&'()*+)'-'..,./#01%02%3."'1%'2%3."'1%4(2%3."'1%4*5

#01%#%7%J."KA@$(H(4J."KAL#MH(%;%1N-,'$@$(H(9

#3 /# : 05%%*<#=%7%'4(<#=%;%*<#=9

??%@0!"A,%)'-'..,. BCDEF%A,-0,. O#1N%GPQ%1N-,'$&?J."KA

#01%0J."KA&%7%/0%;%GPP5%?%GPQ9

&'()*+)'-'..,.:::0J."KA&2%GPQRRR/02%GH82%(2%*59

Serial C Code

Parallel C Code

Heterogeneous Programming

Use the right processor for the right job

Serial Code

Parallel Kernel

&''((()*+,-.)*/01)222$"#34%5

Serial Code

Parallel Kernel

!"#((()*+,-.)*/01)222$"#34%5

GPU Computing 3.0 5 An Ecosystem

GPU Computing 3.0: an emerging ecosystem

Hardware & product lines

Algorithmic sophistication

Cross-platform standards

Education & research

Consumer applications

High-level languages

!"#$%&'()*$+,$-./)(*01$$%&0()2$3,2&*4$56/+7,89*4$/55*

Products

CUDA is in products from laptops to supercomputers

Emerging HPC Products

New class of hybrid GPU-CPU servers

2 TeslaM1060 GPUs

SuperMicro 1U

GPU Server

Upto 18 Tesla M1060 GPUs

Bull Bullx

Blade Enclosure

Algorithmic Sophistication

Sparse matrix

Hash tables

Fast multipole method

Ray tracing (parallel tree traversal)

012$3"4/5.4$6'7($%89.)():+.)7#$76$;9+'4"$<+.')=-Vector Multiplication on Emerging Multicore Platforms", Williams et al, Supercomputing 2007

Cross-Platform Standards

GPU Computing Applications

NVIDIA GPUwith the CUDA Parallel Computing Architecture

CUDA C OpenCLtm Direct Compute

CUDA Fortran

JavaPython.NET

MATLAB3

OpenCL is trademark of Apple Inc. used under license to the Khronos Group Inc.

Over 250 universities teach CUDAOver 1000 research papers

CUDA Momentum

Over 500 CUDA Appswww.nvidia.com/CUDA

CUDA powered TSUBAME29th fastest supercomputer

in the world

180 Million CUDA GPUs

100,000 active developers

Data Structures

! !"#$%!&&'()*+(,)(+!-#

! !"#$%!&&"-%!,)(+!-#

! !"#$%!&&'()*+(,.!#

! Etc.

Algorithms

! !"#$%!&&%-#!

! !"#$%!&&#('$+(

! !"#$%!&&(/+0$%*)(,%+12

! Etc.

thrust

thrust: a library of data parallel algorithms & data structures with an interface similar to the C++ Standard Template Library for CUDA

C++ template metaprogramming automatically chooses the fastest code path at compile time

thrust::sort

sort.cu60*7,819)(:;#84:<;'4:=>97:'#?;2

60*7,819)(:;#84:<19>079=>97:'#?;2

60*7,819)(:;#84:<39*9#":9?;2

60*7,819)(:;#84:<4'#:?;2

60*7,819)(74:1,0!2

0*: @"0*$>'01%

<<)39*9#":9)#"*1'@)1":")'*):;9);'4:

:;#84:BB;'4:=>97:'#(0*:2);=>97$CDDDDDD%5

:;#84:BB39*9#":9$;=>97?!930*$%.);=>97?9*1$%.)#"*1%5

<<):#"*4&9#):')19>079)"*1)4'#:

:;#84:BB19>079=>97:'#(0*:2)1=>97 E);=>975

<<)4'#:)CFDG)HI!)-9J4<497)'*)K/IDD

:;#84:BB4'#:$1=>97?!930*$%.)1=>97?9*1$%%5

#9:8#*)D5L

http://thrust.googlecode.com

GPU Begins to Vanish

Ever-increasing number of codes & commercial 6/78/9#-$%:;-<$9*$=/-<#+($'"#2$>0?$@-$6+#-#2<

Some bio/chem codes available or porting:

NAMD / VMD, GROMACS (alpha), HOOMD, GPU HMMER, MUMmerGPU, AutoDock3

LAMMPS, CHARMM, Q-ChemA$>/;--@/2A$B)CDE3

Consumer applications, e.g. Photoshop

Badaboom, vReveal, Nero MoveIt

OS: Microsoft Windows 7, MacOS Snowleapord

Next-Gen GPU Architecture: Fermi

3 billion transistors

Over 2x the cores (512 total)

~2x the memory bandwidth

L1 and L2 caches

8x the peak DP performance

SM Microarchitecture

Register File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16

Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Instruction Cache

Objective 5 optimize for GPU computingNew ISA

Revamp issue / control flow

New CUDA core architecture

32 cores per SM (512 total)

64KB of configurable L1$ / shared memory

FP32 FP64 INT SFU LD/ST

Ops / clk 32 16 32 4 16

New IEEE 754-2008 arithmetic standard

Fused Multiply-Add(FMA) for SP & DP

New integer ALU optimized for 64-bit and extended precision ops

SM Microarchitecture

Register File

Scheduler

Dispatch

Scheduler

Dispatch

Load/Store Units x 16

Special Func Units x 4

Interconnect Network

64K ConfigurableCache/Shared Mem

Uniform Cache

Instruction Cache

CUDA CoreDispatch Port

Operand Collector

Result Queue

FP Unit INT Unit

Hardware Thread Scheduling

Concurrent kernel execution + faster context switch

Serial Kernel Execution Parallel Kernel Execution

Kernel 1 Kernel 1 Kernel 2

Kernel 2 Kernel 3

Kernel 3

nelKernel 5

Kernel 5

Kernel 4

Kernel 2

More Fermi Goodness

ECC protection for DRAM, L2, L1, RF

Unified 40-bit address space for local, shared, global

5-20x faster atomics

Dual DMA engines for CPU"! GPU transfers

ISA extensions for C++ (e.g. virtual functions)

Key CUDA Challenges

Express other programming models elegantly

Producer-consumer: persistent thread blocks reading and writing work queues

Task-parallel: kernel, thread block or warp as parallel task

C*<"$7*FF*2$%6*'#+-;-#+($G?HB$6/<<#+2-

Foster more high-level languages & platforms

Improve & mature development environment

Key GPU Workloads

Computational graphics

Scientific and numeric computing

Image processing 5 video & images

Computer vision

Speech & natural language

Machine learning

The Future of Computing?

Forward-looking statements:

All future interesting problems are throughput

problems.

GPUs will evolve to be the general-purpose throughput processors.

CPUs as we know them will become (already are?) %9**I$#2*;9"(A$/2I$-"+@28$<*$/$7*+2#+$*=$<"#$I@#J

Final Thoughts 5 Education

We should teach parallel computing in CS 1 or CS 2

G*F6;<#+-$I*2,<$9#<$=/-<#+A$:;-<$'@I#+

Manycore is the future of computing

Heapsort and mergesort

Both O(n lg n)

One parallel-friendly, one not

Students need to understand this early

MiscellaneousThoughts

NVIDIA Resources Available

NVIDIA enables heterogeneous computing research and teaching:

Grants for leading researchers in GPU ComputingPh.D. Fellowship program

Professor Partnership program

Discounted hardware

GPU Computing Ventures 5 Venture fund focused on GPU computing

Technical Training and Assistance

© NVIDIA Corporation 2009 -...

Transcript of © NVIDIA Corporation 2009 -...

© NVIDIA Corporation 2009 -...

Documents

Transcript of © NVIDIA Corporation 2009 -...

Release 178 Graphics Drivers Release Notes...NVIDIA Corporation ii CONFIDENTIAL Release 178 Graphics Drivers Release NotesTable 3.1 Supported NVIDIA Products . 30 Table 3.1 NVIDIA

NVIDIA GRID K1 K2 IBM Datasheet · Title: NVIDIA GRID K1 K2 IBM Datasheet Author: NVIDIA Corporation Subject: NVIDIA GRID boards provide GPU virtualization, low-latency remote display,

User’s Guidehttp.download.nvidia.com/Windows/61.76/ForceWare...NVIDIA Corporation NVIDIA ForceWare Graphics Driver User’s Guide Published by NVIDIA Corporation 2701 San Tomas Expressway

NVIDIA GRID K1 K2 DELL Datasheet · 2014. 7. 1. · Title: NVIDIA GRID K1 K2 DELL Datasheet Author: NVIDIA Corporation Subject: View this datasheet to learn about NVIDIA GRID technology,

CUDA CUBLAS Library - Research School of Computer … · NVIDIA Corporation CUBLAS Library PG-00000-002_V2.0 Published by NVIDIA Corporation 2701 San Tomas Expressway Santa Clara,

NVIDIA AutoCAD 2010 Performance Driver Release …tw.download.nvidia.com/Windows/.../18.0.1/NVAutoCAD... · NVIDIA Corporation i NVIDIA AutoCAD 2010 Performance Driver Release Notes

CUDA Libraries and CUDA Fortran - Nvidia · CUDA Libraries and CUDA Fortran Massimiliano Fatica NVIDIA Corporation. NVIDIA CUDA Libraries CUDA Toolkit includes several libraries:

ForceWare Graphics Driver - Nvidia€¦ · 2 NVIDIA Corporation User’s Guide Table of Contents About the NVIDIA Graphics Driver Installation. . 39 File Locations. . 39 Preserving

NVIDIA Quadro Professional Drivers Release 190 Noteses.download.nvidia.com/Windows/Quadro_Certified/190.38/190.38_… · NVIDIA Corporation ii NVIDIA Quadro Professional Drivers Release

ForceWare Graphics Driver - Nvidiaus.download.nvidia.com/Windows/91.47/91.47_ForceWare_nView_Use… · NVIDIA Corporation NVIDIA ForceWare Graphics Driver User’s Guide Published

NVIDIA Quadro Professional Drivers Release 95 … of Tables iv NVIDIA Corporation Release 95 Version 97.78 NVIDIA Corporation 1 NVIDIA Quadro Professional Drivers, Release 95 Version

Ben Wilson COMP 1631 Winter 2011 NVIDIA CORPORATION.

An Overview of the NVIDIA UNIX Graphics Driver XDevConf, February 8, 2006 Andy Ritger, NVIDIA Corporation.

NVIDIA GRID K1 K2 HP Datasheet€¦ · Title: NVIDIA GRID K1 K2 HP Datasheet Author: NVIDIA Corporation Subject: NVIDIA GRID boards provide GPU virtualization, low-latency remote

Dx8 Pixel Shaders Sim Dietrich NVIDIA Corporation sim.dietrich@nvidia.com.

GeForce Drivers - cn.download.nvidia.comcn.download.nvidia.com/Windows/174.74/174.74_NVIDIA_Control_Pa… · NVIDIA Corporation iii NVIDIA Release 174 Graphics Driver Quick Start

NVIDIA Quadro Professional Drivers Release 165 …us.download.nvidia.com/Windows/Quadro_Certified/165.42/...NVIDIA Corporation i NVIDIA Quadro Professional Drivers Release 165 Notes1.

When Shaders & TexturesCollide Kevin Bjorke NVIDIA Corporation.

THE NVIDIA DEEP LEARNING ACCELERATOR · 2018-08-19 · Encourage Deep Learning applications Invite contributions from the community ... ©2018 NVIDIA CORPORATION ©2018 NVIDIA CORPORATION

CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.