CIS 565: GPU Programming and Architecture Original Slides by: Suresh Venkatasubramanian Updates by...

CIS 565: CIS 565: GPU Programming and GPU Programming and

ArchitectureArchitectureOriginal Slides by: Suresh VenkatasubramanianOriginal Slides by: Suresh Venkatasubramanian

Updates by Joseph Kider and Patrick CozziUpdates by Joseph Kider and Patrick Cozzi

AdministriviaAdministrivia

MeetingMeeting Monday and Monday and

Wednesday Wednesday 1:30-3:00pm1:30-3:00pm Towne 309Towne 309

Recorded lectures upon requestRecorded lectures upon request Website: http://www.seas.upenn.edu/~cis565/Website: http://www.seas.upenn.edu/~cis565/


Instructor: Instructor: Joseph KiderJoseph Kiderkiderj@seaskiderj@seas


Teaching Teaching AssistantAssistant

Qing SunQing Sun

AdministriviaAdministriviaPrerequisitesPrerequisites

CIS 460: Introduction to Computer GraphicsCIS 460: Introduction to Computer GraphicsCIS 501: Computer ArchitectureCIS 501: Computer ArchitectureMost important:Most important:

C/C++ and OpenGLC/C++ and OpenGL

CIS 534: Multicore CIS 534: Multicore Programming Programming and Architectureand Architecture

Course DescriptionCourse Description This course is a pragmatic examination of multicore programming and the This course is a pragmatic examination of multicore programming and the

hardware architecture of modern multicore processors. Unlike the hardware architecture of modern multicore processors. Unlike the sequential single-core processors of the past, utilizing a multicore processor sequential single-core processors of the past, utilizing a multicore processor requires programmers to identify parallelism and write explicitly parallel requires programmers to identify parallelism and write explicitly parallel code. Topics covered include: the relevant architectural trends and aspects code. Topics covered include: the relevant architectural trends and aspects of multicores, approaches for writing multicore software by extracting data of multicores, approaches for writing multicore software by extracting data parallelism (vectors and SIMD), thread-level parallelism, and task-based parallelism (vectors and SIMD), thread-level parallelism, and task-based parallelism, efficient synchronization, and program profiling and parallelism, efficient synchronization, and program profiling and performance tuning. The course focuses primarily on mainstream shared-performance tuning. The course focuses primarily on mainstream shared-memory multicores with some coverage of graphics processing units memory multicores with some coverage of graphics processing units (GPUs). Cluster-based supercomputing is not a focus of this course. (GPUs). Cluster-based supercomputing is not a focus of this course. Several programming assignments and a course project will provide Several programming assignments and a course project will provide students first-hand experience with programming, experimentally analyzing, students first-hand experience with programming, experimentally analyzing, and tuning multicore software. Students are expected to have a solid and tuning multicore software. Students are expected to have a solid understanding of computer architecture and strong programming skills understanding of computer architecture and strong programming skills (including experience with C/C++).(including experience with C/C++).

We will not overlap very muchWe will not overlap very much

What is GPU (Parallel) ComputingWhat is GPU (Parallel) Computing

Parallel computing: using multiple Parallel computing: using multiple processors to…processors to…More quickly perform a computation, orMore quickly perform a computation, or

Perform a larger computation in the same timePerform a larger computation in the same timePROGRAMMER expresses parallelismPROGRAMMER expresses parallelism

Slide curiosity of Milo Martin

Clusters of Computers : MPI , networks, cloud computing ….

Shared memory MultiprocessorCalled “multicore” when on the same chip

GPU: Graphics processing units

NOT COVERED

CIS 534 MULTICORE

COURSE FOCUS CIS 565


Course OverviewCourse OverviewSystem and GPU architectureSystem and GPU architectureReal-time graphics programming withReal-time graphics programming with

OpenGL and GLSLOpenGL and GLSL

General purpose programming withGeneral purpose programming withCUDA and OpenCLCUDA and OpenCLProblem domain: up to youProblem domain: up to you

Hands-onHands-on


GoalsGoalsProgram massively parallel processors:Program massively parallel processors:

High performanceHigh performanceFunctionality and maintainabilityFunctionality and maintainabilityScalabilityScalability

Gain KnowledgeGain KnowledgeParallel programming principles and patternsParallel programming principles and patternsProcessor architecture features and constraintsProcessor architecture features and constraintsProgramming API, tools, and techniquesProgramming API, tools, and techniques


Grading Grading Homeworks (4-5) Homeworks (4-5) 40%40%Paper Presentation Paper Presentation 10%10%Final Project Final Project 40% + 5%40% + 5%FinalFinal 10%10%

AdministriviaAdministriviaBonus days: five per personBonus days: five per person

No-questions-asked one-day extensionNo-questions-asked one-day extensionMultiple bonus days can be used on the same Multiple bonus days can be used on the same

assignmentassignmentCan be used for most, but not all assignmentsCan be used for most, but not all assignments

Strict late policy: not turned by:Strict late policy: not turned by:11:59pm of due date: 25% deduction11:59pm of due date: 25% deduction2 days late: 50%2 days late: 50%3 days late: 75%3 days late: 75%4 or more days: 100%4 or more days: 100%

Add a Readme when using bonus daysAdd a Readme when using bonus days


Academic HonestyAcademic Honesty Discussion with other students, past or present, is Discussion with other students, past or present, is

encouragedencouraged Any reference to assignments from previous terms or Any reference to assignments from previous terms or

web postings is unacceptableweb postings is unacceptable Any copying of non-trivial code is unacceptableAny copying of non-trivial code is unacceptable

Non-trivial = more than a line or soNon-trivial = more than a line or so Includes reading someone else’s code and then going off to Includes reading someone else’s code and then going off to

write your own.write your own.


Academic HonestyAcademic HonestyPenalties for academic dishonesty:Penalties for academic dishonesty:

Zero on the assignment for the first occasionZero on the assignment for the first occasionAutomatic failure of the course for repeat offensesAutomatic failure of the course for repeat offenses

AdministriviaAdministrivia Textbook: NoneTextbook: None Related graphics books:Related graphics books:

Graphics ShadersGraphics Shaders OpenGL Shading LanguageOpenGL Shading Language GPU Gems 1 - 3GPU Gems 1 - 3

Related general GPU books:Related general GPU books: Programming Massively Parallel ProcessorsProgramming Massively Parallel Processors Patterns for Parallel ProgrammingPatterns for Parallel Programming


Do I need a GPU?Do I need a GPU?

Yes: NVIDIA GeForce 8 series or higherYes: NVIDIA GeForce 8 series or higher

NoNo Moore 100b Moore 100b - NVIDIA GeForce 9800s - NVIDIA GeForce 9800s SIG LabSIG Lab - NVIDIA GeForce 8800s, two GeForce - NVIDIA GeForce 8800s, two GeForce

480s, and one Fermi Tesla480s, and one Fermi Tesla


Demo: What GPU do I have?Demo: What GPU do I have?DemoDemo: What version of : What version of

OpenGL/CUDA/OpenCL does it support?OpenGL/CUDA/OpenCL does it support?

Aside: This class is about 3 thingsAside: This class is about 3 things

PERFORMANCEPERFORMANCE PERFORMANCEPERFORMANCE PERFORMANCEPERFORMANCE

Ok, not reallyOk, not really Also about correctness, “-abilities”, etc.Also about correctness, “-abilities”, etc.

Nitty Gritty real world wall-clock performanceNitty Gritty real world wall-clock performance No Proofs!No Proofs!

Slide curiosity of Milo Martin

ExerciseExercise

Parallel SortingParallel Sorting

CreditsCredits

David Kirk (NVIDIA)David Kirk (NVIDIA)Wen-mei Hwu (UIUC)Wen-mei Hwu (UIUC)David LubkeDavid LubkeWolfgang EngelWolfgang EngelEtc. etc.Etc. etc.

What is a GPU?What is a GPU?

GPUGPU: : GGraphics raphics PProcessing rocessing UUnitnitProcessor that resides on your graphics card.Processor that resides on your graphics card.

GPUs allow us to achieve the unprecedented GPUs allow us to achieve the unprecedented graphics capabilities now available in gamesgraphics capabilities now available in games

What is a GPU?What is a GPU?

Demo: Demo: NVIDIA GTX 400NVIDIA GTX 400Demo: Triangle throughputDemo: Triangle throughput

Why Program the GPU ?Why Program the GPU ?

Chart from: http://ixbtlabs.com/articles3/video/cuda-1-p1.html

Why Program the GPU ?Why Program the GPU ? ComputeCompute

Intel Core i7 – 4 cores – 100 GFLOPIntel Core i7 – 4 cores – 100 GFLOP NVIDIA GTX280 – 240 cores – 1 TFLOPNVIDIA GTX280 – 240 cores – 1 TFLOP

Memory BandwidthMemory Bandwidth System Memory – 60 GB/sSystem Memory – 60 GB/s NVIDIA GT200 – 150 GB/sNVIDIA GT200 – 150 GB/s

Install BaseInstall Base Over 200 million NVIDIA G80s shippedOver 200 million NVIDIA G80s shipped

How did this happen?How did this happen?

Games demand advanced shadingGames demand advanced shadingFast GPUs = better shadingFast GPUs = better shadingNeed for speed = continued innovationNeed for speed = continued innovationThe gaming industry has overtaken the The gaming industry has overtaken the

defense, finance, oil and healthcare defense, finance, oil and healthcare industries as the main driving factor for industries as the main driving factor for high performance processors.high performance processors.

GPU = Fast co-processor ? GPU = Fast co-processor ?

GPU speed increasing at cubed-Moore’s Law.GPU speed increasing at cubed-Moore’s Law. This is a consequence of the This is a consequence of the data-paralleldata-parallel

streamingstreaming aspects of the GPU. aspects of the GPU. GPUs are cheap! Put a couple together, and GPUs are cheap! Put a couple together, and

you can get a super-computer. you can get a super-computer.

NYT May 26, 2003: TECHNOLOGY; From PlayStation to Supercomputer for $50,000:

National Center for Supercomputing Applications at University of Illinois at Urbana-Champaign builds supercomputer using 70 individual Sony Playstation 2 machines; project required no hardware engineering other than mounting Playstations in a rack and connecting them with high-speed network switch

So can we use the GPU for general-purpose computing ?

Yes ! Wealth of applicationsYes ! Wealth of applications

Voronoi Diagrams

Data Analysis Motion Planning

Geometric Optimization

Physical Simulation

Matrix Multiplication

Conjugate Gradient Sorting and Searching

Force-field simulation

Particle Systems

Molecular Dynamics Graph Drawing

Signal Processing

Database queries

Range queries

… and graphics too !!

Image Processing

Radar, Sonar, Oil ExplorationFinance

Planning

Optimization

When does “GPU=fast co-processor” work ?When does “GPU=fast co-processor” work ?

Real-time visualization of complex Real-time visualization of complex phenomenaphenomena

The GPU (like a fast parallel processor) The GPU (like a fast parallel processor) can simulate physical processes like fluid can simulate physical processes like fluid flow, n-body systems, molecular dynamicsflow, n-body systems, molecular dynamics

In general: In general: Massively Parallel TasksMassively Parallel Tasks

When does “GPU=fast co-When does “GPU=fast co-processor” work ?processor” work ?

Interactive data analysisInteractive data analysis

For effective visualization of data, For effective visualization of data, interactivity is keyinteractivity is key

When does “GPU=fast co-processor” work ?When does “GPU=fast co-processor” work ?

Rendering complex scenesRendering complex scenes

Procedural shaders can offload much of the expensive rendering Procedural shaders can offload much of the expensive rendering work to the GPU. Still not the Holy Grail of “80 million triangles at 30 work to the GPU. Still not the Holy Grail of “80 million triangles at 30

frames/sec*”, but it helps.frames/sec*”, but it helps. * Alvy Ray Smith, Pixar.

Note: NVIDIA Quadro 5000 is calculated to push 950 million triangles per second

http://www.nvidia.com/object/product-quadro-5000-us.html

Stream ProgrammingStream Programming

A A streamstream is a sequence of data (could be is a sequence of data (could be numbers, colors, RGBA vectors,…)numbers, colors, RGBA vectors,…)

A A kernelkernel is a (fragment) program that runs on is a (fragment) program that runs on each element of a stream, generating an each element of a stream, generating an output stream (pixel buffer).output stream (pixel buffer).

Stream ProgrammingStream Programming

Kernel = vertex/fragment shaderKernel = vertex/fragment shader Input stream = stream of vertices, Input stream = stream of vertices,

primitives, or fragmentsprimitives, or fragmentsOutput stream = frame buffer or other Output stream = frame buffer or other

buffer (transform feedback)buffer (transform feedback)Multiple kernels = multi-pass rendering Multiple kernels = multi-pass rendering

sequence on the GPU.sequence on the GPU.

To program the GPU, one must To program the GPU, one must think of it as a (parallel) stream think of it as a (parallel) stream

processor.processor.

What is the cost of a stream What is the cost of a stream program ?program ?

Number of kernelsNumber of kernels Readbacks from the GPU to main memory are Readbacks from the GPU to main memory are

expensive, and so is transferring data to the GPU.expensive, and so is transferring data to the GPU. Complexity of kernelComplexity of kernel

More complexity takes longer to move data through a More complexity takes longer to move data through a rendering pipelinerendering pipeline

Number of memory accessesNumber of memory accesses Non-local memory access is expensiveNon-local memory access is expensive

Number of branchesNumber of branches Divergent branches are expensiveDivergent branches are expensive

What will this course cover ? What will this course cover ?

1. Stream Programming Principles1. Stream Programming Principles

OpenGL programmable pipelineOpenGL programmable pipelineThe principles of stream hardwareThe principles of stream hardwareHow do we program with streams? How do we program with streams?

2. Shaders and Effects2. Shaders and Effects

How do we compute complex effects found in How do we compute complex effects found in today’s games? Examples:today’s games? Examples:Parallax MappingParallax MappingReflectionsReflectionsSkin and HairSkin and HairParticle SystemsParticle SystemsDeformable MeshDeformable MeshMorphingMorphingAnimationAnimation

3. GPGPU / GPU Computing3. GPGPU / GPU Computing How do we use the GPU as a fast co-processor?How do we use the GPU as a fast co-processor?

GPGPU Languages: CUDA and OpenCLGPGPU Languages: CUDA and OpenCL High Performance ComputingHigh Performance Computing Numerical methods and linear algebra:Numerical methods and linear algebra:

Inner productsInner products Matrix-vector operationsMatrix-vector operations Matrix-Matrix operationsMatrix-Matrix operations SortingSorting Fluid SimulationsFluid Simulations Fast Fourier TransformsFast Fourier Transforms Graph AlgorithmsGraph Algorithms And More…And More…

At what point does the GPU become faster than the CPU for At what point does the GPU become faster than the CPU for matrix operations ? For other operations ?matrix operations ? For other operations ?

4. Optimizations4. Optimizations

How do we use the full potential of the How do we use the full potential of the GPU?GPU?

What tools are there to analyze the What tools are there to analyze the performance of our algorithms?performance of our algorithms?

What we want you to get out of this course!What we want you to get out of this course!

1.1. Understanding of the GPU as a graphics Understanding of the GPU as a graphics pipelinepipeline

2.2. Understanding of the GPU as a high Understanding of the GPU as a high performance compute deviceperformance compute device

3.3. Understanding of GPU architecturesUnderstanding of GPU architectures4.4. Programming in GLSL, CUDA, and OpenCLProgramming in GLSL, CUDA, and OpenCL5.5. Exposure to many core graphics effects Exposure to many core graphics effects

performed on GPUsperformed on GPUs6.6. Exposure to many core parallel algorithms Exposure to many core parallel algorithms

performed on GPUsperformed on GPUs

CIS 565: GPU Programming and Architecture Original Slides by: Suresh Venkatasubramanian Updates by...

Documents

Transcript of CIS 565: GPU Programming and Architecture Original Slides by: Suresh Venkatasubramanian Updates by...