CIS 565: GPU Programming and Architecture Original Slides by: Suresh Venkatasubramanian Updates by...
-
Upload
jayda-branford -
Category
Documents
-
view
221 -
download
2
Transcript of CIS 565: GPU Programming and Architecture Original Slides by: Suresh Venkatasubramanian Updates by...
CIS 565: CIS 565: GPU Programming and GPU Programming and
ArchitectureArchitectureOriginal Slides by: Suresh VenkatasubramanianOriginal Slides by: Suresh Venkatasubramanian
Updates by Joseph Kider and Patrick CozziUpdates by Joseph Kider and Patrick Cozzi
AdministriviaAdministrivia
MeetingMeeting Monday and Monday and
Wednesday Wednesday 1:30-3:00pm1:30-3:00pm Towne 309Towne 309
Recorded lectures upon requestRecorded lectures upon request Website: http://www.seas.upenn.edu/~cis565/Website: http://www.seas.upenn.edu/~cis565/
AdministriviaAdministrivia
Instructor: Instructor: Joseph KiderJoseph Kiderkiderj@seaskiderj@seas
AdministriviaAdministrivia
Teaching Teaching AssistantAssistant
Qing SunQing Sun
AdministriviaAdministriviaPrerequisitesPrerequisites
CIS 460: Introduction to Computer GraphicsCIS 460: Introduction to Computer GraphicsCIS 501: Computer ArchitectureCIS 501: Computer ArchitectureMost important:Most important:
C/C++ and OpenGLC/C++ and OpenGL
CIS 534: Multicore CIS 534: Multicore Programming Programming and Architectureand Architecture
Course DescriptionCourse Description This course is a pragmatic examination of multicore programming and the This course is a pragmatic examination of multicore programming and the
hardware architecture of modern multicore processors. Unlike the hardware architecture of modern multicore processors. Unlike the sequential single-core processors of the past, utilizing a multicore processor sequential single-core processors of the past, utilizing a multicore processor requires programmers to identify parallelism and write explicitly parallel requires programmers to identify parallelism and write explicitly parallel code. Topics covered include: the relevant architectural trends and aspects code. Topics covered include: the relevant architectural trends and aspects of multicores, approaches for writing multicore software by extracting data of multicores, approaches for writing multicore software by extracting data parallelism (vectors and SIMD), thread-level parallelism, and task-based parallelism (vectors and SIMD), thread-level parallelism, and task-based parallelism, efficient synchronization, and program profiling and parallelism, efficient synchronization, and program profiling and performance tuning. The course focuses primarily on mainstream shared-performance tuning. The course focuses primarily on mainstream shared-memory multicores with some coverage of graphics processing units memory multicores with some coverage of graphics processing units (GPUs). Cluster-based supercomputing is not a focus of this course. (GPUs). Cluster-based supercomputing is not a focus of this course. Several programming assignments and a course project will provide Several programming assignments and a course project will provide students first-hand experience with programming, experimentally analyzing, students first-hand experience with programming, experimentally analyzing, and tuning multicore software. Students are expected to have a solid and tuning multicore software. Students are expected to have a solid understanding of computer architecture and strong programming skills understanding of computer architecture and strong programming skills (including experience with C/C++).(including experience with C/C++).
We will not overlap very muchWe will not overlap very much
What is GPU (Parallel) ComputingWhat is GPU (Parallel) Computing
Parallel computing: using multiple Parallel computing: using multiple processors to…processors to…More quickly perform a computation, orMore quickly perform a computation, or
Perform a larger computation in the same timePerform a larger computation in the same timePROGRAMMER expresses parallelismPROGRAMMER expresses parallelism
Slide curiosity of Milo Martin
Clusters of Computers : MPI , networks, cloud computing ….
Shared memory MultiprocessorCalled “multicore” when on the same chip
GPU: Graphics processing units
NOT COVERED
CIS 534 MULTICORE
COURSE FOCUS CIS 565
AdministriviaAdministrivia
Course OverviewCourse OverviewSystem and GPU architectureSystem and GPU architectureReal-time graphics programming withReal-time graphics programming with
OpenGL and GLSLOpenGL and GLSL
General purpose programming withGeneral purpose programming withCUDA and OpenCLCUDA and OpenCLProblem domain: up to youProblem domain: up to you
Hands-onHands-on
AdministriviaAdministrivia
GoalsGoalsProgram massively parallel processors:Program massively parallel processors:
High performanceHigh performanceFunctionality and maintainabilityFunctionality and maintainabilityScalabilityScalability
Gain KnowledgeGain KnowledgeParallel programming principles and patternsParallel programming principles and patternsProcessor architecture features and constraintsProcessor architecture features and constraintsProgramming API, tools, and techniquesProgramming API, tools, and techniques
AdministriviaAdministrivia
Grading Grading Homeworks (4-5) Homeworks (4-5) 40%40%Paper Presentation Paper Presentation 10%10%Final Project Final Project 40% + 5%40% + 5%FinalFinal 10%10%
AdministriviaAdministriviaBonus days: five per personBonus days: five per person
No-questions-asked one-day extensionNo-questions-asked one-day extensionMultiple bonus days can be used on the same Multiple bonus days can be used on the same
assignmentassignmentCan be used for most, but not all assignmentsCan be used for most, but not all assignments
Strict late policy: not turned by:Strict late policy: not turned by:11:59pm of due date: 25% deduction11:59pm of due date: 25% deduction2 days late: 50%2 days late: 50%3 days late: 75%3 days late: 75%4 or more days: 100%4 or more days: 100%
Add a Readme when using bonus daysAdd a Readme when using bonus days
AdministriviaAdministrivia
Academic HonestyAcademic Honesty Discussion with other students, past or present, is Discussion with other students, past or present, is
encouragedencouraged Any reference to assignments from previous terms or Any reference to assignments from previous terms or
web postings is unacceptableweb postings is unacceptable Any copying of non-trivial code is unacceptableAny copying of non-trivial code is unacceptable
Non-trivial = more than a line or soNon-trivial = more than a line or so Includes reading someone else’s code and then going off to Includes reading someone else’s code and then going off to
write your own.write your own.
AdministriviaAdministrivia
Academic HonestyAcademic HonestyPenalties for academic dishonesty:Penalties for academic dishonesty:
Zero on the assignment for the first occasionZero on the assignment for the first occasionAutomatic failure of the course for repeat offensesAutomatic failure of the course for repeat offenses
AdministriviaAdministrivia Textbook: NoneTextbook: None Related graphics books:Related graphics books:
Graphics ShadersGraphics Shaders OpenGL Shading LanguageOpenGL Shading Language GPU Gems 1 - 3GPU Gems 1 - 3
Related general GPU books:Related general GPU books: Programming Massively Parallel ProcessorsProgramming Massively Parallel Processors Patterns for Parallel ProgrammingPatterns for Parallel Programming
AdministriviaAdministrivia
Do I need a GPU?Do I need a GPU?
Yes: NVIDIA GeForce 8 series or higherYes: NVIDIA GeForce 8 series or higher
NoNo Moore 100b Moore 100b - NVIDIA GeForce 9800s - NVIDIA GeForce 9800s SIG LabSIG Lab - NVIDIA GeForce 8800s, two GeForce - NVIDIA GeForce 8800s, two GeForce
480s, and one Fermi Tesla480s, and one Fermi Tesla
AdministriviaAdministrivia
Demo: What GPU do I have?Demo: What GPU do I have?DemoDemo: What version of : What version of
OpenGL/CUDA/OpenCL does it support?OpenGL/CUDA/OpenCL does it support?
Aside: This class is about 3 thingsAside: This class is about 3 things
PERFORMANCEPERFORMANCE PERFORMANCEPERFORMANCE PERFORMANCEPERFORMANCE
Ok, not reallyOk, not really Also about correctness, “-abilities”, etc.Also about correctness, “-abilities”, etc.
Nitty Gritty real world wall-clock performanceNitty Gritty real world wall-clock performance No Proofs!No Proofs!
Slide curiosity of Milo Martin
ExerciseExercise
Parallel SortingParallel Sorting
CreditsCredits
David Kirk (NVIDIA)David Kirk (NVIDIA)Wen-mei Hwu (UIUC)Wen-mei Hwu (UIUC)David LubkeDavid LubkeWolfgang EngelWolfgang EngelEtc. etc.Etc. etc.
What is a GPU?What is a GPU?
GPUGPU: : GGraphics raphics PProcessing rocessing UUnitnitProcessor that resides on your graphics card.Processor that resides on your graphics card.
GPUs allow us to achieve the unprecedented GPUs allow us to achieve the unprecedented graphics capabilities now available in gamesgraphics capabilities now available in games
What is a GPU?What is a GPU?
Demo: Demo: NVIDIA GTX 400NVIDIA GTX 400Demo: Triangle throughputDemo: Triangle throughput
Why Program the GPU ?Why Program the GPU ?
Chart from: http://ixbtlabs.com/articles3/video/cuda-1-p1.html
Why Program the GPU ?Why Program the GPU ? ComputeCompute
Intel Core i7 – 4 cores – 100 GFLOPIntel Core i7 – 4 cores – 100 GFLOP NVIDIA GTX280 – 240 cores – 1 TFLOPNVIDIA GTX280 – 240 cores – 1 TFLOP
Memory BandwidthMemory Bandwidth System Memory – 60 GB/sSystem Memory – 60 GB/s NVIDIA GT200 – 150 GB/sNVIDIA GT200 – 150 GB/s
Install BaseInstall Base Over 200 million NVIDIA G80s shippedOver 200 million NVIDIA G80s shipped
How did this happen?How did this happen?
Games demand advanced shadingGames demand advanced shadingFast GPUs = better shadingFast GPUs = better shadingNeed for speed = continued innovationNeed for speed = continued innovationThe gaming industry has overtaken the The gaming industry has overtaken the
defense, finance, oil and healthcare defense, finance, oil and healthcare industries as the main driving factor for industries as the main driving factor for high performance processors.high performance processors.
GPU = Fast co-processor ? GPU = Fast co-processor ?
GPU speed increasing at cubed-Moore’s Law.GPU speed increasing at cubed-Moore’s Law. This is a consequence of the This is a consequence of the data-paralleldata-parallel
streamingstreaming aspects of the GPU. aspects of the GPU. GPUs are cheap! Put a couple together, and GPUs are cheap! Put a couple together, and
you can get a super-computer. you can get a super-computer.
NYT May 26, 2003: TECHNOLOGY; From PlayStation to Supercomputer for $50,000:
National Center for Supercomputing Applications at University of Illinois at Urbana-Champaign builds supercomputer using 70 individual Sony Playstation 2 machines; project required no hardware engineering other than mounting Playstations in a rack and connecting them with high-speed network switch
So can we use the GPU for general-purpose computing ?
Yes ! Wealth of applicationsYes ! Wealth of applications
Voronoi Diagrams
Data Analysis Motion Planning
Geometric Optimization
Physical Simulation
Matrix Multiplication
Conjugate Gradient Sorting and Searching
Force-field simulation
Particle Systems
Molecular Dynamics Graph Drawing
Signal Processing
Database queries
Range queries
… and graphics too !!
Image Processing
Radar, Sonar, Oil ExplorationFinance
Planning
Optimization
When does “GPU=fast co-processor” work ?When does “GPU=fast co-processor” work ?
Real-time visualization of complex Real-time visualization of complex phenomenaphenomena
The GPU (like a fast parallel processor) The GPU (like a fast parallel processor) can simulate physical processes like fluid can simulate physical processes like fluid flow, n-body systems, molecular dynamicsflow, n-body systems, molecular dynamics
In general: In general: Massively Parallel TasksMassively Parallel Tasks
When does “GPU=fast co-When does “GPU=fast co-processor” work ?processor” work ?
Interactive data analysisInteractive data analysis
For effective visualization of data, For effective visualization of data, interactivity is keyinteractivity is key
When does “GPU=fast co-processor” work ?When does “GPU=fast co-processor” work ?
Rendering complex scenesRendering complex scenes
Procedural shaders can offload much of the expensive rendering Procedural shaders can offload much of the expensive rendering work to the GPU. Still not the Holy Grail of “80 million triangles at 30 work to the GPU. Still not the Holy Grail of “80 million triangles at 30
frames/sec*”, but it helps.frames/sec*”, but it helps. * Alvy Ray Smith, Pixar.
Note: NVIDIA Quadro 5000 is calculated to push 950 million triangles per second
http://www.nvidia.com/object/product-quadro-5000-us.html
Stream ProgrammingStream Programming
A A streamstream is a sequence of data (could be is a sequence of data (could be numbers, colors, RGBA vectors,…)numbers, colors, RGBA vectors,…)
A A kernelkernel is a (fragment) program that runs on is a (fragment) program that runs on each element of a stream, generating an each element of a stream, generating an output stream (pixel buffer).output stream (pixel buffer).
Stream ProgrammingStream Programming
Kernel = vertex/fragment shaderKernel = vertex/fragment shader Input stream = stream of vertices, Input stream = stream of vertices,
primitives, or fragmentsprimitives, or fragmentsOutput stream = frame buffer or other Output stream = frame buffer or other
buffer (transform feedback)buffer (transform feedback)Multiple kernels = multi-pass rendering Multiple kernels = multi-pass rendering
sequence on the GPU.sequence on the GPU.
To program the GPU, one must To program the GPU, one must think of it as a (parallel) stream think of it as a (parallel) stream
processor.processor.
What is the cost of a stream What is the cost of a stream program ?program ?
Number of kernelsNumber of kernels Readbacks from the GPU to main memory are Readbacks from the GPU to main memory are
expensive, and so is transferring data to the GPU.expensive, and so is transferring data to the GPU. Complexity of kernelComplexity of kernel
More complexity takes longer to move data through a More complexity takes longer to move data through a rendering pipelinerendering pipeline
Number of memory accessesNumber of memory accesses Non-local memory access is expensiveNon-local memory access is expensive
Number of branchesNumber of branches Divergent branches are expensiveDivergent branches are expensive
What will this course cover ? What will this course cover ?
1. Stream Programming Principles1. Stream Programming Principles
OpenGL programmable pipelineOpenGL programmable pipelineThe principles of stream hardwareThe principles of stream hardwareHow do we program with streams? How do we program with streams?
2. Shaders and Effects2. Shaders and Effects
How do we compute complex effects found in How do we compute complex effects found in today’s games? Examples:today’s games? Examples:Parallax MappingParallax MappingReflectionsReflectionsSkin and HairSkin and HairParticle SystemsParticle SystemsDeformable MeshDeformable MeshMorphingMorphingAnimationAnimation
3. GPGPU / GPU Computing3. GPGPU / GPU Computing How do we use the GPU as a fast co-processor?How do we use the GPU as a fast co-processor?
GPGPU Languages: CUDA and OpenCLGPGPU Languages: CUDA and OpenCL High Performance ComputingHigh Performance Computing Numerical methods and linear algebra:Numerical methods and linear algebra:
Inner productsInner products Matrix-vector operationsMatrix-vector operations Matrix-Matrix operationsMatrix-Matrix operations SortingSorting Fluid SimulationsFluid Simulations Fast Fourier TransformsFast Fourier Transforms Graph AlgorithmsGraph Algorithms And More…And More…
At what point does the GPU become faster than the CPU for At what point does the GPU become faster than the CPU for matrix operations ? For other operations ?matrix operations ? For other operations ?
4. Optimizations4. Optimizations
How do we use the full potential of the How do we use the full potential of the GPU?GPU?
What tools are there to analyze the What tools are there to analyze the performance of our algorithms?performance of our algorithms?
What we want you to get out of this course!What we want you to get out of this course!
1.1. Understanding of the GPU as a graphics Understanding of the GPU as a graphics pipelinepipeline
2.2. Understanding of the GPU as a high Understanding of the GPU as a high performance compute deviceperformance compute device
3.3. Understanding of GPU architecturesUnderstanding of GPU architectures4.4. Programming in GLSL, CUDA, and OpenCLProgramming in GLSL, CUDA, and OpenCL5.5. Exposure to many core graphics effects Exposure to many core graphics effects
performed on GPUsperformed on GPUs6.6. Exposure to many core parallel algorithms Exposure to many core parallel algorithms
performed on GPUsperformed on GPUs