Efficient Data Parallel Computing on GPUs

31
Efficient Data Parallel Efficient Data Parallel Computing on GPUs Computing on GPUs Cliff Woolley University of Virginia / NVIDIA

description

Efficient Data Parallel Computing on GPUs. Cliff Woolley University of Virginia / NVIDIA. Overview. Data Parallel Computing Computational Frequency Profiling and Load Balancing. Data Parallel Computing. Data Parallel Computing. Vector Processing small-scale parallelism - PowerPoint PPT Presentation

Transcript of Efficient Data Parallel Computing on GPUs

Page 1: Efficient Data Parallel Computing on GPUs

Efficient Data Parallel Efficient Data Parallel Computing on GPUsComputing on GPUs

Cliff Woolley University of Virginia / NVIDIA

Page 2: Efficient Data Parallel Computing on GPUs

OverviewOverviewOverviewOverview

• Data Parallel Computing

• Computational Frequency

• Profiling and Load Balancing

Page 3: Efficient Data Parallel Computing on GPUs

Data Parallel ComputingData Parallel ComputingData Parallel ComputingData Parallel Computing

Page 4: Efficient Data Parallel Computing on GPUs

• Vector Processing– small-scale parallelism

• Data Layout– large-scale parallelism

Data Parallel ComputingData Parallel ComputingData Parallel ComputingData Parallel Computing

Page 5: Efficient Data Parallel Computing on GPUs

frag2frame Smooth(vert2frag IN, uniform samplerRECT Source : texunit0, uniform samplerRECT Operator : texunit1, uniform samplerRECT Boundary : texunit2, uniform float4 params){ frag2frame OUT;

float2 center = IN.TexCoord0.xy; float4 U = f4texRECT(Source, center); // Calculate Red-Black (odd-even) masks float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); ... float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; } ... return OUT;}

Learning by example:Learning by example:A really naïve shaderA really naïve shaderLearning by example:Learning by example:A really naïve shaderA really naïve shader

Page 6: Efficient Data Parallel Computing on GPUs

frag2frame Smooth(vert2frag IN, uniform samplerRECT Source : texunit0, uniform samplerRECT Operator : texunit1, uniform samplerRECT Boundary : texunit2, uniform float4 params){ frag2frame OUT;

float2 center = IN.TexCoord0.xy; float4 U = f4texRECT(Source, center); // Calculate Red-Black (odd-even) masks float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); ... float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; } ... return OUT;}

Data Parallel Computing: Data Parallel Computing: Vector ProcessingVector ProcessingData Parallel Computing: Data Parallel Computing: Vector ProcessingVector Processing

Page 7: Efficient Data Parallel Computing on GPUs

float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f));

float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f);

Data Parallel Computing: Data Parallel Computing: Vector ProcessingVector ProcessingData Parallel Computing: Data Parallel Computing: Vector ProcessingVector Processing

Page 8: Efficient Data Parallel Computing on GPUs

float2 offset = center.xy - 0.5f;offset = offset * params.xx + 0.5f; // MADR is cool too – one cycle, two flops

float4 neighbor = center.xxyy + float4(-1.0f,1.0f,-1.0f,1.0f);

Data Parallel Computing: Data Parallel Computing: Vector ProcessingVector ProcessingData Parallel Computing: Data Parallel Computing: Vector ProcessingVector Processing

Page 9: Efficient Data Parallel Computing on GPUs

Data Parallel Computing:Data Parallel Computing:Data LayoutData LayoutData Parallel Computing:Data Parallel Computing:Data LayoutData Layout

• Pack scalar data into RGBA in texture memory

Page 10: Efficient Data Parallel Computing on GPUs

Computational FrequencyComputational FrequencyComputational FrequencyComputational Frequency

Page 11: Efficient Data Parallel Computing on GPUs

Computational FrequencyComputational FrequencyComputational FrequencyComputational Frequency

• Think of your CPU program and your vertex and fragment programs as different levels of nested looping.

– CPU Program– Vertex Program

– Fragment Program

Page 12: Efficient Data Parallel Computing on GPUs

Computational FrequencyComputational FrequencyComputational FrequencyComputational Frequency

• Branches– Avoid these, especially in the inner loop – i.e., the fragment

program.

Page 13: Efficient Data Parallel Computing on GPUs

Computational Frequency: Computational Frequency: Avoid inner-loop branchingAvoid inner-loop branchingComputational Frequency: Computational Frequency: Avoid inner-loop branchingAvoid inner-loop branching

• Static branch resolution– write several variants of each fragment program to handle

boundary cases– eliminates conditionals in the fragment program– equivalent to avoiding CPU inner-loop branching

case 2: accounts for boundaries

case 1: no boundaries

Page 14: Efficient Data Parallel Computing on GPUs

Computational FrequencyComputational FrequencyComputational FrequencyComputational Frequency

• Branches– Avoid these, especially in the inner loop – i.e., the fragment

program.

– Ian’s talk will give some strategies for branching if you absolutely cannot avoid it.

Page 15: Efficient Data Parallel Computing on GPUs

Computational FrequencyComputational FrequencyComputational FrequencyComputational Frequency

• Precompute

• Precompute

• Precompute

Page 16: Efficient Data Parallel Computing on GPUs

Computational Frequency: Computational Frequency: Precomputing texcoordsPrecomputing texcoordsComputational Frequency: Computational Frequency: Precomputing texcoordsPrecomputing texcoords

• Take advantage of under-utilized hardware– vertex processor– rasterizer

• Reduce instruction count at the per-fragment level

• Avoid lookups being treated as texture indirections

Page 17: Efficient Data Parallel Computing on GPUs

vert2frag smooth(app2vert IN, uniform float4x4 xform : C0, uniform float2 srcoffset, uniform float size){ vert2frag OUT;

OUT.position = mul(xform,IN.position); OUT.center = IN.center; OUT.redblack = IN.center - srcoffset; OUT.operator = size*(OUT.redblack - 0.5f) + 0.5f; OUT.hneighbor = IN.center.xxyx + float4(-1.0f, 1.0f, 0.0f, 0.0f); OUT.vneighbor = IN.center.xyyy + float4(0.0f, -1.0f, 1.0f, 0.0f);

return OUT;}

Computational Frequency: Computational Frequency: Precomputing texcoordsPrecomputing texcoordsComputational Frequency: Computational Frequency: Precomputing texcoordsPrecomputing texcoords

Page 18: Efficient Data Parallel Computing on GPUs

frag2frame Smooth(vert2frag IN, uniform samplerRECT Source : texunit0, uniform samplerRECT Operator : texunit1, uniform samplerRECT Boundary : texunit2, uniform float4 params){ frag2frame OUT;

float2 center = IN.TexCoord0.xy; float4 U = f4texRECT(Source, center); // Calculate Red-Black (odd-even) masks float2 intpart; float2 place = floor(1.0f - modf(round(center + float2(0.5f, 0.5f)) / 2.0f, intpart)); float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y); if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)) { float2 offset = float2(params.x*center.x - 0.5f*(params.x-1.0f), params.x*center.y - 0.5f*(params.x-1.0f)); ... float4 neighbor = float4(center.x - 1.0f, center.x + 1.0f, center.y - 1.0f, center.y + 1.0f); float central = -2.0f*(O.x + O.y); float poisson = ((params.x*params.x)*U.z + (-O.x * f1texRECT(Source, float2(neighbor.x, center.y)) + -O.x * f1texRECT(Source, float2(neighbor.y, center.y)) + -O.y * f1texRECT(Source, float2(center.x, neighbor.z)) + -O.z * f1texRECT(Source, float2(center.x, neighbor.w)))) / O.w; OUT.COL.x = poisson; } ... return OUT;}

Computational Frequency: Computational Frequency: Precomputing texcoordsPrecomputing texcoordsComputational Frequency: Computational Frequency: Precomputing texcoordsPrecomputing texcoords

Page 19: Efficient Data Parallel Computing on GPUs

Computational Frequency: Computational Frequency: Precomputing other valuesPrecomputing other valuesComputational Frequency: Computational Frequency: Precomputing other valuesPrecomputing other values

• Same deal! Factor other computations out:– Anything that varies linearly across the geometry– Anything that has a complex value computed per-vertex– Anything that is uniform across the geometry

Page 20: Efficient Data Parallel Computing on GPUs

Computational Frequency: Computational Frequency: Precomputing on the CPUPrecomputing on the CPUComputational Frequency: Computational Frequency: Precomputing on the CPUPrecomputing on the CPU

• Use glMultiTexCoord4f() creatively

• Extract as much uniformity from uniform parameters as you can

Page 21: Efficient Data Parallel Computing on GPUs

// Calculate Red-Black (odd-even) masksfloat2 intpart;float2 place = floor(1.0f - modf(round(center + 0.5f) / 2.0f, intpart));float2 mask = float2((1.0f-place.x) * (1.0f-place.y), place.x * place.y);

if (((mask.x + mask.y) && params.y) || (!(mask.x + mask.y) && !params.y)){ ...

Computational Frequency:Computational Frequency:Precomputed lookup tablesPrecomputed lookup tablesComputational Frequency:Computational Frequency:Precomputed lookup tablesPrecomputed lookup tables

Page 22: Efficient Data Parallel Computing on GPUs

half4 mask = f4texRECT(RedBlack, IN.redblack);/* * mask.x and mask.w tell whether IN.center.x and IN.center.y * are both odd or both even, respectively. either of these two * conditions indicates that the fragment is red. params.x==1 * selects red; params.y==1 selects black. */if (dot(mask,params.xyyx)){ ...

Computational Frequency:Computational Frequency:Precomputed lookup tablesPrecomputed lookup tablesComputational Frequency:Computational Frequency:Precomputed lookup tablesPrecomputed lookup tables

Page 23: Efficient Data Parallel Computing on GPUs

Computational Frequency:Computational Frequency:Precomputed lookup tablesPrecomputed lookup tablesComputational Frequency:Computational Frequency:Precomputed lookup tablesPrecomputed lookup tables

• Be careful with texture lookups – cache coherence is crucial

• Use the smallest data types you can get away with to reduce bandwidth consumption

• “Computation is cheap; memory accesses are not.”...if you’re memory access limited.

Page 24: Efficient Data Parallel Computing on GPUs

Profiling and Load BalancingProfiling and Load BalancingProfiling and Load BalancingProfiling and Load Balancing

Page 25: Efficient Data Parallel Computing on GPUs

Profiling and Load Profiling and Load BalancingBalancingProfiling and Load Profiling and Load BalancingBalancing

• Software profiling

• GPU pipeline profiling

• GPU load balancing

Page 26: Efficient Data Parallel Computing on GPUs

Profiling and Load Balancing: Profiling and Load Balancing: Software profilingSoftware profilingProfiling and Load Balancing: Profiling and Load Balancing: Software profilingSoftware profiling

• Run a standard software profiler!– Rational Quantify– Intel VTune– AMD CodeAnalyst

Page 27: Efficient Data Parallel Computing on GPUs

Profiling and Load Balancing: Profiling and Load Balancing: GPU pipeline profilingGPU pipeline profilingProfiling and Load Balancing: Profiling and Load Balancing: GPU pipeline profilingGPU pipeline profiling

• This is where it gets tricky.

• Some tools exist to help you:– NVPerfHUD

http://developer.nvidia.com/docs/IO/8343/How-To-Profile.pdf

– NVShaderPerfhttp://developer.nvidia.com/object/nvshaderperf_home.html

– Apple OpenGL Profilerhttp://developer.apple.com/opengl/profiler_image.html

Page 28: Efficient Data Parallel Computing on GPUs

Profiling and Load Balancing: Profiling and Load Balancing: GPU load balancingGPU load balancingProfiling and Load Balancing: Profiling and Load Balancing: GPU load balancingGPU load balancing

• This is a whole talk in and of itself– e.g., http://developer.nvidia.com/docs/IO/8343/Performance-Optimisation.pdf

• Sometimes you can get more hints from third parties than from the vendors themselves– http://www.3dcenter.de/artikel/cinefx/index6_e.php

– http://www.3dcenter.de/artikel/nv40_technik/

Page 29: Efficient Data Parallel Computing on GPUs

ConclusionsConclusionsConclusionsConclusions

Page 30: Efficient Data Parallel Computing on GPUs

ConclusionsConclusionsConclusionsConclusions

• Get used to thinking in terms of vector computation

• Understand how frequently each computation will run, and reduce that frequency wherever possible

• Track down bottlenecks in your application, and shift work to other parts of the system that are idle

Page 31: Efficient Data Parallel Computing on GPUs

Questions?Questions?Questions?Questions?

• Acknowledgements– Pat Brown and Nick Triantos at NVIDIA– NVIDIA for having given me a job this summer– Dave Luebke, my advisor– GPGPU course presenters– NSF Award #0092793