Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture...
Transcript of Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture...
![Page 1: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/1.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Query Processing on GPUs
Graphics Processor OverviewMapping Computation to GPUsDatabase and data mining applications
Database queriesQuantile and frequency queriesExternal memory sortingScientific computations
Summary
![Page 2: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/2.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
CPU vs. GPU
CPU(3 GHz)
System Memory(2 GB)
AGP Memory(512 MB)
PCI-E Bus(4 GB/s)
Video Memory(512 MB)
GPU (690 MHz)
Video Memory(512 MB)
GPU (690 MHz)
2 x 1 MB Cache
![Page 3: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/3.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Query Processing on CPUs
Slow random memory accessesSmall CPU caches (< 2MB)Random memory accesses slower than even sequential disk accesses
High memory latencyHuge memory to compute gap!
CPUs are deeply pipelinedPentium 4 has 30 pipeline stagesDo not hide latency - high cycles per instruction (CPI)
CPU is under-utilized for data intensive applications
![Page 4: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/4.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Graphics Processing Units (GPUs)
Commodity processor for graphics applicationsMassively parallel vector processorsHigh memory bandwidth
Low memory latency pipelineProgrammable
High growth ratePower-efficient
![Page 5: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/5.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
GPU: Commodity Processor
Cell phones Laptops Consoles
PSPDesktops
![Page 6: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/6.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Graphics Processing Units (GPUs)
Commodity processor for graphics applicationsMassively parallel vector processors
10x more operations per sec than CPUs
High memory bandwidthLow memory latency pipelineProgrammable
High growth ratePower-efficient
![Page 7: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/7.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Parallelism on GPUs
Graphics FLOPS
GPU – 1.3 TFLOPS
CPU – 25.6 GFLOPS
![Page 8: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/8.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Graphics Processing Units (GPUs)
Commodity processor for graphics applicationsMassively parallel vector processorsHigh memory bandwidth
Low memory latency pipelineProgrammable10x more memory bandwidth than CPUs
High growth ratePower-efficient
![Page 9: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/9.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
vertex
setuprasterizer
pixel
texture
image
per-pixel texture, fp16 blending
Graphics Pipeline
programmable vertexprocessing (fp32)
programmable per-pixel math (fp32)
polygonpolygon setup,culling, rasterization
Z-buf, fp16 blending,anti-alias (MRT) memory
Hides memory latency!!
Low
pip
elin
e de
pth
56 GB/s
![Page 10: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/10.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
data
setuprasterizer
data
data
data
data fetch, fp16 blending
NON-Graphics Pipeline Abstraction
programmable MIMDprocessing (fp32)
programmable SIMDprocessing (fp32)
listsSIMD“rasterization”
predicated write, fp16blend, multiple output memory
Courtesy: David Kirk,Chief Scientist, NVIDIA
![Page 11: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/11.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Graphics Processing Units (GPUs)
Commodity processor for graphics applicationsMassively parallel vector processorsHigh memory bandwidth
Low memory latency pipelineProgrammable
High growth ratePower-efficient
![Page 12: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/12.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Exploiting Technology MovingFaster than Moore’s Law
CPU Growth Rate
GPU Growth Rate
![Page 13: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/13.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Graphics Processing Units (GPUs)
Commodity processor for graphics applicationsMassively parallel vector processorsHigh memory bandwidth
Low memory latency pipelineProgrammable
High growth ratePower-efficient
![Page 14: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/14.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
CPU vs. GPU(Henry Moreton: NVIDIA, Aug. 2005)
101.620.00.2GFLOPS/W0.565130Power (W)50.8130025.6Graphics GFLOPs
GPU/CPU7800GTXPEE 840
![Page 15: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/15.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Outline
Graphics Processor OverviewMapping Computation to GPUsDatabase and data mining applications
Database queriesQuantile and frequency queriesExternal memory sortingScientific computations
Summary
![Page 16: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/16.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
The Importance of Data Parallelism
GPUs are designed for graphicsHighly parallel tasks
GPUs process independent vertices & fragments
Temporary registers are zeroedNo shared or static dataNo read-modify-write buffers
Data-parallel processingGPUs architecture is ALU-heavy
• Multiple vertex & pixel pipelines, multiple ALUs per pipe
Hide memory latency (with more computation)
![Page 17: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/17.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Arithmetic Intensity
Arithmetic intensityops per word transferredComputation / bandwidth
Best to have high arithmetic intensityIdeal GPGPU apps have
Large data setsHigh parallelismMinimal dependencies between data elements
![Page 18: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/18.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Data Streams & Kernels
StreamsCollection of records requiring similar computation
• Vertex positions, Voxels, FEM cells, etc.
Provide data parallelism
KernelsFunctions applied to each element in stream
• transforms, PDE, …
Few dependencies between stream elements• Encourage high Arithmetic Intensity
![Page 19: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/19.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Example: Simulation Grid
Common GPGPU computation styleTextures represent computational grids = streams
Many computations map to gridsMatrix algebraImage & Volume processingPhysically-based simulation
Non-grid streams can be mapped to grids
![Page 20: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/20.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Stream Computation
Grid Simulation algorithmMade up of stepsEach step updates entire gridMust complete before next step can begin
Grid is a stream, steps are kernelsKernel applied to each stream element
Cloud simulation algorithm
![Page 21: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/21.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Scatter vs. Gather
Grid communicationGrid cells share information
![Page 22: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/22.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Computational Resources Inventory
Programmable parallel processorsVertex & Fragment pipelines
RasterizerMostly useful for interpolating addresses (texture coordinates) and per-vertex constants
Texture unitRead-only memory interface
Render to textureWrite-only memory interface
![Page 23: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/23.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Vertex Processor
Fully programmable (SIMD / MIMD)Processes 4-vectors (RGBA / XYZW)Capable of scatter but not gather
Can change the location of current vertexCannot read info from other verticesCan only read a small constant memory
Latest GPUs: Vertex Texture FetchRandom access memory for vertices≈Gather (But not from the vertex stream itself)
![Page 24: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/24.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Fragment Processor
Fully programmable (SIMD)Processes 4-component vectors (RGBA / XYZW)Random access memory read (textures)Capable of gather but not scatter
RAM read (texture fetch), but no RAM writeOutput address fixed to a specific pixel
Typically more useful than vertex processorMore fragment pipelines than vertex pipelinesDirect output (fragment processor is at end of pipeline)
![Page 25: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/25.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
CPU-GPU Analogies
CPU programming is familiarGPU programming is graphics-centric
Analogies can aid understanding
![Page 26: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/26.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
CPU-GPU Analogies
CPU GPU
Stream / Data Array = TextureMemory Read = Texture Sample
![Page 27: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/27.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Kernels
Kernel / loop body / algorithm step = Fragment Program
CPU GPU
![Page 28: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/28.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Feedback
Each algorithm step depends on the results of previous steps
Each time step depends on the results of the previous time step
![Page 29: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/29.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Feedback
.Grid[i][j]= x;
.
.
.
Array Write = Render to Texture
CPU GPU
![Page 30: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/30.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
GPU Simulation Overview
Analogies lead to implementationAlgorithm steps are fragment programs
• Computational kernels
Current state is stored in texturesFeedback via render to texture
One question: how do we invoke computation?
![Page 31: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/31.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Invoking Computation
Must invoke computation at each pixel
Just draw geometry!Most common GPGPU invocation is a full-screen quad
Other Useful AnalogiesRasterization = Kernel InvocationTexture Coordinates = Computational DomainVertex Coordinates = Computational Range
![Page 32: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/32.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Typical “Grid” Computation
Initialize “view” (so that pixels:texels::1:1)glMatrixMode(GL_MODELVIEW);glLoadIdentity();glMatrixMode(GL_PROJECTION);glLoadIdentity();glOrtho(0, 1, 0, 1, 0, 1);glViewport(0, 0, outTexResX, outTexResY);
For each algorithm step:Activate render-to-textureSetup input textures, fragment programDraw a full-screen quad (1x1)
![Page 33: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/33.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Example: N-Body Simulation
Brute force N = 8192 bodiesN2 gravity computations
64M force comps. / frame~25 flops per force10.5 fps 17+ GFLOPs sustained
GeForce 7800 GTX
Nyland, Harris, Prins,GP2 2004 poster
![Page 34: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/34.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Computing Gravitational Forces
Each body attracts all other bodiesN bodies, so N2 forces
Draw into an NxN bufferPixel (i,j) computes force between bodies i and jVery simple fragment program
![Page 35: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/35.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Computing Gravitational Forces
NN--body force Texturebody force Texture
force(force(ii,,jj))
NNii
NN
00
j
ii
jj
Body Position TextureBody Position Texture
F(i,j) = gMiMj / r(i,j)2,
r(i,j) = |pos(i) - pos(j)|
Force is proportional to the inverse square Force is proportional to the inverse square of the distance between bodiesof the distance between bodies
![Page 36: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/36.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Computing Gravitational Forces
float4 force(float2 ij : WPOS,uniform sampler2D pos) : COLOR0
{// Pos texture is 2D, not 1D, so we need to// convert body index into 2D coords for pos texfloat4 iCoords = getBodyCoords(ij);float4 iPosMass = texture2D(pos, iCoords.xy);float4 jPosMass = texture2D(pos, iCoords.zw);float3 dir = iPos.xyz - jPos.xyz;float r2 = dot(dir, dir);dir = normalize(dir);return dir * g * iPosMass.w * jPosMass.w / r2;
}
![Page 37: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/37.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Computing Total Force
Have: array of (i,j) forcesNeed: total force on each particle i
Sum of each column of the force array
Can do all N columns in parallel
This is called a Parallel Reduction
force(i,j)
NN--body force Texturebody force Texture
NNii
NN
00
![Page 38: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/38.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Parallel Reductions
1D parallel reduction: sum N columns or rows in paralleladd two halves of texture togetherrepeatedly...Until we’re left with a single row of texels ++
NNxxNN
NNx(x(NN/2)/2)NNx(x(NN/4)/4)NNx1x1
Requires logRequires log22NN stepssteps
![Page 39: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/39.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Update Positions and Velocities
Now we have a 1-D array of total forces
One per body
Update Velocityu(i,t+dt) = u(i,t) + Ftotal(i) * dtSimple pixel shader reads previous velocity and force textures, creates new velocity texture
Update Positionx(i, t+dt) = x(i,t) + u(i,t) * dtSimple pixel shader reads previous position and velocity textures, creates new position texture
![Page 40: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/40.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Summary
Presented mappings of basic computational concepts to GPUs
Basic concepts and terminologyFor introductory “Hello GPGPU” sample code, see http://www.gpgpu.org/developer
Only the beginning:Rest of course presents advanced techniques, strategies, and specific algorithms.
![Page 41: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/41.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Outline
Graphics Processor OverviewMapping Computation to GPUsApplications
Database queriesQuantile and frequency queriesExternal memory sortingScientific computations
Summary
![Page 42: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/42.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Basic DB Operations
Basic SQL query Select AFrom TWhere C
A= attributes or aggregations (SUM, COUNT, MAX etc)
T=relational tableC= Boolean Combination of Predicates (using
operators AND, OR, NOT)
N. Govindaraju, B. Lloyd, W. Wang, M. Lin and D. Manocha , Proc. of ACM SIGMOD 04
![Page 43: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/43.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Database Operations
Predicates ai op constant or ai op aj
op: <,>,<=,>=,!=, =, TRUE, FALSE
Boolean combinations Conjunctive Normal Form (CNF)
AggregationsCOUNT, SUM, MAX, MEDIAN, AVG
![Page 44: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/44.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Data Representation
Attribute values ai are stored in 2D textures on the GPUA fragment program is used to copy attributes to the depth buffer
![Page 45: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/45.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Copy Time to the Depth Buffer
![Page 46: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/46.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Predicate Evaluation
ai op constant (d)Copy the attribute values ai into depth bufferSpecify the comparison operation used in the depth testDraw a screen filling quad at depth d and perform the depth test
![Page 47: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/47.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Screen
PIf ( ai op d )pass fragment
Else
reject fragment
ai op d
d
![Page 48: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/48.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Predicate Evaluation
CPU implementation — Intel compiler 7.1 with SIMD optimizations
GPU is nearly 20 times faster than 2.8 GHz Xeon
![Page 49: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/49.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Predicate Evaluation
ai op ajEquivalent to (ai – aj) op 0
![Page 50: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/50.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Boolean Combination
CNF: (A1 AND A2 AND … AND Ak) whereAi = (Bi
1 OR Bi2 OR … OR Bi
mi )
Performed using stencil test recursivelyC1 = (TRUE AND A1) = A1
Ci = (A1 AND A2 AND … AND Ai) = (Ci-1 AND Ai)
Different stencil values are used to code the outcome of Ci
Positive stencil values — pass predicate evaluation Zero — fail predicate evaluation
![Page 51: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/51.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
A1 AND A2
A1
B21
B22
B23
A2 = (B21 OR B2
2 OR B23 )
![Page 52: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/52.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
A1 AND A2
A1
Stencil value = 1
![Page 53: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/53.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
A1 AND A2
A1
Stencil value = 0
Stencil value = 1
TRUE AND A1
![Page 54: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/54.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
A1 AND A2
A1
Stencil = 0
Stencil = 1
B21
Stencil=2
B22
Stencil=2
B23
Stencil=2
![Page 55: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/55.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
A1 AND A2
A1
Stencil = 0
Stencil = 1
B21
B22
B23
Stencil=2
Stencil=2Stencil=2
![Page 56: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/56.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
A1 AND A2
Stencil = 0
Stencil=2A1 AND B2
1
Stencil = 2A1 AND B2
2 Stencil=2
A1 AND B23
![Page 57: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/57.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Multi-Attribute Query
![Page 58: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/58.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Range Query
Compute ai within [low, high]Evaluated as ( ai >= low ) AND ( ai <= high )
Use NVIDIA depth bounds test to evaluate both conditionals in a single clock cycle
![Page 59: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/59.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Range Query
GPU is nearly 20 times faster than 2.8 GHz Xeon
![Page 60: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/60.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Aggregations
COUNT, MAX, MIN, SUM, AVG
![Page 61: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/61.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
COUNT
Use occlusion queries to get the number of pixels passing the testsSyntax:
Begin occlusion queryPerform database operationEnd occlusion queryGet count of number of attributes that passed database operation
Involves no additional overhead!Efficient selectivity computation
![Page 62: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/62.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
MAX, MIN, MEDIAN
Kth-largest numberTraditional algorithms require data rearrangementsWe perform
no data rearrangements no frame buffer readbacks
![Page 63: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/63.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
K-th Largest Number
Given a set S of valuesc(m) —number of values ≥ mvk — the k-th largest number
We haveIf c(m) > k-1, then m ≤ vk
If c(m) ≤ k-1, then m > vk
Evaluate one bit at a time
![Page 64: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/64.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
0011 1011 1101
0111 0101 0001
0111 1010 0010
m = 0000v2 = 1011
2nd Largest in 9 Values
![Page 65: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/65.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
0011 1011 1101
0111 0101 0001
0111 1010 0010
m = 1000v2 = 1011
Draw a Quad at Depth 8 Compute c(1000)
![Page 66: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/66.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
0011 1011 1101
0111 0101 0001
0111 1010 0010
m = 1000v2 = 1011
c(m) = 3
1st bit = 1
![Page 67: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/67.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
0011 1011 1101
0111 0101 0001
0111 1010 0010
m = 1100v2 = 1011
Draw a Quad at Depth 12 Compute c(1100)
![Page 68: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/68.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
0011 1011 1101
0111 0101 0001
0111 1010 0010
m = 1100v2 = 1011
c(m) = 1
2nd bit = 0
![Page 69: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/69.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
0011 1011 1101
0111 0101 0001
0111 1010 0010
m = 1010v2 = 1011
Draw a Quad at Depth 10 Compute c(1010)
![Page 70: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/70.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
0011 1011 1101
0111 0101 0001
0111 1010 0010
m = 1010v2 = 1011
c(m) = 3
3rd bit = 1
![Page 71: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/71.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
0011 1011 1101
0111 0101 0001
0111 1010 0010
m = 1011v2 = 1011
Draw a Quad at Depth 11 Compute c(1011)
![Page 72: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/72.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
0011 1011 1101
0111 0101 0001
0111 1010 0010
m = 1011v2 = 1011
c(m) = 2
4th bit = 1
![Page 73: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/73.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Our algorithm
Initialize m to 0Start with the MSB and scan all bits till LSBAt each bit, put 1 in the corresponding bit-position of mIf c(m) < k, make that bit 0Proceed to the next bit
![Page 74: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/74.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Median
GPU is nearly 6 times faster than 2.8 GHz Xeon!
![Page 75: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/75.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Outline
Graphics Processor OverviewMapping Computation to GPUsDatabase and data mining applications
Database queriesQuantile and frequency queriesExternal memory sortingScientific computations
Summary
![Page 76: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/76.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Streaming
Stream is a continuous sequence of data values arriving at a port
Many real world applications process data streams
Networking data, Stock marketing and financial data, Data collected from sensorsData logs from web trackers
N. Govindaraju, N. Raghuvanshi, D. ManochaProc. Of ACM SIGMOD 05
![Page 77: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/77.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Stream Queries
Applications perform continuous queries and usually collect statistics on streams
Frequencies of elementsQuantiles in a sequenceAnd many more (SUM, MEAN, VARIANCE, etc.)
Widely studied in databases, networking, computational geometry, theory of algorithms, etc.
![Page 78: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/78.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Stream Queries
Massive amounts of data is processed in real-time!
Memory limitations – estimate the query results instead of exact results
![Page 79: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/79.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Approximate Queries
Stream historyStream
![Page 80: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/80.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Approximate Queries
StreamDiscarded InformationSliding Window
![Page 81: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/81.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
ε-Approximate Queries
ϕ-quantile : element with rank ϕ N, 0 < ϕ < 1
ε-approximate ϕ-quantile : Any element with rank (ϕ ± ε) N 0 < ε < 1
Frequency : Number of occurrences of an element f
ε-approximate frequency : Any element with frequency f ≥ f’ ≥ (f - εN), 0 < ε < 1
![Page 82: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/82.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
ε-Approximate Queries
Queries computed using a ε-approximatesummary data structure
Performed by batch insertion a subset of window elements
Ref: [Manku and Motwani 2002, Greenwald and Khanna 2004, Arasu and Manku 2004]
![Page 83: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/83.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
ε-Approximate Summary Construction
Histogram Computation
Merge Operation
CompressOperation
ε-Approximate Summary
ε-Approximate Summary
Window of elements
![Page 84: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/84.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
ε-Approximate Summary Construction
Histogram Computation
Merge Operation
CompressOperation
Computes histogram by sorting the elements
Window of elements
![Page 85: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/85.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
ε-Approximate Summary Construction
Histogram Computation
Merge Operation
CompressOperation
Inserts a subset of the histogram into the summary
ε-Approximate Summary
![Page 86: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/86.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
ε-Approximate Summary Construction
Histogram Computation
Merge Operation
CompressOperation
ε-Approximate Summary
Deletes elements from the summary
![Page 87: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/87.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
ε-Approximate Summary Construction
Histogram Computation
Merge Operation
CompressOperation
ε-Approximate Summary
Window of elements
![Page 88: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/88.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
ε-Approximate Summary Construction
Histogram Computation
Merge Operation
CompressOperation
70 - 95% of the entire time!
Window of elements
ε-Approximate Summary
![Page 89: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/89.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Timing Breakup: Frequency Estimation
Sorting takes nearly 90% time on CPUs
Computational Performance
0
5
10
15
20
25
30
35
40
0.3814 0.1907 0.09536 0.04768 0.02384 0.01192Epsilon*105
Tim
e (s
ecs)
Sorting TimeMerge time
SortingMerge + Compress
![Page 90: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/90.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Sorting on CPUs
Well studiedOptimized Quicksort performs better [LaMarca and Ladner 1997]
Performance mainly governed by cache sizesLarge overhead per cache miss – nearly 100 clock cycles
![Page 91: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/91.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Sorting on CPUs
Sorting incurs cache missesIrregular data access patterns in sorting Small cache sizes (few KB)
Additional stalls - branch mispredictions
Degrading performance in new CPUs! [LaMarca and Ladner 97]
![Page 92: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/92.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Sorting on GPUs
Use the high data parallelism, and memory bandwidth on GPUs for fast sorting
Many sorting algorithms require writes to arbitrary locations
Not supported on GPUsMap algorithms with deterministic access pattern to GPUs (e.g., periodic balanced sorting network [Dowd 89])Represent data in 2D images
![Page 93: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/93.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Sorting Networks
Multi-stage algorithmEach stage involves multiple steps
In each step1. Compare one pixel against exactly one other pixel2. Perform a conditional assignment (MIN or MAX) at
each pixel
![Page 94: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/94.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
![Page 95: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/95.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
2D Memory Addressing
GPUs optimized for 2D representationsMap 1D arrays to 2D arraysMinimum and maximum regions mapped to row-aligned or column-aligned quads
![Page 96: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/96.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
1D – 2D Mapping
MIN MAX
![Page 97: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/97.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
1D – 2D Mapping
MIN
Effectively reduce instructions per element
![Page 98: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/98.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Sorting on GPU: Pipelining and Parallelism
Input Vertices
Texturing, Caching and 2D Quad
Comparisons
Sequential Writes
![Page 99: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/99.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Sorting Analysis
Performed entirely on GPUO(log2n) stepsEach step performs n comparisonsTotal comparisons: O(n log2n)
Data sent and readback from GPUBandwidth: O(n) — low bandwidth requirement from CPU to GPU
![Page 100: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/100.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Comparison with GPU-Based Algorithms
3-6x faster than prior GPU-based algorithms!
![Page 101: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/101.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
GPU vs. High-End Multi-Core CPUs
2-2.5x faster thanIntel high-end processors
Single GPU performance comparable tohigh-end dual coreAthlon
Hand-optimized CPU code from Intel Corporation!
![Page 102: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/102.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
GPU Cache Model
Small data cachesLow memory latencyVendors do not disclose cache information –critical for scientific computing on GPUs
We design simple modelDetermine cache parameters (block and cache sizes)Improve sorting performance
N. Govindaraju, J. Gray, D. ManochaMicrosoft Research TR 2005
![Page 103: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/103.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Cache Evictions
Cache Eviction
Cache Eviction
Cache Eviction
Cache
![Page 104: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/104.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Cache issues
Cache Eviction
Cache Eviction
Cache Eviction
h
Cache misses per step = 2 W H/ (h B)
Cache
![Page 105: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/105.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Analysis
lg n possible steps in bitonic sorting network
Step k is performed (lg n – k+1) times and h = 2k-1
Data fetched from memory = 2 n f(B)where f(B)=(B-1) (lg n -1) + 0.5 (lg n –lg B)2
![Page 106: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/106.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Block Sizes on GPUs
![Page 107: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/107.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Cache-Efficient Algorithm
h
Cache
![Page 108: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/108.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Cache Sizes on GPUs
![Page 109: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/109.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Cache-Efficient Algorithm Performance
![Page 110: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/110.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Super-Moore’s Law Growth
50 GB/s on a single GPU
Peak Performance: Effectively hide memory latency with 15 GOP/s
![Page 111: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/111.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Frequency Estimation
Freq
uenc
yElement
1 2
3 4
5 6
7 8
Window=1/ εStream
Histogram
Merge
Freq
uenc
y
Element2 4 5 7
1
2
Compress
Freq
uenc
y
1 2
3 4
5 6
7 8
Element
![Page 112: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/112.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Applications: Frequency Estimation
![Page 113: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/113.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Quantile Estimation
Window
Sort
Sample (error=ε/2)Merge Operation
error ε1
merge with
error ε2error max(ε1, ε2)Compress Operation
error ε
Uniformly sample
B elementserror <= ε+1/2B
![Page 114: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/114.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Applications: QuantileEstimation
![Page 115: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/115.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Applications: Sliding Windows
![Page 116: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/116.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Advantages
Sorting performed as stream operations entirely on GPUs
Uses specialized functionality of texture mapping and blending - high performance
Low bandwidth requirement<10% of total computation time
![Page 117: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/117.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Outline
Graphics Processor OverviewMapping Computation to GPUsDatabase and data mining applications
Database queriesQuantile and frequency queriesExternal memory sortingScientific computations
Summary
![Page 118: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/118.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
External Memory Sorting
Performed on Terabyte-scale databases
Two phases algorithm [Vitter01, Salzberg90, Nyberg94, Nyberg95]
Limited main memoryFirst phase – partitions input file into large data chunks and writes sorted chunks known as “Runs”Second phase – Merge the “Runs” to generate the sorted file
N. Govindaraju, J. Gray, R. Kumar, D. Manocha, Proc. Of ACM SIGMOD 06
![Page 119: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/119.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
External Memory Sorting
Performance mainly governed by I/O
Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if and only if the run size R in phase 1 is given by R ≈ √(TN)
![Page 120: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/120.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Salzberg Analysis
If N=100GB, T=2MB, then R ≈ 230MB
Large data sorting is inefficient on CPUs
R » CPU cache sizes – memory latency
![Page 121: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/121.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
External memory sorting
External memory sorting on CPUs has low performance due to
High memory latencyOr low I/O performance
Our algorithm Sorts large data arrays on GPUsPerform I/O operations in parallel on CPUs
![Page 122: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/122.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
GPUTeraSort
![Page 123: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/123.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
I/O Performance
Salzberg Analysis: 100 MB Run Size
![Page 124: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/124.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
I/O Performance
Pentium IV: 25MB Run Size
Less work and only 75% IO efficient!
Salzberg Analysis: 100 MB Run Size
![Page 125: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/125.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
I/O Performance
Dual 3.6 GHz Xeons: 25MB Run size
More cores, less work but only 85% IO efficient!
Salzberg Analysis: 100 MB Run Size
![Page 126: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/126.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
I/O Performance
7800 GT: 100MB run size
Ideal work, and 92% IO efficient with single CPU!
Salzberg Analysis: 100 MB Run Size
![Page 127: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/127.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Task Parallelism
Performance limited by IO and memory
![Page 128: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/128.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Overall Performance
Faster and more scalable than Dual Xeon processors (3.6 GHz)!
![Page 129: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/129.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
1.8x faster than current Terabyte sorter
World’s best performance/$ system
N. Govindaraju, J. Gray, R. Kumar, D. Manocha, Proc. Of ACM SIGMOD 06
http://research.microsoft.com/barc/SortBenchMark/
Performance/$
![Page 130: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/130.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Advantages
Exploit high memory bandwidth on GPUs
Higher memory performance than CPU-based algorithms
High I/O performance due to large run sizes
![Page 131: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/131.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Advantages
Offload work from CPUsCPU cycles well-utilized for resource management
Scalable solution for large databases
Best performance/price solution for terabyte sorting
![Page 132: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/132.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Applications
Frequency estimation [Manku and Motwani 02]
Quantile estimation [Greenwald and Khanna 01, 04]
Sliding windows [Arasu and Manku 04]
![Page 133: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/133.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Outline
Graphics Processor OverviewMapping Computation to GPUsDatabase and data mining applications
Database queriesQuantile and frequency queriesExternal memory sortingScientific computations
Summary
![Page 134: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/134.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Scientific Computations
Applied extensively in data mining algorithms
Least square fits, dimensionality reduction, classification etc.
We present mapping of LU-decomposition on GPUs
Extensions to QR-decomposition, singular value decomposition (GPU-LAPACK)
![Page 135: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/135.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
LU decomposition
Sequence of row eliminations: Scale and add: A(i,j) = A(i,j) – A(i,k) A(k,j)Input data mapping: 2 distinct memory areasNo data dependencies
Pivoting: row/column swapPointer-swap vs. data copy
k
k
k
N. Galoppo, N. Govindaraju, M. Henson, D. Manocha, Proc. Of ACM SuperComputing 05
![Page 136: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/136.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
LU decomposition
Theoretical complexity: 2/3n3 + O(n2)
Performance <> ArchitectureOrder of operationsData access (latency)Memory bandwidth
![Page 137: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/137.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Commodity CPUs
LINPACK Benchmark:
Intel Pentium 4, 3.06 GHz: 2.88 GFLOPs/s
(Dongarra, Oct’05)
![Page 138: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/138.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Motivation for LU-GPU
LU decomposition maps well:Stream programFew data dependencies
PivotingParallel pivot searchExploit large memory bandwidth
![Page 139: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/139.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
GPU based algorithms
Data representation
Algorithm mapping
![Page 140: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/140.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Data representation
Matrix elements 2D texture memoryOne-to-one mapping
Texture memory = on-board memoryExploit bandwidthAvoid CPU-GPU data transfer
![Page 141: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/141.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Stream computation
Rasterize quadrilateralsGenerates computation streamInvokes SIMD unitsRasterization simulates blocking
Rasterization pass = row elimination
Alternating memory regions
![Page 142: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/142.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Stream computation
![Page 143: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/143.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Benchmarks
Memory clockMemoryCore clockSIMD unitsGPU
430 MHz
425 MHz
350 MHz
24
16
12
1200 MHz256 Mb7800 Ultra
1100 MHz256 Mb6800 Ultra
900 MHz256 Mb6800 GT
3.4 GHz Pentium IV with Hyper-Threading1 MB L2 cacheLAPACK getrf() (blocked algorithm, SSE-optimized ATLAS library)
Commodity CPU
![Page 144: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/144.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Results: No pivoting
0
1
2
3
4
5
6
7
8
9
1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)
0
1
2
3
4
5
6
7
8
9
1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)
0
1
2
3
4
5
6
7
8
9
1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)
0
1
2
3
4
5
6
7
8
9
1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)
![Page 145: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/145.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Results: Partial pivoting
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot
![Page 146: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/146.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Results: Full Pivoting
0
50
100
150
200
250
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
Ultra 6800 Full Pivot
LAPACK sgetc2 (IMKL)
7800 Full Pivot
0
50
100
150
200
250
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
Ultra 6800 Full Pivot
LAPACK sgetc2 (IMKL)
7800 Full Pivot
0
50
100
150
200
250
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
Ultra 6800 Full Pivot
LAPACK sgetc2 (IMKL)
7800 Full Pivot
![Page 147: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/147.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
LUGPU Library
http://gamma.cs.unc.edu/LUGPULIB
![Page 148: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/148.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Outline
Graphics Processor OverviewMapping Computation to GPUsDatabase and data mining applicationsSummary
![Page 149: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/149.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Conclusions
Novel algorithms to perform Database management on GPUs• Evaluation of predicates, boolean combinations
of predicates, aggregations and join queries
Data streaming on GPUs• Quantile and Frequency estimation
Terabyte data managementData mining applications• LU decomposition, QR decomposition
![Page 150: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/150.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Conclusions
Algorithms take into account GPU limitations
No data rearrangementsNo frame buffer readbacks
Preliminary comparisons with optimized CPU implementations is promisingGPU as a useful co-processor
![Page 151: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/151.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
GPGP: GPU-based Algorithms
Spatial Database computationsSun, Agrawal, Abbadi 2003Bandi, Sun, Agrawal, Abbadi 2004
Data streamingBuck et al. 04, McCool et al. 04
Scientific computationsBolz et al. 03, Kruger et al. 03
CompilersBrook-GPU (Stanford), Sh (U. Waterloo), Accelerator (Microsoft Research)
…More at http://www.gpgpu.org
![Page 152: Query Processing on GPUsnatassa/aapubs/tutorials/ICDE06_partII.pdf · Latest GPUs: Vertex Texture Fetch Random access memory for vertices ≈Gather (But not from the vertex stream](https://reader036.fdocuments.in/reader036/viewer/2022071007/5fc5011862f6d3022c6cd758/html5/thumbnails/152.jpg)
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL@Carnegie MellonDatabases
Advantages
Algorithms progress at GPU growth rateOffload CPU work
Streaming processor parallel to CPU
Fast Massive parallelism on GPUsHigh memory bandwidth
Commodity hardware!