SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core...
Transcript of SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core...
![Page 1: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/1.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
SortBenchmark: A Benchmark
for
Many-core Computing
Naga K. Govindaraju
Microsoft Many-core Incubation
![Page 2: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/2.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH2
Sorting
“I believe that virtually every importantaspect of programming arises somewhere in the context of sorting or searching!”
Don Knuth, Stanford University
There is a lot of effort on multi-core processors, and
comparatively little effort on addressing the “core” problems: (1)
the memory architecture, and (2) the way processors access
memory. Sort demonstrates those problems very clearly.
Jim Gray, Microsoft Research
![Page 3: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/3.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH3
Sorting
Well studied High performance computing
Databases
Computer graphics
Programming languages
...
Google map reduce algorithm
Spec benchmark routine!
![Page 4: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/4.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH4
Massive Databases
Terabyte-data sets are commonGoogle sorts more than 100 billion terms in its index
> 1 Trillion records in web indexed!
Database sizes are rapidly increasing!Max DB sizes increases 3x per year (http://www.wintercorp.com)
Processor improvements not matching information explosion
![Page 5: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/5.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH5
CPU vs. GPU
CPU(3 GHz)
System Memory(16 GB) PCI-E Bus
(8 GB/s)
Video Memory(4 GB)
GPU (690 MHz)
Video Memory(4 GB)
GPU (690 MHz)4 x 6 MB Cache
Video Memory(4 GB)
GPU (690 MHz)
![Page 6: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/6.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH6
Massive Data Handling on CPUs
Require random memory accessesSmall CPU caches
Random memory accesses slower than even sequential disk accesses
High memory latencyHuge memory to compute gap!
CPUs are deeply pipelinedPentium 4 has 30 pipeline stages
Do not hide latency - high cycles per instruction (CPI)
CPU is under-utilized for data intensive applications
![Page 7: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/7.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
Why Many-Core Sort?http://research.microsoft.com/barc/SortBenchmark
1.E+1
1.E+2
1.E+3
1.E+4
1.E+5
1.E+6
1985 1990 1995 2000 2005
reco
rds/s
ec
/cp
u
Records per Second per CPUslow improvement after 1995
Mini
Super
cache conscious
GPUTeraSort
![Page 8: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/8.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH8
Massive Data Handling on CPUs
Sorting is hard!
GPU a potentially scalable solution to terabyte sorting and scientific computing
We provide a scalable solution on GPUs
![Page 9: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/9.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH9
Graphics Processing Units (GPUs)
Commodity processor for graphics applications
Massively data-parallel processors
High memory bandwidthLow memory latency pipeline
Programmable
High growth rate
![Page 10: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/10.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH10
GPU: Commodity Processor
Cell phones Laptops Consoles
PSPDesktops
![Page 11: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/11.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH11
Graphics Processing Units (GPUs)
Commodity processor for graphics applications
Massively data-parallel processors10x more operations per sec than CPUs
High memory bandwidthBetter hides memory latency pipeline
Programmable
High growth rate
![Page 12: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/12.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH12
Parallelism on GPUs
Peak FLOPS
GPU – 933 GFLOPS
CPU – 100 GFLOPS
![Page 13: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/13.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH13
Graphics Processing Units (GPUs)
Commodity processor for graphics applications
Massively data-parallel processors
High memory bandwidthBetter hides latency pipeline
Programmable
10x more memory bandwidth than CPUs
High growth rate
![Page 14: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/14.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
Traditional GPGPU Pipeline
Input Assembler
Vertex Shader
Pixel Shader
Tessellation
Rasterizer
Output Merger
Geometry Shader Memory
Ve
ry h
igh
para
llelism
Hides
memory
latency!!
140
GB/s
![Page 15: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/15.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
DirectX11: Compute Shader
Input Assembler
Vertex Shader
Pixel Shader
Tessellation
Rasterizer
Output Merger
Geometry Shader Memory
140
GB/s
Data StructureCompute Shader
![Page 16: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/16.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH16
Graphics Processing Units (GPUs)
Commodity processor for graphics applications
Massively data-parallel processors
High memory bandwidthBetter hides latency pipeline
Programmable
High growth rate
![Page 17: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/17.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH17
Memory Performance on GPUs
0
20
40
60
80
100
120
140
160
Feb-02 Feb-03 Feb-04 Feb-05 Feb-06 Feb-07 Feb-08
GB
/s
![Page 18: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/18.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH18
GPUs for Sorting: Issues
Random writes are expensiveOptimized CPU algorithms do not map!
Lack of support for recursion
Out-of-core algorithmsLimited GPU memory
![Page 19: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/19.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH19
Outline
Overview
Sorting on GPUs
Conclusions and Future Directions
![Page 20: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/20.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH20
Sorting on GPUs
Adaptive sorting algorithmsExtent of sorted order in a sequence
General sorting algorithms
External memory sorting algorithms
![Page 21: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/21.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH21
Adaptive Sorting on GPUs
Prior adaptive sorting algorithms require random data writes
In insertion sort, a processor may operate on different number of elements – load imbalance
GPUs optimized for data-parallel algorithmsUse optimized data-parallel primitives such as scans
Design adaptive sorting using only data-parallel computations
Avoid load imbalance by operating on same number of elements on each processor
![Page 22: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/22.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH22
Adaptive Sorting Algorithm
Multiple iterations
Each iteration uses a two pass algorithm
First pass – Compute an increasing sequence M
Second pass - Compute the sorted elements in M
Iterate on the remaining unsorted elements
N. Govindaraju, M. Henson, M. Lin and D. Manocha,
Proc. Of ACM I3D, 2005
![Page 23: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/23.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH23
Increasing Sequence
Given a sequence S={x1,…, xn}, an element xi belongs to M if and only if xi ≤ xj, i<j, xj in S
![Page 24: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/24.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH24
Increasing Sequence
X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn
≤
M is an increasing sequence
![Page 25: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/25.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH25
Increasing Sequence Computation
X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn
Compute
![Page 26: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/26.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH26
Compute
Xn ≤ ∞
Xn
Increasing Sequence Computation
![Page 27: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/27.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH27
Compute
xi ≤Min?
Xi Xi+1 … Xn-1 Xn
Yes.
Prepend xi to M
Min = xi
Increasing Sequence Computation
![Page 28: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/28.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH28
Compute
x1≤{x2,…,xn}?
X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn
Increasing Sequence Computation
![Page 29: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/29.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH29
Computing Sorted Elements
Theorem 1: Given the increasing sequence M, rank of an element xi in M is determined if xi < min (I-M)
![Page 30: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/30.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH30
Computing Sorted Elements
X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn
![Page 31: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/31.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH31
X2 … Xi-1 … Xn
Computing Sorted Elements
X1 X3… Xi Xi+1 … Xn-2 Xn-1
≤ ≤
≥
![Page 32: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/32.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH32
Computing Sorted Elements
Linear-time algorithmMaintaining minimum
![Page 33: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/33.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH33
X1 X3… Xi Xi+1 … Xn-2 Xn-1
Compute
Computing Sorted Elements
X2 … Xi-1 … Xn
![Page 34: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/34.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH34
X1 X2 … Xi-1 Xi
Compute
Xi in M?
Xi ≤ min?
No.
Update min
Yes.
Append Xi
to sorted list
Computing Sorted Elements
![Page 35: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/35.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH35
Computing Sorted Elements
X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn
Compute
![Page 36: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/36.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH36
Algorithm Analysis
Knuth’s measure of disorder:
Given a sequence I and its longest increasing sequence LIS(I), the sequence of disordered elements Y = I - LIS(I)
Theorem 2: Given a sequence I and LIS(I), our adaptive algorithm sorts in at most (2 ||Y|| + 1) iterations
![Page 37: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/37.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH37
X1 X2 …Xl Xl+1 Xl+2 …Xm Xm+1 Xm+2 ... Xq Xq+1 Xq+2 ... Xn
Pictorial Proof
![Page 38: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/38.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH38
X1 X2 …Xl Xl+1 Xl+2 …Xm Xm+1 Xm+2 ... Xq Xq+1 Xq+2 ... Xn
Pictorial Proof
2 iterations 2 iterations 2 iterations
![Page 39: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/39.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH39
8
Example
1 2 3 4 5 6 7 9
≤
![Page 40: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/40.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH40
8
Example
1 2 3 4 5 6 7
Sorted
9
![Page 41: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/41.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
Scan: All-prefix-sums
For an input sequence A = [a0, a1, …, an-1] and binary associative operation with left identity scan (A) = [ , a0, a0 a1, …, a0 a1 … an-2]
Example: is addition, = 0A = [1, 1, 1, 1]
scan(A) = [0, 1, 2, 3]
![Page 42: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/42.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
Computing Increasing Sequence
Compute MIN scan in backward direction : I’
Compare each element in I with I’Elements less than corresponding elements in I’ are in M
42
![Page 43: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/43.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
Computing Sorted Elements
Compute forward MIN scan of elements in I-M: I’’
Elements not in M are set to ∞ in I and scan is performed on I
Compare elements in M with I
Elements that are less than or equal to corresponding elements in I are sorted
43
![Page 44: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/44.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
Data Parallel Scans
Well-studied in parallel computingImplemented on GPUs [Horn 05, Harris et al. 07, Sengupta et al. 07, Dotsenko et al. 08]
Exploit shared memory to achieve higher memory efficiency
Optimized libraries available
Tree-based scan algorithms may not scale well with shared memory
Instead use matrix-based representations
44
Y. Dotsenko, N. Govindaraju, PP. Sloan, C. Boyd and J. Manferdelli,
Proc. Of ACM ICS, 2008
![Page 45: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/45.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
Forward Unsegmented Scan
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
1024 16384 262144 4194304 67108864
Re
lati
ve
ela
pse
d t
ime
Sequence length, 4-byte words
CUDPP-gems3.2
CUDPP-1.0alpha
Ours
Lower is better
![Page 46: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/46.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH46
Adaptive Sorting: Lessons
Do not try to remap serial adaptive algorithms
Design better data parallel algorithms to achieve scalable performance
Linear in the input size and sorted extent
Works well on almost sorted input
Use data parallel primitives such as scans
![Page 47: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/47.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH47
General Sorting on GPUs
General datasets
High performance
N. Govindaraju, N. Raghuvanshi and D. Manocha,
Proc. Of ACM SIGMOD, 2005
![Page 48: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/48.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH48
General Sorting on GPUs
Design sorting algorithms with deterministic memory accesses
Achieve high memory bandwidth
Can better hide the memory latency!!
Require minimum and maximum computationsLow branching overhead
No data dependenciesUtilize high parallelism on GPUs
![Page 49: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/49.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH49
GPU-Based Sorting Networks
Represent data as 2D arrays
Multi-stage algorithm
Each stage involves multiple steps
In each step
1. Compare one array element against exactly one other element at fixed distance
2. Perform a conditional assignment (MIN or MAX) at each element location
![Page 50: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/50.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
Sorting Animation
50
![Page 51: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/51.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH51
2D Memory Addressing
GPUs optimized for 2D representationsMap 1D arrays to 2D arrays
Minimum and maximum regions mapped to row-aligned or column-aligned quads
![Page 52: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/52.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH52
1D – 2D Mapping
MIN MAX
![Page 53: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/53.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH53
1D – 2D Mapping
MIN
Effectively reduce instructions
per element
![Page 54: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/54.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH54
Sorting on GPU: Pipelining and Parallelism
Input Vertices
Texturing, Caching
and 2D Quad
Comparisons
Sequential Writes
![Page 55: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/55.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH55
Comparison with GPU-Based Algorithms
3-6x faster than
prior GPU-based
algorithms!
![Page 56: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/56.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
Sorting on Recent GPGPU APIs
Use shared memory to perform local sorting
Bitonic sorting can be implemented in O(nlogn) memory accesses
Memory access patterns similar to FFT butterfly networks
56
![Page 57: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/57.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
FFT Performance on GPUs
57
0
50
100
150
200
250
1 3 5 7 9 11 13 15 17 19 21 23
GFlo
ps
log2N
Ours GTX280
Ours G92
CUFFT
MKL
Preliminary performance of our algorithms:
• 4-8x faster than CUFFT on GTX280
•2x faster using G92 than CUFFT on GTX280
•10-30x faster than Intel MKL on high-end quad-core CPUs
N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith and J. Manferdelli,
Proc. Of ACM/IEEE SuperComputing, 2008 (to appear)
Driver:
177.41
![Page 58: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/58.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH58
External Memory Sorting
Performed on Terabyte-scale databases
Two phases algorithm [Vitter01, Salzberg90,
Nyberg94, Nyberg95]
Limited main memory
First phase – partitions input file into large data chunks and writes sorted chunks known as “Runs”
Second phase – Merge the “Runs” to generate the sorted file
N. Govindaraju, J. Gray, R. Kumar and D. Manocha,
Proc. of ACM SIGMOD 2006 (to appear)
![Page 59: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/59.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH59
External Memory Sorting
Performance mainly governed by I/O
Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)
![Page 60: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/60.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH60
Salzberg Analysis
If N=100GB, T=2MB, then R ≈ 230MB
Large data sorting is inefficient on CPUs
R » CPU cache sizes – memory latency
![Page 61: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/61.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH61
External memory sorting
External memory sorting on CPUs can have low performance due to
High memory latency
Or low I/O performance
Our algorithm Sorts large data arrays on GPUs
Perform I/O operations in parallel on CPUs
![Page 62: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/62.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH62
GPUTeraSort
![Page 63: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/63.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH63
I/O Performance
Salzberg Analysis:
100 MB Run Size
![Page 64: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/64.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH64
I/O Performance
Pentium IV:
25MB Run
Size
Less work
and only 75%
IO efficient!
Salzberg Analysis:
100 MB Run Size
![Page 65: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/65.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH65
I/O Performance
Dual 3.6 GHz
Xeons: 25MB
Run size
More cores,
less work but
only 85% IO
efficient!
Salzberg Analysis:
100 MB Run Size
![Page 66: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/66.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH66
I/O Performance
7800 GT:
100MB run
size
Ideal work,
and 92% IO
efficient with
single CPU!
Salzberg Analysis:
100 MB Run Size
![Page 67: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/67.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH67
Task Parallelism
Performance
limited by IO
and memory
![Page 68: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/68.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH68
Overall Performance
Faster and more scalable than Dual Xeon processors (3.6 GHz)!
![Page 69: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/69.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH69
Performance/$
1.8x faster than
the Terabyte
sorter in 2006
World’s best
performance/$
system in 2006
![Page 70: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/70.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH70
Advantages
Exploit high memory bandwidth on GPUs
Higher memory performance than CPU-based algorithms
High I/O performance due to large run sizes
![Page 71: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/71.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH71
Advantages
Offload work from CPUsCPU cycles well-utilized for resource management
Scalable solution for large databases
Best performance/price solution for terabyte sorting in 2006
![Page 72: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/72.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH72
Conclusions
Benchmarking is important for both IHVs and ISVs
Sorting is an important workload
Design better algorithms on GPUsDo not try to remap serial algorithms
Design scalable primitives (eg. scans), libraries (eg. MapReduce) and exploit them for adaptive, general and external memory sorting algorithms
![Page 73: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/73.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH73
Conclusions
Exploit the memory modelsFFT algorithms currently achieving over 0.25 TFLOPS per GPU
Applicable to many scientific computing algorithms on many-core architectures
Novel external memory sorting algorithm as a scalable solution
Achieves high I/O performance on CPUs
Best performance/price solution – world’s fastest sorting system in 2006
![Page 74: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/74.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH74
GPU Roadmap
GPUs are becoming more general purpose
Fewer limitations in Microsoft DirectX11 API• IEEE floating point support and optional double
support
• Integer instruction support,
• More programmable stages, etc.
Significant advance in performance
GPUs are being widely adopted in commercial applications
Image and media processing, signal processing, finance, etc.
![Page 75: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/75.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH75
Call to Action
Pay attention to data parallelism
Don’t put all your eggs in the multi-core basket
If you want TeraOps – go where they are
If you want memory bandwidth – go where the memory bandwidth is.
CPU-GPU gap is widening
![Page 76: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/76.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH
Acknowledgements
Collaborators:Jim Gray (Microsoft Research)
Ming Lin (UNC)
Qiong Luo (HKUST)
Dinesh Manocha (UNC)
Peter-Pike Sloan (Disney Interactive)
Brandon Lloyd (Microsoft)
Yuri Dotsenko (Microsoft)
Chas Boyd (Microsoft)
Burton Smith (Microsoft)
John Manferdelli (Microsoft)
76
![Page 77: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/77.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH77
Acknowledgements
Supporters:Fred Brooks (UNC Chapel Hill)
Craig Mundie (Microsoft)
![Page 78: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans](https://reader033.fdocuments.in/reader033/viewer/2022051320/5ab0f73a7f8b9ad9788bae80/html5/thumbnails/78.jpg)
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH78
Thank You
Questions or Comments?