SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core...

78
MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH SortBenchmark: A Benchmark for Many-core Computing Naga K. Govindaraju Microsoft Many-core Incubation

Transcript of SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core...

Page 1: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

SortBenchmark: A Benchmark

for

Many-core Computing

Naga K. Govindaraju

Microsoft Many-core Incubation

Page 2: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH2

Sorting

“I believe that virtually every importantaspect of programming arises somewhere in the context of sorting or searching!”

Don Knuth, Stanford University

There is a lot of effort on multi-core processors, and

comparatively little effort on addressing the “core” problems: (1)

the memory architecture, and (2) the way processors access

memory. Sort demonstrates those problems very clearly.

Jim Gray, Microsoft Research

Page 3: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH3

Sorting

Well studied High performance computing

Databases

Computer graphics

Programming languages

...

Google map reduce algorithm

Spec benchmark routine!

Page 4: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH4

Massive Databases

Terabyte-data sets are commonGoogle sorts more than 100 billion terms in its index

> 1 Trillion records in web indexed!

Database sizes are rapidly increasing!Max DB sizes increases 3x per year (http://www.wintercorp.com)

Processor improvements not matching information explosion

Page 5: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH5

CPU vs. GPU

CPU(3 GHz)

System Memory(16 GB) PCI-E Bus

(8 GB/s)

Video Memory(4 GB)

GPU (690 MHz)

Video Memory(4 GB)

GPU (690 MHz)4 x 6 MB Cache

Video Memory(4 GB)

GPU (690 MHz)

Page 6: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH6

Massive Data Handling on CPUs

Require random memory accessesSmall CPU caches

Random memory accesses slower than even sequential disk accesses

High memory latencyHuge memory to compute gap!

CPUs are deeply pipelinedPentium 4 has 30 pipeline stages

Do not hide latency - high cycles per instruction (CPI)

CPU is under-utilized for data intensive applications

Page 7: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

Why Many-Core Sort?http://research.microsoft.com/barc/SortBenchmark

1.E+1

1.E+2

1.E+3

1.E+4

1.E+5

1.E+6

1985 1990 1995 2000 2005

reco

rds/s

ec

/cp

u

Records per Second per CPUslow improvement after 1995

Mini

Super

cache conscious

GPUTeraSort

Page 8: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH8

Massive Data Handling on CPUs

Sorting is hard!

GPU a potentially scalable solution to terabyte sorting and scientific computing

We provide a scalable solution on GPUs

Page 9: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH9

Graphics Processing Units (GPUs)

Commodity processor for graphics applications

Massively data-parallel processors

High memory bandwidthLow memory latency pipeline

Programmable

High growth rate

Page 10: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH10

GPU: Commodity Processor

Cell phones Laptops Consoles

PSPDesktops

Page 11: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH11

Graphics Processing Units (GPUs)

Commodity processor for graphics applications

Massively data-parallel processors10x more operations per sec than CPUs

High memory bandwidthBetter hides memory latency pipeline

Programmable

High growth rate

Page 12: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH12

Parallelism on GPUs

Peak FLOPS

GPU – 933 GFLOPS

CPU – 100 GFLOPS

Page 13: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH13

Graphics Processing Units (GPUs)

Commodity processor for graphics applications

Massively data-parallel processors

High memory bandwidthBetter hides latency pipeline

Programmable

10x more memory bandwidth than CPUs

High growth rate

Page 14: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

Traditional GPGPU Pipeline

Input Assembler

Vertex Shader

Pixel Shader

Tessellation

Rasterizer

Output Merger

Geometry Shader Memory

Ve

ry h

igh

para

llelism

Hides

memory

latency!!

140

GB/s

Page 15: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

DirectX11: Compute Shader

Input Assembler

Vertex Shader

Pixel Shader

Tessellation

Rasterizer

Output Merger

Geometry Shader Memory

140

GB/s

Data StructureCompute Shader

Page 16: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH16

Graphics Processing Units (GPUs)

Commodity processor for graphics applications

Massively data-parallel processors

High memory bandwidthBetter hides latency pipeline

Programmable

High growth rate

Page 17: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH17

Memory Performance on GPUs

0

20

40

60

80

100

120

140

160

Feb-02 Feb-03 Feb-04 Feb-05 Feb-06 Feb-07 Feb-08

GB

/s

Page 18: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH18

GPUs for Sorting: Issues

Random writes are expensiveOptimized CPU algorithms do not map!

Lack of support for recursion

Out-of-core algorithmsLimited GPU memory

Page 19: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH19

Outline

Overview

Sorting on GPUs

Conclusions and Future Directions

Page 20: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH20

Sorting on GPUs

Adaptive sorting algorithmsExtent of sorted order in a sequence

General sorting algorithms

External memory sorting algorithms

Page 21: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH21

Adaptive Sorting on GPUs

Prior adaptive sorting algorithms require random data writes

In insertion sort, a processor may operate on different number of elements – load imbalance

GPUs optimized for data-parallel algorithmsUse optimized data-parallel primitives such as scans

Design adaptive sorting using only data-parallel computations

Avoid load imbalance by operating on same number of elements on each processor

Page 22: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH22

Adaptive Sorting Algorithm

Multiple iterations

Each iteration uses a two pass algorithm

First pass – Compute an increasing sequence M

Second pass - Compute the sorted elements in M

Iterate on the remaining unsorted elements

N. Govindaraju, M. Henson, M. Lin and D. Manocha,

Proc. Of ACM I3D, 2005

Page 23: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH23

Increasing Sequence

Given a sequence S={x1,…, xn}, an element xi belongs to M if and only if xi ≤ xj, i<j, xj in S

Page 24: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH24

Increasing Sequence

X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn

M is an increasing sequence

Page 25: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH25

Increasing Sequence Computation

X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn

Compute

Page 26: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH26

Compute

Xn ≤ ∞

Xn

Increasing Sequence Computation

Page 27: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH27

Compute

xi ≤Min?

Xi Xi+1 … Xn-1 Xn

Yes.

Prepend xi to M

Min = xi

Increasing Sequence Computation

Page 28: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH28

Compute

x1≤{x2,…,xn}?

X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn

Increasing Sequence Computation

Page 29: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH29

Computing Sorted Elements

Theorem 1: Given the increasing sequence M, rank of an element xi in M is determined if xi < min (I-M)

Page 30: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH30

Computing Sorted Elements

X1 X2 X3… Xi-1 Xi Xi+1 … Xn-2 Xn-1 Xn

Page 31: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH31

X2 … Xi-1 … Xn

Computing Sorted Elements

X1 X3… Xi Xi+1 … Xn-2 Xn-1

≤ ≤

Page 32: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH32

Computing Sorted Elements

Linear-time algorithmMaintaining minimum

Page 33: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH33

X1 X3… Xi Xi+1 … Xn-2 Xn-1

Compute

Computing Sorted Elements

X2 … Xi-1 … Xn

Page 34: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH34

X1 X2 … Xi-1 Xi

Compute

Xi in M?

Xi ≤ min?

No.

Update min

Yes.

Append Xi

to sorted list

Computing Sorted Elements

Page 35: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH35

Computing Sorted Elements

X1 X2 … Xi-1 Xi Xi+1 … Xn-1 Xn

Compute

Page 36: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH36

Algorithm Analysis

Knuth’s measure of disorder:

Given a sequence I and its longest increasing sequence LIS(I), the sequence of disordered elements Y = I - LIS(I)

Theorem 2: Given a sequence I and LIS(I), our adaptive algorithm sorts in at most (2 ||Y|| + 1) iterations

Page 37: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH37

X1 X2 …Xl Xl+1 Xl+2 …Xm Xm+1 Xm+2 ... Xq Xq+1 Xq+2 ... Xn

Pictorial Proof

Page 38: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH38

X1 X2 …Xl Xl+1 Xl+2 …Xm Xm+1 Xm+2 ... Xq Xq+1 Xq+2 ... Xn

Pictorial Proof

2 iterations 2 iterations 2 iterations

Page 39: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH39

8

Example

1 2 3 4 5 6 7 9

Page 40: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH40

8

Example

1 2 3 4 5 6 7

Sorted

9

Page 41: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

Scan: All-prefix-sums

For an input sequence A = [a0, a1, …, an-1] and binary associative operation with left identity scan (A) = [ , a0, a0 a1, …, a0 a1 … an-2]

Example: is addition, = 0A = [1, 1, 1, 1]

scan(A) = [0, 1, 2, 3]

Page 42: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

Computing Increasing Sequence

Compute MIN scan in backward direction : I’

Compare each element in I with I’Elements less than corresponding elements in I’ are in M

42

Page 43: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

Computing Sorted Elements

Compute forward MIN scan of elements in I-M: I’’

Elements not in M are set to ∞ in I and scan is performed on I

Compare elements in M with I

Elements that are less than or equal to corresponding elements in I are sorted

43

Page 44: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

Data Parallel Scans

Well-studied in parallel computingImplemented on GPUs [Horn 05, Harris et al. 07, Sengupta et al. 07, Dotsenko et al. 08]

Exploit shared memory to achieve higher memory efficiency

Optimized libraries available

Tree-based scan algorithms may not scale well with shared memory

Instead use matrix-based representations

44

Y. Dotsenko, N. Govindaraju, PP. Sloan, C. Boyd and J. Manferdelli,

Proc. Of ACM ICS, 2008

Page 45: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

Forward Unsegmented Scan

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

1024 16384 262144 4194304 67108864

Re

lati

ve

ela

pse

d t

ime

Sequence length, 4-byte words

CUDPP-gems3.2

CUDPP-1.0alpha

Ours

Lower is better

Page 46: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH46

Adaptive Sorting: Lessons

Do not try to remap serial adaptive algorithms

Design better data parallel algorithms to achieve scalable performance

Linear in the input size and sorted extent

Works well on almost sorted input

Use data parallel primitives such as scans

Page 47: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH47

General Sorting on GPUs

General datasets

High performance

N. Govindaraju, N. Raghuvanshi and D. Manocha,

Proc. Of ACM SIGMOD, 2005

Page 48: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH48

General Sorting on GPUs

Design sorting algorithms with deterministic memory accesses

Achieve high memory bandwidth

Can better hide the memory latency!!

Require minimum and maximum computationsLow branching overhead

No data dependenciesUtilize high parallelism on GPUs

Page 49: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH49

GPU-Based Sorting Networks

Represent data as 2D arrays

Multi-stage algorithm

Each stage involves multiple steps

In each step

1. Compare one array element against exactly one other element at fixed distance

2. Perform a conditional assignment (MIN or MAX) at each element location

Page 50: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

Sorting Animation

50

Page 51: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH51

2D Memory Addressing

GPUs optimized for 2D representationsMap 1D arrays to 2D arrays

Minimum and maximum regions mapped to row-aligned or column-aligned quads

Page 52: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH52

1D – 2D Mapping

MIN MAX

Page 53: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH53

1D – 2D Mapping

MIN

Effectively reduce instructions

per element

Page 54: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH54

Sorting on GPU: Pipelining and Parallelism

Input Vertices

Texturing, Caching

and 2D Quad

Comparisons

Sequential Writes

Page 55: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH55

Comparison with GPU-Based Algorithms

3-6x faster than

prior GPU-based

algorithms!

Page 56: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

Sorting on Recent GPGPU APIs

Use shared memory to perform local sorting

Bitonic sorting can be implemented in O(nlogn) memory accesses

Memory access patterns similar to FFT butterfly networks

56

Page 57: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

FFT Performance on GPUs

57

0

50

100

150

200

250

1 3 5 7 9 11 13 15 17 19 21 23

GFlo

ps

log2N

Ours GTX280

Ours G92

CUFFT

MKL

Preliminary performance of our algorithms:

• 4-8x faster than CUFFT on GTX280

•2x faster using G92 than CUFFT on GTX280

•10-30x faster than Intel MKL on high-end quad-core CPUs

N. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith and J. Manferdelli,

Proc. Of ACM/IEEE SuperComputing, 2008 (to appear)

Driver:

177.41

Page 58: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH58

External Memory Sorting

Performed on Terabyte-scale databases

Two phases algorithm [Vitter01, Salzberg90,

Nyberg94, Nyberg95]

Limited main memory

First phase – partitions input file into large data chunks and writes sorted chunks known as “Runs”

Second phase – Merge the “Runs” to generate the sorted file

N. Govindaraju, J. Gray, R. Kumar and D. Manocha,

Proc. of ACM SIGMOD 2006 (to appear)

Page 59: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH59

External Memory Sorting

Performance mainly governed by I/O

Salzberg Analysis: Given the main memory size M and the file size N, if the I/O read size per run is T in phase 2, external memory sorting achieves efficient I/O performance if the run size R in phase 1 is given by R ≈ √(TN)

Page 60: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH60

Salzberg Analysis

If N=100GB, T=2MB, then R ≈ 230MB

Large data sorting is inefficient on CPUs

R » CPU cache sizes – memory latency

Page 61: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH61

External memory sorting

External memory sorting on CPUs can have low performance due to

High memory latency

Or low I/O performance

Our algorithm Sorts large data arrays on GPUs

Perform I/O operations in parallel on CPUs

Page 62: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH62

GPUTeraSort

Page 63: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH63

I/O Performance

Salzberg Analysis:

100 MB Run Size

Page 64: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH64

I/O Performance

Pentium IV:

25MB Run

Size

Less work

and only 75%

IO efficient!

Salzberg Analysis:

100 MB Run Size

Page 65: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH65

I/O Performance

Dual 3.6 GHz

Xeons: 25MB

Run size

More cores,

less work but

only 85% IO

efficient!

Salzberg Analysis:

100 MB Run Size

Page 66: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH66

I/O Performance

7800 GT:

100MB run

size

Ideal work,

and 92% IO

efficient with

single CPU!

Salzberg Analysis:

100 MB Run Size

Page 67: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH67

Task Parallelism

Performance

limited by IO

and memory

Page 68: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH68

Overall Performance

Faster and more scalable than Dual Xeon processors (3.6 GHz)!

Page 69: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH69

Performance/$

1.8x faster than

the Terabyte

sorter in 2006

World’s best

performance/$

system in 2006

Page 70: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH70

Advantages

Exploit high memory bandwidth on GPUs

Higher memory performance than CPU-based algorithms

High I/O performance due to large run sizes

Page 71: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH71

Advantages

Offload work from CPUsCPU cycles well-utilized for resource management

Scalable solution for large databases

Best performance/price solution for terabyte sorting in 2006

Page 72: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH72

Conclusions

Benchmarking is important for both IHVs and ISVs

Sorting is an important workload

Design better algorithms on GPUsDo not try to remap serial algorithms

Design scalable primitives (eg. scans), libraries (eg. MapReduce) and exploit them for adaptive, general and external memory sorting algorithms

Page 73: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH73

Conclusions

Exploit the memory modelsFFT algorithms currently achieving over 0.25 TFLOPS per GPU

Applicable to many scientific computing algorithms on many-core architectures

Novel external memory sorting algorithm as a scalable solution

Achieves high I/O performance on CPUs

Best performance/price solution – world’s fastest sorting system in 2006

Page 74: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH74

GPU Roadmap

GPUs are becoming more general purpose

Fewer limitations in Microsoft DirectX11 API• IEEE floating point support and optional double

support

• Integer instruction support,

• More programmable stages, etc.

Significant advance in performance

GPUs are being widely adopted in commercial applications

Image and media processing, signal processing, finance, etc.

Page 75: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH75

Call to Action

Pay attention to data parallelism

Don’t put all your eggs in the multi-core basket

If you want TeraOps – go where they are

If you want memory bandwidth – go where the memory bandwidth is.

CPU-GPU gap is widening

Page 76: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH

Acknowledgements

Collaborators:Jim Gray (Microsoft Research)

Ming Lin (UNC)

Qiong Luo (HKUST)

Dinesh Manocha (UNC)

Peter-Pike Sloan (Disney Interactive)

Brandon Lloyd (Microsoft)

Yuri Dotsenko (Microsoft)

Chas Boyd (Microsoft)

Burton Smith (Microsoft)

John Manferdelli (Microsoft)

76

Page 77: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH77

Acknowledgements

Supporters:Fred Brooks (UNC Chapel Hill)

Craig Mundie (Microsoft)

Page 78: SortBenchmark: A Benchmark for Many-core · PDF fileSortBenchmark: A Benchmark for Many-core Computing ... Computer graphics ... Use optimized data-parallel primitives such as scans

MICROSOFT MANY CORE APPLICATIONS INCUBATION RESEARCH78

Thank You

Questions or Comments?

[email protected]