Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter...

68
Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March 20 th 2019 S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS

Transcript of Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter...

Page 1: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

Tim Kaldewey, Nikolay Sakharnykh and Jiri Kraus, March 20th 2019

S9557 EFFECTIVE, SCALABLE MULTI-GPU JOINS

Page 2: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

4

RECAP JOINS

Counts the number of orders in a given quarter of a given yearin which at least one lineitem was received by the customer later than its committed date. The query lists the count of such orders for each order priority sorted in ascending priority order

Joins are implicit in a business question

Business question

aggregate

Database Operators

predicate (filter)

join

aggregate

sort

predicate (filter)

SQL

selecto_orderpriority,count(o_orderkey) as order_count,

fromorders

whereo_orderdate >= date '[DATE]' ando_orderdate < date '[DATE]' + interval '3' month andexists (select * from lineitem

where l_orderkey = o_orderkey andl_commitdate < l_receiptdate)

group byo_orderpriority,

order byo_orderpriority;

Page 3: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

5

TPC-H SCHEMA

ORDERKEY

LINENUMBER

PARTKEY

SUPPKEY

COMMITDATE

RECEIPTDATE

CUSTKEY

NAME

ADDRESS

CITY

SUPPKEY

NAME

ADDRESS

CITY

NATIONKEY

PARTKEY

NAME

MFGR

CATEGORY

BRAND

NATIONKEY

NAME

customer (c_)

nation (n_)

lineitem (l_)

supplier (s_)

part (p_)

ORDERKEY

CUSTKEY

ORDERDATE

ORDPRIORITY

ORDERSTATUS

order (o_)

Page 4: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

6

RELATIONAL JOINLineitem1 Order2

=

Payload

Foreign Key

Primary Key

Join Results

l_orderkey

23

14

56

11

39

27

23

o_orderkey o_orderpriority

11 1

23 5

27 2

29 4

o_orderkey o_orderpriority

23 5

11 1

27 2

23 5

1 after applying predicate “l_commitdate < l_receiptdate”2 after applying predicates “o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month”

Page 5: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

7

HASH JOIN

=

Payload

Foreign Key

Primary Key

Join Results

l_orderkey

23

14

56

11

39

27

23

o_orderkey o_orderpriority

11 1

23 5

27 2

29 4

o_orderkey o_orderpriority

23 5

11 1

27 2

23 5

Build hash table

= Probe inputs

Lineitem1 Order2

1 after applying predicate “l_commitdate < l_receiptdate”2 after applying predicates “o_orderdate >= date '[DATE]’ and o_orderdate < date '[DATE]' + interval '3' month”

Page 6: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

8

JOINS & E2E PERFORMANCE

99%

1%

CPU TPC-H Q4 execution breakdown

join group-by

99%

1%

GPU TPC-H Q4 execution breakdown

join group-by

18/22 TPC-H Queries involve Joins and are the longest running ones1

1 c.f. recently published TPC-H results at http://www.tpc.org/tpch/results/tpch_last_ten_results.asp

Page 7: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

9

IMPLEMENTING GPU JOINSIn Heterogeneous Systems

DB

Key Payload

23 5

27 2

Build &

Pro

be

Hash Table(s)

32GB

HBM

1TB+ DDR

If the hash table fits in GPU memory, performance is primarily bound by random memory access.1

Let’s ignore CPU-GPU interconnect for a moment.

1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289

Page 8: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

10

PERFORMANCE

Peak memory

bandwidth1

Random 8B

access1

High-end CPU

(6-channel DDR4)

120 GB/s 6GB/s

NVIDIA Tesla V100 900 GB/s 60GB/s

10x

1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75

Page 9: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

11

PERFORMANCE VS. CAPACITY

Peak memory

bandwidth1

Random 8B

access1

Memory capacity

High-end CPU

(6-channel DDR4)

120 GB/s 6GB/s 1 TB+

NVIDIA Tesla V100 900 GB/s 60GB/s 32GB

1/32

1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75

Page 10: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

12

PERFORMANCE VS. CAPACITY

Peak memory

bandwidth1

Random 8B

access1

Memory capacity

High-end CPU

(6-channel DDR4)

120 GB/s 6GB/s 1 TB+

NVIDIA Tesla V100 900 GB/s 60GB/s 32GB

NVIDIA DGX-2

(16x V100)

16 x 900 GB/s 16x 60GB/s 512 GB

1/2

1 c.f.“How to Get the Most out of GPU Accelerated Database Operators”, GTC Silicon Valley 2018, Session ID S8289http://on-demand-gtc.gputechconf.com/gtc-quicklink/ar9zi75

Page 11: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

13

IS A SINGLE V100 FAST/LARGE ENOUGH?TPC-H query 4 @SF1000 = 1000GB data warehouse

99%

1%

GPU execution breakdown

join group-by

Hash table sizes

99%

1%

GPU execution breakdown, compressed data

join group-by Query SF1K SF3K SF10K

Q4 1.5 GB 4.5 GB 15 GB

Q18 21 GB 63 GB 210 GB

Q21 10.5 GB 31.5 GB 105 GB

For further speedup or > SF 1000 need to to distribute hash table across multiple GPUs

3.8 s7.0 s

Page 12: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

14

DESIGNED TO TRAIN THE PREVIOUSLY IMPOSSIBLENVIDIA DGX-2

1

2

3

8

4

5 Two Intel Xeon Platinum CPUs

6 1.5 TB System Memory

30 TB NVME SSDs Internal Storage

NVIDIA Tesla V100 32GB

Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card

Twelve NVSwitches2.4 TB/sec bi-section

bandwidth

Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth

7

Two High-Speed Ethernet10/25/40/100 GigE

Page 13: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

15

POTENTIAL DGX-2 IMPLEMENTATIONUse 2.4TB/s bisection BW to exchange FT chunks

GPU8

GPU9

GPU10

GPU11

GPU12

GPU13

GPU14

GPU15

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

GPU6

GPU7

NVSwitch Fabric

Page 14: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

16

SCALING OF INNER JOIN

Page 15: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

17

DISCLAIMER

For a production system some additional aspects need to be considered:

- Data Skew

- Cardinality estimation

- Query optimizer

This investigation is ongoing

Page 16: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

18

SCALING OF INNER JOINredundant build of replicated HT (step 0)

GPU 0

0…B1-1

0…P1-1

Build table

Full HT

GPU 1

B1…B2-1

P1…P2-1

Full HT

GPU 2

B2…B3-1

P2…P3-1

Full HT

B#…B-1

P#…P-1

Full HT

GPU #GPU

Page 17: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

19

SCALING OF INNER JOINredundant build of replicated HT (step 1..#GPU-1)

GPU 0

0…B1-1

0…P1-1

Build table

Full HT

GPU 1

B1…B2-1

P1…P2-1

Full HT

GPU 2

B2…B3-1

P2…P3-1

Full HT

B#…B-1

P#…P-1

Full HT

GPU #GPU

Page 18: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

20

SCALING OF INNER JOINredundant build of replicated HT (step #GPU)

GPU 0

0…B1-1

0…P1-1

Build table

Full HT

GPU 1

B1…B2-1

P1…P2-1

Full HT

GPU 2

B2…B3-1

P2…P3-1

Full HT

B#…B-1

P#…P-1

Full HT

GPU #GPU

Page 19: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

21

SCALING OF INNER JOINparallel probe of replicated HT

GPU 0

0…B1-1

0…P1-1Probe table

Full HT

GPU 1

B1…B2-1

P1…P2-1

Full HT

GPU 2

B2…B3-1

P2…P3-1

Full HT

GPU #GPU

B#…B-1

P#…P-1

Full HT

Page 20: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

22

SCALING OF INNER JOIN

randomly generated 8 bytes keys

build table size = probe table size = 335544320 rows (worst case for HT creation fitting in the memory of a single GPU: 2x 2.5GiB for tables, 2x10GiB for HT + staging buffers (for strong scaling experiment))

HT occupancy = 50%

selectivity = 0 for analytical purposes we will look at a real problem later

build and probe tables are evenly partitioned across GPUs

Benchmark Problem

GPU 0

0…B1-1

0…P1-1

GPU 1

B1…B2-1

GPU 2

Build table B2…B3-1 B#…B-1

P1…P2-1 P2…P3-1 P#…P-1Probe table

GPU #GPU

Page 21: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

23

SCALING OF INNER JOIN ON DGX-2with redundant build of replicated HT

Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)

0%

20%

40%

60%

80%

100%

120%

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Par

alle

l eff

icie

ncy

Ru

nti

me

[ms]

#GPUs

Runtime [ms]

Build runtime [ms]

Probe runtime [ms]

Parallel Efficiency build

Parallel Efficiency probe

Parallel Efficiency

Page 22: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

24

SCALING OF INNER JOIN

Open addressing hash table with N buckets

key -> hash_value = hf(key) -> bucket_idx = hash_value%N

Partition N hash table buckets equally onto GPUs:

The bucket_idx and target HT partition can be computed locally from the key

Basic Idea

GPU 0

0…N1-1

GPU 1

N1…N2-1

GPU #GPU

N#...N-1…Hash table

Page 23: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

25

SCALING OF INNER JOINparallel build of a replicated HT (step 0 of phase 1)

GPU 0

0…B1-1

0…P1-1

temp HT

GPU 1

B1…B2-1

P1…P2-1

temp HT

GPU #GPU

B#…B-1

P#…P-1

temp HT

if hash to bucket0..N1-1

if hash to bucketN1..N2-1

if hash to bucketN#..N-1

Page 24: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

26

SCALING OF INNER JOINparallel build of a replicated HT (step 1..#GPU-1 of phase 1)

GPU 0

0…B1-1

0…P1-1

temp HT

GPU 1

B1…B2-1

P1…P2-1

temp HT

GPU #GPU

B#…B-1

P#…P-1

temp HT

if hash to bucket0..N1-1

if hash to bucketN1..N2-1

if hash to bucketN#..N-1

Page 25: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

27

SCALING OF INNER JOINparallel build of a replicated HT (step #GPU of phase 1)

GPU 0

0…B1-1

0…P1-1

temp HT

GPU 1

B1…B2-1

P1…P2-1

temp HT

GPU #GPU

B#…B-1

P#…P-1

temp HT

if hash to bucket0..N1-1

if hash to bucketN1..N2-1

if hash to bucketN#..N-1

Page 26: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

28

SCALING OF INNER JOINparallel build of a replicated HT (phase 2 – merge step)

GPU 0 temp HT

0…

N1 -1

N1 …

N2 -1

GPU 1 temp HT

N2 …

N3 -1

GPU 2 temp HT

N# …

N-1

GPU # temp HT

Page 27: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

29

SCALING OF INNER JOINparallel build of a replicated HT (phase 2 – merge step)

GPU 0 temp HT

0…

N1 -1

N1 …

N2 -1

GPU 1 temp HT

N2 …

N3 -1

GPU 2 temp HT

N# …

N-1

GPU # temp HT

GPU 0 res HT

GPU 1 res HT

GPU 2 res HT

GPU # res HT

0…

N1 -1

N2 …

N3 -1

N1 …

N2 -1

N# …

N-1

0…

N1 -1

N2 …

N3 -1

N1 …

N2 -1

N# …

N-1

0…

N1 -1

N2 …

N3 -1

N1 …

N2 -1

N# …

N-1

0…

N1 -1

N2 …

N3 -1

N1 …

N2 -1

N# …

N-1

Page 28: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

30

SCALING OF INNER JOINparallel build of a replicated HT (phase 2 – merge step)

GPU 0 temp HT

0…

N1 -1

N1 …

N2 -1

GPU 1 temp HT

N2 …

N3 -1

GPU 2 temp HT

N# …

N-1

GPU # temp HT

GPU 0 res HT

GPU 1 res HT

GPU 2 res HT

GPU # res HT

0…

N1 -1

N2 …

N3 -1

N1 …

N2 -1

N# …

N-1

0…

N1 -1

N2 …

N3 -1

N1 …

N2 -1

N# …

N-1

0…

N1 -1

N2 …

N3 -1

N1 …

N2 -1

N# …

N-1

0…

N1 -1

N2 …

N3 -1

N1 …

N2 -1

N# …

N-1

Page 29: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

31

SCALING OF INNER JOIN ON DGX-2with parallel build of replicated HT

Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)

0%

20%

40%

60%

80%

100%

120%

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Par

alle

l eff

icie

ncy

Ru

nti

me

[ms]

#GPUs

Runtime [ms]

Build runtime [ms]

Probe runtime [ms]

Parallel Efficiency build

Parallel Efficiency probe

Parallel Efficiency

Page 30: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

32

SCALING OF INNER JOIN ON DGX-2with parallel build of replicated HT

With 16 GPUs most

of the time is spend

in HT merging

Page 31: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

33

SCALING OF INNER JOINparallel build of partitioned HT and parallel probe

Full HT Full HT Full HT

GPU 0 GPU 1 GPU 2

Full HT

GPU 0 GPU 1 GPU 2

Replicated:• Limited capacity• Slower building

• Need to merge HT partitions• Faster probing

• No inter-GPU traffic

Partitioned:• High capacity• Faster building

• No need to merge partitions• Slower probing

• Need to access remote partitions

Page 32: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

34

SCALING OF INNER JOINparallel build of a partitioned HT (step 0)

GPU 0

0…B1-1

0…P1-1

0…N1-1

GPU 1

B1…B2-1

P1…P2-1

N1…N2-1

GPU #GPU

B#…B-1

P#…P-1

N#...N-1

if hash to bucket0..N1-1

if hash to bucketN1..N2-1

if hash to bucketN#..N-1

Hash table

Page 33: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

35

SCALING OF INNER JOINparallel build of a partitioned HT (step 1..#GPU-1)

GPU 0

0…B1-1

0…P1-1

0…N1-1

GPU 1

B1…B2-1

P1…P2-1

N1…N2-1

GPU #GPU

B#…B-1

P#…P-1

N#...N-1

if hash to bucket0..N1-1

if hash to bucketN1..N2-1

if hash to bucketN#..N-1

Hash table

Page 34: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

36

SCALING OF INNER JOINparallel build of a partitioned HT (ring exchange) (step #GPU)

GPU 0

0…B1-1

0…P1-1

0…N1-1

GPU 1

B1…B2-1

P1…P2-1

N1…N2-1

GPU #GPU

B#…B-1

P#…P-1

N#...N-1

if hash to bucket0..N1-1

if hash to bucketN1..N2-1

if hash to bucketN#..N-1

Hash table

Page 35: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

37

SCALING OF INNER JOINparallel probe of a partitioned HT (ring exchange) (step 0)

GPU 0

0…B1-1

0…P1-1

0…N1-1

GPU 1

B1…B2-1

P1…P2-1

N1…N2-1

GPU #GPU

B#…B-1

P#…P-1

N#...N-1

if hash to bucket0..N1-1

if hash to bucketN1..N2-1

if hash to bucketN#..N-1

Hash table

Page 36: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

38

SCALING OF INNER JOINparallel probe of a partitioned HT (ring exchange) (step 1..#GPU-1)

GPU 0

0…B1-1

0…P1-1

0…N1-1

GPU 1

B1…B2-1

P1…P2-1

N1…N2-1

GPU #GPU

B#…B-1

P#…P-1

N#...N-1

if hash to bucket0..N1-1

if hash to bucketN1..N2-1

if hash to bucketN#..N-1

Hash table

Page 37: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

39

SCALING OF INNER JOINparallel probe of a partitioned HT (ring exchange) (step #GPU)

GPU 0

0…B1-1

0…P1-1

0…N1-1

GPU 1

B1…B2-1

P1…P2-1

N1…N2-1

GPU #GPU

B#…B-1

P#…P-1

N#...N-1

if hash to bucket0..N1-1

if hash to bucketN1..N2-1

if hash to bucketN#..N-1

Hash table

Page 38: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

40

SCALING OF INNER JOIN ON DGX-2parallel build of partitioned HT and parallel probe (ring exchange)

Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)

0%

20%

40%

60%

80%

100%

120%

140%

160%

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Par

alle

l eff

icie

ncy

Ru

nti

me

[ms]

#GPUs

inner join with parallel build of distributed HT (ring exchange)

Runtime [ms]

Build runtime [ms]

Probe runtime [ms]

Parallel Efficiency build

Parallel Efficiency probe

Parallel Efficiency

Page 39: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

41

SCALING OF INNER JOIN ON DGX-2parallel build of partitioned HT – Memory Subsystem Metrics

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Met

ric

Val

ue

[%]

#GPUs

Unified Cache Hit Rate L2 Cache Hit Rate random mem ops/coalesced mem ops per step

Page 40: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

42

SCALING OF INNER JOINparallel probe of a partitioned HT (staged direct send) (round 0)

GPU 0

0…B1-1

0…P1-1

0…N1-1

GPU 1

B1…B2-1

P1…P2-1

N1…N2-1

GPU #GPU

B#…B-1

P#…P-1

N#...N-1

if hash to bucket0..N1-1

if hash to bucketN1..N2-1

if hash to bucketN#..N-1

Hash table

if hash to bucket0..N1-1

if hash to bucketN2..N3-1

if hash to bucketN1..N2-1

Page 41: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

43

SCALING OF INNER JOINparallel probe of a partitioned HT (staged direct send) (round (k-1))

GPU 0

0…B1-1

0…P1-1

0…N1-1

GPU K

Bk…Bk+1-1

Pk…Pk+1-1

NK…NK+1-1

GPU 2K

B2K…B2K+1-

1

P2K…P2K+1-

1

N2K...N2K+1

-1Hash table

if hash to bucketN2K…N2K+1-1

if hash to bucketNK…NK+1-1

Page 42: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

44

SCALING OF INNER JOINparallel probe of a partitioned HT (staged direct send) (round #GPU)

GPU 0

0…B1-1

0…P1-1

0…N1-1

GPU 1

B1…B2-1

P1…P2-1

N1…N2-1

GPU #GPU

B#…B-1

P#…P-1

N#...N-1Hash table

Page 43: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

45

SCALING OF INNER JOIN ON DGX-2parallel build of partitioned HT and parallel probe (staged direct send)

0%

20%

40%

60%

80%

100%

120%

140%

160%

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Par

alle

l eff

icie

ncy

Ru

nti

me

[ms]

#GPUs

Runtime [ms]

Build runtime [ms]

Probe runtime [ms]

Parallel Efficiency build

Parallel Efficiency probe

Parallel Efficiency

Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)

Page 44: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

46

SCALING OF INNER JOIN ON DGX-2replicated HT vs. partitioned HT (16 GPUs, total # rows = 671088640)

Runtimes are the minimum of 5 repetitions for probe + build (excluding setup overhead, e.g. allocation of hash tables or temp buffers)

0

0.5

1

1.5

2

2.5

3

3.5

0

20

40

60

80

100

120

1 2 4 8 16 32 64 128 256 512

Spee

du

p

Ru

nti

me

[ms]

probe tbl size / build tbl size

Runtime with replicated HT

Runtime with partitioned HT

speedup partitioned

speedup replicated

Page 45: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

47

REAL OLAP QUERIES

Page 46: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

48

TPC-H BENCHMARK

select

o_orderpriority,

count(o_orderkey) as order_count,

from

orders

where

o_orderdate >= date '[DATE]’ and

o_orderdate < date '[DATE]' + interval '3' month and

exists (select * from lineitem

where l_orderkey = o_orderkey and

l_commitdate < l_receiptdate)

group by

o_orderpriority,

order by

o_orderpriority;

semi-join

SQL code for TPC-H Query 4:

99%

1%

CPU execution breakdown

join group-by

Page 47: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

49

Q4: INPUT DATA

1.5M rows per SF 6M rows per SF

o_orderkey o_orderdate o_orderpriority

7 1996-01-10 2-HIGH

32 1995-07-16 2-HIGH

33 1993-10-27 3-MEDIUM

34 1998-07-21 3-MEDIUM

l_orderkey l_commitdate l_receiptdate

7

7

7

32 1995-10-07 1995-08-27

32 1995-08-20 1995-09-14

32 1995-10-01 1995-09-03

34

34

>

>

<

Page 48: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

50

Q4 JOIN: BUILD

32 1995-07-16 2-HIGHGPU 0

GPU 1

orders

o_orderdate >= date '[DATE]’ and

o_orderdate < date '[DATE]' + interval '3' month

filter (selectivity 3.8%)

compute destination HT partition

push (o_orderkey, o_orderpriority)to the remote GPU

insert (o_orderkey, o_orderpriority)into the local HT partition

Page 49: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

51

Q4 JOIN: PROBE

32 1995-08-20 1995-09-14GPU 0

GPU 1

lineitem

filter (selectivity 63%)

compute destination HT partition

push l_orderkey to the remote GPUprobe against the local HT partition

l_commitdate < l_receiptdate

remove element from HT (semi-join)increment o_orderpriority counter (groupby)

match

Page 50: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

52

TEST SETUP

Performance metrics: time, parallel efficiency, throughput (input data size / time)

Use 8B keys, 2B encoded dates, 1B encoded priority string

TPC-H Q4 SF1000

89GB 1.4GB

Input columnsused in Q4

GPU hash table(50% HT occupancy)

All tables in CSV format

1000GB

Page 51: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

53

PERFORMANCE RESULTS ON DGX-2Q4 SF1000, input distributed in GPU memory

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

# of GPUs

Q4 execution time (s)

6M rows chunk

Page 52: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

54

PERFORMANCE RESULTS ON DGX-2Q4 SF1000, input distributed in GPU memory

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

# of GPUs

Q4 parallel efficiency

6M rows chunk

Page 53: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

55

DGX-2 PROFILE: INPUT IN GPU MEMORY

the main bottleneck is HT build (74% of the overall query time)

Page 54: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

56

DGX-2 PROFILE: INPUT IN GPU MEMORY

CUDA API overhead(kernel launches, recording events)

Page 55: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

57

OPTIMIZED CHUNK SIZE ON DGX-2Q4 SF1000, input distributed in GPU memory

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

0.18

0.20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

# of GPUs

Q4 execution time (s)

6M rows chunk

1 chunk per GPU

Page 56: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

58

OPTIMIZED CHUNK SIZE ON DGX-2Q4 SF1000, input distributed in GPU memory

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

# of GPUs

Q4 parallel efficiency

6M rows chunk

1 chunk per GPU

Page 57: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

59

PERFORMANCE RESULTS ON DGX-2Q4 SF1000, input in system memory

0 10 20 30 40 50 60

single V100

replicated HT - redundant build, parallel probe

replicated HT - cooperative build, parallel probe

partitioned HT - cooperative build, parallel probe

throughput (GB/s)

PCIe3 x16

4x PCIe3 x16

Page 58: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

60

DGX-2 PROFILE: INPUT IN CPU MEMORY

the main bottleneck is HT probe (82% of the overall query time)

Page 59: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

61

IS THIS THE BEST WE CAN DO?

l_orderkey l_commitdate l_receiptdate

7

7

7

32 1995-10-07 1995-08-27

32 1995-08-20 1995-09-14

32 1995-10-01 1995-09-03

34

34

8B 2B 2B

Page 60: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

62

IS THIS THE BEST WE CAN DO?

l_orderkey l_commitdate l_receiptdate

7

7

7

32 1995-10-07 1995-08-27

32 1995-08-20 1995-09-14

32 1995-10-01 1995-09-03

34

34

filters can be executed on the CPU

Page 61: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

63

IS THIS THE BEST WE CAN DO?

l_orderkey l_commitdate l_receiptdate

7

7

7

32 1995-10-07 1995-08-27

32 1995-08-20 1995-09-14

32 1995-10-01 1995-09-03

34

34

can be compressed to <8B per key

8B 2B 2B

Page 62: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

64

IS THIS THE BEST WE CAN DO?

l_orderkey l_commitdate l_receiptdate

7

7

7

32 1995-10-07 1995-08-27

32 1995-08-20 1995-09-14

32 1995-10-01 1995-09-03

34

34

can be compressed to <2B per date

8B 2B 2B

Page 63: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

65

IS THIS THE BEST WE CAN DO?

l_orderkey l_commitdate l_receiptdate

7

7

7

32 1995-10-07 1995-08-27

32 1995-08-20 1995-09-14

32 1995-10-01 1995-09-03

34

34

can be compressed to <2B per date

8B 2B 2B

Page 64: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

66

112233

2,2,2 2:0,0,0

1,2,3 1,1,1

3 3:0

1 1:0

RLE-DELTA-RLE COMPRESSION

RLE

RLE

Delta

bit-packing

bit-packing

Uncompressed Compressed

runs

vals

runs

vals

1 c.f.“Breaking the Speed of Interconnect with Compression for Database Applications”, GTC Silicon Valley 2018, Session ID S8417http://on-demand-gtc.gputechconf.com/gtc-quicklink/7LVQs

Page 65: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

67

APPLYING COMPRESSION TO TPC-H Q4

Use RLE + Delta + RLE + bit-packing

Compression rate for SF1K l_orderkey: 14x

Multiple streams per GPU

Pipeline decompress & probe kernels

12

62

113

0

20

40

60

80

100

120

Uncompressed (8B) RLE+bp RLE+Delta+RLE+ bp

l_orderkey decompression throughput (GB/s)reading from system memory

Page 66: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

68

TPC-H SF1000 Q4 RESULTS

*CPU-only results from: http://www.tpc.org/tpch/results/tpch_result_detail.asp?id=117111701

3.2

1.8

1.0

0.060.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Best published CPU-only results*2x Intel Xeon Platinum 8180

DGX-2GPU HT, CPU inputw/o compression

DGX-2GPU HT, CPU inputwith compression

DGX-2GPU HT, GPU input

Query

tim

e (

s)

loweris better

Page 67: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later

69

99%

1%

join

group-by

0%

20%

40%

60%

80%

100%

120%

140%

160%

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Par

alle

l eff

icie

ncy

Ru

nti

me

[ms]

#GPUs

Runtime [ms]

Build runtime [ms]

Probe runtime [ms]

Parallel Efficiency build

Parallel Efficiency probe

Parallel Efficiency

1. Joins is the key bottleneck in OLAP2. Multi-GPU joins improve perf and enable larger workloads

3. Speed-ups on real analytical queries

DGX-2 can run TPC-H Q4 SF1K in 1 second!(input data in system memory)

If columns preloaded to GPU memoryQ4 time goes down to just 60ms

TAKEAWAY

3.2

1.8

1.0

0.06

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Best publishedCPU-only results*

2x Intel XeonPlatinum 8180

DGX-2GPU HT, CPU inputw/o compression

DGX-2GPU HT, CPU inputwith compression

DGX-2GPU HT, GPU input

Query

tim

e (

s)

Page 68: Effective, Scalable Multi-GPU Joins · RECAP JOINS Counts the number of orders in a given quarter of a given year in which at least one lineitem was received by the customer later