Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in...

Post on 07-Jun-2020

4 views 0 download

Transcript of Toward GPUs being mainstream in analytic processing€¦ · Toward GPUs being mainstream in...

6/1/2015 UNIVERSITY OF WISCONSIN

Toward GPUs being mainstream in analytic processing

An initial argument using simple scan-aggregate queries

1

Jason Power || Yinan Li || Mark D. HillJignesh M. Patel || David A. Wood

<powerjg@cs.wisc.edu>

DaMoN 2015

6/1/2015 UNIVERSITY OF WISCONSIN

Summary

▪ GPUs are energy efficient▪ Discrete GPUs unpopular for DBMS▪ New integrated GPUs solve the problems

2

▪ Scan-aggregate GPU implementation▪ Wide bit-parallel scan▪ Fine-grained aggregate GPU offload

▪ Up to 70% energy savings over multicore CPU▪ Even more in the future

6/1/2015 UNIVERSITY OF WISCONSIN

Analytic Data is Growing

▪ Data is growing rapidly

▪ Analytic DBs increasingly important

3

Want: High performance Need: Low energy

Source: IDC’s Digital Universe Study. 2012.

6/1/2015 UNIVERSITY OF WISCONSIN

GPUs to the Rescue?

▪ GPUs are becoming more general▪ Easier to program▪ Integrated GPUs are everywhere

4

[Govindaraju ’04, He ’14, He ’14, Kaldewey ‘12,Satish ’10, and many others]▪ GPUs show great promise

▪ Higher performance than CPUs▪ Better energy efficiency

▪ Analytic DBs look like GPU workloads

6/1/2015 UNIVERSITY OF WISCONSIN

GPU Microarchitecture

5L2 C

ache

Graphics Processing UnitI-Fetch/Sched

SP SP SP SP

L1 Cache Scratchpad Cache

Register File

Compute Unit

SP SP SP SP

SP SP SP SP

CU

6/1/2015 UNIVERSITY OF WISCONSIN

Discrete GPUs

6

Mem

ory Bus

CPU chip

Memory Bus

Discrete GPU

CoresPCIe Bus

6/1/2015 UNIVERSITY OF WISCONSIN

Discrete GPUs

7

Mem

ory Bus

CPU chip

Memory Bus

Discrete GPU

CoresPCIe Bus

6/1/2015 UNIVERSITY OF WISCONSIN

Discrete GPUs

8

Mem

ory Bus

CPU chip

Memory Bus

Discrete GPU

CoresPCIe Bus

➍ And repeat

6/1/2015 UNIVERSITY OF WISCONSIN

Discrete GPUs

▪ Copy data over PCIe▪ Low bandwidth▪ High latency

▪ Small working memory

▪ High latency user→kernel calls

▪ Repeated many times

9

98% of time spent not computing

6/1/2015 UNIVERSITY OF WISCONSIN

Integrated GPUs

10

Memory Bus

Heterogeneouschip

CPU cores

GPU CUs

6/1/2015 UNIVERSITY OF WISCONSIN

▪ No need for data copies▪ Cache coherence and shared address space

▪ No OS kernel interaction▪ User-mode queues

Heterogeneous System Arch.

11

▪ API for tightly-integrated accelerators

▪ Industry support▪ Initial hardware support today▪ HSA foundation (AMD, ARM,Qualcomm, others)

➊➋

6/1/2015 UNIVERSITY OF WISCONSIN

Outline

▪ Background

▪ Algorithms▪ Scan▪ Aggregate

▪ Results

12

6/1/2015 UNIVERSITY OF WISCONSIN

Analytic DBs

▪ Resident in main-memory

▪ Column-based layout

13

▪ WideTable & BitWeaving [Li and Patel ‘13 & ‘14]

▪ Convert queries to mostly scans by pre-joining tables▪ Fast scan by using sub-word parallelism▪ Similar to industry proposals [SAP Hana, Oracle Exalytics, IBM DB2 BLU]

▪ Scan-aggregate queries

6/1/2015 UNIVERSITY OF WISCONSIN

Running Example

14

ShirtColor

ShirtAmount

Green 1Green 3Blue 1Green 5Yellow 7Red 2Yellow 1Blue 4Yellow 2

Color CodeRed 0Blue 1Green 2Yellow 3

ShirtColor221230313

6/1/2015 UNIVERSITY OF WISCONSIN

Running Example

15

ShirtColor

ShirtAmount

Green 1Green 3Blue 1Green 5Yellow 7Red 2Yellow 1Blue 4Yellow 2

ShirtColor221230313

Count the number of green shirts in the inventory

Scan the color column for green (2)

Aggregate amount where there is a match

6/1/2015 UNIVERSITY OF WISCONSIN

10 10 01

11011010000 0000...

Traditional Scan Algorithm

16

ColumnData

CompareCode(Green)

10

ResultBitVector 111

10 10

ShirtColor2 (10)2 (10)1 (01)2 (10)3 (11)0 (00)3 (11)1 (01)3 (11)

6/1/2015 UNIVERSITY OF WISCONSIN

Vertical Layout

17

Color

c0 2 (10)c1 2 (10)c2 1 (01)c3 2 (10)c4 3 (11)c5 0 (00)c6 3 (11)c7 1 (01)c8 3 (11)c9 0 (00)

word c0 c1 c2 c3 c4 c5 c6 c7

w0 1 1 0 1 1 0 1 0

w1 0 0 1 0 1 0 1 1

c8 c9

w2 1 0

w3 1 0

word c0

w0 1

w1 0

word c0 c1

w0 1 1

w1 0 0

110110110 00101011 10000000 ...

6/1/2015 UNIVERSITY OF WISCONSIN

1101

CPU BitWeaving Scan

18

ColumnData

CompareCode

ResultBitVector

11111111 00000000

11010000 0000...

11011011 00101011 10000000

CPU width: 64-bits, up to 256-bit SIMD

6/1/2015 UNIVERSITY OF WISCONSIN

GPU BitWeaving Scan

19

11011011 00101011 10000000ColumnData

CompareCode

ResultBitVector

11111111 11111111 11111111

11010000 0000...

GPU width: 16,384-bit SIMD

6/1/2015 UNIVERSITY OF WISCONSIN

GPU Scan Algorithm

▪ GPU uses very wide “words”▪ CPU: 64-bits or 256-bits with SIMD▪ GPU: 16,384 bits (256 lanes × 64-bits)

▪ Memory and caches optimized for bandwidth

▪ HSA programming model▪ No data copies▪ Low CPU-GPU interaction overhead

20

6/1/2015 UNIVERSITY OF WISCONSIN

ShirtAmount131572142

11+31+3+5+...

CPU Aggregate Algorithm

21

ResultBitVector 11010000 0000...

Result

6/1/2015 UNIVERSITY OF WISCONSIN

GPU Aggregate Algorithm

22

ResultBitVector 11010000 0000...

Column Offsets 00,10,1,3,...

On CPU

6/1/2015 UNIVERSITY OF WISCONSIN

GPU Aggregate Algorithm

23

Column Offsets 0,1,3,...

Result 1+3+5+...

On GPU

ShirtAmount131572142

6/1/2015 UNIVERSITY OF WISCONSIN

Aggregate Algorithm

▪ Two phases▪ Convert from BitVector to offsets (on CPU)▪ Materialize data and compute (offload to GPU)

▪ Two group-by algorithms (see paper)

▪ HSA programming model▪ Fine-grained sharing▪ Can offload subset of computation

24

6/1/2015 UNIVERSITY OF WISCONSIN

Outline

▪ Background

▪ Algorithms

▪ Results

25

6/1/2015 UNIVERSITY OF WISCONSIN

Experimental Methods

▪ AMD A10-7850▪ 4-core CPU▪ 8-compute unit GPU▪ 16GB capacity, 21 GB/s DDR3 memory▪ Separate discrete GPU

▪ Watts-Up meter for full-system power

▪ TPC-H @ scale-factor 10

26

6/1/2015 UNIVERSITY OF WISCONSIN

Scan Performance & Energy

27

6/1/2015 UNIVERSITY OF WISCONSIN

Scan Performance & Energy

28

Takeaway:Integrated GPU most efficient for scans

6/1/2015 UNIVERSITY OF WISCONSIN

TPC-H Queries

29

Query 12 Performance

6/1/2015 UNIVERSITY OF WISCONSIN

TPC-H Queries

30

Query 12 Performance Query 12 Energy

Integrated GPU faster for both aggregate and

scan computation

6/1/2015 UNIVERSITY OF WISCONSIN

TPC-H Queries

31

Query 12 Performance Query 12 Energy

6/1/2015 UNIVERSITY OF WISCONSIN

TPC-H Queries

32

Query 12 Performance Query 12 Energy

More energy:Decrease in latency does not offset power increase

Less energy:Decrease in latency AND

decrease in power

6/1/2015 UNIVERSITY OF WISCONSIN

Future Die Stacked GPUs

33

▪ 3D die stacking

▪ Same physical & logical integration

▪ Increased compute

▪ Increased bandwidthBoard

Power et al.Implications of 3D GPUs on the Scan Primitive

SIGMOD Record. Volume 44, Issue 1. March 2015

DRAM

CPU

GPU

6/1/2015 UNIVERSITY OF WISCONSIN

Conclusions

34

Discrete GPUs

Integrated GPUs

3D StackedGPUs

Performance High ☺ Moderate High ☺Memory Bandwidth High ☺ Low ☹ High ☺

Overhead High ☹ Low ☺ Low ☺Memory Capacity Low ☹ High ☺ Moderate

6/1/2015 UNIVERSITY OF WISCONSIN

?35

6/1/2015 UNIVERSITY OF WISCONSIN

HSA vs CUDA/OpenCL

▪ HSA defines a heterogeneous architecture▪ Cache coherence▪ Shared virtual addresses▪ Architected queuing▪ Intermediate language

▪ CUDA/OpenCL are a level above HSA▪ Come with baggage▪ Not as flexible▪ May not be able to take advantage of all features

36

6/1/2015 UNIVERSITY OF WISCONSIN

Scan Performance & Energy

37

6/1/2015 UNIVERSITY OF WISCONSIN

Group-by Algorithms

38

6/1/2015 UNIVERSITY OF WISCONSIN

All TPC-H Results

39

6/1/2015 UNIVERSITY OF WISCONSIN

Average TPC-H Results

40

Average Performance Average Energy

6/1/2015 UNIVERSITY OF WISCONSIN

What’s Next?

41

▪ Developing cost model for GPU▪ Using the GPU is just another algorithm to choose▪ Evaluate exactly when the GPU is more efficient

▪ Future “database machines”▪ GPUs are a good tradeoff between specialization and

commodity

6/1/2015 UNIVERSITY OF WISCONSIN

Conclusions

▪ Integrated GPUs viable for DBMS?▪ Solve problems with discrete GPUs▪ (Somewhat) better performance and energy

42

▪ Looking toward the future...▪ CPUs cannot keep up with bandwidth▪ GPUs perfectly designed for these workloads