Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers....

Vectorized Execution (Part I)

@Andy_Pavlo // 15-721 // Spring 2018

ADVANCEDDATABASE SYSTEMS

Le

ctu

re #

21

https://twitter.com/andy_pavlo

http://15721.courses.cs.cmu.edu/spring2018/

http://db.cs.cmu.edu/

CMU 15-721 (Spring 2018)

Background

Hardware

Vectorized Algorithms (Columbia)

2


http://15721.courses.cs.cmu.edu/

CMU 15-721 (Spring 2018)

VECTORIZATION

The process of converting an algorithm's scalar implementation that processes a single pair of operands at a time, to a vector implementation that processes one operation on multiple pairs of operands at once.

3



CMU 15-721 (Spring 2018)

WHY THIS MAT TERS

Say we can parallelize our algorithm over 32 cores.

Each core has a 4-wide SIMD registers.

Potential Speed-up: 32x × 4x = 128x

4



CMU 15-721 (Spring 2018)

MULTI-CORE CPUS

Use a small number of high-powered cores. → Intel Xeon Skylake / Kaby Lake→ High power consumption and area per core.

Massively superscalar and aggressive out-of-order execution→ Instructions are issued from a sequential stream.→ Check for dependencies between instructions.→ Process multiple instructions per clock cycle.

5



CMU 15-721 (Spring 2018)

MANY INTEGRATED CORES (MIC)

Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.

Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).

Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)

6



http://www.intel.com/content/www/us/en/processors/xeon/xeon-phi-detail.html

https://en.wikipedia.org/wiki/P5_(microarchitecture)

https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing

CMU 15-721 (Spring 2018)

S INGLE INSTRUCTION, MULTIPLE DATA

A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously.

All major ISAs have microarchitecture support SIMD operations.→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512→ PowerPC: Altivec→ ARM: NEON

8



CMU 15-721 (Spring 2018)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

11111111

Y

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =



CMU 15-721 (Spring 2018)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

SISD

+for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

911111111

Y

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =



CMU 15-721 (Spring 2018)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

SISD

+for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

9 8 7 6 5 4 3 211111111

Y

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =



CMU 15-721 (Spring 2018)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

9 8 7 611111111

Y

SIMD

+

8 7 6 5

1 1 1 1

128-bit SIMD Register

128-bit SIMD Register128-bit SIMD Register

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =



CMU 15-721 (Spring 2018)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

9 8 7 6 5 4 3 211111111

Y

SIMD

+4 3 2 1

1 1 1 1

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =



CMU 15-721 (Spring 2018)

STREAMING SIMD EXTENSIONS (SSE)

SSE is a collection SIMD instructions that target special 128-bit SIMD registers.

These registers can be packed with four 32-bit scalars after which an operation can be performed on each of the four elements simultaneously.

First introduced by Intel in 1999.

10



CMU 15-721 (Spring 2018)

SIMD INSTRUCTIONS (1 )

Data Movement→ Moving data in and out of vector registers

Arithmetic Operations→ Apply operation on multiple data items (e.g., 2 doubles, 4

floats, 16 bytes)→ Example: ADD, SUB, MUL, DIV, SQRT, MAX, MIN

Logical Instructions→ Logical operations on multiple data items→ Example: AND, OR, XOR, ANDN, ANDPS, ANDNPS

11



CMU 15-721 (Spring 2018)

SIMD INSTRUCTIONS (2)

Comparison Instructions→ Comparing multiple data items (==,<,<=,>,>=,!=)

Shuffle instructions→ Move data in between SIMD registers

Miscellaneous→ Conversion: Transform data between x86 and SIMD

registers.→ Cache Control: Move data directly from SIMD registers

to memory (bypassing CPU cache).

12



CMU 15-721 (Spring 2018)

INTEL SIMD EXTENSIONS

13

Width Integers Single-P Double-P

1997 MMX 64 bits ✔

1999 SSE 128 bits ✔ ✔(×4)

2001 SSE2 128 bits ✔ ✔ ✔(×2)

2004 SSE3 128 bits ✔ ✔ ✔

2006 SSSE 3 128 bits ✔ ✔ ✔

2006 SSE 4.1 128 bits ✔ ✔ ✔

2008 SSE 4.2 128 bits ✔ ✔ ✔

2011 AVX 256 bits ✔ ✔(×8) ✔(×4)

2013 AVX2 256 bits ✔ ✔ ✔

2017 AVX-512 512 bits ✔ ✔(×16) ✔(×8)

Source: James Reinders



https://www.youtube.com/watch?v=_OJmxi4-twY

CMU 15-721 (Spring 2018)

WHY NOT GPUS?

Moving data back and forth between DRAM and GPU is slow over PCI-E bus.

There are some newer GPU-enabled DBMSs→ Examples: MapD, SQream, Kinetica

Emerging co-processors that can share CPU’s memory may change this.→ Examples: AMD’s APU, Intel’s Knights Landing

14



https://www.mapd.com/

http://sqream.com/

https://www.kinetica.com/

https://db.cs.cmu.edu/seminar2018



CMU 15-721 (Spring 2018)

VECTORIZATION

Choice #1: Automatic Vectorization

Choice #2: Compiler Hints

Choice #3: Explicit Vectorization

16

Source: James Reinders

Ease of Use

ProgrammerControl



https://www.youtube.com/watch?v=_OJmxi4-twY

CMU 15-721 (Spring 2018)

AUTOMATIC VECTORIZATION

The compiler can identify when instructions inside of a loop can be rewritten as a vectorizedoperation.

Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions.

17



CMU 15-721 (Spring 2018)

AUTOMATIC VECTORIZATION

This loop is not legal to automatically vectorize.

The code is written such that the addition is described as being done sequentially.

18

These might point to the same address!

void add(int *X,int *Y,int *Z) {

for (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}

*Z=*X+1



CMU 15-721 (Spring 2018)

COMPILER HINTS

Provide the compiler with additional information about the code to let it know that is safe to vectorize.

Two approaches:→ Give explicit information about memory locations.→ Tell the compiler to ignore vector dependencies.

19



CMU 15-721 (Spring 2018)

COMPILER HINTS

The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory.

20

void add(int *restrict X,int *restrict Y,int *restrict Z) {

for (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}



CMU 15-721 (Spring 2018)

COMPILER HINTS

This pragma tells the compiler to ignore loop dependencies for the vectors.

It’s up to you make sure that this is correct.

21


#pragma ivdepfor (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}



CMU 15-721 (Spring 2018)

EXPLICIT VECTORIZATION

Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions.

Potentially not portable.

22



CMU 15-721 (Spring 2018)


Store the vectors in 128-bit SIMD registers.

Then invoke the intrinsic to add together the vectors and write them to the output location.

23


__mm128i *vecX = (__m128i*)X;__mm128i *vecY = (__m128i*)Y;__mm128i *vecZ = (__m128i*)Z;for (int i=0; i<MAX/4; i++) {_mm_store_si128(vecZ++,_mm_add_epi32(*vecX, *vecY));

}}



CMU 15-721 (Spring 2018)

VECTORIZATION DIRECTION

Approach #1: Horizontal→ Perform operation on all elements

together within a single vector.

Approach #2: Vertical→ Perform operation in an elementwise

manner on elements of each vector.

24

Source: Przemysław Karpiński

0 1 2 3

SIMD Add 6

0 1 2 3

SIMD Add

1 1 1 1

1 2 3 4



https://gain-performance.com/2017/05/01/umesimd-tutorial-2-calculation/

CMU 15-721 (Spring 2018)


Linear Access Operators→ Predicate evaluation→ Compression

Ad-hoc Vectorization→ Sorting→ Merging

Composable Operations→ Multi-way trees→ Bucketized hash tables

25

Source: Orestis Polychroniou



http://www.cs.columbia.edu/~orestis

CMU 15-721 (Spring 2018)

VECTORIZED DBMS ALGORITHMS

Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality.→ Favor vertical vectorization by processing different input

data per lane.→ Maximize lane utilization by executing different things

per lane subset.

26

RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015



http://15721.courses.cs.cmu.edu/spring2018/papers/21-vectorization1/p1493-polychroniou.pdf


CMU 15-721 (Spring 2018)

FUNDAMENTAL OPERATIONS

Selective Load

Selective Store

Selective Gather

Selective Scatter

27



CMU 15-721 (Spring 2018)

FUNDAMENTAL VECTOR OPERATIONS

28

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •



CMU 15-721 (Spring 2018)


28

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U



CMU 15-721 (Spring 2018)


28

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V



CMU 15-721 (Spring 2018)


28

Selective Load Selective Store

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector

U V W X Y Z • • •Memory

0 1 0 1Mask



CMU 15-721 (Spring 2018)


28


A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector


0 1 0 1Mask

B



CMU 15-721 (Spring 2018)


28


A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector


0 1 0 1Mask

B D



CMU 15-721 (Spring 2018)

Selective Gather


29

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • •

CA



CMU 15-721 (Spring 2018)

Selective Gather


29

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • •

CAW

0 21 3 54



CMU 15-721 (Spring 2018)

Selective Gather


29

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • •

CAW V XZ

0 21 3 54



CMU 15-721 (Spring 2018)

Selective Gather


29

Selective Scatter

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • • A B C DValue Vector


2 1 5 3Index Vector

CAW V XZ

0 21 3 54

0 21 3 54



CMU 15-721 (Spring 2018)

Selective Gather


29

Selective Scatter

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • • A B C DValue Vector


2 1 5 3Index Vector

CAW V XZ B A CD

0 21 3 54

0 21 3 54



CMU 15-721 (Spring 2018)

ISSUES

Gathers and scatters are not really executed in parallel because the L1 cache only allows one or two distinct accesses per cycle.

Gathers are only supported in newer CPUs.

Selective loads and stores are also emulated in Xeon CPUs using vector permutations.

30



https://software.intel.com/en-us/node/683481

CMU 15-721 (Spring 2018)

VECTORIZED OPERATORS

Selection Scans

Hash Tables

Partitioning

Paper provides additional info:→ Joins, Sorting, Bloom filters.

31

RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015





CMU 15-721 (Spring 2018)

SELECTION SCANS

32

SELECT * FROM tableWHERE key >= $(low)AND key <= $(high)



CMU 15-721 (Spring 2018)

SELECTION SCANS

32

Scalar (Branching)

i = 0for t in table:

key = t.keyif (key≥low) && (key≤high):

copy(t, output[i])i = i + 1



CMU 15-721 (Spring 2018)

SELECTION SCANS

32

Scalar (Branching)




Scalar (Branchless)


copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&↪(key≤high ? 1 : 0)

i = i + m



CMU 15-721 (Spring 2018)

SELECTION SCANS

32

Scalar (Branching)




Scalar (Branchless)


copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&↪(key≤high ? 1 : 0)

i = i + m

Source: Bogdan Raducanu



http://15721.courses.cs.cmu.edu/spring2017/papers/20-compilation/p1231-raducanu.pdf

CMU 15-721 (Spring 2018)

SELECTION SCANS

33

Vectorized

i = 0for vt in table:

simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|



CMU 15-721 (Spring 2018)

SELECTION SCANS

33

Vectorized




ID1

KEYJ

2 O3 Y4 S5 U6 X

SELECT * FROM tableWHERE key >= "O" AND key <= "U"



CMU 15-721 (Spring 2018)

SELECTION SCANS

33

Vectorized




J O Y S U XKey VectorID1

KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare




CMU 15-721 (Spring 2018)

SELECTION SCANS

33

Vectorized





KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5All Offsets




CMU 15-721 (Spring 2018)

SELECTION SCANS

33

Vectorized





KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5All Offsets

SIMD Store

1 3 4Matched OffsetsSELECT * FROM tableWHERE key >= "O" AND key <= "U"



CMU 15-721 (Spring 2018)

SELECTION SCANS

34

0

16

32

48

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Selectivity (%)

Scalar (Branching)

Scalar (Branchless)

Vectorized (Early Mat)

Vectorized (Late Mat)

0.0

2.0

4.0

6.0

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

5.7 5.6 5.35.7 4.9 4.3 2.8 1.3

1.7 1.7 1.71.81.6 1.4 1.5

1.2



CMU 15-721 (Spring 2018)

SELECTION SCANS

34

0

16

32

48

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Selectivity (%)

Scalar (Branching)

Scalar (Branchless)

Vectorized (Early Mat)

Vectorized (Late Mat)

0.0

2.0

4.0

6.0

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)Selectivity (%)


MemoryBandwidth

MemoryBandwidth



CMU 15-721 (Spring 2018)

PAYLOADKEY

Linear Probing Hash Table

HASH TABLES PROBING

35

Scalar

k1

Input Key

h1

Hash Index

#hash(key)



CMU 15-721 (Spring 2018)

PAYLOADKEY


HASH TABLES PROBING

35

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

k1 k9=



CMU 15-721 (Spring 2018)

PAYLOADKEY


HASH TABLES PROBING

35

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

k1

k9=

k3=

k8=

k1=



CMU 15-721 (Spring 2018)

HASH TABLES PROBING

35

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table

k1

Input Key

h1

Hash Index

#hash(key)



CMU 15-721 (Spring 2018)

HASH TABLES PROBING

35

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table

k1

Input Key

h1

Hash Index

#hash(key)

k9= k3 k8 k1k1

0 0 0 1Matched Mask

SIMD Compare



CMU 15-721 (Spring 2018)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88


HASH TABLES PROBING

36

Vectorized (Vertical)Input Key

Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4



CMU 15-721 (Spring 2018)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88


HASH TABLES PROBING

36


Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Gather

k1

k2

k3

k4



CMU 15-721 (Spring 2018)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88


HASH TABLES PROBING

36


Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Compare

1

0

0

1

k1

k2

k3

k4



CMU 15-721 (Spring 2018)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88


HASH TABLES PROBING

36


Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Compare

1

0

0

1

k1

k2

k3

k4

k5

k6

h5

h2+1

h3+1

h6



CMU 15-721 (Spring 2018)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88


HASH TABLES PROBING

36


Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k5

k6

h5

h2+1

h3+1

h6



CMU 15-721 (Spring 2018)

HASH TABLES PROBING

37

0

3

6

9

12

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

Scalar Vectorized (Horizontal) Vectorized (Vertical)


0

0.5

1

1.5

2

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

2.3 2.2 2.12.41.1 0.9 0.7 0.6

1.1 1.10.9

1.2

0.8 0.8

0.30.2



CMU 15-721 (Spring 2018)

HASH TABLES PROBING

37

0

3

6

9

12

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

Scalar Vectorized (Horizontal) Vectorized (Vertical)


0

0.5

1

1.5

2

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

Out of Cache

Out of Cache



CMU 15-721 (Spring 2018)

PARTITIONING HISTOGRAM

Use scatter and gathers to increment counts.

Replicate the histogram to handle collisions.

38

k1

k2

k3

k4

Input KeyVector

h1

h2

h3

h4

Hash Index Vector

SIMD AddSIMD Radix

+1

+1

+1

Histogram



CMU 15-721 (Spring 2018)




38

k1

k2

k3

k4

Input KeyVector

h1

h2

h3

h4

Hash Index Vector Replicated Histogram

+1

+1

+1

+1

# of Vector Lanes

SIMD Radix SIMD Scatter



CMU 15-721 (Spring 2018)




38

k1

k2

k3

k4

Input KeyVector

h1

h2

h3

h4

Hash Index Vector Replicated Histogram

+1

+1

+1

+1

SIMD Add

# of Vector Lanes

SIMD Radix

+1

+2

+1

Histogram

SIMD Scatter



CMU 15-721 (Spring 2018)

JOINS

No Partitioning→ Build one shared hash table using atomics→ Partially vectorized

Min Partitioning→ Partition building table→ Build one hash table per thread→ Fully vectorized

Max Partitioning→ Partition both tables repeatedly→ Build and probe cache-resident hash tables→ Fully vectorized

39



CMU 15-721 (Spring 2018)

JOINS

40

0

0.5

1

1.5

2

Scalar Vector Scalar Vector Scalar VectorNo Partitioning Min Partitioning Max Partitioning

Join

Tim

e (s

ec)

Partition Build Probe Build+Probe

200M ⨝ 200M tuples (32-bit keys & payloads)Xeon Phi 7120P – 61 Cores + 4×HT



CMU 15-721 (Spring 2018)

PARTING THOUGHTS

Vectorization is essential for OLAP queries.

These algorithms don’t work when the data exceeds your CPU cache.

We can combine all the intra-query parallelism optimizations we’ve talked about in a DBMS.→ Multiple threads processing the same query.→ Each thread can execute a compiled plan.→ The compiled plan can invoke vectorized operations.

41



CMU 15-721 (Spring 2018)

NEXT CL ASS

Vectorization (Part II)

42



Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers....

Documents

Transcript of Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers....