Optimizing array-based data structures to the limit

48
Optimizing array-based data structures to the limit Roman Leventov Higher Frequency Trading Ltd. [email protected] August 28, 2014

description

Comparison of different approaches of arrays indexing, enconding discrete states, memory layout in terms of performance. Is useful for implementing array-based data structures and algorithms in Java.

Transcript of Optimizing array-based data structures to the limit

Page 1: Optimizing array-based data structures to the limit

Optimizing array-baseddata structures

to the limit

Roman Leventov

Higher Frequency Trading Ltd.

[email protected]

August 28, 2014

Page 2: Optimizing array-based data structures to the limit

Overview

Indexing

Encoding of distinct entry statesObject dataPrimitive data

Layout of tuples of primitives

Page 3: Optimizing array-based data structures to the limit

Benchmarking environments1. AMD K10 (2007),

L1 cache: 128 KB, L2: 512 KB, L3: 6 MB2. Intel Sandy Bridge (2011),

L1: 64 KB, L2: 256 KB, L3: 20 MB3. Intel Haswell (2013),

L1: 64 KB, L2: 256 KB, L3: 3 MB

64-bit Java 1.8.0-b129–8u20JMH ??–0.9.8

If not specified, measurements are in CPU clockcycles per operation or loop iteration.

Page 4: Optimizing array-based data structures to the limit

Section 1

Indexing

Page 5: Optimizing array-based data structures to the limit

Indexing

Simple

int e = a[i];

vs.

Unsafe

long off =((long) i) << INT_SCALE_SHIFT;

int e = U.getInt(a, INT_BASE + off);

Page 6: Optimizing array-based data structures to the limit

Whyever unsafe indexing?HotSpot JIT doesn’t eliminate bound checksas perfectly as you probably think.

Page 7: Optimizing array-based data structures to the limit

Whyever simple indexing?In performance-critical code

Simple

;cmp r8d , ebx;jae <IOOBE location >mov r11 , [r9 + r8*4 + 16]

Unsafe

mov r10 , r8shl r10 , 2mov r11 , [r9 + r10 + 16]

%r9—a; %r8—i16—INT_BASE: object header (12 bytes) +array length field (4 bytes)

Page 8: Optimizing array-based data structures to the limit

Iteration over parallel arraysIndexing case #1

@Benchmarkpublic int _2_simple(State st) {

int[] xs = st.xs, ys = st.ys;int dummy = 0;for (int i = xs.length; i --> 0;)

dummy ^= xs[i] + ys[i];return dummy;

}

Bound checks are fully eliminated!

Page 9: Optimizing array-based data structures to the limit

Iteration over parallel arraysIndexing case #1

@Benchmarkpublic int _2_unsafe(State st) {

int[] xs = st.xs, ys = st.ys;int dummy = 0;long off = xs.length * INT_SCALE;while ((off -= INT_SCALE) >= 0)

dummy ^=U.getInt(xs, INT_BASE + off) +U.getInt(ys, INT_BASE + off);

return dummy;}

Page 10: Optimizing array-based data structures to the limit

Iteration over parallel arraysIndexing case #1

# of arrays 1 2 3 4

SBSimple 0.78 1.3 2.2 3.4Unsafe 1.6 1.8 2.5 3.2

HWSimple 1.2 2.1 3.3 4.9Unsafe 2.1 2.6 3.2 4.3

K10Simple 1.6 5.8 13.1 19.5Unsafe 2.9 6.4 11.8 17.1

Unsafe indexing is slower when there is a singleor 2-3 parallel arrays because of an odd instructionin the tight loop. JIT compiler fault?

Page 11: Optimizing array-based data structures to the limit

Binary heapIndexing case #2

Page 12: Optimizing array-based data structures to the limit

Binary heapIndexing case #2

int leftChildI = parentI * 2 + 1;int rightChildI = leftChildI + 1;

long leftChildOff =parentOff * 2 + INT_SCALE;

long rightChildOff =leftChildOff + INT_SCALE;

Page 13: Optimizing array-based data structures to the limit

Binary heap sortIndexing case #2

Heapsort version with unsafe indexing is fasterby 12–13% on 4 KB array and by 7–10% on 4 MBarray.

With simple indexing lower bound checksare eliminated, but upper mostly aren’t.

Page 14: Optimizing array-based data structures to the limit

Linear hashIndexing case #3

def any_lhash_op(key[, ...]):i = hash(key) % table_sizewhile True:

if is_empty_slot(i): ...if key_at(i) == key: ...i = (i + 1) % table_size

First access is random, then sequential.

Table size is a power of 2, therefore bitwisemasking & (table_size - 1) is usedinstead of modulo.

Page 15: Optimizing array-based data structures to the limit

Quadratic hashIndexing case #3

def any_qhash_op(key[, ...]):i = hash(key) % table_sizestep = 0while True:

if is_empty_slot(i): ...if key_at(i) == key: ...step += 1i = (i + step) % table_size

Random, then local, then non-local access.

Two-way modification of this algorithm is tested,in which table size isn’t a power of 2: one integraldivision per op.

Page 16: Optimizing array-based data structures to the limit

Double hashIndexing case #3

def any_dhash_op(key[, ...]):i = hash(key) % table_sizestep = hash2(key)while True:

if is_empty_slot(i): ...if key_at(i) == key: ...i = (i + step) % table_size

Random access.

Table size isn’t a power of 2, one or two(on collisions) integral divisions per op.

Page 17: Optimizing array-based data structures to the limit

Composite hash benchmarkIndexing case #3

load factor 0.3 0.6 0.9

L.SB 1.9± 1.0 1.7± 1.0 −2.1± 1.1

HW 5.5± 1.3 4.9± 0.7 4.3± 0.9K10 10.3± 0.5 8.2± 0.2 1.6± 0.7

Q.SB −0.2± 1.9 −2.0± 1.8 −0.9± 1.9

HW 2.3± 2.3 2.7± 1.4 −0.3± 1.5K10 1.6± 0.5 −0.5± 0.2 −5.6± 0.3

D.SB −11.5± 2.5 −15.2± 1.1 −23.7± 1.3

HW −9.9± 2.3 −13.5± 1.2 −26.2± 1.0K10 −4.3± 0.2 −9.4± 0.1 −17.6± 0.4

Relative diff of unsafe indexing time to simple,in percent.

Page 18: Optimizing array-based data structures to the limit

Indexing: bottom lineUnsafe indexing is worth considering in the hottestmethods. Tried to avoid this, but: "measure don’tguess".

Was not investigated:I Performance of unsafe indexing on 32-bit VMs

and CPUs, all results should be rechecked.I Interference of unsafe indexing with loop

unrolling and vectorization.

Page 19: Optimizing array-based data structures to the limit

Section 2

Encoding of distinctentry states

Page 20: Optimizing array-based data structures to the limit

Use-cases of entry states"Full" state + data, or "empty" state:

I Open hash table implementations(taken/empty slots)

I "Nullable" non-object data in the subjectdomain

I Lists or queues with "half-lazy" in-placefiltering

Collections of tuples of primitive/object andboolean (or binary state).

Page 21: Optimizing array-based data structures to the limit

Object dataObvious: null in slots of "empty" state, domainobjects in "full" slots.

But what if domain objects are nullablethemselves?

Page 22: Optimizing array-based data structures to the limit

What if nullable Object data?Special "empty" object

static final Object EMPTY_SLOT =new Object ();

Domain nulls - as is.

Masking domain nulls

static final Object NULL_MASK =new Object ();

...Object maskedData = data != null ?

data : NULL_MASK;

null in slots of "empty" state.

Page 23: Optimizing array-based data structures to the limit

What if nullable Object data?The rule: null should be more frequently stored inmemory or compared to other objects, than thespecial object. Often the right option for both goalsis the same.

Page 24: Optimizing array-based data structures to the limit

Why store nullsNullable Object data + states

Don’t forget about amortized costs of storingObjects rather than nulls. At least one extradereference and check per each location duringgarbage collection.

Array shouldn’t be filled with nulls afterinitialization.

Page 25: Optimizing array-based data structures to the limit

Why compare to nullNullable Object data + states

Explicit null checks are almost always costless,merged with VM-generated ones (to throw NPE).

In the rest cases comparison to null is stillcheaper than to the special object, because

I null shouldn’t be read from anywarein advance

I Checks against zero are "featured" on x86

Page 26: Optimizing array-based data structures to the limit

And what if nullable Object data?In hash tables, domain null (at most one!) shouldbe masked, empty slots should be filled withnulls. But the implementation is harder, than withspecial "empty" object.

Got it right: java.util.IdentityHashMap.

Got it wrong: almost all other open hashimplementations.

Page 27: Optimizing array-based data structures to the limit

Primitive dataNo natural way to express "nullabulity". Even nonatural word :)

Arrays of boxed primitives

Page 28: Optimizing array-based data structures to the limit

Separate byte statePrimitive data + states

boolean[] or byte[] and data arrays in parallel:

if (used[i])doSomething(data[i]));

The easiest to implement.

Page 29: Optimizing array-based data structures to the limit

Separate bit statePrimitive data + states

Hand-written bit set and data arrays in parallel:

long word = bitWords[i >> 6];if ((word & (1 << i)) != 0)

doSomething(data[i]));

Page 30: Optimizing array-based data structures to the limit

Advantages of separate bit statePrimitive data + states

Almost no additional memory is used.

Sequential state checks often doesn’t requerememory reads (until the word is exausted).

Iteration could employ very cheapnumberOfLeading(Trailing)Zeros intrinsic.Intel: Haswell+AMD: Leading—K10+, Trailing—Piledriver+

Page 31: Optimizing array-based data structures to the limit

Disadvantages of separate bit statePrimitive data + states

Only for binary state.

On pure random access, no advantage over bytestates except memory usage, just perform extrawork to extract bits.

Relatively tricky to implement.(java.util.BitSet—no way.)

Page 32: Optimizing array-based data structures to the limit

Special value as a statePrimitive data + states

long d = data[i];if (d != EMPTY)

doSomething(d);

Suitable only when there is a "full" state and one orseveral "empty" states.

Page 33: Optimizing array-based data structures to the limit

Advantages of special valuesPrimitive data + states

Zero memory overhead.

All entry data could reside the single memorylocation:

I less memory reads are requiredI Cache-friendlyI Possibility of atomic updates

Page 34: Optimizing array-based data structures to the limit

Special value managementPrimitive data + states

When data domain is bounded, special valuesis a clear winner for enconding states, just pick upa constant out of the data domain, preferably 0,as a special value.

Page 35: Optimizing array-based data structures to the limit

Special value managementPrimitive data + states

However, if the data domain is unbounded,a number of disadvantages of special valuesas states appear:

I Special value should be stored within the datastructure and being read on each query.

I Comparison to non-constant is slower,especially than comparison to zero.

I On collision, special value should bereplaced, that is impossible without locking,if the data structure should be thread-safe,or if it is "offline" in any meaning.

I Implementation become more complicated.

Page 36: Optimizing array-based data structures to the limit

Zero value as a statePrimitive data + states

An attempt to resolve one of the dynamic specialvalues problems - data is compared to zero, andwhen zero is passed as a data itself, it is maskedwith another value:if (data == zeroMask) changeZeroMask ();data = data == 0 ? zeroMask : data;...long d = data[i];if (d != 0)

doSomething(d);

But now data should be masked/unmasked all thetime and impelementation is getting even morecomplicated.

Page 37: Optimizing array-based data structures to the limit

Byte along statePrimitive data + states

Like separate byte state, but more memory-local:

On the other hand:I Only unsafe access (see section 1)I Tiring to implementI Cross cache line bounray memory IO, which

1) has penalty on many CPUs, 2) is notatomic, out-of-the-air values could appear, ifthe data structure is not synchronized, or IOperformed only via CAS ops (Nitsan Wakart).

Page 38: Optimizing array-based data structures to the limit

Benchmarking LHash queries,random queriesPrimitive data + states

All the hash data is in L1:I Load factors 0.3-0.6: typically byte states winI Load factor 0.9: bit states win, sometimes

special valuesBig hashes (don’t fit caches):

I Successful queries: special values winI Unsuccessful queries, including insertions:

bit states winI Byte along states outperform simple byte

statesZero states (with replacement) is never an option.

Page 39: Optimizing array-based data structures to the limit

Benchmarking LHash queries,iterationPrimitive data + states

Internal iteration (forEach): special values win.

External iteration (iterators): byte states win.

But on Haswell and K10, of cause, bit states beatthem all.

Byte along states and zero states withreplacement always lose.

Page 40: Optimizing array-based data structures to the limit

Enconding of distinct entry states:bottom line

Object[] arrays: more nulls.

Primitive arrays: special values as states, whenapplicable. Bit states for iteration on Haswell+ andK10+.

Page 41: Optimizing array-based data structures to the limit

Section 3

Layout of tuples of primitives

Page 42: Optimizing array-based data structures to the limit

Layout of tuples of primitivesWhen random access is needed, we always strivefor memory locality.

Page 43: Optimizing array-based data structures to the limit

Two fields of the same lengthLayout of tuples of primitives

byte+byte, char+short, long+double(longBitsToDouble() is a no-op).

For up to 8 bytes, use arrays of the longerprimitive, ex. long[] for int+int tuples.

I Guarantees the tuple lies on the same cacheline.

I Allows to approach Java array size limitscloser.

Page 44: Optimizing array-based data structures to the limit

One field is two times longer thananotherLayout of tuples of primitives

byte+short, int+double, ...

If cross cache line boundary IO is not an option,use the following layout:

Reqires to access individual fields via Unsafe.

Page 45: Optimizing array-based data structures to the limit

One field is 4-8 times longer thananotherLayout of tuples of primitives

If cross cache line boundary IO is not an option,the only reasonable approach is:

k1, long , 8 bytesk2, long , 8 bytesk3, long , 8 bytesv1, short , 2 bytesv2, short , 2 bytesv3, short , 2 bytes2 bytes gapk4, long , 8 bytes...

Page 46: Optimizing array-based data structures to the limit

One field is 4-8 times longer thananotherLayout of tuples of primitives

Fields of the same tuple will anyway lie on differentcache lines with some probability.

Indexing:

long kOff = (i / 3) * 32L +(i % 3) * 8;

long vOff = kOff + 24;

Integral division :(

Page 47: Optimizing array-based data structures to the limit

Integral division by small constant— Maybe this will help? (see Hacker’s Delight)

long quot = (i * 0x55555556L) >>> 32;long rem = i - quot * 3;long kOff = quot * 32 + rem * 8;long vOff = kOff + 24;

— No, it won’t, because we need to obtainreminder as well as quotient.

Page 48: Optimizing array-based data structures to the limit

The End