Optimizing array-based data structures to the limit

Optimizing array-baseddata structures

to the limit

Roman Leventov

Higher Frequency Trading Ltd.

[email protected]

August 28, 2014

Overview

Indexing

Encoding of distinct entry statesObject dataPrimitive data

Layout of tuples of primitives

Benchmarking environments1. AMD K10 (2007),

L1 cache: 128 KB, L2: 512 KB, L3: 6 MB2. Intel Sandy Bridge (2011),

L1: 64 KB, L2: 256 KB, L3: 20 MB3. Intel Haswell (2013),

L1: 64 KB, L2: 256 KB, L3: 3 MB

64-bit Java 1.8.0-b129–8u20JMH ??–0.9.8

If not specified, measurements are in CPU clockcycles per operation or loop iteration.

Section 1

Indexing

Indexing

Simple

int e = a[i];

vs.

Unsafe

long off =((long) i) << INT_SCALE_SHIFT;

int e = U.getInt(a, INT_BASE + off);

Whyever unsafe indexing?HotSpot JIT doesn’t eliminate bound checksas perfectly as you probably think.

Whyever simple indexing?In performance-critical code

Simple

;cmp r8d , ebx;jae <IOOBE location >mov r11 , [r9 + r8*4 + 16]

Unsafe

mov r10 , r8shl r10 , 2mov r11 , [r9 + r10 + 16]

%r9—a; %r8—i16—INT_BASE: object header (12 bytes) +array length field (4 bytes)

Iteration over parallel arraysIndexing case #1

@Benchmarkpublic int _2_simple(State st) {

int[] xs = st.xs, ys = st.ys;int dummy = 0;for (int i = xs.length; i --> 0;)

dummy ^= xs[i] + ys[i];return dummy;

}

Bound checks are fully eliminated!


@Benchmarkpublic int _2_unsafe(State st) {

int[] xs = st.xs, ys = st.ys;int dummy = 0;long off = xs.length * INT_SCALE;while ((off -= INT_SCALE) >= 0)

dummy ^=U.getInt(xs, INT_BASE + off) +U.getInt(ys, INT_BASE + off);

return dummy;}


# of arrays 1 2 3 4

SBSimple 0.78 1.3 2.2 3.4Unsafe 1.6 1.8 2.5 3.2

HWSimple 1.2 2.1 3.3 4.9Unsafe 2.1 2.6 3.2 4.3

K10Simple 1.6 5.8 13.1 19.5Unsafe 2.9 6.4 11.8 17.1

Unsafe indexing is slower when there is a singleor 2-3 parallel arrays because of an odd instructionin the tight loop. JIT compiler fault?

Binary heapIndexing case #2

Binary heapIndexing case #2

int leftChildI = parentI * 2 + 1;int rightChildI = leftChildI + 1;

long leftChildOff =parentOff * 2 + INT_SCALE;

long rightChildOff =leftChildOff + INT_SCALE;

Binary heap sortIndexing case #2

Heapsort version with unsafe indexing is fasterby 12–13% on 4 KB array and by 7–10% on 4 MBarray.

With simple indexing lower bound checksare eliminated, but upper mostly aren’t.

Linear hashIndexing case #3

def any_lhash_op(key[, ...]):i = hash(key) % table_sizewhile True:

if is_empty_slot(i): ...if key_at(i) == key: ...i = (i + 1) % table_size

First access is random, then sequential.

Table size is a power of 2, therefore bitwisemasking & (table_size - 1) is usedinstead of modulo.

Quadratic hashIndexing case #3

def any_qhash_op(key[, ...]):i = hash(key) % table_sizestep = 0while True:

if is_empty_slot(i): ...if key_at(i) == key: ...step += 1i = (i + step) % table_size

Random, then local, then non-local access.

Two-way modification of this algorithm is tested,in which table size isn’t a power of 2: one integraldivision per op.

Double hashIndexing case #3

def any_dhash_op(key[, ...]):i = hash(key) % table_sizestep = hash2(key)while True:

if is_empty_slot(i): ...if key_at(i) == key: ...i = (i + step) % table_size

Random access.

Table size isn’t a power of 2, one or two(on collisions) integral divisions per op.

Composite hash benchmarkIndexing case #3

load factor 0.3 0.6 0.9

L.SB 1.9± 1.0 1.7± 1.0 −2.1± 1.1

HW 5.5± 1.3 4.9± 0.7 4.3± 0.9K10 10.3± 0.5 8.2± 0.2 1.6± 0.7

Q.SB −0.2± 1.9 −2.0± 1.8 −0.9± 1.9

HW 2.3± 2.3 2.7± 1.4 −0.3± 1.5K10 1.6± 0.5 −0.5± 0.2 −5.6± 0.3

D.SB −11.5± 2.5 −15.2± 1.1 −23.7± 1.3

HW −9.9± 2.3 −13.5± 1.2 −26.2± 1.0K10 −4.3± 0.2 −9.4± 0.1 −17.6± 0.4

Relative diff of unsafe indexing time to simple,in percent.

Indexing: bottom lineUnsafe indexing is worth considering in the hottestmethods. Tried to avoid this, but: "measure don’tguess".

Was not investigated:I Performance of unsafe indexing on 32-bit VMs

and CPUs, all results should be rechecked.I Interference of unsafe indexing with loop

unrolling and vectorization.

Section 2

Encoding of distinctentry states

Use-cases of entry states"Full" state + data, or "empty" state:

I Open hash table implementations(taken/empty slots)

I "Nullable" non-object data in the subjectdomain

I Lists or queues with "half-lazy" in-placefiltering

Collections of tuples of primitive/object andboolean (or binary state).

Object dataObvious: null in slots of "empty" state, domainobjects in "full" slots.

But what if domain objects are nullablethemselves?

What if nullable Object data?Special "empty" object

static final Object EMPTY_SLOT =new Object ();

Domain nulls - as is.

Masking domain nulls

static final Object NULL_MASK =new Object ();

...Object maskedData = data != null ?

data : NULL_MASK;

null in slots of "empty" state.

What if nullable Object data?The rule: null should be more frequently stored inmemory or compared to other objects, than thespecial object. Often the right option for both goalsis the same.

Why store nullsNullable Object data + states

Don’t forget about amortized costs of storingObjects rather than nulls. At least one extradereference and check per each location duringgarbage collection.

Array shouldn’t be filled with nulls afterinitialization.

Why compare to nullNullable Object data + states

Explicit null checks are almost always costless,merged with VM-generated ones (to throw NPE).

In the rest cases comparison to null is stillcheaper than to the special object, because

I null shouldn’t be read from anywarein advance

I Checks against zero are "featured" on x86

And what if nullable Object data?In hash tables, domain null (at most one!) shouldbe masked, empty slots should be filled withnulls. But the implementation is harder, than withspecial "empty" object.

Got it right: java.util.IdentityHashMap.

Got it wrong: almost all other open hashimplementations.

Primitive dataNo natural way to express "nullabulity". Even nonatural word :)

Arrays of boxed primitives

Separate byte statePrimitive data + states

boolean[] or byte[] and data arrays in parallel:

if (used[i])doSomething(data[i]));

The easiest to implement.

Separate bit statePrimitive data + states

Hand-written bit set and data arrays in parallel:

long word = bitWords[i >> 6];if ((word & (1 << i)) != 0)

doSomething(data[i]));

Advantages of separate bit statePrimitive data + states

Almost no additional memory is used.

Sequential state checks often doesn’t requerememory reads (until the word is exausted).

Iteration could employ very cheapnumberOfLeading(Trailing)Zeros intrinsic.Intel: Haswell+AMD: Leading—K10+, Trailing—Piledriver+

Disadvantages of separate bit statePrimitive data + states

Only for binary state.

On pure random access, no advantage over bytestates except memory usage, just perform extrawork to extract bits.

Relatively tricky to implement.(java.util.BitSet—no way.)

Special value as a statePrimitive data + states

long d = data[i];if (d != EMPTY)

doSomething(d);

Suitable only when there is a "full" state and one orseveral "empty" states.

Advantages of special valuesPrimitive data + states

Zero memory overhead.

All entry data could reside the single memorylocation:

I less memory reads are requiredI Cache-friendlyI Possibility of atomic updates

Special value managementPrimitive data + states

When data domain is bounded, special valuesis a clear winner for enconding states, just pick upa constant out of the data domain, preferably 0,as a special value.

Special value managementPrimitive data + states

However, if the data domain is unbounded,a number of disadvantages of special valuesas states appear:

I Special value should be stored within the datastructure and being read on each query.

I Comparison to non-constant is slower,especially than comparison to zero.

I On collision, special value should bereplaced, that is impossible without locking,if the data structure should be thread-safe,or if it is "offline" in any meaning.

I Implementation become more complicated.

Zero value as a statePrimitive data + states

An attempt to resolve one of the dynamic specialvalues problems - data is compared to zero, andwhen zero is passed as a data itself, it is maskedwith another value:if (data == zeroMask) changeZeroMask ();data = data == 0 ? zeroMask : data;...long d = data[i];if (d != 0)

doSomething(d);

But now data should be masked/unmasked all thetime and impelementation is getting even morecomplicated.

Byte along statePrimitive data + states

Like separate byte state, but more memory-local:

On the other hand:I Only unsafe access (see section 1)I Tiring to implementI Cross cache line bounray memory IO, which

1) has penalty on many CPUs, 2) is notatomic, out-of-the-air values could appear, ifthe data structure is not synchronized, or IOperformed only via CAS ops (Nitsan Wakart).

Benchmarking LHash queries,random queriesPrimitive data + states

All the hash data is in L1:I Load factors 0.3-0.6: typically byte states winI Load factor 0.9: bit states win, sometimes

special valuesBig hashes (don’t fit caches):

I Successful queries: special values winI Unsuccessful queries, including insertions:

bit states winI Byte along states outperform simple byte

statesZero states (with replacement) is never an option.

Benchmarking LHash queries,iterationPrimitive data + states

Internal iteration (forEach): special values win.

External iteration (iterators): byte states win.

But on Haswell and K10, of cause, bit states beatthem all.

Byte along states and zero states withreplacement always lose.

Enconding of distinct entry states:bottom line

Object[] arrays: more nulls.

Primitive arrays: special values as states, whenapplicable. Bit states for iteration on Haswell+ andK10+.

Section 3

Layout of tuples of primitives

Layout of tuples of primitivesWhen random access is needed, we always strivefor memory locality.

Two fields of the same lengthLayout of tuples of primitives

byte+byte, char+short, long+double(longBitsToDouble() is a no-op).

For up to 8 bytes, use arrays of the longerprimitive, ex. long[] for int+int tuples.

I Guarantees the tuple lies on the same cacheline.

I Allows to approach Java array size limitscloser.

One field is two times longer thananotherLayout of tuples of primitives

byte+short, int+double, ...

If cross cache line boundary IO is not an option,use the following layout:

Reqires to access individual fields via Unsafe.

One field is 4-8 times longer thananotherLayout of tuples of primitives

If cross cache line boundary IO is not an option,the only reasonable approach is:

k1, long , 8 bytesk2, long , 8 bytesk3, long , 8 bytesv1, short , 2 bytesv2, short , 2 bytesv3, short , 2 bytes2 bytes gapk4, long , 8 bytes...

One field is 4-8 times longer thananotherLayout of tuples of primitives

Fields of the same tuple will anyway lie on differentcache lines with some probability.

Indexing:

long kOff = (i / 3) * 32L +(i % 3) * 8;

long vOff = kOff + 24;

Integral division :(

Integral division by small constant— Maybe this will help? (see Hacker’s Delight)

long quot = (i * 0x55555556L) >>> 32;long rem = i - quot * 3;long kOff = quot * 32 + rem * 8;long vOff = kOff + 24;

— No, it won’t, because we need to obtainreminder as well as quotient.

The End

Optimizing array-based data structures to the limit

Software

Transcript of Optimizing array-based data structures to the limit