Five cool ways the JVM can run Apache Spark faster

26
© 2015 IBM Corporation Five cool ways the JVM can run Apache Spark faster Tim Ellison IBM Runtime Technology Center, Hursley, UK tellison @tpellison

Transcript of Five cool ways the JVM can run Apache Spark faster

Page 1: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation

Five cool ways the JVM can run Apache Spark faster

Tim Ellison

IBM Runtime Technology Center, Hursley, UK

tellison

@tpellison

Page 2: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation

What's so cool about IBM's Java implementation?

ReferenceJava

Technology

OpenJDK,etc

IBMJava

IBM RuntimeTechnology

Centre

Quality EngineeringPerformance

SecurityReliabilityScalability

Serviceability

Production RequirementsStrategic SoftwareHardware exploits

User Feedback

A certified Java, optimized to run on wide variety of hardware platforms

Performance tuned for key software workloads

World class service and support

http://www.ibm.com/developerworks/java/jdk/index.html

Java Platform APIs

AWT Swing Java2d

XML Crypto Serializer

Core Libraries

Virtual Machine

Intel / POWER / z Series / ARM

Spark Backend

Spark APIs

Scala

Page 3: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation3

Five cool things you and I can do with the JVM to enhance the performance of Apache Spark!

1) JIT compiler enhancements and writing JIT-friendly code

2) Improving the object serializer

3) Faster IO – networking and storage

4) Data access acceleration & auto SIMD

5) Offloading tasks to graphics co-processors (GPUs)

Page 4: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation4

Five cool things you and I can do with the JVM to enhance the performance of Apache Spark!

1) JIT compiler enhancements and writing JIT-friendly code

2) Improving the object serializer

3) Faster IO – networking and storage

4) Data access acceleration & auto SIMD

5) Offloading tasks to graphics co-processors (GPUs)

input/outputbound jobs

Page 5: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation5

Five cool things you and I can do with the JVM to enhance the performance of Apache Spark!

1) JIT compiler enhancements and writing JIT-friendly code

2) Improving the object serializer

3) Faster IO – networking and storage

4) Data access acceleration & auto SIMD

5) Offloading tasks to graphics co-processors (GPUs)

processingbound jobs

Page 6: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation6

JIT compiler enhancements, and writing JIT-friendly code

Page 7: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation7

Help yourself. JNI calls are not free!

https://github.com/xerial/snappy­java/blob/develop/src/main/java/org/xerial/snappy/SnappyNative.cpp

Page 8: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation8

Style: Using JNI has an impact...

The cost of calling from Java code to natives and from natives to Java code is significantly higher (maybe 5x longer) than a normal Java method call.

– The JIT can't in-line native methods.

– The JIT can't do data flow analysis into JNI calls• e.g. it has to assume that all parameters always used.

– The JIT has to set up the call stack and parameters for C calling convention,• i.e. maybe rearranging items on the stack.

JNI can introduce additional data copying costs– There's no guarantee that you will get a direct pointer to the array / string with Get<type>ArrayElements(), even when using the GetPrimitiveArrayCritical versions.

– The IBM JVM will always return a copy (to allow GC to continue).

Tip:– JNI natives are more expensive than plain Java calls.– e.g. create an unsafe based Snappy-like package written in Java code so that JNI cost is

eliminated.

Page 9: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation9

Style: Use JIT optimizations to reduce overhead of logging checks

Tip: Check for the non-null value of a static field ref to instance of a logging class singleton– e.g.

– Uses the JIT's speculative optimization to avoid the explicit test for logging being enabled; instead it ...

1)Generates an internal JIT runtime assumption (e.g. InfoLogger.class is undefined),2)NOPs the test for trace enablement3)Uses a class initialization hook for the InfoLogger.class (already necessary for instantiating the class)

4)The JIT will regenerate the test code if the class event is fired

Spark's logging calls are gated on the checks of a static boolean value

trait Logging

Spark

Page 10: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation10

Style: Judicious use of polymorphism Spark has a number of highly polymorphic interface call sites and high fan-in (several calling contexts

invoking the same callee method) in map, reduce, filter, flatMap, ...– e.g. ExternalSorter.insertAll is very hot (drains an iterator using hasNext/next calls)

Pattern #1:– InterruptibleIterator → Scala's mapIterator → Scala's filterIterator → …

Pattern #2:– InterruptibleIterator → Scala's filterIterator → Scala's mapIterator → …

The JIT can only choose one pattern to in-line!– Makes JIT devirtualization and speculation more risky; using profiling information from a different

context could lead to incorrect devirtualization.

– More conservative speculation, or good phase change detection and recovery are needed in the JIT compiler to avoid getting it wrong.

Lambdas and functions as arguments, by definition, introduce different code flow targets– Passing in widely implemented interfaces produce many different bytecode sequences– When we in-line we have to put runtime checks ahead of in-lined method bodies to make sure we are

going to run the right method!– Often specialized classes are used only in a very limited number of places, but the majority of the code

does not use these classes and pays a heavy penalty– e.g. Scala's attempt to specialize Tuple2 Int argument does more harm than good!

Tip: Use polymorphism sparingly, use the same order / patterns for nested & wrappered code, and keep call sites homogeneous.

Page 11: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation11

Replacing the object serializer

Page 12: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation12

Writing a Spark-friendly object serializer

Expand the list of specialist serialized form– Having custom write/read object methods allows for reduced time in reflection and

smaller on-wire payloads.– Types such as Tuple and Some given special treatment in the serializer

Reduce the payload size further using variable length encoding of primitive types.– All objects are eventually decomposed into primitives

Adaptive stack-based recursive serialization vs. state machine serialization– Use the stack to track state wherever possible, but fall back to state machine for deeply

nested objects (e.g. big RDDs)

Sharing object representation within the serialized stream to reduce payload– But may be defeated if supportsRelocationOfSerializedObjects required

Special replacement of deserialization calls to avoid stack-walking to find class loader context

– Optimization in JIT to circumvent some regular calls to more efficient versions

Tip: These are opaque to the application, no special patterns required.

Page 13: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation13

Faster IO – networking and storage

Page 14: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation

Remote Direct Memory Access (RDMA) Networking

Spark VM

Buffer

OffHeap

Buffer

Spark VM

Buffer

OffHeap

Buffer

Ether/IB SwitchRDMA NIC/HCA RDMA NIC/HCA

OS OSDMA DMA(Z-Copy) (Z-Copy)

(B-Copy)(B-Copy)

Acronyms:Z-Copy – Zero Copy

B-Copy – Buffer CopyIB – InfiniBand

Ether - EthernetNIC – Network Interface CardHCA – Host Control Adapter

● Low-latency, high-throughput networking● Direct 'application to application' memory pointer exchange between remote hosts● Off-load network processing to RDMA NIC/HCA – OS/Kernel Bypass (zero-copy)● Introduces new IO characteristics that can influence the Spark transfer plan

Spark node #1 Spark node #2

Page 15: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation

Faster network IO with RDMA-enabled Spark

15

New dynamic transfer plan that adapts to the load and responsiveness of the remote hosts.

New “RDMA” shuffle IO mode with lower latency and higher throughput.

JVM-agnostic

IBM JVM only

JVM-agnostic

IBM JVM only

IBM JVM only

Block manipulation (i.e., RDD partitions)

High-level API

JVM-agnostic working prototype with RDMA

Page 16: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation16

Faster storage with POWER CAPI/Flash

POWER8 architecture offers a 40Tb Flash drive attached via Coherent Accelerator Processor Interface (CAPI)

– Provides simple coherent block IO APIs– No file system overhead

Power Service Layer (PSL)– Performs Address Translations– Maintains Cache– Simple, but powerful interface to the Accelerator unit

Coherent Accelerator Processor Proxy (CAPP)– Maintains directory of cache lines held by Accelerator– Snoops PowerBus on behalf of Accelerator

Page 17: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation17

Faster disk IO with CAPI/Flash-enabled Spark

When under memory pressure, Spark spills RDDs to disk.– Happens in ExternalAppendOnlyMap and ExternalSorter

We have modified Spark to spill to the high-bandwidth, coherently-attached Flash device instead.

– Replacement for DiskBlockManager– New FlashBlockManager handles spill to/from flash

Making this pluggable requires some further abstraction in Spark:– Spill code assumes using disks, and depends on DiskBlockManger– We are spilling without using a file system layer

Dramatically improves performance of executors under memory pressure.

Allows to reach similar performance with much less memory (denser deployments).

IBM Flash System 840Power8 + CAPI

Page 18: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation18

Data Access Acceleration & auto SIMD

Page 19: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation19

Data Access Accelerators (DAA) Provides JIT-recognized operations directly to/from byte arrays.

– Think of it as org.apache.spark.unsafe.Platform APIs on steroids!

Optimized data conversion, arithmetic, etc on native format data records and types.

– Avoids object creation, initialization, data copying, abstraction, boxing, etc

– Range of operations on Packed Decimals; BigDecimal and BigInteger conversions from and to external decimal and Unicode decimal types; marshalling and unmarshalling primitive types (short, int, long, float, and double) to and from byte arrays

– e.g.

Benefits:

Expose hardware acceleration in a platform and JVM-neutral manner (2 – 100x speed-up)

Can provide significant speed-up to record parsing, data marshalling and inter-language communication

Tip: Don't be limited to current set of sun.misc.Unsafe APIs. Use existing DAA APIs, or write equivalent places in Spark we can recognize.

Page 20: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation20

Data Access Accelerator – example

Problem: there is no intrinsic representation of a “packed decimal” in Java

Solution: trivially add two packed decimals by converting to BigDecimal:Step 1: convert from byte[] to BigDecimal

Step 2 : add BigDecimals together

Step 3: convert BigDecimal result back to byte[]IBM Java provides a helper method to do this operation for you:

Old Approach:

byte[] addPacked(byte[] a, byte[] b) {

BigDecimal a_bd = convertPackedToBd(a[]);

BigDecimal b_bd = convertPackedToBd(b[]);

a_bd.add(b_bd);

return (convertBDtoPacked(a_bd));

}

BigDecimal convertPackedToBd(byte[] myBytes) {

… ~30 lines …

}

New Approach:

byte[] addPacked(byte[] a, byte[] b) {

DAA.addPacked(a[], b[]);

return (a[]);

}

e.g. you wish to add two packed decimal numbers

p.s. zSeries hardware has packed decimal machine instructions, e.g. AP (AddPacked)On IBM Java 7 Release 1, “DAA.addPacked(a,b)” is compiled to this one instruction!

http://www­01.ibm.com/support/knowledgecenter/SSYKE2_8.0.0/com.ibm.java.lnx.80.doc/user/daa.html

Page 21: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation

Automatic Vector Processing: SIMD – Single Instruction Multiple Data

The IBM JVM can operate on multiple data-elements (vectors) simultaneously

Can offer dramatic speed-up to data-parallel operations (matrix ops, string processing, etc)

21

Vector registers are 128-bits wide and can be used to operate concurrently on:• Two 64-bit floating point values• Four 32-bit floating point values• Sixteen 8-bit characters• Two 64-bit integer values• etc

Page 22: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation22

Offloading tasks to graphics co-processors

Page 23: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation23

GPU-enabled array sort method

IBM Power 8 with Nvidia K40m GPU

Some Arrays.sort() methods will offload work to GPUs today– e.g. sorting large arrays of ints

Page 24: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation24

GPU optimization of Lambda expressions

Speed-up factor when run on a GPU enabled host

IBM Power 8 with Nvidia K40m GPU

0.00

0.01

0.10

1.00

10.00

100.00

1000.00

auto-SIMD parallel forEach on CPU

parallel forEach on GPU

matrix size

The JIT can recognize parallel stream code, and automatically compile down to the GPU.

Page 25: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation

GPU off-loading of high-level ML functions

Page 26: Five cool ways the JVM can run Apache Spark faster

© 2015 IBM Corporation26

Summary

All the changes are being made in the Java runtime or being pushed out to the Spark community.

We are focused on Core performance to get a multiplier up the Spark stack.– More efficient code, more efficient memory usage/spilling, more efficient serialization &

networking.– Need to take a look at bytecode codegen changes coming in via Tungsten

There are hardware and software technologies we can bring to the party.– We can tune the stack from hardware to high level structures for running Spark.

Spark and Scala developers can help themselves by their style of coding.

There is lots more stuff I don't have time to talk about, like GC optimizations, object layout, monitoring VM/Spark events, hardware compression, security, etc. etc.