Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences...

29
Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit

description

3 Outline Wide-area parallel computing Java Remote Method Invocation (RMI) Performance of JDK RMI The Manta high-performance Java system Wide-area parallel Java applications using RMI Application performance

Transcript of Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences...

Page 1: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

Wide-Area Parallel Computing in Java

Henri Bal

Vrije Universiteit Amsterdam

Faculty of Sciencesvrije Universiteit

Page 2: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

2

Introduction

• Distributed supercomputing- Parallel applications on geographically distributed

computing system (computational grid)- Examples: SETI@home, RSA-155

• Programming support- Language-neutral systems: Legion, Globus- Language-centric: Java

• Goal: study wide-area parallel computing in Java- Programming model: Remote Method Invocation

Page 3: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

3

Outline

• Wide-area parallel computing

• Java Remote Method Invocation (RMI)

• Performance of JDK RMI

• The Manta high-performance Java system

• Wide-area parallel Java applications using RMI

• Application performance

Page 4: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

4

Wide-area parallel computing

• Challenge- Tolerating poor latency and bandwidth of WANs

• Basic assumption: wide-area system is hierarchical- Connect clusters, not individual workstations- Most links are fast

• General approach- Optimize applications to exploit hierarchical

structure most communication is local

Page 5: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

5

Distributed ASCI SupercomputerVU (128) UvA (24)

Leiden (24) Delft (24)

6 Mb/sATM

Node configuration

200 MHz Pentium Pro64-128 MB memory2.5 GB local disksMyrinet LANRedhat Linux 2.0.36

Page 6: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

6

Java

• Growing interest in Java for parallel applications- Java Grande forum

• Parallel programming support in Java- Shared memory : multithreading - Distributed memory : Remote Method Invocation

• Study suitability of Java RMI for (wide-area) parallel programming- Optimizing performance of local RMI [PPoPP’99]- Wide-area parallel programming using RMI

[JavaGrande’99]

Page 7: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

7

RMI (1)

• Flexible object-oriented RPC-like primitive- Easy interoperability between Java Virtual

Machines- Polymorphism dynamic bytecode loading

void species(Animal x) throws … { System.out.println(“Species “ + x.name());}

o.species(new Orca()); “Species orca”

o.species(new Panda()); “Species panda”

o.species(new Manta()); “Species manta”

Animal

Orca

Panda

Manta

Page 8: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

8

RMI (2)

• Designed for client-server applications

• Automatic serialization (marshalling)

• Normally used in a high latency environment

- E.g. Internet

• Is RMI fast enough for parallel programming ?

Page 9: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

9

JDK RMI Performance

1711

1228

22830

0200400

600

8001000

1200

14001600

1800

Fast Ethernet Myrinet

Late

ncy

(mic

rose

cond

s)JDK RMI C RPC

( 200 MHz Pentium Pro, JDK 1.1.4 )

Page 10: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

10

Why is JDK RMI slow ?

• Serialization uses run-time type inspection

• Protocol overhead (class information)

• Thread creation for incoming calls

• TCP/IP

• Most code is written in Java

Page 11: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

11

The Manta system

• Designed for high-performance computing

• Native (static) compilation- Source executable

• New fast RMI protocol between Manta nodes

• Support (polymorphic) RMIs with JVMs

• Implemented on wide-area DAS system

Page 12: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

12

JDK versus MantaJDK time

µsManta time

µs

Serialization Runtime 670 Compiler 11

RMI protocol Heavy-weight 950 Light-weight 10

Communication TCP/IP 280 RPC/LFC 30

200 MHz Pentium Pro, Myrinet, JDK 1.1.4 interpreter,1 object as parameter

Page 13: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

13

Manta serializationclass Test implements Serializable { int i; double d; Object o;}

package java.io;

import java.util.Stack;

public class ObjectOutputStream extends OutputStream implements ObjectOutput, ObjectStreamConstants

{

public ObjectOutputStream (OutputStream out) throws IOException { this.out = out; dos = new DataOutputSt ream (t his); buf = new byte[1024]; writeStreamHeader (); resetStream (); }

public final void writeObject (Object obj) throws IOException { Object prevObject = currentObject; ObjectStreamClass prevClassDesc = currentClassDesc; boolean oldBlockDataMode = setBlockData (false); recursionDepth++;

try { if (serializeNullAndRepeat (obj))

return;

if (checkSpecialClasses (obj))return;

if (enableReplace){ Object altobj = replaceObject (obj); if (obj != altobj) {

if (!(altobj instanceof Serializable)){ String clname = altobj.getClass ().getName ( ); throw new NotSerializableException (clname);}

if (serializeNullAndRepeat (altobj)){ addReplacement (obj, altobj); return;}

addReplacement (obj, altobj);

if (checkSpecialClasses (altobj))return;

obj = altobj; }}

outputObject (obj); } catch (ObjectStreamException ee) { if (abortIOException == null)

{ try {

setBlockData (false);

writeCode (TC_EXCEPTION); resetStream (); th is.writeObject (ee); resetStream ();

abortIOException = ee; } catch (IOException fatal) {

abortIOException = new StreamCorruptedException (fatal.getMessage ()); }}

} catch (IOException ee) {

if (abortIOException == null)abortIOException = ee;

} finally {

recursionDepth--; currentObject = prevObject; currentClassDesc = prevClassDesc; setBlockData ( oldBlockDataMode); }

IOException pending = abortIOException; if (recursionDepth == 0) abortIOException = null; if (pending != null) {

throw pending; } }

private boolean checkSpecialClasses (Object obj) throws IOException {

if (obj instanceof Class) {

outputClass ((Class) obj);return true;

}

if (obj instanceof ObjectStreamClass) {

outputClassDescriptor ((ObjectSt reamClass) obj);return true;

}

if (obj instanceof String) {

outputSt ring ((String) obj);return true;

}

if (obj.getClass ().isArray ()) {

outputArray (obj);return true;

} return false; }

public final void defaultWriteObject () throws IOException { if (currentObject == null || currentClassDesc == null) throw new NotActiveException ("defaultWriteObject");

if (currentClassDesc.getFieldSequence () != null) {

boolean prevmode = setBlockDat a (false); outputClassFields (currentObject, currentClassDesc.forClass (),

currentClassDesc.getFieldSequence ()); setBlockData (prevmode);

} }

public void reset () throws IOException { if (currentObject != null || currentClassDesc != null) throw new IOException ("Illegal call to reset");

setBlockData (false); writeCode (TC_RESET);

resetStream (); abortIOException = null; }

private void resetStream () throws IOException { wireHandle2Object = new Object[100]; wireNextHandle = new int[100]; wireHash2Handle = new int[101]; for (int i = 0; i < wireHash2Handle.length; i++) {

wireHash2Handle[i] = -1; } classDescStack = new Stack (); nextWireOffset = 0; replaceObjects = null; nextReplaceOffset = 0; setBlockData (true); }

protected void annotateClass (Class cl) throws IOException { }

protected Object replaceObject (Object obj) throws IOException { return obj; }

protected final boolean enableReplaceObject (boolean enable) throws SecurityException { boolean previous = enableReplace; if (enable)

{ClassLoader loader = this.getClass ().getClassLoader ();if (loader == null) { enableReplace = true; return previous; }throw new SecurityException ("Not trusted class");

} else {

enableReplace = false; } return previous; }

protected void writeStreamHeader () throws IOException { writeShort (STREAM_MAGIC); writeShort (STREAM_VERSION); }

private void outputString (String s) throws IOException {

assignWireOffset (s); writeCode (TC_S TRING); writeUTF (s); }

private void outputClass (Class aclass) throws IOException {

writeCode (TC_CLASS);

ObjectStreamClass v = ObjectStreamClass.lookup (aclass);

if (v == null) throw new NotSerializableException (aclass.getName ());

outputClassDescriptor (v);

assignWireOffset (aclass); }

private void outputClassDescriptor (ObjectStreamClass classdesc) throws IOException { if (serializeNullAndRepeat (classdesc)) return;

writeCode (TC_CLASSDESC); String classname = classdesc.getName ();

writeUTF (classname); writeLong (classdesc.getSerialVersionUID ());

assignWireOffset (classdesc);

classdesc.write (this);

boolean prevMode = setBlockData (true); annotateClass (classdesc.forClass ()); setBlockData (prevMode); writeCode (TC_ENDBLOCKDATA);

ObjectStreamClass superdesc = classdesc.getSuperclass (); outputClassDescriptor (superdesc); }

private void outputArray (Object obj) throws IOException { Class currclass = obj.getClass ();

ObjectStreamClass v = ObjectStreamClass.lookup (currclass);

writeCode (TC_ARRAY); outputClassDescriptor (v);

assignWireOffset (obj);

int i, length; Class type = currclass.getComponentType ();

if (type.isPrimitive ()) {

if (type == Integer.TYPE) { int[] array = (int[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eInt (array[i]); } }else if (type == Byte.TYPE) { byte[]array = (byte[])obj; length = array.length; writeInt (length); write (array, 0, length); }else if (type == Long.TYPE) { long[] array = (long[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eLong (array[i]); } }else if (type == Float.TYPE) { float[] array = (float[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eFloat (array[i]); } }else if (type == Double.TYPE) { double[] array = (double[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eDouble (array[i]); } }else if (type == Short.TYPE) { short[] array = (short[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eShort (array[i]); } }else if (type == Character.TYPE) { char[] array = (char[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eChar (array[i]); } }else if (type == Boolean.TYPE) { boolean[]array = (boolean[])obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {

writ eBoolean (array[i]); } }else { throw new InvalidClassException (currclass.getName ()); }

} else {

Object[]array = (Object[])obj;length = array.length;writ eInt (length);for (i = 0; i < length; i++) { writeObject (array[i]); }

} }

private void outputObject (Object obj) throws IOException { currentObject = obj; Class currclass = obj.getClass ();

currentClassDesc = ObjectStreamClass.lookup (currclass); if (currentClassDesc == null) {

throw new NotSerializableException (currclass.getName ()); }

writeCode (TC_OBJECT); outputClassDescriptor (currentClassDesc);

assignWireOffset (obj);

if (currentClassDesc.isExternalizable ()) {

Externalizable ext = (Externalizable) obj;

ext.writeExternal (this); } else {

int stackMark = classDescStack.size ();try{ ObjectStreamClass next; while ((next = currentClassDesc.getSuperclass ()) != null) { classDescStack.push (currentClassDesc); currentClassDesc = next; }

do { if (currentClassDesc.hasWriteObject ())

{ setBlockData (true); invokeObjectWriter (obj, currentClassDesc.forClass

()); setBlockData (false); writeCode (TC_ENDBLOCKDATA);}

else{ defaultWriteObject ();}

} while (classDescStack.size () > stackMark &&

(currentClassDesc = (ObjectS treamClass)classDescStack.pop ()) != null);

}finally{ classDescStack.setSize (stackMark);}

} }

private boolean serializeNullAndRepeat (Object obj) throws IOException { if (obj == null) {

writeCode (TC_NULL);return true;

}

if (replaceObjects != null) {

for (int i = 0; i < nextReplaceOffset; i += 2) { if (replaceObjects[i] == obj) {

obj = replaceObjects[i + 1];break;

} }

}

int handle = findWireOffset (obj); if (handle >= 0) {

writeCode (TC_REFERENCE);writeInt (handle + baseWireHandle);return true;

} return false; }

private in t findWireOffset (Object obj) { int hash = S ystem.identityHashCode (obj); int index = (hash & 0x7FFFFFFF) % wireHash2Handle.length;

for (int handle = wireHash2Handle[index]; handle >= 0; handle = wireNextHandle[handle])

{

if (wireHandle2Object[handle] == obj) return handle;

} return -1; }

private void assignWireOffset (Object obj) throws IOException {

if (nextWireOffset == wireHandle2Object.length) {

Object[]o ldhandles = wireHandle2Object;wireHandle2Object = new Object[nextWireOffset * 2];System.arraycopy (oldhandles, 0,

wireHandle2Object, 0, nextWireOffset);

int[] oldnexthandles = wireNextHandle; wireNextHandle = new int[nextWireOffset * 2]; System.arraycopy ( oldnexthandles, 0,

wireNextHandle, 0, nextWireOffset);

} wireHandle2Object[nextWireOffset] = obj;

hashInsert (obj, nextWireOffset);

nextWireOffset++; return; }

private void hashInsert (Object obj, int offset) { int hash = S ystem.identityHashCode (obj); int index = (hash & 0x7FFFFFFF) % wireHash2Handle.length; wireNextHandle[offset] = wireHash2Handle[index]; wireHash2Handle[index] = offset; }

private void addReplacement (Object orig, Object replacement) {

if (replaceObjects == null) {

replaceObjects = new Object[10]; } if (nextReplaceOffset == replaceObjects.length) {

Object[]o ldhandles = replaceObject s;replaceObjects = new Object[2 + nextReplaceOffset * 2];System.arraycopy (o ldhandles, 0,

replaceObjects, 0, nextReplaceOffset);

} replaceObjects[nextReplaceOffset++] = orig; replaceObjects[nextReplaceOffset++] = replacement; }

private void writeCode (int tag) throws IOException { writeByte (tag); }

private boolean blockDataMode; private byte[] buf; private in t count; private OutputStream out;

public void write (int data) throws IOException {

if (count >= buf.length) drain (); buf[count++] = (byte) data; }

public void write (byte b[]) throws IOException { write (b, 0, b .length); }

public void write (byte b[], int off, int len) throws IOException { if (len < 0) throw new IndexOutOfBoundsException ();

int avail = buf.length - count; if (len <= avail) {

System.arraycopy (b, off, buf, count, len);count += len;

} else {

drain ();if (b lockDataMode) { if (len <= 255) {

out.write (TC_BLOCKDATA);out.write (len);

} else {

out.write (TC_BLOCKDATALONG);

out.write ((len >> 24) & 0xFF);out.write ((len >> 16) & 0xFF);out.write ((len >> 8) & 0xFF);out.write (len & 0xFF);

} }out.write (b, off, len);

} }

Manta JDK

void PackageClass__Test(…) { WRITE_INT( type_id ); WRITE_INT( i ); WRITE_DOUBLE( d ); WRITE_OBJECT( o );}

Java Source

Page 14: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

14

RMI protocol

• Light-weight RMI protocol - Send minimal type information

• Avoid thread creation - Simple nonblocking methods executed directly

• Avoid interrupts- Poll network when processor is idle

• Everything is written in C

Page 15: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

15

Communication software

• Panda user space RPC protocol

• LFC Myrinet control program- Similar to active messages- Implemented partly on Myrinet network

interfaces- Myrinet network interfaces mapped in user

space

Manta RMI

Panda RPC

LFC UDP

EthernetMyrinet

TCP

ATM

Page 16: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

16

Interoperability with JVMs

• Manta RMI protocol incompatible with JDK

- Use fast RMI between Manta nodes- Use JDK-compliant protocol with JVMs

• Polymorphic RMI requires exchanging bytecodes- Also generate bytecodes when compiling a

program- Dynamically compile and link bytecodes into

running program

Page 17: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

17

Null-RMI latency

1711

1228

23339,9

22830

0200400600800

10001200140016001800

Fast Ethernet Myrinet

Late

ncy

(mic

rose

cond

s)

JDK Manta C RPC

Page 18: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

18

RMI Throughput

0,974,667,3

38,6

10,3

55,7

0,0

10,0

20,0

30,0

40,0

50,0

60,0

Fast Ethernet Myrinet

Thro

ughp

ut (M

byte

/sec

ond)

JDK Manta C RPC

Page 19: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

19

Outline

• Wide-area parallel computing

• Java Remote Method Invocation (RMI)

• Performance of JDK RMI

• The Manta high-performance Java system

• Wide-area parallel Java applications using RMI

• Application performance

Page 20: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

20

Protocol Null-latency(µsec)

Bandwidth(MByte/sec)

Myrinet LAN LFC 39.9 38.6

ATM WAN TCP/IP 5600 0.55

• 2 orders of magnitude between intra-cluster (LAN) and inter-cluster (WAN) communication performance

• Manta exposes hierarchical structure to application- Applications are optimized to reduce WAN-overhead

Manta on wide-area DAS

Page 21: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

21

Wide-area programming

• Problem: how to tolerate difference between LAN and WAN performance

• Wide-area system is structured hierarchically- Most links are fast

• Approach: application-level optimizations that exploit the hierarchical structure- Reduce wide-area communication

Page 22: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

22

Application experience

• Parallel applications- Successive overrelaxation (SOR)- All-pairs shortest paths problem (ASP)- Traveling salesperson problem (TSP)- Iterative Deepening A* (IDA*)

• Measurements on wide-area DAS- 1-4 clusters with 16 nodes- Comparison with single 64-node cluster

Page 23: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

23

Successive Overrelaxation

• Red/black SOR- Neighbor communication, using RMI

• Problem: nodes at cluster-boundaries- Overlap wide-area communication with computation- RMI is synchronous use multithreading

Cluster 1 Cluster 2

CPU 3CPU 2CPU 1 CPU 6CPU 5CPU 4

40 5600 µsec

µs

Page 24: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

24

All-pairs shortest paths

• Broadcast at beginning of each iteration• Problem: broadcasting over wide-area links

- Lack of broadcast in Java -> use spanning tree- Use coordinator node per cluster- Do asynchronous send to all remote

coordinators- Implemented using threads

Cluster 1 2 3

Page 25: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

25

Traveling salesperson problem

• Replicated-worker style parallel search algorithm

• Problem: work distribution- Central job-queue has high overhead- Statically distribute jobs over clusters- Use centralized job-queue per cluster- Easy to express using RMI

1

2

3

Page 26: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

26

Iterative Deepening A*

• Parallel search algorithm using work stealing• Problem: inter-cluster work stealing• Optimization: first look for work in local cluster

- Easy to express using RMI

Cluster 1 2

Page 27: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

27

Performance

0

10

20

30

40

50

60

70

SOR ASP TSP IDA*

Spee

dup

1 x 16 CPUs

4 x 16CPUs

4 x 16 CPUs(optimized)1 x 64 CPUs

• Wide-area DAS system: 4 clusters of 16 CPUs• Comparison with single 16-node and 64-node

cluster

Page 28: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

28

• Fast RMI possible through- Compiler-generated serialization, light-weight

communication & RMI protocols• Optimized wide-area applications are efficient

- Reduce wide-area communication, or hide its latency• Java RMI is easy to use, but some optimizations are

awkward to express- No asynchronous communication, collective comm.

• Programming systems should take hierarchical structure of wide-area systems into account

Conclusions

http://www.cs.vu.nl/manta

Page 29: Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.

29

Performance breakdown Manta

227 232 235 243

11109105

2417

0

50

100

150

200

250

300

empty 1 object 2 objects 3 objects

Communication RMI Overhead Serialization

( Fast Ethernet )