Wide-Area Parallel Computing in Java
Henri Bal
Vrije Universiteit Amsterdam
Faculty of Sciencesvrije Universiteit
2
Introduction
• Distributed supercomputing- Parallel applications on geographically distributed
computing system (computational grid)- Examples: SETI@home, RSA-155
• Programming support- Language-neutral systems: Legion, Globus- Language-centric: Java
• Goal: study wide-area parallel computing in Java- Programming model: Remote Method Invocation
3
Outline
• Wide-area parallel computing
• Java Remote Method Invocation (RMI)
• Performance of JDK RMI
• The Manta high-performance Java system
• Wide-area parallel Java applications using RMI
• Application performance
4
Wide-area parallel computing
• Challenge- Tolerating poor latency and bandwidth of WANs
• Basic assumption: wide-area system is hierarchical- Connect clusters, not individual workstations- Most links are fast
• General approach- Optimize applications to exploit hierarchical
structure most communication is local
5
Distributed ASCI SupercomputerVU (128) UvA (24)
Leiden (24) Delft (24)
6 Mb/sATM
Node configuration
200 MHz Pentium Pro64-128 MB memory2.5 GB local disksMyrinet LANRedhat Linux 2.0.36
6
Java
• Growing interest in Java for parallel applications- Java Grande forum
• Parallel programming support in Java- Shared memory : multithreading - Distributed memory : Remote Method Invocation
• Study suitability of Java RMI for (wide-area) parallel programming- Optimizing performance of local RMI [PPoPP’99]- Wide-area parallel programming using RMI
[JavaGrande’99]
7
RMI (1)
• Flexible object-oriented RPC-like primitive- Easy interoperability between Java Virtual
Machines- Polymorphism dynamic bytecode loading
void species(Animal x) throws … { System.out.println(“Species “ + x.name());}
o.species(new Orca()); “Species orca”
o.species(new Panda()); “Species panda”
o.species(new Manta()); “Species manta”
Animal
Orca
Panda
Manta
8
RMI (2)
• Designed for client-server applications
• Automatic serialization (marshalling)
• Normally used in a high latency environment
- E.g. Internet
• Is RMI fast enough for parallel programming ?
9
JDK RMI Performance
1711
1228
22830
0200400
600
8001000
1200
14001600
1800
Fast Ethernet Myrinet
Late
ncy
(mic
rose
cond
s)JDK RMI C RPC
( 200 MHz Pentium Pro, JDK 1.1.4 )
10
Why is JDK RMI slow ?
• Serialization uses run-time type inspection
• Protocol overhead (class information)
• Thread creation for incoming calls
• TCP/IP
• Most code is written in Java
11
The Manta system
• Designed for high-performance computing
• Native (static) compilation- Source executable
• New fast RMI protocol between Manta nodes
• Support (polymorphic) RMIs with JVMs
• Implemented on wide-area DAS system
12
JDK versus MantaJDK time
µsManta time
µs
Serialization Runtime 670 Compiler 11
RMI protocol Heavy-weight 950 Light-weight 10
Communication TCP/IP 280 RPC/LFC 30
200 MHz Pentium Pro, Myrinet, JDK 1.1.4 interpreter,1 object as parameter
13
Manta serializationclass Test implements Serializable { int i; double d; Object o;}
package java.io;
import java.util.Stack;
public class ObjectOutputStream extends OutputStream implements ObjectOutput, ObjectStreamConstants
{
public ObjectOutputStream (OutputStream out) throws IOException { this.out = out; dos = new DataOutputSt ream (t his); buf = new byte[1024]; writeStreamHeader (); resetStream (); }
public final void writeObject (Object obj) throws IOException { Object prevObject = currentObject; ObjectStreamClass prevClassDesc = currentClassDesc; boolean oldBlockDataMode = setBlockData (false); recursionDepth++;
try { if (serializeNullAndRepeat (obj))
return;
if (checkSpecialClasses (obj))return;
if (enableReplace){ Object altobj = replaceObject (obj); if (obj != altobj) {
if (!(altobj instanceof Serializable)){ String clname = altobj.getClass ().getName ( ); throw new NotSerializableException (clname);}
if (serializeNullAndRepeat (altobj)){ addReplacement (obj, altobj); return;}
addReplacement (obj, altobj);
if (checkSpecialClasses (altobj))return;
obj = altobj; }}
outputObject (obj); } catch (ObjectStreamException ee) { if (abortIOException == null)
{ try {
setBlockData (false);
writeCode (TC_EXCEPTION); resetStream (); th is.writeObject (ee); resetStream ();
abortIOException = ee; } catch (IOException fatal) {
abortIOException = new StreamCorruptedException (fatal.getMessage ()); }}
} catch (IOException ee) {
if (abortIOException == null)abortIOException = ee;
} finally {
recursionDepth--; currentObject = prevObject; currentClassDesc = prevClassDesc; setBlockData ( oldBlockDataMode); }
IOException pending = abortIOException; if (recursionDepth == 0) abortIOException = null; if (pending != null) {
throw pending; } }
private boolean checkSpecialClasses (Object obj) throws IOException {
if (obj instanceof Class) {
outputClass ((Class) obj);return true;
}
if (obj instanceof ObjectStreamClass) {
outputClassDescriptor ((ObjectSt reamClass) obj);return true;
}
if (obj instanceof String) {
outputSt ring ((String) obj);return true;
}
if (obj.getClass ().isArray ()) {
outputArray (obj);return true;
} return false; }
public final void defaultWriteObject () throws IOException { if (currentObject == null || currentClassDesc == null) throw new NotActiveException ("defaultWriteObject");
if (currentClassDesc.getFieldSequence () != null) {
boolean prevmode = setBlockDat a (false); outputClassFields (currentObject, currentClassDesc.forClass (),
currentClassDesc.getFieldSequence ()); setBlockData (prevmode);
} }
public void reset () throws IOException { if (currentObject != null || currentClassDesc != null) throw new IOException ("Illegal call to reset");
setBlockData (false); writeCode (TC_RESET);
resetStream (); abortIOException = null; }
private void resetStream () throws IOException { wireHandle2Object = new Object[100]; wireNextHandle = new int[100]; wireHash2Handle = new int[101]; for (int i = 0; i < wireHash2Handle.length; i++) {
wireHash2Handle[i] = -1; } classDescStack = new Stack (); nextWireOffset = 0; replaceObjects = null; nextReplaceOffset = 0; setBlockData (true); }
protected void annotateClass (Class cl) throws IOException { }
protected Object replaceObject (Object obj) throws IOException { return obj; }
protected final boolean enableReplaceObject (boolean enable) throws SecurityException { boolean previous = enableReplace; if (enable)
{ClassLoader loader = this.getClass ().getClassLoader ();if (loader == null) { enableReplace = true; return previous; }throw new SecurityException ("Not trusted class");
} else {
enableReplace = false; } return previous; }
protected void writeStreamHeader () throws IOException { writeShort (STREAM_MAGIC); writeShort (STREAM_VERSION); }
private void outputString (String s) throws IOException {
assignWireOffset (s); writeCode (TC_S TRING); writeUTF (s); }
private void outputClass (Class aclass) throws IOException {
writeCode (TC_CLASS);
ObjectStreamClass v = ObjectStreamClass.lookup (aclass);
if (v == null) throw new NotSerializableException (aclass.getName ());
outputClassDescriptor (v);
assignWireOffset (aclass); }
private void outputClassDescriptor (ObjectStreamClass classdesc) throws IOException { if (serializeNullAndRepeat (classdesc)) return;
writeCode (TC_CLASSDESC); String classname = classdesc.getName ();
writeUTF (classname); writeLong (classdesc.getSerialVersionUID ());
assignWireOffset (classdesc);
classdesc.write (this);
boolean prevMode = setBlockData (true); annotateClass (classdesc.forClass ()); setBlockData (prevMode); writeCode (TC_ENDBLOCKDATA);
ObjectStreamClass superdesc = classdesc.getSuperclass (); outputClassDescriptor (superdesc); }
private void outputArray (Object obj) throws IOException { Class currclass = obj.getClass ();
ObjectStreamClass v = ObjectStreamClass.lookup (currclass);
writeCode (TC_ARRAY); outputClassDescriptor (v);
assignWireOffset (obj);
int i, length; Class type = currclass.getComponentType ();
if (type.isPrimitive ()) {
if (type == Integer.TYPE) { int[] array = (int[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {
writ eInt (array[i]); } }else if (type == Byte.TYPE) { byte[]array = (byte[])obj; length = array.length; writeInt (length); write (array, 0, length); }else if (type == Long.TYPE) { long[] array = (long[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {
writ eLong (array[i]); } }else if (type == Float.TYPE) { float[] array = (float[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {
writ eFloat (array[i]); } }else if (type == Double.TYPE) { double[] array = (double[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {
writ eDouble (array[i]); } }else if (type == Short.TYPE) { short[] array = (short[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {
writ eShort (array[i]); } }else if (type == Character.TYPE) { char[] array = (char[]) obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {
writ eChar (array[i]); } }else if (type == Boolean.TYPE) { boolean[]array = (boolean[])obj; length = array.length; writeInt (length); for (i = 0; i < length; i++) {
writ eBoolean (array[i]); } }else { throw new InvalidClassException (currclass.getName ()); }
} else {
Object[]array = (Object[])obj;length = array.length;writ eInt (length);for (i = 0; i < length; i++) { writeObject (array[i]); }
} }
private void outputObject (Object obj) throws IOException { currentObject = obj; Class currclass = obj.getClass ();
currentClassDesc = ObjectStreamClass.lookup (currclass); if (currentClassDesc == null) {
throw new NotSerializableException (currclass.getName ()); }
writeCode (TC_OBJECT); outputClassDescriptor (currentClassDesc);
assignWireOffset (obj);
if (currentClassDesc.isExternalizable ()) {
Externalizable ext = (Externalizable) obj;
ext.writeExternal (this); } else {
int stackMark = classDescStack.size ();try{ ObjectStreamClass next; while ((next = currentClassDesc.getSuperclass ()) != null) { classDescStack.push (currentClassDesc); currentClassDesc = next; }
do { if (currentClassDesc.hasWriteObject ())
{ setBlockData (true); invokeObjectWriter (obj, currentClassDesc.forClass
()); setBlockData (false); writeCode (TC_ENDBLOCKDATA);}
else{ defaultWriteObject ();}
} while (classDescStack.size () > stackMark &&
(currentClassDesc = (ObjectS treamClass)classDescStack.pop ()) != null);
}finally{ classDescStack.setSize (stackMark);}
} }
private boolean serializeNullAndRepeat (Object obj) throws IOException { if (obj == null) {
writeCode (TC_NULL);return true;
}
if (replaceObjects != null) {
for (int i = 0; i < nextReplaceOffset; i += 2) { if (replaceObjects[i] == obj) {
obj = replaceObjects[i + 1];break;
} }
}
int handle = findWireOffset (obj); if (handle >= 0) {
writeCode (TC_REFERENCE);writeInt (handle + baseWireHandle);return true;
} return false; }
private in t findWireOffset (Object obj) { int hash = S ystem.identityHashCode (obj); int index = (hash & 0x7FFFFFFF) % wireHash2Handle.length;
for (int handle = wireHash2Handle[index]; handle >= 0; handle = wireNextHandle[handle])
{
if (wireHandle2Object[handle] == obj) return handle;
} return -1; }
private void assignWireOffset (Object obj) throws IOException {
if (nextWireOffset == wireHandle2Object.length) {
Object[]o ldhandles = wireHandle2Object;wireHandle2Object = new Object[nextWireOffset * 2];System.arraycopy (oldhandles, 0,
wireHandle2Object, 0, nextWireOffset);
int[] oldnexthandles = wireNextHandle; wireNextHandle = new int[nextWireOffset * 2]; System.arraycopy ( oldnexthandles, 0,
wireNextHandle, 0, nextWireOffset);
} wireHandle2Object[nextWireOffset] = obj;
hashInsert (obj, nextWireOffset);
nextWireOffset++; return; }
private void hashInsert (Object obj, int offset) { int hash = S ystem.identityHashCode (obj); int index = (hash & 0x7FFFFFFF) % wireHash2Handle.length; wireNextHandle[offset] = wireHash2Handle[index]; wireHash2Handle[index] = offset; }
private void addReplacement (Object orig, Object replacement) {
if (replaceObjects == null) {
replaceObjects = new Object[10]; } if (nextReplaceOffset == replaceObjects.length) {
Object[]o ldhandles = replaceObject s;replaceObjects = new Object[2 + nextReplaceOffset * 2];System.arraycopy (o ldhandles, 0,
replaceObjects, 0, nextReplaceOffset);
} replaceObjects[nextReplaceOffset++] = orig; replaceObjects[nextReplaceOffset++] = replacement; }
private void writeCode (int tag) throws IOException { writeByte (tag); }
private boolean blockDataMode; private byte[] buf; private in t count; private OutputStream out;
public void write (int data) throws IOException {
if (count >= buf.length) drain (); buf[count++] = (byte) data; }
public void write (byte b[]) throws IOException { write (b, 0, b .length); }
public void write (byte b[], int off, int len) throws IOException { if (len < 0) throw new IndexOutOfBoundsException ();
int avail = buf.length - count; if (len <= avail) {
System.arraycopy (b, off, buf, count, len);count += len;
} else {
drain ();if (b lockDataMode) { if (len <= 255) {
out.write (TC_BLOCKDATA);out.write (len);
} else {
out.write (TC_BLOCKDATALONG);
out.write ((len >> 24) & 0xFF);out.write ((len >> 16) & 0xFF);out.write ((len >> 8) & 0xFF);out.write (len & 0xFF);
} }out.write (b, off, len);
} }
Manta JDK
void PackageClass__Test(…) { WRITE_INT( type_id ); WRITE_INT( i ); WRITE_DOUBLE( d ); WRITE_OBJECT( o );}
Java Source
14
RMI protocol
• Light-weight RMI protocol - Send minimal type information
• Avoid thread creation - Simple nonblocking methods executed directly
• Avoid interrupts- Poll network when processor is idle
• Everything is written in C
15
Communication software
• Panda user space RPC protocol
• LFC Myrinet control program- Similar to active messages- Implemented partly on Myrinet network
interfaces- Myrinet network interfaces mapped in user
space
Manta RMI
Panda RPC
LFC UDP
EthernetMyrinet
TCP
ATM
16
Interoperability with JVMs
• Manta RMI protocol incompatible with JDK
- Use fast RMI between Manta nodes- Use JDK-compliant protocol with JVMs
• Polymorphic RMI requires exchanging bytecodes- Also generate bytecodes when compiling a
program- Dynamically compile and link bytecodes into
running program
17
Null-RMI latency
1711
1228
23339,9
22830
0200400600800
10001200140016001800
Fast Ethernet Myrinet
Late
ncy
(mic
rose
cond
s)
JDK Manta C RPC
18
RMI Throughput
0,974,667,3
38,6
10,3
55,7
0,0
10,0
20,0
30,0
40,0
50,0
60,0
Fast Ethernet Myrinet
Thro
ughp
ut (M
byte
/sec
ond)
JDK Manta C RPC
19
Outline
• Wide-area parallel computing
• Java Remote Method Invocation (RMI)
• Performance of JDK RMI
• The Manta high-performance Java system
• Wide-area parallel Java applications using RMI
• Application performance
20
Protocol Null-latency(µsec)
Bandwidth(MByte/sec)
Myrinet LAN LFC 39.9 38.6
ATM WAN TCP/IP 5600 0.55
• 2 orders of magnitude between intra-cluster (LAN) and inter-cluster (WAN) communication performance
• Manta exposes hierarchical structure to application- Applications are optimized to reduce WAN-overhead
Manta on wide-area DAS
21
Wide-area programming
• Problem: how to tolerate difference between LAN and WAN performance
• Wide-area system is structured hierarchically- Most links are fast
• Approach: application-level optimizations that exploit the hierarchical structure- Reduce wide-area communication
22
Application experience
• Parallel applications- Successive overrelaxation (SOR)- All-pairs shortest paths problem (ASP)- Traveling salesperson problem (TSP)- Iterative Deepening A* (IDA*)
• Measurements on wide-area DAS- 1-4 clusters with 16 nodes- Comparison with single 64-node cluster
23
Successive Overrelaxation
• Red/black SOR- Neighbor communication, using RMI
• Problem: nodes at cluster-boundaries- Overlap wide-area communication with computation- RMI is synchronous use multithreading
Cluster 1 Cluster 2
CPU 3CPU 2CPU 1 CPU 6CPU 5CPU 4
40 5600 µsec
µs
24
All-pairs shortest paths
• Broadcast at beginning of each iteration• Problem: broadcasting over wide-area links
- Lack of broadcast in Java -> use spanning tree- Use coordinator node per cluster- Do asynchronous send to all remote
coordinators- Implemented using threads
Cluster 1 2 3
25
Traveling salesperson problem
• Replicated-worker style parallel search algorithm
• Problem: work distribution- Central job-queue has high overhead- Statically distribute jobs over clusters- Use centralized job-queue per cluster- Easy to express using RMI
1
2
3
26
Iterative Deepening A*
• Parallel search algorithm using work stealing• Problem: inter-cluster work stealing• Optimization: first look for work in local cluster
- Easy to express using RMI
Cluster 1 2
27
Performance
0
10
20
30
40
50
60
70
SOR ASP TSP IDA*
Spee
dup
1 x 16 CPUs
4 x 16CPUs
4 x 16 CPUs(optimized)1 x 64 CPUs
• Wide-area DAS system: 4 clusters of 16 CPUs• Comparison with single 16-node and 64-node
cluster
28
• Fast RMI possible through- Compiler-generated serialization, light-weight
communication & RMI protocols• Optimized wide-area applications are efficient
- Reduce wide-area communication, or hide its latency• Java RMI is easy to use, but some optimizations are
awkward to express- No asynchronous communication, collective comm.
• Programming systems should take hierarchical structure of wide-area systems into account
Conclusions
http://www.cs.vu.nl/manta
29
Performance breakdown Manta
227 232 235 243
11109105
2417
0
50
100
150
200
250
300
empty 1 object 2 objects 3 objects
Communication RMI Overhead Serialization
( Fast Ethernet )
Top Related