Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire...

49
Cluster Computing Cluster Computing with Java Threads with Java Threads Philip J. Hatcher Philip J. Hatcher University of New Hampshire University of New Hampshire [email protected] [email protected]
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    2

Transcript of Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire...

Page 1: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Cluster Computing with Cluster Computing with Java ThreadsJava Threads

Philip J. HatcherPhilip J. Hatcher

University of New HampshireUniversity of New Hampshire

[email protected]@unh.edu

Page 2: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

CollaboratorsCollaborators

• UNH/HyperionUNH/Hyperion– Mark MacBeth and Keith McGuiganMark MacBeth and Keith McGuigan

• ENS-Lyon/DSM-PM2ENS-Lyon/DSM-PM2– Gabriel Antoniu, Luc Bougé and Gabriel Antoniu, Luc Bougé and

Raymond NamystRaymond Namyst

Page 3: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

FocusFocus

• Use Java “as is” for high-Use Java “as is” for high-performance computingperformance computing– support computationally intensive support computationally intensive

applicationsapplications

– utilize parallel computing hardwareutilize parallel computing hardware

Page 4: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

OutlineOutline

• Our VisionOur Vision

• Java ThreadsJava Threads

• The PM2 Run-time EnvironmentThe PM2 Run-time Environment

• Hyperion: Java Threads on ClustersHyperion: Java Threads on Clusters

• EvaluationEvaluation

• Related WorkRelated Work

• ConclusionsConclusions

Page 5: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Why Java?Why Java?

• Soon to be ubiquitous!Soon to be ubiquitous!– use of Java is growing very rapidlyuse of Java is growing very rapidly

• Designed for portability:Designed for portability:– develop programs on your desktopdevelop programs on your desktop

– run programs on a distant clusterrun programs on a distant cluster

Page 6: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Why Java?Why Java?

• Explicitly parallel!Explicitly parallel!– includes a threaded programming includes a threaded programming

modelmodel

• Relaxed memory modelRelaxed memory model– consistency model aids an consistency model aids an

implementation on distributed-memory implementation on distributed-memory parallel computersparallel computers

Page 7: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Unique OpportunityUnique Opportunity

• Use Java to bring parallelism to the Use Java to bring parallelism to the “masses”“masses”

• Let’s not miss it!Let’s not miss it!

• But, programmers will not accept But, programmers will not accept syntax or model changessyntax or model changes

Page 8: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Open QuestionOpen Question

• Parallelism via Java access to Parallelism via Java access to distributed-computing techniques?distributed-computing techniques?– e.g. RMI (remote method invocation)e.g. RMI (remote method invocation)

• Or, parallelism via Java threads?Or, parallelism via Java threads?

Page 9: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

That is, ...That is, ...

• Does a user prefer to view a cluster Does a user prefer to view a cluster as a collection of distinct machines?as a collection of distinct machines?

• Or, does a user prefer to view a Or, does a user prefer to view a cluster as a “black box” that will cluster as a “black box” that will simply run Java code faster?simply run Java code faster?

Page 10: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Are you “in a box”?Are you “in a box”?

Page 11: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Or, are you “thinking Or, are you “thinking outside of the box”?outside of the box”?

Page 12: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Climb out of the box!Climb out of the box!

• Use Java threads “as is” to program Use Java threads “as is” to program clusters of computers.clusters of computers.

• Program for the threaded Java Program for the threaded Java virtual machine.virtual machine.

• Allow the implementation to handle Allow the implementation to handle the details of executing in a cluster.the details of executing in a cluster.

Page 13: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Java ThreadsJava Threads

• Threads are objects.Threads are objects.

• The class java/lang/Thread contains The class java/lang/Thread contains all of the methods for initializing, all of the methods for initializing, running, suspending, querying and running, suspending, querying and destroying threads.destroying threads.

Page 14: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

java/lang/Thread methodsjava/lang/Thread methods

• Thread() - constructor for thread object.Thread() - constructor for thread object.

• start() - start the thread executing.start() - start the thread executing.

• run() - method invoked by ‘start’.run() - method invoked by ‘start’.

• stop(), suspend(), resume(), join(), stop(), suspend(), resume(), join(), yield().yield().

• setPriority().setPriority().

Page 15: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Java SynchronizationJava Synchronization

• Java uses monitors, which protect a Java uses monitors, which protect a region of code by allowing only one region of code by allowing only one thread at a time to execute it.thread at a time to execute it.

• Monitors utilize locks.Monitors utilize locks.

• There is a lock associated with each There is a lock associated with each object.object.

Page 16: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

synchronized keywordsynchronized keyword

• synchronized ( synchronized ( ExpExp ) ) BlockBlock

• public class Q {public class Q { synchronized void put(…) { synchronized void put(…) { … … } }}}

Page 17: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

java/lang/Object methodsjava/lang/Object methods

• wait()wait() - the calling thread, which must - the calling thread, which must hold the lock for the object, is placed in hold the lock for the object, is placed in a wait set associated with the object. a wait set associated with the object. The lock is then released.The lock is then released.

• notify()notify() - an arbitrary thread in the wait - an arbitrary thread in the wait set of this object is awakened and then set of this object is awakened and then competes again to get lock for object.competes again to get lock for object.

• notifyallnotifyall() - all waiting threads () - all waiting threads awakened.awakened.

Page 18: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Shared-Memory ModelShared-Memory Model

• Java threads execute in a virtual Java threads execute in a virtual shared memory.shared memory.

• All threads are able to access all All threads are able to access all objects.objects.

• But threads may not access each But threads may not access each other’s stacks.other’s stacks.

Page 19: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Java Memory ConsistencyJava Memory Consistency

• A variant of A variant of release consistency.release consistency.

• Threads can keep locally cached Threads can keep locally cached copies of objects.copies of objects.

• Consistency is provided by requiring Consistency is provided by requiring that:that:– a thread's object cache be flushed upon a thread's object cache be flushed upon

entry to a monitor.entry to a monitor.

– local modifications made to cached local modifications made to cached objects be transmitted to the central objects be transmitted to the central memory when a thread exits a monitor.memory when a thread exits a monitor.

Page 20: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

PM2: A Distributed, PM2: A Distributed, Multithreaded Multithreaded Runtime EnvironmentRuntime Environment

• Thread library: MarcelThread library: Marcel– User-levelUser-level

– Supports SMPSupports SMP

– POSIX-likePOSIX-like

– PreemptivePreemptive thread thread migrationmigration

• Communication Communication library: Madeleinelibrary: Madeleine

– Portable: BIP, SISCI/SCI, Portable: BIP, SISCI/SCI, MPI, TCP, PVMMPI, TCP, PVM

– EfficientEfficient

Context Switch Create

SMP 0.250 s 2 s

Non-SMP 0.120 s 0.55 s

Latency BandwidthSCI/SISCI 6 s 70 MB/sBIP/Myrinet 8 s 125 MB/s

Thread Migration SCI/SISCI 24 sBIP/Myrinet 75 s

Page 21: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

DSM-PM2: ArchitectureDSM-PM2: Architecture

• DSM comm:DSM comm:– send page requestsend page request

– send pagesend page

– send invalidate requestsend invalidate request

– ……

• DSM page manager:DSM page manager:– set/get page ownerset/get page owner

– set/get page accessset/get page access

– add/remove to/from add/remove to/from copysetcopyset

– ......

DSM-PM2

MadeleineComms

MarcelThreads

DSM Protocol Policy

DSM Protocol lib

DSM Page Manager

DSM Comm

PM2

Page 22: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

DSM-PM2: PerformanceDSM-PM2: Performance

• SCI cluster has 450 MHz Pentium II nodes

• Myrinet cluster has 200 MHz Pentium Pro nodes

Operation/Protocol SISCI/SCI BIP/Myrinet TCP/Myrinet

Page fault 18 56 56

Fault handling 1 2 2

Transmitting request 17 30 190

Processing request 1 2 2

Sending back 4 kB page

85 134 412

Installing the page 12 24 24

Total 134 s 248 s 686 s

Page 23: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

HyperionHyperion

• Executes threaded Java programs Executes threaded Java programs on clusters.on clusters.

• Built on top of PM2 and DSM-PM2.Built on top of PM2 and DSM-PM2.– Provides both portability and efficiencyProvides both portability and efficiency

Page 24: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Reversing the Bytecode Reversing the Bytecode StreamStream

• Conventionally, users “pull” bytecode to Conventionally, users “pull” bytecode to their machines for local execution.their machines for local execution.

• Our vision:Our vision:– users develop their high-performance Java users develop their high-performance Java

programs using the Java toolset on their programs using the Java toolset on their desktop.desktop.

– they then “push” the resulting bytecode to a they then “push” the resulting bytecode to a Hyperion server for high-performance Hyperion server for high-performance cycles.cycles.

Page 25: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Supporting High Supporting High PerformancePerformance

• Utilizes a bytecode-to-C translator.Utilizes a bytecode-to-C translator.

• Parallel execution via spreading of Parallel execution via spreading of Java threads across nodes of the Java threads across nodes of the cluster.cluster.

• Java threads implemented as Java threads implemented as lightweight threads using PM2 lightweight threads using PM2 library.library.

Page 26: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Compiling JavaCompiling Java

• Hyperion designed for Hyperion designed for computationally intensive computationally intensive applications, so small overhead of applications, so small overhead of translating bytecode is not translating bytecode is not important.important.

• Translating to C allows us to Translating to C allows us to leverage the native C compiler and leverage the native C compiler and optimizer.optimizer.

Page 27: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

General Hyperion General Hyperion OverviewOverview

Runtimelibraries

prog.java progjavac java2c gcc -06

libs

Sun'sJava compiler

prog.[ch]prog.class

(bytecode)

Instruction-wisetranslation

Page 28: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

The Hyperion Run-Time The Hyperion Run-Time SystemSystem

• Collection of modules to allow “plug-Collection of modules to allow “plug-and-play” implementations:and-play” implementations:– inter-node communicationinter-node communication

– threadsthreads

– memory and synchronizationmemory and synchronization

– etcetc

Page 29: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Hyperion Internal Hyperion Internal StructureStructure

PM2 API: pm2_rpc, pm2_thread_create, etc.

Loadbalancer

NativeJava API

Threadsubsystem

Memorysubsystem

Comm.subsystem

PM2

DSM subsystem

Thread subsystem Comm. Subsystem

Page 30: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Thread and Object Thread and Object AllocationAllocation

• Currently, threads are allocated to Currently, threads are allocated to processors in round-robin fashion.processors in round-robin fashion.

• Currently, an object is allocated to the Currently, an object is allocated to the processor that holds the thread that is processor that holds the thread that is creating the object.creating the object.

• Currently, DSM-PM2 is used to Currently, DSM-PM2 is used to implement the Java memory model.implement the Java memory model.

Page 31: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Hyperion’s DSM APIHyperion’s DSM API

• loadIntoCacheloadIntoCache

• invalidateCacheinvalidateCache

• updateMainMemoryupdateMainMemory

• getget

• putput

Page 32: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

DSM ImplementationDSM Implementation

• Node-level caches.Node-level caches.

• Page-based and home-based protocol.Page-based and home-based protocol.

• Log mods made to remote objects.Log mods made to remote objects.

• Use explicit in-line checks in Use explicit in-line checks in getget//putput..

• Each node allocates objects from a Each node allocates objects from a different range of the virtual address different range of the virtual address space.space.

Page 33: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

DetailsDetails

• Objects are aligned on 64-byte Objects are aligned on 64-byte boundaries.boundaries.

• An object reference is the address of An object reference is the address of the base of the object.the base of the object.

• The bottom 6 bits of the ref can be The bottom 6 bits of the ref can be used to store the node number of the used to store the node number of the object’s home.object’s home.

Page 34: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

More detailsMore details

• lloadIntoCacheoadIntoCache checks the 6 bits to checks the 6 bits to see if an object is remote.see if an object is remote.

• If so, and if not already locally If so, and if not already locally cached, DSM-PM2 is used to load the cached, DSM-PM2 is used to load the page(s) containing the object.page(s) containing the object.

• When a remote object is cached, a When a remote object is cached, a bit is turned on in its header.bit is turned on in its header.

Page 35: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Yet more detailsYet more details

• The The putput primitive checks the header primitive checks the header bit to see if a modification should be bit to see if a modification should be logged.logged.

• updateMainMemoryupdateMainMemory sends the sends the logged changes to the home node.logged changes to the home node.

Page 36: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

EvaluationEvaluation

• Minimal-cost map-coloring application.Minimal-cost map-coloring application.

• Branch-and-bound algorithm.Branch-and-bound algorithm.

• 64 threads, each with its own priority 64 threads, each with its own priority queue.queue.

• Current best solution is shared.Current best solution is shared.

• Problem size: 29 eastern-most states of Problem size: 29 eastern-most states of USA with 4 colors of differing costs.USA with 4 colors of differing costs.

Page 37: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Experimental SettingExperimental Setting

• Two Linux 2.2 clusters:Two Linux 2.2 clusters:– eight 200 MHz Pentium Pro processors eight 200 MHz Pentium Pro processors

connected by Myrinet switch and using connected by Myrinet switch and using MPI over BIP. MPI over BIP.

– four 450 MHz Pentium II processors four 450 MHz Pentium II processors connected by a SCI network and using connected by a SCI network and using SISCI.SISCI.

• gcc 2.7.2.3 with -O6gcc 2.7.2.3 with -O6

Page 38: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Performance ResultsPerformance Results

0100200300400500600700

Time (sec)

1 2 4 8

nodes

450MHz/ SCI200MHz/ BIP

Page 39: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

ParallelizabilityParallelizability

0

2

4

6

8

10

1 2 4 8

Nodes

200MHz/ BIP450MHz/ SCIIdeal

Page 40: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Baseline PerformanceBaseline Performance

• Compared serial Java to serial C for Compared serial Java to serial C for map-coloring application.map-coloring application.

• Each program has single queue, Each program has single queue, single thread.single thread.

Page 41: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Serial Java versus Serial CSerial Java versus Serial C

050

100150200250300350

Time (sec)

CJavaJava v2Java v3

• Java v2: DSM checks disabled

• Java v3: DSM and array-bound checks disabled

• Executing on a single 450 MHz Pentium II

Page 42: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Inline checks are Inline checks are expensive!expensive!

• Genericity of DSM-PM2 allows an Genericity of DSM-PM2 allows an alternative implementation.alternative implementation.

• Use page-fault detection rather than Use page-fault detection rather than inline check to detect non-local inline check to detect non-local object.object.

Page 43: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Using Page Faults: detailsUsing Page Faults: details

• An object reference is the address An object reference is the address of the base of the object.of the base of the object.

• loadIntoCacheloadIntoCache does nothing. does nothing.

• DSM-PM2 is used to handle page DSM-PM2 is used to handle page faults generated by the faults generated by the getget//putput primitives.primitives.

Page 44: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

More detailsMore details

• When an object is allocated, its address is When an object is allocated, its address is appended to a list attached to the page appended to a list attached to the page that contains its header.that contains its header.

• When a page is loaded on a remote node, When a page is loaded on a remote node, the list is used to turn on the header bit the list is used to turn on the header bit for all object headers on the page.for all object headers on the page.

• The put primitive uses the header bit in The put primitive uses the header bit in the same manner as inline-check version.the same manner as inline-check version.

Page 45: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Inline Check versus Page Inline Check versus Page FaultFault

• IC has higher overhead for IC has higher overhead for accessing objects (either local or accessing objects (either local or locally cached).locally cached).

• PF has higher overhead (signal PF has higher overhead (signal handling and memory protection) handling and memory protection) for loading a page into the cache.for loading a page into the cache.

Page 46: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

IC versus PF: serial map-coloringIC versus PF: serial map-coloring

050

100150200250300350

Time (sec)

CJava ICJava PFJava IC v2Java PF v2Java IC v3Java PF v3

• Java XX v2: DSM checks disabled

• Java XX v3: DSM and array-bound checks disabled

• Executing on a single 450 MHz Pentium II

Page 47: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

IC versus PF: parallel map-IC versus PF: parallel map-coloringcoloring

0

50

100

150

200

250

300

Time (sec)

1 2 4

nodes

ICPF

• Executing on 450MHz/SCI cluster.

Page 48: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

Related WorkRelated Work

• Java/MPI: cluster nodes are explicitJava/MPI: cluster nodes are explicit

• Java/RMI: dittoJava/RMI: ditto

• Remote objects via RMI: nearly Remote objects via RMI: nearly transparenttransparent– e.g. JavaParty, Do!e.g. JavaParty, Do!

• Distributed interpretersDistributed interpreters

– e.g. Java/DSM, MultiJav, cJVMe.g. Java/DSM, MultiJav, cJVM

Page 49: Cluster Computing with Java Threads Philip J. Hatcher University of New Hampshire Philip.Hatcher@unh.edu.

ConclusionsConclusions

• Approach is clean: Java “as is”Approach is clean: Java “as is”

• Approach is promisingApproach is promising– good parallelizability for map-coloringgood parallelizability for map-coloring

– need better scalar compilationneed better scalar compilation• e.g. array bound-check removale.g. array bound-check removal

– need further parallel application studiesneed further parallel application studies• are thread/object placement heuristics are thread/object placement heuristics

sufficient for programmers to write sufficient for programmers to write efficient programs?efficient programs?