On the Conformance of a Cluster Implementation of the Java Memory Model Philip J. Hatcher University...

58
On the Conformance On the Conformance of a Cluster of a Cluster Implementation of Implementation of the Java Memory the Java Memory Model Model Philip J. Hatcher Philip J. Hatcher University of New Hampshire University of New Hampshire [email protected] [email protected]
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of On the Conformance of a Cluster Implementation of the Java Memory Model Philip J. Hatcher University...

On the Conformance of On the Conformance of a Cluster a Cluster Implementation of the Implementation of the Java Memory ModelJava Memory Model

Philip J. HatcherPhilip J. Hatcher

University of New HampshireUniversity of New Hampshire

[email protected]@unh.edu

CollaboratorsCollaborators

• ENS-LyonENS-Lyon– Gabriel Antoniu, Luc Bougé, Raymond Gabriel Antoniu, Luc Bougé, Raymond

NamystNamyst

Cluster EnvironmentCluster Environment

• collection of machines connected by collection of machines connected by networknetwork– commercial distributed-memory parallel commercial distributed-memory parallel

computercomputer

– ““build-your-own” parallel computerbuild-your-own” parallel computer

Cluster Implementation of Cluster Implementation of JavaJava

• Single JVM running on cluster.Single JVM running on cluster.

• Nodes of cluster are transparent.Nodes of cluster are transparent.

• Multithreaded applications exploit Multithreaded applications exploit multiple processors of cluster.multiple processors of cluster.

ExamplesExamples

• Java/DSM (Rice Univ. - Houston)Java/DSM (Rice Univ. - Houston)– transparent heterogeneous computingtransparent heterogeneous computing

• cJVM (IBM Research - Haifa)cJVM (IBM Research - Haifa)– scalable Java serverscalable Java server

• Jackal (Vrije Univ. – Amsterdam)Jackal (Vrije Univ. – Amsterdam)– high-performance computinghigh-performance computing

HyperionHyperion

• Cluster implementation of Java Cluster implementation of Java developed at the Univ. of New developed at the Univ. of New Hampshire.Hampshire.

• Currently built on top of the PM2 Currently built on top of the PM2 distributed, multithreaded runtime distributed, multithreaded runtime environment from ENS-Lyon.environment from ENS-Lyon.

MotivationMotivation

• Use Java “as is” for high-Use Java “as is” for high-performance computingperformance computing– support computationally intensive support computationally intensive

applicationsapplications

– utilize parallel computing hardwareutilize parallel computing hardware

Why Java?Why Java?

• Explicitly parallel!Explicitly parallel!– includes a threaded programming includes a threaded programming

modelmodel

• Relaxed memory modelRelaxed memory model– consistency model aids an consistency model aids an

implementation on distributed-memory implementation on distributed-memory parallel computersparallel computers

Java ThreadsJava Threads

• Threads are objects.Threads are objects.

• The class java/lang/Thread contains The class java/lang/Thread contains all of the methods for initializing, all of the methods for initializing, running, suspending, querying and running, suspending, querying and destroying threads.destroying threads.

java/lang/Thread methodsjava/lang/Thread methods

• Thread() - constructor for thread object.Thread() - constructor for thread object.

• start() - start the thread executing.start() - start the thread executing.

• run() - method invoked by ‘start’.run() - method invoked by ‘start’.

• stop(), suspend(), resume(), join(), stop(), suspend(), resume(), join(), yield().yield().

• setPriority().setPriority().

Java SynchronizationJava Synchronization

• Java uses monitors, which protect a Java uses monitors, which protect a region of code by allowing only one region of code by allowing only one thread at a time to execute it.thread at a time to execute it.

• Monitors utilize locks.Monitors utilize locks.

• There is a lock associated with each There is a lock associated with each object.object.

synchronized keywordsynchronized keyword

• synchronized ( synchronized ( ExpExp ) ) BlockBlock

• public class Q {public class Q { synchronized void put(…) { synchronized void put(…) { … … } }}}

java/lang/Object methodsjava/lang/Object methods

• wait()wait() - the calling thread, which must - the calling thread, which must hold the lock for the object, is placed in hold the lock for the object, is placed in a wait set associated with the object. a wait set associated with the object. The lock is then released.The lock is then released.

• notify()notify() - an arbitrary thread in the wait - an arbitrary thread in the wait set of this object is awakened and then set of this object is awakened and then competes again to get lock for object.competes again to get lock for object.

• notifyallnotifyall() - all waiting threads () - all waiting threads awakened.awakened.

Shared-Memory ModelShared-Memory Model

• Java threads execute in a virtual Java threads execute in a virtual shared memory.shared memory.

• All threads are able to access all All threads are able to access all objects.objects.

• But threads may not access each But threads may not access each other’s stacks.other’s stacks.

Java Memory ConsistencyJava Memory Consistency

• A variant of A variant of release consistency.release consistency.

• Threads can keep locally cached Threads can keep locally cached copies of objects.copies of objects.

• Consistency is provided by requiring Consistency is provided by requiring that:that:– a thread's object cache be flushed upon a thread's object cache be flushed upon

entry to a monitor.entry to a monitor.

– local modifications made to cached local modifications made to cached objects be transmitted to the central objects be transmitted to the central memory when a thread exits a monitor.memory when a thread exits a monitor.

General Hyperion General Hyperion OverviewOverview

Runtimelibraries

prog.java progjavac java2c gcc -06

libs

Sun'sJava compiler

prog.[ch]prog.class

(bytecode)

Instruction-wisetranslation

The Hyperion Run-Time The Hyperion Run-Time SystemSystem

• Collection of modules to allow “plug-Collection of modules to allow “plug-and-play” implementations:and-play” implementations:– inter-node communicationinter-node communication

– threadsthreads

– memory and synchronizationmemory and synchronization

– etcetc

Thread and Object Thread and Object AllocationAllocation

• Currently, threads are allocated to Currently, threads are allocated to processors in round-robin fashion.processors in round-robin fashion.

• Currently, an object is allocated to the Currently, an object is allocated to the processor that holds the thread that is processor that holds the thread that is creating the object.creating the object.

• Currently, DSM-PM2 is used to Currently, DSM-PM2 is used to implement the Java memory model.implement the Java memory model.

Hyperion Internal Hyperion Internal StructureStructure

PM2 API: pm2_rpc, pm2_thread_create, etc.

Loadbalancer

NativeJava API

Threadsubsystem

Memorysubsystem

Comm.subsystem

PM2

DSM subsystem

Thread subsystem Comm. Subsystem

PM2: A Distributed, PM2: A Distributed, Multithreaded Multithreaded Runtime EnvironmentRuntime Environment

• Thread library: MarcelThread library: Marcel– User-levelUser-level

– Supports SMPSupports SMP

– POSIX-likePOSIX-like

– PreemptivePreemptive thread thread migrationmigration

• Communication Communication library: Madeleinelibrary: Madeleine

– Portable: BIP, SISCI/SCI, Portable: BIP, SISCI/SCI, MPI, TCP, PVMMPI, TCP, PVM

– EfficientEfficient

Context Switch Create

SMP 0.250 s 2 s

Non-SMP 0.120 s 0.55 s

Latency BandwidthSCI/SISCI 6 s 70 MB/sBIP/Myrinet 8 s 125 MB/s

Thread Migration SCI/SISCI 24 sBIP/Myrinet 75 s

DSM-PM2: ArchitectureDSM-PM2: Architecture

• DSM comm:DSM comm:– send page requestsend page request

– send pagesend page

– send invalidate requestsend invalidate request

– ……

• DSM page manager:DSM page manager:– set/get page ownerset/get page owner

– set/get page accessset/get page access

– add/remove to/from add/remove to/from copysetcopyset

– ......

DSM-PM2

MadeleineComms

MarcelThreads

DSM Protocol Policy

DSM Protocol lib

DSM Page Manager

DSM Comm

PM2

Hyperion’s DSM APIHyperion’s DSM API

• loadIntoCacheloadIntoCache

• invalidateCacheinvalidateCache

• updateMainMemoryupdateMainMemory

• getget

• putput

DSM ImplementationDSM Implementation

• Node-level caches.Node-level caches.

• Page-based and home-based protocol.Page-based and home-based protocol.

• Use page faults to detect remote objects.Use page faults to detect remote objects.

• Log modifications made to remote Log modifications made to remote objects.objects.

• Each node allocates objects from a Each node allocates objects from a different range of the virtual address different range of the virtual address space.space.

Using Page Faults: detailsUsing Page Faults: details

• An object reference is the address An object reference is the address of the base of the object.of the base of the object.

• loadIntoCacheloadIntoCache does nothing. does nothing.

• DSM-PM2 is used to handle page DSM-PM2 is used to handle page faults generated by the faults generated by the getget//putput primitives.primitives.

More detailsMore details

• When an object is allocated, its address When an object is allocated, its address is appended to a list attached to the is appended to a list attached to the page that contains its header.page that contains its header.

• When a page is loaded on a remote When a page is loaded on a remote node, the list is used to turn on the node, the list is used to turn on the header bit for all object headers on the header bit for all object headers on the page.page.

• The The putput primitive checks the header bit primitive checks the header bit to see if a modification should be logged.to see if a modification should be logged.

• updateMainMemoryupdateMainMemory sends the logged sends the logged changes to the home node.changes to the home node.

BenchmarkingBenchmarking

• Two Linux 2.2 clusters:Two Linux 2.2 clusters:– twelve 200 MHz Pentium Pro twelve 200 MHz Pentium Pro

processors connected by Myrinet processors connected by Myrinet switch and using BIP. switch and using BIP.

– six 450 MHz Pentium II processors six 450 MHz Pentium II processors connected by a SCI network and using connected by a SCI network and using SISCI.SISCI.

• gcc 2.7.2.3 with -O6gcc 2.7.2.3 with -O6

Pi (50M intervals)Pi (50M intervals)

0

2

4

6

8

10

12

1 2 4 6 8 10 12

Nodes

Seco

nds 200MHz/ BIP

450MHz/ SCI

Jacobi (1024x1024)Jacobi (1024x1024)

0

20

40

60

80

100

1 2 4 6 8 10 12

Nodes

Seco

nds 200MHz/ BIP

450MHz/ SCI

Traveling Salesperson (17 Traveling Salesperson (17 cities)cities)

0200400600800

100012001400

1 2 4 6 8 10 12

Nodes

Seco

nds 200MHz/ BIP

450MHz/ SCI

All-pairs Shortest Path (2K All-pairs Shortest Path (2K nodes)nodes)

0

200

400

600

800

1000

1 2 4 6 8 10 12

Nodes

Seco

nds 200MHz/ BIP

450MHz/ SCI

Barnes-Hut (16K bodies)Barnes-Hut (16K bodies)

020406080

100120140

1 2 4 6 8 10 12

Nodes

Seco

nds 200MHz/ BIP

450MHz/ SCI

CorrectnessCorrectness

• Does the Hyperion approach fully Does the Hyperion approach fully support the Java Memory Model?support the Java Memory Model?

• Hyperion follows the operational Hyperion follows the operational specification of the JMM with two specification of the JMM with two exceptions:exceptions:– page granularitypage granularity

– node-level cachesnode-level caches

Java Memory Model: the Java Memory Model: the actorsactors

Threads

lock

unlock

load

store

use

assign

Main Memory

lock

unlock

read

write

Entry to MonitorEntry to Monitor

lock

load

use

lock

read

lock

load

use

read

lock

Exit from MonitorExit from Monitor

assign

store

unlock

write

unlock

assign

store

unlock

unlock

write

SerializabilitySerializability

• Main memory actions are Main memory actions are serializable for a single variable or serializable for a single variable or lock.lock.

Page GranularityPage Granularity

• Hyperion fetches remote memory Hyperion fetches remote memory locations with page granularity.locations with page granularity.– always OK to perform always OK to perform loadload//readread actions actions

“early”.“early”.

Node-level CachesNode-level Caches

• All threads on a node share a cache.All threads on a node share a cache.

• Cache contains values being accessed Cache contains values being accessed whose homes are on other nodes.whose homes are on other nodes.

• Values whose homes are on this node Values whose homes are on this node are directly accessed.are directly accessed.

– as if every as if every useuse is immediately preceded is immediately preceded by a by a loadload//readread and every and every assignassign is is immediately followed by a immediately followed by a storestore//writewrite..

Node-level CachesNode-level Caches

• If one thread invalidates the cache, If one thread invalidates the cache, the cache is invalidated for all the cache is invalidated for all threads.threads.– always OK to do “extra” always OK to do “extra” loadload//readread

actions.actions.

• If one thread updates main memory, If one thread updates main memory, all threads do.all threads do.– always OK to do “early” always OK to do “early” storestore//writewrite

actions.actions.

Node-level CachesNode-level Caches

• If one thread performs a If one thread performs a loadload//readread, , then all threads see the result.then all threads see the result.– as if all threads perform the as if all threads perform the loadload//read.read.

• If one thread performs an If one thread performs an assignassign, , then other threads see the result then other threads see the result before the subsequent before the subsequent storestore//writewrite..– hmmm...hmmm...

Implementation TracesImplementation Traces

• loadload//readread across cluster across cluster implemented by implemented by requestrequest--sendsend message pair.message pair.

• storestore//writewrite across cluster across cluster implemented by implemented by transmittransmit--receivereceive message pair.message pair.

An ExampleAn Example

Node 0

request x

T0: use x (7)

T1: assign x = 17

T2: use x (17)

transmit x (17)

Home(x)

send x (value 7) to Node 0

send x (7) to Node 1

receive x (17) from Node 0

receive x (19) from Node 1

Node 1

request x

T3: use x (7)

T4: assign x = 19

T5: use x (19)

transmit x (19)

SerializationSerialization

Main Memory

read x (value 7) for T0

read x (7) for T3

write x (17) for T1

read x (17) for T2

write x (19) for T4

read x (19) for T5

AlgorithmAlgorithm

• The model-satisfying serialization can The model-satisfying serialization can always be constructed from the node-always be constructed from the node-level traces of the Hyperion actions.level traces of the Hyperion actions.

• Merge node traces, controlled by the Merge node traces, controlled by the pairing of pairing of requestrequest--sendsend and and transmittransmit--receivereceive actions. actions.

• Correct, if memory actions at the Correct, if memory actions at the home node are serializable.home node are serializable.

Therefore:Therefore:

• Node-level caches conform to the JMM.Node-level caches conform to the JMM.

• Simplify implementation.Simplify implementation.

• Reduce memory consumption.Reduce memory consumption.

• Facilitate pre-fetching.Facilitate pre-fetching.

• However, “invalidate one, invalidate However, “invalidate one, invalidate all”.all”.

Implementation Implementation DifficultiesDifficulties

• Node-level concurrency makes the Node-level concurrency makes the implementation tricky. (duh!)implementation tricky. (duh!)– concurrent page fetchesconcurrent page fetches

– cache invalidate while page fetch cache invalidate while page fetch pending?pending?

– thread reading page as it is being thread reading page as it is being installedinstalled

Java API Additions?Java API Additions?

• These would be desirable:These would be desirable:– barrier synchronizationsbarrier synchronizations

– data reductionsdata reductions

– query cluster configurationquery cluster configuration• i.e. number of nodesi.e. number of nodes

• Is this cheating?Is this cheating?– no longer “as is” Java?no longer “as is” Java?

API implementationAPI implementation

• Careful implementation of API Careful implementation of API extensions can lessen potential cost extensions can lessen potential cost of “invalidate one, invalidate all”.of “invalidate one, invalidate all”.

• Implementation of barrier should Implementation of barrier should only invalidate local cache once all only invalidate local cache once all threads have reached barrier.threads have reached barrier.

Level of TransparencyLevel of Transparency

• Consider the current Hyperion Consider the current Hyperion thread/object allocation strategies:thread/object allocation strategies:– not mandated by Java Language Specnot mandated by Java Language Spec

– might be superceded by smarter run-time might be superceded by smarter run-time liblib

– but, still good guidelines for programmer?but, still good guidelines for programmer?• i.e. if I didn’t create the object, it might be i.e. if I didn’t create the object, it might be

expensive to access.expensive to access.

• not unreasonable to expect user to be aware not unreasonable to expect user to be aware that there might be an extended memory that there might be an extended memory hierarchy.hierarchy.

Final ThoughtsFinal Thoughts

• Java is extremely attractive vehicle for Java is extremely attractive vehicle for programming parallel computers.programming parallel computers.

• Can acceptable performance be obtained?Can acceptable performance be obtained?– Scalar executionScalar execution

– DSM implementationDSM implementation

– Application localityApplication locality

– ScalabilityScalability

More Final ThoughtsMore Final Thoughts

• Extended API for threads required.Extended API for threads required.

• Transparent implementations are Transparent implementations are necessary.necessary.

• JMM specification under review but JMM specification under review but it will remain based upon “release it will remain based upon “release consistency”, which is important.consistency”, which is important.

Support for ReflectionSupport for Reflection

• Object headers point to virtual method Object headers point to virtual method table, which also contains a pointer to table, which also contains a pointer to a class database structure.a class database structure.

• The classDB contains all information The classDB contains all information pertinent to class introspection.pertinent to class introspection.

• The classDBs are initialized at program The classDBs are initialized at program start-up using the tree of static class start-up using the tree of static class dependences rooted at the class that dependences rooted at the class that contains the main method to be called.contains the main method to be called.

Static InitializationStatic Initialization

• Static initializers are executed in a Static initializers are executed in a postorder traversal of the static postorder traversal of the static class dependence tree.class dependence tree.– violates the JLS as static initializers violates the JLS as static initializers

should not execute until the class is should not execute until the class is first accessed.first accessed.

Class LoadingClass Loading

• Not currently supported.Not currently supported.

• But conceptually possible.But conceptually possible.– dynamically compile and then dynamically compile and then

dynamically link in the new class.dynamically link in the new class.

– many details to be worked out, many details to be worked out, however.however.

java2cjava2c Details Details

• Maps JVM stack locations to C locals.Maps JVM stack locations to C locals.– relies on gcc to map these to registersrelies on gcc to map these to registers

• Interface method invocation is Interface method invocation is implemented via a table of pairs:implemented via a table of pairs:– (interface VMT, implementation VMT)(interface VMT, implementation VMT)

• Exceptions are implemented via setjmp Exceptions are implemented via setjmp and longjmp.and longjmp.

• Null references are detected by Null references are detected by segfault handler.segfault handler.

Future WorkFuture Work

• Continue to investigate DSM protocolsContinue to investigate DSM protocols– eliminate check in eliminate check in putput primitive primitive

– compare object-based and page-based protocolscompare object-based and page-based protocols

• Implement message aggregationImplement message aggregation– send mods for multiple pages in one messagesend mods for multiple pages in one message

– prefetch pages in groupsprefetch pages in groups

Future WorkFuture Work

• Utilize multiprocessor nodesUtilize multiprocessor nodes– true node-level concurrency puts more true node-level concurrency puts more

demands on the implementationdemands on the implementation

• java2ICjava2IC– java2cjava2c has been modified to generate a has been modified to generate a

(three-address) intermediate code(three-address) intermediate code

– want to connect to SUIF, or some other want to connect to SUIF, or some other compiler toolsetcompiler toolset

Future WorkFuture Work

• Further benchmarkingFurther benchmarking

– converting SPLASH-2 applications to Javaconverting SPLASH-2 applications to Java

• Java APIJava API

– minimal support now (and still JDK 1.1)minimal support now (and still JDK 1.1)

– is there a “clean” API implementation?is there a “clean” API implementation?

– evaluate extensions: barriers, etc.evaluate extensions: barriers, etc.

– distributed garbage collectiondistributed garbage collection