Porting NANOS on SDSM
description
Transcript of Porting NANOS on SDSM
Porting NANOS on SDSMPorting NANOS on SDSM
GOALPorting a shared memory environment
to distributed memory.What is missing to current SDSM ?
Christian PerezChristian Perez
Who am i ?Who am i ?
• December 1999 : PhD at LIP, ENS Lyon, FranceDecember 1999 : PhD at LIP, ENS Lyon, France– Data parallel languages, distributed Data parallel languages, distributed
memory, load balancing, preemptive thread memory, load balancing, preemptive thread migrationmigration
• Winter 1999/2000 : TMR at UPCWinter 1999/2000 : TMR at UPC– OpenMP, Nanos, SDSMOpenMP, Nanos, SDSM
• October 2000 : INRIA researcherOctober 2000 : INRIA researcher– Distributed programs, code couplingDistributed programs, code coupling
ContentsContents
• MotivationMotivation• Related worksRelated works• Nanos execution model (NthLib)Nanos execution model (NthLib)• Nanos on top of 2 SDSM (JIAJIA & DSM-PM2)Nanos on top of 2 SDSM (JIAJIA & DSM-PM2)• Missing SDSM functionalitiesMissing SDSM functionalities• ConclusionConclusion
MotivationMotivation
• OpenMP : emerging standardOpenMP : emerging standard– simplicity (no data distribution)simplicity (no data distribution)
• Cluster of machines (mono or Cluster of machines (mono or multiprocessors)multiprocessors)– excellent ratio performance / priceexcellent ratio performance / price
OpenMP on top of a cluster !OpenMP on top of a cluster !
OpenMP / Cluster : HOW ?OpenMP / Cluster : HOW ?
• OpenMP paradigm : shared memoryOpenMP paradigm : shared memory• Cluster paradigm : message passingCluster paradigm : message passing Use of software DSM system !Use of software DSM system !
Hardware DSM system : SCI (write: 2 Hardware DSM system : SCI (write: 2 s) s) specific hardwarespecific hardware not yet stablenot yet stable
Related workRelated work
• Several OpenMP/DSM implementationsSeveral OpenMP/DSM implementations– OpenMP NOW!, OmniOpenMP NOW!, Omni
• But,But,– Modification of OpenMP semanticsModification of OpenMP semantics– One level of parallelismOne level of parallelism– Do not exploit high performance Do not exploit high performance
networksnetworks
OpenMP on classical DSM OpenMP on classical DSM
• Compiler extracts shared data from stackCompiler extracts shared data from stack– Expensive local variable creationExpensive local variable creation
•shared memory allocationshared memory allocation• Modification of OpenMP standard :Modification of OpenMP standard :
– default should be default should be privateprivate instead of instead of being being sharedshared variables variables
– New synchronization primitives :New synchronization primitives :•condition variables & semaphorescondition variables & semaphores
OpenMP on classical DSMOpenMP on classical DSM
• One level of parallelism (SPMD)One level of parallelism (SPMD)
!$omp parallel do!$omp parallel dodo i = 1,4do i = 1,4
x(i) = x(i) + x(i+1)x(i) = x(i) + x(i+1)end doend do
barriercall schedule(lb, up, …)call schedule(lb, up, …)do i = lb, ubdo i = lb, ub
x(i) = x(i) + x(i+1)x(i) = x(i) + x(i+1)end doend docall dsm_barrier()call dsm_barrier()
Taken from pdplab.trc.rwcp.or.jp/pdperf/Omni/wgcc2k/Taken from pdplab.trc.rwcp.or.jp/pdperf/Omni/wgcc2k/
Omni compilation Omni compilation approachapproach
Our goalsOur goals
• Support OpenMP standardSupport OpenMP standard• High performanceHigh performance• Allow exploitation ofAllow exploitation of
– multithreading (SMP)multithreading (SMP)– high performance networkshigh performance networks
Nanos OpenMP compilerNanos OpenMP compiler
• Convert an OpenMP program to a task graphConvert an OpenMP program to a task graph• Communications via shared memoryCommunications via shared memory
!$omp parallel do!$omp parallel dodo i = 1,4do i = 1,4
x(i) = x(i) + x(i+1)x(i) = x(i) + x(i+1)end doend do
i=1,2i=1,2 i=3,4i=3,4
NthLib runtime supportNthLib runtime support
• Nanos compiler generates intermediate codesNanos compiler generates intermediate codes• Communications still via shared memoryCommunications still via shared memory
call call nthf_depaddnthf_depadd(…)(…) do nth_p = 1, procdo nth_p = 1, proc nth= nth= nthf_create_1snthf_create_1s(…,f,…)(…,f,…) donedone call nth_block()call nth_block()
subroutine f(…)subroutine f(…) x(i) = x(i) + x(i+1)x(i) = x(i) + x(i+1)
NthLib detailsNthLib details
• Assumes to run on top of kernel threadsAssumes to run on top of kernel threads• Provides user-level threads (QT)Provides user-level threads (QT)
• Stack management (allocate)Stack management (allocate)• Stack initialization (argument)Stack initialization (argument)• Explicit context switchExplicit context switch
Nthlib queuesNthlib queues
• Global/LocalGlobal/Local• Thread descriptorThread descriptor
– Rich functionalitiesRich functionalities• Work descriptorWork descriptor
– High performanceHigh performance
Nthlib : Nthlib : MemoryMemory managementmanagement
• Mutal exclusion Mutal exclusion mmapmmap allocation allocation • SLOT_SIZESLOT_SIZE stack alignment stack alignment
Nano-thread descriptorNano-thread descriptorSuccessorsSuccessors
StackStack
Guard zoneGuard zone
PortingPorting Nthlib to SDSM Nthlib to SDSM
• Data consistencyData consistency• Shared memory managementShared memory management• Nanos threadsNanos threads• JIAJIA implementationJIAJIA implementation• DSM-PM2 implementationDSM-PM2 implementation• Summary of DSM requirementsSummary of DSM requirements
Data consistencyData consistency• Mutual exclusion for defined data Mutual exclusion for defined data
structuresstructures Acquire/ReleaseAcquire/Release
• User level shared memory dataUser level shared memory data BarrierBarrier
Data consistencyData consistency• Mutual exclusion for defined data Mutual exclusion for defined data
structuresstructures Acquire/ReleaseAcquire/Release
• User level shared memory dataUser level shared memory data BarrierBarrier
barrier
barrier
barrier
Shared memory Shared memory managementmanagement• Asynchronous shared memory allocationAsynchronous shared memory allocation• Alignment parameter (> Alignment parameter (> PAGE_SIZEPAGE_SIZE))• Global variables/Global variables/commoncommon declarationdeclaration
Not yet supportedNot yet supported
Nano-threadsNano-threads
• Run-to-block execution modelRun-to-block execution model• Shared stacks (father/sons relationship)Shared stacks (father/sons relationship)• Implicit thread migration (scheduler)Implicit thread migration (scheduler)
JIAJIAJIAJIA• Developed at China by W. Hu, W. Shi & Z. TangDeveloped at China by W. Hu, W. Shi & Z. Tang• Public domain DSMPublic domain DSM• User level DSMUser level DSM• DSM : lock/unlock, barrier, cond. variablesDSM : lock/unlock, barrier, cond. variables• MP : send/receive, broadcast, reduceMP : send/receive, broadcast, reduce• Solaris, AIX, Irix, Linux, NT (not distributed)Solaris, AIX, Irix, Linux, NT (not distributed)
JIAJIA : Memory AllocationJIAJIA : Memory Allocation• No control of memory alignment (x2)No control of memory alignment (x2)• Synchronous memory allocation primitiveSynchronous memory allocation primitive
Development of an RPC versionDevelopment of an RPC version– Based on send/receive primitiveBased on send/receive primitive– Add of a user level message handlerAdd of a user level message handler ProblemsProblems– Global lockGlobal lock– Interference with JIAJIA blocking functionInterference with JIAJIA blocking function
JIAJIA : DiscussionJIAJIA : Discussion• Global barrier for data synchronizationGlobal barrier for data synchronization
Not multiple levels of parallelismNot multiple levels of parallelism• No thread awareNo thread aware
No efficient use of SMP nodesNo efficient use of SMP nodes
DSM/PM2DSM/PM2• Developed at LIP by G. Antoniu (PhD student)Developed at LIP by G. Antoniu (PhD student)• Public domainPublic domain• User level, module of PM2User level, module of PM2• Generic and multi-protocol DSMGeneric and multi-protocol DSM• DSM : lock/unlockDSM : lock/unlock• MP : LRPCMP : LRPC• Linux, Solaris, Irix (32 bits)Linux, Solaris, Irix (32 bits)
PM2 organizationPM2 organization
DSMMAD1TCPPVMMPISCIVIASBP
MAD2TCPMPISCIVIABIP
MARCELMONOSMPACTIVATON
PM2 TBX NTBX
http://www.pm2.org
DSM/PM2 : Memory DSM/PM2 : Memory AllocationAllocation• Only static memory allocationOnly static memory allocation
Build dynamic memory allocation primitiveBuild dynamic memory allocation primitive– Centralized memory allocation Centralized memory allocation – LRPC to Node 0LRPC to Node 0 Integration of alignment parameterIntegration of alignment parameter
Summer 2000 : dynamic memory allocation Summer 2000 : dynamic memory allocation ready !ready !
DSM/PM2 : marcel DSM/PM2 : marcel descriptordescriptor
Page boundarymarcel_t marcel_t
(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE
NthLib requirement :NthLib requirement :a kernel thread a kernel thread many nano- many nano-threadsthreads
DSM/PM2 : marcel DSM/PM2 : marcel descriptordescriptor
Page boundarymarcel_t marcel_t
Page boundary
marcel_t* marcel_t*
(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE
*((sp&MASK)+SLOT_SIZE)*((sp&MASK)+SLOT_SIZE)
DSM/PM2 : Discussion DSM/PM2 : Discussion • Using page level sequential consistencyUsing page level sequential consistency
+ no need of barrier (Multiple levels of + no need of barrier (Multiple levels of parallelism)parallelism)– – False sharingFalse sharing Dedicated stack layoutDedicated stack layoutPage boundary
PadPadPage boundary
marcel_t* marcel_t*
DSM/PM2 : Discussion DSM/PM2 : Discussion (cont)(cont)• No alternate stack for signal handlerNo alternate stack for signal handler
Prefetch page before context switch : O(n)Prefetch page before context switch : O(n) Pad to next page before opening parallelismPad to next page before opening parallelism
PadPad
Page boundary
Page boundary
SharedShared
datadata
DSM/PM2 improvementDSM/PM2 improvement
• Availability of an asynchronous DSM mallocAvailability of an asynchronous DSM malloc• Lazy data consistency protocol in evaluationLazy data consistency protocol in evaluation
– eager consistency, multiple writereager consistency, multiple writer– scope consistencyscope consistency
• Support for stack in shared memory (LINUX)Support for stack in shared memory (LINUX)
DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t
(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE
SEGV stack
DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t
(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE
SEGV stack
DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t
(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE
SEGV stack SEGV stack
DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t
(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE
SEGV stack SEGV stack
DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t
(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE
SEGV stack SEGV stack
DSM/PM2 shared stack DSM/PM2 shared stack supportsupportmarcel_t marcel_t
(sp&MASK)+SLOT_SIZE(sp&MASK)+SLOT_SIZE
SEGV stack
DSM requirementDSM requirement
• Support of static global shared variablesSupport of static global shared variables– Efficient codeEfficient code
•remove one indirection levelremove one indirection level– Enable use of classical compilerEnable use of classical compiler
•Support for Support for commoncommon « Sharedization » of already allocated memory« Sharedization » of already allocated memory
dsm_to_shared(void* p, size_t size);dsm_to_shared(void* p, size_t size);
• Support for multiple level of parallelismSupport for multiple level of parallelism– Partial barrierPartial barrier
• group managementgroup management– Dependencies supportDependencies support
• like acquire/releaselike acquire/release but without lockbut without lock
DSM requirementDSM requirement
• Support for multiple level of parallelismSupport for multiple level of parallelism– Partial barrierPartial barrier
•group managementgroup management– Dependencies supportDependencies support
• like acquire/releaselike acquire/release but without lockbut without lock
barrier
barrier
DSM requirementDSM requirement
• Support for multiple level of parallelismSupport for multiple level of parallelism– Partial barrierPartial barrier
•group managementgroup management– Dependencies supportDependencies support
• like acquire/releaselike acquire/release but without lockbut without lock
barriers
barrier
DSM requirementDSM requirement
• Support for multiple level of parallelismSupport for multiple level of parallelism– Partial barrierPartial barrier
•group managementgroup management– Dependencies supportDependencies support
• like acquire/releaselike acquire/release but without lockbut without lock
start(1)start(1)
stop(1)stop(1)
update(update(11,,22))
start(2)start(2)
stop(2)stop(2)
DSM requirementDSM requirement
Summary of DSM Summary of DSM requirementsrequirements• Support of static global shared variablesSupport of static global shared variables
« Sharedization » of already allocated « Sharedization » of already allocated memorymemory
• Acquire/release primitiveAcquire/release primitive• Partial barrier Partial barrier
group managementgroup management• Asynchronous shared memory allocationAsynchronous shared memory allocation• Alignment parameter to memory allocationAlignment parameter to memory allocation• Threads (SMP nodes)Threads (SMP nodes)• Optimized stack managementOptimized stack management
ConclusionConclusion
• Successfully port Nanos to 2 DSMSuccessfully port Nanos to 2 DSM JIAJIA & DSM-PM2JIAJIA & DSM-PM2
• DSM requirement to obtain performanceDSM requirement to obtain performance Support MIMD modelSupport MIMD model Automatic thread migrationAutomatic thread migration
• Performance ?Performance ?