User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on...
Transcript of User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on...
![Page 1: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/1.jpg)
SangminSeo
AssistantComputerScientistArgonneNationalLaboratory
November7,2016
User-Level Threads and OpenMP
![Page 2: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/2.jpg)
Top500 Supercomputers Today1. SunwayTaihuLight (SunwayMPP):2. Tianhe-2(IntelXeon+XeonPhi):3. Titan(CrayXK7+Nvidia K20x):4. Sequoia(BlueGene/Q):5. Kcomputer(SPARC64):6. Mira(BlueGene/Q):
10,646,600cores3,120,000cores560,640cores
1,572,864cores705,024cores786,432cores
NumberofcorespersocketinTop500 TotalnumberofcoresinTop500rank#14thXMPWorkshop(11/7/2016) 2
0%
20%
40%
60%
80%
100%1 2 4 6 8 10 12 14-
0K
2,000K
4,000K
6,000K
8,000K
10,000K
12,000K
TotalN
umbe
rofC
oers
![Page 3: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/3.jpg)
Massive On-node Parallelism
• Toaddressmassiveon-nodeparallelism,thenumberofworkunits(e.g.,threads)mustincreaseby100X
• MPI+OpenMP issufficientformanyapps,butimplementationispoor– TodayMPI+OpenMP ==MPI+Pthreads
• Pthreadabstractionistoogeneric,notsuitableforHPC– Lackoffine-grainedscheduling,memory
management,networkmanagement,signaling,etc.
• BetterruntimecansignificantlyimproveMPI+OpenMP performanceandsupportotheremergingprogrammingmodels
core
MPIprocesswithmanyOpenMP threads
4thXMPWorkshop(11/7/2016) 3
![Page 4: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/4.jpg)
Outline
• Motivation• User-LevelThreads(ULTs)• Argobots• BOLT:OpenMP overLightweightThreads• Summary
4thXMPWorkshop(11/7/2016) 4
![Page 5: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/5.jpg)
User-Level Threads (ULTs)
• Whatisuser-levelthread(ULT)?– Providesthreadsemanticsinuserspace– Executionmodel:cooperativetimesharing
• MorethanoneULTcanbemappedtoasinglekernelthread
• ULTsonthesameOSthreaddonotexecuteinparallel– Canbeimplementedwithcoroutines
• Enableexplicitsuspendandresumeofitsprogressbypreservingexecutionstate
• SomelanguagessuchasPythonandGousecoroutines forasynchronousI/O
timeline
Context switch
Context switch
ULT1
ULT2
Core Core Core Core Core Core Core Core
ULTs :
Kernel threads :
4thXMPWorkshop(11/7/2016) 5
![Page 6: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/6.jpg)
User-Level Threads (ULTs)
• WhyULTs?– Conventionalthreads(e.g.,Pthreads)aretooexpensiveto
expressmassiveparallelism– IfwecreatePthreadsmorethan# ofcores
(i.e.,oversubscription):• Context-switchandsynchronizationoverheadwillincreasedramatically
– ULTscanmitigatehighoverheadofPthreadsbutneedexplicitcontrol
• Wheretouse?– Tobetteroverlapcomputationandcommunication/IO
• Lowcontext-switchingoverheadofULTscangivemoreopportunitiestohidecommunication/IOlatency
– Toexploitfine-grainedtaskparallelism• LightweightULTsaremoresuitabletoexpressmassivetaskparallelism
U
U
U
OSthread
4thXMPWorkshop(11/7/2016) 6
![Page 7: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/7.jpg)
Pthreads vs. ULTs
• Averagetimeforcreatingandjoiningonethread• Pthread:6.6us- 21.2us(avg.34,953cycles)• ULT(Argobots):78ns- 130ns(avg.191cycles)• ULTis64x- 233xfaster thanPthread
– HowfastisULT?• L1$access:1.112ns,L2$access:5.648ns,memoryaccess:18.4ns• Contextswitch(2processes):1.64us
1
10
100
1000
10000
100000
1 2 4 8 16 32 64 128 256 512 1024 2048
Avg.Create&
JoinTime/thread
(ns)
NumberofThreads
Pthread ULT(Argobots)
*measuredusingLMbench3
4thXMPWorkshop(11/7/2016) 7
![Page 8: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/8.jpg)
Growing Interests in ULTs
• ULTandtasklibraries– Conversethreads,Qthreads,MassiveThreads,Nanos++,
Maestro,GnuPth,StackThreads/MP,Protothreads,Capriccio,StateThreads,TiNy-threads,etc.
• OSsupports– Windowsfibers,Solaristhreads
• Languageandprogrammingmodels– Cilk,OpenMP task,C++11task,C++17coroutine proposal,
Stackless Python,Gocoroutines,etc.
• Pros– EasytousewithPthreads-likeinterface
• Cons– Runtimetriestodosomethingsmart(e.g.,work-stealing)– Thismayconflictwiththecharacteristicsanddemandsof
applications
4thXMPWorkshop(11/7/2016) 8
![Page 9: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/9.jpg)
Argobots
Overview• Separationofmechanismsandpolicies• Massiveparallelism
– Exec.Streams guaranteeprogress– WorkUnits executetocompletion
• User-levelthreads(ULTs)vs.Tasklets• Clearlydefinedmemorysemantics
– Consistencydomains• ProvideEventualConsistency
– Softwarecanmanageconsistency
ArgobotsInnovations• Enablingtechnology,butnotapolicymaker
– High-levellanguages/librariessuchasOpenMPorCharm++havemoreinformationabouttheuserapplication(datalocality,dependencies)
• Explicitmodel:– Enablesdynamism,butalwaysmanaged
byhigh-levelsystems
Argobots
coreProcessor
Programming Models(MPI, OpenMP, Charm++, PaRSEC, …)
U User-Level Thread T TaskletLightweightWork Units
Exec
utio
n St
ream
Private pool Private poolShared pool
U U
U T
TTU TU
Exec
utio
n St
ream
Exec
utio
n St
ream
Alightweightlow-levelthreadingandtaskingframework(http://www.mcs.anl.gov/argobots/)
9
*Currentteammembers:PavanBalaji,SangminSeo,Halim Amer (ANL),L.Kale,NitinBhat (UIUC)
4thXMPWorkshop(11/7/2016)
![Page 10: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/10.jpg)
Argobots Execution Model
• ExecutionStreams(ES)– Sequentialinstructionstream
• Canconsistofoneormoreworkunits– Mappedefficientlytoahardware
resource– Implicitlymanagedprogresssemantics
• OneblockedEScannotblockotherESs
• User-levelThreads(ULTs)– Independentexecutionunitsinuser
space– AssociatedwithanESwhenrunning– Yieldableandmigratable– Canmakeblockingcalls
• Tasklets– Atomicunitsofwork– Asynchronouscompletionvia
notifications– Notyieldable,migratablebefore
execution– Cannotmakeblockingcalls
S
Scheduler Pool
U
ULT
T
Tasklet
E
Event
ES1 Sched
U
U
E
E
E
E
U
S
S
T
T
T
T
T
Argobots Execution Model
...
ESn
• Scheduler– Stackableschedulerwithpluggable
strategies• Synchronizationprimitives
– Mutex,conditionvariable,barrier,future• Events
– Communicationtriggers
4thXMPWorkshop(11/7/2016) 10
![Page 11: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/11.jpg)
Explicit Mapping ULT/Tasklet to ES
• TheuserneedstomapworkunitstoESs• Nosmartscheduling,nowork-stealingunlesstheuserwants
touse
ES1
U0
U1
T1
T2
U2
U3
ES2
U4
U5
• Benefits– Allowlocalityoptimization
• ExecuteworkunitsonthesameES
– NoexpensivelockisneededbetweenULTsonthesameES
• Theydonotruninparallel• Aflagisenough
4thXMPWorkshop(11/7/2016) 11
![Page 12: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/12.jpg)
Stackable Scheduler with Pluggable Strategies
• AssociatedwithanES• CanhandleULTsandtasklets• Canhandleschedulers
– Allowstostackschedulershierarchically• Canhandleasynchronousevents• Userscanwriteschedulers
– Providesmechanisms,notpolicies– Replacethedefaultscheduler
• E.g.,FIFO,LIFO,PriorityQueue,etc.• ULTcanexplicitlyyieldto anotherULT
– Avoidscheduleroverhead
Sched
U
U
E
E
E
E
U
S
S
T
T
T
T
T
U S U U U
yield() yield_to(target)
4thXMPWorkshop(11/7/2016) 12
![Page 13: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/13.jpg)
Performance: Create/Join Time
• Idealscalability– IftheULTruntimeisperfectlyscalable,thetimeshouldbethesame
regardlessofthenumberofESs
10
100
1000
10000
1 2 4 8 16 24 32 36 40 48 56 64 72
Create/JoinTimepe
rULT(cycles)
NumberofExecutionStreams(Workers)
Qthreads MassiveThreads(H) MassiveThreads(W)
Argobots(ULT) Argobots(Tasklet)
4thXMPWorkshop(11/7/2016) 13
![Page 14: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/14.jpg)
Argobots’ Position
NodeOS
Argobots Comm.Lib.
High-Level Programming Models/LibrariesDomain SpecificLanguages(DSLs)
Applications
...
NodeOS
Argobots Comm.Lib.
Argobotsisalow-levelthreading/taskingruntime!
4thXMPWorkshop(11/7/2016) 14
![Page 15: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/15.jpg)
Argobots Ecosystem
ES1 Sched
U
U
E
E
E
E
U
S
S
T
T
T
T
T
Argobots
...
ESn
MPI+Argobots
ULT
ES
ULT
ES
MPI
Argobots runtime
Communication libraries
Charm++
Applications
Charm++Cilk “Worker”
ArgobotsES
RWSULT
FusedULT1
FusedULT2
FusedULTN
…CilkBots
PO
GE
TR
SYTR TR
PO GE GE
TR TR
SYSY GE
PO
TR
SY
PO
SY SY
PaRSEC
OpenMP MercuryRPC
Origin
Target
RPCproc
RPCproc
OmpSs
TASCEL,XMP,ROSE,GridFTP,Kokkos,RAJA,etc.MoreConnections
4thXMPWorkshop(11/7/2016) 15
![Page 16: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/16.jpg)
OpenMP
• Directivebasedprogrammingmodel• Commonlyusedforshared-memoryprogramminginanode• Manydifferentimplementations
– TypicallyontopofPthreadslibrary– Intel,GCC,Clang,IBM,etc.
Sequentialcodefor (i = 0; i < N; i++) {
do_something();}
OpenMPcode#pragma omp parallel forfor (i = 0; i < N; i++) {
do_something();}
zz
zzzz
164thXMPWorkshop(11/7/2016)
![Page 17: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/17.jpg)
Nested Parallel Loop: Microbenchmark
17
AthreadforeachCPUiscreatedbydefault
Eachthreadexecutesaportion
Eachthreadcreatesmorethreadsforthesecondloop
Eachinnerthreadexecutesaportion
int in[1000][1000], out[1000][1000];
#pragma omp parallel for
for (i = 0; i < 1000; i++) {
lib_compute(i);
}
lib_compute(int x)
{
#pragma omp parallel for
for (j = 0; j < 1000; j++)
out[x][j] = compute(in[x][j]);
}
Contribution:AdrianCastello(Universitat Jaume I)
4thXMPWorkshop(11/7/2016)
![Page 18: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/18.jpg)
Nested Parallel Loop: Implementations
• GCC– Doesnotreusetheidlethreadsinnestedparallelconstructs– Allthreadteamsinsideaparallelregionneedtobecreated
• ICC– Reuseidlethreads
• Iftherearenotmorethreadsavailable,newthreadsarecreated• AllcreatedthreadsareOSthreadsandaddoverhead
• ImplementationusingArgobots– CreatesULTsortasklets forbothouterloopandinnerloop
18
ES0
ESN-1…
WU
Outerloopsynchronizationpoint
…
…
S
S
S
S
WU
WU
WU WU
WU WU WU
OneESforeachcoreWorkunitfortheouterloop
Eachworkunitexecutesaportionoftheinnerloop
Innerloopsynchronizationpoint
4thXMPWorkshop(11/7/2016)
![Page 19: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/19.jpg)
0.00
0.01
0.10
1.00
10.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35Time(s)
#OMPThreads|ArgobotsULTs/tasks(innerloop)
ICC/Pthreads ICC/ArgobotsULTs ICC/Argobotstasks
0.00
0.01
0.10
1.00
10.00
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Time(s)
#OMPThreads|ArgobotsULTs/tasks(innerloop)
GCC/Pthreads GCC/ArgobotsULTs GCC/Argobotstasks
Nested Parallel Loop: Performance
GCCOpenMPimplementationdoesnotreuseidlethreadsinnestedparallelregions,alltheteamsofthreadsneedtobecreatedineachiteration
Executiontimefor36threadsintheouterloop
SomeoverheadisaddedbycreatingULTsinsteadoftasks
Lowerisbetter
Lowerisbetter
194thXMPWorkshop(11/7/2016)
![Page 20: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/20.jpg)
Nested Parallel Loop: Analysis
• Howdoeseachimplementationmanagethethreadsinnestedparallelregions?
Imple. #CreatedThreads #ReusedThreads # CreatedULTs
# CreatedTasks
GCC C+N*(T-1) =35,036 0 --- ---
ICC C+C*(T-1)=1,296 (N-C)*(T-1) =33,740 --- ---
Argobots tasks C=36 0 C=36 N*T=36,000
§ Parameters:• C:numberofcores(orthreadscreatedbyuseratthebeginning)• N:numberofiterationsoftheouterloop• M:numberofiterationsoftheinnerloop• T:numberofthreadsfortheinnerloop
Sequentialcreation Parallelcreation
Example->C:36,N:1,000,M:1,000,T:36
204thXMPWorkshop(11/7/2016)
![Page 21: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/21.jpg)
BOLT: A Lightning-Fast OpenMP Implementation
• AboutBOLT– BOLTisarecursiveacronymthatstandsfor
"BOLTisOpenMP overLightweightThreads"– https://www.mcs.anl.gov/bolt/
• Objective– OpenMP frameworkthatexploitslightweightthreadsandtasks
21
ImprovedNestedMassiveParallelism
EnhancedFine-GrainedTaskParallelism
BetterInteroperabilitywithMPIandOtherInternodeProgrammingModels
4thXMPWorkshop(11/7/2016)
![Page 22: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/22.jpg)
Approach & Development
• Basicapproach– CompilersimplygeneratesruntimeAPIcalls,whiletheruntime
createsULTs/tasklets andmanagesthemoverafixedsetofcomputationalresources
– UseArgobots astheunderlyingthreadingandtaskingmechanism– ABIcompatibilitywithIntelOpenMP compilers,LLVM/Clang,andGCC
(i.e.,canbeusedwiththesecompilers)• Development
– Runtime• BasedonIntelOpenMP RuntimeAPI• GeneratesArgobotsworkunitsfromOpenMP pragmas• CangenerateULTsortasklets dependingoncodecharacteristics
– Compiler(planned)• LLVM/Clang• Passescharacteristicsofparallelregionortask(e.g.,existenceofblocking
calls)totheruntime• Extendspragmaswiththeoption“nonblocking”
224thXMPWorkshop(11/7/2016)
![Page 23: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/23.jpg)
BOLT Execution Model
• OpenMP threadsandtasksaretranslatedintoArgobotsworkunits(i.e.,ULTsandtasklets)
• Sharedpoolsareutilizedtohandlenestedparallelism• AcustomizedArgobotsschedulermanagesschedulingofworkunits
acrossexecutionstreams
23
T TU T
OpenMP
Argobots
U
ULT
T
Tasklet
#pragmaomp parallel #pragmaomp task
U U
ExecutionStream
CPUcore
PrivatePool
U TU T USharedPool
CPU
OpenMPthreads
OpenMPtasks
4thXMPWorkshop(11/7/2016)
![Page 24: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/24.jpg)
Prototype Implementation of BOLT Runtime
• BasedonIntel’sopen-sourceOpenMP runtime– http://openmp.llvm.org/
• KepttheoriginalruntimeAPIfortheABIcompatibility• Designedandimplementedthethreadinglayerusing
Argobots andmodifiedtheruntimeinternallayer
24
ThreadingLayer
Pthreads
RuntimeAPILayer
RuntimeInternalLayer
Argobots
4thXMPWorkshop(11/7/2016)
![Page 25: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/25.jpg)
OpenMP Pragma Translation
1. AsetofN threadsiscreatedatruntime– Iftheyhavenotbeencreatedyet– CommonlyasmanyasthenumberofCPUcores
2. Thenumberofiterationsisdividedbetweenallthethreads3. Asynchronizationpointisaddedaftertheforloop
– Implicitbarrierattheendofparallelfor
#pragma omp parallel for (1,2)for (i = 0; i < N; i++) {
do_something();} (3)
254thXMPWorkshop(11/7/2016)
![Page 26: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/26.jpg)
OpenMP Compiler & BOLT Runtime
__kmpc_fork_call(…){
}
__kmp_fork_call(…)
__kmp_join_call(…)
IntelOpenMPRuntimeAPI
• CreateExecutionStreams(ifneeded)• AddaULTortasklettoeachES• Launchthework
• Join workunitscreated
BOLTruntime
#pragmaompparallelClangandIntelcompiler
264thXMPWorkshop(11/7/2016)
![Page 27: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/27.jpg)
parallel for
#pragma omp parallel forfor (i = 0; i < N; i++) {
do_something();}
Createsthreads
Dividesalliterationsamongthreads
Synchronizationpoint
ES0
ES1
ESK
…
WU
WU
WU
Eachworkunitexecutesaportionoftheforloop
ImplementationusingArgobots
S
S
S
Asynchronizationpointisadded
OneExecutionStreamforeachCPU core(orhardwarethread)
274thXMPWorkshop(11/7/2016)
![Page 28: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/28.jpg)
OpenUH OpenMP Validation Suite 3.1
GCC 6.1 ICC17.0.0 +IntelOpenMP
ICC 17.0.0+BOLTruntime(Argobots)
BOLT(clang+Argobots)
#oftestedOpenMPconstructs 62 62 62 62
#ofusedtests 123 123 123 123
#ofsuccessful tests 118 118 122 112
#offailedtests 5 5 1 1
Passrate(%) 95.9 95.9 99.2 99.2
28
• TheBOLTprototypefunctionallyworkswell!
4thXMPWorkshop(11/7/2016)
![Page 29: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/29.jpg)
Nested Parallel Loop Microbenchmark
29
10
100
1000
10000
100000
1000000
10000000
100000000
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
ExecutionTime(us)
#ofThreadsfortheInnerLoop
gcc6.1.0 icc17.0.0 BOLT(ULT) BOLT(ULT+tasklet) Argobots(ULT)
*Thenumberofthreadsfortheouterloopwasfixedat36.
4thXMPWorkshop(11/7/2016)
![Page 30: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/30.jpg)
Application Study: KIFMM
• Kernel-IndependentFastMultipoleMethod(KIFMM)– Offloaddgemv operationstoIntelMKL
• EvaluatedtheefficiencyofthenestedparallelismsupportinIntelOpenMPandBOLTduringtheDownwardstage– 9threadsfortheapplication(outerparallelregion)
30
0
2
4
6
8
10
12
14
1 2 4 8
ExecutionTime(s)
#ofThreadsforIntelMKL
IntelOMP:core-close IntelOMP:core-true IntelOMP:no-binding BOLT
4thXMPWorkshop(11/7/2016)
![Page 31: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/31.jpg)
Application Study: ACME mini-app
• ACME(AcceleratedClimateModelingforEnergy)– ImplementingadditionallevelsofparallelismthroughOpenMP
nestedparallelloopsforupcomingmany-coremachines• Preliminaryresultsoftestingthetransport_semini-appversion
ofHOMME(ACME’sCAM-SEdycore)
31
0%
20%
40%
60%
80%
100%
120%
H=16,V=1 H=8,V=2 H=4,V=4 H=4,V=8 H=8,V=4
Normalize
dExecutionTime(%
)
ACMEmini-app(transport_se)
ICC+IntelOpenMP(15.0.0) ICC+BOLT(Argobots)
Lowerisbetter(upto3.16x
faster)
oversubscription
4thXMPWorkshop(11/7/2016)
![Page 32: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/32.jpg)
Summary
• Massiveon-nodeparallelismisinevitable– Needruntimesystemsutilizingsuchparallelism
• User-levelthreads(ULTs)– Lightweightthreadsmoresuitableforfine-graineddynamic
parallelismandcomputation-communicationoverlap• Argobots
– Alightweightlow-levelthreading/taskingframework– Providesefficientmechanisms,notpolicies,tousers(library
developersorcompilers)• Theycanbuildtheirownsolutions
• BOLT:OpenMP overLightweightThreads– MoreefficientsupportofnestedparallelismwithArgobotsULTs
andtasklets– PreliminaryresultsshowthatBOLTispromising
4thXMPWorkshop(11/7/2016) 32
![Page 33: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/33.jpg)
Argo Concurrency Team
• ArgonneNationalLaboratory(ANL)– Pavan Balaji (co-lead)– SangminSeo– Abdelhalim Amer– PeteBeckman(PI)
• UniversityofIllinoisatUrbana-Champaign(UIUC)– Laxmikant Kale(co-lead)– MarcSnir– NitinKundapur Bhat
• UniversityofTennessee,Knoxville(UTK)– GeorgeBosilca– ThomasHerault– DamienGenet
• PacificNorthwestNationalLaboratory(PNNL)– Sriram Krishnamoorthy
PastTeamMembers:• CyrilBordage (UIUC)• Prateek Jindal(UIUC)• JonathanLifflander (UIUC)• EstebanMeneses
(UniversityofPittsburgh)• Huiwei Lu(ANL)• Yanhua Sun(UIUC)
4thXMPWorkshop(11/7/2016) 33
![Page 34: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/34.jpg)
BOLT Collaborations
• Maintainers– ArgonneNationalLaboratory
• Sangmin Seo• Abdelhalim Amer• Pavan Balaji
• Contributors– Universitat Jaume IdeCastelló
• Adrián Castelló• RafaelMayo• EnriqueS.Quintana-Ortí
– BarcelonaSupercomputingCenter(BSC)• AntonioJ.Peña• JesusLabarta
– RIKEN• Jinpil Lee• Mitsuhisa Sato
344thXMPWorkshop(11/7/2016)
![Page 35: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/35.jpg)
Try Argobots & BOLT
• Argobots– http://www.mcs.anl.gov/argobots/– git repository
• https://github.com/pmodels/argobots– Wiki
• https://github.com/pmodels/argobots/wiki– Doxygen
• http://www.mcs.anl.gov/~sseo/public/argobots/
• BOLT– http://www.mcs.anl.gov/bolt/– git repository
• https://github.com/pmodels/bolt-runtime
354thXMPWorkshop(11/7/2016)
![Page 36: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/36.jpg)
Funding AcknowledgmentsFundingGrantProviders
InfrastructureProviders
4thXMPWorkshop(11/7/2016)
![Page 37: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/37.jpg)
Programming Models and Runtime Systems GroupGroupLead– PavanBalaji(computerscientistandgroup
lead)
CurrentStaffMembers– Abdelhalim Amer (postdoc)– Yanfei Guo (postdoc)– RobLatham(developer)– LenaOden(postdoc)– KenRaffenetti (developer)– Sangmin Seo (assistantcomputerscientist)– MinSi(postdoc)
PastStaffMembers– AntonioPena(postdoc)– WesleyBland(postdoc)– DariusT.Buntinas (developer)– JamesS.Dinan (postdoc)– DavidJ.Goodell (developer)– Huiwei Lu(postdoc)– MinTian(visitingscholar)– Yanjie Wei(visitingscholar)– Yuqing Xiong (visitingscholar)– JianYu(visitingscholar)– Junchao Zhang(postdoc)– Xiaomin Zhu(visitingscholar)
– Ashwin Aji (Ph.D.)– Abdelhalim Amer (Ph.D.)– Md.Humayun Arafat(Ph.D.)– AlexBrooks(Ph.D.)– AdrianCastello (Ph.D.)– Dazhao Cheng(Ph.D.)– Hoang-VuDang(Ph.D.)– JamesS.Dinan (Ph.D.)– Piotr Fidkowski (Ph.D.)– PriyankaGhosh(Ph.D.)– Sayan Ghosh (Ph.D.)– RalfGunter(B.S.)– Jichi Guo (Ph.D.)– Yanfei Guo (Ph.D.)– MariusHorga (M.S.)– JohnJenkins(Ph.D.)– Feng Ji (Ph.D.)– PingLai(Ph.D.)– Palden Lama(Ph.D.)– YanLi(Ph.D.)– Huiwei Lu(Ph.D.)– JintaoMeng (Ph.D.)– GaneshNarayanaswamy (M.S.)– Qingpeng Niu (Ph.D.)– Ziaul Haque Olive(Ph.D.)
– DavidOzog (Ph.D.)– Renbo Pang(Ph.D.)– Nikela Papadopoulou (Ph.D)– Sreeram Potluri (Ph.D.)– Sarunya Pumma (Ph.D)– LiRao (M.S.)– Gopal Santhanaraman (Ph.D.)– ThomasScogland (Ph.D.)– MinSi(Ph.D.)– BrianSkjerven (Ph.D.)– RajeshSudarsan (Ph.D.)– LukaszWesolowski (Ph.D.)– Shucai Xiao(Ph.D.)– Chaoran Yang(Ph.D.)– Boyu Zhang(Ph.D.)– Xiuxia Zhang(Ph.D.)– XinZhao(Ph.D.)
Advisory Board– PeteBeckman(seniorscientist)– RustyLusk(retired,STA)– MarcSnir (divisiondirector)– RajeevThakur(deputydirector)
CurrentandRecentStudents
4thXMPWorkshop(11/7/2016)
![Page 38: User-Level Threads and OpenMP - XcalableMPMassive On-node Parallelism • To address massive on -node parallelism, the number of work units (e.g., threads) must increase by 100X •](https://reader030.fdocuments.in/reader030/viewer/2022040803/5e3f9e6f7d45d873d738e607/html5/thumbnails/38.jpg)
Q&A
• Thankyouforyourattention!
Questions?
4thXMPWorkshop(11/7/2016)