Lina Yu, Hongfeng Yu -...
Transcript of Lina Yu, Hongfeng Yu -...
![Page 1: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/1.jpg)
Legion-basedScien/ficDataAnaly/csonHeterogeneousProcessors
Lina Yu, Hongfeng YuDepartment of Computer Science & Engineering
University of Nebraska Lincoln, Lincoln, Nebraska
1
![Page 2: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/2.jpg)
Outline
• Motivation• Contributions• Framework• Examples• ExperimentsandResults• Conclusion
2
![Page 3: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/3.jpg)
Mo/va/on
• Itischallengingtoef>icientlyusetoday’ssupercomputers– Deep,distributedmemoryhierarchies– Heterogeneousprocessingunits
• Communicationcostsareacriticalissueforparallelsystemandsoftwaredesignerstoconsider– Ascienti>icanalyticswork>lowconsistsofmultipleoperationsthat
intrinsicallyincurdifferentcommunicationordatamovementrequirementsbetweencomputenodes
3
![Page 4: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/4.jpg)
• Legion:programmingmodel+runtimesystem– Describehierarchicalorganizationsofbothdataandcomputationatan
abstractlevel
• Legionassistsaprogrammerinsolvingthecommonprogrammingburdens– Discover/verifythecorrectnessofparallelexecution– Managecommunication
• Atahighlevel,mappingaLegionprogramneedsmakingtwokindsofdecisions– Foreachtask,selectaprocessoronwhichtorunthetaskbythemapping
interface– Foreachlogicalregion,ataskneedstoselectamemoryinwhichtocreate
anduseaphysicalinstanceofthelogicalregion
Mo/va/on
4
![Page 5: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/5.jpg)
OurContribu/on
• InvestigatethefeasibilityofusingLegiontoperformanalyticsforlarge-scalescienti>icdataonheterogeneousprocessors
• Helpuserssimplifyprogrammingonthedatapartition,dataorganization,anddatamovementfordistributed-memoryheterogeneousarchitectures
• Facilitateasimultaneousexecutionofmultipleanalyticsoperationsonmodernandfuturesupercomputers
• Demonstratethescalabilityandtheusabilityofourapproachusingseveralrepresentativeanalyticsoperationsonaheterogeneoussupercomputer
5
![Page 6: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/6.jpg)
MapperInterface
• WedesignacustommapperbasedonLegion’smapperinterface– Mapoperationsontotargetprocessors– Specifywhichmemoriesareusedtohostthephysicalinstancesofthe
logicalregionsrequestedbysuchoperations
6
OP= {op1,...,opv }
mapper interface GPU = { gpu1,...,gpun }
CPU = { cpu1,...,cpum } <opi, CPUsi, GPUsi>
<op1, CPUs1, GPUs1>
<opv, CPUsv, GPUsv>
……
……
![Page 7: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/7.jpg)
RegionConstruc/onandTaskScheduling
• Mainstepsoftheprocessofourapproach– Makeanoperationopiprocessedonheterogeneousprocessors
7
logical region physical region
logical partition ={lp1,…,lpp}
CPUsi task scheduler
Opi (logical partition)
… …
… …
index space
field space
GPUsi … …
… …
(GPUsi)1 Opgi(lp1)
(GPUsi)k Opgi(lpk)
(GPUsi)u Opgi(lpu)
(CPUsi)1 Opci(lpu
+1)
(CPUsi)j Opci(lpu+j)
(CPUsi)v Opci(lpp)
Construct a field space of the logical region, and allocate the field space for each portion of data.
Construct an index space of the logical region for the inputdata of each operation.
Create a logical region using the index space and the fieldspace defined in the previous two steps.
Execute operations on GPUs and CPUs according to the previous mapper interface we designed.
Use coloring to partition a logical region (colorings are objects that describe an intended partition of an index space).
Create a corresponding physical region to hold the physical instances (i.e., the real values for the input data).
1
2
1
23
5
6
4
3
4
5
6
![Page 8: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/8.jpg)
8
![Page 9: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/9.jpg)
ContinuedList1
9
![Page 10: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/10.jpg)
10
ray_casting image_compositing
mapperinterfaceGPU = { gpu1,...,gpun }
CPU = { cpu1,...,cpum } CPUs1
GPUs1 ray_casting
entropy
entropy
CPUs2 image_compositing
ray_casting image_compositing
mapperinterface
GPUs1 ray_casting
CPUs2 entropy
entropy
CPUs3 image_compositing
CPUs1 ray_casting
GPU = { gpu1,...,gpun }
CPU = { cpu1,...,cpum }
• Sort-lastparallelvolumerenderingwithentropyanalysis– Mapperinterface
Examples
![Page 11: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/11.jpg)
Examples• Sort-lastparallelvolumerenderingwithentropyanalysis
– Regionconstructionandtaskscheduling
11
…
entropyCPU
ray_castingCPU
image_compositingCPU image_compositingCPU
Index Field
{vol_index_space} {vol_field_space}
3Dvolumelogicalregion
TaskID Type
RAY_CASTING_TASK1 GPUs
RAY_CASTING_TASK2 CPUs
ENTROPY_TASK CPUs
IMAGE_COMPOSITING_TASK CPUs
Mapper
Tasks3Dvolumephysicalregion 3Dvolumelogicalpar//on
entropyCPU
ray_castingGPU
Voxel Index Value
… …LogicalPartitonID Start Offset
… … …
Index Field
{img_index_space} {img_field_space}
2Dimagelogicalregion
2Dimagephysicalregion 2Dimagelogicalpar//onPixel Index Value
… …
LogicalPartitonID Start Offset
… … …
LogicalPartitonID TaskID
… …
…
GPU
…
denotesCPUcores
GPU
GPU
![Page 12: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/12.jpg)
Examples• Sort->irstparallelvolumerenderingwithentropyanalysis
– Mapperinterface• Raycastingtask(GPUs)• Entropytask(CPUs)
– Regionconstructionandtaskscheduling• Dividethe2Dimageintouniform2Dgrids• Eachprocessorisresponsiblefortherenderingofanimageportion• Noneedtodividethe3Dvolumedata• Noneedimagecompositing
• Thesort->irstandsort-lastalgorithmshavedifferencesondatapartitioninganddistributionrequirements,butoursolutionprovidesasimpleandfeasiblewaytoincorporatedifferentoperationsinauni>iedframeworkusinglogicalregions
12
![Page 13: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/13.jpg)
ExperimentsandResults• ConductexperimentsonTitan,aCrayXK7supercomputerlocatedattheOakRidgeLeadershipComputingFacility– EachnodeofTitancontainsone16-coreAMDOpteronCPUandaNVIDIA
TeslaK20GPU
• Testsort->irstandsort-lastparallelrendering• Conductscalabilitycomparisonsusingacombustiondatasetwiththeresolutionof1600x1375x430
• Testbetween1to256processorswithtwooutputimageresolutionsof10242and20482
13
![Page 14: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/14.jpg)
ExperimentsandResults• Theoverviewtimebreakdown,datapartitiontime,renderingtime,anddatamovementtimeonadifferenttotalnumberofnodesforsort->irstrenderingandsort-lastrendering
(a) (b)
(c) (d)
(a) (b)
(c) (d)
14
Fig.1:(a):the/mebreakdownofsort-firstparallelvolumerenderingfordifferentnumberofnodes.(b):thedatapar//on/me.(c):therendering/me.(d):thedatamovement/me.Twooutputimageresolu/ons,10242and20482,areused.
Fig.2:(a):the/mebreakdownofsort-lastparallelvolumerenderingfordifferentnumberofnodes.(b):thedatapar//on/me.(c):therendering/me.(d):theimagecomposi/ng/me.Twooutputimageresolu/ons,10242and20482,areused.
![Page 15: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/15.jpg)
• Interactiverenderingtimeanddatamovementtimeofsort->irstparallelrenderingfor64nodeswithimageresolutionof10242
ExperimentsandResults
Fig.3:Therendering/meanddatamovement/meofsort-firstrenderingfor64nodesfrommul/pleviewangles.Theoutputimageresolu/onis10242.
15
![Page 16: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/16.jpg)
ExperimentsandResults
• Therenderingtimeresultsofsort->irstandsort-lastparallelrenderingonanynumberofnodesfrom1to256withimageresolutionof10242
Fig.4:The/meresultsofsort-first(a)andsort-last(b)parallelrenderingonanynumberofnodesfrom1to256.Theoutputimageresolu/onis10242.
(a) (b)
16
![Page 17: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/17.jpg)
ExperimentsandResults
Fig.5:The/meresultsofraycas/ngandentropyanalysiswithvariousra/osonalloca/on.Theoutputimageresolu/onis10242.
• Legionjobstealingschedulingperformance– CPUraycastingtimeis1.347seconds(5%)– CPUentropytimeis0.936second– GPUraycastingtimeis2.833seconds(95%)
• Giventhateachnodehasa16-coreCPU,wetesteddifferentratiosbetweenraycastingandentropyoperations
17
![Page 18: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/18.jpg)
Conclusion
• Astudyforconductingscienti>icdataanalyticsondistributedheterogeneousarchitecturesbyleveragingtheLegionprogrammingmodelandruntimesystem
• Considerbothscalabilityandusabilityinourdesign
• Facilitatecomplexanalyticsoperationswithcompletelydifferentdatapartitioninganddistributionrequirementsinanearlyuni>iedmanner
• PerformoperationsacrossCPUsandGPUsandbalanceworkloadbyautomaticormanualschedulingstrategies
18
![Page 19: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/19.jpg)
Acknowledgement
• ThisresearchhasbeensponsoredinpartbytheDepartmentofEnergythroughtheExaCTCenterforExascaleSimulationofCombustioninTurbulencetheNationalScienceFoundationthroughgrantIIS-1423487.
• TheallocationofsupercomputingtimeontheOakRidgeLeadershipComputingFacility(OLCF)hasbeensponsoredbytheDepartmentofEnergythroughtheInnovativeandNovelComputationalImpactonTheoryandExperiment(INCITE)program
19
![Page 20: Lina Yu, Hongfeng Yu - cecsresearch.orgcecsresearch.org/vcl/ASH/index_files/2016/legion_BIG_DATA_2016.pdf · Fig. 4: The /me results of sort-first (a) and sort-last (b) parallel](https://reader030.fdocuments.in/reader030/viewer/2022041220/5e09284f52bdb3685613550c/html5/thumbnails/20.jpg)
ThankYou!
20