Post on 07-Jul-2020
H5Spark
CUG,May2016
H5Spark:BridgingtheI/OGapbetweenSparkandScien9ficDataFormatsonHPCSystems
Jialin Liu, Evan Racah, Quincey Koziol, Richard Shane Canon,
Alex Gittens, Lisa Gerhardt, Suren Byna, Mike F. Ringenburg, Prabhat
Jalnliu@lbl.gov National Energy Research Scientific Computing Center(NERSC)
H5Spark: Outline
• Introduc9on,Spark• Mo9va9on
• H5SparkDesign• H5SparkEvalua9on
• H5SparkFuture
-2-CUG,May2016
H5Spark: Big Data Analytics, Spark
• ApacheSparkisanopensourceclustercompu9ngframework– DevelopedatUCBAMPLab,2014v1.0,2016v2.0
• Ac9velydeveloped,1000+contributorsin2015– Produc9veprogramminginterface
• 6vs28linesofcodecomparetohadoopmapreduce
– Implicitdataparallelism– Fault-tolerance
• SparkforData-intensiveCompu9ng– Streamingprocessing– SQL– Machinelearning,MLlib– Graphprocessing
-3-CUG,May2016
H5Spark: Porting Spark onto HPC
• AdvantagesofPor9ngSparkontoHPC– Amoreproduc9veAPIfordata-intensivecompu9ng– Relievetheusersfromconcurrencycontrol,communica9onand
memorymanagementwithtradi9onalMPImodel.– Embarrassinglyparallelcompu9ng,data.map(f) – Faulttolerance,recompute()
• ButScien9ficDataFormatsinHPCnotSupported– HDF5/netCDFareamongthetop5librariesatNERSC,2015
• 750+uniqueusers@NERSC,millionofusersworldwide
– 1987,NCSA&UIUC.NASAsendHDF-EOSto2.4millionsendusers– Hierarchicaldataorganiza9on– ParallelI/O
-4-CUG,May2016
H5Spark: Data in Spark
• RDD:ResilientDistributedDatasets– Read-only,par99onedcollec9onofrecordsinSpark– RDDcancontainanytypeofPython/Java/Scalaobjects– FaultTolerant
• Transforma9onsonRDD– Filter,map,join,etc
• Ac9onsonRDD– Reduce,collect,etc
• Sparkopera9onsarelazy• RDDallowsin-memoryprocessing
– rdd.cache()orrdd.persist()– Goodforitera9veorinterac9veprocessing
-5-CUG,May2016
H5Spark: Data in Spark
-6-CUG,May2016
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
H5Spark: Data in HDF5
• HierarchicalDataFormatv5
-7-CUG,May2016
Group HDF5
Dataset
1 2 3 4 5 6 7 8 9
Dataset Attributes Group
H5Spark: Support HDF5 in Spark
• WhatdoesSparkhaveinreadingvariousdataformats?– Texhile,sc.textFile()– Parquet,sc.read.parquet()– Json,sc.read.json()
• Challenges:Func9onalityandPerformance– HowtotransformanHDF5datasetintoanRDD?
– Howtou9lizetheHDF5I/OlibrariesinSpark?– HowtoenableparallelI/OonHPC?– WhatistheimpactofLustrestriping?
– WhatistheeffectofcachingonIOinSpark?
-8-CUG,May2016
HDF5àParquet?
H5Spark: Software Overview
• Scala/Pythonimplementa9on– SparkfavorsScalaandPython– H5SparkusesHDF5javalibrary– UnderneathisHDF5Cposixlibrary– NoMPIIOsupport
• H5Sparkasastandalonepackage– UserscanloaditintheirSparkapplica9ons– H5SparkmoduleonCori– sbtpackage------>h5spark_2.10-1.0.jar
• Opensource– Github:hlps://github.com/valiantljk/h5spark
-9-CUG,May2016
H5Sparkscala/python
JHI5java
HDF5c
HDF5MPI
H5Pypython
1.8.14
H5Spark: Design
-10-
Group HDF5
Dataset
Lustre File System
User App
H5Spark Hyperslab Partitioner
RD
D
Parallel I/O
H5Spark Metadata Analyzer
H5Spark RDD Constructor
H5spark RDD Seeder
1 2 3 4 5 6 7 8 9
• RDDSeeder• MetadataAnalyzer• HyperslabPar99oner• RDDConstructor
CUG,May2016
H5Spark: From HDF5 to RDD
• Input:
*SparkPar$$ondeterminesthedegreeofparallelism=MPIprocesses +OpenMP p>numofcores
• Output:RDD:r • UndertheHood:readingHDF5intoRDD
– Adjustpar99ons p=p>dim[sid]?dim[sid]:p
– Determinehyperslab offset[i]=dim[sid]/p*i
– SeedRDD r_seed=sc.parallelize(offset,p)– PerformparallelI/O r_seed.flatmap(h5read(f,v))
-11-
HDF5FilePath: f DatasetName: v SparkContext: sc
*SparkPar99on: p
CUG,May2016
H5Spark: How to Use
• H5SparkAPIs
• CorrespondtoSparkMLlibinterfaceimportorg.apache.spark.mllib.linalg
DataType:Vector,labeledpoint,matrix,indexedrowmatrix,etc
-12-CUG,May2016
Input:sc,f,v,p
Func9ons Output
h5read ARDDofdoublearray
h5read_point ARDDof(key,value)pair
h5read_vec ARDDofvector
h5read_irow ARDDofindexedrow
H5read_imat ARDDofindexedrowmatrix
H5Spark: How to Use
• Samplecodes,H5SparkvsMPI
-13-CUG,May2016
1. valsc=newSparkContext()2. valrdd=h5read(sc,f,v,p)3. sc.stop()
1. MPI_Init(&argc,&argv);2. MPI_Comm_size(comm,&mpi_size);3. MPI_Comm_rank(comm,&mpi_rank);4. hid_tfapl=H5Pcreate(H5P_FILE_ACCESS);5. H5Pset_fapl_mpio(fapl,comm,info);6. file=H5Fopen(f,H5F_ACC_RDONLY,fapl);7. dataset=H5Dopen(file,v,H5P_DEFAULT);8. hid_tdataspace=H5Dget_space(dataset);9. hsize_toffset[rank];10. hsize_tcount[rank];11. hsize_trest=dims_out[0]%mpi_size;12. if(mpi_rank!=(mpi_size-1)){13. count[0]=dims_out[0]/mpi_size;14. }else{15. count[0]=dims_out[0]/mpi_size+rest;16. }17. offset[0]=dims_out[0]/mpi_size*mpi_rank;18. for(i=1;i<rank;i++){19. offset[i]=0;20. count[i]=dims_out[i];21. }22. hid_thyperid=H5Sselect_hyperslab(dataspace,23. H5S_SELECT_SET,offset,NULL,count,NULL);24. hsize_trankmemsize=1;25. for(i=0;i<rank;i++)rankmemsize*=count[i];26. hid_tmemspace=H5Screate_simple(rank,count,NULL);27. double*data_t=(double*)malloc(sizeof(double)*rankmemsize);28. H5Dread(dataset,H5T_NATIVE_DOUBLE,memspace,29. dataspace,H5P_DEFAULT,data_t);30. MPI_Finalize()
H5SparkParallelRead
MPIParallelRead
Parallelism
H5Spark: Evaluation
• AbouttheSystem– Cori,Phase1,CrayXC40supercomputer,1600computenodes,248
LustreOSTs– Eachcomputenodehas32coreswith128GBRAMintotal.Thepeak
I/Obandwidthis700GB/s.
• ExperimentalSetup– PCAon2.2TBglobaloceantemperaturedata,16TBCAM5
atmospheredata.– 2.2TB,16TB,HDF5format,Doubleprecision– Numberofnodes:45,90,135,1600– Stripecounts:1,8,24,72,144,248
-14-CUG,May2016
CAM5,16TB,Findingtheprincipalcausesofvariabilityinlargescale3Dfields.
H5Spark: Evaluation
• Scaling/ProfilingH5SparkwithLustreStriping– 45nodes,1440cores,3000par99ons,2.2TBdata,1MBstripesize
-15-CUG,May2016
I/OBandwidthwithLustreStriping H5SparkTasksLaunchingDelay
OSTshouldbeanotherfactorinSpark’sschedulingbesidesCPU/Memory
H5Spark: Evaluation
• ScalingH5SparkwithPar99ons– 45nodes,2.2TB
-16-CUG,May2016
Thenumberofpar99onscanbetuned,basedontheworkloadsandresources
H5Spark: Evaluation
• ScalingH5SparkwithExecutorsand/orPar99ons– 2.2TB,45,95,135nodes
-17-CUG,May2016
Lesson:IncreasethenumberofExecutorsandPar99onsatthesame9me
H5Spark: Evaluation
• H5SparkhasbeentestedatfullscaleonCoriphase1
-18-CUG,May2016
Tests Size(TB) I/O(s) B/W(GB/s) OSTs Executors ParSSons
135nodes 2.2 37 59.7 144 135 9000
Fullscale 16 120 136.5 144 1522 52100
H5Spark: Evaluation
• H5SparkPythonvsScala
-19-CUG,May2016
Version I/O(s) B/W(GB/s) Speedup Mem(GB) RaSo
Python 162 13.65 1 479 1
Scala 90 24.56 1.8 2210 4.61
ScalaisfasterthanPython
H5Spark: Evaluation
• H5SparkvsMPI-IO
-20-CUG,May2016
MPIscaleswellwithOSTsH5SparkscaleswellwithNodes(whileMPIsaturatestheI/O)Again:StorageonHPCisanimportantschedulingfactor
Par99onsarealsoincreased
H5Spark: Evaluation
• H5Spark@LBNLvsSciSpark@NASA– hlps://github.com/SciSpark/SciSpark– hlps://github.com/valiantljk/h5spark
-21-CUG,May2016
H5Spark: Conclusion & Future Work
• H5Spark:– AnefficientHDF5fileloaderforSpark– UserscannowuseSparkperformbigdataanalysisonHDF5data– H5SparkgetsclosertoMPIIO
• H5SparkFuture– SparkI/Ofinerprofiling/lazyevalua9on– Parallelwrite/filter– Storage-awarescheduling
-22-CUG,May2016
H5Spark
ThanksSciSparkTeamThankDouglasJacobsen@NERSC,
JeyKolaalam@Amplab
-23-CUG,May2016