AWS Webinar 201: Designing scalable, available & resilient cloud applications
Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and...
Transcript of Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and...
![Page 1: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/1.jpg)
Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for
Modern HPC Clusters
Dipti Shankar
Dr. Dhabaleswar K. Panda (Advisor)Dr. Xiaoyi Lu (Co-Advisor)
Department of Computer Science & EngineeringThe Ohio State University
![Page 2: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/2.jpg)
SC 2018 Doctoral Showcase 2Network Based Computing Laboratory
Outline
• Introduction and Problem Statement
• Research Highlights and Results• Broader Impact
• Conclusion & Future Avenues
![Page 3: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/3.jpg)
SC 2018 Doctoral Showcase 3Network Based Computing Laboratory
Key-Value Storage in HPC and Data Centers• General purpose distributed memory-centric storage
– Allows to aggregate spare memory from multiple nodes (e.g, Memcached)
• Accelerating Online and Offline Analytics in High-Performance Compute (HPC) environments
• Our Basis: Current High-performance and hybrid key-value stores for modern HPC clusters
v High-Performance Network Interconnects (e.g., InfiniBand) v Low end-to-end latencies with IP-over-InfiniBand
(IPoIB) and Remote Direct Memory Access (RDMA)
v ‘DRAM+SSD’ hybrid memory designs
v Extend storage capabilities beyond DRAM capabilities using high-speed SSDs
!"#$%"$#&
Web Frontend Servers (Memcached Clients)
(Memcached Servers)
(Database Servers)
High Performance
Networks
High Performance
Networks
(Offline Analytical Workloads: Software-Assisted Burst-Buffer )
(Online Analytical Workloads: OLTP/NoSQL Query Cache)
![Page 4: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/4.jpg)
SC 2018 Doctoral Showcase 4Network Based Computing Laboratory
Drivers of Modern HPC Cluster Architectures
• Multi-core/many-core technologies
• Remote Direct Memory Access (RDMA)-enabled networking (InfiniBand and RoCE)
• Solid State Drives (e.g., PCIe/NVMe-SSDs), NVRAM (e.g., PCM, 3DXpoint), Parallel Filesystems (e.g., Lustre)
• Accelerators (e.g., NVIDIA GPGPUs)
• Production-scale HPC Clusters: SDSC Comet, TACC Stampede, OSC Owens, etc.
High Performance
Interconnects
Multi-core Processors with
vectorization support +
Accelerators (GPUs)
Large Memory
Nodes (DRAM)SSD, NVMe-SSD, NVRAM
![Page 5: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/5.jpg)
SC 2018 Doctoral Showcase 5Network Based Computing Laboratory
• Our focus: Holistic approach to designing a high-performance and resilient
key-value storage for current and emerging HPC systems
– RDMA-enabled networking
– Hybrid NVM storage-awareness: High-speed NVMe-/PCIe-SSDs and upcoming
NVRAM technologies
– Heterogeneous compute devices (e.g., SIMD capabilities of CPUs and GPUs)
• Goals: Maximize end-to-end performance to
– Optimally exploit available compute/storage/network capabilities
– Enable data-intensive Big Data applications to leverage memory-centric key-value
storage to improve their overall performance
Holistic Approach for HPC-Centric KVS
![Page 6: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/6.jpg)
SC 2018 Doctoral Showcase 6Network Based Computing Laboratory
Research Framework
HighPerformance, Hybrid and Resilient Key Value Storage
Application WorkloadsOnline
(E.g., SQL/NoSQL query cache/ LinkBench and TAO Graph KV workloads)
Offline (E.g., BurstBuffer over PFS for Hadoop for
MapReduce/Spark)
NonBlocking RDMAaware API Extensions
Fast Online Erasure Coding (Reliability)
NVRAMaware CommunicationProtocols (Persistence)
Accelerations on SIMDbased(e.g., GPU) Architecture
(Scalability)
HeterogeneityAware (DRAM/NVRAM/NVMe/PFS) KeyValue Storage Engine
Modern HPC System Architecture
Volatile/NonVolatileMemory Technologies (DRAM, NVRAM)
MultiCoreNodes
(w/ SIMD units)GPUs
RDMACapableNetworks (InfiniBand,
40Gbe RoCE)
Parallel FileSystem (Lustre)
Local Storage(PCIe SSD/ NVMeSSD)
![Page 7: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/7.jpg)
SC 2018 Doctoral Showcase 7Network Based Computing Laboratory
High-Performance Non-Blocking API Semantics
v Heterogeneous Storage-Aware Key-Value Stores (e.g., ‘DRAM + PCIe/NVMe-SSD’)
• Higher data retention at the cost of SSD I/O; suitable for out-of-memory scenarios
• Performance limited by Blocking API semantics
v Goals: Achieve near in-memory speeds while being able to exploit hybrid memory
v Approach: Novel Non-blocking API Semantics to extend RDMA-Libmemcached library
• memcached_(iset/iget/bset/bget) APIs for SET/GET
• memcached_(test/wait) APIs for progressing communication
D. Shankar, X. Lu , N. Islam , M. W. Rahman , and D. K. Panda, “High-Performance Hybrid Key-Value Store on Modern Clusters with RDMA Interconnects and
SSDs: Non-blocking Extensions, Designs, and Benefits”, IPDPS 2016
BLOCKINGAPI
NON-BLOCKINGAPIREQ.
NON-BLOCKINGAPIREPLY
CLIENT
SERV
ER
HYBRIDSLABMANAGER(RAM+SSD)
RDMA-ENHANCEDCOMMUNICATIONLIBRARY
RDMA-ENHANCEDCOMMUNICATIONLIBRARY
LIBMEMCACHEDLIBRARY
BlockingAPIFlow Non-BlockingAPIFlowNonB-b
Test/Wait
NonB-i
![Page 8: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/8.jpg)
SC 2018 Doctoral Showcase 8Network Based Computing Laboratory
High-Performance Non-Blocking API Semantics
• Set/Get Latency with Non-Blocking API: Up to 8x gain in overall latency vs. blocking API semantics over RDMA+SSD hybrid design
• Up to 2.5x gain in throughput observed at client; Ability to overlap request and response phases to hide SSD I/O overheads
0
500
1000
1500
2000
2500
Set Get Set Get Set Get Set Get Set Get Set GetIPoIB-Mem RDMA-Mem H-RDMA-Def H-RDMA-Opt-
Block H-RDMA-Opt-
NonB-i H-RDMA-Opt-
NonB-b
Aver
age
Late
ncy
(us)
Miss Penalty (Backend DB Access Overhead)Client WaitServer ResponseCache UpdateCache check+Load (Memory and/or SSD read)Slab Allocation (w/ SSD write on Out-of-Mem)
Overlap SSD I/O Access (near in-memory latency)
![Page 9: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/9.jpg)
SC 2018 Doctoral Showcase 9Network Based Computing Laboratory
v Erasure Coding (EC): A low storage-overhead alternative to Replication
v Bottlenecks: (1) Encode/decode compute overheads (2) Communication overhead of scattering/ gathering distributed data/parity chunks
v Goal: Making Online EC viable for key-value stores
v Approach: Non-blocking RDMA-aware semantics to enable compute/communication overlap
v Encode/Decode offload capabilities integrated into Memcached client (CE/CD) and server (SE/SD)
Fast Online Erasure Coding with RDMA
EC
Perf
orman
ce
Ideal Scenario
MemoryEfficient
High Performance
Memory Overhead
No FT
Replication
RS (3,2) => 66% Storage Overhead vs 200% of Replication-Factor=3
![Page 10: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/10.jpg)
SC 2018 Doctoral Showcase 10Network Based Computing Laboratory
Fast Online Erasure Coding with RDMA
D. Shankar , X. Lu , and D. K. Panda, “High-Performance and Resilient Key-Value Store with Online Erasure Coding for Big Data Workloads”, 37th International Conference on Distributed Computing Systems (ICDCS 2017)
0
500
1000
1500
2000
2500
3000
3500
512 4K 16K 64K 256K 1M
Tota
l Set
Lat
ency
(us
)
Value Size (Bytes)
Sync-Rep=3 Async-Rep=3
Era(3,2)-CE-CD Era(3,2)-SE-SD
Era(3,2)-SE-CD
~1.6x
~2.8x
0
50
100
150
200
250
4K 16K 32K 4K 16K 32K
YCSB-A YCSB-B
Ag
g. T
hro
ug
hp
ut
(Ko
ps/
sec)
Memc-RDMA-NoRep Async-Rep=3Era(3,2)-CE-CD Era(3,2)-SE-CD
~1.5x~1.34x
• Experiments with YCSB for Online EC vs. Async. Rep:
• 150 Clients on 10 nodes on SDSC Comet Cluster (IB FDR + 24-core Intel Haswell) over 5-node RDMA-Memcached Cluster
• (1) CE-CD gains ~1.34x for Update-Heavy workloads; SE-CD on-par (2) CE-CD/SE-CD on-par for Read-Heavy workloads
Intel Westmere + QDR-IB ClusterRep=3 vs. RS (3,2)
![Page 11: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/11.jpg)
SC 2018 Doctoral Showcase 11Network Based Computing Laboratory
• Offline Data Analytics Use-Case: Hybrid and resilient key-value store-based Burst-Buffer system Over Lustre (Boldio)
• Overcome local storage limitations on HPC nodes; performance of `data locality’
• Light-weight transparent interface to Hadoop/ Spark applications
• Accelerating I/O-intensive Big Data workloads
– Non-blocking RDMA-Libmemcached APIs to maximize overlap
– Client-based replication or Online Erasure Coding with RDMA for resilience
– Asynchronous persistence to Lustre parallel file system at RDMA-Memcached Servers
D. Shankar, X. Lu, D. Panda, Boldio: A Hybrid and Resilient Burst-Buffer over Lustre for Accelerating Big Data I/O, IEEE International Conference on Big
Data 2016 (Short Paper)
Co-Designing Key-Value Store-based Burst Buffer over PFS
![Page 12: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/12.jpg)
SC 2018 Doctoral Showcase 12Network Based Computing Laboratory
Co-Designing Key-Value Store-based Burst Buffer over PFS
• TestDFSIO on SDSC Gordon Cluster (16-core Intel Sandy Bridge and IB QDR) with 16-node MapReduce Cluster + 4-node Boldio Cluster
• Boldio can sustain 3x and 6.7x gains in read and write throughputs over stand-alone Lustre
0
20000
40000
60000
80000
60 GB 100 GB 60 GB 100 GB
Write Read
Agg.Throu
ghpu
t(MBp
s) Lustre-Direct Boldio
6.7x
3x
0
10000
20000
30000
40000
50000
60000
Write Read
DF
SIO
Agg
. T
hro
ugh
pu
t (M
Bp
s)
Lustre-DirectAlluxio-RemoteBoldio_Async-Rep=3Boldio_Online-EC
No EC overhead (Rep=3 vs. RS (3,2))
• TestDFSIO on Intel Westmere Cluster (8-core Intel Sandy Bridge and IB QDR); 8-node MapReduce Cluster + 5-node Boldio Cluster over Lustre
• Performance gains over designs like Alluxio(formerly Tachyon) in HPC environments with no local storage
![Page 13: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/13.jpg)
SC 2018 Doctoral Showcase 13Network Based Computing Laboratory
Exploring Opportunities with NVRAM and RDMA
DRAM HCA
Mem. Cntrl
DRAM NVRAM
I/O Cntrllr
L3
PCIe Bus
HCA
Mem.Cntrl
NVRAM
I/O Cntrllr
L3
PCIe Bus Du
rab
ility P
ath
RDMA into NVRAM RDMA from NVRAMLocal Durability Point
Core L1/L2
Core L1/L2
Core L1/L2
Core L1/L2
Node1 (Initiator/Target)Core L1/L2
Core L1/L2
Core L1/L2
Core L1/L2
Node0 (Initiator/Target)• Emerging Non-volatile Memory technologies
(NVRAM)
• Potential: Byte-addressable and persistent; capable of RDMA
• Observations: RDMA writes into NVRAM needs to guarantee remote durability
• Appliance Method: Hardware-Assisted remote direct persistence
• General Purpose Server Method: Server-Assisted software-based persistence
• Opportunities: Designing high-performance RDMA-based Persistence Protocols (e.g., Persistent In-Memory KVS over NVRAM) D. Shankar, X. Lu, D. Panda, RDMP-KVS: Remote Direct Memory
Persistence Aware Key-Value Store for NVRAM Systems (Under
Submission)
![Page 14: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/14.jpg)
SC 2018 Doctoral Showcase 14Network Based Computing Laboratory
• RDMA for Apache Spark (RDMA-Spark), Apache Hadoop 2.x (RDMA-Hadoop-2.x), RDMA for Apache HBase
• RDMA for Memcached (RDMA-Memcached)
– RDMA-aware `DRAM+SSD’ hybrid Memcached server design
– Non-Blocking RDMA-based Client API designs (RDMA-Libmemcached)
– Based on Memcached 1.5.3 and Libmemcached client 1.0.18
– Available for InfiniBand and RoCE
• OSU HiBD-Benchmarks (OHB)– Memcached Set/Get Micro-benchmarks for Blocking and Non-Blocking APIs, and Hybrid
Memcached designs
– YCSB plugin for RDMA-Memcached
– Also includes HDFS, HBase, Spark Micro-benchmarks
• http://hibd.cse.ohio-state.edu
• Users Base: 290 organizations, 34 countries, 27,950 downloads
Broader Impact: The High-Performance Big Data (HiBD) Project
![Page 15: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/15.jpg)
SC 2018 Doctoral Showcase 15Network Based Computing Laboratory
Conclusion & Future Avenuesv Holistic approach to designing key-value storage by exploiting the capabilities of HPC
clusters for (1) performance, (2) scalability, and, (3) data resilience/availability
v RDMA-capable Networks: (1) Proposed Non-blocking RDMA-based Libmemcached APIs
(2) Fast Online EC-based RDMA-Memcached designs
v Heterogeneous Storage-Awareness: (1) Leverage `RDMA+SSD’ hybrid designs, (2)
`RDMA+NVRAM’ Persistent Key-Value Storage
v Application Co-Design: Memory-centric data-intensive applications on HPC Clusters
v Online (e.g., SQL query cache, YCSB) and Offline Data Analytics (e.g., Boldio Burst-Buffer for
Hadoop I/O)
v Future Work: Ongoing work in this thesis direction
v Heterogeneous compute capabilities of CPU/GPU: End-to-end SIMD-aware KVS designs
v Exploring co-design of (1) Read-intensive Graph-based workloads (E.g., LinkBench, RedisGraph)
(2) Key-value storage engine for ML Parameter Server frameworks
![Page 16: Designing High-Performance, Resilient and …...Designing High-Performance, Resilient and Heterogeneity-Aware Key-Value Storage for Modern HPC Clusters Dipti Shankar Dr. DhabaleswarK.](https://reader035.fdocuments.in/reader035/viewer/2022070805/5f037cc47e708231d409550e/html5/thumbnails/16.jpg)
SC 2018 Doctoral Showcase 16Network Based Computing Laboratory
http://www.cse.ohio-state.edu/~shankard
Thank You!
Network-Based Computing Laboratoryhttp://nowlab.cse.ohio-state.edu/
The High-Performance Big Data Projecthttp://hibd.cse.ohio-state.edu/