Optimized Distributed Data Sharing Substrate in Multi-Core...
Transcript of Optimized Distributed Data Sharing Substrate in Multi-Core...
Optimized Distributed Data Sharing Substrate
in Multi-Core Commodity Clusters: A
Comprehensive Study with Applications
K. Vaidyanathan, P. Lai, S. Narravula and D. K. Panda
Network Based Computing Laboratory (NBCL)
The Ohio State University
Presentation Outline
• Introduction and Motivation
• Distributed Data Sharing Substrate
• Proposed Design Optimizations
• Experimental Results
• Conclusions and Future Work
Introduction and Motivation
• Interactive data-driven applications
– Stock trading, airline tickets, medical imaging, online auction, online
banking, web streaming, …
– Ability to interact, synthesize and visualize data
• Datacenters enable such capabilities
– Processes data and reply to client queries
– Common and increasing in size (IBM, Amazon, Google)
• Datacenters unable to meet increasing client demands
Stock markets Airline industries Medical imaging Online auction
Proxy/Web Server
(Apache, STORM)
Application Server
(PHP, CGI)
Storage
Tier 0 Tier 1 Tier 2
Database Server
(MySQL, DB2)
Resource monitoring (Ganglia), resource mgmt (IBM WebSphere), caching
Datacenter Architecture
WANWAN
Clients
• Applications host web content online
• Services improve performance and scalability
• State sharing is common in applications and services– Communicate and synchronize (intra-node, intra-tier and inter-tier)
More Computation and CommunicationRequirements
State Sharing in Datacenters
ProxyServer
Tier 0
Application
Server
Tier 1
Memory copies
Apache Network
Memory copies
STORM Network
Memory copies
Caching Network
Memory copies
Res Mgmt Network
IPC
Apache Caching
IPC
STORM A STORM B
IPC
Servlets App
IPC
Apache Res Mgmt
Intra-Node State SharingIntra-Tier State SharingInter-Tier State Sharing
Resourceadaptation
Systemstate
Resourceadaptation
Systemstate
Caching Data
Caching Data
Resourcemonitoring
Tier 1load
Loadbalancing
Tier 1load
Resourceadaptation
Tier 1load
Resourceadaptation
Tier 1load
State Sharing in Datacenters…
• Several applications employ
their own
– data management protocols
– maintain versions of stored data
– synchronization primitives
• Datacenter Services
frequently exchange
– System load, system state, locks
– Cached data
Issues
• Ad-hoc messaging protocols for exchanging data/resource
• Same data/resource at multiple places (e.g., load information, data)
• Protocols used are typically TCP/IP, IPC mechanisms, memory copies, etc
• Performance may depend on the back-end load
• Scalability issues
High-Performance Networks
• InfiniBand, 10 Gigabit Ethernet
• High-Performance– Low latency (< 1 usecs) and high bandwidth (> 32 Gbps with
QDR adapters)
• Novel features– One-sided RDMA and atomics, multicast, QoS
• OpenFabrics alliance (http://www.openfabrics.org/)– Common stack for several networks including iWARP
(LAN/WAN)
Datacenter Research at OSU
ReconfigurationResource
Monitoring
Soft Shared State Lock ManagerGlobal Memory
Aggregator
Distributed Data/Resource Sharing Substrate
High-Performance Networks (InfiniBand, iWARP 10GigE)
Existing Datacenter Components
Multicast
Advanced System Services
ActiveCaching
CooperativeCaching
Dynamic Content Caching
ActiveCaching
CooperativeCaching
ReconfigurationResource
Monitoring
Active Resource Adaptation
ReconfigurationResource
Monitoring
RDMA Atomics
High-speedNetworks
Advanced CommunicationProtocols and SubsystemsSockets Direct Protocol
QoS&
AdmissionControl
AdvancedService
Primitives
Distributed Data/Resource Sharing Substrate
Soft Shared State Lock ManagerGlobal Memory
Aggregator
Existing Datacenter Components
Datacenter Homepage: http://nowlab.cse.ohio-state.edu/projects/data-centers/
Distributed Data Sharing Substrate
Load InfoSystem State
Meta-dataData
Datacenter
Application
Datacenter
Application
Datacenter
Services
Datacenter
Application
Datacenter
Application
Datacenter
Services
Get
Get
Get
Put
Put
Put
Multicore Architectures
• Increased cores per-chip
– More parallelism available
• Intel, AMD
– Dual-core, quad-core
– 80-core systems are currently built
• Significant benefits for datacenters
– Applications are multi-threaded in nature
– Design Optimizations in state sharing mechanisms
– Opportunities for dedicating one or more cores
Future multicore systems
Objective
• Can we enhance the distributed data sharing
substrate using the features of multicore
architectures by dedicating one or more of
the cores?
• How do these enhancements help in
improving the overall performance with
datacenter applications and services?
Presentation Outline
• Introduction and Motivation
• Distributed Data Sharing Substrate
• Proposed Design Optimizations
• Experimental Results
• Conclusions and Future Work
Distributed Data Sharing Substrate
• Use of a common service thread to get access to the shared state
• Applications get shared state information using the service thread
• Several design optimizations in communicating with the service thread– Message Queues (MQ-DDSS)
– Memory mapped queues for request (RMQ-DDSS)
– Memory mapped queues for request and response (RCQ-DDSS)
Message Queue-based DDSS (MQ-DDSS)
NIC
Produce
Produce
Consume
Consume
Request
Queue
Completion
Queue
Kernel Message Queues
Application
Threads
Service
Thread
User Space
Kernel Space
IPC_SendIPC_Recv IPC_Send
IPC_Recv
Interrupt
Event
Kernel
Thread
Kernel Involvement
Message Queue-based DDSS
• Kernel involvement– IPC Send and Receive operations
– Communication Progress
• Limitations– Several context-switches
– Interrupt overheads
Presentation Outline
• Introduction and Motivation
• Distributed Data Sharing Substrate
• Proposed Design Optimizations
• Experimental Results
• Conclusions and Future Work
Request/Message Queue-based
DDSS (RMQ-DDSS)
NIC
Produce
Produce
Consume
Consume
Request
Queue
CompletionQueue
Kernel Message Queues
ApplicationThreads
Service
Thread
User Space
Kernel Space
IPC_Recv IPC_Send
Kernel Involvement
Produce Consume
RequestQueue
Request/Completion Queue-based
DDSS (RCQ-DDSS)
NIC
Produce
Produce
Consume
Consume
Request
Queue
CompletionQueue
ApplicationThreads
Service
Thread
User Space
Kernel SpaceNo Kernel Involvement
Produce Consume
RequestQueue
Produce
Consume
CompletionQueue
RMQ-DDSS and RCQ-DDSS Schemes
• RMQ-DDSS scheme+ Lesser number of interrupts and context-switches compared to
MQ-DDSS
+ Improvement in response time as request is sent via memory mapped queues
– May occupy significant CPU
• RCQ-DDSS scheme+ Avoids kernel involvement
+ Significant improvement in response time as request and response are sent via memory mapped queues
– May occupy more CPU as compared to RMQ-DDSS - apps & service thread need to poll on the completion queue
Presentation Outline
• Introduction and Motivation
• Distributed Data Sharing Substrate
• Proposed Design Optimizations
• Experimental Results
• Conclusions and Future Work
Experimental Testbed
• InfiniBand experiments– 560-core cluster consisting of 70 compute nodes with dual 2.33
GHz Intel Xeon quad-core processors
– Mellanox MT25208 dual port HCA
• 10-Gigabit experiments– Intel dual quad-core Xeon 3.0 GHz, 512 MB memory
– Chelsio T3B 10 GigE PCI-Express adapters
• OpenFabrics stack– OFED 1.2
• Experimental outline– Microbenchmarks (performance and scalability)
– Application performance (R-Trees, B-Trees, STORM, checkpointing)
– Dedicating cores for datacenter services (resource monitoring)
Basic Performance of DDSS
0
10
20
30
40
50
60
70
80
1 4 16 64 256 1K 4K 16K
Message Size (bytes)
Late
ncy (usecs)
RCQ-DDSS RMQ-DDSS MQ-DDSS
• RCQ-DDSS scales with increasing client threads
• RCQ-DDSS performs better than RMQ-DDSS and MQ-DDSS
0
10
20
30
40
50
60
70
1 2 3 4 5 6 7
Number of Client Threads
IPC
La
ten
cy
(u
se
cs
)
RCQ-DDSS RMQ-DDSS MQ-DDSS
0
10
20
30
40
50
60
1 4 16 64 256 1K 4K 16K
Message Size (bytes)
Late
ncy (usecs)
RCQ-DDSS RMQ-DDSS MQ-DDSS
InfiniBand
10-Gigabit Ethernet
DDSS Scalability
0
50
100
150
200
250
1 2 4 8 16 32 64 128 256 512
Number of Client Threads
Late
ncy (
usecs)
RCQ-DDSS RMQ-DDSS MQ-DDSS
• Hybrid approach is required for scalability with large number of threads
• DDSS scales when keys are distributed
0
1
10
100
1000
10000
100000
1 2 4 8 16 32 64 128
256
512
Number of Client Threads
IPC
Late
ncy (
usecs)
RCQ-DDSS RMQ-DDSS MQ-DDSS
0200400600800
1000120014001600
1 2 4 8 16 32 64 128
256
512
Number of Client Threads
Late
ncy (
usecs)
RCQ-DDSS RMQ-DDSS MQ-DDSS
Keys are on a single node
Keys are distributed
Performance with R-Trees, B-Trees, STORM
0
5
10
15
20
25
20% 40% 60% 80% 100%
Records Accessed
Tim
e (
msecs)
BTREE-RCQ-SS BTREE-RMQ-SS
BTREE-MQ-SS BTREE
0
20
40
60
80
20% 40% 60% 80% 100%
Records Accessed
Tim
e (
msecs)
RTREE-RCQ-SS RRTEE-RMQ-SS
RTREE-MQ-SS RTREE
02000
40006000
800010000
1200014000
10K 100K 1000K
Number of Records
Tim
e (
msecs)
STORM-RCQ-SS STORM-RMQ-SS
STORM-MQ-SS STORM
• MQ-SS shows significant improvement compared to
traditional implementations but
RCQ-SS shows marginal improvements compared to MQ-SS
Data Sharing Performance in Applications
1
10
100
1000
10000
20% 40% 60% 80% 100%
Records Accessed
Tim
e (
usecs)
BTREE-RCQ-DDSS BTREE-RMQ-DDSS
BTREE-MQ-DDSS BTREE
• RCQ-DDSS shows significant improvement as compared to RMQ-DDSS and MQ-DDSS
1
10
100
1000
10000
100000
20% 40% 60% 80% 100%
Records Accessed
Tim
e (
usecs)
RTREE-RCQ-DDSS RRTEE-RMQ-DDSS
RTREE-MQ-DDSS RTREE
1
10
100
1000
10000
1K 10K 100K 1000K
Number of Records
Tim
e (m
illiseconds)
STORM-RCQ-DDSS STORM-RMQ-DDSS
STORM-MQ-DDSS STORM
Performance with checkpointing
0
20000
40000
60000
80000
100000
120000
140000
160000
1 4 16 64 256 1024 4096
Number of Client Threads
Late
ncy (usecs)
RCQ-DDSS RMQ-DDSS MQ-DDSS
• Hybrid approach is required for scalability with large number of threads
1
10
100
1000
10000
100000
1 2 4 8 16 32 64
Number of Client Threads
Execu
tio
n T
ime (
usecs)
RCQ-DDSS RMQ-DDSS MQ-DDSS
1
10
100
1000
10000
1 4 16 64 256
Number of Client Threads
Execu
tio
n T
ime (
usecs)
RCQ-DDSS RMQ-DDSS MQ-DDSS
Clients on diff node (non-distributed)
Clients on diff node (non-distributed)
Clients on single node (non-distributed)
Performance with Dedicated Cores
• Dedicating a core for resource monitoring can avoid up
to 50% degradation in client response time
0 200 400 600 800 1000800
1000
1200
1400
1600
1800
2000
Iterations
La
ten
cy(M
icro
se
co
nd
s)
4Servers8Servers16Servers32Servers
0 200 400 600 800 1000800
1000
1200
1400
1600
1800
2000
Iterations
Late
ncy(M
icro
seconds)
4Servers8Servers16Servers32Servers
Conclusions & Future Work
• Proposed multicore optimizations for
distributed data sharing substrate
• Evaluations with several applications shows
significant improvement
• Showed the benefits of dedicating cores for
services in datacenters
• Future work on dedicating other datacenter
services, datacenter-specific operations