Designing High Performance DSM Systems using InfiniBand...
Transcript of Designing High Performance DSM Systems using InfiniBand...
![Page 1: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/1.jpg)
Designing High Performance DSM Systems using InfiniBand Features
Ranjit Noronha and
Dhabaleswar K. PandaThe Ohio State University
NBC
![Page 2: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/2.jpg)
OutlineIntroductionMotivationDesign and ImplementationResults ConclusionsFuture Work
![Page 3: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/3.jpg)
IntroductionSoftware DSM
HLRC/VIA (Rutgers), TreadMarks (Rice), JIAJIA (ICT China)
Depends on user and software layer
Depends on communication protocols provided by the system such as TCP, UDP, etc.
Degraded performance because of false sharing and high overhead of communication
Has scaling problems
![Page 4: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/4.jpg)
Introduction
Modern Interconnects (InfiniBand, Myrinet, Quadrics)
Low Latency (InfiniBand 5.0 µs)
High Bandwidth (InfiniBand 4X upto 10 Gbps)
Programmable NIC
User Level Protocols (VAPI, GM)
Can deliver performance close to that of the underlying hardware
RDMA Write/Read, Atomic Operations, Service Levels, Multicast
![Page 5: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/5.jpg)
MotivationTraditional DSM
Uses Request / Response Communication Model (asynchronous)Separate signal handler thread neededApplication Processing interruptedCache Effects
Can network based features be used to reduce interrupt overhead ?
0 1Send REQ Interrupt
Process
Send RESRecv REQ
![Page 6: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/6.jpg)
MotivationAsynchronous communication model
Use network features to achieve the same effect (synchronous/hybrid communication model)
Potential AdvantagesPartial offload of protocol to networkMore application processing timeReduced CopyingBetter caching
Potential DisadvantagesLonger protocol execution time Ordering problemsConsistency Issues
![Page 7: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/7.jpg)
Outline
IntroductionMotivationDesign and ImplementationResults ConclusionsFuture Work
![Page 8: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/8.jpg)
Preliminaries
RDMARemote Direct Memory AccessAllows access to memory on a remote nodeNo involvement from the remote nodeRDMA WriteRDMA Read
![Page 9: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/9.jpg)
RDMA Write Example
NIC NIC
A BHost HostX X
![Page 10: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/10.jpg)
RDMA Read Example
NIC NIC
A BHost Host PP
![Page 11: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/11.jpg)
Preliminaries - Remote Atomic Operations
Remote Atomic OperationsCompare and Swap (CMP_AND_SWAP)
Conditionally change a location on a remote machine atomically
Fetch and Add
![Page 12: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/12.jpg)
Remote Atomic Operations Example
NIC NIC
A BHost Host Y
• Compare and Swap
Z S
Z == Y ?
SY
![Page 13: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/13.jpg)
Preliminaries - HLRCHLRC/VIA (Rutgers)
Home Based Lazy Release Consistency ModelPage Based DSM System
Basic OperationsPageDiffLock
Use interrupts Referred to as ASYNC
![Page 14: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/14.jpg)
HLRC Programming Example
Acquire_Lock (L1)X=X * 2Release_Lock(L1)
Acquire_Lock (L1)
X = X + 1
Release_Lock(L1)Time
A
B
•Initial value of X = 0
•B is home node for page P containing X
Read page P (containing X) from B
Send diffs for P to B
![Page 15: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/15.jpg)
HLRC Design
HLRC
ASYNC
Page Diff Lock
![Page 16: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/16.jpg)
Our Design
Design consists of 2 protocolsARDMAR (Atomic and RDMA Write)DRAW (Diff using RDMA Write)
ARDMAR is a synchronous protocolDRAW is a hybrid protocolNEWGENDSM = ARDMAR + DRAW
![Page 17: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/17.jpg)
NEWGENDSM
HLRC
ASYNC
Page Diff Lock Page (ARDMAR)
Diff (DRAW)
Lock
NEGENDSM
![Page 18: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/18.jpg)
ASYNC (page fetch)A B C
DEFAULTHome for page 2
RES
B
REQ
HOME
REQ
RES
RES
PAGE
BB
![Page 19: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/19.jpg)
ARDMAR (Atomic and RDMA Write)
--B
CMP AND SWAP
B
CMP AND SWAP
RDMA READ
B
A B C
Home for page 2
![Page 20: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/20.jpg)
NEWGENDSM
HLRC
ASYNC
Page Diff Lock Page (ARDMAR)
Diff (DRAW)
Lock
NEGENDSM
![Page 21: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/21.jpg)
ASYNC (diff)A B
P1 P2
DIFF (P1)
ACK (P1)
DIFF (P2)
TIMESTAMP (P1)
TIMESTAMP (P2)
ACK (P2)
![Page 22: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/22.jpg)
DRAWA B
P1 P2
RDMA WRITE DIFF (P1)
RDMA WRITE DIFF (P2)
TIMESTAMP (P1 and P2)
![Page 23: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/23.jpg)
Outline
IntroductionMotivationDesign and ImplementationResultsConclusionsFuture Work
![Page 24: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/24.jpg)
Experimental SetupHLRC/ VIA (Rutgers) modified to work with VAPI InfiniScale MT43132 Eight 4X switchMellanox InfiniHost MT23108 DualPort 4X HCA’s SuperMicro SUPER P4DL6
Dual Pentium Xeon 2.4 GHz512 MB memory133 MHz PCI-X bus
Linux 2.4.7-10 SMP kernel
![Page 25: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/25.jpg)
EvaluationMicro-benchmarks (modified from TreadMarks suite)
Page Average time to fetch a page from a home node when a number of nodes are accessing it
Diff Measure Compute Time and Apply TimeSmall diff (single word) and Large diff (entire page)
Applications from SPLASH-2 suite (Barnes, TSP, 3Dfft, Radix)
20 (large)Tour sizeTSP
2621440Number of keysRadix128Grid size3Dfft32678BodiesBarnesSizeParameterApplication
![Page 26: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/26.jpg)
Microbenchmarks (Page)
0
20
40
60
80
100
120
140
160
2 3 4 5 6 7 8
Page
fetc
h tim
e (u
sec)
Number of nodes
Page microbenchmark
ASYNCARDMAR
• Page fetching in ARDMAR is lower than ASYNC at 8 nodes
![Page 27: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/27.jpg)
Microbenchmarks (Diff)
01020304050
Compute(Small)
Apply(Small)
Compute(Large)
Apply(Large)
Diff Component
Tim
e (m
illis
econ
ds)
ASYNC DRAW
• DRAW performs better than ASYNC in all cases
![Page 28: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/28.jpg)
Application Speedup
0
1
2
3
4
5
6
Barnes TSP 3Dfft RadixApplication
Spee
dup
(8 n
odes
)
ASYNC ARDMAR DRAW NEWGENDSM
• Speedup w.r.t. sequential running times
•Radix NEWGENDSM speedup 1.63 times ASYNC
• Barnes NEGENDSM speedup 1.59 times ASYNC
![Page 29: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/29.jpg)
•Diff time a part of Barrier Compute Time
•Page time reduced significantly
Breakdown
![Page 30: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/30.jpg)
Asynchronous Handler Time
• Asynchronous handler time substantially reduced for Barnes and 3Dfft
0
50
100
150
200
250
300
350
Barnes TSP 3Dfft Radix
Application
Tim
e (m
illis
econ
ds)
ASYNC ARDMAR DRAW NEWGENDSM
![Page 31: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/31.jpg)
Conclusions
Explored reducing asynchronous protocol processing timeUsed network features like RDMA Read/Write and atomic operationsIncorporated in a protocol NEWGENDSMMicrobenchmark/application level evaluationImprovement in parallel speedup upto 1.63
![Page 32: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/32.jpg)
Future WorkExploit small message latency to implement “critical word first”
RDMA Read for “early restart”
Atomic operations for locking
Migrating home protocol
![Page 33: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/33.jpg)
http://nowlab.cis.ohio-state.edu/
E-mail: {noronha, panda}@cis.ohio-state.edu
NBC home page
Web Pointers
![Page 34: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks](https://reader033.fdocuments.in/reader033/viewer/2022042622/5f8dbcd0c3903811d95ddbf1/html5/thumbnails/34.jpg)
•Page time reduced for Barnes
Breakdown