Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device...
Transcript of Portals4 LND Overview - EOFS€“ network name: ptlf – network adapter identified by device...
05-10-2017
© Atos
Portals4 LND Overview
Grégoire Pichon - Atos
New Lustre Network Driver
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
▶ communication infrastructure
– between Lustre clients and servers
▶ supports many commonly-used network types
– Infiniband, TCP/IP networks
– allows routing between networks
▶ key features
– high availability
– recovery
▶ Lustre Network Driver
– provides support for a particular network type
– implements LNet-LND api
LNet
LNet
socklnd o2iblnd
Portal RPC LNet
selftest
TCP/IP Infiniband
Lustre Network layer
3
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
▶ ptl4lnd
– based on Sandia National Laboratories Portals 4 Network Programming Interface
LU-8932 lnet: define new network driver ptl4lnd
– kernel module: kptl4lnd.ko
– network name: ptlf
– network adapter identified by device number
▶ hardware
– Bull eXascale Interconnect
Portals 4 LND
LNet
socklnd o2iblnd
Portal RPC LNet
selftest
TCP/IP Infiniband
… ptl4lnd
BXI
networks=ptlf0(0),ptlf1(1)
New Lustre Network Driver
4
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
Portals 4
▶ Common semantic for MPI and PGAS
– Target memory descriptors: “persistent” (one-sided ops.) or “use once” (two-sided ops.)
– Rich operations library: Put, Get, Swap, FetchAtomic
▶ Full hardware offloading from the host CPU
– Performance is not impacted by heavy load on the host CPU
– Triggered operations for protocol offloading (collectives, etc.)
▶ Non-connected reliable protocol
– No connection establishment time
– Constant memory footprint whatever the number of communicators
▶ Unexpected messages aggregation
– N to 1 communications optimization – single buffer for multiple messages
5
API for high-performance networking
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
6
Portals 4 Communication Model
Portals Table
ME ME Portals Table Entry
Match Entry
Memory region
ME NI
Network Interface
MSG
request sent nid / pid portals idx matchbits
matchbits ma
matchbits mb
matchbits mz
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
7
Ptl4lnd internals Small data transfer (from node A to node B)
Node A Node B
build ptl4lnd imm. msg with LNet hdr
copy LNet payload into msg IMMEDIATE message
with fixed matchbits
parse LNet hdr
copy payload from msg to LNet destination memory
match a persistent ME
lnd_send()
lnd_recv()
PtlPut() MSG
ME
MSG
persistent ME pre-posted with fixed matchbits
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
8
Ptl4lnd internals RDMA data transfer (from node A to node B)
Node A Node B
build ptl4lnd rdma msg LNet hdr & matchbits
get a unique matchbits prepare a use-once ME
RDV message with fixed matchbits
parse LNet hdr
match a persistent ME
lnd_send()
lnd_recv()
PtlPut()
prepare a Memory Descriptor
BULK message with unique matchbits
PtlGet()
match use-once ME remove it from list
MSG unique
matchbits
MD
ME
persistent ME pre-posted with fixed matchbits
ME
MSG
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
9
Ptl4lnd internals resource management
Portals Table overflow list
ME ME PTL4_PORTAL start length fixed matchbits persistent
Match Entry
Memory region RX Event Queue
MSG
IMM. message or RDV message
TX
tx descriptor
BULK message
MD Memory
Descriptor
priority list
ME ME start length unique matchbits use-once iovec
TX Event Queue
reception (RX) resources
transmission (TX) resources
ME
TX
tx descriptor
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
▶ credits (32) and peercredits (8)
– number of concurrent sends to all and to one peer
– need uniform setting between cluster nodes
– higher value allows higher bandwidth but consumes more resources
▶ checksum (off)
– check integrity of non-bulk messages
▶ nscheds (2)
– number of scheduler threads that handle RX buffer, TX finalization, …
10
Ptl4lnd parameters and tuning
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
▶ slots in event queues
– number of reserved slots
– total number of slots
▶ peer status
– nid-pid, state, credits, alive time
▶ statistics
– number of TX of different types
▶ dump TX and RX states
11
Ptl4lnd debug and statistics
# cat /sys/kernel/kptl4lnd/0/stats PUT GET IMM NOOP HELL NAK BULK tx 3 0 23 1 0 0 0
# cat /sys/kernel/kptl4lnd/0/peers <NID>:<PID> | <state> | <credits> | <alive> 24:9 | ACTIVE | 8 / 1 + 7 | 253s 25:9 | ACTIVE | 8 / 3 + 5 | 212s
# cat /sys/kernel/kptl4lnd/0/events tx events: 14 reserved / 65536
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
12
Ptl4lnd performance achievements LNet performance
Lustre IEEL 2.7.21 BXI 1.2 NIC – SWITCH - NIC
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 256 512 768 1024
Thro
ugh
pu
t (M
iB/s
)
IO size
LNet Selftest – brw
write
read
portals4
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
13
Ptl4lnd performance achievements Lustre performance
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
0 2 4 6 8 10 12
Thro
ugh
pu
t (M
iB/s
)
tasks
IOR bandwidth sequential - file-per-process - single client
write
read
Lustre IEEL 2.7.21 BXI 1.2 NIC – SWITCH - NIC
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
▶ Finalize LND code
– perform some optimizations
– handle remaining corner cases
▶ Ensure compatibility with recent LNet changes
– currently runs on Lustre IEEL 2.7.21
– build and test on Lustre master
▶ Push code to the Lustre community
14
Ptl4lnd development Next steps
Development Challenges
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
▶ register to LNet
– lnet_register_lnd()
– lnet_unregister_lnd()
▶ provide a lnet_lnd structure
– lnd id (SOCKLND, O2IBLND, LOLND, GNILND, PTL4LND)
– startup/shutdown network communication on the network interface
– send/receive LNet message on the network interface
– notification /query on peer health / aliveness
– control commands
▶ use LNet callbacks
– lnet_parse(), lnet_finalize(), lnet_set-reply_msg_len(), …
16
LNet – LND interface
struct lnet_lnd ptlflnd = { .lnd_type = PTL4LND, .lnd_startup = ptlf_startup, .lnd_shutdown = ptlf_shutdown, .lnd_ctl = ptlf_ctl, .lnd_send = ptlf_send, .lnd_recv = ptlf_recv, .lnd_query = ptlf_query, };
looks simple …
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
▶ lack of documentation
– routine semantic
• when should a LND call lnet_notify() ?
• what LNet callbacks should/shall a LND use ?
– parameters description
– call context
17
LNet – LND interface … but still some interrogations
/* Start receiving 'mlen' bytes of payload data, skipping the following * 'rlen' - 'mlen' bytes. 'private' is the 'private' passed to * lnet_parse(). Return non-zero for immedaite failure, otherwise * complete later with lnet_finalize(). This also gives back a receive * credit if the LND does flow control. */ int (*lnd_recv)(struct lnet_ni *ni, void *private, struct lnet_msg *msg, int delayed, unsigned int niov, struct kvec *iov, lnet_kiov_t *kiov, unsigned int offset, unsigned int mlen, unsigned int rlen);
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
▶ LNet selftest is not convenient for LNet debugging or LND bringup
– test infrastructure is setup with … LNet messages
– not designed to perform 1 LNetPut() or LNetGet() operation
– cannot test
• specific (exotic) memory region transfers
• error cases (non matching ME, short MD)
▶ development of a LNet test kernel module
– define a pseudo device for ioctl command
• post LNet ME-MD for reception
• launch LNet data transfer
– also post permanent LNet ME-MD
18
Tests and bringup
Opportunities offered by modern Interconnects
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
▶ Atomic operations
– atomically read and update data located in remote memory regions
– host bypass, low latency
– operations
• min, max, sum, prod, or, and, swap, conditional swap, …
– usage
• distributed lock
• object sequence number
– LNetAtomic(), LNetFetchAtomic(), LNetSwap()
20
Network advanced features (1) … that Lustre could benefit
Host NIC
object sequence number
Atomic Increment
Hosts
| 05-10-2017 | Grégoire Pichon | © Atos Atos | BDS | HPC R&D Data Management
▶ Triggered operations
– setup operations triggered by incoming messages
– host bypass, low latency
– usage:
• tree based reduction
• recovery synchronization
– LNetTriggeredPut(), LNetTriggeredAtomic()
21
Network advanced features (2) … that Lustre could benefit
Host NIC
Hosts N * Atomic Increments
Host NIC
Hosts N * Atomic Increments
Triggered Atomic
Increments
Atos, the Atos logo, Atos Codex, Atos Consulting, Atos Worldgrid, Worldline, BlueKiwi, Bull, Canopy the Open Cloud Company, Unify, Yunano, Zero Email, Zero Email Certified and The Zero Email Company are registered trademarks of the Atos group. September 2017. © 2017 Atos. Confidential information owned by Atos, to be used by the recipient only. This document, or any part of it, may not be reproduced, copied, circulated and/or distributed nor quoted without prior written approval from Atos.
Thanks [email protected]