[IEEE 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing - Melbourne,...

2
Supporting OFED over Non-InfiniBand SANs Devesh Sharma Hardware Technology Development Group Centre for Development of Advanced Computing Pune, India e-mail: [email protected] Abstract—OpenFabrics Enterprise Distribution (OFED) is op- en-source software, committed to provide common communic- ation stack to all RDMA capable System Area Networks (SAN- s). It supports high performance MPIs and legacy protocols for HPC domain and Data Centre community. Currently, it supp- orts InfiniBand (IB) and Internet Wide Area RDMA Protocol (iWARP). This paper presents a technique to support OFED software stack over non-IB RDMA capable SAN. We propose the design of Virtual Management Port (VMP) to enable IB subnet management model. Integration of VMP with IB-Verbs interface driver prevents hardware and OFED modifications and enables connection manager that is mandatory to run user applications. The performance evaluation shows that VMP is lightweight. Keywords-nonInfiniBand SANs; OFED; subnet managem- ent model emulation I. INTRODUCTION The latest Top500 supercomputer list shows that the existence of a proprietary (non-IB) RDMA capable SAN is fairly possible, although 85% of the clusters are using GigE or IB as a primary interconnect. Supporting legacy protocols (TCP/IP, Socket Direct Protocol, iSCSI extensions for RDMA) and MPIs over non-IB SANs require proprietary software stack implementation. Developing stack for every protocol is time consuming. OFED is an open source software stack developed by OpenFabrics Alliance (OFA). It was originally developed for IB. However, with the widespread acceptance of RDMA, OFA is now committed to provide a single software stack for all RDMA capable interconnects. The latest releases of OFED support iWARP as well. It includes most of the legacy protocols and latest MPIs that take advantage of RDMA to extract performance. In this paper, we present a technique that avails OFED software stack over non-IB RDMA capable SANs. In order to enable connection manager, IB subnet management model is supported, without changing the OFED and underlying hardware. We propose the design of VMP, to emulate IB switch management-port and route Directed Routed Subnet Management Packets (DR-SMP). VMP is integrated with IB-Verbs interface driver and introduces very small overhead on the CPUs of cluster end-nodes. II. BACKGROUND In order to run user applications over legacy protocols and MPIs, it is important to support connection manager ava- ilable with OFED. However, connection manager uses basic services of IB subnet management model. Therefore, for a non-IB interconnect subnet management support is mandatory. IB Architecture (IBA) defines subnet management model to configure and maintain the subnet after power-on. Subnet Manager (SM) approaches the Subnet Management Agents (SMA) of all the devices to obtain device specific attributes. SM uses control packets, called DR-SMPs, to traverse un- configured subnet. The means to exchange DR-SMPs over SAN is provided by Subnet Management Interface (SMI). It maintains consistent DR-SMP header using Directed Routed Algorithm [1]. III. DESIGN AND IMPLEMENTATION To keep both OFED software stack as well as under- lying non-IB hardware unchanged, all the complexities related to VMP are handled in the vendor specific IB-Verbs interface driver. Further, rather than running SM on a non-IB switch, it is run on an end-node within the subnet. In the rest of the paper, we refer the node hosting SM as SM-node. VMP provides solution to emulate an IB switch management-port in the form of Management-port Emulator. The Address Translator handles the DR-SMP routing. Fig. 1 shows the block diagram of VMP. Vendor Specific ib_post_send Implementation SQ of QP0 Address Translator Mgmt-Port Emulator Host Channel Adaptor (HCA) SQ of Other QPs VMP Figure 1. VMP integration with IB-Verbs Driver. DR-SMPs are pre- processed in the Send Queue (SQ) of Queue-Pair zero (QP0). 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing 978-0-7695-4039-9/10 $26.00 © 2010 IEEE DOI 10.1109/CCGRID.2010.62 573

Transcript of [IEEE 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing - Melbourne,...

Page 1: [IEEE 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing - Melbourne, Australia (2010.05.17-2010.05.20)] 2010 10th IEEE/ACM International Conference on

Supporting OFED over Non-InfiniBand SANs

Devesh Sharma Hardware Technology Development Group

Centre for Development of Advanced Computing Pune, India

e-mail: [email protected]

Abstract—OpenFabrics Enterprise Distribution (OFED) is op-en-source software, committed to provide common communic-ation stack to all RDMA capable System Area Networks (SAN-s). It supports high performance MPIs and legacy protocols for HPC domain and Data Centre community. Currently, it supp-orts InfiniBand (IB) and Internet Wide Area RDMA Protocol (iWARP). This paper presents a technique to support OFED software stack over non-IB RDMA capable SAN. We propose the design of Virtual Management Port (VMP) to enable IB subnet management model. Integration of VMP with IB-Verbs interface driver prevents hardware and OFED modifications and enables connection manager that is mandatory to run user applications. The performance evaluation shows that VMP is lightweight.

Keywords-nonInfiniBand SANs; OFED; subnet managem-ent model emulation

I. INTRODUCTION The latest Top500 supercomputer list shows that the

existence of a proprietary (non-IB) RDMA capable SAN is fairly possible, although 85% of the clusters are using GigE or IB as a primary interconnect. Supporting legacy protocols (TCP/IP, Socket Direct Protocol, iSCSI extensions for RDMA) and MPIs over non-IB SANs require proprietary software stack implementation. Developing stack for every protocol is time consuming.

OFED is an open source software stack developed by OpenFabrics Alliance (OFA). It was originally developed for IB. However, with the widespread acceptance of RDMA, OFA is now committed to provide a single software stack for all RDMA capable interconnects. The latest releases of OFED support iWARP as well. It includes most of the legacy protocols and latest MPIs that take advantage of RDMA to extract performance.

In this paper, we present a technique that avails OFED software stack over non-IB RDMA capable SANs. In order to enable connection manager, IB subnet management model is supported, without changing the OFED and underlying hardware. We propose the design of VMP, to emulate IB switch management-port and route Directed Routed Subnet Management Packets (DR-SMP). VMP is integrated with IB-Verbs interface driver and introduces very small overhead on the CPUs of cluster end-nodes.

II. BACKGROUND In order to run user applications over legacy protocols

and MPIs, it is important to support connection manager ava-ilable with OFED. However, connection manager uses basic services of IB subnet management model. Therefore, for a non-IB interconnect subnet management support is mandatory.

IB Architecture (IBA) defines subnet management model to configure and maintain the subnet after power-on. Subnet Manager (SM) approaches the Subnet Management Agents (SMA) of all the devices to obtain device specific attributes. SM uses control packets, called DR-SMPs, to traverse un-configured subnet. The means to exchange DR-SMPs over SAN is provided by Subnet Management Interface (SMI). It maintains consistent DR-SMP header using Directed Routed Algorithm [1].

III. DESIGN AND IMPLEMENTATION To keep both OFED software stack as well as under-

lying non-IB hardware unchanged, all the complexities related to VMP are handled in the vendor specific IB-Verbs interface driver. Further, rather than running SM on a non-IB switch, it is run on an end-node within the subnet. In the rest of the paper, we refer the node hosting SM as SM-node.

VMP provides solution to emulate an IB switch management-port in the form of Management-port Emulator. The Address Translator handles the DR-SMP routing. Fig. 1 shows the block diagram of VMP.

Vendor Specific ib_post_send Implementation

SQ of QP0

Address Translator

Mgmt-Port Emulator

Host Channel Adaptor (HCA)

SQ of Other QPs

VMP

Figure 1. VMP integration with IB-Verbs Driver. DR-SMPs are pre-processed in the Send Queue (SQ) of Queue-Pair zero (QP0).

2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing

978-0-7695-4039-9/10 $26.00 © 2010 IEEE

DOI 10.1109/CCGRID.2010.62

573

Page 2: [IEEE 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing - Melbourne, Australia (2010.05.17-2010.05.20)] 2010 10th IEEE/ACM International Conference on

A. Management-port Emulator In an IB compliant switch, SMA and SMI are attached to

a special management-port. However, for a non-IB switch, supporting IB compliant management-port may not be possible due to architectural differences.

The emulator utilizes the resources of end-nodes within the subnet to perform management-port functionalities for a non-IB switch. It emulates IB switch SMA on the SM-node. The emulator maintains the information generated by SM during various subnet management [2, 3] stages. It maintains read-only attributes of switch e.g. available physical ports.

The switch SMI functionality is distributed on all the end-nodes of the subnet. The header of a DR-SMP is pre-processed on the packet generator end-node. Every end-node executes the directed routed algorithm on behalf of the non-IB switch to maintain consistent packet header. This prevents packet drop due to inconsistent header at the receiving end.

The switch management-port emulation, exports one virtual IB switch irrespective of the non-IB physical switches present within the subnet. During subnet discovery phase [2, 3], SM detects a virtual IB switch and proceeds beyond it to explore all the end-nodes connected to the switches. Emula-ting management-port on the end-nodes prevents introducing hardware or firmware changes in non-IB switches.

B. Address Translator In IB compliant network, DR-SMPs are forwarded from

hop-to-hop using path-vector lookup. This mechanism is very specific to IBA and may not be supported by non-IB SAN. Therefore, non-IB SAN is not capable of routing DR-SMP without hardware and firmware changes.

In order to route DR-SMPs without changing the underlying hardware, address translator modifies work-requests of all DR-SMPs originating from the SM-node. In a work-request the source and destination addresses are replaced with the already known non-IB HCA addresses. The execution of modified work-request causes DR-SMP to reach the intended end-node using the native routing mecha-nism. To generate valid destination address for a DR-SMP, address translator uses hash function. For every DR-SMP, hash function uses the value in DR-SMP header initial-path-vector(2) as key to generate a destination address.

The DR-SMPs traveling towards the SM-node do not require address translation. This is due to the architecture of OFED software stack. While creating a response packet, the stack uses query packet work-completion:source-address to fill response packet work-request:destnation-address. The execution of response packet work-request causes packet to reach SM-node.

Address translation solves the problem of routing DR-SMPs over a non-IB SAN. As a result, master SM exchanges DR-SMPs with other end-nodes of subnet and completes various subnet management stages.

IV. VALIDATION AND PERFORMANCE EVALUATION VMP can be implemented for any non-IB SAN to avail

OFED. We validated VMP with PARAMNet-3 (Pnet-3) [4],

a 10Gbps, RDMA capable SAN, designed and developed by C-DAC. The test-bed consisted of 8 nodes, each with four 2.93 GHz Intel Xeon Tigerton quad core processors, 64GB RAM, PCI-e Pnet-3 HCA, OFED-1.2 and RHEL-4.5. OpenSM-1.0.3 was used as SM for the cluster. A successful run of OpenSM generated a table of all the devices it detected. We checked the table to find a virtual IB switch and all the HCAs. OpenSM initiates hop-1 packets for the switch management-port emulator while hop-2 packets are for the other end-nodes but pre-processes on SM-node. Pnet-3 IB-Verbs interface driver was devised to measure CPU cycles required for processing of hop-1 and hop-2 DR-SMPs on the SM-node.

Table I shows that CPU cycles required to process single DR-SMP (hop-1 or hop-2) are very few and independent of the size of cluster. This shows that VMP is lightweight. However, as the cluster size increases, for every sweep cycle OpenSM injects more and more DR-SMPs into the network. This increases overhead on the SM-node and may hamper the application performance. For a small cluster overhead on SM-node can be reduced by adjusting the sweep timeout while running OpenSM. For very large clusters a dedicated SM-node can be allocated. Distributed switch SMI requires very few CPU cycles from other cluster end-nodes.

We have successfully run NAS Parallel Benchmark over the test-bed to validate availability of connection manager for user applications.

V. CONCLUSION We have presented a technique that can be implemented

by any non-IB RDMA capable SAN vendor to support OFED. In order to enable connection manager, VMP enabled subnet management model keeping the OFED software stack as well as underlying hardware unchanged. As a result, SM successfully built its data-base that connection manager used during application start-up to establish connections. The performance evaluation showed that VMP is lightweight.

REFERENCES [1] InfiniBand Architecture Specification Volume 1 Release 1.2,

InfiniBand Trade Association. October 2004. [2] A. Bermúdez, R. Casado, F. J. Quiles, T. M. Pinkston, and J. Duato,

“Evaluation of a Subnet Management Mechanism for InfiniBand Networks”, in Proc. Int’l Conference on Parallel Processing, Kaohsiung, Taiwan (ROC) 2003, October 2003, p. 117-124.

[3] A Vishnu, A.R. Mamidala, H. W. Jin, D. K. Panda, “Perform-ance Modeling of Subnet Management on Fat Tree InfiniBand Networks using OpenSM,” in IPDPS’05, 2005.

[4] PARAMNet-3 on C-DAC. [Online]. Available: http://www.cdac.in/html/htdg/products.asp

TABLE I. NUMBER OF DR-SMPS INJECTED BY OPENSM AND PER DR-SMP REQUIRED CPU CYCLES WITH VARIATION IN CLUSTER SIZE.

Number of Nodes 1 2 4 8 16 32 Injected DR-SMP 21 30 48 84 164 325

Hop-1 (CPU cycles)

798 752 763 735 644 631

Hop-2 (CPU cycles)

0 95 117 124 115 376

574