Intro to IB for End Users
Transcript of Intro to IB for End Users
-
8/2/2019 Intro to IB for End Users
1/54
Introduction to InfniBand
or End Users
Industry-Standard Value and Performance forHigh Performance Computing and the Enterprise
Paul Grun
InfniBand Trade Association
-
8/2/2019 Intro to IB for End Users
2/54
-
8/2/2019 Intro to IB for End Users
3/54
[ 3 ]
CONTENTS
Contents
Acknowledgements ..............................................................................4
About the Author ..................................................................................4
Introduction .........................................................................................5
Chapter 1 Basic Concepts ..................................................................7Chapter 2 InniBand or HPC ..........................................................14
InniBand MPI Clusters...................................................................15
Storage and HPC ............................................................................16
Upper Layer Protocols or HPC .........................................................18
Chapter 3 InniBand or the Enterprise ..............................................19
Devoting Server Resources to Application Processing ..........................20
A Flexible Server Architecture ...........................................................21
Investment Protection ......................................................................24
Cloud Computing An Emerging Data Center Model ...........................25
InniBand in the Distributed Enterprise .............................................28
Chapter 4 Designing with InniBand ..................................................31
Building on Top o the Verbs ............................................................32
Chapter 5 InniBand Architecture and Features ...................................37Address Translation .........................................................................38
The InniBand Transport .................................................................39
InniBand Link Layer Considerations ................................................42
Management and Services ...............................................................42
InniBand Management Architecture ................................................43
Chapter 6 Achieving an Interoperable Solution ....................................45
An Interoperable Sotware Solution ...................................................45Ensuring Hardware Interoperability ...................................................46
Chapter 7 InniBand Perormance Capabilities and Examples ...............48
Chapter 8 Into the Future .................................................................52
-
8/2/2019 Intro to IB for End Users
4/54
[ 4 ]
INTRO TO INFINIBAND FOR END USERS
Acknowledgements
Special thanks to Gilad Shainer at Mellanox or his work and content related to
Chapter 7 - InniBand Perormance Capabilities and Examples.
Thanks to the ollowing individuals or their contributions to this publication:
David Southwell, Ariel Cohen, Rupert Dance, Jay Diepenbrock, Jim Ryan, Cheri
Winterberg and Brian Sparks.
About the Author
Paul Grun has been working on I/O or mainrames and servers or more than
30 years and has been involved in high perormance networks and RDMA technol-
ogy since beore the inception o InniBand. As a member o the InniBand Trade
Association, Paul served on the Link Working Group during the development o the
InniBand Architecture; he is a member o the IBTA Steering Committee and is a
past chair o the Technical Working Group. He recently served as chair o the IBXoE
Working Group, charged with developing the new RDMA over Converged Ethernet(RoCE) specication. He is currently chie scientist or System Fabric Works, Inc.,
a consulting and proessional services company dedicated to delivering RDMA and
storage solutions or high perormance computing, commercial enterprise and cloud
computing systems.
-
8/2/2019 Intro to IB for End Users
5/54
[ 5 ]
INTRODUCTION
Introduction
InniBand is not complex. Despite its reputation as an exotic technology, the
concepts behind it are surprisingly straightorward. One purpose o this book is to
clearly describe the basic concepts behind the InniBand Architecture.
Understanding the basic concepts is all well and good, but not very useul unless
those concepts deliver a meaningul and signicant value to the user o the tech-
nology. And that is the real purpose o this book: to draw straight-line connections
between the basic concepts behind the InniBand Architecture and how those
concepts deliver real value to the enterprise data center and to high perormance
computing (HPC).
The readers o this book are exquisitely knowledgeable in their respective elds o
endeavor, whether those endeavors are installing and maintaining a large HPC clus-
ter, overseeing the design and development o an enterprise data center, deployingservers and networks, or building devices/appliances targeted at solving a specic
problem requiring computers. But most readers o this book are unlikely to be ex-
perts in the arcane details o networking architecture. Although the working title or
this book was InniBand or Dummies, its readers are anything but dummies. This
book is designed to give its readers, in one short sitting, a view into the InniBand
Architecture that you probably could not get by reading the specication itsel.
The point is that you should not have to hold an advanced degree in obscure
networking technologies in order to understand how InniBand technology can bringbenets to your chosen eld o endeavor and to understand what those benets
are. This minibook is written or you. At the end o the next hour you will be able to
clearly see how InniBand can solve problems you are acing today.
By the same token, this book is not a detailed technical treatise o the underlying
theory, nor does it provide a tutorial on deploying the InniBand Architecture. All we
can hope or in this short book is to bring a level o enlightenment about this excit-
ing technology. The best measure o our success is i you, the reader, eel motivated
ater reading this book to learn more about how to deploy the InniBand Architec-
ture in your computing environment.We will avoid the low-level bits and bytes o the elegant architecture (which are
complex) and ocus on the concepts and applications that are visible to the user o
InniBand technology. When you decide to move orward with deploying InniBand,
there is ample assistance available in the market to help you deploy it, just as there
is or any other networking technology like traditional TCP/IP/Ethernet. The Inni-
Band Trade Association and its website (www.innibandta.org) are great sources
Introduction
-
8/2/2019 Intro to IB for End Users
6/54
[ 6 ]
INTRO TO INFINIBAND FOR END USERS
o access to those with deep experience in developing, deploying and using the
InniBand Architecture.
The InniBand Architecture emerged in 1999 as the joining o two competing
proposals known as Next Generation I/O and Future I/O. These proposals, and the
InniBand Architecture that resulted rom their merger, are all rooted in the Vir-tual Interace Architecture, VIA. The Virtual Interace Architecture is based on two
synergistic concepts: direct access to a network interace (e.g. a NIC) straight rom
application space, and an ability or applications to exchange data directly between
their respective virtual buers across a network, all without involving the operating
system directly in the address translation and networking processes needed to do
so. This is the notion o Channel I/O the creation o a virtual channel directly
connecting two applications that exist in entirely separate address spaces. We will
come back to a detailed description o these key concepts in the next ew chapters.
InniBand is oten compared, and not improperly, to a traditional network such
as TCP/IP/Ethernet. In some respects it is a air comparison, but in many other
respects the comparison is wildly o the mark. It is true that InniBand is based on
networking concepts and includes the usual layers ound in a traditional network,
but beyond that there are more dierences between InniBand and an IP network
than there are similarities. The key is that InniBand provides a messaging service
that applications can access directly. The messaging service can be used or stor-
age, or InterProcess Communication (IPC) or or a host o other purposes, anything
that requires an application to communicate with others in its environment. The keybenets that InniBand delivers accrue rom the way that the InniBand messaging
service is presented to the application, and the underlying technology used to trans-
port and deliver those messages. This is much dierent rom TCP/IP/Ethernet, which
is a byte-stream oriented transport or conducting bytes o inormation between
sockets applications.
The rst chapter gives a high-level view o the InniBand Architecture and
reviews the basic concepts on which it is ounded. The next two chapters relate
those basic concepts to value propositions which are important to users in the
HPC community and in the commercial enterprise. Each o those communities has
unique needs, but both benet rom the advantages InniBand can bring. Chap-
ter 4 addresses InniBands sotware architecture and describes the open source
sotware stacks that are available to ease and simpliy the deployment o InniBand.
Chapter 5 delves slightly deeper into the nuances o some o the key elements o
the architecture. Chapter 6 describes the eorts undertaken by both the InniBand
Trade Association and the OpenFabrics Alliance (www.openabrics.org) to ensure
that solutions rom a range o vendors will interoperate and that they will behave in
conormance with the specication. Chapter 7 describes some comparative per-ormance proo points, and nally Chapter 8 briefy reviews the ways in which the
InniBand Trade Association ensures that the InniBand Architecture continues to
move ahead into the uture.
-
8/2/2019 Intro to IB for End Users
7/54
[ 7 ]
ChApTER 1 BASIC CONCEpTS
Chapter 1 Basic Concepts
Networks are oten thought o as the set o routers, switches and cables that
are plugged into servers and storage devices. When asked, most people wouldprobably say that a network is used to connect servers to other servers, storage
and a network backbone. It does seem that traditional networking generally
starts with a bottoms up view, with much attention ocused on the underlying
wires and switches. This is a very network centric view o the world; the way
that an application communicates is driven partly by the nature o the trac.
This is what has given rise to the multiple dedicated abrics ound in many
data centers today.
A server generally provides a selection o communication services to theapplications it supports; one or storage, one or networking and requently a
third or specialized IPC trac. This complement o communications stacks
(storage, networking, etc.) and the device adapters accompanying each one
comprise a shared resource. This means that the operating system owns
these resources and makes them available to applications on an as-needed
basis. An application, in turn, relies on the operating system to provide it with
the communication services it needs. To use one o these services, the appli-
cation uses some sort o interace or API to make a request to the operatingsystem, which conducts the transaction on behal o the application.
InniBand, on the other hand begins with a distinctly application cen-
tric view by asking a prooundly simple question: How to make application
accesses to other applications and to storage as simple, ecient and direct
as possible? I one begins to look at the I/O problem rom this application
centric perspective one comes up with a much dierent approach to network
architecture.
The basic idea behind InniBand is simple; it provides applications withan easy-to-use messaging service. This service can be used to communicate
with other applications or processes or to access storage. Thats the whole
idea. Instead o making a request to the operating system or access to one o
the servers communication resources, an application accesses the InniBand
messaging service directly. Since the service provided is a straightorward mes-
saging service, the complex dance between an application and a traditional
-
8/2/2019 Intro to IB for End Users
8/54
[ 8 ]
INTRO TO INFINIBAND FOR END USERS
network is eliminated. Since storage trac can be viewed as consisting o
control messages and data messages, the same messaging service is equally at
home or use in storage applications.
A key to this approach is that InniBand Architecture gives every application
direct access to the messaging service. Direct access means that an applicationneed not rely on the operating system to transer messages. This idea is in con-
trast to a standard network environment where the shared network resources
(e.g. the TCP/IP network stack and associated NICs) are solely owned by the
operating system and cannot be accessed by the user application. This means
that an application cannot have direct access to the network and instead
must rely on the involvement o the operating system to move data rom the
applications virtual buer space, through the network stack and out onto the
wire. There is an identical involvement o the operating system at the otherend o the connection. InniBand avoids this through a technique known as
stack bypass. How it does this, while ensuring the same degree o application
isolation and protection that an operating system would provide, is the one key
piece o secret sauce that underlines the InniBand Architecture.
InniBand provides the messaging service by creating a channel connecting
an application to any other application or service with which the applica-
tion needs to communicate. The applications using the service can be either
user space applications or kernel applications such as a le system. Thisapplication-centric approach to the computing problem is the key dierentiator
between InniBand and traditional networks. You could think o it as a tops
down approach to architecting a networking solution. Everything else in the
InniBand Architecture is there to support this one simple goal: provide a mes-
sage service to be used by an application to communicate directly with another
application or with storage.
The challenge or InniBands designers was to create these channels
between virtual address spaces which would be capable o carrying messages
o varying sizes, and to ensure that the channels are isolated and protected.
These channels would need to serve as pipes or connections between entirely
disjoint virtual address spaces. In act, the two virtual spaces might even be
located in entirely disjointphysical address spaces in other words, hosted by
dierent servers even over a distance.
-
8/2/2019 Intro to IB for End Users
9/54
[ 9 ]
ChApTER 1 BASIC CONCEpTS
App
OS
App
NIC
Buf
OS
Buf
NIC
Figure 1: InfniBand creates a channel directly connecting an application in its virtual
address space to an application in another virtual address space. The two applicationscan be in disjoint physical address spaces hosted by dierent servers.
It is convenient to give a name to the endpoints o the channel well call
them Queue Pairs (QPs); each QP consists o a Send Queue and a Receive
Queue, and each QP represents one end o a channel. I an application
requires more than one connection, more QPs are created. The QPs are the
structure by which an application accesses InniBands messaging service. In
order to avoid involving the OS, the applications at each end o the channel
must have direct access to these QPs. This is accomplished by mapping the
QPs directly into each applications virtual address space. Thus, the application
at each end o the connection has direct, virtual access to the channel con-
necting it to the application (or storage) at the other end o the channel. This is
the notion oChannel I/O.
Taken altogether then, this is the essence o the InniBand Architecture. It
creates private, protected channels between two disjoint virtual address spaces,
it provides a channel endpoint, called a QP, to the applications at each end o
the channel, and it provides a means or a local application to transer mes-
sages directly between applications residing in those disjoint virtual address
spaces. Channel I/O in a nutshell.
Having established a channel, and having created a virtual endpoint to the
channel, there is one urther architectural nuance needed to complete the
channel I/O picture and that is the actual method or transerring a message.
InniBand provides two transer semantics; a channel semantic sometimes
called SEND/RECEIVE and a pair o memory semantics called RDMA READ
and RDMA WRITE. When using the channel semantic, the message is received
in a data structure provided by the application on the receiving side. This data
structure was pre-posted on its receive queue. Thus, the sending side does
not have visibility into the buers or data structures on the receiving side;
instead, it simply SENDS the message and the receiving application RECEIVES
the message.
-
8/2/2019 Intro to IB for End Users
10/54
[ 10 ]
INTRO TO INFINIBAND FOR END USERS
The memory semantic is somewhat dierent; in this case the receiving side
application registers a buer in its virtual memory space. It passes control o
that buer to the sending side which then uses RDMA READ or RDMA WRITE
operations to either read or write the data in that buer.
A typical storage operation may illustrate the dierence. The initiatorwishes to store a block o data. To do so, it places the block o data in a buer
in its virtual address space and uses a SEND operation to send a storage
request to the target (e.g. the storage device). The target, in turn, uses RDMA
READ operations to etch the block o data rom the initiators virtual buer.
Once it has completed the operation, the target uses a SEND operation to
return ending status to the initiator. Notice that the initiator, having requested
service rom the target, was ree to go about its other business while the target
asynchronously completed the storage operation, notiying the initiator oncompletion. The data transer phase and the storage operation, o course, com-
prise the bulk o the operation.
Included as part o the InniBand Architecture and located right above
the transport layer is thesotware transport interace. The sotware transport
interace contains the QPs; keep in mind that the queue pairs are the struc-
ture by which the RDMA message transport service is accessed. The sot-
ware transport interace also denes all the methods and mechanisms that
an application needs to take ull advantage o the RDMA message transportservice. For example, the sotware transport interace describes the methods
that applications use to establish a channel between them. An implementation
o the sotware transport interace includes the APIs and libraries needed by
an application to create and control the channel and to use the QPs in order to
transer messages.
Application
InfiniBandMessaging
Service
S/W TransportInterface
Transport
Network
Link
Physical
RDMA MessageTransport Service
Figure 2: The InfniBand Architecture provides an easy-to-use messaging service to
applications. The messaging service includes a communications stack similar to the
amiliar OSI reerence model.
-
8/2/2019 Intro to IB for End Users
11/54
[ 11 ]
ChApTER 1 BASIC CONCEpTS
Underneath the covers o the messaging service all this still requires a
complete network stack just as you would nd in any traditional network.
It includes the InniBand transport layer to provide reliability and delivery
guarantees (similar to the TCP transport in an IP network), a network layer
(like the IP layer in a traditional network) and link and physical layers (wiresand switches). But its a special kind o a network stack because it has ea-
tures that make it simple to transport messages directly between applications
virtual memory, even i the applications are remote with respect to each
other. Hence, the combination o InniBands transport layer together with the
sotware transport interace is better thought o as a Remote Direct Memory
Access (RDMA) message transport service. The entire stack taken together,
including the sotware transport interace, comprise the InniBand messaging
service.How does the application actually transer a message? The InniBand
Architecture provides simple mechanisms dened in the sotware transport
interace or placing a request to perorm a message transer on a queue. This
queue is the QP, representing the channel endpoint. The request is called a
Work Request (WR) and represents a single quantum o work that the applica-
tion wants to perorm. A typical WR, or example, describes a message that the
application wishes to have transported to another application.
The notion o a WR as representing a message is another o the key distinc-tions between InniBand and a traditional network; the InniBand Architecture
is said to be message-oriented. A message can be any size ranging up to
2**31 bytes in size. This means that the entire architecture is oriented around
moving messages. This messaging service is a distinctly dierent service than
is provided by other traditional networks such as TCP/IP, which are adept at
moving strings o bytes rom the operating system in one node to the operating
system in another node. As the name suggests, a byte stream-oriented network
presents a stream o bytes to an application. Each time an Ethernet packet
arrives, the server NIC hardware places the bytes comprising the packet into
an anonymous buer in main memory belonging to the operating system. Once
the stream o bytes has been received into the operating systems buer, the
TCP/IP network stack signals the application to request a buer rom the appli-
cation into which the bytes can be placed. This process is repeated each time a
packet arrives until the entire message is eventually received.
InniBand on the other hand, delivers a complete message to an application.
Once an application has requested transport o a message, the InniBand hard-
ware automatically segments the outbound message into a number o packets;
the packet size is chosen to optimize the available network bandwidth. Packets
are transmitted through the network by the InniBand hardware and at the
receiving end they are delivered directly into the receiving applications virtual
buer where they are re-assembled into a complete message. Once the entire
-
8/2/2019 Intro to IB for End Users
12/54
[ 12 ]
INTRO TO INFINIBAND FOR END USERS
message has been received the receiving application is notied. Neither the
sending nor the receiving application is involved until the complete message is
delivered to the receiving applications virtual buer.
How, exactly, does a sending application, or example send a message? The
InniBand sotware transport interace specication denes the concept o averb. The word verb was chosen since a verb describes action, and this is
exactly how an application requests action rom the messaging service; it uses
a verb. The set o verbs, taken together, are simply a semantic description o
the methods the application uses to request service rom the RDMA message
transport service. For example, Post Send Request is a commonly used verb
to request transmission o a message to another application.
The verbs are ully dened by the specication, and they are the basis or
speciying the APIs that an application uses. But the InniBand Architecturespecication doesnt dene the actual APIs; that is let to other organizations
such as the OpenFabrics Alliance, which provides a complete set o open
source APIs and sotware which implements the verbs and works seamlessly
with the InniBand hardware. A richer description o the concept o verbs,
and how one designs a system using InniBand and the verbs, is contained in
a later chapter.
The InniBand Architecture denes a ull set o hardware components nec-
essary to deploy the architecture. Those components are:
HCA Host Channel Adapter. An HCA is the point at which an InniBand
end node, such as a server or storage device, connects to the InniBand
network. InniBands architects went to signicant lengths to ensure that the
architecture would support a range o possible implementations, thus the
specication does not require that particular HCA unctions be implemented
in hardware, rmware or sotware. Regardless, the collection o hardware,
sotware and rmware that represents the HCAs unctions provides the appli-
cations with ull access to the network resources. The HCA contains address
translation mechanisms under the control o the operating system that allow
an application to access the HCA directly. In eect, the Queue Pairs that the
application uses to access the InniBand hardware appear directly in the
applications virtual address space. The same address translation mechanism
is the means by which an HCA accesses memory on behal o a user level
application. The application, as usual, reers to virtual addresses; the HCA
has the ability to translate these into physical addresses in order to aect theactual message transer.
TCA Target Channel Adapter. This is a specialized version o a channel
adapter intended or use in an embedded environment such as a storage
appliance. Such an appliance may be based on an embedded operating
system, or may be entirely based on state machine logic and thereore may
not require a standard interace or applications. The concept o a TCA is
-
8/2/2019 Intro to IB for End Users
13/54
[ 13 ]
ChApTER 1 BASIC CONCEpTS
not widely deployed today; instead most I/O devices are implemented using
standard server motherboards controlled by standard operating systems.
Switches An InniBand Architecture switch is conceptually similar to any
other standard networking switch, but molded to meet InniBands per-
ormance and cost targets. For example, InniBand switches are designed
to be cut through or perormance and cost reasons and they implement
InniBands link layer fow control protocol to avoid dropped packets. This is
a subtle, but key element o InniBand since it means that packets are never
dropped in the network during normal operation. This no drop behavior is
central to the operation o InniBands highly ecient transport protocol.
Routers Although not currently in wide deployment, an InniBand router
is intended to be used to segment a very large network into smaller subnetsconnected together by an InniBand router. Since InniBands management
architecture is dened on a per subnet basis, using an InniBand router
allows a large network to be partitioned into a number o smaller subnets
thus enabling the deployment o InniBand networks that can be scaled to
very large sizes, without the adverse perormance impacts due to the need
to route management trac throughout the entire network. In addition to this
traditional use, specialized InniBand routers have also been deployed in a
unique way as range extenders to interconnect two segments o an Inni-Band subnet that is distributed across wide geographic distances.
Cables and Connectors Volume 2 o the Architecture Specication is de-
voted to the physical and electrical characteristics o InniBand and denes,
among other things, the characteristics and specications or InniBand
cables, both copper and electrical. This has enabled vendors to develop and
oer or sale a wide range o both copper and optical cables in a broad range
o widths (4x, 12x) and speed grades (SDR, DDR, QDR).
Thats it or the basic concepts underlying InniBand. The InniBand Archi-
tecture brings a broad range o value to its users, ranging rom ultra-low latency
or clustering to extreme fexibility in application deployment in the data center
to dramatic reductions in energy consumption and all at very low price/per-
ormance points. Figuring out which value propositions are relevant to which
situations depends entirely on the destined use.
The next two chapters describe how these basic concepts map onto partic-
ular value propositions in two important market segments. Chapter 2 describes
how InniBand relates to High Perormance Computing (HPC) clusters, and
Chapter 3 does the same thing or enterprise data centers. As we will see,
InniBands basic concepts can help solve a range o problems, regardless
whether you are trying to solve a sticky perormance problem or i your data
center is suering rom severe space, power or cooling constraints. A wise
choice o an appropriate interconnect can prooundly impact the solution.
-
8/2/2019 Intro to IB for End Users
14/54
[ 14 ]
INTRO TO INFINIBAND FOR END USERS
Chapter 2 InfniBand or HPC
In this chapter we will explore how InniBands undamental characteristics
map onto addressing the problems being aced today in high perormancecomputing environments.
High perormance computing is not a monolithic eld; it covers the gamut
rom the largest Top 500 class supercomputers down to small desktop
clusters. For our purposes, we will generally categorize HPC as being that class
o systems where nearly all o the available compute capacity is devoted to
solving a single large problem or substantial periods o time. Looked at another
way, HPC systems generally dont run traditional enterprise applications such
as mail, accounting or productivity applications.Some examples o HPC applications include atmospheric modeling,
genomics research, automotive crash test simulations, oil and gas extraction
models and fuid dynamics.
HPC systems rely on a combination o high-perormance storage and low-
latency InterProcess Communication (IPC) to deliver perormance and scal-
ability to scientic applications. This chapter ocuses on how the InniBand
Architectures characteristics make it uniquely suited or HPC. With respect to
high perormance computing, InniBands low-latency/high-bandwidth peror-mance and the nature o a channel architecture are crucially important.
Ultra-low latency or:
Scalability
Clusterperformance
Channel I/O delivers:
Scalablestoragebandwidthperformance
Supportforshareddiskclusterlesystemsandparallellesystems
HPC clusters are dependent to some extent on low latency IPC or scalability
and application perormance. Since the processes comprising a cluster appli-
cation are distributed across a number o cores, processors and servers, it is
clear that a low-latency IPC interconnect is an important determinant o cluster
scalability and perormance.
In addition to demand or low-latency IPC, HPC systems requently place
-
8/2/2019 Intro to IB for End Users
15/54
[ 15 ]
ChApTER 2 INFINIBAND FOR hpC
enormous demands on storage bandwidth, perormance (measured in I/Os per
second), scalability and capacity. As we will demonstrate in a later section,
InniBands channel I/O architecture is well suited to high-perormance storage
and especially parallel le systems.
InniBand support or HPC ranges rom upper layer protocols tailored tosupport the Message Passing Interace (MPI) in HPC to support or large scale
high-perormance parallel le systems such as Lustre. MPI is the predominant
messaging middleware ound in HPC systems.
InfniBand MPI Clusters
MPI is the de acto standard and dominant model in use today or com-
munication in a parallel system. Although it is possible to run MPI on a shared
memory system, the more common deployment is as the communicationlayer connecting the nodes o a cluster. Cluster perormance and scalability
are closely related to the rate at which messages can be exchanged between
cluster nodes; the classic low latency requirement. This is native InniBand
territory and its clear stock in trade.
MPI is sometimes described as communication middleware, but perhaps a
better, more unctional description would be to say that it provides a communi-
cation service to the distributed processes comprising the HPC application.
To be eective, the MPI communication service relies on an underlyingmessaging service to conduct the actual communication messages rom one
node to another. The InniBand Architecture, together with its accompanying
application interaces, provides an RDMA messaging service to the MPI layer.
In the rst chapter, we discussed direct application access to the messaging
service as being one o the undamental principles behind the InniBand Archi-
tecture; applications use this service to transer messages between them. In the
case o HPC, one can think o the processes comprising the HPC application
as being the clients o the MPI communication service, and MPI, in turn as the
client o the underlying RDMA message transport service. Viewed this way, one
immediately grasps the signicance o InniBands channel-based message
passing architecture; its ability to enable its client, the MPI communications
middleware in this case, to communicate among the nodes comprising the
cluster without involvement o any CPUs in the cluster. InniBands copy avoid-
ance architecture and stack bypass deliver ultra-low application-to-application
latency and high bandwidth coupled with extremely low CPU utilization.
MPI is easily implemented on InniBand by deploying one o a number o
implementations which take ull advantage o the InniBand Architecture.
-
8/2/2019 Intro to IB for End Users
16/54
[ 16 ]
INTRO TO INFINIBAND FOR END USERS
Storage and HPC
An HPC application comprises a number o processes running in parallel. In
an HPC cluster, these processes are distributed; the distribution could be across
the cores comprising a processor chip, or across a number o processor chips
comprising a server, or they could be distributed across a number o physical
servers. This last model is the one most oten thought o when thinking about
HPC clusters.
Depending on the conguration and purpose or the cluster, it is oten
the case that each o the distributed processes comprising the application
requires access to storage. As the cluster is scaled up (in terms o the number
o processes comprising the application) the demand or storage bandwidth
increases. In this section well explore two common HPC storage architectures,
shared disk cluster le systems and parallel le systems, and discuss the
role channel I/O plays. Other storage architectures, such as network attached
storage, are also possible.
A shared disk cluster le system is exactly as its name implies and is illus-
trated in the diagram below; it permits distributed processes comprising the
cluster to share access to a common pool o storage. Typically in a shared disk
le system, the le system is distributed among the servers hosting the distrib-
uted application with an instance o the le system running on each processorand a side band communication channel between le system instances used
or distributed lock management. By creating a series o independent chan-
nels, each instance o the distributed le system is given direct access to the
shared disk storage system. The le system places a certain load on the storage
system which is equal to the sum o the loads placed by each instance o the
le system. By distributing this load across a series o unique channels, the
channel I/O architecture provides benets in terms o parallel accesses and
bandwidth allocation.
cluster f/s
cluster f/s
cluster f/s
Server
Server
Server
Distributed Local
File System
Figure 3: Shared disk cluster fle system. Each instance o the distributed fle system
has a unique connection to the elements o the storage system.
-
8/2/2019 Intro to IB for End Users
17/54
[ 17 ]
ChApTER 2 INFINIBAND FOR hpC
One nal point; the perormance o a shared disk le system and hence the
overall system is dependent on the speed with which locks can be exchanged
between le system instances. InniBands ultra low latency characteristic
allows the cluster to extract maximum perormance rom the storage system.
A parallel le system such as the Lustre le system is similar in some waysto a shared disk le system in that it is designed to satisy storage demand
rom a number o clients in parallel, but with several distinct dierences.
Once again, the processes comprising the application are distributed across a
number o servers. The le system however, is sited together with the storage
devices rather than being distributed. Each process comprising the distributed
application is served by a relatively thin le system client.
Within the storage system, the control plane (le system metadata), is oten
separated rom the data plane (object servers) containing user data. This pro-motes parallelism by reducing bottlenecks associated with accessing le system
metadata, and it permits user data to be distributed across a number o storage
servers which are devoted to strictly serving user data.
f/s client
f/s client
f/s client
Server
Server
Server
Remote
File System
mds
oss
oss
oss
Figure 4:A parallel fle system.A thin fle system client co-located with each instance
o the distributed application creates parallel connections to the metadata server(s) and
the object servers. The large arrow represents the physical InfniBand network.
Each le system client creates a unique channel to the le system metadata.
This accelerates the metadata look-up process and allows the metadata server
to service a airly large number o accesses in parallel. In addition, each le
system client creates unique channels to the object storage servers where userdata is actually stored.
By creating unique connections between each le system client and
the storage subsystem, a parallel le system is perectly positioned to reap
maximum advantage rom InniBands channel I/O architecture. The unique
and persistent low latency connections to the metadata server(s) allow a high
degree o scalability by eliminating the serialization o accesses sometimes
-
8/2/2019 Intro to IB for End Users
18/54
[ 18 ]
INTRO TO INFINIBAND FOR END USERS
experienced in a distributed lock system. By distributing the data across an
array o object servers and by providing each le system client with a unique
connection to the object servers, a high degree o parallelism is created as
multiple le system clients attempt to access objects. By creating unique and
persistent low-latency connections between each le system client and themetadata servers, and between each le system client and the object storage
servers, the Lustre le system delivers scalable perormance.
Upper Layer Protocols or HPC
The concept o upper layer protocols (ULPs) is described in some detail in
Chapter 4 Designing with InniBand. For now, it is sucient to say that open
source sotware to support the InniBand Architecture and specically ocused
on HPC is readily available, or example, rom the OpenFabrics Alliance. Thereare ULPs available to support le system accesses such as the Lustre le
system, MPI implementations and a range o other APIs necessary to support a
complete HPC system are also provided.
-
8/2/2019 Intro to IB for End Users
19/54
[ 19 ]
ChApTER 3 INFINIBAND FOR ThE ENTERpRISE
Chapter 3 InfniBand or the Enterprise
The previous chapter addressed applications o InniBand in HPC and
ocused on the values an RDMA messaging service brings to that space ultra-low, application-to-application latency and extremely high bandwidth all
delivered at a cost o almost zero CPU utilization. As we will see in this chapter,
InniBand brings an entirely dierent set o values to the enterprise. But those
values are rooted in precisely the same set o underlying principles.
O great interest to the enterprise are values such as risk reduction, invest-
ment protection, continuity o operations, ease o management, fexible use
o resources, reduced capital and operating expenses and reduced power,
cooling and foor space costs. Naturally, there are no hard rules here; thereare certainly areas in the enterprise space where InniBands perormance
(latency, bandwidth, CPU utilization) is o extreme importance. For example,
certain applications in the nancial services industry (such as arbitrage trading)
demand the lowest possible application-to-application latencies at nearly any
cost. Our purpose in this chapter is to help the enterprise user understand more
clearly how the InniBand Architecture applies in this environment.
Reducing expenses associated with purchasing and operating a data center
means being able to do more with less less servers, less switches, ewer
routers and ewer cables. Reducing the number o pieces o equipment in a
data center leads to reductions in foor space and commensurate reductions in
demand or power and cooling. In some environments, severe limitations on
available foor space, power and cooling eectively throttle the data centers
ability to meet the needs o a growing enterprise, so overcoming these obsta-
cles can be crucial to an enterprises growth. Even in environments where this
is not the case, each piece o equipment in a data center demands a certain
amount o maintenance and operating cost, so reducing the number o pieces
o equipment leads to reductions in operating expenses. Aggregated across an
enterprise, these savings are substantial. For all these reasons, enterprise data
center owners are incented to ocus on the ecient use o IT resources. As
we will see shortly, appropriate application o InniBand can advance each o
these goals.
-
8/2/2019 Intro to IB for End Users
20/54
[ 20 ]
INTRO TO INFINIBAND FOR END USERS
Simply reducing the amount o equipment is a noble goal unto itsel, but in
an environment where workloads continue to increase, it becomes critical to
do with less. This is where the InniBand Architecture can help; it enables the
enterprise data center to serve more clients with a broader range o applications
with aster response times while reducing the number o servers, cables andswitches required. Sounds like magic.
Server virtualization is a strategy being successully deployed now in an
attempt to reduce the cost o server hardware in the data center. It does so by
taking advantage o the act that many data center servers today operate at
very low levels o CPU utilization; it is not uncommon or data center servers
to consume less than 15-20 percent o the available CPU cycles. By hosting
a number o Virtual Machines (VMs) on each physical server, each o which
hosts one or more instances o an enterprise application, virtualization cancleverly consume CPU cycles that had previously gone unused. This means
that ewer physical servers are needed to support approximately the same
application workload.
Even though server virtualization is an eective strategy or driving up
consumption o previously unused CPU cycles, it does little to improve the
eective utilization (efciency) o the existing processor/memory complex. CPU
cycles and memory bandwidth are still devoted to executing the network stack
at the expense o application perormance. Eective utilization, or eciency,is the ratio o a server resource, such as CPU cycles or memory bandwidth
that is devoted directly to executing an application program compared to the
use o server resources devoted to housekeeping tasks. Housekeeping tasks
principally include services provided to the application by the operating system
devoted to conducting routine network communication or storage activities on
behal o an application. The trick is to apply technologies that can undamen-
tally improve the servers overall eective utilization, without assuming unnec-
essary risk in deploying them. I successul, the data center owner will achieve
much, much higher usage o the data center servers, switches and storage that
were purchased. That is the subject o this chapter.
Devoting Server Resources to Application Processing
Earlier we described how an application using InniBands RDMA mes-
saging service avoids the use o the operating system (using operating system
bypass) in order to transer messages between a sender and a receiver. This
was described as a perormance improvement since avoiding operating system
calls and unnecessary buer copies signicantly reduces latency. But operating
system bypass produces another eect o much greater value to the enterprise
data center.
Server architecture and design is an optimization process. At a given server
price point, the server architect can aord to provide only a certain amount
-
8/2/2019 Intro to IB for End Users
21/54
[ 21 ]
ChApTER 3 INFINIBAND FOR ThE ENTERpRISE
o main memory; the important corollary is that there is a nite amount o
memory bandwidth available to support application processing at that price
point. As it turns out, main memory bandwidth is a key determinant o server
perormance. Heres why.
The precious memory bandwidth provided on a given server motherboardmust be allocated between application processing and I/O processing. This
is true because traditional I/O, such as TCP/IP networking, relies on main
memory to create an anonymous buer pool, which is integral to the opera-
tion o the network stack. Each inbound packet must be placed into an
anonymous buer and then copied out o the anonymous buer and delivered
to a user buer in the applications virtual memory space. Thus, each inbound
packet is copied at least twice. A single 10Gb/s Ethernet link, or example,
can consume between 20Gb/s and 30Gb/s o memory bandwidth to supportinbound network trac.
App
I/O
Mem
Figure 5: Each inbound network packet consumes rom two to three times the memory
bandwidth as the inbound packet is placed into an anonymous buer by the NIC and
then subsequently copied rom the anonymous buer and into the applications virtual
buer space.
The load oered to the memory by network trac is largely dependent on
how much aggregate network trac is created by the applications hosted on
the server. As the number o processor sockets on the motherboard increases,
and as the number o cores per socket increases, and as the number o execu-
tion threads per core increases, the oered network load increases approxi-
mately proportionally. And each increment o oered network load consumes
two to three times as much memory bandwidth as the networks underlyingsignaling rate. For example, I an application consumes 10Gb/s o inbound
network trac (which translates into 20Gb/s o memory bandwidth consumed)
then multiple instances o an application running on a multicore CPU will
consume proportionally more network bandwidth and twice the memory band-
width. And, o course, executing the TCP/IP network protocol stack consumes
CPU processor cycles. The cycles consumed in executing the network protocol
-
8/2/2019 Intro to IB for End Users
22/54
[ 22 ]
INTRO TO INFINIBAND FOR END USERS
and copying data out o the anonymous buer pool and into the applications
virtual buer space are cycles that cannot be devoted to application processing;
they are purely overhead required in order to provide the application with the
data it requested.
Consider: When one contemplates the purchase o a server such as a serverblade, the key actors that are typically weighed against the cost o the server
are power consumption and capacity. Processor speed, number o cores,
number o threads per core and memory capacity are the major contributors to
both power consumption and server perormance. So i you could reduce the
number o CPUs required or the amount o memory bandwidth consumed, but
without reducing the ability to execute user applications, signicant savings
can be realized in both capital expense and operating expense. I using the
available CPU cycles and memory bandwidth more eciently enables the datacenter to reduce the number o servers required, then capital and operating
expense improvements multiple dramatically.
And this is exactly what the InniBand Architecture delivers; it dramati-
cally improves the servers eective utilization (its eciency) by dramatically
reducing demand on the CPU to execute the network stack or copy data out o
main memory and it dramatically reduces I/O demand on memory bandwidth,
making more memory bandwidth available or application processing. This
allows the user to direct precious CPU and memory resources to useul appli-cation processing and allows the user to reap the ull capacity o the server or
which the user has paid.
Again, server virtualization increases the number o applications that a
server can host by time slicing the server hardware among the hosted virtual
machines. Channel I/O decreases the amount o time the server devotes to
housekeeping tasks and thereore increases the amount o time that the server
can devote to application processing. It does this by liting rom the CPU and
memory the burden o executing I/O and copying data.
A Flexible Server Architecture
Most modern servers touch at least two dierent I/O abrics an Ethernet
network and a storage abric, commonly Fibre Channel. In some cases there
is a third abric dedicated to IPC. For perormance and scalability reasons, a
server oten touches several instances o each type o abric. Each point o
contact between a server and one o the associated abrics is implemented with
a piece o hardware either a Network Interace Card (NIC) or networking or
a Host Bus Adapter (HBA) or storage. The servers total I/O bandwidth is the
sum o the bandwidths available through these dierent abrics. Each o these
abrics places a certain load both on memory bandwidth and CPU resources.
The allocation o the total available I/O bandwidth between storage and
networking is established at the time the server is provisioned with I/O hard-
-
8/2/2019 Intro to IB for End Users
23/54
[ 23 ]
ChApTER 3 INFINIBAND FOR ThE ENTERpRISE
ware. This process begins during motherboard manuacture in the case o
NICs soldered directly to the motherboard (LAN on Motherboard LOM) and
continues when the server is provisioned with a complement o PCI Express-
attached Host Bus Adapters (HBAs, used or storage) and NICs. Even though
the complement o PCI Express adapters can be changed, the overall costso doing so are prohibitive; so by and large the allocation o I/O bandwidth
between storage and networking is essentially xed by the time the server has
been installed.
App
HBA
Mem
HBANIC
NIC
Network, IPC
LOM(LAN on Motherboard)
Storage
Figure 6: Memory bandwidth is divided between application processing and I/O. I/O
bandwidth allocation between storage, networking and IPC is predetermined by the
physical hardware, the NICs and HBAs, in the server.
The ideal ratio o storage bandwidth to networking bandwidth is solelya unction o the application(s) being hosted by the server each dierent
enterprise application will present a dierent characteristic workload on the I/O
subsystem. For example, a database application may create a high disk storage
workload and a relatively modest networking workload as it receives queries
rom its clients and returns results. Hence, i the mix o enterprise applications
hosted by a given server changes, its ideal allocation o I/O bandwidth between
networking and storage also changes. This is especially true in virtualized
servers, where each physical server hosts a number o application containers(Virtual Machines VMs), each o which may be presenting a dierent charac-
teristic workload to the I/O subsystem.
This creates a bit o a conundrum or the IT manager. Does the IT man-
ager over-provision each server with a maximum complement o storage and
networking devices, and in what proportions? Does the IT manager lock a given
application, or application mix, down to a given server and provision that server
-
8/2/2019 Intro to IB for End Users
24/54
[ 24 ]
INTRO TO INFINIBAND FOR END USERS
with the idealized I/O complement o storage and networking resources? What
happens to the server as the workload it is hosting is changed? Will it still have
proportional access to both networking and storage abrics?
In a later chapter we introduce the notion o a unied abric as being a
abric that is optimal or storage, networking and IPC. An InniBand abrichas the characteristics that make it well suited or use as a unied abric. For
example, its message orientation and its ability to eciently handle the small
message sizes that are oten ound in network applications, as well as the large
message sizes that are common in storage. In addition, the InniBand Architec-
ture contains eatures such as a virtual lane architecture specically designed
to support various types o trac simultaneously.
Carrying storage and networking trac eciently over a single InniBand net-
work using the RDMA message transport service means that the total numbero I/O adapters required (HBAs and NICs) can be radically reduced. This is a
signicant saving, and even more signicant when one actors in the 40Gb/s
data rates o commonly available commodity InniBand components versus
leading edge 10Gb/s Ethernet. For our purposes here, we are ignoring the
approximately 4x improvement in price/perormance provided by InniBand.
But ar more important than the simple cost savings associated with the
reduction in the number o required adapter types is this: by using an Inni-
Band network, a single type o resource (HCAs, buers, queue pairs and soon) is used or both networking and storage. Thus the balancing act between
networking and storage is eliminated, meaning that the server will always have
the correct proportion o networking and storage resources available to it even
as the mix o applications changes and the characteristic workload oered by
the applications changes!
This is a proound observation; it means that it is no longer necessary to
overprovision a server with networking and storage resources in anticipation o
an unknown application workload; the server will always be naturally provi-
sioned with I/O resources.
Investment Protection
Enterprise data centers typically represent a signicant investment in equip-
ment and sotware accumulated over time. Only rarely does an opportunity
arise to install a new physical plant and accompanying control and manage-
ment sotware. Accordingly, valuable new technologies must be introduced to
the data center in such a way as to provide maximum investment protection
and to minimize the risk o disruption to existing applications. For example, a
data center may contain a large investment in SANs or storage which it would
be impractical to simply replace. In this section we explore some methods to
allow the data centers servers to take advantage o InniBands messaging
service while protecting existing investments.
-
8/2/2019 Intro to IB for End Users
25/54
[ 25 ]
ChApTER 3 INFINIBAND FOR ThE ENTERpRISE
Consider the case o a rack o servers, or a chassis o bladed servers. As
usual, each server in the rack requires connection to up to three types o
abrics, and in many cases, multiple instances o each type o abric. There
may be one Ethernet network dedicated to migrating virtual machines, another
Ethernet network dedicated to perorming backups and yet another one or con-necting to the internet. Connecting to all these networks requires multiple NICs,
HBAs, cables and switch ports.
Earlier we described the advantages that accrue in the data center by
reducing the types o I/O adapters in any given server. These included reduc-
tions in the number and types o cables, switches and routers and improve-
ments in the servers ability to host changing workloads because it is no longer
limited by the particular I/O adapters that happen to be installed.
Nevertheless, a server may still require access to an existing centrally man-aged pool o storage such as a Fibre Channel SAN which requires that each
server be equipped with a Fibre Channel HBA. On the ace o it, this is at
odds with the notion o reducing the number o I/O adapters. The solution is to
replace the FC HBA with a virtual HBA driver. Instead o using its own dedi-
cated FC HBA, the server uses a virtual device adapter to access a physical
Fibre Channel adapter located in an external I/O chassis. The I/O chassis pro-
vides a pool o HBAs which are shared among the servers. The servers access
the pool o HBAs over an InniBand network.The set o physical adapters thus eliminated rom each server are replaced
by a single physical InniBand HCA and a corresponding set o virtual device
adapters (VNICs and VHBAs). By so doing, the servers gain the fexibility
aorded by the one at pipe, benet rom the reduction in HBAs and NICs, and
retain the ability to access the existing inrastructure, thus preserving
the investment.
A typical server rack contains a so-called top o rack switch which is used
to aggregate all o a given class o trac rom within the rack. This is an ideal
place to site an I/O chassis, which thus provides the entire rack with virtualized
access to data center storage resources and with access to the data center net-
work and with access to an IPC service or communicating with servers located
in other racks.
Cloud Computing An Emerging Data Center Model
Cloud computing has emerged as an interesting variant on enterprise com-
puting. Although widespread adoption has only recently occurred, the unda-
mental concepts beneath the idea o a cloud have been under discussion or
quite a long period under dierent names such as utility computing, grid com-
puting and agile or fexible computing. Although cloud computing is marketed
and expressed in a number o ways, at its heart it is an expression o the idea
o fexible allocation o compute resources (including processing, network and
-
8/2/2019 Intro to IB for End Users
26/54
[ 26 ]
INTRO TO INFINIBAND FOR END USERS
storage) to support an ever-shiting pool o applications. In its ullest realization,
the cloud represents a pool o resources capable o hosting a vast array o
applications, the composition o which is regularly shiting. This is dramatically
dierent rom the classical enterprise data center in which the applications
hosted by the data center are oten closely associated with, and in some casesstatically assigned to, a specic set o hardware and sotware resources a
server. For example, the classical three-tier data center is an expression o this
relatively static allocation o servers to applications.
Given the above discussion concerning the fexible allocation o I/O resources
to applications, it is immediately clear that InniBands ability to transport
messages on behal o any application is a strong contributor to the idea that
applications must not be enslaved to the specic I/O architecture o the under-
lying server on which the application is hosted. Indeed, the cloud computingmodel is the very embodiment o the idea o fexible computing; the ability
to fexibly allocate processors, memory, networking and storage resources to
support application execution anywhere within the cloud. By using the set o
Upper Layer Protocols described briefy in the rst chapter, the application is
provided with a simple way to use the InniBand messaging service. By doing
so, the applications dependence on a particular mix o storage and networking
I/O is eectively eliminated. Thus, the operator o a cloud inrastructure gains
an unprecedented level o fexibility in hosting applications capable o executingon any available compute resource in the cloud.
One last point should be made beore we leave behind the topic o cloud
computing. Cloud inrastructures, as they are being deployed today, represent
some o the highest densities yet seen in the computing industry. Cloud sys-
tems are oten delivered in container-like boxes that push the limits o compute
density, storage density, power and cooling density in a conned physical
space. In that way, they share the same set o obstacles ound in some o the
highest density commercial data centers, such as those ound in the nancial
services industry in Manhattan. These data centers ace many o the same
stifing constraints on available power, cooling and foor space.
We mentioned in an earlier section that enterprise data centers are oten
constrained in their ability to adopt relevant new technologies since they must
balance the advantages gained by deploying a new technology against require-
ments to protect investments in existing technologies. In that way, many o
the cloud computing inrastructures now coming online are very dierent rom
their brethren in other commercial enterprises. Because o the modular way
that modern cloud inrastructures are assembled and deployed, the opportu-
nity does exist or simple adoption o a vastly improved I/O inrastructure like
InniBand. For example, a typical cloud inrastructure might be composed o
a series o sel-contained containers resembling the amiliar shipping con-
tainer seen on trucks. This container, sometimes called a POD, contains racks,
-
8/2/2019 Intro to IB for End Users
27/54
[ 27 ]
ChApTER 3 INFINIBAND FOR ThE ENTERpRISE
servers, switches, networks and inrastructure. They are designed to be rolled
in, connected to a power and cooling source and be operational right away. A
POD might comprise a complete, sel-contained data center or it may represent
a granular unit o expansion in a larger cloud installation.
By deploying InniBand as the network internal to the container, and bytaking ull advantage o its messaging service, the container manuacturer can
deliver to his or her client the densest possible conguration presenting the
highest eective utilization o compute, power and cooling resources possible.
Such a container takes ull advantage o InniBands complete set o value
propositions:
Itslowlatency,andextremelyhighbandwidthremovesbottlenecksthatpre-
vent applications rom delivering maximum achievable perormance.
TheeaseandsimplicitywithwhichanapplicationdirectlyusesInniBands
messaging service means that precious CPU cycles and, more importantly,
memory bandwidth, are devoted to application processing and not to
housekeeping I/O chores. This drives up the eective utilization o available
resources. Thus, the overall return on investment is higher, and the delivered
perormance per cubic oot o foor space and per watt o power consump-
tion, is higher.
Byprovidingeachserverwithonefatpipe,theserverplatformgainsanintrinsic guarantee that its mixture o storage I/O, networking I/O and IPC will
always be perectly balanced.
InniBandsefcientmanagementofthenetworktopologyoffersexcellent
utilization o the available network connectivity. Because InniBand deploys
a centralized subnet manager, switch orwarding tables are programmed in
such a way that routes are balanced and available links are well used. This
ecient topology handling scales to thousands o nodes and thus satisesthe needs o large cloud installations.
InniBandsnative,commoditybandwidths(currently40Gb/s)arenearly
our times higher than 10Gb/s Ethernet, at prices that have historically been
extremely competitive compared to commodity 1Gb/s or 10Gb/s Ethernet so-
lutions. Since each InniBand cable carries nearly our times the bandwidth
o a corresponding 10Gb/s Ethernet cable, the number o connectors, cables,
switches and so on that are required is dramatically reduced.
Andnally,thesimpliedcommunicationserviceprovidedbyanInniBand
network inside the container simplies the deployment o applications on
the pool o computing, storage and networking resources provided by the
container and reduces the burden o migrating applications onto available
resources.
-
8/2/2019 Intro to IB for End Users
28/54
-
8/2/2019 Intro to IB for End Users
29/54
[ 29 ]
ChApTER 3 INFINIBAND FOR ThE ENTERpRISE
physically copied to the local site. This is because enterprises rely on a le
transer protocol to replicate data rom site to site. The time delay to accomplish
the transer can oten be measured in minutes or hours depending on le sizes,
the capacity o the underlying network and the perormance o the storage
system. This poses a dicult problem in synchronizing storage between sites.Suppose, instead o copying les rom the remote le system to the local le
system, that the remote le system could simply be mounted to the local users
workstation in exactly the same way that the local le system is mounted?
This is exactly the model enabled by executing InniBand over the WAN. Over
long haul connections with relatively long fight times, the InniBand protocol
oers extremely high bandwidth eciency across distances that cause TCP/IP
to stumble.
StorageClient
UserApp
User
Buffer
LocalStorage
RemoteStorage
RemoteStorage
Figure 7.
An enterprise application reads data through a storage client
The storage client connects to each storage server via RDMA
Connections to remote storage servers are made via the WAN
Thus, the user has direct access to all data stored anywhere on the system
It turns out that the perormance o any windowing protocol is a unc-
tion o the so-called bandwidth-delay product; this product represents themaximum amount o data that can be stored in transit on the WAN at any
given moment. Once the limit o the congestion window has been reached,
the transport must stop transmitting until some o the bytes in the pipe can be
drained at the ar end. In order to achieve maximum utilization o the avail-
able WAN bandwidth, the transports congestion window must be suciently
large to keep the wire continuously ull. As the wire length increases, or as
the bandwidth on the wire increases, so does the bandwidth-delay product.
And hence, the congestion window that is basic to any congestion managednetwork such as InniBand or TCP/IP must grow very large in order to support
the large bandwidth-delay products commonly ound on high speed WAN links.
Because o the undamental architecture o the InniBand transport, it is easily
able to support these exceedingly large bandwidth-delay products.
On the other hand, lossy fow control protocols such as TCP degrade
rapidly as the bandwidth-delay product increases, causing turbulent rather
-
8/2/2019 Intro to IB for End Users
30/54
[ 30 ]
INTRO TO INFINIBAND FOR END USERS
than smooth packet fow. These eects cause a rapid roll-o in achieved
bandwidth perormance as link distance or wire bandwidth increases. These
eects are already pronounced at 10Gbits/s, and become much worse at 40 or
100Gbits/s and beyond.
Because the InniBand transport, when run over the WAN is able to eec-tively manage the bandwidth-delay product o these long links, it becomes
possible to actually mount the remote le system directly to the local users
workstation and hence to deliver perormance approximately equivalent to that
which the local user would experience when accessing a local le system. O
course speed o light delays are an unavoidable byproduct o access over the
WAN, but or many applications the improvement in response time rom min-
utes or hours to milliseconds or seconds represents an easy tradeo.
Long distance InniBand links may span Campus or Metro Area Networks oreven Global WANs. Over shorter optical networks, applications directly benet
rom InniBands low latency, low jitter, low CPU loading and high throughput.
Examples o such applications include:
Data Center Expansion spill an InniBand network rom a primary data
center to nearby buildings using multiple optical InniBand connections. This
approach causes no disruption to data center trac, is transparent to applica-
tions and suciently low latency as to be ignorable to operations.
Campus System Area Networks InniBand makes an excellent System
Area Network, carrying storage/ IPC and Internet trac over its unied abric.
Campus Area InniBand links allow islands o compute, storage and I/O
devices such as high-resolution display walls to be ederated and shared
across departments. Location agnostic access perormance greatly assists in
the ecient utilization o organization-wide resources.
-
8/2/2019 Intro to IB for End Users
31/54
[ 31 ]
ChApTER 4 DESIgNINg wITh INFINIBAND
Chapter 4 Designing with InfniBand
As with any network architecture, the InniBand Architecture provides a
service; in the case o InniBand it provides a messaging service. That mes-sage service can be used to transport IPC messages, or to transport control
and data messages or storage, or to transport messages associated with any
other o a range o usages. This makes InniBand dierent rom a traditional
TCP/IP/Ethernet network which provides a byte stream transport service, or
Fibre Channel which provides a transport service specic to the Fibre Channel
wire protocol. By application, we mean, or example, a user application using
the sockets interace to conduct IPC, or a kernel level le system application
executing SCSI block level commands to a remote block storage device.The key eature that sets InniBand apart rom other network technologies
is that InniBand is designed to provide message transport services directly to
the application layer, whether it is a user application or a kernel application.
This is very dierent rom a traditional network which requires the application
to solicit assistance rom the operating system to transer a stream o bytes.
At rst blush, one might think that such application level access requires the
application to ully understand the networking and I/O protocols provided by
the underlying network. Nothing could be urther rom the truth.The top layer o the InniBand Architecture denes asotware transport
interace; this section o the specication denes the methods that an applica-
tion uses to access the complete set o services provided by InniBand.
An application accesses InniBands message transport service by posting
a Work Request (WR) to a work queue. As described earlier, the work queue
is part o a structure, called a Queue Pair (QP) representing the endpoint o a
channel connecting the application with another application. A QP contains
two work queues; a Send Queue and a Receive Queue, hence the expressionQP. A WR, once placed on the work queue, is in eect an instruction to the
InniBand RDMA message transport to transmit a message on the channel, or
perorm some sort o control or housekeeping unction on the channel.
The methods used by an application to interact with the InniBand trans-
port are called verbs and are dened in the sotware transport interace section
o the specication. An application uses a POST SEND verb, or example, to
-
8/2/2019 Intro to IB for End Users
32/54
[ 32 ]
INTRO TO INFINIBAND FOR END USERS
request that the InniBand transport send a message on the given channel.
This is what is meant when we say that the InniBand network stack goes all
the way up to the application layer. It provides methods that are used directly
by an application to request service rom the InniBand transport. This is very
dierent rom a traditional network, which requires the application to solicitassistance rom the operating system in order to access the servers network
services.
Application
SoftwareTransportInterface
IB Transport
Layer
Application(Consumer of InfiniBand message services):
posts work requests to a queue each work request represents a messagea unit of work
Channel interface (verbs):
allows the consumer to request services
InfiniBand Channel interface provider:
Maintains work queues Manages address translation Provides completion and event mechanisms
IB Transport:
Packetizes messages Provides transport service
- reliable/unreliable, connected/unconnected, datagram
Implements RDMA protocol- send/receive, RDMA r/w, Atomics
Implements end-to-end reliability Assures reliable delivery
Figure 8.
Building on Top o the Verbs
The verbs, as dened, are sucient to enable an application developer
to design and code the application or use directly on top o an InniBand
network. But i the application developer were to do so, he or she would need
to create a set o customized application interace code, at least one set or
each operating system to be supported. In a world o specialized applications,
such as high perormance computing, where the last ounce o perormance
is important, such an approach makes a great deal o sense the range o
applications in question is airly well dened and the variety o systems on
which such applications are deployed is similarly constrained. But in a world o
more broadly distributed applications and operating systems it may be prohibi-
tive or the application or hardware developer and the prousion o application
interaces may be an obstacle inhibiting an IT manager rom adopting a given
application or technology.
In more broadly deployed systems, an application accesses system services
-
8/2/2019 Intro to IB for End Users
33/54
[ 33 ]
ChApTER 4 DESIgNINg wITh INFINIBAND
via a set o APIs oten supplied as part o the operating system.
InniBand is an industry-standard interconnect and is not tied to any given
operating system or server platorm. For this reason, it does not speciy any
APIs directly. Instead, the InniBand verbs ully speciy the required behaviors
o the APIs. In eect, the verbs represent a semantic denition o the applica-tion interace by speciying the methods and mechanisms that an application
uses to control the network and to access its services. Since the verbs speciy
a set o required behaviors, they are sucient to support the development
o OS-specic APIs and thus ensure interoperability between an application
executing, or example, on a Windows platorm which is communicating with
an application executing on a Linux platorm. The verbs ensure cross-platorm
and cross-vendor interoperability. APIs can be tested both or conormance
to the verbs specication and or interoperability allowing dierent vendors oInniBand solutions to assure their customers that the devices and sotware
will operate in conormance to the specication and will interoperate with solu-
tions provided by other vendors.
However, a semantic description o a required behavior is o little actual
use without an accompanying set o APIs to implement the required behavior.
Fortunately, other organizations do provide APIs, even i the InniBand Trade
Association does not. For example, the OpenFabrics Alliance and other com-
mercial providers supply a complete set o sotware stacks which are bothnecessary and sucient to deploy an InniBand-based system. The sotware
stack available rom OFA, as the name suggests, is open source and reely
downloadable under both BSD and GNU licenses. The OFA is broadly sup-
ported by providers o InniBand hardware, by major operating system sup-
pliers and suppliers o middleware and by the Linux open source community.
All o these organizations, and many others, contribute open source code to the
OFA repository where it undergoes testing and renement and is made gener-
ally available to the community at large.
But even supplying APIs is not, by itsel, sucient to deploy InniBand.
Broadly speaking, the sotware elements required to implement InniBand all
into three major areas; there is the set o Upper Layer Protocols (ULPs) and
associated libraries, there is a set o mid-layer unctions that are helpul in con-
guring and managing the underlying InniBand abric and that provide needed
services to the upper layer protocols, and there is a set o hardware specic
device drivers. Some o the mid-layer unctions are described in a little more
detail in the chapter describing the InniBand Management Architecture. All o
these components are reely available through open source or easy download,
installation and use. They can be downloaded directly rom the OpenFabrics
Alliance, or, in the case o Linux, they can be acquired as part o the standard
distributions provided by a number o Linux distributions. The complete set o
sotware components provided by the OpenFabrics Alliance is ormally known
-
8/2/2019 Intro to IB for End Users
34/54
[ 34 ]
INTRO TO INFINIBAND FOR END USERS
as the Open Fabrics Enterprise Distribution OFED.
There is an important point associated with the ULPs that bears urther
mention. This chapter began by describing InniBand as a message transport
service. As described above, in some circumstances it is entirely practical
to design an application to access the message transport service directly bycoding the application to conorm to the InniBand verbs. Purpose-built HPC
applications, or example, oten do this in order to gain the ull measure o per-
ormance. But doing so requires the application, and the application developer
to be RDMA aware. For example, the application would need to understand
the sequence o verbs to be used to create a queue pair, to establish a connec-
tion, to send and receive messages over the connection and so orth. However,
most commercial applications are designed to take advantage o a standard
networking or storage subsystem by leveraging existing, industry standardAPIs. But that does not mean that such an application cannot benet rom the
perormance and fexibility advantages provided by InniBand. The set o ULPs
provided through OFED allows an existing application to take easy advantage o
InniBand. Each ULP can be thought o as having two interaces; an upward-
acing interace presented to the application, and a downward-acing interace
connecting to the underlying InniBand messaging service via the queue pairs
dened in the sotware transport interace. At its upward-acing interace, the
ULP presents a standard interace easily recognizable by the application. Forexample, the SCSI RDMA Protocol ULP (SRP) presents a standard SCSI class
driver interace to the le system above it. By providing the advantages o an
RDMA network, the ULP allows a SCSI le system to reap the ull benets
o InniBand. A rich set o ULPs are provided by the Open Fabrics Alliance
including:
SDP Sockets Direct Protocol. This ULP allows a sockets application to
take advantage o an InniBand network with no change to the application.
SRP SCSI RDMA Protocol. This allows a SCSI le system to directly con-
nect to a remote block storage chassis using RDMA semantics. Again, there
is no impact to the le system itsel.
iSER iSCSI Extensions or RDMA. iSCSI is a protocol allowing a block
storage le system to access a block storage device over a generic network.
iSER allows the user to operate the iSCSI protocol over an RDMA capable
network.
IPoIB IP over InfniBand. This important part o the suite o ULPs allows
an application hosted in, or example, an InniBand-based network to com-
municate with other sources outside the InniBand network using standard
IP semantics. Although oten used to transport TCP/IP over an InniBand
network, the IPoIB ULP can be used to transport any o the suite o IP
protocols including UDP, SCTP and others.
-
8/2/2019 Intro to IB for End Users
35/54
[ 35 ]
ChApTER 4 DESIgNINg wITh INFINIBAND
NFS-RDMA Network File System over RDMA. NFS is a well-known and
widely-deployed le system providing le level I/O (as opposed to block
level I/O) over a conventional TCP/IP network. This enables easy le shar-
ing. NFS-RDMA extends the protocol and enables it to take ull advantage
o the high bandwidth and parallelism provided naturally by InniBand.
Lustre support Lustre is a parallel le system enabling, or example, a set
o clients hosted on a number o servers to access the data store in parallel.
It does this by taking advantage o InniBands Channel I/O architecture, al-
lowing each client to establish an independent, protected channel between
itsel and the Lustre Metadata Servers (MDS) and associated Object Storage
Servers and Targets (OSS, OST).
RDS Reliable Datagram Sockets oers a Berkeley sockets API allow-ing messages to be sent to multiple destinations rom a single socket. This
ULP, originally developed by Oracle, is ideally designed to allow database
systems to take ull advantage o the parallelism and low latency character-
istics o InniBand.
MPI The MPI ULP or HPC clusters provides ull support or MPI unction
calls.
SocketsApps
BlockStorage
IP-basedApps
ClusteredDB Access
FilesSystems
VNIC SDP
iPoIB iSER RDSNFS
RDMAMPI
SRPCluster
File Systems
Application
Upper Layer Protocols
InfiniBand verbs interface
InfiniBand RDMAMessage
Transport Service
InfiniBand FabricInterconnect
Figure 9.
There is one more important corollary to this. As described above, a tradi-
tional API provides access to a particular kind o service, such as a network
service or a storage service. These services are usually provided over a specic
type o network or interconnect. The sockets API, or example, gives an appli-
cation access to a network service and is closely coupled with an underlying
TCP/IP network. Similarly, a block level le system is closely coupled with a
-
8/2/2019 Intro to IB for End Users
36/54
[ 36 ]
INTRO TO INFINIBAND FOR END USERS
purpose-built storage interconnect such as Fibre Channel. This is why data
center networks today are commonly thought o as consisting o as many as
three separate interconnects one or networking, one or storage and possibly
a third or IPC. Each type o interconnect consists o its own physical plant
(cables, switches, NICs or HBAs and so on), its own protocol stack, and mostimportantly, its own set o application level APIs.
This is where the InniBand Architecture is dierent: The set o ULPs
provide a series o interaces that an application can use to access a particular
kind o service (storage service, networking service, IPC service or others).
However, each o those services is provided using asingle underlying network.
At their lower interaces, the ULPs are bound to a single InniBand network.
There is no need or multiple underlying networks coupled to each type o
application interace. In eect, the set o ULPs gives an application access toa single unied abric, completely hiding the details associated with a unied
abric, while avoiding the need or multiple data center networks.
-
8/2/2019 Intro to IB for End Users
37/54
[ 37 ]
ChApTER 5 INFINIBAND ARChITECTURE AND FEATURES
Chapter 5 InfniBand Architecture and Features
Chapter 1 provided a high-level overview o the key concepts underpin-
ning InniBand. Some o these concepts include channel I/O (described as a
connection between two virtual address spaces), a channel endpoint called aqueue pair (QP), and direct application level access to a QP via virtual to phys-
ical memory address mapping. A detailed description o the architecture and
its accompanying eatures is the province o an entire multi-volume book and
is well outside the scope o this minibook. This chapter is designed to deepen
your understanding o the architecture with an eye toward understanding how
InniBands key concepts are achieved and how they are applied.
Volume 1 o the InniBand Architecture specication co