Intro to IB for End Users

8/2/2019 Intro to IB for End Users

1/54

Introduction to InfniBand

or End Users

Industry-Standard Value and Performance forHigh Performance Computing and the Enterprise

Paul Grun

InfniBand Trade Association


2/54


3/54

[ 3 ]

CONTENTS

Contents

Acknowledgements ..............................................................................4

About the Author ..................................................................................4

Introduction .........................................................................................5

Chapter 1 Basic Concepts ..................................................................7Chapter 2 InniBand or HPC ..........................................................14

InniBand MPI Clusters...................................................................15

Storage and HPC ............................................................................16

Upper Layer Protocols or HPC .........................................................18

Chapter 3 InniBand or the Enterprise ..............................................19

Devoting Server Resources to Application Processing ..........................20

A Flexible Server Architecture ...........................................................21

Investment Protection ......................................................................24

Cloud Computing An Emerging Data Center Model ...........................25

InniBand in the Distributed Enterprise .............................................28

Chapter 4 Designing with InniBand ..................................................31

Building on Top o the Verbs ............................................................32

Chapter 5 InniBand Architecture and Features ...................................37Address Translation .........................................................................38

The InniBand Transport .................................................................39

InniBand Link Layer Considerations ................................................42

Management and Services ...............................................................42

InniBand Management Architecture ................................................43

Chapter 6 Achieving an Interoperable Solution ....................................45

An Interoperable Sotware Solution ...................................................45Ensuring Hardware Interoperability ...................................................46

Chapter 7 InniBand Perormance Capabilities and Examples ...............48

Chapter 8 Into the Future .................................................................52


4/54

[ 4 ]

INTRO TO INFINIBAND FOR END USERS

Acknowledgements

Special thanks to Gilad Shainer at Mellanox or his work and content related to

Chapter 7 - InniBand Perormance Capabilities and Examples.

Thanks to the ollowing individuals or their contributions to this publication:

David Southwell, Ariel Cohen, Rupert Dance, Jay Diepenbrock, Jim Ryan, Cheri

Winterberg and Brian Sparks.

About the Author

Paul Grun has been working on I/O or mainrames and servers or more than

30 years and has been involved in high perormance networks and RDMA technol-

ogy since beore the inception o InniBand. As a member o the InniBand Trade

Association, Paul served on the Link Working Group during the development o the

InniBand Architecture; he is a member o the IBTA Steering Committee and is a

past chair o the Technical Working Group. He recently served as chair o the IBXoE

Working Group, charged with developing the new RDMA over Converged Ethernet(RoCE) specication. He is currently chie scientist or System Fabric Works, Inc.,

a consulting and proessional services company dedicated to delivering RDMA and

storage solutions or high perormance computing, commercial enterprise and cloud

computing systems.


5/54

[ 5 ]

INTRODUCTION

Introduction

InniBand is not complex. Despite its reputation as an exotic technology, the

concepts behind it are surprisingly straightorward. One purpose o this book is to

clearly describe the basic concepts behind the InniBand Architecture.

Understanding the basic concepts is all well and good, but not very useul unless

those concepts deliver a meaningul and signicant value to the user o the tech-

nology. And that is the real purpose o this book: to draw straight-line connections

between the basic concepts behind the InniBand Architecture and how those

concepts deliver real value to the enterprise data center and to high perormance

computing (HPC).

The readers o this book are exquisitely knowledgeable in their respective elds o

endeavor, whether those endeavors are installing and maintaining a large HPC clus-

ter, overseeing the design and development o an enterprise data center, deployingservers and networks, or building devices/appliances targeted at solving a specic

problem requiring computers. But most readers o this book are unlikely to be ex-

perts in the arcane details o networking architecture. Although the working title or

this book was InniBand or Dummies, its readers are anything but dummies. This

book is designed to give its readers, in one short sitting, a view into the InniBand

Architecture that you probably could not get by reading the specication itsel.

The point is that you should not have to hold an advanced degree in obscure

networking technologies in order to understand how InniBand technology can bringbenets to your chosen eld o endeavor and to understand what those benets

are. This minibook is written or you. At the end o the next hour you will be able to

clearly see how InniBand can solve problems you are acing today.

By the same token, this book is not a detailed technical treatise o the underlying

theory, nor does it provide a tutorial on deploying the InniBand Architecture. All we

can hope or in this short book is to bring a level o enlightenment about this excit-

ing technology. The best measure o our success is i you, the reader, eel motivated

ater reading this book to learn more about how to deploy the InniBand Architec-

ture in your computing environment.We will avoid the low-level bits and bytes o the elegant architecture (which are

complex) and ocus on the concepts and applications that are visible to the user o

InniBand technology. When you decide to move orward with deploying InniBand,

there is ample assistance available in the market to help you deploy it, just as there

is or any other networking technology like traditional TCP/IP/Ethernet. The Inni-

Band Trade Association and its website (www.innibandta.org) are great sources

Introduction


6/54

[ 6 ]


o access to those with deep experience in developing, deploying and using the

InniBand Architecture.

The InniBand Architecture emerged in 1999 as the joining o two competing

proposals known as Next Generation I/O and Future I/O. These proposals, and the

InniBand Architecture that resulted rom their merger, are all rooted in the Vir-tual Interace Architecture, VIA. The Virtual Interace Architecture is based on two

synergistic concepts: direct access to a network interace (e.g. a NIC) straight rom

application space, and an ability or applications to exchange data directly between

their respective virtual buers across a network, all without involving the operating

system directly in the address translation and networking processes needed to do

so. This is the notion o Channel I/O the creation o a virtual channel directly

connecting two applications that exist in entirely separate address spaces. We will

come back to a detailed description o these key concepts in the next ew chapters.

InniBand is oten compared, and not improperly, to a traditional network such

as TCP/IP/Ethernet. In some respects it is a air comparison, but in many other

respects the comparison is wildly o the mark. It is true that InniBand is based on

networking concepts and includes the usual layers ound in a traditional network,

but beyond that there are more dierences between InniBand and an IP network

than there are similarities. The key is that InniBand provides a messaging service

that applications can access directly. The messaging service can be used or stor-

age, or InterProcess Communication (IPC) or or a host o other purposes, anything

that requires an application to communicate with others in its environment. The keybenets that InniBand delivers accrue rom the way that the InniBand messaging

service is presented to the application, and the underlying technology used to trans-

port and deliver those messages. This is much dierent rom TCP/IP/Ethernet, which

is a byte-stream oriented transport or conducting bytes o inormation between

sockets applications.

The rst chapter gives a high-level view o the InniBand Architecture and

reviews the basic concepts on which it is ounded. The next two chapters relate

those basic concepts to value propositions which are important to users in the

HPC community and in the commercial enterprise. Each o those communities has

unique needs, but both benet rom the advantages InniBand can bring. Chap-

ter 4 addresses InniBands sotware architecture and describes the open source

sotware stacks that are available to ease and simpliy the deployment o InniBand.

Chapter 5 delves slightly deeper into the nuances o some o the key elements o

the architecture. Chapter 6 describes the eorts undertaken by both the InniBand

Trade Association and the OpenFabrics Alliance (www.openabrics.org) to ensure

that solutions rom a range o vendors will interoperate and that they will behave in

conormance with the specication. Chapter 7 describes some comparative per-ormance proo points, and nally Chapter 8 briefy reviews the ways in which the

InniBand Trade Association ensures that the InniBand Architecture continues to

move ahead into the uture.


7/54

[ 7 ]

ChApTER 1 BASIC CONCEpTS

Chapter 1 Basic Concepts

Networks are oten thought o as the set o routers, switches and cables that

are plugged into servers and storage devices. When asked, most people wouldprobably say that a network is used to connect servers to other servers, storage

and a network backbone. It does seem that traditional networking generally

starts with a bottoms up view, with much attention ocused on the underlying

wires and switches. This is a very network centric view o the world; the way

that an application communicates is driven partly by the nature o the trac.

This is what has given rise to the multiple dedicated abrics ound in many

data centers today.

A server generally provides a selection o communication services to theapplications it supports; one or storage, one or networking and requently a

third or specialized IPC trac. This complement o communications stacks

(storage, networking, etc.) and the device adapters accompanying each one

comprise a shared resource. This means that the operating system owns

these resources and makes them available to applications on an as-needed

basis. An application, in turn, relies on the operating system to provide it with

the communication services it needs. To use one o these services, the appli-

cation uses some sort o interace or API to make a request to the operatingsystem, which conducts the transaction on behal o the application.

InniBand, on the other hand begins with a distinctly application cen-

tric view by asking a prooundly simple question: How to make application

accesses to other applications and to storage as simple, ecient and direct

as possible? I one begins to look at the I/O problem rom this application

centric perspective one comes up with a much dierent approach to network

architecture.

The basic idea behind InniBand is simple; it provides applications withan easy-to-use messaging service. This service can be used to communicate

with other applications or processes or to access storage. Thats the whole

idea. Instead o making a request to the operating system or access to one o

the servers communication resources, an application accesses the InniBand

messaging service directly. Since the service provided is a straightorward mes-

saging service, the complex dance between an application and a traditional


8/54

[ 8 ]


network is eliminated. Since storage trac can be viewed as consisting o

control messages and data messages, the same messaging service is equally at

home or use in storage applications.

A key to this approach is that InniBand Architecture gives every application

direct access to the messaging service. Direct access means that an applicationneed not rely on the operating system to transer messages. This idea is in con-

trast to a standard network environment where the shared network resources

(e.g. the TCP/IP network stack and associated NICs) are solely owned by the

operating system and cannot be accessed by the user application. This means

that an application cannot have direct access to the network and instead

must rely on the involvement o the operating system to move data rom the

applications virtual buer space, through the network stack and out onto the

wire. There is an identical involvement o the operating system at the otherend o the connection. InniBand avoids this through a technique known as

stack bypass. How it does this, while ensuring the same degree o application

isolation and protection that an operating system would provide, is the one key

piece o secret sauce that underlines the InniBand Architecture.

InniBand provides the messaging service by creating a channel connecting

an application to any other application or service with which the applica-

tion needs to communicate. The applications using the service can be either

user space applications or kernel applications such as a le system. Thisapplication-centric approach to the computing problem is the key dierentiator

between InniBand and traditional networks. You could think o it as a tops

down approach to architecting a networking solution. Everything else in the

InniBand Architecture is there to support this one simple goal: provide a mes-

sage service to be used by an application to communicate directly with another

application or with storage.

The challenge or InniBands designers was to create these channels

between virtual address spaces which would be capable o carrying messages

o varying sizes, and to ensure that the channels are isolated and protected.

These channels would need to serve as pipes or connections between entirely

disjoint virtual address spaces. In act, the two virtual spaces might even be

located in entirely disjointphysical address spaces in other words, hosted by

dierent servers even over a distance.


9/54

[ 9 ]


App

OS

App

NIC

Buf

OS

Buf

NIC

Figure 1: InfniBand creates a channel directly connecting an application in its virtual

address space to an application in another virtual address space. The two applicationscan be in disjoint physical address spaces hosted by dierent servers.

It is convenient to give a name to the endpoints o the channel well call

them Queue Pairs (QPs); each QP consists o a Send Queue and a Receive

Queue, and each QP represents one end o a channel. I an application

requires more than one connection, more QPs are created. The QPs are the

structure by which an application accesses InniBands messaging service. In

order to avoid involving the OS, the applications at each end o the channel

must have direct access to these QPs. This is accomplished by mapping the

QPs directly into each applications virtual address space. Thus, the application

at each end o the connection has direct, virtual access to the channel con-

necting it to the application (or storage) at the other end o the channel. This is

the notion oChannel I/O.

Taken altogether then, this is the essence o the InniBand Architecture. It

creates private, protected channels between two disjoint virtual address spaces,

it provides a channel endpoint, called a QP, to the applications at each end o

the channel, and it provides a means or a local application to transer mes-

sages directly between applications residing in those disjoint virtual address

spaces. Channel I/O in a nutshell.

Having established a channel, and having created a virtual endpoint to the

channel, there is one urther architectural nuance needed to complete the

channel I/O picture and that is the actual method or transerring a message.

InniBand provides two transer semantics; a channel semantic sometimes

called SEND/RECEIVE and a pair o memory semantics called RDMA READ

and RDMA WRITE. When using the channel semantic, the message is received

in a data structure provided by the application on the receiving side. This data

structure was pre-posted on its receive queue. Thus, the sending side does

not have visibility into the buers or data structures on the receiving side;

instead, it simply SENDS the message and the receiving application RECEIVES

the message.


10/54

[ 10 ]


The memory semantic is somewhat dierent; in this case the receiving side

application registers a buer in its virtual memory space. It passes control o

that buer to the sending side which then uses RDMA READ or RDMA WRITE

operations to either read or write the data in that buer.

A typical storage operation may illustrate the dierence. The initiatorwishes to store a block o data. To do so, it places the block o data in a buer

in its virtual address space and uses a SEND operation to send a storage

request to the target (e.g. the storage device). The target, in turn, uses RDMA

READ operations to etch the block o data rom the initiators virtual buer.

Once it has completed the operation, the target uses a SEND operation to

return ending status to the initiator. Notice that the initiator, having requested

service rom the target, was ree to go about its other business while the target

asynchronously completed the storage operation, notiying the initiator oncompletion. The data transer phase and the storage operation, o course, com-

prise the bulk o the operation.

Included as part o the InniBand Architecture and located right above

the transport layer is thesotware transport interace. The sotware transport

interace contains the QPs; keep in mind that the queue pairs are the struc-

ture by which the RDMA message transport service is accessed. The sot-

ware transport interace also denes all the methods and mechanisms that

an application needs to take ull advantage o the RDMA message transportservice. For example, the sotware transport interace describes the methods

that applications use to establish a channel between them. An implementation

o the sotware transport interace includes the APIs and libraries needed by

an application to create and control the channel and to use the QPs in order to

transer messages.

Application

InfiniBandMessaging

Service

S/W TransportInterface

Transport

Network

Link

Physical

RDMA MessageTransport Service

Figure 2: The InfniBand Architecture provides an easy-to-use messaging service to

applications. The messaging service includes a communications stack similar to the

amiliar OSI reerence model.


11/54

[ 11 ]


Underneath the covers o the messaging service all this still requires a

complete network stack just as you would nd in any traditional network.

It includes the InniBand transport layer to provide reliability and delivery

guarantees (similar to the TCP transport in an IP network), a network layer

(like the IP layer in a traditional network) and link and physical layers (wiresand switches). But its a special kind o a network stack because it has ea-

tures that make it simple to transport messages directly between applications

virtual memory, even i the applications are remote with respect to each

other. Hence, the combination o InniBands transport layer together with the

sotware transport interace is better thought o as a Remote Direct Memory

Access (RDMA) message transport service. The entire stack taken together,

including the sotware transport interace, comprise the InniBand messaging

service.How does the application actually transer a message? The InniBand

Architecture provides simple mechanisms dened in the sotware transport

interace or placing a request to perorm a message transer on a queue. This

queue is the QP, representing the channel endpoint. The request is called a

Work Request (WR) and represents a single quantum o work that the applica-

tion wants to perorm. A typical WR, or example, describes a message that the

application wishes to have transported to another application.

The notion o a WR as representing a message is another o the key distinc-tions between InniBand and a traditional network; the InniBand Architecture

is said to be message-oriented. A message can be any size ranging up to

2**31 bytes in size. This means that the entire architecture is oriented around

moving messages. This messaging service is a distinctly dierent service than

is provided by other traditional networks such as TCP/IP, which are adept at

moving strings o bytes rom the operating system in one node to the operating

system in another node. As the name suggests, a byte stream-oriented network

presents a stream o bytes to an application. Each time an Ethernet packet

arrives, the server NIC hardware places the bytes comprising the packet into

an anonymous buer in main memory belonging to the operating system. Once

the stream o bytes has been received into the operating systems buer, the

TCP/IP network stack signals the application to request a buer rom the appli-

cation into which the bytes can be placed. This process is repeated each time a

packet arrives until the entire message is eventually received.

InniBand on the other hand, delivers a complete message to an application.

Once an application has requested transport o a message, the InniBand hard-

ware automatically segments the outbound message into a number o packets;

the packet size is chosen to optimize the available network bandwidth. Packets

are transmitted through the network by the InniBand hardware and at the

receiving end they are delivered directly into the receiving applications virtual

buer where they are re-assembled into a complete message. Once the entire


12/54

[ 12 ]


message has been received the receiving application is notied. Neither the

sending nor the receiving application is involved until the complete message is

delivered to the receiving applications virtual buer.

How, exactly, does a sending application, or example send a message? The

InniBand sotware transport interace specication denes the concept o averb. The word verb was chosen since a verb describes action, and this is

exactly how an application requests action rom the messaging service; it uses

a verb. The set o verbs, taken together, are simply a semantic description o

the methods the application uses to request service rom the RDMA message

transport service. For example, Post Send Request is a commonly used verb

to request transmission o a message to another application.

The verbs are ully dened by the specication, and they are the basis or

speciying the APIs that an application uses. But the InniBand Architecturespecication doesnt dene the actual APIs; that is let to other organizations

such as the OpenFabrics Alliance, which provides a complete set o open

source APIs and sotware which implements the verbs and works seamlessly

with the InniBand hardware. A richer description o the concept o verbs,

and how one designs a system using InniBand and the verbs, is contained in

a later chapter.

The InniBand Architecture denes a ull set o hardware components nec-

essary to deploy the architecture. Those components are:

HCA Host Channel Adapter. An HCA is the point at which an InniBand

end node, such as a server or storage device, connects to the InniBand

network. InniBands architects went to signicant lengths to ensure that the

architecture would support a range o possible implementations, thus the

specication does not require that particular HCA unctions be implemented

in hardware, rmware or sotware. Regardless, the collection o hardware,

sotware and rmware that represents the HCAs unctions provides the appli-

cations with ull access to the network resources. The HCA contains address

translation mechanisms under the control o the operating system that allow

an application to access the HCA directly. In eect, the Queue Pairs that the

application uses to access the InniBand hardware appear directly in the

applications virtual address space. The same address translation mechanism

is the means by which an HCA accesses memory on behal o a user level

application. The application, as usual, reers to virtual addresses; the HCA

has the ability to translate these into physical addresses in order to aect theactual message transer.

TCA Target Channel Adapter. This is a specialized version o a channel

adapter intended or use in an embedded environment such as a storage

appliance. Such an appliance may be based on an embedded operating

system, or may be entirely based on state machine logic and thereore may

not require a standard interace or applications. The concept o a TCA is


13/54

[ 13 ]


not widely deployed today; instead most I/O devices are implemented using

standard server motherboards controlled by standard operating systems.

Switches An InniBand Architecture switch is conceptually similar to any

other standard networking switch, but molded to meet InniBands per-

ormance and cost targets. For example, InniBand switches are designed

to be cut through or perormance and cost reasons and they implement

InniBands link layer fow control protocol to avoid dropped packets. This is

a subtle, but key element o InniBand since it means that packets are never

dropped in the network during normal operation. This no drop behavior is

central to the operation o InniBands highly ecient transport protocol.

Routers Although not currently in wide deployment, an InniBand router

is intended to be used to segment a very large network into smaller subnetsconnected together by an InniBand router. Since InniBands management

architecture is dened on a per subnet basis, using an InniBand router

allows a large network to be partitioned into a number o smaller subnets

thus enabling the deployment o InniBand networks that can be scaled to

very large sizes, without the adverse perormance impacts due to the need

to route management trac throughout the entire network. In addition to this

traditional use, specialized InniBand routers have also been deployed in a

unique way as range extenders to interconnect two segments o an Inni-Band subnet that is distributed across wide geographic distances.

Cables and Connectors Volume 2 o the Architecture Specication is de-

voted to the physical and electrical characteristics o InniBand and denes,

among other things, the characteristics and specications or InniBand

cables, both copper and electrical. This has enabled vendors to develop and

oer or sale a wide range o both copper and optical cables in a broad range

o widths (4x, 12x) and speed grades (SDR, DDR, QDR).

Thats it or the basic concepts underlying InniBand. The InniBand Archi-

tecture brings a broad range o value to its users, ranging rom ultra-low latency

or clustering to extreme fexibility in application deployment in the data center

to dramatic reductions in energy consumption and all at very low price/per-

ormance points. Figuring out which value propositions are relevant to which

situations depends entirely on the destined use.

The next two chapters describe how these basic concepts map onto partic-

ular value propositions in two important market segments. Chapter 2 describes

how InniBand relates to High Perormance Computing (HPC) clusters, and

Chapter 3 does the same thing or enterprise data centers. As we will see,

InniBands basic concepts can help solve a range o problems, regardless

whether you are trying to solve a sticky perormance problem or i your data

center is suering rom severe space, power or cooling constraints. A wise

choice o an appropriate interconnect can prooundly impact the solution.


14/54

[ 14 ]


Chapter 2 InfniBand or HPC

In this chapter we will explore how InniBands undamental characteristics

map onto addressing the problems being aced today in high perormancecomputing environments.

High perormance computing is not a monolithic eld; it covers the gamut

rom the largest Top 500 class supercomputers down to small desktop

clusters. For our purposes, we will generally categorize HPC as being that class

o systems where nearly all o the available compute capacity is devoted to

solving a single large problem or substantial periods o time. Looked at another

way, HPC systems generally dont run traditional enterprise applications such

as mail, accounting or productivity applications.Some examples o HPC applications include atmospheric modeling,

genomics research, automotive crash test simulations, oil and gas extraction

models and fuid dynamics.

HPC systems rely on a combination o high-perormance storage and low-

latency InterProcess Communication (IPC) to deliver perormance and scal-

ability to scientic applications. This chapter ocuses on how the InniBand

Architectures characteristics make it uniquely suited or HPC. With respect to

high perormance computing, InniBands low-latency/high-bandwidth peror-mance and the nature o a channel architecture are crucially important.

Ultra-low latency or:

Scalability

Clusterperformance

Channel I/O delivers:

Scalablestoragebandwidthperformance

Supportforshareddiskclusterlesystemsandparallellesystems

HPC clusters are dependent to some extent on low latency IPC or scalability

and application perormance. Since the processes comprising a cluster appli-

cation are distributed across a number o cores, processors and servers, it is

clear that a low-latency IPC interconnect is an important determinant o cluster

scalability and perormance.

In addition to demand or low-latency IPC, HPC systems requently place


15/54

[ 15 ]

ChApTER 2 INFINIBAND FOR hpC

enormous demands on storage bandwidth, perormance (measured in I/Os per

second), scalability and capacity. As we will demonstrate in a later section,

InniBands channel I/O architecture is well suited to high-perormance storage

and especially parallel le systems.

InniBand support or HPC ranges rom upper layer protocols tailored tosupport the Message Passing Interace (MPI) in HPC to support or large scale

high-perormance parallel le systems such as Lustre. MPI is the predominant

messaging middleware ound in HPC systems.

InfniBand MPI Clusters

MPI is the de acto standard and dominant model in use today or com-

munication in a parallel system. Although it is possible to run MPI on a shared

memory system, the more common deployment is as the communicationlayer connecting the nodes o a cluster. Cluster perormance and scalability

are closely related to the rate at which messages can be exchanged between

cluster nodes; the classic low latency requirement. This is native InniBand

territory and its clear stock in trade.

MPI is sometimes described as communication middleware, but perhaps a

better, more unctional description would be to say that it provides a communi-

cation service to the distributed processes comprising the HPC application.

To be eective, the MPI communication service relies on an underlyingmessaging service to conduct the actual communication messages rom one

node to another. The InniBand Architecture, together with its accompanying

application interaces, provides an RDMA messaging service to the MPI layer.

In the rst chapter, we discussed direct application access to the messaging

service as being one o the undamental principles behind the InniBand Archi-

tecture; applications use this service to transer messages between them. In the

case o HPC, one can think o the processes comprising the HPC application

as being the clients o the MPI communication service, and MPI, in turn as the

client o the underlying RDMA message transport service. Viewed this way, one

immediately grasps the signicance o InniBands channel-based message

passing architecture; its ability to enable its client, the MPI communications

middleware in this case, to communicate among the nodes comprising the

cluster without involvement o any CPUs in the cluster. InniBands copy avoid-

ance architecture and stack bypass deliver ultra-low application-to-application

latency and high bandwidth coupled with extremely low CPU utilization.

MPI is easily implemented on InniBand by deploying one o a number o

implementations which take ull advantage o the InniBand Architecture.


16/54

[ 16 ]


Storage and HPC

An HPC application comprises a number o processes running in parallel. In

an HPC cluster, these processes are distributed; the distribution could be across

the cores comprising a processor chip, or across a number o processor chips

comprising a server, or they could be distributed across a number o physical

servers. This last model is the one most oten thought o when thinking about

HPC clusters.

Depending on the conguration and purpose or the cluster, it is oten

the case that each o the distributed processes comprising the application

requires access to storage. As the cluster is scaled up (in terms o the number

o processes comprising the application) the demand or storage bandwidth

increases. In this section well explore two common HPC storage architectures,

shared disk cluster le systems and parallel le systems, and discuss the

role channel I/O plays. Other storage architectures, such as network attached

storage, are also possible.

A shared disk cluster le system is exactly as its name implies and is illus-

trated in the diagram below; it permits distributed processes comprising the

cluster to share access to a common pool o storage. Typically in a shared disk

le system, the le system is distributed among the servers hosting the distrib-

uted application with an instance o the le system running on each processorand a side band communication channel between le system instances used

or distributed lock management. By creating a series o independent chan-

nels, each instance o the distributed le system is given direct access to the

shared disk storage system. The le system places a certain load on the storage

system which is equal to the sum o the loads placed by each instance o the

le system. By distributing this load across a series o unique channels, the

channel I/O architecture provides benets in terms o parallel accesses and

bandwidth allocation.

cluster f/s

cluster f/s

cluster f/s

Server

Server

Server

Distributed Local

File System

Figure 3: Shared disk cluster fle system. Each instance o the distributed fle system

has a unique connection to the elements o the storage system.


17/54

[ 17 ]

ChApTER 2 INFINIBAND FOR hpC

One nal point; the perormance o a shared disk le system and hence the

overall system is dependent on the speed with which locks can be exchanged

between le system instances. InniBands ultra low latency characteristic

allows the cluster to extract maximum perormance rom the storage system.

A parallel le system such as the Lustre le system is similar in some waysto a shared disk le system in that it is designed to satisy storage demand

rom a number o clients in parallel, but with several distinct dierences.

Once again, the processes comprising the application are distributed across a

number o servers. The le system however, is sited together with the storage

devices rather than being distributed. Each process comprising the distributed

application is served by a relatively thin le system client.

Within the storage system, the control plane (le system metadata), is oten

separated rom the data plane (object servers) containing user data. This pro-motes parallelism by reducing bottlenecks associated with accessing le system

metadata, and it permits user data to be distributed across a number o storage

servers which are devoted to strictly serving user data.

f/s client

f/s client

f/s client

Server

Server

Server

Remote

File System

mds

oss

oss

oss

Figure 4:A parallel fle system.A thin fle system client co-located with each instance

o the distributed application creates parallel connections to the metadata server(s) and

the object servers. The large arrow represents the physical InfniBand network.

Each le system client creates a unique channel to the le system metadata.

This accelerates the metadata look-up process and allows the metadata server

to service a airly large number o accesses in parallel. In addition, each le

system client creates unique channels to the object storage servers where userdata is actually stored.

By creating unique connections between each le system client and

the storage subsystem, a parallel le system is perectly positioned to reap

maximum advantage rom InniBands channel I/O architecture. The unique

and persistent low latency connections to the metadata server(s) allow a high

degree o scalability by eliminating the serialization o accesses sometimes


18/54

[ 18 ]


experienced in a distributed lock system. By distributing the data across an

array o object servers and by providing each le system client with a unique

connection to the object servers, a high degree o parallelism is created as

multiple le system clients attempt to access objects. By creating unique and

persistent low-latency connections between each le system client and themetadata servers, and between each le system client and the object storage

servers, the Lustre le system delivers scalable perormance.

Upper Layer Protocols or HPC

The concept o upper layer protocols (ULPs) is described in some detail in

Chapter 4 Designing with InniBand. For now, it is sucient to say that open

source sotware to support the InniBand Architecture and specically ocused

on HPC is readily available, or example, rom the OpenFabrics Alliance. Thereare ULPs available to support le system accesses such as the Lustre le

system, MPI implementations and a range o other APIs necessary to support a

complete HPC system are also provided.


19/54

[ 19 ]

ChApTER 3 INFINIBAND FOR ThE ENTERpRISE

Chapter 3 InfniBand or the Enterprise

The previous chapter addressed applications o InniBand in HPC and

ocused on the values an RDMA messaging service brings to that space ultra-low, application-to-application latency and extremely high bandwidth all

delivered at a cost o almost zero CPU utilization. As we will see in this chapter,

InniBand brings an entirely dierent set o values to the enterprise. But those

values are rooted in precisely the same set o underlying principles.

O great interest to the enterprise are values such as risk reduction, invest-

ment protection, continuity o operations, ease o management, fexible use

o resources, reduced capital and operating expenses and reduced power,

cooling and foor space costs. Naturally, there are no hard rules here; thereare certainly areas in the enterprise space where InniBands perormance

(latency, bandwidth, CPU utilization) is o extreme importance. For example,

certain applications in the nancial services industry (such as arbitrage trading)

demand the lowest possible application-to-application latencies at nearly any

cost. Our purpose in this chapter is to help the enterprise user understand more

clearly how the InniBand Architecture applies in this environment.

Reducing expenses associated with purchasing and operating a data center

means being able to do more with less less servers, less switches, ewer

routers and ewer cables. Reducing the number o pieces o equipment in a

data center leads to reductions in foor space and commensurate reductions in

demand or power and cooling. In some environments, severe limitations on

available foor space, power and cooling eectively throttle the data centers

ability to meet the needs o a growing enterprise, so overcoming these obsta-

cles can be crucial to an enterprises growth. Even in environments where this

is not the case, each piece o equipment in a data center demands a certain

amount o maintenance and operating cost, so reducing the number o pieces

o equipment leads to reductions in operating expenses. Aggregated across an

enterprise, these savings are substantial. For all these reasons, enterprise data

center owners are incented to ocus on the ecient use o IT resources. As

we will see shortly, appropriate application o InniBand can advance each o

these goals.


20/54

[ 20 ]


Simply reducing the amount o equipment is a noble goal unto itsel, but in

an environment where workloads continue to increase, it becomes critical to

do with less. This is where the InniBand Architecture can help; it enables the

enterprise data center to serve more clients with a broader range o applications

with aster response times while reducing the number o servers, cables andswitches required. Sounds like magic.

Server virtualization is a strategy being successully deployed now in an

attempt to reduce the cost o server hardware in the data center. It does so by

taking advantage o the act that many data center servers today operate at

very low levels o CPU utilization; it is not uncommon or data center servers

to consume less than 15-20 percent o the available CPU cycles. By hosting

a number o Virtual Machines (VMs) on each physical server, each o which

hosts one or more instances o an enterprise application, virtualization cancleverly consume CPU cycles that had previously gone unused. This means

that ewer physical servers are needed to support approximately the same

application workload.

Even though server virtualization is an eective strategy or driving up

consumption o previously unused CPU cycles, it does little to improve the

eective utilization (efciency) o the existing processor/memory complex. CPU

cycles and memory bandwidth are still devoted to executing the network stack

at the expense o application perormance. Eective utilization, or eciency,is the ratio o a server resource, such as CPU cycles or memory bandwidth

that is devoted directly to executing an application program compared to the

use o server resources devoted to housekeeping tasks. Housekeeping tasks

principally include services provided to the application by the operating system

devoted to conducting routine network communication or storage activities on

behal o an application. The trick is to apply technologies that can undamen-

tally improve the servers overall eective utilization, without assuming unnec-

essary risk in deploying them. I successul, the data center owner will achieve

much, much higher usage o the data center servers, switches and storage that

were purchased. That is the subject o this chapter.

Devoting Server Resources to Application Processing

Earlier we described how an application using InniBands RDMA mes-

saging service avoids the use o the operating system (using operating system

bypass) in order to transer messages between a sender and a receiver. This

was described as a perormance improvement since avoiding operating system

calls and unnecessary buer copies signicantly reduces latency. But operating

system bypass produces another eect o much greater value to the enterprise

data center.

Server architecture and design is an optimization process. At a given server

price point, the server architect can aord to provide only a certain amount


21/54

[ 21 ]


o main memory; the important corollary is that there is a nite amount o

memory bandwidth available to support application processing at that price

point. As it turns out, main memory bandwidth is a key determinant o server

perormance. Heres why.

The precious memory bandwidth provided on a given server motherboardmust be allocated between application processing and I/O processing. This

is true because traditional I/O, such as TCP/IP networking, relies on main

memory to create an anonymous buer pool, which is integral to the opera-

tion o the network stack. Each inbound packet must be placed into an

anonymous buer and then copied out o the anonymous buer and delivered

to a user buer in the applications virtual memory space. Thus, each inbound

packet is copied at least twice. A single 10Gb/s Ethernet link, or example,

can consume between 20Gb/s and 30Gb/s o memory bandwidth to supportinbound network trac.

App

I/O

Mem

Figure 5: Each inbound network packet consumes rom two to three times the memory

bandwidth as the inbound packet is placed into an anonymous buer by the NIC and

then subsequently copied rom the anonymous buer and into the applications virtual

buer space.

The load oered to the memory by network trac is largely dependent on

how much aggregate network trac is created by the applications hosted on

the server. As the number o processor sockets on the motherboard increases,

and as the number o cores per socket increases, and as the number o execu-

tion threads per core increases, the oered network load increases approxi-

mately proportionally. And each increment o oered network load consumes

two to three times as much memory bandwidth as the networks underlyingsignaling rate. For example, I an application consumes 10Gb/s o inbound

network trac (which translates into 20Gb/s o memory bandwidth consumed)

then multiple instances o an application running on a multicore CPU will

consume proportionally more network bandwidth and twice the memory band-

width. And, o course, executing the TCP/IP network protocol stack consumes

CPU processor cycles. The cycles consumed in executing the network protocol


22/54

[ 22 ]


and copying data out o the anonymous buer pool and into the applications

virtual buer space are cycles that cannot be devoted to application processing;

they are purely overhead required in order to provide the application with the

data it requested.

Consider: When one contemplates the purchase o a server such as a serverblade, the key actors that are typically weighed against the cost o the server

are power consumption and capacity. Processor speed, number o cores,

number o threads per core and memory capacity are the major contributors to

both power consumption and server perormance. So i you could reduce the

number o CPUs required or the amount o memory bandwidth consumed, but

without reducing the ability to execute user applications, signicant savings

can be realized in both capital expense and operating expense. I using the

available CPU cycles and memory bandwidth more eciently enables the datacenter to reduce the number o servers required, then capital and operating

expense improvements multiple dramatically.

And this is exactly what the InniBand Architecture delivers; it dramati-

cally improves the servers eective utilization (its eciency) by dramatically

reducing demand on the CPU to execute the network stack or copy data out o

main memory and it dramatically reduces I/O demand on memory bandwidth,

making more memory bandwidth available or application processing. This

allows the user to direct precious CPU and memory resources to useul appli-cation processing and allows the user to reap the ull capacity o the server or

which the user has paid.

Again, server virtualization increases the number o applications that a

server can host by time slicing the server hardware among the hosted virtual

machines. Channel I/O decreases the amount o time the server devotes to

housekeeping tasks and thereore increases the amount o time that the server

can devote to application processing. It does this by liting rom the CPU and

memory the burden o executing I/O and copying data.

A Flexible Server Architecture

Most modern servers touch at least two dierent I/O abrics an Ethernet

network and a storage abric, commonly Fibre Channel. In some cases there

is a third abric dedicated to IPC. For perormance and scalability reasons, a

server oten touches several instances o each type o abric. Each point o

contact between a server and one o the associated abrics is implemented with

a piece o hardware either a Network Interace Card (NIC) or networking or

a Host Bus Adapter (HBA) or storage. The servers total I/O bandwidth is the

sum o the bandwidths available through these dierent abrics. Each o these

abrics places a certain load both on memory bandwidth and CPU resources.

The allocation o the total available I/O bandwidth between storage and

networking is established at the time the server is provisioned with I/O hard-


23/54

[ 23 ]


ware. This process begins during motherboard manuacture in the case o

NICs soldered directly to the motherboard (LAN on Motherboard LOM) and

continues when the server is provisioned with a complement o PCI Express-

attached Host Bus Adapters (HBAs, used or storage) and NICs. Even though

the complement o PCI Express adapters can be changed, the overall costso doing so are prohibitive; so by and large the allocation o I/O bandwidth

between storage and networking is essentially xed by the time the server has

been installed.

App

HBA

Mem

HBANIC

NIC

Network, IPC

LOM(LAN on Motherboard)

Storage

Figure 6: Memory bandwidth is divided between application processing and I/O. I/O

bandwidth allocation between storage, networking and IPC is predetermined by the

physical hardware, the NICs and HBAs, in the server.

The ideal ratio o storage bandwidth to networking bandwidth is solelya unction o the application(s) being hosted by the server each dierent

enterprise application will present a dierent characteristic workload on the I/O

subsystem. For example, a database application may create a high disk storage

workload and a relatively modest networking workload as it receives queries

rom its clients and returns results. Hence, i the mix o enterprise applications

hosted by a given server changes, its ideal allocation o I/O bandwidth between

networking and storage also changes. This is especially true in virtualized

servers, where each physical server hosts a number o application containers(Virtual Machines VMs), each o which may be presenting a dierent charac-

teristic workload to the I/O subsystem.

This creates a bit o a conundrum or the IT manager. Does the IT man-

ager over-provision each server with a maximum complement o storage and

networking devices, and in what proportions? Does the IT manager lock a given

application, or application mix, down to a given server and provision that server


24/54

[ 24 ]


with the idealized I/O complement o storage and networking resources? What

happens to the server as the workload it is hosting is changed? Will it still have

proportional access to both networking and storage abrics?

In a later chapter we introduce the notion o a unied abric as being a

abric that is optimal or storage, networking and IPC. An InniBand abrichas the characteristics that make it well suited or use as a unied abric. For

example, its message orientation and its ability to eciently handle the small

message sizes that are oten ound in network applications, as well as the large

message sizes that are common in storage. In addition, the InniBand Architec-

ture contains eatures such as a virtual lane architecture specically designed

to support various types o trac simultaneously.

Carrying storage and networking trac eciently over a single InniBand net-

work using the RDMA message transport service means that the total numbero I/O adapters required (HBAs and NICs) can be radically reduced. This is a

signicant saving, and even more signicant when one actors in the 40Gb/s

data rates o commonly available commodity InniBand components versus

leading edge 10Gb/s Ethernet. For our purposes here, we are ignoring the

approximately 4x improvement in price/perormance provided by InniBand.

But ar more important than the simple cost savings associated with the

reduction in the number o required adapter types is this: by using an Inni-

Band network, a single type o resource (HCAs, buers, queue pairs and soon) is used or both networking and storage. Thus the balancing act between

networking and storage is eliminated, meaning that the server will always have

the correct proportion o networking and storage resources available to it even

as the mix o applications changes and the characteristic workload oered by

the applications changes!

This is a proound observation; it means that it is no longer necessary to

overprovision a server with networking and storage resources in anticipation o

an unknown application workload; the server will always be naturally provi-

sioned with I/O resources.

Investment Protection

Enterprise data centers typically represent a signicant investment in equip-

ment and sotware accumulated over time. Only rarely does an opportunity

arise to install a new physical plant and accompanying control and manage-

ment sotware. Accordingly, valuable new technologies must be introduced to

the data center in such a way as to provide maximum investment protection

and to minimize the risk o disruption to existing applications. For example, a

data center may contain a large investment in SANs or storage which it would

be impractical to simply replace. In this section we explore some methods to

allow the data centers servers to take advantage o InniBands messaging

service while protecting existing investments.


25/54

[ 25 ]


Consider the case o a rack o servers, or a chassis o bladed servers. As

usual, each server in the rack requires connection to up to three types o

abrics, and in many cases, multiple instances o each type o abric. There

may be one Ethernet network dedicated to migrating virtual machines, another

Ethernet network dedicated to perorming backups and yet another one or con-necting to the internet. Connecting to all these networks requires multiple NICs,

HBAs, cables and switch ports.

Earlier we described the advantages that accrue in the data center by

reducing the types o I/O adapters in any given server. These included reduc-

tions in the number and types o cables, switches and routers and improve-

ments in the servers ability to host changing workloads because it is no longer

limited by the particular I/O adapters that happen to be installed.

Nevertheless, a server may still require access to an existing centrally man-aged pool o storage such as a Fibre Channel SAN which requires that each

server be equipped with a Fibre Channel HBA. On the ace o it, this is at

odds with the notion o reducing the number o I/O adapters. The solution is to

replace the FC HBA with a virtual HBA driver. Instead o using its own dedi-

cated FC HBA, the server uses a virtual device adapter to access a physical

Fibre Channel adapter located in an external I/O chassis. The I/O chassis pro-

vides a pool o HBAs which are shared among the servers. The servers access

the pool o HBAs over an InniBand network.The set o physical adapters thus eliminated rom each server are replaced

by a single physical InniBand HCA and a corresponding set o virtual device

adapters (VNICs and VHBAs). By so doing, the servers gain the fexibility

aorded by the one at pipe, benet rom the reduction in HBAs and NICs, and

retain the ability to access the existing inrastructure, thus preserving

the investment.

A typical server rack contains a so-called top o rack switch which is used

to aggregate all o a given class o trac rom within the rack. This is an ideal

place to site an I/O chassis, which thus provides the entire rack with virtualized

access to data center storage resources and with access to the data center net-

work and with access to an IPC service or communicating with servers located

in other racks.

Cloud Computing An Emerging Data Center Model

Cloud computing has emerged as an interesting variant on enterprise com-

puting. Although widespread adoption has only recently occurred, the unda-

mental concepts beneath the idea o a cloud have been under discussion or

quite a long period under dierent names such as utility computing, grid com-

puting and agile or fexible computing. Although cloud computing is marketed

and expressed in a number o ways, at its heart it is an expression o the idea

o fexible allocation o compute resources (including processing, network and


26/54

[ 26 ]


storage) to support an ever-shiting pool o applications. In its ullest realization,

the cloud represents a pool o resources capable o hosting a vast array o

applications, the composition o which is regularly shiting. This is dramatically

dierent rom the classical enterprise data center in which the applications

hosted by the data center are oten closely associated with, and in some casesstatically assigned to, a specic set o hardware and sotware resources a

server. For example, the classical three-tier data center is an expression o this

relatively static allocation o servers to applications.

Given the above discussion concerning the fexible allocation o I/O resources

to applications, it is immediately clear that InniBands ability to transport

messages on behal o any application is a strong contributor to the idea that

applications must not be enslaved to the specic I/O architecture o the under-

lying server on which the application is hosted. Indeed, the cloud computingmodel is the very embodiment o the idea o fexible computing; the ability

to fexibly allocate processors, memory, networking and storage resources to

support application execution anywhere within the cloud. By using the set o

Upper Layer Protocols described briefy in the rst chapter, the application is

provided with a simple way to use the InniBand messaging service. By doing

so, the applications dependence on a particular mix o storage and networking

I/O is eectively eliminated. Thus, the operator o a cloud inrastructure gains

an unprecedented level o fexibility in hosting applications capable o executingon any available compute resource in the cloud.

One last point should be made beore we leave behind the topic o cloud

computing. Cloud inrastructures, as they are being deployed today, represent

some o the highest densities yet seen in the computing industry. Cloud sys-

tems are oten delivered in container-like boxes that push the limits o compute

density, storage density, power and cooling density in a conned physical

space. In that way, they share the same set o obstacles ound in some o the

highest density commercial data centers, such as those ound in the nancial

services industry in Manhattan. These data centers ace many o the same

stifing constraints on available power, cooling and foor space.

We mentioned in an earlier section that enterprise data centers are oten

constrained in their ability to adopt relevant new technologies since they must

balance the advantages gained by deploying a new technology against require-

ments to protect investments in existing technologies. In that way, many o

the cloud computing inrastructures now coming online are very dierent rom

their brethren in other commercial enterprises. Because o the modular way

that modern cloud inrastructures are assembled and deployed, the opportu-

nity does exist or simple adoption o a vastly improved I/O inrastructure like

InniBand. For example, a typical cloud inrastructure might be composed o

a series o sel-contained containers resembling the amiliar shipping con-

tainer seen on trucks. This container, sometimes called a POD, contains racks,


27/54

[ 27 ]


servers, switches, networks and inrastructure. They are designed to be rolled

in, connected to a power and cooling source and be operational right away. A

POD might comprise a complete, sel-contained data center or it may represent

a granular unit o expansion in a larger cloud installation.

By deploying InniBand as the network internal to the container, and bytaking ull advantage o its messaging service, the container manuacturer can

deliver to his or her client the densest possible conguration presenting the

highest eective utilization o compute, power and cooling resources possible.

Such a container takes ull advantage o InniBands complete set o value

propositions:

Itslowlatency,andextremelyhighbandwidthremovesbottlenecksthatpre-

vent applications rom delivering maximum achievable perormance.

TheeaseandsimplicitywithwhichanapplicationdirectlyusesInniBands

messaging service means that precious CPU cycles and, more importantly,

memory bandwidth, are devoted to application processing and not to

housekeeping I/O chores. This drives up the eective utilization o available

resources. Thus, the overall return on investment is higher, and the delivered

perormance per cubic oot o foor space and per watt o power consump-

tion, is higher.

Byprovidingeachserverwithonefatpipe,theserverplatformgainsanintrinsic guarantee that its mixture o storage I/O, networking I/O and IPC will

always be perectly balanced.

InniBandsefcientmanagementofthenetworktopologyoffersexcellent

utilization o the available network connectivity. Because InniBand deploys

a centralized subnet manager, switch orwarding tables are programmed in

such a way that routes are balanced and available links are well used. This

ecient topology handling scales to thousands o nodes and thus satisesthe needs o large cloud installations.

InniBandsnative,commoditybandwidths(currently40Gb/s)arenearly

our times higher than 10Gb/s Ethernet, at prices that have historically been

extremely competitive compared to commodity 1Gb/s or 10Gb/s Ethernet so-

lutions. Since each InniBand cable carries nearly our times the bandwidth

o a corresponding 10Gb/s Ethernet cable, the number o connectors, cables,

switches and so on that are required is dramatically reduced.

Andnally,thesimpliedcommunicationserviceprovidedbyanInniBand

network inside the container simplies the deployment o applications on

the pool o computing, storage and networking resources provided by the

container and reduces the burden o migrating applications onto available

resources.


28/54


29/54

[ 29 ]


physically copied to the local site. This is because enterprises rely on a le

transer protocol to replicate data rom site to site. The time delay to accomplish

the transer can oten be measured in minutes or hours depending on le sizes,

the capacity o the underlying network and the perormance o the storage

system. This poses a dicult problem in synchronizing storage between sites.Suppose, instead o copying les rom the remote le system to the local le

system, that the remote le system could simply be mounted to the local users

workstation in exactly the same way that the local le system is mounted?

This is exactly the model enabled by executing InniBand over the WAN. Over

long haul connections with relatively long fight times, the InniBand protocol

oers extremely high bandwidth eciency across distances that cause TCP/IP

to stumble.

StorageClient

UserApp

User

Buffer

LocalStorage

RemoteStorage

RemoteStorage

Figure 7.

An enterprise application reads data through a storage client

The storage client connects to each storage server via RDMA

Connections to remote storage servers are made via the WAN

Thus, the user has direct access to all data stored anywhere on the system

It turns out that the perormance o any windowing protocol is a unc-

tion o the so-called bandwidth-delay product; this product represents themaximum amount o data that can be stored in transit on the WAN at any

given moment. Once the limit o the congestion window has been reached,

the transport must stop transmitting until some o the bytes in the pipe can be

drained at the ar end. In order to achieve maximum utilization o the avail-

able WAN bandwidth, the transports congestion window must be suciently

large to keep the wire continuously ull. As the wire length increases, or as

the bandwidth on the wire increases, so does the bandwidth-delay product.

And hence, the congestion window that is basic to any congestion managednetwork such as InniBand or TCP/IP must grow very large in order to support

the large bandwidth-delay products commonly ound on high speed WAN links.

Because o the undamental architecture o the InniBand transport, it is easily

able to support these exceedingly large bandwidth-delay products.

On the other hand, lossy fow control protocols such as TCP degrade

rapidly as the bandwidth-delay product increases, causing turbulent rather


30/54

[ 30 ]


than smooth packet fow. These eects cause a rapid roll-o in achieved

bandwidth perormance as link distance or wire bandwidth increases. These

eects are already pronounced at 10Gbits/s, and become much worse at 40 or

100Gbits/s and beyond.

Because the InniBand transport, when run over the WAN is able to eec-tively manage the bandwidth-delay product o these long links, it becomes

possible to actually mount the remote le system directly to the local users

workstation and hence to deliver perormance approximately equivalent to that

which the local user would experience when accessing a local le system. O

course speed o light delays are an unavoidable byproduct o access over the

WAN, but or many applications the improvement in response time rom min-

utes or hours to milliseconds or seconds represents an easy tradeo.

Long distance InniBand links may span Campus or Metro Area Networks oreven Global WANs. Over shorter optical networks, applications directly benet

rom InniBands low latency, low jitter, low CPU loading and high throughput.

Examples o such applications include:

Data Center Expansion spill an InniBand network rom a primary data

center to nearby buildings using multiple optical InniBand connections. This

approach causes no disruption to data center trac, is transparent to applica-

tions and suciently low latency as to be ignorable to operations.

Campus System Area Networks InniBand makes an excellent System

Area Network, carrying storage/ IPC and Internet trac over its unied abric.

Campus Area InniBand links allow islands o compute, storage and I/O

devices such as high-resolution display walls to be ederated and shared

across departments. Location agnostic access perormance greatly assists in

the ecient utilization o organization-wide resources.


31/54

[ 31 ]

ChApTER 4 DESIgNINg wITh INFINIBAND

Chapter 4 Designing with InfniBand

As with any network architecture, the InniBand Architecture provides a

service; in the case o InniBand it provides a messaging service. That mes-sage service can be used to transport IPC messages, or to transport control

and data messages or storage, or to transport messages associated with any

other o a range o usages. This makes InniBand dierent rom a traditional

TCP/IP/Ethernet network which provides a byte stream transport service, or

Fibre Channel which provides a transport service specic to the Fibre Channel

wire protocol. By application, we mean, or example, a user application using

the sockets interace to conduct IPC, or a kernel level le system application

executing SCSI block level commands to a remote block storage device.The key eature that sets InniBand apart rom other network technologies

is that InniBand is designed to provide message transport services directly to

the application layer, whether it is a user application or a kernel application.

This is very dierent rom a traditional network which requires the application

to solicit assistance rom the operating system to transer a stream o bytes.

At rst blush, one might think that such application level access requires the

application to ully understand the networking and I/O protocols provided by

the underlying network. Nothing could be urther rom the truth.The top layer o the InniBand Architecture denes asotware transport

interace; this section o the specication denes the methods that an applica-

tion uses to access the complete set o services provided by InniBand.

An application accesses InniBands message transport service by posting

a Work Request (WR) to a work queue. As described earlier, the work queue

is part o a structure, called a Queue Pair (QP) representing the endpoint o a

channel connecting the application with another application. A QP contains

two work queues; a Send Queue and a Receive Queue, hence the expressionQP. A WR, once placed on the work queue, is in eect an instruction to the

InniBand RDMA message transport to transmit a message on the channel, or

perorm some sort o control or housekeeping unction on the channel.

The methods used by an application to interact with the InniBand trans-

port are called verbs and are dened in the sotware transport interace section

o the specication. An application uses a POST SEND verb, or example, to


32/54

[ 32 ]


request that the InniBand transport send a message on the given channel.

This is what is meant when we say that the InniBand network stack goes all

the way up to the application layer. It provides methods that are used directly

by an application to request service rom the InniBand transport. This is very

dierent rom a traditional network, which requires the application to solicitassistance rom the operating system in order to access the servers network

services.

Application

SoftwareTransportInterface

IB Transport

Layer

Application(Consumer of InfiniBand message services):

posts work requests to a queue each work request represents a messagea unit of work

Channel interface (verbs):

allows the consumer to request services

InfiniBand Channel interface provider:

Maintains work queues Manages address translation Provides completion and event mechanisms

IB Transport:

Packetizes messages Provides transport service

- reliable/unreliable, connected/unconnected, datagram

Implements RDMA protocol- send/receive, RDMA r/w, Atomics

Implements end-to-end reliability Assures reliable delivery

Figure 8.

Building on Top o the Verbs

The verbs, as dened, are sucient to enable an application developer

to design and code the application or use directly on top o an InniBand

network. But i the application developer were to do so, he or she would need

to create a set o customized application interace code, at least one set or

each operating system to be supported. In a world o specialized applications,

such as high perormance computing, where the last ounce o perormance

is important, such an approach makes a great deal o sense the range o

applications in question is airly well dened and the variety o systems on

which such applications are deployed is similarly constrained. But in a world o

more broadly distributed applications and operating systems it may be prohibi-

tive or the application or hardware developer and the prousion o application

interaces may be an obstacle inhibiting an IT manager rom adopting a given

application or technology.

In more broadly deployed systems, an application accesses system services


33/54

[ 33 ]


via a set o APIs oten supplied as part o the operating system.

InniBand is an industry-standard interconnect and is not tied to any given

operating system or server platorm. For this reason, it does not speciy any

APIs directly. Instead, the InniBand verbs ully speciy the required behaviors

o the APIs. In eect, the verbs represent a semantic denition o the applica-tion interace by speciying the methods and mechanisms that an application

uses to control the network and to access its services. Since the verbs speciy

a set o required behaviors, they are sucient to support the development

o OS-specic APIs and thus ensure interoperability between an application

executing, or example, on a Windows platorm which is communicating with

an application executing on a Linux platorm. The verbs ensure cross-platorm

and cross-vendor interoperability. APIs can be tested both or conormance

to the verbs specication and or interoperability allowing dierent vendors oInniBand solutions to assure their customers that the devices and sotware

will operate in conormance to the specication and will interoperate with solu-

tions provided by other vendors.

However, a semantic description o a required behavior is o little actual

use without an accompanying set o APIs to implement the required behavior.

Fortunately, other organizations do provide APIs, even i the InniBand Trade

Association does not. For example, the OpenFabrics Alliance and other com-

mercial providers supply a complete set o sotware stacks which are bothnecessary and sucient to deploy an InniBand-based system. The sotware

stack available rom OFA, as the name suggests, is open source and reely

downloadable under both BSD and GNU licenses. The OFA is broadly sup-

ported by providers o InniBand hardware, by major operating system sup-

pliers and suppliers o middleware and by the Linux open source community.

All o these organizations, and many others, contribute open source code to the

OFA repository where it undergoes testing and renement and is made gener-

ally available to the community at large.

But even supplying APIs is not, by itsel, sucient to deploy InniBand.

Broadly speaking, the sotware elements required to implement InniBand all

into three major areas; there is the set o Upper Layer Protocols (ULPs) and

associated libraries, there is a set o mid-layer unctions that are helpul in con-

guring and managing the underlying InniBand abric and that provide needed

services to the upper layer protocols, and there is a set o hardware specic

device drivers. Some o the mid-layer unctions are described in a little more

detail in the chapter describing the InniBand Management Architecture. All o

these components are reely available through open source or easy download,

installation and use. They can be downloaded directly rom the OpenFabrics

Alliance, or, in the case o Linux, they can be acquired as part o the standard

distributions provided by a number o Linux distributions. The complete set o

sotware components provided by the OpenFabrics Alliance is ormally known


34/54

[ 34 ]


as the Open Fabrics Enterprise Distribution OFED.

There is an important point associated with the ULPs that bears urther

mention. This chapter began by describing InniBand as a message transport

service. As described above, in some circumstances it is entirely practical

to design an application to access the message transport service directly bycoding the application to conorm to the InniBand verbs. Purpose-built HPC

applications, or example, oten do this in order to gain the ull measure o per-

ormance. But doing so requires the application, and the application developer

to be RDMA aware. For example, the application would need to understand

the sequence o verbs to be used to create a queue pair, to establish a connec-

tion, to send and receive messages over the connection and so orth. However,

most commercial applications are designed to take advantage o a standard

networking or storage subsystem by leveraging existing, industry standardAPIs. But that does not mean that such an application cannot benet rom the

perormance and fexibility advantages provided by InniBand. The set o ULPs

provided through OFED allows an existing application to take easy advantage o

InniBand. Each ULP can be thought o as having two interaces; an upward-

acing interace presented to the application, and a downward-acing interace

connecting to the underlying InniBand messaging service via the queue pairs

dened in the sotware transport interace. At its upward-acing interace, the

ULP presents a standard interace easily recognizable by the application. Forexample, the SCSI RDMA Protocol ULP (SRP) presents a standard SCSI class

driver interace to the le system above it. By providing the advantages o an

RDMA network, the ULP allows a SCSI le system to reap the ull benets

o InniBand. A rich set o ULPs are provided by the Open Fabrics Alliance

including:

SDP Sockets Direct Protocol. This ULP allows a sockets application to

take advantage o an InniBand network with no change to the application.

SRP SCSI RDMA Protocol. This allows a SCSI le system to directly con-

nect to a remote block storage chassis using RDMA semantics. Again, there

is no impact to the le system itsel.

iSER iSCSI Extensions or RDMA. iSCSI is a protocol allowing a block

storage le system to access a block storage device over a generic network.

iSER allows the user to operate the iSCSI protocol over an RDMA capable

network.

IPoIB IP over InfniBand. This important part o the suite o ULPs allows

an application hosted in, or example, an InniBand-based network to com-

municate with other sources outside the InniBand network using standard

IP semantics. Although oten used to transport TCP/IP over an InniBand

network, the IPoIB ULP can be used to transport any o the suite o IP

protocols including UDP, SCTP and others.


35/54

[ 35 ]


NFS-RDMA Network File System over RDMA. NFS is a well-known and

widely-deployed le system providing le level I/O (as opposed to block

level I/O) over a conventional TCP/IP network. This enables easy le shar-

ing. NFS-RDMA extends the protocol and enables it to take ull advantage

o the high bandwidth and parallelism provided naturally by InniBand.

Lustre support Lustre is a parallel le system enabling, or example, a set

o clients hosted on a number o servers to access the data store in parallel.

It does this by taking advantage o InniBands Channel I/O architecture, al-

lowing each client to establish an independent, protected channel between

itsel and the Lustre Metadata Servers (MDS) and associated Object Storage

Servers and Targets (OSS, OST).

RDS Reliable Datagram Sockets oers a Berkeley sockets API allow-ing messages to be sent to multiple destinations rom a single socket. This

ULP, originally developed by Oracle, is ideally designed to allow database

systems to take ull advantage o the parallelism and low latency character-

istics o InniBand.

MPI The MPI ULP or HPC clusters provides ull support or MPI unction

calls.

SocketsApps

BlockStorage

IP-basedApps

ClusteredDB Access

FilesSystems

VNIC SDP

iPoIB iSER RDSNFS

RDMAMPI

SRPCluster

File Systems

Application

Upper Layer Protocols

InfiniBand verbs interface

InfiniBand RDMAMessage

Transport Service

InfiniBand FabricInterconnect

Figure 9.

There is one more important corollary to this. As described above, a tradi-

tional API provides access to a particular kind o service, such as a network

service or a storage service. These services are usually provided over a specic

type o network or interconnect. The sockets API, or example, gives an appli-

cation access to a network service and is closely coupled with an underlying

TCP/IP network. Similarly, a block level le system is closely coupled with a


36/54

[ 36 ]


purpose-built storage interconnect such as Fibre Channel. This is why data

center networks today are commonly thought o as consisting o as many as

three separate interconnects one or networking, one or storage and possibly

a third or IPC. Each type o interconnect consists o its own physical plant

(cables, switches, NICs or HBAs and so on), its own protocol stack, and mostimportantly, its own set o application level APIs.

This is where the InniBand Architecture is dierent: The set o ULPs

provide a series o interaces that an application can use to access a particular

kind o service (storage service, networking service, IPC service or others).

However, each o those services is provided using asingle underlying network.

At their lower interaces, the ULPs are bound to a single InniBand network.

There is no need or multiple underlying networks coupled to each type o

application interace. In eect, the set o ULPs gives an application access toa single unied abric, completely hiding the details associated with a unied

abric, while avoiding the need or multiple data center networks.


37/54

[ 37 ]

ChApTER 5 INFINIBAND ARChITECTURE AND FEATURES

Chapter 5 InfniBand Architecture and Features

Chapter 1 provided a high-level overview o the key concepts underpin-

ning InniBand. Some o these concepts include channel I/O (described as a

connection between two virtual address spaces), a channel endpoint called aqueue pair (QP), and direct application level access to a QP via virtual to phys-

ical memory address mapping. A detailed description o the architecture and

its accompanying eatures is the province o an entire multi-volume book and

is well outside the scope o this minibook. This chapter is designed to deepen

your understanding o the architecture with an eye toward understanding how

InniBands key concepts are achieved and how they are applied.

Volume 1 o the InniBand Architecture specication co

Intro to IB for End Users

Documents

Transcript of Intro to IB for End Users