Intro to IB for End Users

download Intro to IB for End Users

of 54

Transcript of Intro to IB for End Users

  • 8/2/2019 Intro to IB for End Users

    1/54

    Introduction to InfniBand

    or End Users

    Industry-Standard Value and Performance forHigh Performance Computing and the Enterprise

    Paul Grun

    InfniBand Trade Association

  • 8/2/2019 Intro to IB for End Users

    2/54

  • 8/2/2019 Intro to IB for End Users

    3/54

    [ 3 ]

    CONTENTS

    Contents

    Acknowledgements ..............................................................................4

    About the Author ..................................................................................4

    Introduction .........................................................................................5

    Chapter 1 Basic Concepts ..................................................................7Chapter 2 InniBand or HPC ..........................................................14

    InniBand MPI Clusters...................................................................15

    Storage and HPC ............................................................................16

    Upper Layer Protocols or HPC .........................................................18

    Chapter 3 InniBand or the Enterprise ..............................................19

    Devoting Server Resources to Application Processing ..........................20

    A Flexible Server Architecture ...........................................................21

    Investment Protection ......................................................................24

    Cloud Computing An Emerging Data Center Model ...........................25

    InniBand in the Distributed Enterprise .............................................28

    Chapter 4 Designing with InniBand ..................................................31

    Building on Top o the Verbs ............................................................32

    Chapter 5 InniBand Architecture and Features ...................................37Address Translation .........................................................................38

    The InniBand Transport .................................................................39

    InniBand Link Layer Considerations ................................................42

    Management and Services ...............................................................42

    InniBand Management Architecture ................................................43

    Chapter 6 Achieving an Interoperable Solution ....................................45

    An Interoperable Sotware Solution ...................................................45Ensuring Hardware Interoperability ...................................................46

    Chapter 7 InniBand Perormance Capabilities and Examples ...............48

    Chapter 8 Into the Future .................................................................52

  • 8/2/2019 Intro to IB for End Users

    4/54

    [ 4 ]

    INTRO TO INFINIBAND FOR END USERS

    Acknowledgements

    Special thanks to Gilad Shainer at Mellanox or his work and content related to

    Chapter 7 - InniBand Perormance Capabilities and Examples.

    Thanks to the ollowing individuals or their contributions to this publication:

    David Southwell, Ariel Cohen, Rupert Dance, Jay Diepenbrock, Jim Ryan, Cheri

    Winterberg and Brian Sparks.

    About the Author

    Paul Grun has been working on I/O or mainrames and servers or more than

    30 years and has been involved in high perormance networks and RDMA technol-

    ogy since beore the inception o InniBand. As a member o the InniBand Trade

    Association, Paul served on the Link Working Group during the development o the

    InniBand Architecture; he is a member o the IBTA Steering Committee and is a

    past chair o the Technical Working Group. He recently served as chair o the IBXoE

    Working Group, charged with developing the new RDMA over Converged Ethernet(RoCE) specication. He is currently chie scientist or System Fabric Works, Inc.,

    a consulting and proessional services company dedicated to delivering RDMA and

    storage solutions or high perormance computing, commercial enterprise and cloud

    computing systems.

  • 8/2/2019 Intro to IB for End Users

    5/54

    [ 5 ]

    INTRODUCTION

    Introduction

    InniBand is not complex. Despite its reputation as an exotic technology, the

    concepts behind it are surprisingly straightorward. One purpose o this book is to

    clearly describe the basic concepts behind the InniBand Architecture.

    Understanding the basic concepts is all well and good, but not very useul unless

    those concepts deliver a meaningul and signicant value to the user o the tech-

    nology. And that is the real purpose o this book: to draw straight-line connections

    between the basic concepts behind the InniBand Architecture and how those

    concepts deliver real value to the enterprise data center and to high perormance

    computing (HPC).

    The readers o this book are exquisitely knowledgeable in their respective elds o

    endeavor, whether those endeavors are installing and maintaining a large HPC clus-

    ter, overseeing the design and development o an enterprise data center, deployingservers and networks, or building devices/appliances targeted at solving a specic

    problem requiring computers. But most readers o this book are unlikely to be ex-

    perts in the arcane details o networking architecture. Although the working title or

    this book was InniBand or Dummies, its readers are anything but dummies. This

    book is designed to give its readers, in one short sitting, a view into the InniBand

    Architecture that you probably could not get by reading the specication itsel.

    The point is that you should not have to hold an advanced degree in obscure

    networking technologies in order to understand how InniBand technology can bringbenets to your chosen eld o endeavor and to understand what those benets

    are. This minibook is written or you. At the end o the next hour you will be able to

    clearly see how InniBand can solve problems you are acing today.

    By the same token, this book is not a detailed technical treatise o the underlying

    theory, nor does it provide a tutorial on deploying the InniBand Architecture. All we

    can hope or in this short book is to bring a level o enlightenment about this excit-

    ing technology. The best measure o our success is i you, the reader, eel motivated

    ater reading this book to learn more about how to deploy the InniBand Architec-

    ture in your computing environment.We will avoid the low-level bits and bytes o the elegant architecture (which are

    complex) and ocus on the concepts and applications that are visible to the user o

    InniBand technology. When you decide to move orward with deploying InniBand,

    there is ample assistance available in the market to help you deploy it, just as there

    is or any other networking technology like traditional TCP/IP/Ethernet. The Inni-

    Band Trade Association and its website (www.innibandta.org) are great sources

    Introduction

  • 8/2/2019 Intro to IB for End Users

    6/54

    [ 6 ]

    INTRO TO INFINIBAND FOR END USERS

    o access to those with deep experience in developing, deploying and using the

    InniBand Architecture.

    The InniBand Architecture emerged in 1999 as the joining o two competing

    proposals known as Next Generation I/O and Future I/O. These proposals, and the

    InniBand Architecture that resulted rom their merger, are all rooted in the Vir-tual Interace Architecture, VIA. The Virtual Interace Architecture is based on two

    synergistic concepts: direct access to a network interace (e.g. a NIC) straight rom

    application space, and an ability or applications to exchange data directly between

    their respective virtual buers across a network, all without involving the operating

    system directly in the address translation and networking processes needed to do

    so. This is the notion o Channel I/O the creation o a virtual channel directly

    connecting two applications that exist in entirely separate address spaces. We will

    come back to a detailed description o these key concepts in the next ew chapters.

    InniBand is oten compared, and not improperly, to a traditional network such

    as TCP/IP/Ethernet. In some respects it is a air comparison, but in many other

    respects the comparison is wildly o the mark. It is true that InniBand is based on

    networking concepts and includes the usual layers ound in a traditional network,

    but beyond that there are more dierences between InniBand and an IP network

    than there are similarities. The key is that InniBand provides a messaging service

    that applications can access directly. The messaging service can be used or stor-

    age, or InterProcess Communication (IPC) or or a host o other purposes, anything

    that requires an application to communicate with others in its environment. The keybenets that InniBand delivers accrue rom the way that the InniBand messaging

    service is presented to the application, and the underlying technology used to trans-

    port and deliver those messages. This is much dierent rom TCP/IP/Ethernet, which

    is a byte-stream oriented transport or conducting bytes o inormation between

    sockets applications.

    The rst chapter gives a high-level view o the InniBand Architecture and

    reviews the basic concepts on which it is ounded. The next two chapters relate

    those basic concepts to value propositions which are important to users in the

    HPC community and in the commercial enterprise. Each o those communities has

    unique needs, but both benet rom the advantages InniBand can bring. Chap-

    ter 4 addresses InniBands sotware architecture and describes the open source

    sotware stacks that are available to ease and simpliy the deployment o InniBand.

    Chapter 5 delves slightly deeper into the nuances o some o the key elements o

    the architecture. Chapter 6 describes the eorts undertaken by both the InniBand

    Trade Association and the OpenFabrics Alliance (www.openabrics.org) to ensure

    that solutions rom a range o vendors will interoperate and that they will behave in

    conormance with the specication. Chapter 7 describes some comparative per-ormance proo points, and nally Chapter 8 briefy reviews the ways in which the

    InniBand Trade Association ensures that the InniBand Architecture continues to

    move ahead into the uture.

  • 8/2/2019 Intro to IB for End Users

    7/54

    [ 7 ]

    ChApTER 1 BASIC CONCEpTS

    Chapter 1 Basic Concepts

    Networks are oten thought o as the set o routers, switches and cables that

    are plugged into servers and storage devices. When asked, most people wouldprobably say that a network is used to connect servers to other servers, storage

    and a network backbone. It does seem that traditional networking generally

    starts with a bottoms up view, with much attention ocused on the underlying

    wires and switches. This is a very network centric view o the world; the way

    that an application communicates is driven partly by the nature o the trac.

    This is what has given rise to the multiple dedicated abrics ound in many

    data centers today.

    A server generally provides a selection o communication services to theapplications it supports; one or storage, one or networking and requently a

    third or specialized IPC trac. This complement o communications stacks

    (storage, networking, etc.) and the device adapters accompanying each one

    comprise a shared resource. This means that the operating system owns

    these resources and makes them available to applications on an as-needed

    basis. An application, in turn, relies on the operating system to provide it with

    the communication services it needs. To use one o these services, the appli-

    cation uses some sort o interace or API to make a request to the operatingsystem, which conducts the transaction on behal o the application.

    InniBand, on the other hand begins with a distinctly application cen-

    tric view by asking a prooundly simple question: How to make application

    accesses to other applications and to storage as simple, ecient and direct

    as possible? I one begins to look at the I/O problem rom this application

    centric perspective one comes up with a much dierent approach to network

    architecture.

    The basic idea behind InniBand is simple; it provides applications withan easy-to-use messaging service. This service can be used to communicate

    with other applications or processes or to access storage. Thats the whole

    idea. Instead o making a request to the operating system or access to one o

    the servers communication resources, an application accesses the InniBand

    messaging service directly. Since the service provided is a straightorward mes-

    saging service, the complex dance between an application and a traditional

  • 8/2/2019 Intro to IB for End Users

    8/54

    [ 8 ]

    INTRO TO INFINIBAND FOR END USERS

    network is eliminated. Since storage trac can be viewed as consisting o

    control messages and data messages, the same messaging service is equally at

    home or use in storage applications.

    A key to this approach is that InniBand Architecture gives every application

    direct access to the messaging service. Direct access means that an applicationneed not rely on the operating system to transer messages. This idea is in con-

    trast to a standard network environment where the shared network resources

    (e.g. the TCP/IP network stack and associated NICs) are solely owned by the

    operating system and cannot be accessed by the user application. This means

    that an application cannot have direct access to the network and instead

    must rely on the involvement o the operating system to move data rom the

    applications virtual buer space, through the network stack and out onto the

    wire. There is an identical involvement o the operating system at the otherend o the connection. InniBand avoids this through a technique known as

    stack bypass. How it does this, while ensuring the same degree o application

    isolation and protection that an operating system would provide, is the one key

    piece o secret sauce that underlines the InniBand Architecture.

    InniBand provides the messaging service by creating a channel connecting

    an application to any other application or service with which the applica-

    tion needs to communicate. The applications using the service can be either

    user space applications or kernel applications such as a le system. Thisapplication-centric approach to the computing problem is the key dierentiator

    between InniBand and traditional networks. You could think o it as a tops

    down approach to architecting a networking solution. Everything else in the

    InniBand Architecture is there to support this one simple goal: provide a mes-

    sage service to be used by an application to communicate directly with another

    application or with storage.

    The challenge or InniBands designers was to create these channels

    between virtual address spaces which would be capable o carrying messages

    o varying sizes, and to ensure that the channels are isolated and protected.

    These channels would need to serve as pipes or connections between entirely

    disjoint virtual address spaces. In act, the two virtual spaces might even be

    located in entirely disjointphysical address spaces in other words, hosted by

    dierent servers even over a distance.

  • 8/2/2019 Intro to IB for End Users

    9/54

    [ 9 ]

    ChApTER 1 BASIC CONCEpTS

    App

    OS

    App

    NIC

    Buf

    OS

    Buf

    NIC

    Figure 1: InfniBand creates a channel directly connecting an application in its virtual

    address space to an application in another virtual address space. The two applicationscan be in disjoint physical address spaces hosted by dierent servers.

    It is convenient to give a name to the endpoints o the channel well call

    them Queue Pairs (QPs); each QP consists o a Send Queue and a Receive

    Queue, and each QP represents one end o a channel. I an application

    requires more than one connection, more QPs are created. The QPs are the

    structure by which an application accesses InniBands messaging service. In

    order to avoid involving the OS, the applications at each end o the channel

    must have direct access to these QPs. This is accomplished by mapping the

    QPs directly into each applications virtual address space. Thus, the application

    at each end o the connection has direct, virtual access to the channel con-

    necting it to the application (or storage) at the other end o the channel. This is

    the notion oChannel I/O.

    Taken altogether then, this is the essence o the InniBand Architecture. It

    creates private, protected channels between two disjoint virtual address spaces,

    it provides a channel endpoint, called a QP, to the applications at each end o

    the channel, and it provides a means or a local application to transer mes-

    sages directly between applications residing in those disjoint virtual address

    spaces. Channel I/O in a nutshell.

    Having established a channel, and having created a virtual endpoint to the

    channel, there is one urther architectural nuance needed to complete the

    channel I/O picture and that is the actual method or transerring a message.

    InniBand provides two transer semantics; a channel semantic sometimes

    called SEND/RECEIVE and a pair o memory semantics called RDMA READ

    and RDMA WRITE. When using the channel semantic, the message is received

    in a data structure provided by the application on the receiving side. This data

    structure was pre-posted on its receive queue. Thus, the sending side does

    not have visibility into the buers or data structures on the receiving side;

    instead, it simply SENDS the message and the receiving application RECEIVES

    the message.

  • 8/2/2019 Intro to IB for End Users

    10/54

    [ 10 ]

    INTRO TO INFINIBAND FOR END USERS

    The memory semantic is somewhat dierent; in this case the receiving side

    application registers a buer in its virtual memory space. It passes control o

    that buer to the sending side which then uses RDMA READ or RDMA WRITE

    operations to either read or write the data in that buer.

    A typical storage operation may illustrate the dierence. The initiatorwishes to store a block o data. To do so, it places the block o data in a buer

    in its virtual address space and uses a SEND operation to send a storage

    request to the target (e.g. the storage device). The target, in turn, uses RDMA

    READ operations to etch the block o data rom the initiators virtual buer.

    Once it has completed the operation, the target uses a SEND operation to

    return ending status to the initiator. Notice that the initiator, having requested

    service rom the target, was ree to go about its other business while the target

    asynchronously completed the storage operation, notiying the initiator oncompletion. The data transer phase and the storage operation, o course, com-

    prise the bulk o the operation.

    Included as part o the InniBand Architecture and located right above

    the transport layer is thesotware transport interace. The sotware transport

    interace contains the QPs; keep in mind that the queue pairs are the struc-

    ture by which the RDMA message transport service is accessed. The sot-

    ware transport interace also denes all the methods and mechanisms that

    an application needs to take ull advantage o the RDMA message transportservice. For example, the sotware transport interace describes the methods

    that applications use to establish a channel between them. An implementation

    o the sotware transport interace includes the APIs and libraries needed by

    an application to create and control the channel and to use the QPs in order to

    transer messages.

    Application

    InfiniBandMessaging

    Service

    S/W TransportInterface

    Transport

    Network

    Link

    Physical

    RDMA MessageTransport Service

    Figure 2: The InfniBand Architecture provides an easy-to-use messaging service to

    applications. The messaging service includes a communications stack similar to the

    amiliar OSI reerence model.

  • 8/2/2019 Intro to IB for End Users

    11/54

    [ 11 ]

    ChApTER 1 BASIC CONCEpTS

    Underneath the covers o the messaging service all this still requires a

    complete network stack just as you would nd in any traditional network.

    It includes the InniBand transport layer to provide reliability and delivery

    guarantees (similar to the TCP transport in an IP network), a network layer

    (like the IP layer in a traditional network) and link and physical layers (wiresand switches). But its a special kind o a network stack because it has ea-

    tures that make it simple to transport messages directly between applications

    virtual memory, even i the applications are remote with respect to each

    other. Hence, the combination o InniBands transport layer together with the

    sotware transport interace is better thought o as a Remote Direct Memory

    Access (RDMA) message transport service. The entire stack taken together,

    including the sotware transport interace, comprise the InniBand messaging

    service.How does the application actually transer a message? The InniBand

    Architecture provides simple mechanisms dened in the sotware transport

    interace or placing a request to perorm a message transer on a queue. This

    queue is the QP, representing the channel endpoint. The request is called a

    Work Request (WR) and represents a single quantum o work that the applica-

    tion wants to perorm. A typical WR, or example, describes a message that the

    application wishes to have transported to another application.

    The notion o a WR as representing a message is another o the key distinc-tions between InniBand and a traditional network; the InniBand Architecture

    is said to be message-oriented. A message can be any size ranging up to

    2**31 bytes in size. This means that the entire architecture is oriented around

    moving messages. This messaging service is a distinctly dierent service than

    is provided by other traditional networks such as TCP/IP, which are adept at

    moving strings o bytes rom the operating system in one node to the operating

    system in another node. As the name suggests, a byte stream-oriented network

    presents a stream o bytes to an application. Each time an Ethernet packet

    arrives, the server NIC hardware places the bytes comprising the packet into

    an anonymous buer in main memory belonging to the operating system. Once

    the stream o bytes has been received into the operating systems buer, the

    TCP/IP network stack signals the application to request a buer rom the appli-

    cation into which the bytes can be placed. This process is repeated each time a

    packet arrives until the entire message is eventually received.

    InniBand on the other hand, delivers a complete message to an application.

    Once an application has requested transport o a message, the InniBand hard-

    ware automatically segments the outbound message into a number o packets;

    the packet size is chosen to optimize the available network bandwidth. Packets

    are transmitted through the network by the InniBand hardware and at the

    receiving end they are delivered directly into the receiving applications virtual

    buer where they are re-assembled into a complete message. Once the entire

  • 8/2/2019 Intro to IB for End Users

    12/54

    [ 12 ]

    INTRO TO INFINIBAND FOR END USERS

    message has been received the receiving application is notied. Neither the

    sending nor the receiving application is involved until the complete message is

    delivered to the receiving applications virtual buer.

    How, exactly, does a sending application, or example send a message? The

    InniBand sotware transport interace specication denes the concept o averb. The word verb was chosen since a verb describes action, and this is

    exactly how an application requests action rom the messaging service; it uses

    a verb. The set o verbs, taken together, are simply a semantic description o

    the methods the application uses to request service rom the RDMA message

    transport service. For example, Post Send Request is a commonly used verb

    to request transmission o a message to another application.

    The verbs are ully dened by the specication, and they are the basis or

    speciying the APIs that an application uses. But the InniBand Architecturespecication doesnt dene the actual APIs; that is let to other organizations

    such as the OpenFabrics Alliance, which provides a complete set o open

    source APIs and sotware which implements the verbs and works seamlessly

    with the InniBand hardware. A richer description o the concept o verbs,

    and how one designs a system using InniBand and the verbs, is contained in

    a later chapter.

    The InniBand Architecture denes a ull set o hardware components nec-

    essary to deploy the architecture. Those components are:

    HCA Host Channel Adapter. An HCA is the point at which an InniBand

    end node, such as a server or storage device, connects to the InniBand

    network. InniBands architects went to signicant lengths to ensure that the

    architecture would support a range o possible implementations, thus the

    specication does not require that particular HCA unctions be implemented

    in hardware, rmware or sotware. Regardless, the collection o hardware,

    sotware and rmware that represents the HCAs unctions provides the appli-

    cations with ull access to the network resources. The HCA contains address

    translation mechanisms under the control o the operating system that allow

    an application to access the HCA directly. In eect, the Queue Pairs that the

    application uses to access the InniBand hardware appear directly in the

    applications virtual address space. The same address translation mechanism

    is the means by which an HCA accesses memory on behal o a user level

    application. The application, as usual, reers to virtual addresses; the HCA

    has the ability to translate these into physical addresses in order to aect theactual message transer.

    TCA Target Channel Adapter. This is a specialized version o a channel

    adapter intended or use in an embedded environment such as a storage

    appliance. Such an appliance may be based on an embedded operating

    system, or may be entirely based on state machine logic and thereore may

    not require a standard interace or applications. The concept o a TCA is

  • 8/2/2019 Intro to IB for End Users

    13/54

    [ 13 ]

    ChApTER 1 BASIC CONCEpTS

    not widely deployed today; instead most I/O devices are implemented using

    standard server motherboards controlled by standard operating systems.

    Switches An InniBand Architecture switch is conceptually similar to any

    other standard networking switch, but molded to meet InniBands per-

    ormance and cost targets. For example, InniBand switches are designed

    to be cut through or perormance and cost reasons and they implement

    InniBands link layer fow control protocol to avoid dropped packets. This is

    a subtle, but key element o InniBand since it means that packets are never

    dropped in the network during normal operation. This no drop behavior is

    central to the operation o InniBands highly ecient transport protocol.

    Routers Although not currently in wide deployment, an InniBand router

    is intended to be used to segment a very large network into smaller subnetsconnected together by an InniBand router. Since InniBands management

    architecture is dened on a per subnet basis, using an InniBand router

    allows a large network to be partitioned into a number o smaller subnets

    thus enabling the deployment o InniBand networks that can be scaled to

    very large sizes, without the adverse perormance impacts due to the need

    to route management trac throughout the entire network. In addition to this

    traditional use, specialized InniBand routers have also been deployed in a

    unique way as range extenders to interconnect two segments o an Inni-Band subnet that is distributed across wide geographic distances.

    Cables and Connectors Volume 2 o the Architecture Specication is de-

    voted to the physical and electrical characteristics o InniBand and denes,

    among other things, the characteristics and specications or InniBand

    cables, both copper and electrical. This has enabled vendors to develop and

    oer or sale a wide range o both copper and optical cables in a broad range

    o widths (4x, 12x) and speed grades (SDR, DDR, QDR).

    Thats it or the basic concepts underlying InniBand. The InniBand Archi-

    tecture brings a broad range o value to its users, ranging rom ultra-low latency

    or clustering to extreme fexibility in application deployment in the data center

    to dramatic reductions in energy consumption and all at very low price/per-

    ormance points. Figuring out which value propositions are relevant to which

    situations depends entirely on the destined use.

    The next two chapters describe how these basic concepts map onto partic-

    ular value propositions in two important market segments. Chapter 2 describes

    how InniBand relates to High Perormance Computing (HPC) clusters, and

    Chapter 3 does the same thing or enterprise data centers. As we will see,

    InniBands basic concepts can help solve a range o problems, regardless

    whether you are trying to solve a sticky perormance problem or i your data

    center is suering rom severe space, power or cooling constraints. A wise

    choice o an appropriate interconnect can prooundly impact the solution.

  • 8/2/2019 Intro to IB for End Users

    14/54

    [ 14 ]

    INTRO TO INFINIBAND FOR END USERS

    Chapter 2 InfniBand or HPC

    In this chapter we will explore how InniBands undamental characteristics

    map onto addressing the problems being aced today in high perormancecomputing environments.

    High perormance computing is not a monolithic eld; it covers the gamut

    rom the largest Top 500 class supercomputers down to small desktop

    clusters. For our purposes, we will generally categorize HPC as being that class

    o systems where nearly all o the available compute capacity is devoted to

    solving a single large problem or substantial periods o time. Looked at another

    way, HPC systems generally dont run traditional enterprise applications such

    as mail, accounting or productivity applications.Some examples o HPC applications include atmospheric modeling,

    genomics research, automotive crash test simulations, oil and gas extraction

    models and fuid dynamics.

    HPC systems rely on a combination o high-perormance storage and low-

    latency InterProcess Communication (IPC) to deliver perormance and scal-

    ability to scientic applications. This chapter ocuses on how the InniBand

    Architectures characteristics make it uniquely suited or HPC. With respect to

    high perormance computing, InniBands low-latency/high-bandwidth peror-mance and the nature o a channel architecture are crucially important.

    Ultra-low latency or:

    Scalability

    Clusterperformance

    Channel I/O delivers:

    Scalablestoragebandwidthperformance

    Supportforshareddiskclusterlesystemsandparallellesystems

    HPC clusters are dependent to some extent on low latency IPC or scalability

    and application perormance. Since the processes comprising a cluster appli-

    cation are distributed across a number o cores, processors and servers, it is

    clear that a low-latency IPC interconnect is an important determinant o cluster

    scalability and perormance.

    In addition to demand or low-latency IPC, HPC systems requently place

  • 8/2/2019 Intro to IB for End Users

    15/54

    [ 15 ]

    ChApTER 2 INFINIBAND FOR hpC

    enormous demands on storage bandwidth, perormance (measured in I/Os per

    second), scalability and capacity. As we will demonstrate in a later section,

    InniBands channel I/O architecture is well suited to high-perormance storage

    and especially parallel le systems.

    InniBand support or HPC ranges rom upper layer protocols tailored tosupport the Message Passing Interace (MPI) in HPC to support or large scale

    high-perormance parallel le systems such as Lustre. MPI is the predominant

    messaging middleware ound in HPC systems.

    InfniBand MPI Clusters

    MPI is the de acto standard and dominant model in use today or com-

    munication in a parallel system. Although it is possible to run MPI on a shared

    memory system, the more common deployment is as the communicationlayer connecting the nodes o a cluster. Cluster perormance and scalability

    are closely related to the rate at which messages can be exchanged between

    cluster nodes; the classic low latency requirement. This is native InniBand

    territory and its clear stock in trade.

    MPI is sometimes described as communication middleware, but perhaps a

    better, more unctional description would be to say that it provides a communi-

    cation service to the distributed processes comprising the HPC application.

    To be eective, the MPI communication service relies on an underlyingmessaging service to conduct the actual communication messages rom one

    node to another. The InniBand Architecture, together with its accompanying

    application interaces, provides an RDMA messaging service to the MPI layer.

    In the rst chapter, we discussed direct application access to the messaging

    service as being one o the undamental principles behind the InniBand Archi-

    tecture; applications use this service to transer messages between them. In the

    case o HPC, one can think o the processes comprising the HPC application

    as being the clients o the MPI communication service, and MPI, in turn as the

    client o the underlying RDMA message transport service. Viewed this way, one

    immediately grasps the signicance o InniBands channel-based message

    passing architecture; its ability to enable its client, the MPI communications

    middleware in this case, to communicate among the nodes comprising the

    cluster without involvement o any CPUs in the cluster. InniBands copy avoid-

    ance architecture and stack bypass deliver ultra-low application-to-application

    latency and high bandwidth coupled with extremely low CPU utilization.

    MPI is easily implemented on InniBand by deploying one o a number o

    implementations which take ull advantage o the InniBand Architecture.

  • 8/2/2019 Intro to IB for End Users

    16/54

    [ 16 ]

    INTRO TO INFINIBAND FOR END USERS

    Storage and HPC

    An HPC application comprises a number o processes running in parallel. In

    an HPC cluster, these processes are distributed; the distribution could be across

    the cores comprising a processor chip, or across a number o processor chips

    comprising a server, or they could be distributed across a number o physical

    servers. This last model is the one most oten thought o when thinking about

    HPC clusters.

    Depending on the conguration and purpose or the cluster, it is oten

    the case that each o the distributed processes comprising the application

    requires access to storage. As the cluster is scaled up (in terms o the number

    o processes comprising the application) the demand or storage bandwidth

    increases. In this section well explore two common HPC storage architectures,

    shared disk cluster le systems and parallel le systems, and discuss the

    role channel I/O plays. Other storage architectures, such as network attached

    storage, are also possible.

    A shared disk cluster le system is exactly as its name implies and is illus-

    trated in the diagram below; it permits distributed processes comprising the

    cluster to share access to a common pool o storage. Typically in a shared disk

    le system, the le system is distributed among the servers hosting the distrib-

    uted application with an instance o the le system running on each processorand a side band communication channel between le system instances used

    or distributed lock management. By creating a series o independent chan-

    nels, each instance o the distributed le system is given direct access to the

    shared disk storage system. The le system places a certain load on the storage

    system which is equal to the sum o the loads placed by each instance o the

    le system. By distributing this load across a series o unique channels, the

    channel I/O architecture provides benets in terms o parallel accesses and

    bandwidth allocation.

    cluster f/s

    cluster f/s

    cluster f/s

    Server

    Server

    Server

    Distributed Local

    File System

    Figure 3: Shared disk cluster fle system. Each instance o the distributed fle system

    has a unique connection to the elements o the storage system.

  • 8/2/2019 Intro to IB for End Users

    17/54

    [ 17 ]

    ChApTER 2 INFINIBAND FOR hpC

    One nal point; the perormance o a shared disk le system and hence the

    overall system is dependent on the speed with which locks can be exchanged

    between le system instances. InniBands ultra low latency characteristic

    allows the cluster to extract maximum perormance rom the storage system.

    A parallel le system such as the Lustre le system is similar in some waysto a shared disk le system in that it is designed to satisy storage demand

    rom a number o clients in parallel, but with several distinct dierences.

    Once again, the processes comprising the application are distributed across a

    number o servers. The le system however, is sited together with the storage

    devices rather than being distributed. Each process comprising the distributed

    application is served by a relatively thin le system client.

    Within the storage system, the control plane (le system metadata), is oten

    separated rom the data plane (object servers) containing user data. This pro-motes parallelism by reducing bottlenecks associated with accessing le system

    metadata, and it permits user data to be distributed across a number o storage

    servers which are devoted to strictly serving user data.

    f/s client

    f/s client

    f/s client

    Server

    Server

    Server

    Remote

    File System

    mds

    oss

    oss

    oss

    Figure 4:A parallel fle system.A thin fle system client co-located with each instance

    o the distributed application creates parallel connections to the metadata server(s) and

    the object servers. The large arrow represents the physical InfniBand network.

    Each le system client creates a unique channel to the le system metadata.

    This accelerates the metadata look-up process and allows the metadata server

    to service a airly large number o accesses in parallel. In addition, each le

    system client creates unique channels to the object storage servers where userdata is actually stored.

    By creating unique connections between each le system client and

    the storage subsystem, a parallel le system is perectly positioned to reap

    maximum advantage rom InniBands channel I/O architecture. The unique

    and persistent low latency connections to the metadata server(s) allow a high

    degree o scalability by eliminating the serialization o accesses sometimes

  • 8/2/2019 Intro to IB for End Users

    18/54

    [ 18 ]

    INTRO TO INFINIBAND FOR END USERS

    experienced in a distributed lock system. By distributing the data across an

    array o object servers and by providing each le system client with a unique

    connection to the object servers, a high degree o parallelism is created as

    multiple le system clients attempt to access objects. By creating unique and

    persistent low-latency connections between each le system client and themetadata servers, and between each le system client and the object storage

    servers, the Lustre le system delivers scalable perormance.

    Upper Layer Protocols or HPC

    The concept o upper layer protocols (ULPs) is described in some detail in

    Chapter 4 Designing with InniBand. For now, it is sucient to say that open

    source sotware to support the InniBand Architecture and specically ocused

    on HPC is readily available, or example, rom the OpenFabrics Alliance. Thereare ULPs available to support le system accesses such as the Lustre le

    system, MPI implementations and a range o other APIs necessary to support a

    complete HPC system are also provided.

  • 8/2/2019 Intro to IB for End Users

    19/54

    [ 19 ]

    ChApTER 3 INFINIBAND FOR ThE ENTERpRISE

    Chapter 3 InfniBand or the Enterprise

    The previous chapter addressed applications o InniBand in HPC and

    ocused on the values an RDMA messaging service brings to that space ultra-low, application-to-application latency and extremely high bandwidth all

    delivered at a cost o almost zero CPU utilization. As we will see in this chapter,

    InniBand brings an entirely dierent set o values to the enterprise. But those

    values are rooted in precisely the same set o underlying principles.

    O great interest to the enterprise are values such as risk reduction, invest-

    ment protection, continuity o operations, ease o management, fexible use

    o resources, reduced capital and operating expenses and reduced power,

    cooling and foor space costs. Naturally, there are no hard rules here; thereare certainly areas in the enterprise space where InniBands perormance

    (latency, bandwidth, CPU utilization) is o extreme importance. For example,

    certain applications in the nancial services industry (such as arbitrage trading)

    demand the lowest possible application-to-application latencies at nearly any

    cost. Our purpose in this chapter is to help the enterprise user understand more

    clearly how the InniBand Architecture applies in this environment.

    Reducing expenses associated with purchasing and operating a data center

    means being able to do more with less less servers, less switches, ewer

    routers and ewer cables. Reducing the number o pieces o equipment in a

    data center leads to reductions in foor space and commensurate reductions in

    demand or power and cooling. In some environments, severe limitations on

    available foor space, power and cooling eectively throttle the data centers

    ability to meet the needs o a growing enterprise, so overcoming these obsta-

    cles can be crucial to an enterprises growth. Even in environments where this

    is not the case, each piece o equipment in a data center demands a certain

    amount o maintenance and operating cost, so reducing the number o pieces

    o equipment leads to reductions in operating expenses. Aggregated across an

    enterprise, these savings are substantial. For all these reasons, enterprise data

    center owners are incented to ocus on the ecient use o IT resources. As

    we will see shortly, appropriate application o InniBand can advance each o

    these goals.

  • 8/2/2019 Intro to IB for End Users

    20/54

    [ 20 ]

    INTRO TO INFINIBAND FOR END USERS

    Simply reducing the amount o equipment is a noble goal unto itsel, but in

    an environment where workloads continue to increase, it becomes critical to

    do with less. This is where the InniBand Architecture can help; it enables the

    enterprise data center to serve more clients with a broader range o applications

    with aster response times while reducing the number o servers, cables andswitches required. Sounds like magic.

    Server virtualization is a strategy being successully deployed now in an

    attempt to reduce the cost o server hardware in the data center. It does so by

    taking advantage o the act that many data center servers today operate at

    very low levels o CPU utilization; it is not uncommon or data center servers

    to consume less than 15-20 percent o the available CPU cycles. By hosting

    a number o Virtual Machines (VMs) on each physical server, each o which

    hosts one or more instances o an enterprise application, virtualization cancleverly consume CPU cycles that had previously gone unused. This means

    that ewer physical servers are needed to support approximately the same

    application workload.

    Even though server virtualization is an eective strategy or driving up

    consumption o previously unused CPU cycles, it does little to improve the

    eective utilization (efciency) o the existing processor/memory complex. CPU

    cycles and memory bandwidth are still devoted to executing the network stack

    at the expense o application perormance. Eective utilization, or eciency,is the ratio o a server resource, such as CPU cycles or memory bandwidth

    that is devoted directly to executing an application program compared to the

    use o server resources devoted to housekeeping tasks. Housekeeping tasks

    principally include services provided to the application by the operating system

    devoted to conducting routine network communication or storage activities on

    behal o an application. The trick is to apply technologies that can undamen-

    tally improve the servers overall eective utilization, without assuming unnec-

    essary risk in deploying them. I successul, the data center owner will achieve

    much, much higher usage o the data center servers, switches and storage that

    were purchased. That is the subject o this chapter.

    Devoting Server Resources to Application Processing

    Earlier we described how an application using InniBands RDMA mes-

    saging service avoids the use o the operating system (using operating system

    bypass) in order to transer messages between a sender and a receiver. This

    was described as a perormance improvement since avoiding operating system

    calls and unnecessary buer copies signicantly reduces latency. But operating

    system bypass produces another eect o much greater value to the enterprise

    data center.

    Server architecture and design is an optimization process. At a given server

    price point, the server architect can aord to provide only a certain amount

  • 8/2/2019 Intro to IB for End Users

    21/54

    [ 21 ]

    ChApTER 3 INFINIBAND FOR ThE ENTERpRISE

    o main memory; the important corollary is that there is a nite amount o

    memory bandwidth available to support application processing at that price

    point. As it turns out, main memory bandwidth is a key determinant o server

    perormance. Heres why.

    The precious memory bandwidth provided on a given server motherboardmust be allocated between application processing and I/O processing. This

    is true because traditional I/O, such as TCP/IP networking, relies on main

    memory to create an anonymous buer pool, which is integral to the opera-

    tion o the network stack. Each inbound packet must be placed into an

    anonymous buer and then copied out o the anonymous buer and delivered

    to a user buer in the applications virtual memory space. Thus, each inbound

    packet is copied at least twice. A single 10Gb/s Ethernet link, or example,

    can consume between 20Gb/s and 30Gb/s o memory bandwidth to supportinbound network trac.

    App

    I/O

    Mem

    Figure 5: Each inbound network packet consumes rom two to three times the memory

    bandwidth as the inbound packet is placed into an anonymous buer by the NIC and

    then subsequently copied rom the anonymous buer and into the applications virtual

    buer space.

    The load oered to the memory by network trac is largely dependent on

    how much aggregate network trac is created by the applications hosted on

    the server. As the number o processor sockets on the motherboard increases,

    and as the number o cores per socket increases, and as the number o execu-

    tion threads per core increases, the oered network load increases approxi-

    mately proportionally. And each increment o oered network load consumes

    two to three times as much memory bandwidth as the networks underlyingsignaling rate. For example, I an application consumes 10Gb/s o inbound

    network trac (which translates into 20Gb/s o memory bandwidth consumed)

    then multiple instances o an application running on a multicore CPU will

    consume proportionally more network bandwidth and twice the memory band-

    width. And, o course, executing the TCP/IP network protocol stack consumes

    CPU processor cycles. The cycles consumed in executing the network protocol

  • 8/2/2019 Intro to IB for End Users

    22/54

    [ 22 ]

    INTRO TO INFINIBAND FOR END USERS

    and copying data out o the anonymous buer pool and into the applications

    virtual buer space are cycles that cannot be devoted to application processing;

    they are purely overhead required in order to provide the application with the

    data it requested.

    Consider: When one contemplates the purchase o a server such as a serverblade, the key actors that are typically weighed against the cost o the server

    are power consumption and capacity. Processor speed, number o cores,

    number o threads per core and memory capacity are the major contributors to

    both power consumption and server perormance. So i you could reduce the

    number o CPUs required or the amount o memory bandwidth consumed, but

    without reducing the ability to execute user applications, signicant savings

    can be realized in both capital expense and operating expense. I using the

    available CPU cycles and memory bandwidth more eciently enables the datacenter to reduce the number o servers required, then capital and operating

    expense improvements multiple dramatically.

    And this is exactly what the InniBand Architecture delivers; it dramati-

    cally improves the servers eective utilization (its eciency) by dramatically

    reducing demand on the CPU to execute the network stack or copy data out o

    main memory and it dramatically reduces I/O demand on memory bandwidth,

    making more memory bandwidth available or application processing. This

    allows the user to direct precious CPU and memory resources to useul appli-cation processing and allows the user to reap the ull capacity o the server or

    which the user has paid.

    Again, server virtualization increases the number o applications that a

    server can host by time slicing the server hardware among the hosted virtual

    machines. Channel I/O decreases the amount o time the server devotes to

    housekeeping tasks and thereore increases the amount o time that the server

    can devote to application processing. It does this by liting rom the CPU and

    memory the burden o executing I/O and copying data.

    A Flexible Server Architecture

    Most modern servers touch at least two dierent I/O abrics an Ethernet

    network and a storage abric, commonly Fibre Channel. In some cases there

    is a third abric dedicated to IPC. For perormance and scalability reasons, a

    server oten touches several instances o each type o abric. Each point o

    contact between a server and one o the associated abrics is implemented with

    a piece o hardware either a Network Interace Card (NIC) or networking or

    a Host Bus Adapter (HBA) or storage. The servers total I/O bandwidth is the

    sum o the bandwidths available through these dierent abrics. Each o these

    abrics places a certain load both on memory bandwidth and CPU resources.

    The allocation o the total available I/O bandwidth between storage and

    networking is established at the time the server is provisioned with I/O hard-

  • 8/2/2019 Intro to IB for End Users

    23/54

    [ 23 ]

    ChApTER 3 INFINIBAND FOR ThE ENTERpRISE

    ware. This process begins during motherboard manuacture in the case o

    NICs soldered directly to the motherboard (LAN on Motherboard LOM) and

    continues when the server is provisioned with a complement o PCI Express-

    attached Host Bus Adapters (HBAs, used or storage) and NICs. Even though

    the complement o PCI Express adapters can be changed, the overall costso doing so are prohibitive; so by and large the allocation o I/O bandwidth

    between storage and networking is essentially xed by the time the server has

    been installed.

    App

    HBA

    Mem

    HBANIC

    NIC

    Network, IPC

    LOM(LAN on Motherboard)

    Storage

    Figure 6: Memory bandwidth is divided between application processing and I/O. I/O

    bandwidth allocation between storage, networking and IPC is predetermined by the

    physical hardware, the NICs and HBAs, in the server.

    The ideal ratio o storage bandwidth to networking bandwidth is solelya unction o the application(s) being hosted by the server each dierent

    enterprise application will present a dierent characteristic workload on the I/O

    subsystem. For example, a database application may create a high disk storage

    workload and a relatively modest networking workload as it receives queries

    rom its clients and returns results. Hence, i the mix o enterprise applications

    hosted by a given server changes, its ideal allocation o I/O bandwidth between

    networking and storage also changes. This is especially true in virtualized

    servers, where each physical server hosts a number o application containers(Virtual Machines VMs), each o which may be presenting a dierent charac-

    teristic workload to the I/O subsystem.

    This creates a bit o a conundrum or the IT manager. Does the IT man-

    ager over-provision each server with a maximum complement o storage and

    networking devices, and in what proportions? Does the IT manager lock a given

    application, or application mix, down to a given server and provision that server

  • 8/2/2019 Intro to IB for End Users

    24/54

    [ 24 ]

    INTRO TO INFINIBAND FOR END USERS

    with the idealized I/O complement o storage and networking resources? What

    happens to the server as the workload it is hosting is changed? Will it still have

    proportional access to both networking and storage abrics?

    In a later chapter we introduce the notion o a unied abric as being a

    abric that is optimal or storage, networking and IPC. An InniBand abrichas the characteristics that make it well suited or use as a unied abric. For

    example, its message orientation and its ability to eciently handle the small

    message sizes that are oten ound in network applications, as well as the large

    message sizes that are common in storage. In addition, the InniBand Architec-

    ture contains eatures such as a virtual lane architecture specically designed

    to support various types o trac simultaneously.

    Carrying storage and networking trac eciently over a single InniBand net-

    work using the RDMA message transport service means that the total numbero I/O adapters required (HBAs and NICs) can be radically reduced. This is a

    signicant saving, and even more signicant when one actors in the 40Gb/s

    data rates o commonly available commodity InniBand components versus

    leading edge 10Gb/s Ethernet. For our purposes here, we are ignoring the

    approximately 4x improvement in price/perormance provided by InniBand.

    But ar more important than the simple cost savings associated with the

    reduction in the number o required adapter types is this: by using an Inni-

    Band network, a single type o resource (HCAs, buers, queue pairs and soon) is used or both networking and storage. Thus the balancing act between

    networking and storage is eliminated, meaning that the server will always have

    the correct proportion o networking and storage resources available to it even

    as the mix o applications changes and the characteristic workload oered by

    the applications changes!

    This is a proound observation; it means that it is no longer necessary to

    overprovision a server with networking and storage resources in anticipation o

    an unknown application workload; the server will always be naturally provi-

    sioned with I/O resources.

    Investment Protection

    Enterprise data centers typically represent a signicant investment in equip-

    ment and sotware accumulated over time. Only rarely does an opportunity

    arise to install a new physical plant and accompanying control and manage-

    ment sotware. Accordingly, valuable new technologies must be introduced to

    the data center in such a way as to provide maximum investment protection

    and to minimize the risk o disruption to existing applications. For example, a

    data center may contain a large investment in SANs or storage which it would

    be impractical to simply replace. In this section we explore some methods to

    allow the data centers servers to take advantage o InniBands messaging

    service while protecting existing investments.

  • 8/2/2019 Intro to IB for End Users

    25/54

    [ 25 ]

    ChApTER 3 INFINIBAND FOR ThE ENTERpRISE

    Consider the case o a rack o servers, or a chassis o bladed servers. As

    usual, each server in the rack requires connection to up to three types o

    abrics, and in many cases, multiple instances o each type o abric. There

    may be one Ethernet network dedicated to migrating virtual machines, another

    Ethernet network dedicated to perorming backups and yet another one or con-necting to the internet. Connecting to all these networks requires multiple NICs,

    HBAs, cables and switch ports.

    Earlier we described the advantages that accrue in the data center by

    reducing the types o I/O adapters in any given server. These included reduc-

    tions in the number and types o cables, switches and routers and improve-

    ments in the servers ability to host changing workloads because it is no longer

    limited by the particular I/O adapters that happen to be installed.

    Nevertheless, a server may still require access to an existing centrally man-aged pool o storage such as a Fibre Channel SAN which requires that each

    server be equipped with a Fibre Channel HBA. On the ace o it, this is at

    odds with the notion o reducing the number o I/O adapters. The solution is to

    replace the FC HBA with a virtual HBA driver. Instead o using its own dedi-

    cated FC HBA, the server uses a virtual device adapter to access a physical

    Fibre Channel adapter located in an external I/O chassis. The I/O chassis pro-

    vides a pool o HBAs which are shared among the servers. The servers access

    the pool o HBAs over an InniBand network.The set o physical adapters thus eliminated rom each server are replaced

    by a single physical InniBand HCA and a corresponding set o virtual device

    adapters (VNICs and VHBAs). By so doing, the servers gain the fexibility

    aorded by the one at pipe, benet rom the reduction in HBAs and NICs, and

    retain the ability to access the existing inrastructure, thus preserving

    the investment.

    A typical server rack contains a so-called top o rack switch which is used

    to aggregate all o a given class o trac rom within the rack. This is an ideal

    place to site an I/O chassis, which thus provides the entire rack with virtualized

    access to data center storage resources and with access to the data center net-

    work and with access to an IPC service or communicating with servers located

    in other racks.

    Cloud Computing An Emerging Data Center Model

    Cloud computing has emerged as an interesting variant on enterprise com-

    puting. Although widespread adoption has only recently occurred, the unda-

    mental concepts beneath the idea o a cloud have been under discussion or

    quite a long period under dierent names such as utility computing, grid com-

    puting and agile or fexible computing. Although cloud computing is marketed

    and expressed in a number o ways, at its heart it is an expression o the idea

    o fexible allocation o compute resources (including processing, network and

  • 8/2/2019 Intro to IB for End Users

    26/54

    [ 26 ]

    INTRO TO INFINIBAND FOR END USERS

    storage) to support an ever-shiting pool o applications. In its ullest realization,

    the cloud represents a pool o resources capable o hosting a vast array o

    applications, the composition o which is regularly shiting. This is dramatically

    dierent rom the classical enterprise data center in which the applications

    hosted by the data center are oten closely associated with, and in some casesstatically assigned to, a specic set o hardware and sotware resources a

    server. For example, the classical three-tier data center is an expression o this

    relatively static allocation o servers to applications.

    Given the above discussion concerning the fexible allocation o I/O resources

    to applications, it is immediately clear that InniBands ability to transport

    messages on behal o any application is a strong contributor to the idea that

    applications must not be enslaved to the specic I/O architecture o the under-

    lying server on which the application is hosted. Indeed, the cloud computingmodel is the very embodiment o the idea o fexible computing; the ability

    to fexibly allocate processors, memory, networking and storage resources to

    support application execution anywhere within the cloud. By using the set o

    Upper Layer Protocols described briefy in the rst chapter, the application is

    provided with a simple way to use the InniBand messaging service. By doing

    so, the applications dependence on a particular mix o storage and networking

    I/O is eectively eliminated. Thus, the operator o a cloud inrastructure gains

    an unprecedented level o fexibility in hosting applications capable o executingon any available compute resource in the cloud.

    One last point should be made beore we leave behind the topic o cloud

    computing. Cloud inrastructures, as they are being deployed today, represent

    some o the highest densities yet seen in the computing industry. Cloud sys-

    tems are oten delivered in container-like boxes that push the limits o compute

    density, storage density, power and cooling density in a conned physical

    space. In that way, they share the same set o obstacles ound in some o the

    highest density commercial data centers, such as those ound in the nancial

    services industry in Manhattan. These data centers ace many o the same

    stifing constraints on available power, cooling and foor space.

    We mentioned in an earlier section that enterprise data centers are oten

    constrained in their ability to adopt relevant new technologies since they must

    balance the advantages gained by deploying a new technology against require-

    ments to protect investments in existing technologies. In that way, many o

    the cloud computing inrastructures now coming online are very dierent rom

    their brethren in other commercial enterprises. Because o the modular way

    that modern cloud inrastructures are assembled and deployed, the opportu-

    nity does exist or simple adoption o a vastly improved I/O inrastructure like

    InniBand. For example, a typical cloud inrastructure might be composed o

    a series o sel-contained containers resembling the amiliar shipping con-

    tainer seen on trucks. This container, sometimes called a POD, contains racks,

  • 8/2/2019 Intro to IB for End Users

    27/54

    [ 27 ]

    ChApTER 3 INFINIBAND FOR ThE ENTERpRISE

    servers, switches, networks and inrastructure. They are designed to be rolled

    in, connected to a power and cooling source and be operational right away. A

    POD might comprise a complete, sel-contained data center or it may represent

    a granular unit o expansion in a larger cloud installation.

    By deploying InniBand as the network internal to the container, and bytaking ull advantage o its messaging service, the container manuacturer can

    deliver to his or her client the densest possible conguration presenting the

    highest eective utilization o compute, power and cooling resources possible.

    Such a container takes ull advantage o InniBands complete set o value

    propositions:

    Itslowlatency,andextremelyhighbandwidthremovesbottlenecksthatpre-

    vent applications rom delivering maximum achievable perormance.

    TheeaseandsimplicitywithwhichanapplicationdirectlyusesInniBands

    messaging service means that precious CPU cycles and, more importantly,

    memory bandwidth, are devoted to application processing and not to

    housekeeping I/O chores. This drives up the eective utilization o available

    resources. Thus, the overall return on investment is higher, and the delivered

    perormance per cubic oot o foor space and per watt o power consump-

    tion, is higher.

    Byprovidingeachserverwithonefatpipe,theserverplatformgainsanintrinsic guarantee that its mixture o storage I/O, networking I/O and IPC will

    always be perectly balanced.

    InniBandsefcientmanagementofthenetworktopologyoffersexcellent

    utilization o the available network connectivity. Because InniBand deploys

    a centralized subnet manager, switch orwarding tables are programmed in

    such a way that routes are balanced and available links are well used. This

    ecient topology handling scales to thousands o nodes and thus satisesthe needs o large cloud installations.

    InniBandsnative,commoditybandwidths(currently40Gb/s)arenearly

    our times higher than 10Gb/s Ethernet, at prices that have historically been

    extremely competitive compared to commodity 1Gb/s or 10Gb/s Ethernet so-

    lutions. Since each InniBand cable carries nearly our times the bandwidth

    o a corresponding 10Gb/s Ethernet cable, the number o connectors, cables,

    switches and so on that are required is dramatically reduced.

    Andnally,thesimpliedcommunicationserviceprovidedbyanInniBand

    network inside the container simplies the deployment o applications on

    the pool o computing, storage and networking resources provided by the

    container and reduces the burden o migrating applications onto available

    resources.

  • 8/2/2019 Intro to IB for End Users

    28/54

  • 8/2/2019 Intro to IB for End Users

    29/54

    [ 29 ]

    ChApTER 3 INFINIBAND FOR ThE ENTERpRISE

    physically copied to the local site. This is because enterprises rely on a le

    transer protocol to replicate data rom site to site. The time delay to accomplish

    the transer can oten be measured in minutes or hours depending on le sizes,

    the capacity o the underlying network and the perormance o the storage

    system. This poses a dicult problem in synchronizing storage between sites.Suppose, instead o copying les rom the remote le system to the local le

    system, that the remote le system could simply be mounted to the local users

    workstation in exactly the same way that the local le system is mounted?

    This is exactly the model enabled by executing InniBand over the WAN. Over

    long haul connections with relatively long fight times, the InniBand protocol

    oers extremely high bandwidth eciency across distances that cause TCP/IP

    to stumble.

    StorageClient

    UserApp

    User

    Buffer

    LocalStorage

    RemoteStorage

    RemoteStorage

    Figure 7.

    An enterprise application reads data through a storage client

    The storage client connects to each storage server via RDMA

    Connections to remote storage servers are made via the WAN

    Thus, the user has direct access to all data stored anywhere on the system

    It turns out that the perormance o any windowing protocol is a unc-

    tion o the so-called bandwidth-delay product; this product represents themaximum amount o data that can be stored in transit on the WAN at any

    given moment. Once the limit o the congestion window has been reached,

    the transport must stop transmitting until some o the bytes in the pipe can be

    drained at the ar end. In order to achieve maximum utilization o the avail-

    able WAN bandwidth, the transports congestion window must be suciently

    large to keep the wire continuously ull. As the wire length increases, or as

    the bandwidth on the wire increases, so does the bandwidth-delay product.

    And hence, the congestion window that is basic to any congestion managednetwork such as InniBand or TCP/IP must grow very large in order to support

    the large bandwidth-delay products commonly ound on high speed WAN links.

    Because o the undamental architecture o the InniBand transport, it is easily

    able to support these exceedingly large bandwidth-delay products.

    On the other hand, lossy fow control protocols such as TCP degrade

    rapidly as the bandwidth-delay product increases, causing turbulent rather

  • 8/2/2019 Intro to IB for End Users

    30/54

    [ 30 ]

    INTRO TO INFINIBAND FOR END USERS

    than smooth packet fow. These eects cause a rapid roll-o in achieved

    bandwidth perormance as link distance or wire bandwidth increases. These

    eects are already pronounced at 10Gbits/s, and become much worse at 40 or

    100Gbits/s and beyond.

    Because the InniBand transport, when run over the WAN is able to eec-tively manage the bandwidth-delay product o these long links, it becomes

    possible to actually mount the remote le system directly to the local users

    workstation and hence to deliver perormance approximately equivalent to that

    which the local user would experience when accessing a local le system. O

    course speed o light delays are an unavoidable byproduct o access over the

    WAN, but or many applications the improvement in response time rom min-

    utes or hours to milliseconds or seconds represents an easy tradeo.

    Long distance InniBand links may span Campus or Metro Area Networks oreven Global WANs. Over shorter optical networks, applications directly benet

    rom InniBands low latency, low jitter, low CPU loading and high throughput.

    Examples o such applications include:

    Data Center Expansion spill an InniBand network rom a primary data

    center to nearby buildings using multiple optical InniBand connections. This

    approach causes no disruption to data center trac, is transparent to applica-

    tions and suciently low latency as to be ignorable to operations.

    Campus System Area Networks InniBand makes an excellent System

    Area Network, carrying storage/ IPC and Internet trac over its unied abric.

    Campus Area InniBand links allow islands o compute, storage and I/O

    devices such as high-resolution display walls to be ederated and shared

    across departments. Location agnostic access perormance greatly assists in

    the ecient utilization o organization-wide resources.

  • 8/2/2019 Intro to IB for End Users

    31/54

    [ 31 ]

    ChApTER 4 DESIgNINg wITh INFINIBAND

    Chapter 4 Designing with InfniBand

    As with any network architecture, the InniBand Architecture provides a

    service; in the case o InniBand it provides a messaging service. That mes-sage service can be used to transport IPC messages, or to transport control

    and data messages or storage, or to transport messages associated with any

    other o a range o usages. This makes InniBand dierent rom a traditional

    TCP/IP/Ethernet network which provides a byte stream transport service, or

    Fibre Channel which provides a transport service specic to the Fibre Channel

    wire protocol. By application, we mean, or example, a user application using

    the sockets interace to conduct IPC, or a kernel level le system application

    executing SCSI block level commands to a remote block storage device.The key eature that sets InniBand apart rom other network technologies

    is that InniBand is designed to provide message transport services directly to

    the application layer, whether it is a user application or a kernel application.

    This is very dierent rom a traditional network which requires the application

    to solicit assistance rom the operating system to transer a stream o bytes.

    At rst blush, one might think that such application level access requires the

    application to ully understand the networking and I/O protocols provided by

    the underlying network. Nothing could be urther rom the truth.The top layer o the InniBand Architecture denes asotware transport

    interace; this section o the specication denes the methods that an applica-

    tion uses to access the complete set o services provided by InniBand.

    An application accesses InniBands message transport service by posting

    a Work Request (WR) to a work queue. As described earlier, the work queue

    is part o a structure, called a Queue Pair (QP) representing the endpoint o a

    channel connecting the application with another application. A QP contains

    two work queues; a Send Queue and a Receive Queue, hence the expressionQP. A WR, once placed on the work queue, is in eect an instruction to the

    InniBand RDMA message transport to transmit a message on the channel, or

    perorm some sort o control or housekeeping unction on the channel.

    The methods used by an application to interact with the InniBand trans-

    port are called verbs and are dened in the sotware transport interace section

    o the specication. An application uses a POST SEND verb, or example, to

  • 8/2/2019 Intro to IB for End Users

    32/54

    [ 32 ]

    INTRO TO INFINIBAND FOR END USERS

    request that the InniBand transport send a message on the given channel.

    This is what is meant when we say that the InniBand network stack goes all

    the way up to the application layer. It provides methods that are used directly

    by an application to request service rom the InniBand transport. This is very

    dierent rom a traditional network, which requires the application to solicitassistance rom the operating system in order to access the servers network

    services.

    Application

    SoftwareTransportInterface

    IB Transport

    Layer

    Application(Consumer of InfiniBand message services):

    posts work requests to a queue each work request represents a messagea unit of work

    Channel interface (verbs):

    allows the consumer to request services

    InfiniBand Channel interface provider:

    Maintains work queues Manages address translation Provides completion and event mechanisms

    IB Transport:

    Packetizes messages Provides transport service

    - reliable/unreliable, connected/unconnected, datagram

    Implements RDMA protocol- send/receive, RDMA r/w, Atomics

    Implements end-to-end reliability Assures reliable delivery

    Figure 8.

    Building on Top o the Verbs

    The verbs, as dened, are sucient to enable an application developer

    to design and code the application or use directly on top o an InniBand

    network. But i the application developer were to do so, he or she would need

    to create a set o customized application interace code, at least one set or

    each operating system to be supported. In a world o specialized applications,

    such as high perormance computing, where the last ounce o perormance

    is important, such an approach makes a great deal o sense the range o

    applications in question is airly well dened and the variety o systems on

    which such applications are deployed is similarly constrained. But in a world o

    more broadly distributed applications and operating systems it may be prohibi-

    tive or the application or hardware developer and the prousion o application

    interaces may be an obstacle inhibiting an IT manager rom adopting a given

    application or technology.

    In more broadly deployed systems, an application accesses system services

  • 8/2/2019 Intro to IB for End Users

    33/54

    [ 33 ]

    ChApTER 4 DESIgNINg wITh INFINIBAND

    via a set o APIs oten supplied as part o the operating system.

    InniBand is an industry-standard interconnect and is not tied to any given

    operating system or server platorm. For this reason, it does not speciy any

    APIs directly. Instead, the InniBand verbs ully speciy the required behaviors

    o the APIs. In eect, the verbs represent a semantic denition o the applica-tion interace by speciying the methods and mechanisms that an application

    uses to control the network and to access its services. Since the verbs speciy

    a set o required behaviors, they are sucient to support the development

    o OS-specic APIs and thus ensure interoperability between an application

    executing, or example, on a Windows platorm which is communicating with

    an application executing on a Linux platorm. The verbs ensure cross-platorm

    and cross-vendor interoperability. APIs can be tested both or conormance

    to the verbs specication and or interoperability allowing dierent vendors oInniBand solutions to assure their customers that the devices and sotware

    will operate in conormance to the specication and will interoperate with solu-

    tions provided by other vendors.

    However, a semantic description o a required behavior is o little actual

    use without an accompanying set o APIs to implement the required behavior.

    Fortunately, other organizations do provide APIs, even i the InniBand Trade

    Association does not. For example, the OpenFabrics Alliance and other com-

    mercial providers supply a complete set o sotware stacks which are bothnecessary and sucient to deploy an InniBand-based system. The sotware

    stack available rom OFA, as the name suggests, is open source and reely

    downloadable under both BSD and GNU licenses. The OFA is broadly sup-

    ported by providers o InniBand hardware, by major operating system sup-

    pliers and suppliers o middleware and by the Linux open source community.

    All o these organizations, and many others, contribute open source code to the

    OFA repository where it undergoes testing and renement and is made gener-

    ally available to the community at large.

    But even supplying APIs is not, by itsel, sucient to deploy InniBand.

    Broadly speaking, the sotware elements required to implement InniBand all

    into three major areas; there is the set o Upper Layer Protocols (ULPs) and

    associated libraries, there is a set o mid-layer unctions that are helpul in con-

    guring and managing the underlying InniBand abric and that provide needed

    services to the upper layer protocols, and there is a set o hardware specic

    device drivers. Some o the mid-layer unctions are described in a little more

    detail in the chapter describing the InniBand Management Architecture. All o

    these components are reely available through open source or easy download,

    installation and use. They can be downloaded directly rom the OpenFabrics

    Alliance, or, in the case o Linux, they can be acquired as part o the standard

    distributions provided by a number o Linux distributions. The complete set o

    sotware components provided by the OpenFabrics Alliance is ormally known

  • 8/2/2019 Intro to IB for End Users

    34/54

    [ 34 ]

    INTRO TO INFINIBAND FOR END USERS

    as the Open Fabrics Enterprise Distribution OFED.

    There is an important point associated with the ULPs that bears urther

    mention. This chapter began by describing InniBand as a message transport

    service. As described above, in some circumstances it is entirely practical

    to design an application to access the message transport service directly bycoding the application to conorm to the InniBand verbs. Purpose-built HPC

    applications, or example, oten do this in order to gain the ull measure o per-

    ormance. But doing so requires the application, and the application developer

    to be RDMA aware. For example, the application would need to understand

    the sequence o verbs to be used to create a queue pair, to establish a connec-

    tion, to send and receive messages over the connection and so orth. However,

    most commercial applications are designed to take advantage o a standard

    networking or storage subsystem by leveraging existing, industry standardAPIs. But that does not mean that such an application cannot benet rom the

    perormance and fexibility advantages provided by InniBand. The set o ULPs

    provided through OFED allows an existing application to take easy advantage o

    InniBand. Each ULP can be thought o as having two interaces; an upward-

    acing interace presented to the application, and a downward-acing interace

    connecting to the underlying InniBand messaging service via the queue pairs

    dened in the sotware transport interace. At its upward-acing interace, the

    ULP presents a standard interace easily recognizable by the application. Forexample, the SCSI RDMA Protocol ULP (SRP) presents a standard SCSI class

    driver interace to the le system above it. By providing the advantages o an

    RDMA network, the ULP allows a SCSI le system to reap the ull benets

    o InniBand. A rich set o ULPs are provided by the Open Fabrics Alliance

    including:

    SDP Sockets Direct Protocol. This ULP allows a sockets application to

    take advantage o an InniBand network with no change to the application.

    SRP SCSI RDMA Protocol. This allows a SCSI le system to directly con-

    nect to a remote block storage chassis using RDMA semantics. Again, there

    is no impact to the le system itsel.

    iSER iSCSI Extensions or RDMA. iSCSI is a protocol allowing a block

    storage le system to access a block storage device over a generic network.

    iSER allows the user to operate the iSCSI protocol over an RDMA capable

    network.

    IPoIB IP over InfniBand. This important part o the suite o ULPs allows

    an application hosted in, or example, an InniBand-based network to com-

    municate with other sources outside the InniBand network using standard

    IP semantics. Although oten used to transport TCP/IP over an InniBand

    network, the IPoIB ULP can be used to transport any o the suite o IP

    protocols including UDP, SCTP and others.

  • 8/2/2019 Intro to IB for End Users

    35/54

    [ 35 ]

    ChApTER 4 DESIgNINg wITh INFINIBAND

    NFS-RDMA Network File System over RDMA. NFS is a well-known and

    widely-deployed le system providing le level I/O (as opposed to block

    level I/O) over a conventional TCP/IP network. This enables easy le shar-

    ing. NFS-RDMA extends the protocol and enables it to take ull advantage

    o the high bandwidth and parallelism provided naturally by InniBand.

    Lustre support Lustre is a parallel le system enabling, or example, a set

    o clients hosted on a number o servers to access the data store in parallel.

    It does this by taking advantage o InniBands Channel I/O architecture, al-

    lowing each client to establish an independent, protected channel between

    itsel and the Lustre Metadata Servers (MDS) and associated Object Storage

    Servers and Targets (OSS, OST).

    RDS Reliable Datagram Sockets oers a Berkeley sockets API allow-ing messages to be sent to multiple destinations rom a single socket. This

    ULP, originally developed by Oracle, is ideally designed to allow database

    systems to take ull advantage o the parallelism and low latency character-

    istics o InniBand.

    MPI The MPI ULP or HPC clusters provides ull support or MPI unction

    calls.

    SocketsApps

    BlockStorage

    IP-basedApps

    ClusteredDB Access

    FilesSystems

    VNIC SDP

    iPoIB iSER RDSNFS

    RDMAMPI

    SRPCluster

    File Systems

    Application

    Upper Layer Protocols

    InfiniBand verbs interface

    InfiniBand RDMAMessage

    Transport Service

    InfiniBand FabricInterconnect

    Figure 9.

    There is one more important corollary to this. As described above, a tradi-

    tional API provides access to a particular kind o service, such as a network

    service or a storage service. These services are usually provided over a specic

    type o network or interconnect. The sockets API, or example, gives an appli-

    cation access to a network service and is closely coupled with an underlying

    TCP/IP network. Similarly, a block level le system is closely coupled with a

  • 8/2/2019 Intro to IB for End Users

    36/54

    [ 36 ]

    INTRO TO INFINIBAND FOR END USERS

    purpose-built storage interconnect such as Fibre Channel. This is why data

    center networks today are commonly thought o as consisting o as many as

    three separate interconnects one or networking, one or storage and possibly

    a third or IPC. Each type o interconnect consists o its own physical plant

    (cables, switches, NICs or HBAs and so on), its own protocol stack, and mostimportantly, its own set o application level APIs.

    This is where the InniBand Architecture is dierent: The set o ULPs

    provide a series o interaces that an application can use to access a particular

    kind o service (storage service, networking service, IPC service or others).

    However, each o those services is provided using asingle underlying network.

    At their lower interaces, the ULPs are bound to a single InniBand network.

    There is no need or multiple underlying networks coupled to each type o

    application interace. In eect, the set o ULPs gives an application access toa single unied abric, completely hiding the details associated with a unied

    abric, while avoiding the need or multiple data center networks.

  • 8/2/2019 Intro to IB for End Users

    37/54

    [ 37 ]

    ChApTER 5 INFINIBAND ARChITECTURE AND FEATURES

    Chapter 5 InfniBand Architecture and Features

    Chapter 1 provided a high-level overview o the key concepts underpin-

    ning InniBand. Some o these concepts include channel I/O (described as a

    connection between two virtual address spaces), a channel endpoint called aqueue pair (QP), and direct application level access to a QP via virtual to phys-

    ical memory address mapping. A detailed description o the architecture and

    its accompanying eatures is the province o an entire multi-volume book and

    is well outside the scope o this minibook. This chapter is designed to deepen

    your understanding o the architecture with an eye toward understanding how

    InniBands key concepts are achieved and how they are applied.

    Volume 1 o the InniBand Architecture specication co