IPU-M2000 DIRECT ATTACH EARLY ACCESS

66
IPU-M2000 DIRECT ATTACH EARLY ACCESS Build and test guide

Transcript of IPU-M2000 DIRECT ATTACH EARLY ACCESS

Page 1: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 DIRECT ATTACH

EARLY ACCESS

Build and test guide

Page 2: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

2

Table of contents

Overview ............................................................................................. 4

1.1 Acronyms and abbreviations ............................................................................ 4

1.2 System summary ............................................................................................... 5

The IPU-M2000 direct attach kits ........................................................ 8

2.1 IPU-M2000 overview ........................................................................................ 9

2.2 IPU-M2000 direct attach configurations ........................................................ 10

2.3 System pre-requisites ..................................................................................... 11

Physical mounting ............................................................................. 12

3.1 Preparing the rack ........................................................................................... 12

3.1.1 Unit orientation 12

3.1.2 Rack preparation 12

3.1.3 Adjusting front and rear vertical rails 12

3.1.4 Installing the IPU-M2000 rails 13

3.2 Installing the equipment ................................................................................. 16

3.2.1 Installing the IPU-M2000s 16

3.2.2 Installing the Dell R6525 server 20

3.2.3 Installing the PDUs 21

3.3 Cabling the system .......................................................................................... 22

3.3.1 IPU-M2000 to IPU-M2000 IPU-Link connectivity (OSFP) 23

3.3.2 IPU-M2000 to IPU-M2000 Sync-Link cabling (Cat5e) 25

3.3.3 IPU-M2000 to IPU-M2000 management cabling (Cat5e) 26

3.3.1 IPU-M2000 to server management cabling (Cat5e) 27

3.3.2 IPU-M2000 to server data cabling (QSFP) 28

3.4 Power cabling .................................................................................................. 30

3.5 Completing the rack ........................................................................................ 30

3.5.1 Blanking panels 30

3.5.2 Rack completion 30

Server configuration .......................................................................... 31

4.1 Server hardware and storage ......................................................................... 31

4.1.1 Hardware requirements 31

4.1.2 Storage configuration 31

4.1.3 BIOS configuration 32

4.2 Operating system and packages ..................................................................... 32

4.2.1 Ubuntu OS installation and packages 32

4.2.2 CentOS OS installation and packages 33

4.2.3 Python packages 34

Page 3: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

3

4.3 User accounts and groups............................................................................... 35

4.4 Network interfaces ......................................................................................... 36

4.4.1 Overview 36

4.4.2 IPU-M2000 direct attach network interfaces 37

4.4.3 Server network configuration 38

4.5 Services ........................................................................................................... 41

4.5.1 DHCP (Dynamic Host Configuration Protocol) 41

4.5.2 NTP (Network Time Protocol) 44

4.5.3 Syslog 45

IPU-M2000 software installation ....................................................... 46

5.1 V-IPU server installation ................................................................................. 46

5.1.1 Interactive installation 47

5.1.2 Batch installation 47

5.1.3 V-IPU socket 48

5.2 IPU-M2000 system software installation ....................................................... 49

5.2.1 Download latest IPU-M2000 system software 49

5.2.2 Install and configure rack_tool 49

5.3 IPU-M2000s system software validation ........................................................ 51

5.3.1 Verify V-IPU server and IPU-M2000 version compatibility 51

5.3.2 Software upgrade of IPU-M2000 units 53

5.4 rack_tool ................................................................................................... 54

5.4.1 Synopsis 55

5.4.2 Options 56

5.4.3 Commands 57

5.4.4 Exit status 57

5.4.5 Files and directories 58

System testing ................................................................................... 60

6.1 Running system BISTs ..................................................................................... 60

6.2 Troubleshooting .............................................................................................. 60

6.2.1 Rack_tool testing 60

6.2.2 V-IPU testing 60

Revision history ................................................................................. 65

Legal notices ..................................................................................... 66

Page 4: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

4

Overview

This guide is for properly trained service personnel and technicians who are required to install

the IPU-M2000 direct attach system.

Warning: Only qualified personnel should install, service, or replace the equipment

described in this document.

Note

The IPU-M2000 direct attach (early access) system provides integrators

with the opportunity to get experience with the IPU-M2000 direct attach

system ahead of the full product launch. Operation of the configured

system will be the same as in the final product. Additional manual

configuration steps are required in the early access system and are

described in this document. It is noted throughout the document where

these steps will not be required in the final system. Hardware provided for

the construction of early access systems is fully qualified: the additional

functionality of the final product is provided by updates to the firmware

and software bundles.

1.1 Acronyms and abbreviations

This is a short list that describes some of the most commonly used terms in this document.

BMC Baseboard Management Controller – standby power domain

service processor providing system hardware management

BOM Bill of Materials

EA Early access

GW Short for IPU-Gateway, a device that disaggregate the Server

and the four IPUs in the IPU-M2000 across a RoCE network,

provides external IPU Exchange Memory, and enables IPU-Link

scaleout across 100GbE (IPU-GW-link) for rack-to-rack

connectivity

GCD A graph compile domain operated by a single Poplar Instance

within the system, either within a single IPU-M2000 unit or

within several units connected by IPU-Link cables

IPU-Link High speed communication links that interconnect IPUs within

and between IPU-M2000 units in a GCD. Special cables are

required for IPU-Links between IPU-M2000 units

PDU Power Distribution Unit

RDMA Remote DMA

RNIC RDMA Network Interface Controller

RoCE RDMA over converged Ethernet

Page 5: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

5

1.2 System summary

The IPU-M2000 is a 1 rack unit (RU) compute platform delivering 1petaFLOPS (FP16.16) of AI compute. It contains 4 Colossus GC200 IPUs with 3.6GB In-Processor-MemoryTM and is pre-configured with 128GB (2x64GB) Streaming MemoryTM, 1x 100GbE RoCEv2 NIC card for host server connectivity and 1TB of NVMe M.2 SSD. In addition, the IPU-M2000 has connectors for the IPU-FabricTM that provide high speed interfaces (total 2.8Tbps) for connecting to other IPU-M2000 units.

An installed and fully operational IPU-M2000 direct attach system will consist of:

• A customer provided host server

• The IPU-M2000 direct attach AI compute platform

o Configuration 1: IPU-M2000 direct attach x1 (single IPU-M2000, 4 IPUs)

o Configuration 2: IPU-M2000 direct attach x4 (four IPU-M2000, 16 IPUs)

o Additional options to be announced

• Pre-installed and configured Virtual IPU (V-IPU) management software with embedded

management through Web UI that offer easy installation and integration with pre-

existing infrastructure

• The Graphcore Poplar SDK software stack to be downloaded and installed on the host

server

Page 6: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

6

An example IPU-M2000 direct attach x1 system is illustrated in the diagram below:

Note

In the IPU-M2000 direct attach (early access) system, the V-IPU

management software is downloaded and installed onto the host server

and is not contained in the IPU-M2000 unit.

All the IPU-M2000 direct attach configuration options are all fully supported by Graphcore’s

Poplar® software development environment, providing a complete scalable platform for

accelerated development. Existing ML frameworks such as TensorFlow, ONNX, and PyTorch

are fully supported as well as industry standard converged infrastructure management tools

including Open BMC, Redfish, Docker containers, and orchestration with Slurm and

Kubernetes. The PopVisionTM visualisation and analysis tools provide monitoring of

performance across one or more IPUs - the graphical analysis enables detailed inspection of all

processing activities.

See the “IPU-M2000 Direct Attach Getting Started Guide” and the “Poplar and PopLibs User

Guide” on the documentation page (https://docs.graphcore.ai/) for details of Poplar

installation and use.

Page 7: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

7

Pictures of a complete IPU-M2000 direct attach x4 system are shown below:

Front view (cold isle)

Rear view (hot isle)

Note Cable colours may differ from those supplied in the kits.

Page 8: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

8

The IPU-M2000 direct attach kits

The IPU-M2000 units required to build IPU-M2000 direct attach systems can be provided

singly in two forms: with or without the associated cables. In addition, cable kits can be

provided.

• IPU-M2000

o 1x IPU-M2000 unit, including pre-installed M.2 SSD and 100GbE RNIC

o 1x slider kit for installation in a rack

o 2x 1.45m IEC60320 AC power cables for the IPU-M2000 unit

• IPU-M2000 Founders Edition

o 1x IPU-M2000 unit, including pre-installed M.2 SSD and 100GbE RNIC

o 1x slider kit for installation in a rack

o 2x 1.45m IEC60320 AC power cables for the IPU-M2000 unit

o 1x 1.5m QSFP cable to connect the IPU-M2000 to the server (black)

o 1x 1.5m Cat5e cable to connect the IPU-M2000 to the server (blue)

o 1x 0.6m Cat5e cable to connect management ports between IPU-M2000 units

(blue)

o 4x 0.3m OSFP cables to connect IPU-Link ports between IPU-M2000 units

(black)

o 2x 0.15m Cat5e cables to connect Sync-Link ports between IPU-M2000 units

(red)

• IPU-M2000 cable kit

o 2x 1.45m IEC60320 AC power cables for the IPU-M2000 unit

o 1x 1.5m QSFP cable to connect the IPU-M2000 to the server (black)

o 1x 1.5m Cat5e cable to connect the IPU-M2000 to the server (blue)

o 1x 0.6m Cat5e cable to connect management ports between IPU-M2000 units

(blue)

o 4x 0.3m OSFP cables to connect IPU-Link ports between IPU-M2000 units

(black)

o 2x 0.15m Cat5e cables to connect Sync-Link ports between IPU-M2000 units

(red)

Depending on the quantity ordered, the above items may be supplied in single cartons, or a

bulk carton containing 4 sets. Note that not all the supplied cables are used when building any

of the possible IPU-M2000 direct attach configurations.

Page 9: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

9

2.1 IPU-M2000 overview

The IPU-M2000 front panel contains (from left to right in the figure):

• 2 RNIC ports for connection to the server

• 8 Sync-Link ports for connection between units

• 2 management GbE ports for connection to the server

• 2 GW-Link ports – not used in direct attach systems

• 8 IPU-Link ports for connection between units

Front panel

The IPU-M2000 back panel contains:

• 2 power connectors per IPU-M2000

• 5 Fan units (n+1 redundant and removable)

• Unit QR code

Back panel

The QR code contains the following information for each IPU-M2000:

• Company name (Graphcore)

• Serial number

• Part number

• BMC Ethernet MAC address

• GW Ethernet MAC address

• Graphcore support web URL (https://www.graphcore.ai/support)

QR code

Page 10: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

10

2.2 IPU-M2000 direct attach configurations

It is possible to build a number of IPU-M2000 direct attach configurations, based on the

different numbers of IPU-M2000s, rack types and PDU types. Minimum requirements for the

rack environment and PDUs are given in the next section. The following diagrams illustrate the

range of system configurations: others are also possible.

System (a) is a baseline IPU-M2000 direct attach x1, shown with redundant horizontally rack-

mounted PDUs. System (b) is functionally the same IPU-M2000 direct attach x1 configuration

where space has been allowed between the first IPU-M2000 unit and the server for future

expansion up to a direct attach x4 system. System (c) is a baseline IPU-M2000 direct attach x4,

with redundant horizontally rack-mounted PDUs. In all cases, power cabling to the PDUs can

sometimes be made easier by allowing space between the lowest IPU-M2000 unit and the

PDUs.

Systems (d) and (e) show the extremes for x1 and x4 systems. System (d) is a minimally

configured IPU-M2000 direct attach x1 using just one horizontally mounted PDU. While this is

not recommended in production deployments due to lack of redundancy, it could be

appropriate for small development systems. System (e) shows a larger rack deployment where

more than one x4 system is deployed. In this case, it may be more effective to use two higher

power vertically mounted PDUs to supply more than one x4 system. Given power and cooling,

it is possible to build racks containing 4 or more IPU-M2000 direct attach x4 systems powered

by such vertical PDUs.

Page 11: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

11

2.3 System pre-requisites

In addition to the parts included in the IPU-M2000 kits described above, the following are

needed to build an IPU-M2000 direct attach system:

• 1 host server with an installed 100GbE RoCEv2 NIC

o This document describes the use of a Dell PowerEdge R6525 server. Other

servers are supported – see the Approved server list or contact Graphcore

support for details of other supported server types.

• Mounting rack

o The IPU-M2000 is an Open Compute compliant 1U 19-inch chassis so the

mounting rack needs to support this standard.

o The IPU-M2000 mounting system requires a rail-to-rail distance of 720mm.

o Additional height may be required for any horizontal PDUs and ToR

datacentre switch

• 2 PDU units

o For power feed redundancy the IPU-M2000 direct attach system will require 2

PDU units supporting output voltge in the range 115-230Vac.

o Each IPU-M2000 unit connects to both PDUs with C15-to-C14 cables. The

recommended PowerEdge server connects to both PDUs with two C13-to-C14

cables.

o Examples of PDUs:

▪ Horizontal mounting (1U Rackmont): Tripp Lite PDUH32HV (supports

one x4 system)

▪ Vertical mounting (zero U): AP8886 PDU (supports multiple x4

systems)

• Power delivery

o Power requirement x1 direct attach: 2500W per PDU for redundancy

o Power requirement x4 direct attach: 7kW per PDU for redundancy

• Cooling delivery

o Airflow requirements x1 direct attach: 150 CFM or higher at 35C maximum

o Airflow requirements x4 direct attach: 465 CFM or higher at 35C maximum

• Suitable management and data connectivity into the server to service the workload

requirements of the completed system.

Page 12: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

12

Physical mounting

3.1 Preparing the rack

3.1.1 Unit orientation

The airflow for the IPU-M2000 units is from the edge containing the network ports to the

edge containing the fans. Therefore, the network ports face the cold aisle.

The airflow for the R6525 server is from the edge containing bezel and display to the edge

containing the network ports. Therefore, the network ports face the hot aisle.

3.1.2 Rack preparation

The IPU-M2000 mounting system requires a rail-to-rail distance of 720mm. Details of this

adjustment will depend on the choice of rack.

The vertical accessory channels should be positioned at the very front and very rear of the

rack. If necessary, move these from their shipping positions.

3.1.3 Adjusting front and rear vertical rails

The rear vertical rack rails should be positioned such that there is 20mm of distance between

the rear face of the vertical rack rail and the racks rear frame. This should result in a square

symbol being visible through the alignment window at the top and bottom of the rail. Details

of this adjustment will depend on the choice of rack.

The front vertical rack rails should be positioned tight against the front vertical cable

organisers such that only a single diamond symbol is visible through the alignment window at

the top and bottom of the rail.

Page 13: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

13

3.1.4 Installing the IPU-M2000 rails

The IPU-M2000 rail kit comprises two mated inner and outer rack rails and an accessory bag

containing screws. The inner rail fixes to the body of the IPU-M2000 and the outer rail fixes to

the vertical rack rails in the server cabinet.

1) Separate the mated inner and outer rails:

a) Fully extend the rails by pulling on the end which has the captive thumb screw attached:

b) Whilst pulling on the thumb screw end of the rails, push the white plastic release tab

towards the thumb screw end:

c) The inner and outer rails will now separate:

d) Repeat these steps for the number of IPU-M2000 units to be installed.

2) Fix the inner rail to the body of the IPU-M2000:

a) Offer up the inner rail to the side of the IPU-M2000 and ensure that all fixing pins are

sitting within the enlarged opening of the retention channel. The inner rails are the

thinner of the two separated rails and have a captive thumb screw at one end. The inner

rail should be oriented such that the captive thumb screw end is at the end of the IPU-

M2000 containing the network ports. The inner rails are mirrored and are not handed so

the procedure for inner rail fixing is the same for both the left- and right-hand inner

rails.

Page 14: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

14

b) Push the inner rail towards the front of the IPU-M2000 (containing the network ports):

you should hear a click as the latching mechanism locks behind the head of a fixing pin.

c) Ensure all fixing pins are correctly engaged with their respective retention channel.

d) Locate the four flat head fixing screws from the rack rail accessory bag:

Page 15: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

15

e) Using the above screws, fix the inner rail to the body of the IPU-M2000:

The inner rails are now securely affixed to the IPU-M2000 body.

f) Repeat these steps for the number of IPU-M2000 units to be installed.

3) Place the outer rails to one side for later use:

IPU-M2000 outer rails

Page 16: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

16

3.2 Installing the equipment

The following sections describe the installation of the IPU-M2000s and server into the rack.

In the case of an IPU-M2000 direct attach x1 system, one IPU-M2000 is installed. If

subsequently you plan to expand to an IPU-M2000 x4 direct attach system, you may wish to

leave enough free rack space (3 rows) between the IPU-M2000 unit and the server.

In the case of IPU-M2000 direct attach x4 system, four IPU-M2000 are installed. All IPU-

M2000 units and the server should be adjacent to each other in the rack.

The installation of PDUs is not covered in this guide since that will depend upon the units

selected.

3.2.1 Installing the IPU-M2000s

Earlier in the guide we fixed the inner rack rails to the IPU-M2000 body. We now need to

install the outer rack rails into the rack and install the IPU-M2000 units.

1) Install the outer rack rails into the rack

It is possible to identify the front and rear of the outer rail by finding the large metal latching

mechanism – this is to be located at the rear of the rack.

The outer rail is also embossed with the text “FRONT” at the front end of the rail.

For each rack position where there is to be an IPU-M2000 installed, perform the steps below

with both the left-hand and right-hand outer rack rails:

a) Pull on each end of the outer rail to adjust the rail length to suite your rack

b) Locate the front end of the outer rail and hold it behind the square holes in the vertical

rack rail for your installation. Pull the outer rail towards the vertical rack rail and the

latching mechanism will click and hold the outer rail in place:

Page 17: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

17

c) Locate the rear end of the outer rail and slightly open the large metal latch, then press

the upper and lower locating pins into the square holes. Release the large metal latch

and the outer rail will now be secured to the vertical rack rail:

d) Included in the rack rail accessory bag are two screws and two washers. One screw with

one washer should be screwed through the vertical rack rail and into the outer rack rail

threaded hole. The washer should be used in such a way that the washer sits flush with

the head of the screw – like a cup.

This should be repeated for both outer rack rails.

Page 18: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

18

2) To install an IPU-M2000 unit into the rails

a) Pull the sliding rail located within the outer rack rail completely forward such that it

locks into the fully extended position:

b) Place the IPU-M2000 onto an appropriately suited server lift and adjust the height such

that it is suitable for the sliders. If a lift is not available, then this is a two-person

operation.

c) Slide the protruding inner rails into the receiving channel of the extended outer rails

Page 19: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

19

d) Whilst the server lift is supporting the full weight of the IPU-M2000, slide the IPU-M2000

into the extended outer rails until you feel both sides engage a stopping mechanism.

e) Simultaneously pull on the blue tabs for the release mechanism at each side of the IPU-

M2000 and then push the IPU-M2000 unit fully into the rack:

f) Screw the captive thumb screw into the inner rack rail:

g) Repeat these steps for the number of IPU-M2000 units to be installed.

Page 20: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

20

3.2.2 Installing the Dell R6525 server

The server should be installed above the IPU-M2000 units.

Note

The R6525 is installed such that the rear of the server (containing the

management and data ports) is the opposite side of the rack to the ports

on the front of the IPU-M2000 units. This is due to airflow direction.

Remove and discard the cable management arm brackets from the rear of each tool-less

sliding rail.

Install the tool-less sliding rail kit(s) in the required location.

Pull out the rail and fit the server to the rail ensuring the T pins on the side of the server locate

in the slots on the rail. Ensure that the power supplies on the server face the rear of the rack.

Note Use an appropriate server lift or have two people installing the servers to

ensure correct fitting

Push the server gently from the front to lock it into the slides then press the tab on the side of

the slides and push the server fully home in the rack. Repeat the above process for each

server if installing multiple servers.

Remove the Velcro tape from the light pipes on the rear of the servers.

Page 21: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

21

Remove the small plastic tab on the left front side of the server bezel and clip the bezel in

place on the front of the server ensuring the connection pins on the right-hand side of the

bezel line up with the connector on the server, as shown below:

3.2.3 Installing the PDUs

The method for installing the PDUs will depend on the choice of PDU type and location. If the

PDUs are to be installed horizontally within the rack, the recommendation is that these are

positioned beneath the lowest IPU-M2000. Allowing some space between the lowest IPU-

M2000 unit and the PDUs may make power cabling easier.

Page 22: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

22

3.3 Cabling the system

The following sections detail the wiring of the IPU-M2000 direct attach system within the rack.

The cabling is very straightforward, as indicated by diagrams of the completed systems below.

The following sections take you through wiring up each group of cables in turn.

IPU-M2000 x1 direct attach system

IPU-M2000 x4 direct attach system

Page 23: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

23

3.3.1 IPU-M2000 to IPU-M2000 IPU-Link connectivity (OSFP)

Note This step is only required for direct attach systems with more than one IPU-

M2000.

There are 8 OSFP IPU-Link ports on the right side of each IPU-M2000.

Using 0.3M OSFP cables, link the top row of four IPU-link ports (5-8) to the bottom row of four

IPU-link ports (1-4) in the IPU-M2000 that is installed directly above (see figure and table

below). The top row (5-8) of the top-most IPU-M2000 (#4) and the bottom row (1-4) of the

bottom-most IPU-M2000 (#1), are left unconnected.

Before attempting to install the OSFP cables, it is beneficial to manipulate the cable to form a

tight loop such that the white side of the connector tabs face away from each other. During

manufacture and shipping, the cables can form quite a stiff shape, so manipulating the cables

before installing them reduces stresses on the socket during install.

After installing a cable, pull gently on the black cable to ensure the plugs are firm in the

sockets on the IPU-M2000.

Note The white tab is on the top of the cable when inserted into the IPU-M2000.

Page 24: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

24

IPU-M2000 to IPU-M2000 IPU-Links Cables

IPU-M2000 # 3 IPU-Link ports 5,6,7,8 IPU-M2000 # 4 IPU-Link ports 1,2,3,4 OSFP 0.3m

IPU-M2000 # 2 IPU-Link ports 5,6,7,8 IPU-M2000 # 3 IPU-Link ports 1,2,3,4 OSFP 0.3m

IPU-M2000 # 1 IPU-Link ports 5,6,7,8 IPU-M2000 # 2 IPU-Link ports 1,2,3,4 OSFP 0.3m

IPU-M2000 OSFP port mapping

The figure below shows the IPU-M2000 to IPU-M2000 IPU-Link cabling for an IPU-M2000 x4

direct attach system.

IPU-M2000 x4 direct attach system

Page 25: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

25

3.3.2 IPU-M2000 to IPU-M2000 Sync-Link cabling (Cat5e)

Note This step is only required for direct attach systems with more than one IPU-

M2000.

There are 8 Cat5e Sync-Link ports in the middle of each IPU-M2000.

Using 0.15m red Ethernet Cat5e cables, link the top two central Sync-Link ports (6-7) of each

IPU-M2000 to the bottom two central Sync-Link ports (2-3) of the IPU-M2000 that is installed

directly above (see figure and table below). The top row (5-8) of the top-most IPU-M2000 (#4)

and the bottom row (1-4) of the bottom-most IPU-M2000 (#1), are left unconnected.

IPU-M2000 to IPU-M2000 Sync-Link connections Cables

IPU-M2000 # 3 Sync-Link ports 6-7 IPU-M2000 # 4 Sync-Link ports 2-3 2x Cat5e 0.15m red

IPU-M2000 # 2 Sync-Link ports 6-7 IPU-M2000 # 3 Sync-Link ports 2-3 2x Cat5e 0.15m red

IPU-M2000 # 1 Sync-Link ports 6-7 IPU-M2000 # 2 Sync-Link ports 2-3 2x Cat5e 0.15m red

IPU-M2000 Sync-Link port mapping

The figure below shows the IPU-M2000 Sync-Link cabling for an IPU-M2000 x4 direct attach

system.

IPU-M2000 x4 direct attach system

Page 26: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

26

3.3.3 IPU-M2000 to IPU-M2000 management cabling (Cat5e)

Note This step is only required for direct attach systems with more than one IPU-

M2000.

There are 2 Cat5e management ports in the middle of each IPU-M2000.

Using 0.3m blue Ethernet Cat5e cables, link the top management port (2) of each IPU-M2000

to the bottom management port (1) of the IPU-M2000 that is installed directly above (see

figure and table below). The top management port (2) of the top-most IPU-M2000 (#4) and

the bottom management port (1) of the bottom-most IPU-M2000 (#1), are left unconnected.

IPU-M2000 to IPU-M2000 management connections Cables

IPU-M2000 # 3 management ports 2 IPU-M2000 # 4 management ports 1 Cat5e 0.3m blue

IPU-M2000 # 2 management ports 2 IPU-M2000 # 3 management ports 1 Cat5e 0.3m blue

IPU-M2000 # 1 management ports 2 IPU-M2000 # 2 management ports 1 Cat5e 0.3m blue

IPU-M2000 management port mapping

The figure below shows the IPU-M2000 management cabling for an IPU-M2000 x4 direct

attach system.

IPU-M2000 x4 direct attach system

Page 27: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

27

3.3.1 IPU-M2000 to server management cabling (Cat5e)

There are 2 Cat5e management ports in the middle of each IPU-M2000.

Using a 1.5m blue Ethernet Cat5e cable, link the top management port (port #2) of the top IPU-M2000 to management port #1 of the R6525 server.

IPU-M2000 to server management connections Cables

Top IPU-M2000 management port 2 R6525 server management ports 1 Cat5e 1.5m blue

The figures below show the IPU-M2000 management cabling for an IPU-M2000 x1 direct

attach system and an IPU-M2000 x4 direct attach system.

IPU-M2000 x1 direct attach system

IPU-M2000 x4 direct attach system

Page 28: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

28

3.3.2 IPU-M2000 to server data cabling (QSFP)

There are two QSFP RNIC data ports on the left of each IPU-M2000.

Only one of these (port #2) should be connected from each IPU-M2000 to the server with

1.5m QSFP cables, as shown in the table below. In an IPU-M2000 x1 direct attach system, only

IPU-M2000 # 1 is present.

IPU-M2000 RNIC port mapping Cables

IPU-M2000 # 4 port 2 Server NIC slot 2, port 2 QSFP 1.5m

IPU-M2000 # 3 port 2 Server NIC slot 2, port 1 QSFP 1.5m

IPU-M2000 # 2 port 2 Server NIC slot 1, port 2 QSFP 1.5m

IPU-M2000 # 1 port 2 Server NIC slot 1, port 1 QSFP 1.5m

IPU-M2000 RNIC port mapping

The figures below show the IPU-M2000 data cabling for both IPU-M2000 x1 direct attach and

IPU-M2000 x4 direct attach systems.

IPU-M2000 x1 direct attach system

Page 29: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

29

IPU-M2000 x4 direct attach system

Note IPU-M2000 #1 (the lowest in the stack) is always connected to Port 1 in

server Slot 3.

Page 30: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

30

3.4 Power cabling

The method for cabling the PDUs will depend on the choice of PDU type and location. Each

IPU-M2000 has two power inputs and these should be supplied from separate power

distribution units for redundancy.

3.5 Completing the rack

The following steps describe completing the rack: fitting blanking panels and re-installing the

doors and side panels, if required.

3.5.1 Blanking panels

For proper airflow it is recommended to install 1U blanking panels in every unoccupied rack

slot at the front of the rack.

1U blanking panels

3.5.2 Rack completion

Re-install the front and rear doors. Ensure the earth cables are reconnected to the cable on

the rack. Re-install the top and bottom side panels on each side of the rack.

Page 31: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

31

Server configuration

This chapter describes how to configure the server in an IPU-M2000 direct attach system. The

high-level steps for configuring the server are as follows:

• Select appropriate server specifications

• Configure the server storage and RAID arrays

• Install the server operating system and packages

• Install required python packages

• Configure the required users and groups

• Configure the network interfaces

• Configure various services

Note Scripts for the server installation can be provided: please contact Graphcore

support.

4.1 Server hardware and storage

4.1.1 Hardware requirements

This document describes building a system using a PowerEdge R6525 server. Contact

Graphcore support for details of other supported server types. Other servers may have

different installation requirements.

The recommended configuration of the Dell R6525 is as follows:

• Dell R6525 containing dual AMD EPYC 7742 processors

• 16x32GbE RDIMM PC4-25600 ECC registered dual-rank X4 1.2v

• 2x 480GbE SSD-SATA 6Gbps 2.5 inch hot-swap

• 7x 1TB NVME SSD PCIe 4x 3.1

• Dual port Gigabit BASE-T PCIe

• 1x single port Mellanox ConnectX-5 EN 100Gb/s Ethernet (1x system)

• 2x dual port Mellanox ConnectX-5 EN 100Gb/s Ethernet (4x system)

4.1.2 Storage configuration

The recommendation is to configure two types of server storage: SSD-SATA for the operating

system and NVME SSD for data storage.

Operating system:

• 2x 480GbE SSD-SATA units as a RAID 1 via hardware controller

• Partitioned to use ext4 file system

Data storage:

• 7x 1TB NVME SSD units as a logical RAID 6 managed with MDADM

• Partitioned to use xfs file system

Page 32: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

32

4.1.3 BIOS configuration

The NUMA and multi-threading configuration of the R6525 can impact the performance of the

system. The recommended settings in the BIOS for these are as follows:

• Simultaneous multi-threading (SMT): ON

• NUMA nodes per socket: 4 (NPS4)

4.2 Operating system and packages

Please contact your Graphcore representative or use the support portal support.graphcore.ai

for information about operating system support. This document describes the following

operating systems:

• Ubuntu 18.04.4 LTS (bionic)

• CentOS 7.2 / 8

4.2.1 Ubuntu OS installation and packages

The Ubuntu OS should be installed from the following default public Ubuntu 18.04.4

repositories:

deb http://archive.ubuntu.com/ubuntu/ bionic main restricted

deb http://archive.ubuntu.com/ubuntu/ bionic-updates main restricted

deb http://archive.ubuntu.com/ubuntu/ bionic universe

deb http://archive.ubuntu.com/ubuntu/ bionic-updates universe

deb http://archive.ubuntu.com/ubuntu/ bionic multiverse

deb http://archive.ubuntu.com/ubuntu/ bionic-updates multiverse

deb http://archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe

multiverse

deb http://archive.canonical.com/ubuntu bionic partner

deb http://security.ubuntu.com/ubuntu bionic-security main restricted

deb http://security.ubuntu.com/ubuntu bionic-security universe

deb http://security.ubuntu.com/ubuntu bionic-security multiverse

In order to have a stable system where IPU related software can run, several packages need to

be installed on the system via the Aptitude package manager tool:

apt-transport-https ibverbs-utils openjdk-8-jdk python3-virtualenv

autoconf ipmitool php-cli python3-wheel

automake jq php-curl qtcreator

bc kcachegrind policykit-1 rdma-core

build-essential libaio-dev protobuf-compiler screen

ccache libboost-all-dev python-boto3 software-properties-common

clang libeigen3-dev python-dev sshpass

cmake libjson-c-dev python-lxml subversion

curl libjson-c-doc python-numpy swig

direnv libpci-dev python-pip sysfsutils

dkms libpixman-1-dev python-pytest tar

Page 33: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

33

emacs libprotobuf-dev python-recommonmark tmux

ethtool libtool python-requests u-boot-tools

exuberant-ctags lldpad python-setuptools unzip

flex m4 python-wheel valgrind

g++ minicom python-yaml vim

gawk moreutils python2 virtualenv

gcc net-tools python3 wdiff

gdb netcat python3-dev wget

git parallel python3-numpy zip

golang-go pciutils python3-pip

htop perl python3-setuptools

4.2.2 CentOS OS installation and packages

In order to have a stable system where IPU related software can run, several packages need to

be installed on the system via the yum configuration manager:

bc libaio-devel python2-numpy vim

centos-release-scl libboost-devel python2-pip wdiff

clang libibverbs-utils python2-pytest wget

cmake libuser python27-python-devel

containerd.io lldpad qt5-qbase

devtoolset-7 minicom rdma-core

dhcp moreutils rh-python36

dkms nano rh-python38-python-lxml

docker-ce nc rh-python38-python-numpy

docker-ce-cli net-tools rh-python38-python-setuptools

eigen3 ntp rh-python38-python-wheel

emacs parallel rh-python38-scldevel

golang-go pciutils-devel rh-pyton38

htop php-cli screen

ipmitool protobuf-devel snapd

java-latest-jdk python-anymarkup sshpass

jq python-boto3 sysfsutils

json-c-devel python-requests tmux

json-c-doc python-wheel uboot-tools

kcachegrind python2 valgrind

Page 34: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

34

4.2.3 Python packages

Several python packages are required for both OS installations. They can be installed using the

pip installation tool.

autograd paramiko pylint scp

jstyleson pep8 pyyaml yapf

mock pexpect requests

Page 35: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

35

4.3 User accounts and groups

The following accounts are required as part of the default server configuration:

Accounts

root A root user account secured with a password is recommended.

itadmin An admin account secured with a password is recommended.

Home folder located at /home/itadmin using bash shell.

ipuuser

An account dedicated to IPU software and IPU-M2000 management

software is mandatory. Home folder located at /home/ipuuser using

bash shell.

poplaruser An account dedicated to Poplar software is mandatory.

Home folder located at /home/poplaruser using bash shell.

The following groups are required as part of the default server configuration:

Groups

root: A root group to locate the root account is mandatory.

dhcpd: A group to allocate the DHCP service is mandatory (usually is

configured automatically while installing the DHCP service).

ipugroup: A group to allocate ipuuser is mandatory.

poplargroup: A group to allocate poplaruser is mandatory.

ipupodgroup: A group to allocate both ipuuser and poplaruser is mandatory.

The following table gives the default usernames for the IPU-M2000 system:

Login to: Username Password

IPU-M2000 BMC OS root The default passwords are available from Graphcore support support.graphcore.ai.

IPU-M2000 GW OS itadmin

Server - Poplar SDK user poplaruser

Server - IPU admin user ipuuser

Server - IT admin user itadmin

Server - iDRAC port root

PDU (depending on choice) apc

Note that users need to have unique user IDs and group IDs.

Page 36: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

36

4.4 Network interfaces

4.4.1 Overview

The following figure gives a logical overview of the network setup within the IPU-M2000 direct

attach system.

Page 37: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

37

4.4.2 IPU-M2000 direct attach network interfaces

The following table shows the interfaces in the system and how these are set up:

Port Role Speed IP address Config from

IPU-M2000

- Mgmt Port 1

- Mgmt Port 2

BMC+GW

management

ports

1GE BMC: 10.1.1.1-4/22

GW: 10.1.2.1-4/22

Static DHCP lease from server

IPU-M2000 #1

IPU-M2000 #2

IPU-M2000 #3

IPU-M2000 #4

Host-link

data-plane link

to IPU-M2000s

100GE 10.1.5.2/30

10.1.5.6/30

10.1.5.10/30

10.1.5.14/30

DHCP lease from server

Dell R6525

- eno1 Management

of IPU-M2000

1GE

10.1.3.101/22

Server local netplan

Server - iDRAC Server BMC 1GE x.x.x.x Site specific setup

- enp129s0f0

- enp129s0f1

- enp161s0f0

- enp161s0f1

RDMA

IPU-M2000

100GE 10.1.5.1/30

10.1.5.5/30

10.1.5.9/30

10.1.5.13/30

Server local netplan

Note

These items need to be setup manually for the early access release. A

configuration script will be included in the production release to automate the

process.

Note Server interface names may vary with server build and configuration.

Page 38: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

38

4.4.3 Server network configuration

Note The following manual configuration is only required for the early access system.

The examples below assume an IPU-M2000 x4 system where one server is directly attached

via QSFP cables to four IPU-M2000 units.

The IPU-M2000 x1 direct attach variant is also supported by the same network configuration

by omitting the RNIC interfaces with the highest indexes.

The manual setup of the networking for the early access system requires the use of both

netplan and systemd/network config files. The netplan file is used to configure the

management interfaces while the systemd-networkd files are used to configure the IPU-

M2000 RNIC interfaces.

Example netplan file (Dell R6525 running Ubuntu 18.04):

• This file will configure (during server start-up) four IP subnets, each mapped to the

point-to-point link between the server and the IPU-M2000. The mapping to four IP

subnets is a mandatory requirement for the IPUoF protocol.

• Default location: /etc/netplan/01-netcfg.yaml

• Interface eno1 serves the 1GE management network where a daisy chain exists of all

IPU-M2000s giving access to all BMC and Gateway (GW) interfaces

• Interface eno2 serves the user’s network.

network:

version: 2

renderer: networkd

ethernets:

eno1:

addresses:

- 10.1.3.101/22

eno2:

addresses:

CUSTOMER TODO: Site specific network setup

Example systemd/network files (Dell R6525 running Ubuntu 18.04):

• These files are required to be used under the hood of netplan, due to a missing

feature in netplan:

o All RNIC interfaces to be served by DHCP must be up when the DHCP server

starts up after a power down.

o Forcing the RNIC interfaces to be up during boot is required since the GW may

be powered off (no carrier) and the server´s RNIC interface otherwise will be

down.

• A mapping of IPU-M2000s to four IP subnets is a mandatory requirement for the

IPUoF protocol.

Page 39: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

39

• Only the first configuration file is required for an x1 system

• Interfaces named “enp*” are serving point-to-point QSFP cables towards the IPU-

M2000 100GE RNIC interfaces.

• IPU-M2000 #1: /etc/systemd/network/10-netplan-enp129s0f0.network

[Match]

Name=enp129s0f0

[Link]

RequiredForOnline=no

[Network]

LinkLocalAddressing=ipv6

Address=10.1.5.1/30

ConfigureWithoutCarrier=true

• IPU-M2000 #2: /etc/systemd/network/10-netplan-enp129s0f0.network

[Match]

Name=enp129s0f1

[Link]

RequiredForOnline=no

[Network]

LinkLocalAddressing=ipv6

Address=10.1.5.5/30

ConfigureWithoutCarrier=true

• IPU-M2000 #3: /etc/systemd/network/10-netplan-enp161s0f0.network

[Match]

Name=enp161s0f0

[Link]

RequiredForOnline=no

[Network]

LinkLocalAddressing=ipv6

Address=10.1.5.9/30

ConfigureWithoutCarrier=true

Page 40: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

40

• IPU-M2000 #4: /etc/systemd/network/10-netplan-enp161s0f1.network

[Match]

Name=enp161s0f1

[Link]

RequiredForOnline=no

[Network]

LinkLocalAddressing=ipv6

Address=10.1.5.13/30

ConfigureWithoutCarrier=true

Page 41: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

41

4.5 Services

4.5.1 DHCP (Dynamic Host Configuration Protocol)

Note Manual installation and configuration of this is only required for the early

access version.

An ISC-DHCP-Server service is required to provide DHCP network configuration to IPU-

M2000s. It can be installed from the Ubuntu or CentOS public repositories.

- File structure: o /etc/default/isc-dhcp-server (file)

▪ This file contains the network interfaces which DHCP is going to use ▪ root:root 0644

Dell R6525 server using 2 RNICs (x4 product variant):

INTERFACES=”eno1 enp129s0f0 enp129s0f1 enp161s0f0 enp161s0f1”

Dell R6525 server using 1 RNIC (x1 product variant):

INTERFACES=”eno1 enp129s0f0”

o /etc/dhcp/ (folder) ▪ Folder containing DHCP related files ▪ root:dhcpd 0575

o /etc/dhcp/dhcpd.conf /(file) ▪ Main DHCP server configuration file ▪ root:root 0444

default-lease-time 600;

max-lease-time 1200;

ddns-update-style none;

authoritative;

log-facility local7;

# NOTE: Keep the current site specific settings, and

# Add Graphcore setup as the last section

include "/var/lib/gc-ipu-network/ea/rnic.conf";

include "/var/lib/gc-ipu-network/ea/gw+bmc.conf";

Page 42: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

42

o /var/lib/gc-ipu-network/ea/ (folder)

▪ Folder to contain IPU-M2000 network configuration files so that the later install script can identify a manual setup (early access) of the product.

▪ root:dhcpd 0770

o /var/lib/gc-ipu-network/ea/rnic.conf (file)

• Specific file related to IPU-M2000 100Gb RNIC interfaces. The file lists four IP

subnets with exactly one free dynamic lease address per subnet. There is no

need to map this to a static “MAC-address to IP address” lease since there is

only one end point on the cable.

• root:dhcpd 0660

default-lease-time 600;

max-lease-time 1200;

ddns-update-style none;

authoritative;

log-facility local7;

subnet 10.1.5.0 netmask 255.255.255.252 {

option subnet-mask 255.255.255.252;

range 10.1.5.2 10.1.5.2;

}

subnet 10.1.5.4 netmask 255.255.255.252 {

option subnet-mask 255.255.255.252;

range 10.1.5.6 10.1.5.6;

}

subnet 10.1.5.8 netmask 255.255.255.252 {

option subnet-mask 255.255.255.252;

range 10.1.5.10 10.1.5.10;

}

subnet 10.1.5.12 netmask 255.255.255.252 {

option subnet-mask 255.255.255.252;

range 10.1.5.14 10.1.5.14;

}

NOTE: Skip the last two subnets if using only one RNIC as in x1 product

variants !

Page 43: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

43

o /var/lib/gc-ipu-network/ea/gw+bmc.conf (file)

• Specific file related to GW+BMC DHCP requests from IPU-M2000 1Gb BASE-T

management interfaces. Will have entries for each IPU-M2000 daisy chained to

the management interface of the server. The entry defines a static IP address

assigned to the port that includes the specified MAC address in its DHCP request

packet.

• root:dhcpd 0660

default-lease-time 600;

max-lease-time 1200;

ddns-update-style none;

authoritative;

log-facility local7;

subnet 10.1.0.0 netmask 255.255.252.0 {

option subnet-mask 255.255.252.0;

}

host ipum1bmc { hardware ethernet 70:69:79:20:03:A8; fixed-address 10.1.1.1; }

host ipum1gw { hardware ethernet 70:69:79:20:03:A9; fixed-address 10.1.2.1; }

host ipum2bmc { hardware ethernet 70:69:79:20:01:48; fixed-address 10.1.1.2; }

host ipum2gw { hardware ethernet 70:69:79:20:01:49; fixed-address 10.1.2.2; }

host ipum3bmc { hardware ethernet 70:69:79:20:03:80; fixed-address 10.1.1.3; }

host ipum3gw { hardware ethernet 70:69:79:20:03:81; fixed-address 10.1.2.3; }

host ipum4bmc { hardware ethernet 70:69:79:20:03:E0; fixed-address 10.1.1.4; }

host ipum4gw { hardware ethernet 70:69:79:20:03:E1; fixed-address 10.1.2.4; }

The dhcp service is enabled/started/stopped using:

sudo systemctl enable isc-dhcp

sudo systecmtl start isc-dhcp

sudo systecmtl stop isc-dhcp

Page 44: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

44

4.5.2 NTP (Network Time Protocol)

NTP service is recommended to provide network time configuration to IPU-M2000 systems. It

can be installed from the Ubuntu or CentOS public repositories.

File structure:

• /etc/ntp.conf (file)

o NTP tool configuration file

o root:root 0444

disable monitor

driftfile /var/lib/ntp/drift

fudge 127.127.1.0 stratum 10

includefile /etc/ntp/crypto/pw

keys /etc/ntp/keys

restrict ::1

restrict 127.0.0.1

restrict default nomodify notrap nopeer noquery

server 127.127.1.0

server 0.pool.ntp.org iburst

server 1.pool.ntp.org iburst

server 2.pool.ntp.org iburst

The ntp service is started using:

sudo systemctl enable ntpd

sudo systecmtl start ntpd

Page 45: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

45

4.5.3 Syslog

Syslog is a software utility for forwarding log messages in an IP network.

File structure:

• /etc/rsyslog.d (folder)

o Rsyslog tool configuration folder.

o root:root 0755

• /etc/rsyslog.conf (file)

o Rsyslog tool configuration file.

o root:root 0444

module(load="imuxsock")

module(load="imudp")

input(type="imudp" port="514")

module(load="imtcp")

input(type="imtcp" port="514")

module(load="imklog" permitnonkernelfacility="on")

$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat

$RepeatedMsgReduction on

$FileOwner syslog

$FileGroup adm

$FileCreateMode 0640

$DirCreateMode 0755

$Umask 0022

$PrivDropToUser syslog

$PrivDropToGroup syslog

$WorkDirectory /var/spool/rsyslog

$IncludeConfig /etc/rsyslog.d/*.conf

• /etc/rsyslog.d/99_ipum.conf (file)

o Rsyslog rules configuration file.

o root:root 0444

$template precise,"%fromhost-

ip%,%HOSTNAME%,%syslogpriority%,%syslogfacility%,%timegenerated::fulltime%,-

%syslogtag%,%msg%\n"

:HOSTNAME, contains, "ipum" /var/log/ipulogs;precise

& ~

• /etc/rsyslog.d/99_dhcpd.conf (file)

o Rsyslog rules configuration file.

o root:root 0444

local7.* /var/log/dhcpd.log

Page 46: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

46

IPU-M2000 software installation

Note

The software installation requirements described in this section are for the early

access version of the IPU-M2000 direct attach system. In this version, the server

provides management functions which will be performed by one of the IPU-

M2000 units in the production system.

The following Graphcore software packages need to be installed on the server:

a) V-IPU server contains management and control software for IPU resource control,

built-in self-test (BIST) and monitoring of the IPU-M2000s and IPUs. There is a V-IPU

Admin Guide and a V-IPU User Guide available.

b) IPU-M2000 system software contains the latest IPU-M2000 resident software for

update, if required. It also includes the server resident tool rack_tool which is

required for updating the IPU-M2000s resident software and testing the system

hardware.

5.1 V-IPU server installation

Please read carefully the release notes for the V-IPU software release before any software

installation or upgrade is performed. Both the release notes and the V-IPU software release

tarball are available from the Graphcore download portal https://downloads.graphcore.ai.

An installation script called install.sh is included with the V-IPU tarball. The installation

script has been tested and verified to work with Ubuntu and CentOS distros that

use systemd as the default service manager. The installation script needs to be executed with

root privileges (sudo ./install.sh) as it copies the vipu-server, vipu-

admin and vipu binaries to /usr/local/bin.

Note You may need to log in as user itadmin to run the script because the default

configuration for the username ipuuser does not have permission to run as root

The script expects the user ipuuser to be present in the system, as the vipu-

server systemd service is configured to be executed by ipuuser. If that user does not exist,

the script will display an error.

The script can be run in interactive or batch modes. The simplest use of these is shown below.

For more details and configuration options, please see the detailed V-IPU documentation.

Page 47: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

47

5.1.1 Interactive installation

If the ipuuser user is present in the system, the installation will ask if vipu-server should be

configured to run as a service in this host and which interface should be used for auto-

discovery. If the answer is yes to the former and a valid interface is entered for the latter, the

script will configure and start vipu-server.service.

In the following example, the system is cabled according to the standard instructions where

“eno1” is the host server interface that connects to the top IPU-M2000 management port at

the top of the stack. See Section 4.4.2.

$ sudo ./install.sh

Do you want to start the vipu-server as a service in this host?

Note that you should have vipu-server running only in one host

and use vipu/vipu-admin to connect to it from all other hosts.

(N/y) y

Choose an interface to use for agent auto-discovery:

eno0

eno1

docker0

lo

Enter disable to deactivate the auto-discovery

Which interface should be used for auto-discovery? eno1

- vipu-server will be configured to be run as a service in this host

- Initialising /etc/vipu/config.hcl

5.1.2 Batch installation

For installation without interaction, certain flags can be set when executing the installation

script (see output below). Set the --yes flag to automatically create vipu-server as a

service. The interface to use for auto-discovery of agents can be set with -a INTERFACE,

where INTERFACE is the interface to be used (for example, -a eth0). If INTERFACE is set

to disabled, the auto-discovery mechanism is disabled.

$ sudo ./install.sh --yes -a eno1

- vipu-server will be configured to be run as a service in this host

- Initialising /etc/vipu/config.hcl

When the installation is over, and if vipu-server was configured to start as a service, you can

use the vipu-admin command from the same host to configure the vipu-server without

any additional configuration.

Page 48: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

48

5.1.3 V-IPU socket

The V-IPU server creates a Unix socket that provides access for the VIPU utilities. The default

location to create the Unix socket is the current working directory, but this can be changed

using the --socket path option of the vipu-server command when this is run.

The V-IPU utilities will look for a local Unix socket in a set of predefined paths. Therefore, if

the V-IPU utilities are being run from the same host as the V-IPU server, the socket is located

in the correct search path and the user has the necessary permissions to access this Unix

socket, then the path to the socket does not need to be specified. If this is not the case, the

path will need to be given using the -H option with the V-IPU utilities (for example -H

localhost).

The default path list for the socket is the following:

1) ./vipu-server.sock

2) $HOME/.vipu-server/vipu-server.sock

3) /var/run/vipu/vipu-server.sock

4) /var/run/vipu-server.sock

Page 49: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

49

5.2 IPU-M2000 system software installation

Please read carefully the release notes for the IPU-M2000 system software release before any

software upgrade is performed. Both the release notes and IPU-M2000 software release are

available from the Graphcore download portal https://downloads.graphcore.ai.

The IPU-M2000 system software contains a set of upgradable software and FPGA sub-

components that are targeted to be executed on the IPU-M2000 units. The release also

contains the tool rack_tool which is used for the software upgrade and other rack related

tasks targeting the IPU-M2000s.

Note

Graphcore has only qualified the IPU-M2000 software release with the

documented set of software sub-component versions and any other version

combinations of software components are not guaranteed.

5.2.1 Download latest IPU-M2000 system software

The server needs to be loaded with the correct IPU-M2000 system software bundle before the

software update of the IPU-M2000s can be performed. To perform the download, please

follow these steps:

1. Go to the Graphcore download portal https://downloads.graphcore.ai and download

the latest release into the /tmp directory

2. Log in to the Poplar host as “ipuuser” and Unpack the tarball:

tar xvfz <tar-ball.tar.gz>

The unrolling of the IPU-M2000 system software onto the servers file system will

automatically create a file tree with a leading directory containing the release version number.

This allows several releases to be kept on the server, in case there is a need to revert to a

previous release on the IPU-M2000s. If this is considered not needed, the older releases (both

the unrolled files and the downloaded tar file) can be removed from the server.

5.2.2 Install and configure rack_tool

Note: A full man page describing how to use rack_tool is found in section 5.4.

1. Install (unpack) the release by running:

cd /tmp/<release_dir>

./install.sh

The install script will do the following:

• Create $HOME/IPU-M_releases and copy in the release files

• Create a symlink to $HOME/.local/bin/rack_tool which links to

rack_tool.py in the release that was installed.

• Install any required python dependencies

• List all options by supplying the -h flag, env variables can be specified to change

the storage location of the software files

Page 50: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

50

2. Setup the rack_config.json file

rack_tool requires a config file which contains information on all the IPU-M2000s it will

control. The information in the config file defines all IP addresses of the BMC, GW and

RNIC interfaces.

The IPU-M2000 system software comes with a rack_config.json template file that can

be found in:

<release_dir>/maintenance_tools/rack_config.json.template

rack_tool can use any config file using the --config-file flag, or the config file can be

copied to the default location. The latter is recommended as this makes it easier to

perform subsequent software upgrades using a central rack_tool config file:

cd /tmp/<release_dir>/maintenance_tools

mkdir -p $HOME/.rack_tool

cp rack_config.json.template $HOME/.rack_tool/rack_config.json

# Edit the file in $HOME/.rack_tool/rack_config.json

Depending on your product variant, this file MUST be edited to match the number of IPU-

M2000s (x1 or x4 product variants). The config file also requires the factory default

passwords for the BMC and GW software to be added. The passwords can be obtained

from Graphcore support.

See section 4.4 for details on IP addresses to be used in this config file.

To ease the editing, please also refer to a pre-defined “machines:” section of this rack_tool

config file found in section 5.4.5. Please copy this text into and overwriting the “machines:”

section of the copied template file.

3. Copy the root-overlay file system

A root-overlay file system is used to pass configuration of the NTP and syslog into the IPU-

M2000 software. The rack_config.json file above refers to the path of these files. The

path is either relative to the location of the rack_config.json or an absolute path. The

easiest is to copy over the files to the default location:

cd /tmp/<release_dir>/maintenance_tools/ipu_pod_config

cp -r root-overlay $HOME/.rack_tool/

If a non-default IP addressing scheme is in use, the files within the root-overlay file system

will have to be updated to reflect the new scheme.

4. Remove the downloaded release tarball (optional)

rm /tmp/<tar-ball.tar.gz>

5. rack_tool should now be available from the PATH.

If it is not available, log out and log in again to pick up the new $HOME/.local/bin path.

Page 51: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

51

5.3 IPU-M2000s system software validation

Before validating which IPU-M2000 system software is running on the IPU-M2000s, follow the

steps given in IPU-M2000 system software installation.

The easiest way to verify which version of the IPU-M2000 system software is installed on the

IPU-M2000s is to use the included rack_tool application.

1. Login on the server using the ipuuser user account.

2. Run the following command to check for versions running on the IPU-M2000s.

rack_tool status

3. The version field of this output will show which IPU-M2000 system software is running on

the IPU-M2000s. If any IPU-M2000 is showing a different version to the others, then a re-

install is necessary. This status output also shows the port status of the three network

interfaces on each IPU-M2000.

ipuuser@ipu_m2000:~$ rack_tool status

14:10:18: Reading configuration from rack_config.json

a04 BMC:[ UP ] GW:[ UP ] RNIC:[ UP ] Version:[ 2.0.0 ]

a03 BMC:[ UP ] GW:[ UP ] RNIC:[ UP ] Version:[ 2.0.0 ]

a02 BMC:[ UP ] GW:[ UP ] RNIC:[ UP ] Version:[ 2.0.0 ]

a01 BMC:[ UP ] GW:[ UP ] RNIC:[ UP ] Version:[ 2.0.0 ]

4. Compare the version number reported against the latest version available and upgrade if

necessary. Upgrade is explained in a separate section below.

5.3.1 Verify V-IPU server and IPU-M2000 version compatibility

The V-IPU server installed on the server must match the V-IPU client which is part of the IPU-

M2000 system software. If both are updated to the latest version this will happen

automatically.

To verify that the versions of the V-IPU server and IPU-M2000 system software are

compatible:

1. Log in as ipuuser on the server

2. Run the following commands to check that the VIPU-server installation is consistent:

vipu -v

vipu –-server-version

The output from the V-IPU binary and the server version should match:

$ vipu --version

v1.12.2

$ vipu --server-version

version: v1.12.2

host: vipu-server-host:8090

Page 52: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

52

If there is a mismatch between the versions, the V-IPU server package should be

reinstalled.

3. Run the following command to check for a mismatch between the V-IPU server and the

IPU-M2000 system software:

$ vipu-admin create agent ag01 --host 10.1.2.1

create agent (ag01): failed: version mismatch: vipu-server(v1.12.2) <->

ag01(v1.11.1).

The V-IPU Admin Guide contains much more information and should be consulted for more

details – it is available here. The V-IPU User Guide is also useful and can be found here. Make

sure the selected document version on these pages matches the two first major.minor version

numbers of the V-IPU SW being used.

Page 53: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

53

5.3.2 Software upgrade of IPU-M2000 units

If the IPU-M2000 system software version is not the latest available, or an incompatibility is

shown between the IPU-M2000 system software and the VIPU-server, it may be necessary to

update the IPU-M2000 system software.

Note

The software upgrade process can currently NOT be run at the same time as

running ML jobs since the install process reboots the IPU-M2000 once

complete.

Having unrolled the software release onto the server file system, follow these steps:

1. First check that all IPU-M2000s are up and available by running:

rack_tool status

This will also show which rack_config.json is in use, and, when running the upgrade

command in the next step, which IPU-M2000s will be upgraded.

2. Trigger the upgrade by running:

rack_tool upgrade

3. In cases where specific upgrades fail on a particular machine, that specific upgrade can be

restarted by running:

rack_tool upgrade --select a01

Where a01 refers to a name specified in the rack_config.json file.

rack_tool will read a default config file to learn how to access the IPU-M2000s.

The default location of this file is: $HOME/.rack_tool/rack_config.json.

The upgrade process will take several minutes and all the IPU-M2000s will be upgraded in

parallel.

The upgrade process reboots the IPU-M2000 multiple times during the upgrade to activate

the different software components.

rack_tool finally verifies that the upgrade completes with all sub-components being

upgraded to the same version. Please verify that the command “rack_tool status” reports an

installed version which corresponds to the one defined in the release notes to check that the

upgrade procedures have been followed correctly.

Page 54: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

54

5.4 rack_tool

rack_tool is a utility that is supplied with the IPU-M2000 system software pack. It is always

installed in the account that is performing IPU resource management (the default is the

ipuuser account).

rack_tool is used for the following:

a) Installing system software for single or all IPU-M2000s

b) Querying the version of all IPU-M2000s

c) Connectivity test performed on all IPU-M2000 RDMA data-plane ports, GW and BMC

management-plane ports across all IPU-M2000s that are listed in rack_tool’s

default config file

d) Restarting IPU-M2000’s GW and BMC in different ways (power cycling or OS reboot)

e) Control power on/off for the GW part of the IPU-M2000

f) Running commands on several IPU-M2000s for troubleshooting

g) Updating the “root overlay” files onto all IPU-M2000s if NTP or syslog server has

changed

h) Running hardware and connectivity tests (see section 6 for the built-in self-test

capabilities)

The supported options will evolve over time so please refer to the official help menu

(rack_tool --help) or accompanying man pages for the installed rack_tool version on your

system. The current supported options are listed below:

Page 55: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

55

5.4.1 Synopsis

rack_tool.py [--version] [--help] <command> [<args>]

[--ipum <name bmc-ip gw-ip bmc-username bmc-password gw-username gw-password>]

[--config-file <path-to-config file>]

[--global-credentials <bmc-username bmc-password gw-username gw-password>]

rack_tool.py upgrade [--help] [--gw-root-overlay <path-to-root-overlay>]

rack_tool.py bist [--help]

rack_tool.py vipu-test [--help] [--vipu-path <path-to-vipu-binaries>]

rack_tool.py status [--help] [--no-color]

rack_tool.py hostname [--help]

rack_tool.py install-key [--help]

rack_tool.py update-root-overlay [--help] [--overlay <path-to-root-overlay>

rack_tool.py run-command [--help] -c <command> -d <device>

<device> is: gw|bmc

rack_tool.py bmc-factory-reset [--help]

rack_tool.py power-off [--help] [--hard]

rack_tool.py power-on [--help]

rack_tool.py power-cycle [--help]

rack_tool.py logging-server [--help] -a address -p port -d <device>

<device> is: gw|bmc

Page 56: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

56

5.4.2 Options

--ipum, --global-credentials and --config-file

These options have to be provided after the command parameter.

-v, --version

Print the version of the tool

-h, --help

Prints the synopsis and a list of all available commands. --help can also be given

after a command to show individual help for each command.

--ipum <name bmc-ip gw-ip bmc-username bmc-password gw-username gw-password>

Option to manually define what machines to do operations on instead of using a

config file. Several machines can be selected by passing the --ipum option several

times.

Example: rack_tool.py upgrade --ipum machine1 10.1.1.1 10.1.1.2 root password

itadmin password

--global-credentials <bmc-username bmc-password gw-username gw-password>

Option to set a common set of login details for the machines selected with --ipum

option. If this option is used, the password and username parameters for the --

ipum option can be omitted.

Example: rack_tool.py upgrade --ipum machine1 10.1.1.1 10.1.1.2 --ipum

machine2 10.1.2.1 10.1.2.2 --global-credentials root password itadmin password

--config-file <config file>

Config file with info about a list of IPU-M2000s to connect to. The config file will

be ignored if '--ipum' parameter is given.

Example: rack_tool.py upgrade --config-file my_config_file.json

Page 57: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

57

5.4.3 Commands

upgrade Start upgrade of all machines.

bist Run built-In self-test per machine that checks that most components

on the board are available and functional.

vipu-test Run Virtual-IPU connectivity and cabling tests. This command will

overwrite the V-IPU DB which is then lost. This is not a problem during

installation but will be when the V-IPU Controller has site-specific

database.

status Show network connectivity status and SW versions for IPU-M2000s in

a rack.

hostname Set hostname on GW and BMC.

install-key Install current users public ssh key to all machines.

update-root-overlay Copy all files in ~/.rack_tool/root-overlay to all machines

run-command Run a command on a device on all machines.

bmc-factory-reset Do factory reset on the BMC on all machines.

power-off Power off GW and IPUs on all machines

power-on Power on GW and IPUs on all machines

power-cycle Power cycle GW and IPUs on all machines

logging-server Set logging server on a device on all machines

5.4.4 Exit status

0 Successful program execution.

1 Unsuccessful program execution.

Page 58: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

58

5.4.5 Files and directories

GW root overlay

This is a directory ("~/ipuuser/.rack_tool/root-overlay”) that consists of all files that

should be copied to the GW after an upgrade. It gives the possibility of creating site

specific files that should be persistent on the GW after upgrade. The content of this

directory is copied over to the GW automatically if the location of this directory is

given either as an object in the rack config file or as input parameter. The structure

of the directory should be the same as the root file system on the GW.

For example files that can be useful to have in GW root overlay are files that relate

to the outside world such as NTP and syslog configuration files.

rack_config.json file format

rack_config.json is a json file that rack_tool is using to know how to connect to all

the machines in a rack. It consists of one mandatory object "machines" and two

optional objects "global_credentials" and "gw_root_overlay".

global_credentials

This is an object that holds the login details of the BMC and GW. The object has the

following key/value pairs:

"global_credentials": {

"bmc_username": "<username>",

"bmc_passwd": "<password>",

"gw_username": "<username>",

"gw_passwd": "<password>"

}

root-overlay

This object is a key/value pair that points to the location of the GW root overlay.

"gw_root_overlay": "~/.rack_tool/root-overlay",

machines

This is an array of machine objects that holds information about each machine in the

rack. Each machine object consists of the following key/value pairs:

"machines": [

{

"name": "ipum1”,

"bmc_ip": "10.1.1.1",

"gw_ip": "10.1.2.1",

"mx_ip": "10.1.5.1"

},

{

Page 59: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

59

"name": "ipum2”,

"bmc_ip": "10.1.1.2",

"gw_ip": "10.1.2.2",

"mx_ip": "10.1.5.5"

},

{

"name": "ipum3”,

"bmc_ip": "10.1.1.3",

"gw_ip": "10.1.2.3",

"mx_ip": "10.1.5.9"

},

{

"name": "ipum4”,

"bmc_ip": "10.1.1.4",

"gw_ip": "10.1.2.4",

"mx_ip": "10.1.5.13"

}

]

NOTE: The “machines”: section of this json file for x4 product variants should look like the

above text. An x1 variant of this product should only include the first IPU-M2000 unit entry.

~/.rack_tool/ This directory is the default location for configuration files.

~/.rack_tool/rack_config.json This is the rack configuration file for the rack.

~/.rack_tool/root-overlay/ This is the default directory for GW root overlay files.

Page 60: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

60

System testing

6.1 Running system BISTs

There are two parts to testing the IPU-M2000 direct attach system: running build-in self-tests

(BIST) and checking the system using the V-IPU management tool.

$ ./rack_tool.py bist - performs chassis hardware testing

$ ./rack_tool.py vipu-test - performs V-IPU connectivity tests

6.2 Troubleshooting

This section contains useful information about what to do if you encounter problems while

installing and testing the rack. If you can’t find the answer to your query here and are still

experiencing problems, then please contact your Graphcore representative or use the

resources on the Graphcore support portal: https://www.graphcore.ai/support.

6.2.1 Rack_tool testing

$ ./rack_tool.py BIST

This test will generate a very low-level hardware verification report/log that will need to be

analyzed by Graphcore support in case any errors are reported. The logs are located at

“./maintenance_tools/logs” relative to the current directory from which the command

is executed.

The command “Done BIST on …” if the test is successful.

The command “Failed BIST on …” if the test fails.

The command will point to the log name generated in both cases.

6.2.2 V-IPU testing

$ ./rack_tool.py vipu-test

The following section is based on excerpts from the V-IPU Admin Guide which should be

consulted for a detailed and updated overview of BISTs. This guide is available here. The V-IPU

User Guide is also useful and can be found here. The collection of V-IPU connectivity tests can

be invoked by the ./rack_tool.py vipu-test command or by directly using V-IPU CLI

commands as described below.

The V-IPU Controller implements a cluster testing suite that runs a series of tests to verify

installation correctness. A V-IPU cluster test can be executed against a cluster entity before

any partitions are created. It is strongly recommended to run all the test types provided by the

cluster testing suite before deploying any applications in a cluster.

Assume we have created a cluster named “cluster1” formed by four IPU-M2000s (a complete

IPU-M2000 x4 direct attach system) using the command:

vipu-admin create cluster cl0 --agents ipum1, ipum2, ipum3, ipum4 --mesh

./rack_tool.py vipu-test will create V-IPU machine VIRM agents for each IPU-M2000

and automatically create this cluster by applying this command.

Page 61: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

61

The simplest way to run a complete cluster test for this cluster is to run ./rack_tool.py

vipu-test. The test performs the V-IPU self-tests shown below.

vipu-admin test cluster cluster1

Showing test results for cluster cl0

Test Type | Duration | Passed | Summary

---------------------------------------------------------------------------

Version | 0.00s | 4/4 | All component versions are consistent

Cabling | 8.76s | 4/4 | All cables connected as expected

Sync-Link | 0.35s | 8/8 | Sync Link test passed

Link-Training | 20.16s | 76/76 | All Links Passed

Traffic | 42.00s | 1/1 | Traffic test passed

GW-Link | 0.00s | 0/0 | GW Link test skipped

The output above shows a successful test with no errors reported.

As the test results show, five test types were executed on “cluster1”. The results for each test

type are printed one per line in the output. Each test type tested zero or more elements of the

cluster as can be seen from the “Passed” column. Each test type is explained in detail in the

rest of this section.

Note that the vipu-test command blocks the CLI until the cluster test is completed and may

take several minutes to finish. To avoid blocking the CLI for prolonged periods of time, cluster

tests can be executed asynchronously with the --start, --status and --stop options.

Depending on the how the cluster is created, some of the link tests will be omitted.

In the above example the V-IPU GW-Link test is skipped since the GW-Link is not used in an

IPU-M2000 direct attach system.

Errors discovered during testing can be like the ones shown below. The error text will, if

possible, indicate which ports are relevant for the problem detected. The port numbers used

are aligned with the various connector numbering schemes described earlier in this

document.

When a cluster test is running, some restrictions are imposed on the actions an administrator

can perform to the system:

• Partition creation in a cluster where a test is in progress is forbidden.

• Removal of a cluster where a test is in progress is forbidden.

• Only one cluster test can be running at any given time on a V-IPU server, even if the V-

IPU server controls more than one cluster.

• There is no persistence to the cluster test results. Only the results of the last test can

be retrieved with the --status command, as long as the V-IPU server has not been

restarted.

Page 62: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

62

IPU-Link cabling test:

In order to verify that external IPU-Link cables are connected and properly inserted as

expected in a cluster, the cabling test can be utilized. The cabling test will read the serial ID of

the OSFP cables from each end of the links and verify that the cable connects the expected

ports together.

Cabling tests are invoked by passing the --cabling flag to the test cluster command.

If the test fails, details about which connections that failed are displayed. This will give the

user a hint to which cables to physically inspect and correct. Very often, a loose cable is the

root cause of problems. Below is an example of a test run when the 4 OSFP cables between

ipum1 and ipum2 in the cluster are not connected.

$vipu-admin test cluster cluster1 --cabling

Showing test results for cluster cluster1

Test Type | Duration | Passed | Summary

------------------------------------------------------------------------------

Cabling | 21.77s | 8/12 | ipum1 (IPU-Cluster Port 5) x--> ipum2 (IPU-Cluster

port 11) (cable not connected)

| | | ipum1 (IPU-Cluster Port 6) x--> ipum2 (IPU-Cluster

port 12) (cable not connected)

| | | ipum1 (IPU-Cluster Port 7) x--> ipum2 (IPU-Cluster

port 13) (cable not connected)

| | | ipum1 (IPU-Cluster Port 8) x--> ipum2 (IPU-Cluster

port 14) (cable not connected)

------------------------------------------------------------------------------

This is an indication of either faulty cabling or an incorrect cluster definition that doesn´t

reflect the intended cabling.

Page 63: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

63

Sync-Link test:

The Sync-Link test verifies the external Sync-Link cabling between IPU-M2000s. You can run a

Sync-Link test by passing the --sync option to the test cluster command.

A failing Sync-Link test reports the cables which failed to satisfy the cluster topology that is

being tested by pointing to the IPU-M2000s and Sync-Link port numbers of the failing Sync-

Link. In the example command below, two Sync-Link cables between “ipum1” and “ipum2”

fail:

- the link between “ipum1” Sync-Link port 6 and “ipum2” Sync-Link port 2

- the link between “ipum1” Sync-Link port 7 and “ipum2” Sync-Link port 3

This is an indication of either faulty cabling or an incorrect cluster definition that doesn´t

reflect the intended cabling.

$vipu-admin test cluster cluster1 --sync

Showing test results for cluster cluster1

Test Type | Duration | Passed | Summary

------------------------------------------------------

Sync-Link | 0.90s | x/y | Failed Sync Links:

| | | ipum1:6 <--> 2:ipum2

| | | ipum1:7 <--> 3:ipum2

------------------------------------------------------

test (cluster): failed: Some tests failed.

IPU-Link training test:

The IPU-Link training test verifies IPU-Link readiness for IPU-Links between and within IPU-

M2000s (OSFP cables). An IPU-Link test can be invoked with the --ipulink option in the test

cluster command. A failing test will indicate which IPU-Links are failing by pointing to the

agent and cluster port enumeration of the failing IPU-Link. In the following example, we test a

cluster where the IPU-Links have been disconnected between the first and second IPU-M2000

units.

$vipu-admin test cluster cluster1 --ipulink

Showing test results for cluster cluster1

Test Type | Duration | Passed | Summary

-----------------------------------------------------------------------------------

IPU-Link | 34.57s | x/y | Failed Links

| | | ipum1:4 [pending g1x1] <--> ipum2:8 [pending g1x1]

| | | ipum1:3 [pending g1x1] <--> ipum2:7 [pending g1x1]

| | | ipum1:2 [pending g1x1] <--> ipum2:6 [pending g1x1]

| | | ipum1:1 [pending g1x1] <--> ipum2:5 [pending g1x1]

-----------------------------------------------------------------------------------

test (cluster): failed: Some tests failed.

Page 64: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

64

IPU-Link traffic test:

The traffic test acts as a smoke test for all IPU-Links of a cluster before deploying applications.

The traffic test can be invoked with the --traffic option. Note that for a traffic test to pass, a

prerequisite is that the IPU-Link and IPU-Link training tests have passed.

$vipu-admin test cluster cluster1 –-traffic

Test | Duration | Passed | Summary

---------------------------------------------------------------------------------

Traffic | 92.23s | 3/4 | Traffic test failed

| | | Errors encountered in traffic test 1

| | | corrected link errors: 460

| | | - error counter IPU-Link 1 in ipum1, IPU '1' is 250

| | | - error counter IPU-Link 1 in ipum4, IPU '1' is 210

---------------------------------------------------------------------------------

test cluster (cluster1): failed: Some tests failed.

This example shows a situation where the IPU-link traffic test has failed due to too many

correctable errors being detected. Should this occur please try reseating the IPU-Link cables

associated with the referenced IPU-M2000 units. If that does not resolve the issue, please

contact Graphcore support.

Page 65: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

65

Revision history

This document’s revision history is as follows:

Version Date Notes

1.0 18th December 2020 Initial release

Page 66: IPU-M2000 DIRECT ATTACH EARLY ACCESS

IPU-M2000 direct attach build and test guide – early access

66

Legal notices

This document is confidential and is provided subject to one or more confidentiality

obligations between you and Graphcore, including a Non-Disclosure Agreement. The

information disclosed to you hereunder (the “Materials”) is provided solely for the selection

and use of Graphcore products. To the maximum extent permitted by applicable law: (1)

Materials are made available "AS IS" and with all faults, Graphcore hereby DISCLAIMS ALL

WARRANTIES AND CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY, INCLUDING BUT NOT

LIMITED TO WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, OR FITNESS FOR ANY

PARTICULAR PURPOSE; and (2) Graphcore shall not be liable (whether in contract or tort,

including negligence, or under any other theory of liability) for any loss or damage of any kind

or nature related to, arising under, or in connection with, the Materials (including your use of

the Materials), including for any direct, indirect, special, incidental, or consequential loss or

damage (including loss of data, profits, goodwill, or any type of loss or damage suffered as a

result of any action brought by a third party) even if such damage or loss was reasonably

foreseeable or Graphcore had been advised of the possibility of the same. Graphcore assumes

no obligation to correct any errors contained in the Materials or to notify you of updates to

the Materials or to product specifications. You may not reproduce, modify, distribute, or

publicly display the Materials without prior written consent. Certain products are subject to

the terms and conditions of Graphcore’s limited warranty. Graphcore products are not

designed or intended to be fail-safe or for use in any application requiring fail-safe

performance; you assume sole risk and liability for use of Graphcore products in such critical

applications.

Trademarks & copyright

Graphcore® and Poplar® are Registered Trademarks of Graphcore Ltd.

Colossus™, IPU-Core™, In-Processor-Memory™, Exchange Memory™, Streaming Memory™,

IPU-Tile™, IPU-Exchange™, IPU-Machine™, IPU-POD™, IPU-Link™, Virtual-IPU™, AI-Float™, IPU-

Fabric™ and PopVision™ are Trademarks of Graphcore Ltd.

All other trademarks are the property of their respective owners.

© Copyright 2020, Graphcore Ltd.