Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

57
Confidential – Internal Use Only 1 Scyld ClusterWare System Administration

Transcript of Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Page 1: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only

1

Scyld ClusterWare System Administration

Page 2: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only2

Orientation Agenda – Part 1

Scyld ClusterWare foundations

» Booting process

• Startup scripts

• File systems

• Name services

» Cluster Configuration

Cluster Components

» Networking infrastructure

» NFS File servers

» IPMI Configuration

Break

Page 3: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only3

Orientation Agenda – Part 2

Parallel jobs

» MPI configuration

» Infiniband interconnect

Queuing

» Initial setup

» Tuning

» Policy case studies

Other software and tools

Troubleshooting

Questions and Answers

Page 4: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only4

Orientation Agenda – Part 1

Scyld ClusterWare foundations

» Booting process

• Startup scripts

• File systems

• Name services

» Cluster Configuration

Cluster Components

» Networking infrastructure

» NFS File servers

Break

Page 5: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only5

Cluster Virtualization Architecture Realized Minimal in-memory OS with single

daemon rapidly deployed in seconds - no disk required

» Less than 20 seconds

Virtual, unified process space enables intuitive single sign-on, job submission

» Effortless job migration to nodes

Monitor & manage efficiently from the Master

» Single System Install

» Single Process Space

» Shared cache of the cluster state

» Single point of provisioning

» Better performance due to lightweight nodes

» No version skew is inherently more reliable

Master Node

Interconnection Network

Internet or Internal Network

Optional Disks

Manage & use a cluster like a single SMP machineManage & use a cluster like a single SMP machine

Page 6: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only6

Elements of Cluster Systems

Some important elements of a cluster system

» Booting and Provisioning

» Process creation, monitoring and control

» Update and consistency model

» Name services

» File Systems

» Physical management

» Workload virtualization

Master Node

Interconnection Network

Internet or Internal Network

Optional Disks

Page 7: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only7

Booting and Provisioning

Integrated, automatic network boot

Basic hardware reporting and diagnostics in the Pre-OS stage

Only CPU, memory and NIC needed

Kernel and minimal environment from master

Just enough to say “what do I do now?”

Remaining configuration driven by master

Logs are stored in:

» /var/log/messages

» /var/log/beowulf/node.*

Page 8: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only8

DHCP and TFTP services

Started from /etc/rc.d/init.d/beowulf

» Locate vmlinuz in /boot

» Configure syslog and other parameters on the head node

» Loads kernel modules

» Setup libraries

» Creates ramdisk image for compute nodes

» Starts DHCP/TFTP server (beoserv)

» Configures NAT for ipforwarding if needed

» Starts kickback name service daemon (4.2.0+)

» Tune network stack

Page 9: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only9

Compute Node Boot Process

Starts with /etc/beowulf/node_up

» Calls /usr/lib/beoboot/bin/node_up

• Usage: node_up <nodenumber>

• Sets up:– System date

– Basic network configuration

– Kernel modules (device drivers)

– Network routing

– setup_fs

– Name services

– chroot

– Prestages files (4.2.0+)

– Other init scripts in /etc/beowulf/init.d

Page 10: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only10

Compute Node Boot Process

Starts with /etc/beowulf/node_up

» Calls /usr/lib/beoboot/bin/node_up

• Usage: node_up <nodenumber>

• Sets up:– System date

– Basic network configuration

– Kernel modules (device drivers)

– Network routing

– setup_fs

– Name services

– chroot

– Prestages files (4.2.0+)

– Other init scripts in /etc/beowulf/init.d

Page 11: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only11

Subnet configuration

Default used to be class C Network

» netmask 255.255.255.0

» Limited to 155 compute nodes ( 100 + $NODE < 255 )

» Last octect denotes special devices

• x.x.x.10 switches

• x.x.x.30 storage

» Infiniband is a separate network

• x.x.1.$(( 100 + $NODE ))

» Needed eth0:1 to reach IPMI network

• x.x.2.$(( 100 + $NODE ))

• /etc/sysconfig/network-scripts/ifcfg-eth0:1

• ifconfig eth0:1 10.54.2.1 netmask 255.255.255.0

Page 12: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only12

Subnet configuration

New standard is class B Network

» netmask 255.255.0.0

» Limited to 100 * 256 compute nodes

• 10.54.50.x – 10.54.149.x

» Third octect denotes special devices

• x.x.10.x switches

• x.x.30.x storage

» Infiniband is a separate network

• x.$(( x+1)).x.x

» IPMI is on the same network (eth0:1 not needed)

• x.x.150.$NODE

Page 13: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only13

Compute Node Boot Process

Starts with /etc/beowulf/node_up

» Calls /usr/lib/beoboot/bin/node_up

• Usage: node_up <nodenumber>

• Sets up:– System date

– Basic network configuration

– Kernel modules (device drivers)

– Network routing

– setup_fs

– Name services

– chroot

– Prestages files (4.2.0+)

– Other init scripts in /etc/beowulf/init.d

Page 14: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only14

Setup_fs

Script is in /usr/lib/beoboot/bin/setup_fs

Configuration file: /etc/beowulf/fstab» # Select which FSTAB to use.if [ -r /etc/beowulf/fstab.$NODE ] ; then FSTAB=/etc/beowulf/fstab.$NODEelse FSTAB=/etc/beowulf/fstabfiecho "setup_fs: Configuring node filesystems using $FSTAB...“

$MASTER is determined and populated

“nonfatal” option allows compute nodes to finish boot process and log errors in /var/log/beowulf/node.*

NFS mounts of external servers needs to be done via IP address because name services have not been configured yet

Page 15: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only15

beofdisk

Beofdisk configures partition tables on compute nodes

» To configure first drive: •bpsh 0 fdisk /dev/sda

– Typical interactive usage

» Query partition table:•beofdisk -q --node 0

» Write partition tables to other nodes:•for i in $(seq 1 10); do beofdisk -w --node $i ; done

» Create devices initially

• Use head nodes /dev/sd* as reference:– [root@scyld beowulf]# ls -l /dev/sda*brw-rw---- 1 root disk 8, 0 May 20 08:18 /dev/sdabrw-rw---- 1 root disk 8, 1 May 20 08:18 /dev/sda1brw-rw---- 1 root disk 8, 2 May 20 08:18 /dev/sda2brw-rw---- 1 root disk 8, 3 May 20 08:18 /dev/sda3[root@scyld beowulf]# bpsh 0 mknod /dev/sda1 b 8 1

Page 16: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only16

Create local filesystems

After partitions have been created, mkfs

» bpsh –an mkswap /dev/sda1

» bpsh –an mkfs.ext2 /dev/sda2

• ext2 is a non-journaled filesystem, faster than ext3 for scratch file system

• If corruption occurs, simply mkfs again

Copy int18 bootblock if needed:» bpcp /usr/lib/beoboot/bin/int18_bootblock $NODE:/dev/sda

/etc/beowulf/config options for file system creation» # The compute node file system creation and consistency checking policies.

fsck fullmkfs never

Page 17: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only17

Compute Node Boot Process

Starts with /etc/beowulf/node_up

» Calls /usr/lib/beoboot/bin/node_up

• Usage: node_up <nodenumber>

• Sets up:– System date

– Basic network configuration

– Kernel modules (device drivers)

– Network routing

– setup_fs

– Name services

– chroot

– Prestages files (4.2.0+)

– Other init scripts in /etc/beowulf/init.d

Page 18: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only18

Name services

/usr/lib/beoboot/bin/node_up populates /etc/hosts and /etc/nsswitch.conf on compute nodes

beo name service determines values from /etc/beowulf/config file

bproc name service determines values from current environment

‘getent’ can be used to query entries» getent netgroup cluster

» getent hosts 10.54.0.1

» getent hosts n3

If system-config-authentication is run, ensure that proper entries still exist in /etc/nsswitch.conf (head node)

Page 19: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only19

BeoNSS Hostnames Opportunity: We control IP address

assignment

» Assign node IP addresses in node order

» Changes name lookup to addition

» Master: 10.54.0.1GigE Switch: 10.54.10.0IB Switch: 10.54.11.0NFS/Storage: 10.54.30.0Nodes: 10.54.50.$node

Name format

» Cluster hostnames have the base form n<N>

» Options for admin-defined names and networks

Special names for "self" and "master"

» Current machine is ".-2" or "self".

» Master is known as ".-1", “master”, “master0”

Master Node

Interconnection Network

Internet or Internal Network

Optional Disks

.-1

master

n0 n1 n2 n3 n4 n5

Page 20: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only20

Changes

Prior to 4.2.0

» Hostnames default to .<NODE> form

» /etc/hosts had to be populated with alternative names and IP addresses

» May break @cluster netgroup and hence NFS exports

» /etc/passwd and /etc/group needed on compute nodes for Torque

4.2.0+

» Hostnames default to n<NODE> form

» Configuration is driven by /etc/beowulf/config and beoNSS

» Username and groups can be provided by kickback daemon for Torque

Page 21: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only21

Compute Node Boot Process

Starts with /etc/beowulf/node_up

» Calls /usr/lib/beoboot/bin/node_up

• Usage: node_up <nodenumber>

• Sets up:– System date

– Basic network configuration

– Kernel modules (device drivers)

– Network routing

– setup_fs

– Name services

– chroot

– Prestages files (4.2.0+)

– Other init scripts in /etc/beowulf/init.d

Page 22: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only22

ClusterWare Filecache functionality

Provided by filecache kernel module

Configured by /etc/beowulf/config libraries directives

Dynamically controlled by ‘bplib’

Capabilities exist in all ClusterWare 4 versions

» 4.2.0 add prestage keyword in /etc/beowulf/config

» Prior versions needed additional scripts in /etc/beowulf/init.d

For libraries listed in /etc/beowulf/config, files can be prestaged by ‘md5sum’ the file

» # Prestage selected libraries. The keyword is generic, but the current# implementation only knows how to "prestage" a file that is open'able on# the compute node: through the libcache, across NFS, or already exists# locally (which isn't really a "prestaging", since it's already there).prestage_libs=`beoconfig prestage`for libname in $prestage_libs ; do # failure isn't always fatal, so don't use run_cmd echo "node_up: Prestage file:" $libname bpsh $NODE md5sum $libname > /dev/nulldone

Page 23: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only23

Compute Node Boot Process

Starts with /etc/beowulf/node_up

» Calls /usr/lib/beoboot/bin/node_up

• Usage: node_up <nodenumber>

• Sets up:– System date

– Basic network configuration

– Kernel modules (device drivers)

– Network routing

– setup_fs

– Name services

– chroot

– Prestages files (4.2.0+)

– Other init scripts in /etc/beowulf/init.d

Page 24: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only24

Compute nodes init.d scripts

Located in /etc/beowulf/init.d

Scripts start on the head node and need explicit bpsh and beomodprobe to operate on compute nodes

$NODE has been prepopulated by /usr/lib/beoboot/bin/node_up

Order is based on file name

» Numbered files can be used to control order

beochkconfig is used to set +x bit on files

Page 25: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only25

Cluster Configuration

/etc/beowulf/config is the central location for cluster configuration

Features are documented in ‘man beowulf-config’

Compute node order is determined by ‘node’ parameters

Changes can be activated by doing a ‘service beowulf reload’

Page 26: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only26

Orientation Agenda – Part 1

Scyld ClusterWare foundations

» Booting process

• Startup scripts

• File systems

• Name services

» Cluster Configuration

Cluster Components

» Networking infrastructure

» NFS File servers

» IPMI configuration

Break

Page 27: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only27

Elements of Cluster Systems

Some important elements of a cluster system

» Booting and Provisioning

» Process creation, monitoring and control

» Update and consistency model

» Name services

» File Systems

» Physical management

» Workload virtualization

Master Node

Interconnection Network

Internet or Internal Network

Optional Disks

Page 28: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only28

Compute Node Boot Process

Starts with /etc/beowulf/node_up

» Calls /usr/lib/beoboot/bin/node_up

• Usage: node_up <nodenumber>

• Sets up:– System date

– Basic network configuration

– Kernel modules (device drivers)

– Network routing

– setup_fs

– Name services

– chroot

– Prestages files (4.2.0+)

– Other init scripts in /etc/beowulf/init.d

Page 29: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only29

Remote Filesystems

Remote - Share a single disk among all nodes

» Every node sees same filesystem

» Synchronization mechanisms manage changes

» Locking has either high overhead or causes serial blocking

» "Traditional" UNIX approach

» Relatively low performance

» Doesn't scale well; server becomes bottleneck in large systems

» Simplest solution for small clusters, reading/writing small files

Master Node

Interconnection Network

Internet or Internal Network

Optional Disks

Page 30: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only30

NFS Server Configuration

Head node NFS services

» Configuration in /etc/exports

» Provides system files (/bin, /usr/bin)

» Increase number of NFS daemons

• echo “RPCNFSDCOUNT=64” > /etc/sysconfig/nfs ; service nfs restart

Dedicated NFS server

» SLES10 was recommended; RHEL5 now includes some xfs support

• xfs has better performance

• OS has better IO performance than RHEL4

» Network trunking can be used to increase bandwidth (with caveats)

» Hardware RAID

• Adaptec RAID card– CTRL-A at boot – arcconf utility from http://www.adaptec.com/en-US/support/raid/

» External storage (Xyratex or nStor)

• SAS-attached

• Fibre channel attached

Page 31: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only31

Network trunking

Use multiple physical links as a single pipe for data

» Configuration must be done on host and switch

SLES 10 configuration

» Create a configuration file /etc/sysconfig/network/ifcfg-bond0 for the bond0 interface

» BOOTPROTO=staticDEVICE=bond0IPADDR=10.54.30.0NETMASK=255.255.0.0STARTMODE=onbootMTU='‘

BONDING_MASTER=yesBONDING_SLAVE_0=eth0BONDING_SLAVE_1=eth1BONDING_MODULE_OPTS='mode=0 miimon=500'

Page 32: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only32

Network trunking

HP switch configuration

» Create trunk group via serial or telnet interface

Netgear (admin:password)

» Create trunk group via http interface

Cisco

» Create etherchannel configuration

Page 33: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only33

External Storage

Xyratex arrays have a configuration interface

» Text based via serial port

» Newer devices (nStor 5210, Xyratex F/E 5402/5412/5404) have embedded StorView

• http://storage0:9292– admin:password

» RAID arrays, logical drives are configured and monitored

• LUNs are numbered and presented on each port. Highest LUN is the controller itself

• Multipath or failover needs to be configured

Page 34: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only34

Need for QLogic Failover

Collapse LUN presentation in OS to a single instance per LUN

Minimize potential for user error which maintaining failover and static loadbalancing

Page 35: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only35

Physical Management

ipmitool

» Intelligent Platform Management Interface (IPMI) is integrated into the base management console (BMC)

» Serial-over-LAN (SOL) can be implemented

» Allows access to hardware such as sensor data or power states

» E.g. ipmitool –H n$NODE-ipmi –U admin –P admin power {status,on,off}

bpctl

» Controls the operational state and ownership of compute nodes

» Examples might be to reboot or power off a node

• Reboot: bpctl –S all -R

• Power off: bpctl –S all –P

» Limit user and group access to run on a particular node or set of nodes

Page 36: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only36

IPMI Configuration

Full spec is available here:

» http://www.intel.com/design/servers/ipmi/pdf/IPMIv2_0_rev1_0_E3_markup.pdf

Penguin Specific configuration

» Recent products all have IPMI implementations. Some are in-band (share physical media with eth0), some are out-of-band (separate port and cable from eth0)• Altus 1300, 600, 650 – In-band, lan channel 6

• Altus 1600, 2600, 1650, 2650; Relion 1600, 2600, 1650, 2650, 2612 – Out-of-band, lan channel 2

• Relion 1670 – In-band, lan channel 1

• Altus x700/x800, Relion x700 – Out-of-band OR in-band, lan channel 1

Some ipmitool versions have a bug and need to following command to commit a write

» bpsh $NODE ipmitool raw 12 1 $CHANNEL 0 0

Page 37: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only37

Orientation Agenda – Part 2

Parallel jobs

» MPI configuration

» Infiniband interconnect

Queueing

» Initial setup

» Tuning

» Policy case studies

Other software and tools

Questions and Answers

Page 38: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only38

Explicitly Parallel Programs

Different paradigms exist for parallelizing programs» Shared memory

» OpenMP

» Sockets

» PVM

» Linda

» MPI

Most distributed parallel programs are now written using MPI» Different options for MPI stacks: MPICH, OpenMPI, HP, and

Intel

» ClusterWare comes integrated with customized versions of MPICH and OpenMPI

Page 39: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only39

Compiling MPICH programs

mpicc, mpiCC, mpif77, mpif90 are used to automatically compile code and link in the correct MPI libraries from /usr/lib64/MPICH

» GNU, PGI, and Intel compilers are supported

Effectively set libraries and includes for compile and linking

» prefix="/usr“part1="-I${prefix}/include“part2="“part3="-lmpi -lbproc“…part1="-L${prefix}/${lib}/MPICH/p4/gnu $part1“…$cc $part1 $part2 $part3

Page 40: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only40

Running MPICH programs

mpirun is used to launch MPICH programs

If Infiniband is installed, the interconnect fabric can be chosen using the machine flag:

» -machine p4

» -machine vapi

» Done by changing LD_LIBRARY_PATH at runtime• export LD_LIBRARY_PATH=${libdir}/MPICH/${MACHINE}/${compiler}:${LD_LIBRARY_PATH}

» Hooks for using mpiexec for Queue system• elif [ -n "${PBS_JOBID}" ]; then for var in NP NO_LOCAL ALL_LOCAL BEOWULF_JOB_MAP do unset $var done for hostname in `cat $PBS_NODEFILE` do NODENUMBER=`getent hosts ${hostname} | awk '{print $3}' | tr -d '.'` BEOWULF_JOB_MAP="${BEOWULF_JOB_MAP}:${NODENUMBER}“ done # Clean a leading : from the map export BEOWULF_JOB_MAP=`echo ${BEOWULF_JOB_MAP} | sed 's/^://g'` # The -n 1 argument is important here exec mpiexec -n 1 ${progname} "$@"

Page 41: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only41

Environment Variable Options

Additional environment variable control:

» NP — The number of processes requested, but not the number of processors. As in the example earlier in this section, NP=4 ./a.out will run the MPI program a.out with 4 processes.

» ALL_CPUS — Set the number of processes to the number of CPUs available to the current user. Similar to the example above, --all-cpus=1 ./a.out would run the MPI program a.out on all available CPUs.

» ALL_NODES—Set the number of processes to the number of nodes available to the current user. Similar to the ALL_CPUS variable, but you get a maximum of one CPU per node. This is useful for running a job per node instead of per CPU.

» ALL_LOCAL — Run every process on the master node; used for debugging purposes.

» NO_LOCAL — Don’t run any processes on the master node.

» EXCLUDE — A colon-delimited list of nodes to be avoided during node assignment.

» BEOWULF_JOB_MAP — A colon-delimited list of nodes. The first node listed will be the first process (MPI Rank 0) and so on.

Page 42: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only42

Running MPICH programs

Prior to ClusterWare 4.1.4, mpich jobs were spawned outside of the queue system

» BEOWULF_JOB_MAP had to be set based on machines listed in $PBS_NODEFILE

• number_of_nodes=`cat $PBS_NODEFILE | wc -l`hostlist=`cat $PBS_NODEFILE | head -n 1 `for i in $(seq 2 $number_of_nodes ) ; do   hostlist=${hostlist}:`cat $PBS_NODEFILE | head -n $i | tail -n 1`done BEOWULF_JOB_MAP=`echo $hostlist | sed 's/\.//g' | sed 's/n//g'`export BEOWULF_JOB

Starting with ClusterWare 4.1.4, mpiexec was included with the distribution. mpiexec is an alternative spawning mechanism that starts processes as part of the queue system

Other MPI implementations have alternatives. HP-MPI and Intel MPI use rsh and run outside of the queue system. OpenMPI uses libtm to properly start processes

Page 43: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only43

MPI Primer Only a brief introduction is provided here for MPI. Many other in-depth tutorials are

available on the web and in published sources.

» http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html

» http://www.llnl.gov/computing/tutorials/mpi/

Paradigms for writing parallel programs depend upon the application

» SIMD (single-instruction multiple-data)

» MIMD (multiple-instruction multiple-data)

» MISD (multiple-instruction single-data)

SIMD will be presented here as it is a commonly used template

» A single application source is compiled to perform operations on different sets of data

» The data is read by the different threads or passed between threads via messages (hence MPI = message passing interface)

• Contrast this with shared memory or OpenMP where data is locally via memory

• Optimizations in the MPI implementation can perform localhost optimization; however, the program is still written using a message passing construct

MPI specification has many functions; however most MPI programs can be written with only a small subset

Page 44: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only44

Infiniband Primer

Infiniband provides a low-latency, high-bandwidth interconnect for message to minimize IO for tightly coupled parallel applications

Infiniband requires hardware, kernel drivers, O/S support, user land drivers, and application support

Prior to 4.2.0, software stack was provided by SilverStorm

Starting with 4.2.0, ClusterWare migrated to using the OpenFabrics (ofed, openIB) stack

Page 45: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only45

Infiniband Subnet Manager

Every Infiniband network requires a Subnet Manager to discover and manage the topology

» Our clusters typically ship with a Managed QLogic Infiniband switch with an embedded subnet manager (10.54.0.20; admin:adminpass)

» Subnet Manager is configured to start at switch boot

» Alternatively, a software Subnet Manager (e.g. openSM) can be run on a host connected to the Infiniband fabric.

» Typically the embedded subnet manager is more robust and provides a better experience

Page 46: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only46

Communication Layers

Verbs API (VAPI) provides a hardware specific interface to the transport media» Any program compiled with VAPI can only run on the same

hardware profile and drivers

» Makes portability difficult

Direct Access Programming Language (DAPL) provides a more consistent interface» DAPL layers can communicate with IB, Myrinet, and 10GigE

hardware

» Better portability for MPI libraries

TCP/IP interface» Another upper layer protocol provides IP-over-IB (IPoIB)

where the IB interface is assigned an IP address and most standard TCP/IP applications work

Page 47: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only47

MPI Implementation Comparison

MPICH is provided by Argonne National Labs

» Runs only over Ethernet

Ohio State University has ported MPICH to use the Verbs API => MVAPICH

» Similar to MPICH but uses Infiniband

LAM-MPI was another implementation which provided a more modular format

OpenMPI is the successor to LAM-MPI and has many options

» Can use different physical interfaces and spawning mechanisms

» http://www.openmpi.org

HP-MPI, Intel-MPI

» Licensed MPICH2 code and added functionality

» Can use a variety of physical interconnects

Page 48: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only48

OpenMPI Configuration

./configure --prefix=/opt/openmpi --with-udapl=/usr --with-tm=/usr --with-openib=/usr --without-bproc --without-lsf_bproc --without-grid --without-slurm --without-gridengine --without-portals --without-gm --without-loadleveler --without-xgrid --without-mx --enable-mpirun-prefix-by-default --enable-static

make all

make install

Create scripts in /etc/profile.d to set default environment variables for all users

mpirun -v -mca pls_rsh_agent rsh -mca btl openib,sm,self -machinefile machinefile ./IMB-MPI1

Page 49: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only49

Queuing

How are resources allocated among multiple users and/or groups?

» Statically by using bpctl user and group permissions

» ClusterWare supports a variety of queuing packages

• TaskMaster (advanced MOAB policy based scheduler integrated ClusterWare)

• Torque

• SGE

Page 50: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only50

Interacting with TaskMaster

Because TaskMaster uses the MOAB scheduler with Torque pbs_server and pbs_mom components, all of the Torque commands are still valid

» qsub will submit a job to Torque, MOAB then polls pbs_server to detect new jobs

» msub will submit a job to Moab which then pushes the job to pbs_server

Other TaskMaster commands

» qstat -> showq

» qdel, qhold, qrls -> mjobctl

» pbsnodes -> showstate

» qmgr -> mschedctl, mdiag

» Configuration in /opt/moab/moab.cfg

Page 51: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only51

Torque Initial Setup

‘/usr/bin/torque.setup root’ can be used to start with a clean slate

» This will delete any current configuration that you have

» qmgr –c ‘set server keep_completed=300’qmgr –c ‘set server query_other_jobs=true’qmgr –c ‘set server operators += [email protected]’qmgr –c ‘set server managers += [email protected]

/var/spool/torque/server_priv/nodes stores node information

» n0 np=8 prop1 prop2

» qterm –t quickedit /var/spool/torque/server_priv/nodesservice pbs_server start

/var/spool/torque/sched_priv/sched_config configures default FIFO scheduler

/var/spool/torque/mom_priv/config configure pbs_mom’s

» Copied out during /etc/beowulf/init.d/torque

Page 52: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only52

TaskMaster Initial Setup

Edit configuration in /opt/moab/moab.cfg» SCHEDCFG[Scyld] MODE=NORMAL SERVER=scyld.localdomain:42559

• Ensure hostname is consistent with ‘hostname’» ADMINCFG[1] USERS=root

• Add additional users who can be queue managers» RMCFG[base] TYPE=PBS

• TYPE=PBS integrates with a traditional Torque configuration

Page 53: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only53

Tuning

Default walltime can be set in Torque using:» qmgr -c ‘set queue batch resources_default.walltime=16:00:00’

If many small jobs need to be submitted, uncomment the following in /opt/moab/moab.cfg

» JOBAGGREGATIONTIME 10

To exactly match node and processor requests, add the following to /opt/moab/moab.cfg

» JOBNODEMATCHPOLICY EXACTNODE

Changes in /opt/moab/moab.cfg can be activated by doing a ‘service moab restart’

Page 54: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only54

Case Studies

Case Study #1

» Multiple queues for interactive, high priority, and standard jobs

Case Study #2

» Different types of hardware configuration

» Setup with FairShare

• http://www.clusterresources.com/products/mwm/moabdocs/5.1.1priorityoverview.shtml

• http://www.clusterresources.com/products/mwm/moabdocs/5.1.2priorityfactors.shtml

Page 55: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only55

Troubleshooting

Log files

» /var/log/messages

» /var/log/beowulf/node*

» /var/spool/torque/server_logs

» /var/spool/torque/mom_logs

» qstat –f

» tracejob

» /opt/moab/log

» mdiag

» strace –p

» gdb

Page 56: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only56

Hardware Maintenance

pbsnodes –o n0: mark node offline and allow jobs to drain

bpctl –S 0 –s unavailable: prevent user interactive commands from running on node

Wait until node is idle

bpctl –S 0 –P: power off node

Perform maintenance

Power on node

pbsnodes –c n0

Page 57: Confidential – Internal Use Only 1 Scyld ClusterWare System Administration.

Confidential – Internal Use Only

57

Questions??