Flux for PBS Users HPC 105 Dr. Charles J Antonelli LSAIT ARS August, 2013.

38
Flux for PBS Users HPC 105 Dr. Charles J Antonelli LSAIT ARS August, 2013

Transcript of Flux for PBS Users HPC 105 Dr. Charles J Antonelli LSAIT ARS August, 2013.

Flux for PBS UsersHPC 105

Dr. Charles J AntonelliLSAIT ARS

August, 2013

cja 2013 2

FluxFlux is a university-wide shared computational discovery / high-performance computing service.

Interdisciplinary Provided by Advanced Research Computing at U-M (ARC)

Operated by CAEN HPC

Hardware procurement, software licensing, billing support by U-M ITS

Used across campus

Collaborative since 2010Advanced Research Computing at U-M (ARC)

College of Engineering’s IT Group (CAEN)

Information and Technology Services

Medical School

College of Literature, Science, and the Arts

School of Information

8/13

http://arc.research.umich.edu/resources-services/flux/

cja 2013 3

The Flux clusterLogin nodes Compute nodes

Storage…

Data transfernode

8/13

cja 2013 4

Flux node

12 Intel cores

48 GB RAM

Local disk

Ethernet InfiniBand

8/13

cja 2013 5

Flux Large Memory node

1 TB RAM

Local disk

Ethernet InfiniBand

8/13

40 Intel cores

ES13 6

Flux hardware8,016 Intel cores 200 Intel Large Memory cores632 Flux nodes 5 Flux Large Memory nodes

48/64 GB RAM/node 1 TB RAM/ Large Memory node4 GB RAM/core (allocated) 25 GB RAM/Marge Memory core

4X Infiniband network (interconnects all nodes)40 Gbps, <2 us latency

Latency an order of magnitude less than Ethernet

Lustre Filesystem

Scalable, high-performance, open

Supports MPI-IO for MPI jobs

Mounted on all login and compute nodes5/13

cja 2013 7

Flux softwareLicensed software

http://cac.engin.umich.edu/resources/software/flux-softwareet al

Compilers & Libraries:Intel , PGI, GNU

OpenMP

OpenMPI

8/13

cja 2013 8

Using Flux

Three basic requirements to use Flux:

1. A Flux account2. An MToken (or a Software Token)3. A Flux allocation

8/13

cja 2013 9

Using Flux1. A Flux account

Allows login to the Flux login nodes

Develop, compile, and test code

Available to members of U-M community, free

Get an account by visiting https://www.engin.umich.edu/form/cacaccountapplication

8/13

cja 2013 10

Flux Account Policies

To qualify for a Flux account:

You must have an active institutional roleOn the Ann Arbor campus

Not a Retiree or Alumni role

Your uniqname must have a strong identity typeNot a friend account

You must be able to receive email sent to [email protected]

You must have run a job in the last 13 months

http://cac.engin.umich.edu/resources/systems/user-accounts

8/13

cja 2013 11

Using Flux2. An MToken (or a Software Token)

Required for access to the login nodesImproves cluster security by requiring a second means of proving your identity

You can use either an MToken or an application for your mobile device (called a Software Token) for this

Information on obtaining and using these tokens at http://cac.engin.umich.edu/resources/login-nodes/tfa

8/13

cja 2013 12

Using Flux3. A Flux allocation

Allows you to run jobs on the compute nodes

Current rates: (through June 30, 2016) $18 per core-month for Standard Flux

$24.35 per core-month for Large Memory Flux

$8 cost-share per core-month for LSA, Engineering, and Medical School

Details at http://arc.research.umich.edu/resources-services/flux/flux-pricing/

To inquire about Flux allocations please email [email protected]

8/13

cja 2013 13

Flux AllocationsTo request an allocation send email to [email protected] with

the type of allocation desiredRegular or Large-Memory

the number of cores needed

the start date and number of months for the allocation

the shortcode for the funding source

the list of people who should have access to the allocation

the list of people who can change the user list and augment or end the allocations

http://arc.research.umich.edu/resources-services/flux/managing-a-flux-project/

8/13

cja 2013 14

Flux AllocationsAn allocation specifies resources that are consumed by running jobs

Explicit core count

Implicit memory usage (4 or 25 GB per core)

When any resource fully in use, new jobs are blocked

An allocation may be ended earlyOn the monthly anniversary

You may have multiple active allocationsJobs draw resources from all active allocations

8/13

cja 2013 15

lsa_flux AllocationLSA funds a shared allocation named lsa_flux

Usable by anyone in the College

60 cores

For testing, experimentation, explorationNot for production runs

Each user limited to 30 concurrent jobs

https://sites.google.com/a/umich.edu/flux-support/support-for-users/lsa_flux

8/13

cja 2013 16

Monitoring Allocations

Visit https://mreports.umich.edu/mreports/pages/Flux.aspx

Select your allocation from the list at upper leftYou’ll see all allocations you can submit jobs against

Four sets of outputsAllocation details (start & end date, cores, shortcode)

Financial overview (cores allocated vs. used, by month)

Usage summary table (core-months by user and monthDrill down for individual job run data

Usage charts (by user)

Details & screenshots:http://arc.research.umich.edu/resources-services/flux/check-my-flux-allocation /

8/13

cja 2013 17

Storing data on FluxLustre filesystem mounted on /scratch on all login, compute, and transfer nodes

640 TB of short-term storage for batch jobs

Pathname depends on your allocation and uniqnamee.g., /scratch/lsa_flux/cja

Can share through UNIX groups

Large, fast, short-termData deleted 60 days after allocation expires

http://cac.engin.umich.edu/resources/storage/flux-high-performance-storage-scratch

NFS filesystems mounted on /home and /home2 on all nodes

80 GB of storage per user for development & testing

Small, slow, long-term

8/13

cja 2013 18

Storing data on FluxFlux does not provide large, long-term storage

Alternatives:LSA Research Storage

ITS Value Storage

Departmental server

CAEN HPC can mount your storage on the login nodes

Issue df -kh command on a login node to see what other groups have mounted

8/13

cja 2013 19

Storing data on FluxLSA Research Storage

2 TB of secure, replicated data storageAvailable to each LSA faculty member at no cost

Additional storage available at $30/TB/yr

Turn in existing storage hardware for additional storage

Request by visitinghttps://sharepoint.lsait.lsa.umich.edu/Lists/Research%20Storage%20Space/NewForm.aspx?RootFolder=

Authenticate with Kerberos login and password

Select NFS as the method for connecting to your storage

8/13

cja 2013 20

Copying data to FluxUsing the transfer host:

rsync -avz /your/cluster1/directory flux-xfer.engin.umich.edu:newdirname

rsync -avz /your/cluster1/directory flux-xfer.engin.umich.edu:/scratch/youralloc/youruniqname

Or use scp, sftp, WinSCP, Cyberduck, FileZilla

http://cac.engin.umich.edu/resources/login-nodes/transfer-hosts

8/13

cja 2013 21

Globus OnlineFeatures

High-speed data transfer, much faster than SCP or SFTP

Reliable & persistent

Minimal client software: Mac OS X, Linux, Windows

GridFTP EndpointsGateways through which data flow

Exist for XSEDE, OSG, …

UMich: umich#flux, umich#nyx

Add your own server endpoint: contact [email protected]

Add your own client endpoint!

More informationhttp://cac.engin.umich.edu/resources/login-nodes/globus-gridftp

8/13

cja 2013 22

Connecting to Fluxssh flux-login.engin.umich.edu

Login with token code, uniqname, and Kerberos password

You will be randomly connected a Flux login nodeCurrently flux-login1 or flux-login2

Do not run compute- or I/O-intensive jobs hereProcesses killed automatically after 30 minutes

Firewalls restrict access to flux-login.To connect successfully, either

Physically connect your ssh client platform to the U-M campus wired or MWireless network, or

Use VPN software on your client platform, or

Use ssh to login to an ITS login node (login.itd.umich.edu), and ssh to flux-login from there

8/13

cja 2013 23

Lab 1Task: Use the multicore package

The multicore package allows you to use multiple cores on the same node

module load R

Copy sample code to your login directorycdcp ~cja/hpc-sample-code.tar.gz .tar -zxvf hpc-sample-code.tar.gzcd ./hpc-sample-code

Examine Rmulti.pbs and Rmulti.R

Edit Rmulti.pbs with your favorite Linux editor

Change #PBS -M email address to your own

8/13

cja 2013 24

Lab 1Task: Use the multicore package

Submit your job to Fluxqsub Rmulti.pbs

Watch the progress of your jobqstat -u uniqname

where uniqname is your own uniqname

When complete, look at the job’s outputless Rmulti.out

8/13

cja 2013 25

Lab 2Task: Run an MPI job on 8 cores

Compile c_ex05cd ~/cac-intro-codemake c_ex05

Edit file run with your favorite Linux editorChange #PBS -M address to your own

I don’t want Brock to get your email!

Change #PBS -A allocation to FluxTraining_flux, or to your own allocation, if desired

Change #PBS -l allocation to flux

Submit your jobqsub run

8/13

cja 2013 26

PBS resources (1)A resource (-l) can specify:

Request wallclock (that is, running) time-l walltime=HH:MM:SS

Request C MB of memory per core-l pmem=Cmb

Request T MB of memory for entire job-l mem=Tmb

Request M cores on arbitrary node(s)-l procs=M

Request a token to use licensed software-l gres=stata:1-l gres=matlab-l gres=matlab%Communication_toolbox

8/13

cja 2013 27

PBS resources (2)A resource (-l) can specify:

For multithreaded code:Request M nodes with at least N cores per node-l nodes=M:ppn=N

Request M cores with exactly N cores per node (note the differencevis a vis ppn syntax and semantics!)-l nodes=M,tpn=N(you’ll only use this for specific algorithms)

8/13

cja 2013 28

Interactive jobsYou can submit jobs interactively:

qsub -I -V -l procs=2 -l walltime=15:00 -A youralloc_flux -l qos=flux –q flux

This queues a job as usualYour terminal session will be blocked until the job runs

When it runs, you will be connected to one of your nodes

Invoked serial commands will run on that node

Invoked parallel commands (e.g., via mpirun) will run on all of your nodes

When you exit the terminal session your job is deleted

Interactive jobs allow you toTest your code on cluster node(s)

Execute GUI tools on a cluster node with output on your local platform’s X server

Utilize a parallel debugger interactively 8/13

cja 2013 29

Lab 3Task: compile and execute an MPI program on a compute node

Copy sample code to your login directory:cdcp ~brockp/cac-intro-code.tar.gz .tar -xvzf cac-intro-code.tar.gzcd ./cac-intro-code

Start an interactive PBS sessionqsub -I -V -l procs=2 -l walltime=30:00 -A FluxTraining_flux -l qos=flux -q flux

On the compute node, compile & execute MPI parallel code:cd $PBS_O_WORKDIRmpicc -O3 -ipo -no-prec-div -xHost -o c_ex01 c_ex01.cmpirun -np 2 ./c_ex01

8/13

cja 2013 30

Lab 4Task: Run Matlab interactively

module load matlab

Start an interactive PBS sessionqsub -I -V -l procs=2 -l walltime=30:00 -A FluxTraining_flux -l qos=flux -q flux

Run Matlab in the interactive PBS sessionmatlab -nodisplay

8/13

cja 2013 31

The Scheduler (1/3)Flux scheduling policies:

The job’s queue determines the set of nodes you run on

flux, fluxm

The job’s account determines the allocation to be charged

If you specify an inactive allocation, your job will never run

The job’s resource requirements help determine when the job becomes eligible to run

If you ask for unavailable resources, your job will wait until they become free

There is no pre-emption

8/13

cja 2013 32

The Scheduler (2/3)Flux scheduling policies:

If there is competition for resources among eligible jobs in the allocation or in the cluster, two things help determine when you run:

How long you have waited for the resource

How much of the resource you have used so far

This is called “fairshare”

The scheduler will reserve nodes for a job with sufficient priority

This is intended to prevent starving jobs with large resource requirements

8/13

cja 2013 33

The Scheduler (3/3)Flux scheduling policies:

If there is room for shorter jobs in the gaps of the schedule, the scheduler will fit smaller jobs in those gaps

This is called “backfill”

Core

sTime

8/13

cja 2013 34

Job monitoringThere are several commands you can run to get some insight over your jobs’ execution:

freenodes : shows the number of free nodes and cores currently available

mdiag -a youralloc_name : shows resources defined for your allocation and who can run against it

showq -w acct=yourallocname: shows jobs using your allocation (running/idle/blocked)

checkjob jobid : Can show why your job might not be starting

showstart -e all jobid : Gives you a coarse estimate of job start time; use the smallest value returned

8/13

cja 2013 35

Job Arrays• Submit copies of identical jobs• Invoked via qsub –t:

qsub –t array-spec pbsbatch.txt

Where array-spec can be

m-n

a,b,c

m-n%slotlimit

e.g.

qsub –t 1-50%10 Fifty jobs, numbered 1 through 50,

only ten can run simultaneously

• $PBS_ARRAYID records array identifier

358/13

cja 2013 36

Dependent scheduling

• Submit jobs whose execution scheduling depends on other jobs

• Invoked via qsub –W:qsub -W depend=type:jobid[:jobid]…

Where depend can be

after Schedule after jobids have started

afterok Schedule after jobids have finished, only if no errors

afternotok Schedule after jobids have finished, only if errors

afterany Schedule after jobids have finished, regardless of status

Inverted semantics for before,beforeok,beforenotok,beforeany

368/13

cja 2013 37

Some Flux Resources

http://arc.research.umich.edu/resources-services/flux/

U-M Advanced Research Computing Flux pages

http://cac.engin.umich.edu/CAEN HPC Flux pages

http://www.youtube.com/user/UMCoECACCAEN HPC YouTube channel

For assistance: [email protected] by a team of people including unit support staffCannot help with programming questions, but can help with operational Flux and basic usage questions

8/13

cja 2013 38

Any Questions?Charles J. AntonelliLSAIT Advocacy and Research [email protected]://www.umich.edu/~cja734 763 0607

8/13