Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No...
Transcript of Accounts, Access, User Environment Topics · PDF fileUser Environment Topics ... NVRAM SSD No...
LLNL-PRES-729302
This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344. Lawrence Livermore National Security, LLC
Accounts, Access, User Environment Topics
Blaise BarneyLivermore Computing
Development Environment Group
April 19, 2017
2LLNL-PRES-729302
▪ LLNL and Collaborators:— easy - just go to https://lc-idm.llnl.gov as usual— OCF: add resource Ray (CZ), RZManta (RZ)— SCF: add resource Shark
▪ LANL and Sandia:— also easy - go to sarape.sandia.gov as usual— LLNL resources: Ray, RZManta and Shark (depending on
clearance/citizenship)— Sponsor: Greg Tomaschke, [email protected], 925-423-0561
▪ PSAAP centers:— go to sarape.sandia.gov as usual— LLNL resource: Ray— Sponsor: Blaise Barney, [email protected], 925-422-2578
Accounts
3LLNL-PRES-729302
▪ Currently, everyone just gets put into a "guests" account/group.
▪ LC staff also get put into a "lcstaff" account/group.
▪ Eventually, LC will establish a real set of accounts and allocations.
▪ Expect things to behave differently though - the LSF batch system has replaced Moab/SLURM.
▪ TIP: setting this environment variable (for now) will help avoid jobs getting rejected in case you forget to specify a group:
setenv LSB_DEFAULT_USERGROUP guestsexport LSB_DEFAULT_USERGROUP=guests
Allocations
4LLNL-PRES-729302
▪ Ray (CZ):— Accessible directly from within the LLNL domain— Not currently accessible directly from outside LLNL
• Need to login to another CZ machine first, then ssh to Ray• This will change later, after required security measures are in place
▪ RZManta (RZ):— Accessible only through rzgw.llnl.gov - same as other RZ systems— LANL/Sandia: need to start from an "ihpc" node. Instructions are at:
https://hpc.llnl.gov/manuals/access-lc-systems/logging
▪ Shark (SCF):— Accessible directly from anywhere within the SCF— LANL/Sandia: Kerberos authentication - same as other SCF machines. No
password/token required.— Note: as of today's presentation, shark is not quite yet available
Access
5LLNL-PRES-729302
▪ Expectations
▪ Big differences
▪ Like TOSS 3 (kinda)
▪ Beta software environment
▪ File systems
▪ Modules, dotkits
▪ Compilers (covered later)
▪ MPI (covered later)
User Environment Topics
▪ Running jobs & LSF batch system (covered later)
▪ Software
▪ Math libraries
▪ HPSS Storage, FIS
▪ Miscellaneous
▪ Documentation and getting help
6LLNL-PRES-729302
▪ Although Power8 and Pascal hardware are not brand new, putting them together is - especially true for the software.
▪ There will be a "learning curve" for all involved: vendors, LC staff and users alike.
▪ Much of the software is "beta" level, and some is still being developed as we speak.
▪ Expect some growing pains: unplanned outages, planned outages, bugs, reboots, changes (some with little notice), instabilities, performance issues, etc.
▪ What you might typically expect from new systems...and more!
▪ LC is interested in your feedback - we're in this together!
Setting Expectations
7LLNL-PRES-729302
Big Differences
Typical LC Linux Cluster CORAL EA Cluster
Hardware Intel Xeon IBM Power8 + NVIDIA Pascal
Multi-threading (CPU) 2 hardware threads per core 8 hardware threads per core
Peak flops (% GPU) 0% 97%
Job scheduler SLURM / Moab IBM Spectrum LSF
Parallel file systems Lustre IBM Spectrum Scale (GPFS)
Compilers Intel, GNU, PGI, Clang IBM XL, Clang (GNU, PGI,
xlflang)
MPI MVAPICH, Open MPI, Intel IBM Spectrum MPI
Packages dotkit, Tcl modules Lmod modules
NVRAM SSD No Yes (Ray only)
Job launcher srun mpirun (jsrun beta coming
soon)
8LLNL-PRES-729302
▪ Same OS - Red Hat Enterprise Linux Server release 7.3
▪ /usr/tce/ will be used instead of /usr/local for compilers, MPI, tools, packages, etc.— Currently /usr/tcetmp is being used but that will transition to /usr/tce later
▪ Lmod modules are used to load software environments
▪ However, CORAL EA systems do not run true TOSS 3 software:
TOSS 3 Like Environment (kinda)
ray23% echo $SYS_TYPE
blueos_3_ppc64le_ib
ray23% distro_version
blueos 3.0-0
quartz2306% echo $SYS_TYPE
toss_3_x86_64_ib
quartz2306% distro_version
toss 3.0-2.1
9LLNL-PRES-729302
▪ Much of the software on LC's CORAL EA systems is beta release.
▪ So, until GA-level software is installed, any performance results from applications/benchmarks running on these systems are not publishable without official review from IBM and/or NVIDIA . — Questions? Contact Rob Neely ([email protected]) or Bronis de Supinski
▪ Changing rapidly, possibly with little notice.
▪ clang-coral and xl beta compiler releases every few weeks
▪ More beta software on the way:— Job launcher beta (jsrun) - will replace mpirun.— Burst buffers— Cluster Systems Manager (CSM)
Beta Software Environment
10LLNL-PRES-729302
CORAL EA systems mount the usual LC file systems. The only significant difference is that these systems use IBM's Spectrum Scale product for parallel file systems instead of Lustre. Available file systems are summarized in the table below.
File Systems
File System Mount Points Backed
Up?
Purged? Comments
Home directories /g/g0 - g99 Yes No 16 GB quota; safest file system; includes
.snapshot online backups
Workspace /usr/workspace/ws* No No 1 TB quota; includes .snapshot online
backups
Local tmp /tmp, /usr/tmp, /var/tmp No Yes Node local temporary file space; small;
actually resides in node memory, not physical
disk
NFS tmp /nfs/tmp2 No Yes Large NFS mounted temporary file space;
shared by all users and multiple clusters
Collaboration /usr/gapps, gdata
/collab/usr/gapps, gdata
Yes No User managed application directories;
intended for collaborative development and
usage
Parallel /p/gscratchr (ray)
/p/gscratchrzm (rzmanta)
/p/gscratch# (shark - TBD)
No Yes Intended for parallel I/O; large, shared by all
users on a cluster
11LLNL-PRES-729302
▪ Sizes:— Ray (/p/gscratchr): 1.3 PB— RZManta (/p/gscratchrzm): 431 TB— Shark (/p/gscratch#): TBA
▪ We expect that, from a user perspective, application interactions with this new parallel file system will be similar to Lustre.
▪ We also expect to learn about differences as we acquire experience.
▪ oslic, rzslic, cslic clusters will eventually mount the respective gscratch file system for convenience.
▪ IBM Spectrum Scale product information is available at: http://www-03.ibm.com/systems/storage/spectrum/scale/index.html
File Systems - IBM Spectrum Scale /p/gscratch*
12LLNL-PRES-729302
▪ As with TOSS 3 systems, Lmod modules are used for most software packages, such as compilers, MPI and tools.— Dotkits have pretty much disappeared— Users only need to know a few commands to effectively use modules— The "ml" shorthand can be used instead of "module" - for example: "ml avail"
Modules, Dotkits
Command Description
module avail List available modules
module load package Load a selected module
module list Show modules currently loaded
module unload package Unload a previously loaded module
module purge Unload all loaded modules
module reset Reset loaded modules to system defaults
module display package Display the contents of a selected module
module spider List all modules (not just available ones)
module keyword key Search for available modules by keyword
module, module help Get help
13LLNL-PRES-729302
▪ Simple example - see what's loaded by default, see what's available, load a selected module, check again:
More on Modules
ray23% module list
Currently Loaded Modules:
1) xl/beta-2017.04.11 2) spectrum-mpi/2017.04.03 3) StdEnv
ray23% module avail
---------- /usr/tcetmp/modulefiles/Compiler/xl/beta-2017.04.11 ------
spectrum-mpi/2017.04.03 (L)
----------------------- /usr/tcetmp/modulefiles/Core -------------------------
StdEnv (L) makedepend/1.0.5
clang/coral-2017.03.15 pgi/16.10
clang/coral-2017.03.29 (D) pgi/17.1
clang/3.9.1 pgi/17.3 (D)
cmake/3.7.2 totalview/2016.07.22
gcc/4.8-redhat totalview/2017.0.12 (D)
gcc/4.9.3 (D) xl/beta-2017.03.28
git/2.9.3 xl/beta-2017.04.11 (L,D)
gmake/4.2.1 xl/2016.12.02
gsl/2.3
---------------- /usr/share/lmod/lmod/modulefiles/Core -------------------
lmod/6.5 settarg/6.5
Where:
L: Module is loaded
D: Default Module
Use "module spider" to find all possible modules.
Use "module keyword key1 key2 ..." to search for all possible
modules matching any of the "keys".
ray23% module load clang/coral-2017.03.15
Lmod is automatically replacing "xl/beta-2017.03.28" with
"clang/coral-2017.03.15"
Due to MODULEPATH changes the following have been reloaded:
1) spectrum-mpi/2017.04.03
ray23% module list
Currently Loaded Modules:
1) StdEnv 2) clang/coral-2017.03.15 3) spectrum-mpi/2017.04.03
14LLNL-PRES-729302
▪ CORAL EA systems pre-load certain modules into your environment— Important to know for selecting your choice of compiler. For example:
▪ LC employs module hierarchies for some packages:— loading module A will cause modules B and C to become available
▪ Module families are also used:— only one package from each family may be loaded at once— if compiler A is loaded, and then compiler B, compiler A will be unloaded
▪ A number of modules have default versions - designated by a (D) next to the module name. For example:
totalview/2016.07.22totalview/2017.0.12 (D)
"module load totalview" will select the (D) version
More on Modules
ray23% module list
Currently Loaded Modules:
1) xl/beta-2017.04.11 2) spectrum-mpi/2017.04.03 3) StdEnv
15LLNL-PRES-729302
▪ More on Lmod modules: https://www.tacc.utexas.edu/research-development/tacc-projects/lmodhttp://lmod.readthedocs.io/en/latest/index.html
▪ LC documentation:https://lc.llnl.gov/confluence/display/TCE/Using+TOSS+3#UsingTOSS3-Modules
More on Modules
16LLNL-PRES-729302
Software
▪ The most important software - Compilers, MPI and Tools, will be covered in detail later.
▪ CUDA 8.0 - installed under /usr/local - with links in /usr/tcetmp/packages for convenience. More info about NVIDIA software will be covered later.
▪ A small assortment of other software/utilities can be found in /usr/tcetmp/bin, /usr/tcetmp/packages or via "module spider".
▪ Visualization software (list is at https://hpc.llnl.gov/data-vis/vis-software): a subset of these packages will be ported to CORAL EA. Currently under evaluation.
▪ Software under /usr/gapps is owned and maintained by users - porting to CORAL EA systems will vary.
▪ Need something that's missing? Let us know (LC Hotline)...
17LLNL-PRES-729302
Software - Math Libraries
▪ MASS - Mathematical Acceleration Subsystem Libraries
— From IBM: a set of C/C++ libraries of tuned mathematical intrinsic functions (scalar, vector, simd) that provide improved performance over the corresponding standard system math library functions.
▪ ESSL - Engineering and Scientific Subroutine Library
— IBM's ESSL is a collection of high-performance subroutines providing a wide range of mathematical functions for many different scientific and engineering applications. A subset of the functions contained in ESSL are tuned replacements for some of the functions provided in the BLAS and LAPACK libraries. C/C++ and Fortran.
▪ Installed under /usr/tcetmp/packages for convenience:
— NETLIB: BLAS, LAPACK, ScaLAPACK
— FFTW
— GSL - GNU Scientific Library - over 1000 functions; C/C++
▪ Documentation: https://lc.llnl.gov/confluence/display/CORALEA/Math+Libraries
▪ Other math software (matlab, mathematica) - no plans to port at this time
18LLNL-PRES-729302
HPSS Storage, FIS
▪ CORAL EA systems do not currently have access to OCF/SCF HPSS storage
— Awaiting some infrastructure work - stay tuned
▪ FIS (File Interchange Service) - for moving files between OCF and SCF
— Available on CORAL EA systems
19LLNL-PRES-729302
Miscellaneous
▪ Login nodes vs. compute nodes:
— As with other LC systems, login nodes are shared by multiple users and should not be used for compute intensive or parallel work.
— CORAL EA login nodes do NOT have GPUs - only the compute nodes do
▪ Spack - package management tool designed to support multiple versions and configurations of software on a wide variety of platforms and environments. For details see: http://spack.readthedocs.io/
▪ X-Win32 2012 users: if things aren't working right, you will probably need to get a more recent build (at least build 102) installed. Contact 4-HELP or your local desktop support person.
— Using LANDesk to download a more recent version may not work because it doesn't de-install the old version.
20LLNL-PRES-729302
Documentation and Getting Help
▪ BEST place to get started for user information is the CORAL EA Systems confluence wiki page:
https://lc.llnl.gov/confluence/display/CORALEA/CORAL+EA+Systems
or just go to the LC confluence wiki https://lc.llnl.gov/confluence and search for "coral"
21LLNL-PRES-729302
https://lc.llnl.gov/confluence/display/CORALEA/CORAL+EA+Systems
22LLNL-PRES-729302
Documentation and Getting Help
▪ Best place to get started for user information:https://lc.llnl.gov/confluence/display/CORALEA/CORAL+EA+Systems
— Includes information about IBM & NVIDIA hardware, compilers, MPI, running jobs & LSFbatch system, tools, math libraries, user environment topics, quickstart guide + more
— Lots of links to in-depth information on related topics
— Also includes a general discussion blog and "symptoms & solutions"
• Your contributions are welcome!
• Questions or problems? Check here first to see if there's already an answer/solution
▪ Reporting problems, questions and getting help:
— The LC Hotline is available as the "front line" of support to CORAL EA systems - as with other LC systems: [email protected] (925) 422-4531
— Referrals to other LC staff, IBM and NVIDIA onsite reps
23LLNL-PRES-729302
Documentation and Getting Help
▪ On-site IBM and NVIDIA support:
— To help address the challenges of adapting to new technologies delivered in the CORAL systems, IBM and NVIDIA provide dedicated full-time, on-site support for the duration of the CORAL contract. This support helps facilitate efficient interaction with IBM and NVIDIA technical engineering to resolve issues with the systems and software.
• System Administrator and Spectrum Scale (GPFS) Subject Matter Expert (James Lamb, IBM)
– Hardware and system software.
• NVIDIA Solutions Architect (Max Katz, NVIDIA)
– All things related to the use of the NVIDIA GPUs
• Application Analyst (Roy Musselman, IBM)
– Compilers, MPI, math libraries, IBM tools
— These experts are highly integrated into the Livermore Computing (LC) support teams: Development Environment Group (DEG) and System Administration Group (SAG), and are a supplemental part of the total support structure which includes the LC Hotline.
24LLNL-PRES-729302
Documentation and Getting Help
▪ Sierra Center of Excellence (COE):— Provides resources for ensuring application readiness for Sierra
— Includes development and vendor support from IBM and NVIDIA; Funded by ASC
— https://lc.llnl.gov/confluence/display/SCOE/Sierra+Center+of+Excellence+Home
• Excellent resource for information on Sierra related topics, workshops, presentations, etc.
• However, access is restricted due to NDA material
▪ Institutional Center of Excellence (iCOE) Project— Complementary to the Sierra COE, but for institutional (M&IC) programs
— https://lc.llnl.gov/confluence/display/SCOE/Institutional+Center+of+Excellence+%28iCOE%29+Project
▪ Advanced Architecture and Portability Specialists (AAPS) team— https://lc.llnl.gov/confluence/display/AAPS/Advanced+Architecture+and+Portability+Specialists
• Source of knowledge dissemination for work done with specific ASC / Tri-lab codes
▪ Sierra COE and iCOE projects: Talk to Rob Neely (COE) or Bert Still / Ian Karlin (iCOE) if you have any questions/issues.