IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug...
-
Upload
gregory-hodges -
Category
Documents
-
view
216 -
download
0
Transcript of IEEE HPDC 9 Conference 1 Software Configuration for Clusters in a Production HPC Environment Doug...
IEEE HPDC 9 Conference
1
Software Configuration for Clusters in a Production HPC Environment
Doug Johnson, Jim Giuliani, and Troy Baer
Ohio Supercomputer Center
IEEE HPDC 9 Conference
2
Introduction• Linux clusters are becoming mature
• No longer just for post-processing.
• Rich tool environment
• Third-party adoption– LS-Dyna– Fluent– Large choice of commercial compilers– Integrators
IEEE HPDC 9 Conference
3
Tutorial Content• Development Environment
– User Environment– Compilers and Languages– Programming Models
• Application Performance Analysis– Non-intrusive– Intrusive
• System Management– Job Scheduling– System
IEEE HPDC 9 Conference
4
User Environment• Shell and Convenience Environment Variables
• Interface with Mass Storage
• Parallel Debuggers
• Languages and Compilers
IEEE HPDC 9 Conference
5
Shell and Convenience Environment Variables• Need to present users with a uniform, well designed shell
environment.
• Documentation of needed shell environment variables for different programs is critical for usability.
• Must support users shell preferences, this is a personal thing.– Forcing a shell is akin to forcing vi or emacs.
• OSC has used a run-alike version of Cray modules.
• Mixed results.– Uniform environment.– Reliability problems.
• Convenience environment variables– $TMPDIR, $USER– $MPI_FFLAGS, $MPI_C[XX]FLAGS and $MPI_LIBS– Environment variables for compiling ScaLAPACK programs.
IEEE HPDC 9 Conference
6
Interface with Mass Storage• Linux has been followed a rocky development path.• NFS Version 3 support just now becoming stable.
– Supports 64 bit filesystems– Network Lock Manager (NLM) – Asynchronous writes
• What is the hope of achieving good performance with NFS and Linux?• The following plots can show us, first a few tuning parameters.
– Modified /proc/sys/net/core/rmem_max, wmem_max, rmem_default and wmem_default.
echo 2097152 > /proc/sys/net/core/[rw]mem_max
echo 524288 > /proc/sys/net/core/[rw]mem_default– rmem_default and wmem_default set the defaults for SO_RCVBUF
and SO_SNDBUF.– Warning: On limited memory systems with many socket
communications these settings may cause memory pressures.
IEEE HPDC 9 Conference
7
TCP Performance
TCP Stream Performance
0
50
100
150
200
250
300
350
0 1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000
Block Size (bytes)
Meg
abits
/Sec
ond
HIPPI
Gigabit Ethernet
Fast Ethernet
IEEE HPDC 9 Conference
8
TCP Performance
TCP Stream Performance
0
50
100
150
200
250
300
0 50000 100000 150000 200000 250000
Block Size (bytes)
Meg
abits
/sec
ond
HIPPI
Gigabit Ethernet
Fast Ethernet
IEEE HPDC 9 Conference
9
UDP Performance./netperf -l 60 -H fe.ovl.osc.edu -i 10,2 -I 99,10 -t
UDP_STREAM -- -m 1472 -s 32768 -S 32768
UDP UNIDIRECTIONAL SEND TEST to fe.ovl.osc.edu : +/-5.0% @ 99% conf.
Socket Message Elapsed Messages
Size Size Time Okay Errors Throughput
bytes bytes secs # # 10^6bits/sec
131070 1472 59.99 3229909 0 634.03
524288 59.99 2169706 425.91
IEEE HPDC 9 Conference
10
Debugging and Parallel Programs• Developing code always introduces bugs.
• Strategic print statements sometimes are not enough.
• Postmortem analysis.– Debugger a.out core
• Parallel programs started in same NFS directory may be a problem.– Multiple processes trying to dump to the same file.
• Kernel patch to make the name of the core file unique.
IEEE HPDC 9 Conference
11
TotalView Parallel Debugger• Available from http://www.etnus.com.
• Debugger for MPI, Open MP and threaded programs on many platforms.
– Some features such as Open MP and threads not supported on some platforms.
M pi P rocessS endB u ffer
R eceiveB u f fer
tvdm ain(if M P I_C om m _rank= 0 ), tvdsvr
daem on otherw ise.
IEEE HPDC 9 Conference
12
TotalView Debugger Installation• Can be downloaded from the Etnus website.
• Will need to be installed on an NFS filesystem visible to all nodes, or installed in the same location on local disk on each node.
• Simple script-based install, follow prompts.
• Environment variables are critical!– Must be present for rsh commands.– /etc/profile.d/[] will not be evaluated.– bash and bash2 will evaluate .bashrc– csh and tcsh will evaluate .cshrc– pdksh will not evaluate any “.” files.
IEEE HPDC 9 Conference
13
TotalView Interface
IEEE HPDC 9 Conference
14
Message Queues• Message queue window provides;
– MPI_COMM_WORLD info– Size– Rank– Pending send and receives
IEEE HPDC 9 Conference
15
TotalView Features• Can have multiple object files open in the same TotalView session
(local and remote).
• Automatic attach to child processes.
• X-Window and CLI user interfaces.
• Data visualization.
IEEE HPDC 9 Conference
16
Languages and Compilers• With the wide-spread adoption of Linux the availability of commercial
compilers has increased.– Lahey Fortran 95, http://www.lahey.com– Portland Group Compiler Suite, http://www.pgroup.com– Absoft Fortran95, C, and C++, http://www.absoft.com– NAG Fortran 95, http://www.nag.com– SGI Compiler Suite for Linux IA-64, http://www.oss.sgi.com
IEEE HPDC 9 Conference
17
GCC• Ubiquitous C compiler with additional languages added over the
years, easily re-targeted to new platforms.
• Languages supported include C, C++, Fortran 77…
• Other languages are supported, but won’t be covered in this tutorial, the 3 above cover the majority of Scientific and Engineering codes.
GCC Advantages:
• Free, in the monetary and liberty sense of the word.
• Common back-end.
• Flexible– Extendable– Inline assembly– Extra language features
IEEE HPDC 9 Conference
18
GCCLanguage extensions:
• Complex
• __alignof__
• inline– Only works with -O or greater optimization level.
• Lexical scoping of functions for nested functions.
• Inline assembly.– C expressions are allowed as operands.
IEEE HPDC 9 Conference
19
GCC and EGCS• Compiler back-end is lacking in optimizations for specific
architectures.
• C++ has in the past been criticized for not tracking the C++ standard.
• Since the Cygnus EGCS/GCC integration C++ performance and conformance has significantly improved.
STL Performance:http://www.physics.ohio-state.edu/~wilkins/computing/benchmark/STC.html
Created to test the implementation of STL, from the web-site:
“To verify how efficiently C++ (and in particular STL) is compiled by the present day compilers, benchmark outputs 13 numbers computed with increasing abstrctions. In the ideal world these numbers should be the same. In the real world, however, …”
IEEE HPDC 9 Conference
20
STC Description– 0 - uses simple Fortran-like for loop. – 1 - 12 use STL style accumulate template function with plus function
object. – 1, 3, 5, 7 ,9, 11 use doubles. – 2, 4, 6, 8, 10, 12 use Double - double wrapped in a class. – 1, 2 - use regular pointers. – 3, 4 - use pointers wrapped in a class. – 5, 6 - use pointers wrapped in a reverse-iterator adaptor. – 7, 8 - use wrapped pointers wrapped in a reverse-iterator adaptor. – 9, 10 - use pointers wrapped in a reverse-iterator adaptor wrapped – in a reverse-iterator adaptor. – 11, 12 - use wrapped pointers wrapped in a reverse-iterator adaptor – wrapped in a reverse-iterator adaptor.
IEEE HPDC 9 Conference
21
G77• Fortran “Front-End” that uses the GCC “Back-End” for code
generation.
• Implements most of the Fortran 77 standard.
• Not completely integrated with GCC– No inline assembly– Warning of implicit type conversions
• No aggressive optimizations.
IEEE HPDC 9 Conference
22
Portland Group Compiler Suite• Vendor of Compilers for traditional HPC systems.
• Contracted by DOE and Intel to provide compilers for Intel ASCI Red.
• Optimizing compiler for Intel P6 core.
• Linux, Solaris and MS Windows (X86 only).
• Compiler suite includes C, C++, Fortran 77 and 90, and HPF.
• Link compatible with GCC objects and libraries.
• Includes debugger and profiler (can use GDB).
IEEE HPDC 9 Conference
23
Optimizations for Portland Compilers• Vectorizor can optimize for countable loops with large arrays.
• Use -Minfo=loop to have the compiler report what optimizations were applied to the loops, unrolling and vectorized.
• Cache size can be specified to maximize cache re-use, -Mvect:cachesize=…
• Use -Mneginfo=loop to provide information about why a loop was not a candidate for vectorization.
• Can specify number of times to unroll a loop.
• Can use -Minline to inline functions. This can improve the performance of calls to functions inside of subroutines.
– Is not useful for functions that have an execution time >> penalty for the jump.
– This option will sacrifice code compactness for efficiency.
IEEE HPDC 9 Conference
24
Optimizations for Portland Compilers (cont.)• All command line optimizations are available through directives or
pragmas.
• Can be used to enable or disable specific optimizations.
IEEE HPDC 9 Conference
25
Caveats for Portland Compilers• F77 and F90 are separate front-ends.
• Debugger cannot display floating point registers.
• Code compiled with Portland Compiler is compatible with GDB– Initial listing of code does not work.– Set break point or watch point where desired.
• Profiler can be difficult or impossible to use on parallel codes.
IEEE HPDC 9 Conference
26
Other Sources of Compiler Information• Linux Fortran web page,
http://studbolt.physast.uga.edu/templon/fortran.html
• Cygnus/FSF GCC homepage, http://gcc.gnu.org
• Scientific Applications on Linux, http://SAL.KachinaTech.COM/index.shtml
IEEE HPDC 9 Conference
27
Parallel Programming Models for ClustersThere are a number of programming models available on cluster of x86-
based systems running Linux:
• Threading
• Compiler directives
• Message passing
• Multi-level (hybrid message passing with directives)
• Parallel numerical libraries and application frameworks
IEEE HPDC 9 Conference
28
Threading• Threading is a common concurrent programming approach in which
several “threads” of execution share a memory address space.
• Because of the requirement for shared memory, threaded programs will only run on individual systems, although they typically see performance boosts when run on SMP systems.
• The most common interface to threads on UNIX-like systems such as Linux is the POSIX threads (pthreads) API, although on other systems there are numerous other interfaces available (DCE threads, Java threads, Win32 threads, etc.).
• Programming threaded applications can be extremely tedious and difficult to debug, and thus threading is not often used in HPC-oriented scientific applications.
IEEE HPDC 9 Conference
29
Compiler Directives• Compiler directives allow the relatively simple alteration of a serial
code into a parallel code by inserting directives into the serial code which act as “hints” to the compiler, telling it where to look for parallelism. The directives will show up as comments to compilers which do not support the directives.
• This approach obviously requires the availability of a compiler which supports the directives in question.
• The most commonly supported directive sets for Linux systems are:– High Performance Fortran (HPF)– OpenMP
IEEE HPDC 9 Conference
30
Compiler Directives: HPF• HPF was developed in the early 1990s as a parallel extension the the
Fortran programming language. It consists of directives which allow the programmer to distribute data arrays across multiple processors using a “owner-computes” model, as well as a library of parallel intrinsic routines.
• One HPF compiler for Linux/x86 is the Portland Group’s pghpf compiler, which supports parallelization on both a single SMP system (using shared memory) and clusters of systems (using a message passing layer over MPI). OSC has used the pghpf compiler on its 132 processor IA32 cluster and has seen scalability comparable to a hand-coded MPI program for some applications.
• Other HPF compilers for Linux/x86 include Pacific-Sierra Research’s VAST-HPF compiler and NA Software’s HPF-Plus compiler.
IEEE HPDC 9 Conference
31
Compiler Directives: OpenMP• OpenMP was developed in the late 1990s as a portable solution for
directive-based parallel programming, specifically for shared memory architectures. Like HPF, it is a collection of directives with a support library; however, unlike HPF OpenMP does not give explicit control over data placement. Also unlike HPF, OpenMP supports C and C++ as well as Fortran.
• There are several OpenMP-enabled compilers for Linux/x86, including those from the Portland Group (Fortran 77/90, C, and C++) and Kuck and Associates (C++). OSC has used both the Portland Group and Kuck compilers and found them to be acceptable, although OpenMP codes rarely scale well past two processors due to the limited memory bandwidth on four- and eight-way x86-based SMP systems.
IEEE HPDC 9 Conference
32
Message Passing and MPI• Message Passing is the most widely used approach for developing
applications on distributed memory parallel systems. In message passing, data movement between processors is achieved by explicitly calling communication routines to send data from one processor to another.
• The standard and most common used message passing library is the Message Passing Interface (MPI), originally developed in the mid 1990s. There are numerous implementations of the MPI-1.1 standard, including:
– MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/) -- freely available, supports MPI over shared memory and TCP/IP as well as a number of high speed interconnects including Myrinet, implements the parallel I/O portions of the MPI-2 standard.
– LAM (http://www.mpi.nd.edu/lam/) -- freely available, supports MPI over shared memory and TCP/IP, implements much (most?) of the MPI-2 standard.
IEEE HPDC 9 Conference
33
Multi-Level/Hybrid• In clusters of SMP systems, it is sometimes advantageous to use a
hybrid of message passing and directive-based approaches, often referred to as multi-level parallel programming.
• In the multi-level approach, the domain is decomposed in a coarse manner using message passing. Within the message passing code, compiler directives are inserted to run the computationally intensive portions in parallel in a shared memory node.
• This approach works best for systems and applications where contention for interconnect interfaces becomes a hinderance to scalability; in general, it does not increase performance at low processor counts, but it extends the region of near-linear scalability beyond that possible with message passing or compiler directives alone.
IEEE HPDC 9 Conference
34
Multi-Level: MPI + OpenMP• The most common and portable way to do multi-level parallel
programming is to use MPI for the coarse-grained domain decomposition and message passing, and OpenMP for the finer-grained loop level parallelism.
• The main restriction to this programming approach is that MPI routines must not be called within OpenMP parallel regions.
• This approach also requires compilers which support both MPI and OpenMP.
IEEE HPDC 9 Conference
35
Parallel Numerical Libraries• Another approach to parallel programming is to use a parallel
numerical library or application framework which abstracts away (to some extent) the distributed memory nature of a cluster system.
• Parallel numerical libraries are libraries which perform a particular class of mathematical operations, such as the Fourier transform or matrix/vector operations, in parallel.
• Examples of paralle numerical libraries include:– FFTW (http://www.fftw.org/), which implements parallel FFTs using
either pthreads or MPI.– ScaLAPACK (http://www.netlib.org/scalapack/), which implements
parallel matrix and vector operations using MPI.
IEEE HPDC 9 Conference
36
Parallel Application Frameworks• Parallel application frameworks are similar to parallel numerical
libraries, but often include other features such as I/O, visualization, or steering capabilities.
• Parallel application frameworks tend to be aimed at a particular application domain rather than a class of mathematical operations.
• Examples of parallel application frameworks include:– Cactus (http://www.cactuscode.org/), which is a parallel toolkit for
solving general relativity and astrophysical fluid mechanics problems.– PETSc (http://www-fp.mcs.anl.gov/petsc/), which is a general purpose
parallel toolkit for solving problems modeled by partial differential equations.
IEEE HPDC 9 Conference
37
Application Performance AnalysisThe availability of tools which give users the ability to characterize the
performance of their applications is critical to the acceptance of clusters as “real” production HPC systems. Performance analysis tools fall into three broad categories:
• Timing
• Profiling
• Hardware Performance Counters
IEEE HPDC 9 Conference
38
TimingThe simplest way of determining the performance of an application is to
measure how long it takes to run.
• Timing on a per-process basis can be accomplished using the time command.
• Timing on a more arbitrary basis within an application can be done using timing routines such as the gettimeofday() system call or the MPI_Wtime() function in the MPI message passing library
IEEE HPDC 9 Conference
39
ProfilingProfiling is an approach in which the time spent in each routine is logged
and analyzed in some fashion. This allows the programmer to designate which routines are taking the most time to execute and hence are candidates for optimization. In clusters and other distributed memory parallel environments, this can be taken a step further by profiling a program’s computational routines as well as its communications patterns.
• Computation profiling– gprof– pgprof
• Communication profiling– jumpshot
IEEE HPDC 9 Conference
40
Computation Profiling: gprofgprof is the GNU profiler. To use it, you need to do the following:
• Compile and link your code with the GNU compilers (gcc, egcs, g++, g77) using the -pg option flag.
• Run your code as usual. A file called gmon.out will be created containing the profile data for that run.
• Run gprof progname gmon.out to analyze the profile data.
IEEE HPDC 9 Conference
41
Computation Profiling: gprof Exampletroy@oscbw:/home/troy/Beowulf/cdnz3d> make
g77 -O2 -pg -c cdnz3d.f -o cdnz3d.o
g77 -O2 -pg -c sdbdax.f -o sdbdax.o
g77 -O2 -pg -o cdnz3d cdnz3d.o sdbdax.o
troy@oscbw:/home/troy/Beowulf/cdnz3d> ./cdnz3d
(…gmon.out created…)
troy@oscbw:/home/troy/Beowulf/cdnz3d> gprof cdnz3d gmon.out | more
IEEE HPDC 9 Conference
42
Computation Profiling: gprof Example (con’t.)Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
24.67 942.76 942.76 4100500 0.00 0.00 lxi_
23.51 1841.45 898.69 4100500 0.00 0.00 leta_
20.10 2609.66 768.21 4100500 0.00 0.00 damping_
12.64 3092.90 483.24 4100500 0.00 0.00 lzeta_
11.55 3534.28 441.38 4100500 0.00 0.00 sum_
4.12 3691.73 157.45 250 0.63 14.83 page_
2.91 3802.84 111.11 250 0.44 0.44 tmstep_
0.41 3818.62 15.78 500 0.03 0.03 bc_
0.03 3819.59 0.97 pow_dd
(…output continues…)
IEEE HPDC 9 Conference
43
Computation Profiling: pgprofpgprof is the profiler from the Portland Group compiler suite; it is
somewhat more powerful than gprof. To use it, you need to do the following:
• Compile and link your code with the Portland Group compilers (pgcc, pgCC, pgf77, pgf90, pghpf) using the -Mprof=func or -Mprof=lines options depending whether you want function-level or line-level profiling.
• Run your code as usual. A file called pgprof.out will be created containing the profile data for that run.
• Run pgprof pgprof.out to analyze the profile data.
IEEE HPDC 9 Conference
44
Computation Profiling: pgprof Exampletroy@oscbw:/home/troy/Beowulf/cdnz3d> make
pgf77 -fast -tp p6 -Mvect=assoc,cachesize:524288 -Mprof=func \ -c cdnz3d.f -o cdnz3d.o
pgf77 -fast -tp p6 -Mvect=assoc,cachesize:524288 -Mprof=func \ -c sdbdax.f -o sdbdax.o
pgf77 -fast -tp p6 -Mvect=assoc,cachesize:524288 -Mprof=func \ -o cdnz3d cdnz3d.o sdbdax.o
Linking:
troy@oscbw:/home/troy/Beowulf/cdnz3d> ./cdnz3d
(…pgprof.out created…)
troy@oscbw:/home/troy/Beowulf/cdnz3d> pgprof pgprof.out
IEEE HPDC 9 Conference
45
Computation Profiling: pgprof Example (con’t.)• pgprof will present a graphical display if it finds a functional X
display as part of the user’s environment:
IEEE HPDC 9 Conference
46
Computation Profiling: pgprof Example (con’t.)• Without a functional X display, pgprof will present a command line
interface like the following:troy@oscbw:/home/troy/Beowulf/cdnz3d> pgprof pgprof.out
Loading....
Datafile : pgprof.out
Processes : 1
pgprof> print
Time/ Function
Calls Call(%) Time(%) Cost(%) Name:
------------------------------------------------------------------------
4100500 0.00 23.43 23 lxi (cdnz3d.f:1632)
4100500 0.00 21.90 22 damping (cdnz3d.f:2319)
4100500 0.00 21.87 22 leta (cdnz3d.f:1790)
4100500 0.00 11.68 12 lzeta (cdnz3d.f:1947)
pgprof> quit
IEEE HPDC 9 Conference
47
Communication Profiling: jumpshotjumpshot is a Java-based GUI profiling tool which is included in the
MPICH implementation of MPI. It allows the programmer to profile all calls to MPI routines. To use jumpshot, you need to do the following:
• Compile your MPI code using one of the MPI compiler wrappers (mpicc, mpiCC, mpif77, mpif90) supplied with MPICH using the -mpilog option, and link using -lmpe.
• Run your MPI code as usual. A .clog file will be created (i.e. if your executable is named progname, a log file called progname.clog will be created).
• Run jumpshot on the .clog file (eg. jumpshot progname.clog)
IEEE HPDC 9 Conference
48
Communication Profiling: jumpshot Exampletroy@oscbw:/home/troy/Beowulf/mpi-c> more jumpshot.pbs
#PBS -l nodes=2:ppn=4
#PBS -N jumpshot
#PBS -j oe
cd $HOME/Beowulf/mpi-c
mpicc -mpilog nblock2.c -o nblock2 -lmpe
mpiexec ./nblock2
troy@oscbw:/home/troy/Beowulf/mpi-c> qsub jumpshot.pbs
(…nblock2.clog created…)
troy@oscbw:/home/troy/Beowulf/mpi-c> jumpshot nblock2.clog
IEEE HPDC 9 Conference
49
Communication Profiling: jumpshot Example (con’t)
IEEE HPDC 9 Conference
50
Hardware Performance CountersHardware performance counters are a way of measuring the performance
of an application or system at a very low level. This can be extremely useful for diagnosing performance problems such as cache thrashing or memory bandwidth bottlenecks. There are two ways of accessing performance counters:
• Non-invasive (command-line driven)– lperfex
• Invasive (instrumentation library)– libperf– PAPI
IEEE HPDC 9 Conference
51
Hardware Performance Counters: lperfex• OSC has developed a utility called lperfex (http://www.osc.edu
/~troy/lperfex/) to access the hardware performance counters built into newer Intel P6-based processors.
• lperfex functions much like the time command, in that it is run on other programs. However, it also gives the ability to count and report on hardware events. The default events if none are specified are floating point operations and L2 cache line loads.
• No special compilation is required, and lperfex can be used within batch jobs and with MPI programs (eg. mpiexec lperfex -y ./a.out). However, it currently does not work with multithreaded programs, such as those using OpenMP or pthreads. It also requires the use of a kernel patch which exposes the MSRs (Model Specific Registers), available at http://www.beowulf.org/software/perf-0.7.tar.gz.
IEEE HPDC 9 Conference
52
Hardware Performance Counters: lperfex Example
troy@oscbw:/home/troy/Beowulf/cdnz3d> lperfex -e 41 -y ./cdnz3d
838.239990 seconds of CPU time elapsed and 0.000000 MB of memory on oscbw.cluster.osc.edu
Event # Event Events Counted
------- ----- --------------
41 Floating point operations retired 3042728032
Statistics:
-----------
MFLOPS 65.694389
IEEE HPDC 9 Conference
53
Hardware Performance Counters: lperfex -- Events
• 0: Memory references
• 1: L1 data cache lines loaded
• 3: L1 data cache lines flushed
• 13: L2 cache lines loaded
• 14: L2 cache lines flushed
• 31: I/O transactions
• 35: Memory transactions
• 41: Floating point operations retired (counter 0 only)
• 43: Floating point exceptions handled by microcode (counter 1 only)
• 50: Instructions retired
• 51: Ops retired
• 53: Hardware interrupts received
• 67: Cycles during which the processor is not halted
IEEE HPDC 9 Conference
54
Hardware Performance Counters: libperf and PerfAPI
• lperfex is built on top of libperf, which is a user-callable library which is included with the NASA Goddard performance counters patch (http://www.beowulf.org/software/perf-0.7.tar.gz).
• It is also possible to instrument a code directly with libperf rather than use lperfex; this would be of interest if you wanted to measure the performance of a single routine rather than the entire code.
• There is an effort under the auspices of the Parallel Tools Consortium (http://www.ptools.org/) to develop a standard library for doing portable low-level performance measurement, called PerfAPI (http://icl.cs.utk.edu/projects/papi/). The current Linux/x86 release of this project fortunately uses a kernel patch which should be compatible with libperf and lperfex.
IEEE HPDC 9 Conference
55
Why Job Scheduling SoftwareIn an ideal world, users would coordinate with each other and no conflicts
would be encountered when running jobs on a cluster.
Unfortunately in real life we have limited resources (processors, memory and network interfaces)
– Users, faced with time deadlines of their own, will want to execute jobs on the cluster as it fits with their schedule
– High throughput users can swamp the whole system, if allowed– Users can check for CPU availability (system load), but how many will
check memory or network interface availability
Job scheduling system allows you to enforce a system policy– Policy can be established by management or peer review– Enforcement of policy will control what are the maximum resources
available, and in what order jobs will be allocated these resources
IEEE HPDC 9 Conference
56
OSC User Environment ConfigurationFront End System
• Designated for code development and pre/post processing
• Interactive resource limits (10 min. CPU time, 32MB memory on the front end node -- use the limit command to check this).
Compute Nodes
• Private network ensures no direct interface (i.e. rlogin directly to the compute nodes)
• Users specify what their compute requirements are and the scheduling policy allocates nodes as a resource
IEEE HPDC 9 Conference
57
Job Scheduling Software for ClustersThere are several batch queuing systems available for Linux-based
clusters, depending on what your needs are. Here are just a few:
• Condor (http://www.cs.wisc.edu/condor/)
• DQS (http://www.scri.fsu.edu/~pasko/dqs.html)
• Generic NQS (http://www.gnqs.org/)
• Job Manager (http://bond.imm.dtu.dk/jobd/)
• GNU Queue (http://www.gnu.org/software/queue/queue.html)
• LSF (http://www.platform.com/ -- commercial)
• Portable Batch System (PBS) (http://pbs.pbspro.com/)
IEEE HPDC 9 Conference
58
Batch System Eval• Cluster management software requirements were identified and seven
batch systems were evaluated
• Two systems met basic requirements, PBS and LSF, and a side by side comparison was made with both packages
Some observations made at the time of the comparison (1999)– Not all packages of the LSF Suite have been ported to Linux– Microsoft NT apparently Platform Computing, Inc. operating system of
choice for Intel architecture, while PBS fully supports Linux– LSF was designed for clusters of systems, although not necessarily
dedicated clusters– PBS was designed for single system image systems and is evolving to
supporting clusters, specifically dedicated clusters– No significant difference in functionality between the two– PBS provides more opportunity for optimization and development
IEEE HPDC 9 Conference
59
Portable Batch System - Brief Overview• The most widely used batch queuing system for clusters
• PBS (“Portable Batch System”) from MRJ Technology Solutions (Veridian Corporation). This package was developed by MRJ for the NAS Facility at NASA Ames Research Center; it is the successor of the venerable NQS package, which was also developed at NASA Ames.
• PBS is a software system for managing system resources on workstations, SMP systems, MPPs, and vector supercomputers.
• Developed with the intent to conform with the POSIX Batch Environment Standard
• For the purposes of this tutorial, we will concentrate on how PBS may be applied to a space-shared cluster of small SMP systems (i.e. cluster systems).
IEEE HPDC 9 Conference
60
PBS Structure
SERVER SCHEDULER MOM
PBS Server
•There is one server process
•It creates and receives batch jobs
•Modifies batch jobs
•Invokes the scheduler
•Instructs moms to execute jobs
PBS Scheduler
•There is one scheduler process
•Contains the policy controlling which job is run, where and when it is run
•Communicates with the “moms” to learn about state of system
•Communicates with server to learn about the availability of jobs
PBS Machine Oriented Miniserver
•One process required for each compute node
•Places jobs into execution
•Takes instruction from the server
•Requires that each instance have its own local file system
PBS provides an Application Program Interface (API) to communicate with the server and another to interface with the moms
IEEE HPDC 9 Conference
61
How PBS Handles Jobs
Server1) Based on resource requirements place job into execution queue2) Instruct scheduler to examine queued jobs3) Instruct “mother superior” to execute the commands section of the batch script
Batch Script1) PBS directives2) Commands required for program execution
qsub command
Scheduler1) Query moms to determine available resources2) Examine queued jobs to see if any can be started and allocate resources3) Return job id and resource list to server for execution
step 2 results
mom pool
Query for available resources
Mother superior1) Execute batch commands2) Monitor resource usage of child processes and report back to server mom
mom
mommom
mom
mommom
mom
If parallel job, create remote processes on nodes allocated to this job
mom pool
IEEE HPDC 9 Conference
62
Starting and Stopping PBS Services• Recommended, but not required, starting order
– Mom– Server (generates a “are you there” ping to all moms at startup)– Scheduler
• Server– ‘-t create’ required for first startup– ‘pbs_server -t hot’ starts up server and looks for jobs currently running– ‘qterm -t quick’ kills server but leaves job running
• Mom(s)– ‘kill -9’ will leave jobs running– ‘pbs_mom -p’ let running jobs continue to run– ‘pbs_mom -r’ kill any running jobs
• Scheduler– No impact on performance
IEEE HPDC 9 Conference
63
PBS Server• Configuring the server can be separated into two parts
– Configuring the server attributes
– Configuring queues and their attributes
• Server is configured with the qmgr command while it is running • Commonly used commands:
– set, unset, print, create, delete, quit• Commands operate on three main entities
– server set/change server parameters
– node set/change properties of individual nodes
– queue set/change properties of individual queues
Usage: qmgr [-c command] -n-c Execute a single command and exit qmgr
-n No commands are executed, syntax checking only is performed
IEEE HPDC 9 Conference
64
Server AttributesDefault queue
– Declares the default queue to which jobs are submitted if a queue is not specified
– OSC cluster is structured so that all jobs go first though a routing queue called ‘batch’ and then to the specific destination queue
– For OSC cluster, ‘batch’ is the default queueset server default_queue = batch
Access Control List (ACL)Hosts: a list of hosts from which jobs may be submittedset server acl_hosts = *.osc.edu
set server acl_host_enable = True
True=turn this feature on
False=turn this feature off
Users: a list of users who may submit jobsset server acl_user = wrk001@*,wrk002@*,wrk003@*
set server acl_user_enable = True
IEEE HPDC 9 Conference
65
Server AttributesManagers
Defines which users at a specified host are granted batch system administrator privilegeset server managers = admin01@*.osc.edu
set server managers += pinky@*.osc.edu
set server managers += brain@*.osc.edu
Node PackingDefines the order in which multiple cpu cluster nodes are allocated to jobs
True: jobs are packed into the fewest possible nodes
False: jobs are scattered across the most possible nodesset server node_pack = True
Query Other NodesTrue: qstat allows you to see all jobs on the system
False: qstat only allows you to see your jobsset server query_other_jobs = True
IEEE HPDC 9 Conference
66
Server AttributesLogging
There are two types of logging• account logging
• events
Within qmgr, you set the mask that determines the level of event logging1 Error Events
2 Batch System/Server Events
4 Administration Events
8 Job Events
16 Job Resource Usage
32 Security Violations
64 Scheduler Calls
128 Debug Messages
256 Extra Debug Messages
The specified events are logically “OR-ed”set server log_events = 511 Everything turned on
set server log_events = 127 Good balance
IEEE HPDC 9 Conference
67
Queue StructurePBS defines two types of queues
Routing• Used to move jobs to other queues• Jobs cannot execute in a routing queue
Execution• A job must reside in an execution queue to be eligible to run• Job remains in this queue during execution
OSC queue configuration– One routing queue that is the entry point for all jobs– Routing queue dispatches jobs to execution queues defined by cpu time
and number of processors requested
IEEE HPDC 9 Conference
68
Queue Structure
Batch routing queue
q1 (0-5 hrs) q2 (5-10 hrs) q3 (10-20 hrs) q4 (20-40 hrs) q5 (40-160 hrs)p4 q1p4 q2p4 q3p4 q4p4 q5p4p8 q1p8 q2p8 q3p8 q4p8 q5p8
Number p16 q1p16 q2p16 .. .. ..of p32 q1p32 q2p32 .. .. ..Processors p64 q1p64 q2p64 .. .. ..
p128 q1p128 q2p128 .. .. q5p128
Time Range
•Queue division by processor count allows for management of parallel jobs
•Queue division by time allows job control for system maintenance
•Currently no OS checkpoint support for Linux
•Jobs running at system shutdown must restart from the beginning
•Structure allows queues to be turned off incrementally as downtime approaches, preventing the need to kill and restart jobs
IEEE HPDC 9 Conference
69
PBS Queue AttributesServer is configured with the qmgr command while it is running
Usage:[oscbw.osc.edu]$ qmgr
Qmgr: create|set queue queue_name attribute_name = value
see man pbs_queue_attributes for a complete list of queue attributes
Creating a queue
Before queue attributes can be set, the queue must be createdcreate queue batch
create queue short_16pe
create queue long_16pe
IEEE HPDC 9 Conference
70
Required PBS Queue AttributesQueue Type
Must be set to either execution or routingset queue batch queue_type = Routing
set queue short_16pe queue_type = Execution
EnabledLogical flag that specifies if jobs will be accepted into the queue
True - the specified queue will accept jobs
False - jobs will not be accepted into the queueset queue short_16pe enabled = True
Started– Logical flag that specifies if jobs in the queue will be processed– Good method for draining queues when system maintenance is needed
True - jobs in the queue will be processed, either routed or scheduled
False - jobs will be held in the queue
set queue short_16pe started = True
IEEE HPDC 9 Conference
71
Recommended PBS Queue AttributesMax running
– Controls how many jobs in this queue can run simultaneously– Customize this value based on hardware resources available
set queue short_16pe max_running = 4
Max user run– Controls how many jobs an individual userid can execute simultaneously
across the entire server– Helps prevent a single user from monopolizing system resources
set queue short_16pe max_user_run = 2
Priority– Sets the priority of a queue, relative to other queues
– Provides a method of giving smaller jobs quicker turnaround
set queue q5p128 Priority = 90
IEEE HPDC 9 Conference
72
Recommended PBS Queue AttributesMaximum and Minimum resources
– Limits can be placed on various resource limits– This restricts which jobs may enter the queue based on the resources
requested
Usage:
set queue short_16pe resources_max.resource = value
Look at man pbs_resources_linux to see all resource attributes for linux, man pbs_resources_###### for aix4, sp2, sunos4, unicos8
cput maximum amount of CPU time used by all processes
nodes number of nodes to be reserved
ppn number of processors to be reserved on each node
pmem maximum amount of physical memory used by any single processes
walltime maximum amount of real time during which the job can be in therunning state
IEEE HPDC 9 Conference
73
Example PBS Execution Queuecreate queue short_16pe
set queue short_16pe queue_type = Execution
set queue short_16pe Priority = 90
set queue short_16pe max_running = 8
set queue short_16pe resources_max.cput = 10:00:00
set queue short_16pe resources_max.nodect = 4
set queue short_16pe resources_max.nodes = 4:ppn=4
set queue short_16pe resources_min.nodect = 2
set queue short_16pe resources_default.cput = 05:00:00
set queue short_16pe resources_default.mem = 1900mb
set queue short_16pe resources_default.nodect = 4
set queue short_16pe resources_default.nodes = 4:ppn=4
set queue short_16pe resources_default.vmem = 1900mb
set queue short_16pe max_user_run = 4
set queue short_16pe enabled = True
set queue short_16pe started = True
IEEE HPDC 9 Conference
74
Routing Queue AttributesRoute destinations
– Specifies potential destinations to which a job may be routed– Will be processed in the order listed– Job will be sent to first queue which meets the resource requirements of
the job
create queue batchset queue batch queue_type = Routeset queue batch max_running = 4set queue batch route_destinations = short_16peset queue batch route_destinations += long_16peset queue batch enabled = Trueset queue batch started = True
IEEE HPDC 9 Conference
75
PBS SchedulerPBS implements the scheduler as a module, so that different sites can
“plug in” the scheduler that meets the specific need
The material in this tutorial will cover the default FIFO scheduler
FIFO Scheduler - Default Characteristics– All jobs in a queue will be considered for execution before the next queue
is examined– All queues are sorted by priority– Within each queue, jobs are sorted by requested CPU time (jobs can be
sorted on multiple keys)– Jobs which have been queued for more than 24 hours will be considered
starving
IEEE HPDC 9 Conference
76
PBS SchedulerConfiguring the scheduler• Configuration file read when scheduler is started• $PBS_HOME/sched_priv/sched_config• FIFO scheduler will require some customization initially, but should remain
fairly static
Format of config file• One line for each attribute
name: value { prime | non_prime | all }
• Some attributes will require a prime option• If nothing is placed after the value, the default of “all” will be assigned• Lines starting with a “#” will be interpreted as comments• When PBS is installed, an initial sched_config is created
IEEE HPDC 9 Conference
77
Scheduler Attributesstrict_fifo
Controls whether jobs will be scheduled in strict FIFO order or not
Type: booleanTrue - jobs will be run in a strict first in first out order
False - jobs will be scheduled based on resource usage
help_starving_jobsOnce a queued job has waited a certain period of time, PBS will cease
running jobs until the “starving job” can be run
Waiting period for starving job status is defined in starv_max
Type: booleanTrue - starving job support is enabled
False - starving job support is disabled
Recommendation: Turn starving jobs off or set starv_max high
IEEE HPDC 9 Conference
78
Scheduler Attributessort_by
Controls how the jobs are sorted within the queues
Type: stringno_sort - do not sort the jobs
shortest_job_first - ascending by the cput attribute (default)
longest_job_first - descending by the cput attribute
smallest_memory_first - ascending by the mem attribute
largest_memory_first - descending by the mem attribute
high_priority_first - descending by the job priority attribute
low_priority_first - ascending by the job priority attribute
large_walltime_first - descending by job walltime attribute
cmp_job_walltime_asc - ascending by job walltime attribute
fair_share - not covered here; see PBS Administrator Guide
multi_sort - sort on more than one key
IEEE HPDC 9 Conference
79
Scheduler Attributessort_by (cont)
Examples:sort_by: smallest_memory_first
sort_by: shortest_job_first
If multi_sort is set, multiple key fields are used
Each key field will be a key for the multi sort and the order of the key fields decides which sort type is used first
sort_by: multi_sort
key: shortest_job_first
key: smallest_memory_first
key: high_priority_first
starv_maxThe amount of time before a job is considered starving
Type: timemax_starve: 48:00:00
IEEE HPDC 9 Conference
80
Scheduler Attributeslog_filter
Defines level of scheduler logging
Type: number1 internal errors
2 system (server) events
4 admin events
8 job related events
16 End of Job accounting
32 security violation events
64 scheduler events
128 common debug messages
256 less needed debug messages
Example: to log internal errors, system events, admin events and scheduler events(1, 2, 4, 64)
log_filter 71
IEEE HPDC 9 Conference
81
PBS Mom• Configuring the execution server (Mom) is achieved with a
configuration file, which is read in at startup
• Configuration file locationDefault: $PBS_HOME/mom_priv/config
You can specify a different file with the ‘-c’ option when the pbs_mom daemon is started
• Configuration file contains two types of information– Initialization values– Static resources
IEEE HPDC 9 Conference
82
Initialization Values$clienthost hostname
– Adds hostname to the list of hosts which will be allowed to connect to Mom
– Both the host that runs the scheduler and the host that runs the server must be listed as a clienthost
$logevent value1 Error Events
2 Batch System/Server Events
4 Administration Events
8 Job Events
16 Job Resource Usage
32 Security Violations
64 Scheduler Calls
128 Debug Messages
256 Extra Debug Messages
The specified events are logically “or-ed”set server log_events = 511 Everything turned on
set server log_events = 127 Good balance
IEEE HPDC 9 Conference
83
Initialization Values$max_load
– Declares the load value at which the node will be marked busy– If the load value exceeds max_load, the node will be marked as busy– If a node is marked busy, no new jobs will be scheduled
$max_load 4.0
$ideal_load– Declares the load value at which the “busy” label will be removed from a
node– If the load value drops below ideal_load, the node will no longer be
marked as busy$ideal_load 3.0
$cputmultSets a factor used to adjust the cpu time used by a job. Allows adjustment of
time where the job might run on systems with different cpu performance
IEEE HPDC 9 Conference
84
Static Resources• Static resources are names and values that you assign to a given node that
identify its special characteristics• These resources can then be requested in the batch script if a job needs some
special resourcencpus 4
physmem 2009644
myrinet 2
fasteth 1
• Given the above definitions, jobs that want up to 4 cpus, 2 gig of memory, 2 myrinet (in this case myrinet is a network interface) or 1 fast ethernet interface “could” be scheduled on this node
• If a job asked for 2 fast ethernet interfaces, it could not be scheduled on this node
#PBS -l nodes=1:myrinet=2 could be scheduled on this node
#PBS -l nodes=1:myrinet=3 could not be scheduled on this node
IEEE HPDC 9 Conference
85
Prologue ScriptPBS provides the ability to run a site supplied script before and/or after
each job runs. This provides the capability to perform initialization or cleanup of resources.
• Prologue script runs prior to each job• The script name and path is
$PBS_HOME/mom_priv/prologue
• The script must be owned by root
• The script must have permissionsroot read/write/executegroup & world none
IEEE HPDC 9 Conference
86
Prologue and Epilogue ArgumentsThe prologue script is passed the following three arguments that can be
used in the script1 the job id
2 the user id under which the job executes
3 the group id under which the job executes
The epilogue script is passed these arguments plus the following six4 the job name
5 the session id
6 the requested resource limits
7 the list of resources used
8 the name of the queue in which the job resides
9 the account string (if one exisits)
IEEE HPDC 9 Conference
87
Sample Prologue Script#!/bin/csh
# Copyright 2000, The Ohio Supercomputer Center, Troy Baer
# Create TMPDIR on all the nodes
# prologue gets 3 arguments:
# 1 -- jobid
# 2 -- userid
# 3 -- grpid
setenv TMPDIR /tmp/pbstmp.$1
setenv USER $2
setenv GROUP $3
if ( -e /var/spool/pbs/aux/$1 ) then
foreach node ( `cat /var/spool/pbs/aux/$1 | uniq` )
rsh $node "mkdir $TMPDIR ; chown $USER $TMPDIR ; chgrp $GROUP $TMPDIR ;
chmod 700 $TMPDIR" >& /dev/null
end
else
mkdir $TMPDIR
chown $USER $TMPDIR
chgrp $GROUP $TMPDIR
chmod 700 $TMPDIR
endif
IEEE HPDC 9 Conference
88
Sample Epilogue Script#!/bin/csh
# Copyright 2000, The Ohio Supercomputer Center, Troy Baer
# Clear out TMPDIR
# epilogue gets 9 arguments:
# 1 -- jobid
# 2 -- userid
# 3 -- grpid
# 4 -- job name
# 5 -- sessionid
# 6 -- resource limits
# 7 -- resources used
# 8 -- queue
# 9 -- account
setenv TMPDIR /tmp/pbstmp.$1
foreach node ( `cat /var/spool/pbs/aux/$1 | uniq` )
rsh $node /bin/rm -rf $TMPDIR
end
IEEE HPDC 9 Conference
89
User Environment CustomizationInteractive limits (front end - /etc/profile.d/limits.sh)
#!/bin/sh
ulimit -t 600
#ulimit -d 65536
#ulimit -m 65536
#ulimit -l 65536
#ulimit -p 64
User environment modifications for $TMPDIR (compute nodes - /etc/profile.d/tmpdir.sh)
#!/bin/sh
# If PBS_ENVIRONMENT exists and is "PBS_BATCH" or "PBS_INTERACTIVE",
# set TMPDIR
if [ -n "$PBS_ENVIRONMENT" ]
then
if [ "$PBS_ENVIRONMENT" = PBS_BATCH -o "$PBS_ENVIRONMENT" = PBS_INTERACTIVE ]
then
export TMPDIR=/tmp/pbstmp.$PBS_JOBID
fi
fi
IEEE HPDC 9 Conference
90
Interactive Batch Job SupportPBS supports the option of an Interactive Batch Job for debugging
purposes through PBS directives
General limitations– qsub command reads standard input and passes the data to the job, which
is connected via a pseudo tty– PBS only handles standard input, output and error
Additional OSC handicap– Compute nodes are on a private network
OSC customization for interactive graphics support
A technique has been devised that gives graphics capability within an interactive PBS batch job. It takes advantage of the special “X11 forwarding” implemented by SSH.
IEEE HPDC 9 Conference
91
Interactive Batch Job Graphics Support• Only supported if user connects to front end via ssh• Utility installed in /etc/profile.d grabs the DISPLAY environment variable
from the front end session, builds and installs a new X authorization entry and forwards the display to the front end, which is sent to the user’s workstation via the X11 forwarding proxy server
• Must pass environment variables to PBS batch job with “qsub -V …”
/etc/profile.d/pbsX.sh (compute nodes only)
if [ -n "$DISPLAY" -a "$PBS_ENVIRONMENT" == "PBS_INTERACTIVE" ]then export AUTHKEY=`xauth list|grep $DISPLAY | sed "s/oscbw[0-9]*.osc.edu/node00.cluster.osc.edu/" |
head -1` export DISPLAY=`echo $DISPLAY | sed 's/oscbw01/node00.cluster'` xauth add $AUTHKEYfi
IEEE HPDC 9 Conference
92
Parallel Job ControlMPIRUN implementation
• Current default method based on rsh– Not scalable; max. of 512 processes can be started from a single node– Uses too many sockets– Under PBS Mom starts 1st process, which now reports to Mom– PBS knows nothing about processes started up by rsh since they were not
executed by Mom• Spawned processes are not children of the moms
• Moms do not have control over spawned processes
• Moms do not know about spawned processes
• Resource utilization is not reported to moms, so accounting is not accurate
• New with MPICH 1.2.0 is the MultiPurpose Daemon (MPD)– dependency on user id; still must use rsh or some mechanism to start up
daemons– Daemons start up jobs, not the moms, so still no control over processes
and incorrect accounting
IEEE HPDC 9 Conference
93
Parallel Job Control - mpiexecPBS Task Manager• In addition to the PBS API, which provides access to the PBS server, there is a
task manager interface for the moms• Based on the PSCHED API (http://parallel.nas.nasa.gov/PSCHED)
Mpiexec uses the task manager library of pbs to spawn copies of the executable on all the nodes in a pbs allocation. It is functionally equivalent to
rsh node "cd $cwd; $SHELL -c ‘cd $cwd; exec executable arguments’"
The PBS server API is used to extract resource request information and construct the resource configuration file (nodes, etc.)
We use GM, which requires information on NICs, that is constructed by mpiexec as well (PBS does not know about NICs)
IEEE HPDC 9 Conference
94
mpiexec Formatmpiexec [OPTION]... executable [args]...
-n numproc Use only the specified number of processes
-tv, -totalview Debug using totalview
-perif Allocate only one process per myrinet interface
This flag can be used to ensure maximum communication
bandwidth available to each process
-pernode Allocate only one process per compute node. For SMP
nodes, only one processor will be allocated a job.
This flag is used to implement multiple level parallelism
with MPI between nodes, and threads within a node
-config configfile Process executable and arguments are specified in
the given configuration file. This flag permits the use
of heterogeneous jobs using multiple executables,
architectures, and command line arguments.
-bg, -background Do not redirect stdin to task zero. Similar to the
"-n" flag in rsh(1).
IEEE HPDC 9 Conference
95
MPIEXECC program written at OSC for PBS and available under GPL
mpiexec
pbs_server pbs_mom(mother superior)
pbs_mom
pbs_mom
pbs_mom
pbs_mom
1,32
4
1
1) Establish connection to task manager on mother superior node
2) Query server to get host names and cpu numbers (vpn’s)
3) Instruct task manager to spawn off processes, based on info in 2
4) Mother superior instructs moms in her pool to start indicated tasks
node 0
node 1
node 2
node 3
node #
IEEE HPDC 9 Conference
96
Hardware Level Access• To connect to the individual nodes serial console line you will need a
program such as Kermit, which can be downloaded from http://www.columbia.edu/kermit/ckermit.html
• If I needed to connect to the console on your front end (more about how to configure this later), you would type,[oscbw.osc.edu]% kermit
set line /dev/ttyC0
set speed 9600
set carrier-watch off
connect
IEEE HPDC 9 Conference
97
Serial Communication Programs• Minicom is a standard package included in most Linux distributions.
• More flexible and better terminal emulation than kermit.
• Can create separate “configurations” for different nodes,– /etc/minirc.node01 can contain
pr port /dev/ttyC0
pu baudrate 9600
pu bits 8
pu parity N– Would then connect to the node by typing ‘minicom node01’
IEEE HPDC 9 Conference
98
Serial Line Console• To configure your Linux computer to have a serial console we will
want to recreate the /dev/console entry,rm -f /dev/console
mknod -m 666 /dev/console c 5 1
• Next, we will need to spawn a getty (the process that allows logins) on the proper serial line.
• This is down by adding the following line to /etc/inittab,# Spawn getty for the serial console.
S1:12345:respawn:/sbin/getty ttyS0 DT9600 vt100
• To re-examine to inittab file type ‘/sbin/init q’
• It is now possible to login over /dev/ttyS0.
IEEE HPDC 9 Conference
99
Allowing root Logins and Console Redirection• root is only allowed to log in at the tty’s defined in /etc/securetty.
• Must add an entry for serial lines– For our example, ttyS0
• Console redirection requires the following additions to /etc/lilo.confserial=0,9600n8 in the global section.
append=“console=ttyS0” in the per-image section
IEEE HPDC 9 Conference
100
Serial Line Hardware Level Access• In addition to serial consoles, newer Intel motherboards include
support for Intelligent Platform Management Interface (IPMI).
• Allows power cycling, fan monitoring and BIOS access.
• Implemented in hardware. Not dependent on, and does not effect the OS.
• No robust implementations for Linux and limited hardware support.
• Instead of the serial line connections one could install a keyboard-video-mouse switches.
– Pros: More easily understood.– Cons: Lots of extra cables, limits to the number of nodes, remote access.
IEEE HPDC 9 Conference
101
System Management and Monitoring• Performance Copilot (PCP) can be used for system-wide performance
management and monitoring.
• Hierarchical storage and manipulation of the systems data.
• Renormalization is possible.– Choose the level of detail
• Data returned to PCP are self describing.
• Designed to return performance data with minimally affecting what it is measuring.
• Command line interface called pmview.
• Can build graphical representations using pmview, theory is visual presentation of large amounts of data.
• Can monitor log-files.
• Built in logging routines for later analysis.
IEEE HPDC 9 Conference
102
PMCD and PMDA’s• Performance Collection Metric Daemon (PCMD) is the core of PCP.
• Performance Metric Domain Agents (PMDA) collect self-describing, statistics from a variety of sources.
– Disk, cpu, network, logs, switches and routers.
P C M D
P M D A
P M D A
P M D A
C lien ts(pm view ,pm in fo,logger)
IEEE HPDC 9 Conference
103
Pminfopminfo: option requires an argument -- h
Usage: pminfo [options] [metricname ...]
Options:
-a archive metrics source is a PCP log archive
-b batchsize fetch this many metrics at a time for -f or -v (default 20)
-d get and print metric description
-f fetch and print value(s) for all instances
-F fetch and print values for non-enumerable indoms too
-h host metrics source is PMCD on host
-m print PMID
-M print PMID in verbose format
-n pmnsfile use an alternative PMNS
-O time origin for a fetch from the archive
-t get and display (terse) oneline text
-T get and display (verbose) help text
-v verify mode, be quiet and only report errors
(forces other output control options off)
-Z timezone set timezone for -O
-z set timezone for -O to local time for host from -a
IEEE HPDC 9 Conference
104
Pminfodjohnson:~> pminfo -f disk.all.read_bytes
disk.all.read_bytes
value 177955
djohnson:~> pminfo -d disk.all.read_bytes
disk.all.read_bytes
Data Type: 32-bit unsigned int InDom: PM_INDOM_NULL 0xffffffff
Semantics: counter Units: Kbyte
IEEE HPDC 9 Conference
105
Pmview
IEEE HPDC 9 Conference
106
Pmview