Monitoring and Trouble Shooting on BioHPC
1 Updated for 2017-03-15
[web] portal.biohpc.swmed.edu
[email] [email protected]
Why Monitoring & Troubleshooting
2
data
Monitoring jobs running on the cluster
Understand how current HPC resource is used
Optimize usage to maximum capacities
code
Why Monitoring & Troubleshooting
3
Try to understand if the job is:• CPU intensive• Memory intensive• I/O intensive• A combination of above
Try to figure out:• Where are the bottlenecks• How to boost the computational efficiency-Completing more tasks during available time window-Run an analysis with larger data set in the same amount of time
What to Monitor
4
CPU Usage- lscpu- pstree- top
Memory Usage- free- vmstat
I/O Usage- iostat
Network/Bandwidth- ifstat
First, start by profiling the application on an interactive node.
CPU Usage
5
Achieve speedup on HPC?Increased frequencies Increased scalability
lscpu: display information about CPU architecture
CPU Usage: command line tools
6
Job running on the compute node: astrocyte_cli test <workflow> align-bowtie-se.sh bowtie/1.0.04 samples
pstree: display a tree of processes
* You may also use top and pstree command to verify if your job is running across multiple nodes
CPU Usage: command line tools
7
top: display Linux tasks, provides a dynamic real-time view of a running system.
Memory Usage: The Memory Hierarchy
8
http://cse1.net/recaps/4-memory.html
Memory Usage: command line tools
9
free: displays the total amount of free and used physical and swap memory in the system, as well as the buffers used by the kernel
Mem (RAM): can be used by currently-running processSwap (Virtual Memory): is used when the amount of physical memory (RAM) is full. Constant swapping should be avoided
buffers: file system metadatacached: pages with actual contents of files for future faster access, not currently “used” memory
Memory Usage: command line tools
10
vmstat: (Virtual Memory Statistics) outputs instantaneous reports about your system's processes, memory, paging, block I/O, interrupts and CPU activity.
Disk Usage & I/O
11
Parallel Filesystems on BioHPC
Advantages:scalabilitythe capability to distribute large files across multiple nodes
IssuesInadequate I/O capability can severely degrade overall cluster performance
Disk Usage & I/O: command line tools
12
iostat: generates reports that can be used to change system configuration to better balance the input/output load between physical disks.
%iowait is the percentage of time your processors are waiting on the disk
Network/Bandwidth Usage
13
Minimizing communication
Network/Bandwidth Usage: command line tools
14
ifstat: reports the network bandwidth in a batch style mode
All-in-One tools
15
Too many tools?
All-in-on tools- Dstat- Linux Collectl Profile- HPCTools
Dstat: Versatile resource statistics tool
16
DAG:a versatile replacement for vmstat, iostat, netstat and ifstat. http://dag.wiee.rs/home-made/dstat/
Linux Collectl Profiler
17
• Information from monitoring an application can aid the user to run it optimally
• Collectl is a tool which monitors a broad set of subsystems of a server while user application is running on it
• Helpful to know your application’s usage of cpu, memory, disk, etc to determine if system resources are being stressed or over utilized
• Many subsystems in summary or detail available to monitor, but initial interest to a user running an application.• CPU• Memory• Disk – Lustre• InfiniBand• NFS usage• TCP summary
collectl --showsubsys
18
Shows ALL subsystems that data can be collected for and plotted in Summary plots:
b - buddy info (memory fragmentation)c - cpud - diskf - nfsi - inodesj - interrupts by CPUm - memoryn - networks - socketst - tcpx - interconnect (currently supported: OFED/Infiniband)y - slabs
collectl --showsubsys
19
Shows all subsystems that data collected can be shown in Detailed plots:
C - individual CPUs, including interrupts if -sj or -sJD - individual DisksE - environmental (fan, power, temp) [requires ipmitool]F - nfs dataJ - interrupts by CPU by interrupt numberM - memory numa/nodeN - individual NetworksT - tcp details (lots of data!)X - interconnect ports/rails (Infiniband/Quadrics)Y - slabs/slubsZ - processesL - lustre
Why Monitoring - Linux Collectl Profiler – Getting LUSTRE metrics
20
• In your script that you sbatch to run a job, execute collectl running in the background:
#!/bin/bashmodule add collectl/4.1.2cd /project/biohpcadmin/s175049mkdir testcollectl -scLmx -P -f /project/biohpcadmin/s175049/test &>/dev/null &dd if=/dev/zero of=stripe4 bs=4M count=4096kill %1
Data is collected for subsystems that are listed in –s option Collectl data files are written to user directory “test” above
Why Monitoring - Linux Colplot Visualizer
21
• View data with Gnuplot either while job is running or after it is finished:
% colplot –dir /project/biohpcadmin/s175049/test –plot cpu,mem,inter,cltdet
% colplot –showplot
shows ALL the different args to –plot to display the plots you want
• May need to refine timeline by specifying specific timeframe to view:
% colplot –dir /project/biohpcadmin/s175049/test –plot \cpu,mem,inter,cltdet -time 08:20-08:30
Why Monitoring - Linux Collectl & Colplot
22
• Documentation with examples and tutorials:
collectl.sourceforge.net/Documentation.html
colplot.sourceforge.net/Documentation.html
• Collectl and colplot man pages:
linux.die.net/man/1/collectl
collectl-utils.sourceforge.net/coplot.html
What’s next
23
Optimization: Use appropriate compiler options
24
Intel Math Kernel Library: a library of optimized math routines for science,
engineering and financial applications.
• Basic Linear Algebra Subroutines
• LAPCK
• Fast Fourier Transform (FFT)
• Vector Math Library
• Build in OpenMP multithreading (set OMP_NUM_THREADS>1)
Modules with MKL on BioHPC• R/2.15.3-Intel• R/3.3.2-gccmkl• julia/0.4.6• JAGS/4.2.0…
Compile your own MKLusing the –mkl complier option(detailed options refer to: https://software.intel.com/en-us/node/528512)
Optimization : Load big data into memory to reduce I/O
25
8GB RAM
256GB RAM
Significantly reduced I/O
Optimization : Single-Instruction, Multiple-Data
26
Vector Processing Unit
for ( i = 0; i < n; i++)A[i] = A[i] + B[i];
for ( i = 0; i < n; I += 8)A[I : (i+8)] = A[I : (i+8)] + B[i : (i+8)];
Scalar Loop
SIMD Loop
* Each SIMD addition operator acts on 8 numbers at a time
“Intel® AVX data types allow packing of up to 32 elements in a
register if bytes are used. The number of elements depends upon the
element type: 8 single-precision floating point types or 4 double-
precision floating point types.”
https://software.intel.com/en-us/node/524040
https://en.wikipedia.org/wiki/SIMD
Another example is GPU
Optimization: GNU Parallel
27
keeping the CPUs active and thus saving time
A shell tool for executing jobs in parallel using one or more computers.Make best use of CPU resource with balanced job load
http://www.gnu.org/software/parallel/
If all jobs are independent to each other ...
• Predefined the job pool to match the total number of Cores
• Spawns a new process when one finishes
• module load parallel
Optimization: Multithreading
28
If communication between jobs are needed ...
• pthread• openMP
Shared memoryAdvantages:
user friendly programmingfast data sharing between tasks
Disadvantage:programmer’s responsibility for
synchronization construction that ensure “correct” access of shared memory
• phenix• bowtie2
libs tools
Optimization: Shared Memory
29
Possible bottleneck:
concurrent read: Maybeconcurrent write: No
http://www.delphicorner.f9.co.uk/articles/op4.htm
Modified from Figure 1 in https://developer.marklogic.com/blog/how-marklogic-supports-acid-transactions
Optimization: Message Passing Interface
30
If communication between jobs are needed ...e.g.: MPI job across multiple nodes
master nodeslave node 1
slave node 2
slave node 3
Optimization: Message Passing Interface
31
Possible bottleneck: • communication cost• unbalanced load Decompose dataset in a smart way to:
• Minimize the overlaps (proportion to communication cost)
• Balance the data between nodes
Example: METIS – Graph partition toolhttp://glaros.dtc.umn.edu/gkhome/metis/metis/overview
What is the maximum speed-up you could achieve?
Optimization: Multithreading & Message Passing
32
MPI + pthread
If you try to run relion job across 2 nodes on 256GB partition, 48*2 = 96 cores
No. of MPI jobs No. of threads No. MPI * No. threads
2 48 96
4 24 96
8 12 96
16 6 96
Q: Which one has the shortest computation time?
Demo: Project Gutenberg “big data” reader
33
Data: 18792 booksSize: ≈ 10 GBType: plain/text
Count the number of occurrences of the words:
“dog”“cat”“boy”“girl”
Goal: Complete as fast as possible by reducing bottlenecks and inefficiencies
Demo: Project Gutenberg “big data” reader: Solution I (single-processor, many files)
34
file_00.txt
LUSTRE
Read text into node RAM
CPU_00count keywords
file_01.txt
LUSTRE
Read text into node RAM
CPU_00count keywords
file_02.txt
LUSTRE
Read text into node RAM
CPU_00count keywords
Demo: Project Gutenberg “big data” reader: Solution II (multi-processor, partition file set)
35
file_00.txt
LUSTRE
Read line of text into
node RAM
CPU_00count
keywords
file_01.txt
LUSTRE
Read line of text into
node RAM
CPU_00count
keywords
file_02.txt
LUSTRE
Read line of text into
node RAM
CPU_00count
keywords
file_03.txt
LUSTRE
Read line of text into
node RAM
CPU_01count
keywords
file_04.txt
LUSTRE
Read line of text into
node RAM
CPU_01count
keywords
file_05.txt
LUSTRE
Read line of text into
node RAM
CPU_01count
keywords
file_06.txt
LUSTRE
Read line of text into
node RAM
CPU_02count
keywords
file_07.txt
LUSTRE
Read line of text into
node RAM
CPU_02count
keywords
file_08txt
LUSTRE
Read line of text into
node RAM
CPU_02count
keywords
file_09.txt
LUSTRE
Read line of text into
node RAM
CPU_03count
keywords
file_10.txt
LUSTRE
Read line of text into
node RAM
CPU_03count
keywords
file_11.txt
LUSTRE
Read line of text into
node RAM
CPU_03count
keywords
Demo: Project Gutenberg “big data” reader: Solution III (single-processor, one large file, chunked)
large_txt.bin (all text from all books in one large file)
Distribute file chunks to RAMLUSTRE
Distribute memory to CPU_00 in limited
chunks
chunk_00
CPU_00count keywords
chunk_01
CPU_00count keywords
chunk_02
CPU_00count keywords
Demo: Project Gutenberg “big data” reader: Solution IV (multiple-processors, one large file, chunked)
large_txt.bin (all text from all books in one large file)
Load all text into node memoryLUSTRE
Partition memory to all procesors in
chunks
multiple chunks
CPU_00count keywords
multiple chunks
CPU_01count keywords
multiple chunks
CPU_02count keywords
multiple chunks
CPU_03count keywords
Demo: Project Gutenberg “big data” reader: Results
38
time python inefficient_reader.py: 7.2 minSolution I
time python multithreaded_inefficient_reader.py 2.0 minSolution II
time python efficient_reader.py: 3.5 minSolution III
time python multithreaded_efficient_reader.py 0.7 minSolution IV
Top Related