Download - Monitoring and Trouble Shooting on BioHPC · Monitoring and Trouble Shooting on BioHPC 1 Updated for 2017-03-15 [web] portal.biohpc.swmed.edu [email] [email protected]

Monitoring and Trouble Shooting on BioHPC

1 Updated for 2017-03-15

[web] portal.biohpc.swmed.edu

[email] [email protected]

Why Monitoring & Troubleshooting

2

data

Monitoring jobs running on the cluster

Understand how current HPC resource is used

Optimize usage to maximum capacities

code

Why Monitoring & Troubleshooting

3

Try to understand if the job is:• CPU intensive• Memory intensive• I/O intensive• A combination of above

Try to figure out:• Where are the bottlenecks• How to boost the computational efficiency-Completing more tasks during available time window-Run an analysis with larger data set in the same amount of time

What to Monitor

4

CPU Usage- lscpu- pstree- top

Memory Usage- free- vmstat

I/O Usage- iostat

Network/Bandwidth- ifstat

First, start by profiling the application on an interactive node.

CPU Usage

5

Achieve speedup on HPC?Increased frequencies Increased scalability

lscpu: display information about CPU architecture

CPU Usage: command line tools

6

Job running on the compute node: astrocyte_cli test <workflow> align-bowtie-se.sh bowtie/1.0.04 samples

pstree: display a tree of processes

* You may also use top and pstree command to verify if your job is running across multiple nodes

CPU Usage: command line tools

7

top: display Linux tasks, provides a dynamic real-time view of a running system.

Memory Usage: The Memory Hierarchy

8

http://cse1.net/recaps/4-memory.html

http://cse1.net/recaps/4-memory.html

Memory Usage: command line tools

9

free: displays the total amount of free and used physical and swap memory in the system, as well as the buffers used by the kernel

Mem (RAM): can be used by currently-running processSwap (Virtual Memory): is used when the amount of physical memory (RAM) is full. Constant swapping should be avoided

buffers: file system metadatacached: pages with actual contents of files for future faster access, not currently “used” memory

Memory Usage: command line tools

10

vmstat: (Virtual Memory Statistics) outputs instantaneous reports about your system's processes, memory, paging, block I/O, interrupts and CPU activity.

Disk Usage & I/O

11

Parallel Filesystems on BioHPC

Advantages:scalabilitythe capability to distribute large files across multiple nodes

IssuesInadequate I/O capability can severely degrade overall cluster performance

Disk Usage & I/O: command line tools

12

iostat: generates reports that can be used to change system configuration to better balance the input/output load between physical disks.

%iowait is the percentage of time your processors are waiting on the disk

Network/Bandwidth Usage

13

Minimizing communication

Network/Bandwidth Usage: command line tools

14

ifstat: reports the network bandwidth in a batch style mode

All-in-One tools

15

Too many tools?

All-in-on tools- Dstat- Linux Collectl Profile- HPCTools

Dstat: Versatile resource statistics tool

16

DAG:a versatile replacement for vmstat, iostat, netstat and ifstat. http://dag.wiee.rs/home-made/dstat/

Linux Collectl Profiler

17

• Information from monitoring an application can aid the user to run it optimally

• Collectl is a tool which monitors a broad set of subsystems of a server while user application is running on it

• Helpful to know your application’s usage of cpu, memory, disk, etc to determine if system resources are being stressed or over utilized

• Many subsystems in summary or detail available to monitor, but initial interest to a user running an application.• CPU• Memory• Disk – Lustre• InfiniBand• NFS usage• TCP summary

collectl --showsubsys

18

Shows ALL subsystems that data can be collected for and plotted in Summary plots:

b - buddy info (memory fragmentation)c - cpud - diskf - nfsi - inodesj - interrupts by CPUm - memoryn - networks - socketst - tcpx - interconnect (currently supported: OFED/Infiniband)y - slabs

collectl --showsubsys

19

Shows all subsystems that data collected can be shown in Detailed plots:

C - individual CPUs, including interrupts if -sj or -sJD - individual DisksE - environmental (fan, power, temp) [requires ipmitool]F - nfs dataJ - interrupts by CPU by interrupt numberM - memory numa/nodeN - individual NetworksT - tcp details (lots of data!)X - interconnect ports/rails (Infiniband/Quadrics)Y - slabs/slubsZ - processesL - lustre

Why Monitoring - Linux Collectl Profiler – Getting LUSTRE metrics

20

• In your script that you sbatch to run a job, execute collectl running in the background:

#!/bin/bashmodule add collectl/4.1.2cd /project/biohpcadmin/s175049mkdir testcollectl -scLmx -P -f /project/biohpcadmin/s175049/test &>/dev/null &dd if=/dev/zero of=stripe4 bs=4M count=4096kill %1

Data is collected for subsystems that are listed in –s option Collectl data files are written to user directory “test” above

Why Monitoring - Linux Colplot Visualizer

21

• View data with Gnuplot either while job is running or after it is finished:

% colplot –dir /project/biohpcadmin/s175049/test –plot cpu,mem,inter,cltdet

% colplot –showplot

shows ALL the different args to –plot to display the plots you want

• May need to refine timeline by specifying specific timeframe to view:

% colplot –dir /project/biohpcadmin/s175049/test –plot \cpu,mem,inter,cltdet -time 08:20-08:30

Why Monitoring - Linux Collectl & Colplot

22

• Documentation with examples and tutorials:

collectl.sourceforge.net/Documentation.html

colplot.sourceforge.net/Documentation.html

• Collectl and colplot man pages:

linux.die.net/man/1/collectl

collectl-utils.sourceforge.net/coplot.html

What’s next

23

Optimization: Use appropriate compiler options

24

Intel Math Kernel Library: a library of optimized math routines for science,

engineering and financial applications.

• Basic Linear Algebra Subroutines

• LAPCK

• Fast Fourier Transform (FFT)

• Vector Math Library

• Build in OpenMP multithreading (set OMP_NUM_THREADS>1)

Modules with MKL on BioHPC• R/2.15.3-Intel• R/3.3.2-gccmkl• julia/0.4.6• JAGS/4.2.0…

Compile your own MKLusing the –mkl complier option(detailed options refer to: https://software.intel.com/en-us/node/528512)

Optimization : Load big data into memory to reduce I/O

25

8GB RAM

256GB RAM

Significantly reduced I/O

Optimization : Single-Instruction, Multiple-Data

26

Vector Processing Unit

for ( i = 0; i < n; i++)A[i] = A[i] + B[i];

for ( i = 0; i < n; I += 8)A[I : (i+8)] = A[I : (i+8)] + B[i : (i+8)];

Scalar Loop

SIMD Loop

* Each SIMD addition operator acts on 8 numbers at a time

“Intel® AVX data types allow packing of up to 32 elements in a

register if bytes are used. The number of elements depends upon the

element type: 8 single-precision floating point types or 4 double-

precision floating point types.”

https://software.intel.com/en-us/node/524040

https://en.wikipedia.org/wiki/SIMD

Another example is GPU

https://software.intel.com/en-us/node/524040

Optimization: GNU Parallel

27

keeping the CPUs active and thus saving time

A shell tool for executing jobs in parallel using one or more computers.Make best use of CPU resource with balanced job load

http://www.gnu.org/software/parallel/

If all jobs are independent to each other ...

• Predefined the job pool to match the total number of Cores

• Spawns a new process when one finishes

• module load parallel

Optimization: Multithreading

28

If communication between jobs are needed ...

• pthread• openMP

Shared memoryAdvantages:

user friendly programmingfast data sharing between tasks

Disadvantage:programmer’s responsibility for

synchronization construction that ensure “correct” access of shared memory

• phenix• bowtie2

libs tools

Optimization: Shared Memory

29

Possible bottleneck:

concurrent read: Maybeconcurrent write: No

http://www.delphicorner.f9.co.uk/articles/op4.htm

Modified from Figure 1 in https://developer.marklogic.com/blog/how-marklogic-supports-acid-transactions

Optimization: Message Passing Interface

30

If communication between jobs are needed ...e.g.: MPI job across multiple nodes

master nodeslave node 1

slave node 2

slave node 3

Optimization: Message Passing Interface

31

Possible bottleneck: • communication cost• unbalanced load Decompose dataset in a smart way to:

• Minimize the overlaps (proportion to communication cost)

• Balance the data between nodes

Example: METIS – Graph partition toolhttp://glaros.dtc.umn.edu/gkhome/metis/metis/overview

What is the maximum speed-up you could achieve?

Optimization: Multithreading & Message Passing

32

MPI + pthread

If you try to run relion job across 2 nodes on 256GB partition, 48*2 = 96 cores

No. of MPI jobs No. of threads No. MPI * No. threads

2 48 96

4 24 96

8 12 96

16 6 96

Q: Which one has the shortest computation time?

Demo: Project Gutenberg “big data” reader

33

Data: 18792 booksSize: ≈ 10 GBType: plain/text

Count the number of occurrences of the words:

“dog”“cat”“boy”“girl”

Goal: Complete as fast as possible by reducing bottlenecks and inefficiencies

Demo: Project Gutenberg “big data” reader: Solution I (single-processor, many files)

34

file_00.txt

LUSTRE

Read text into node RAM

CPU_00count keywords

file_01.txt

LUSTRE



file_02.txt

LUSTRE



Demo: Project Gutenberg “big data” reader: Solution II (multi-processor, partition file set)

35

file_00.txt

LUSTRE

Read line of text into

node RAM

CPU_00count

keywords

file_01.txt

LUSTRE


node RAM

CPU_00count

keywords

file_02.txt

LUSTRE


node RAM

CPU_00count

keywords

file_03.txt

LUSTRE


node RAM

CPU_01count

keywords

file_04.txt

LUSTRE


node RAM

CPU_01count

keywords

file_05.txt

LUSTRE


node RAM

CPU_01count

keywords

file_06.txt

LUSTRE


node RAM

CPU_02count

keywords

file_07.txt

LUSTRE


node RAM

CPU_02count

keywords

file_08txt

LUSTRE


node RAM

CPU_02count

keywords

file_09.txt

LUSTRE


node RAM

CPU_03count

keywords

file_10.txt

LUSTRE


node RAM

CPU_03count

keywords

file_11.txt

LUSTRE


node RAM

CPU_03count

keywords

Demo: Project Gutenberg “big data” reader: Solution III (single-processor, one large file, chunked)

large_txt.bin (all text from all books in one large file)

Distribute file chunks to RAMLUSTRE

Distribute memory to CPU_00 in limited

chunks

chunk_00


chunk_01


chunk_02


Demo: Project Gutenberg “big data” reader: Solution IV (multiple-processors, one large file, chunked)

large_txt.bin (all text from all books in one large file)

Load all text into node memoryLUSTRE

Partition memory to all procesors in

chunks

multiple chunks


multiple chunks


multiple chunks


multiple chunks


Demo: Project Gutenberg “big data” reader: Results

38

time python inefficient_reader.py: 7.2 minSolution I

time python multithreaded_inefficient_reader.py 2.0 minSolution II

time python efficient_reader.py: 3.5 minSolution III

time python multithreaded_efficient_reader.py 0.7 minSolution IV