Introduction to HPC at UCL · Introduction to HPC at UCL ... 76/104=73% mb-wes Westmere : 0/24=0%...
Transcript of Introduction to HPC at UCL · Introduction to HPC at UCL ... 76/104=73% mb-wes Westmere : 0/24=0%...
Institut de calcul intensif et destockage de masse
Introduction to HPC at UCL
● Technical reminders and available equipment
• source code • compiling• optimized libraries: BLAS, LAPACK• OpenMP• MPI
● Job submission: SGE, Condor, SLURM
● CISM: working principles, management, access
● Machine room visit
● From algorithm to computer program: optimization and parallel code
October 17th 2017 Damien François and Bernard Van Renterghem
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Cache memory
program execution = information exchange between CPU and RAM memory (program instructions and data)
RAM slow + sequential set of instructions > cache memory: instructions and/or data read by entire blocks transferred from RAM to cache memory
Cache L1, L2, L3
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Clusters
Big number of standard elements (low cost)
Network performance critical
Low cost computer
Low cost computer
Low cost computer
Low cost computer
Low cost computer
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Equipment: servers
CISM servers
● Manneback● With CP3 working nodes● « exotic » or « interactive » machines
UCL CECI servers
● Hmem● Lemaitre2
CECI servers
● Vega ULB● Hercules UNamur● Dragon1 Umons● Nic4 ULiege
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Equipment: clusters
● Charles Manneback (1894-1975) Georges Lemaitre's friend
● Made of several different generations of hardware● Har , Neh, Westmere,SandyB, IvyB, Haswell,...● Opteron● GPU Tesla● Mic Xeon Phi
●Cp3 partition 125 nodes, 2424 core●Zoe partition 26 nodes, 416 core●Def partition 102, 1280 core
● Installed compilers: GCC, Intel, PGI
● OS: GNU/Linux Centos 6
● Batch system: SLURM
Manneback
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Equipment: clusters
Manneback
Welcome to | |
__ `__ \ _` | __ \ __ \ _ \ __ \ _` | __| | / | | | ( | | | | | __/ | | ( | ( < _| _| _| \__,_| _| _| _| _| \___| _.__/ \__,_| \___| _|\_\
Charles Manneback Lemaitre fellow cluster
(GNU/Linux CentOS 6.9) front-end: 2x8Core E5-2650@2GHz/64GB RAM
contact, support : [email protected]
Use the following commands to adjust your environment:'module avail' - show available modules'module add <module>' - adds a module to your environment for this session'module initadd <module>' - configure module to be loaded at every login
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM
Equipment: clusters
Sumitting jobs with SLURM: http://www.cism.ucl.ac.be/faq~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ http://www.uclouvain.be/cism http://www.ceci-hpc.be~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[root@manneback ~]# sloadMBack : 3262/4128=79% mb-har Harpertown : 238/632=37% mb-opt Opteron : 1086/1248=87% mb-neh Nehalem : 76/104=73% mb-wes Westmere : 0/24=0% mb-sab SandyBridge: 583/704=82% mb-ivy IvyBridge : 895/896=99% mb-has Haswell(p zoe): 384/416=92%
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM
# sinfo NODELIST NODES PARTI CPUS MEMORY FEATURES GRES mb-har[121-123,125-130] 9 Def* 8 13869+ Harpertown,Xeon,X5460 localscratch:340mb-har[124,131-140] 11 Def* 8 13930+ Harpertown,Xeon,X5460 localscratch:20mb-har[001-009,011-014] 13 Def* 8 15947 Harpertown,Xeon,L5420 localscratch:166mb-har102 1 Def* 8 32108 Harpertown,Xeon,L5420 localscratch:146 mb-neh070 1 Def* 8 24019 Nehalem,Xeon,X5550 localscratch:814,gpu:TeslaC1060/M1060:2 mb-sab040 1 Def* 16 64398 SandyBridge,Xeon,E5-2660 localscratch:458,gpu:TeslaM2090:2,mic:5110P:1
[...]
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Equipment: exotic machines
Other peculiar machines● Lm9 : interactive matlab, TermoCalc, R,
2 6 core [email protected] GHz, 144 GB RAM
● Mb-neh070): dual quad [email protected] (85 Gflops) + ( 2x) Tesla M1060 = 240 GPU core, 624 SP Gflops, 77 DP Gflops GPU.
● Mb-sab040 : dual octa [email protected] +(2x) Tesla M2090 = 512 GPU core, 1332 SP Gflops, 666 DP Gflops GPU + Xeon Phi 61 [email protected] 1011 DP Gflops
● SCMS-SAS 3&4 : for SAS, STATA, R,...2*16 core 8192 Gflops 128GB RAM Xeon [email protected]
● LmPp001-003 : lemaitre2 PostProcessing Nvidia Quadro 4000 = 256 GPU core, 486 SP Gflops, 243 DP Gflops GPU.
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Equipment CECI Clusters
● 16 Dell PowerEdge R815 + 1 HP + 3 Ttec
● 17x48 core AMD Opteron 6174(Magny-Cours) @2.2GHz
+ 3x8 core AMD Opteron 8222 @3GHz (24h partition)
● 2: 512, 7:256, 8:128, 3:128 GB RAM
● /scratch 3.2TB or 1.7TB
● Infiniband 40Gb/s
● SLURM batch queuing system
=
● Tot: 840 core, 4128 TB RAM, 31 TB /scratch, 11TB /home, 7468 GFlops
Hmem ( www.ceci-hpc.be )
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM
● 112 HP DL380 with 2x6 core [email protected] 48GB RAM
● /scratch lustreFS 120TB, /tmp 325GB
● Infiniband 40Gb/s
● SLURM batch queuing system
=
● Tot: 1344 core, 5.25 TB RAM, 120 TB /scratch, 30TB /home, 13.6 TFlops
Lemaitre2 ( www.ceci-hpc.be )
Equipment CECI Clusters
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM
ULB+Unamur,+UMons+UCL+ULiège = CECI
Equipment CECI Clusters
See www.ceci-hpc.be/clusters.html
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM
Manneback (2012)
1400 CPUs
Manneback (2012)
1400 CPUs
HMEM (2011)
816 CPUs
HMEM (2011)
816 CPUs
Lemaitre2 (2012)
1380 CPUs
Lemaitre2 (2012)
1380 CPUs
Vega (2013)
2752 CPUs
Vega (2013)
2752 CPUs
Dragon1 (2013)
416 CPUs
Dragon1 (2013)
416 CPUs
Hercules (2013)
896 CPUs
Hercules (2013)
896 CPUs
Nic4 (2014)
2048 CPUs
Nic4 (2014)
2048 CPUs
~ 8300 CPUs
(2010)
Zenobe (2014)13536 CPUs
Zenobe (2014)13536 CPUs
(2012)
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM
280 M/mois = 4.7 MhCPU~ 533 ans/CPU
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM CECI Distributed Filesystem
Belnet10 Gbps
UCL
ULg
HPC
Solution core design
(2014)HPC
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM To reduce computing time…
… improve your code
● choice of algorithm ● source code● optimized compiling● optimized libraries
… use parallel computation
● OpenMP (mostly on SMP machines) ● MPI
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Source code
• Algorithm choice: volume of calculation increases with n, n x n,…? Stability ?
● indirect addressing expensive (pointers) ● fetching order of array elements (for optimal use of cache memory)● loop efficiency (get all uneccessary bits and pieces out of them)
• Programming language: FORTRAN, C, C++,…?
• Coding practise
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Compiling
• The compiler…
• Optimization options: -01, -02, -03
• Different qualities of compilers !!
● translates an instruction list written in a high level language into a machine readable (binary) file [= the object file]
e. g. ifc –c myprog.f90 generates object file myprog.o● link binary object files to produce an executable file
e. g. ifc –o myprog module1.o libmath.a myprog.o generates the executable file (= program) myprog
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Optimized libraries: BLAS
Basic Linear Algebra Subroutines
● set of optimized subroutines to handle vector x vector, matrix x vector, matrix x matrix operations (for real and complex numbers, single or double precision)
● the subrouines are optimized for a specific machine CPU/OS
● See http://www.netlig.org/blas● Example…
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Optimized libraries: BLAS
● compiling from BLAS source: ifc –o mvm sgemv.f mvm.f
● compiling with pre-compiled BLAS library (optimized for Intel CPU):
ifc –o mvm mvm.f sblasd13d.a
real*8 matlxc(nl, nc)real*8 vectc(nc), result(nl)
call random_number(matlxc)call random_number(vectc)
do i=1,nl result(i)=0.0 do j=1,nc result(i)=result(i)+matlxc(i,j)*vectc(j) end doend do
call SGEMV('N',nl,nc,1.0d0,matlxc,nl,vectc,1,0.d0,result,1)
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Optimized libraries: BLAS
Performance comparison of Intel and PGI FORTRAN compilers, for self-made code, BLAS code and pre-compiled optimized libraries (matrix 10,000 x 5,000)
Compiler Subroutine Options Mflpos
Intel (ifc) DO loop - O0 11
- O3 11
BLAS source - O0 42
O3 115
BLAS compiled - O0 120
O3 120
PGI (pgf90) DO loop - O0 11
- O3 11
BLAS source - O0 48
- O3 57
BLAS compiled -O0 116
-O3 119
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Optimized libraries: LAPACK
• Linear Algebra Subroutines:
● linear equation system Ax=b● least square: min ||Ax-b||²● eigen value: Ax=x, AX=Bx● for real or complex, single or double precision● includes all utility routines (LU factoring, Cholesky,…)
• Based on BLAS (don't depend on hardware, always optimized)
• See http://www.netlib.org/lapack
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM OpenMP
• Open Memory Parallelism: standard language (compiler directives, functions, environment variables) for shared memory architectures (OpenMP 2.0)
• Principle: compiler directives > parallelism details are left to the compiler > fast implementation
!OMP PARALLEL DO
modèle fork and join…
DO I=1,1000 a(i)=b(i)*c(i)END DO…
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM MPI environment
• MPI = Message Passing Interface (2.0)
• Principle: the program has full control over data exchange between nodes while distributing work and managing communication between nodes
• Widely used standard for clusters (but also exists for SMP boxes)
…REAL a(100) …C Process 0 sends, process 1 receives: if( myrank.eq.0 ) then call MPI_SEND(a,100,MPI_REAL,1,17,MPI_COMM_WORLD,ierr) else if ( myrank.eq.1 ) then call MPI_RECV(a,100,MPI_REAL,0,17,MPI_COMM_WORLD,status,ierr) endif …
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Job submission
• Goal: one single task per CPU
• Principle: the user hands his program over to an automatic job management system, specifying his requirements (memory, architecture, number of cpus,…). When the requested resources become available, the job is dispatched and starts running.
• Slurm Workload Manager
● sbatch● sinfo● scancel
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Job submission
• Submission script examples…
• To submit your job: sbatch myscript
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM CISM: research environment
ELEN
TERM
SC
TOPO
ELIENAPS
RDGN
RECI
BSMA
IMAP
MOST
MEMA
BIB
COMU
LICR
ELEN
INGI
INMA
ELIC
INFM
CP3
FACM
NAPS
LOCI
GERU
PAMO
LSM
ECON
RSPO
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM CISM
• Equipment and support available for any UCL (and CECI) member
• Equipments are acquired through projects
• Goals: joining forces to acquire and manage more powerful equipments
• Institut de Calcul Intensif et de Stockage de Masse:
● management committee composed of representatives of user's entities: debates and decides on strategies; chairman elected for four years
● offices in Mercator; machine rooms in Pythagore and Marc de Hemptinne
● daily management by technical computer team, under leadership of CISM Director (elected for four years)
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM CISM management team
Thomas KeutgenDirecteur CISM
Olivier MattelaerGestionnaire système &Support utilisateur
Bernard Van RenterghemGestionnaire système &
support utilisateur
Damien FrançoisGestionnaire système &
support utilisateur
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Environmental challenges
• two 60 KW water chillers
Aquarium
• water cooling (rack based)
Introduction to High Performance Computing at UCLOct. 17th 2017
CISM Environmental challenges
• total hosting capacity 120 KW
• electrical redundancy and 200 KVA UPS protection
• 5 m3 buffer tank• redundant
pumps, electrical feed through independent UPS
Aquarium