Introduction to HPC at UCL · Introduction to HPC at UCL ... 76/104=73% mb-wes Westmere : 0/24=0%...

Institut de calcul intensif et destockage de masse

Introduction to HPC at UCL

● Technical reminders and available equipment

• source code • compiling• optimized libraries: BLAS, LAPACK• OpenMP• MPI

● Job submission: SGE, Condor, SLURM

● CISM: working principles, management, access

● Machine room visit

● From algorithm to computer program: optimization and parallel code

October 17th 2017 Damien François and Bernard Van Renterghem

Introduction to High Performance Computing at UCLOct. 17th 2017

CISM Cache memory

program execution = information exchange between CPU and RAM memory (program instructions and data)

RAM slow + sequential set of instructions > cache memory: instructions and/or data read by entire blocks transferred from RAM to cache memory

Cache L1, L2, L3


CISM Clusters

Big number of standard elements (low cost)

Network performance critical

Low cost computer

Low cost computer

Low cost computer

Low cost computer

Low cost computer


CISM Symmetric multi-processors


CISM

Equipment: servers


CISM Equipment: servers

CISM servers

● Manneback● With CP3 working nodes● « exotic » or « interactive » machines

UCL CECI servers

● Hmem● Lemaitre2

CECI servers

● Vega ULB● Hercules UNamur● Dragon1 Umons● Nic4 ULiege


CISM Equipment: clusters

● Charles Manneback (1894-1975) Georges Lemaitre's friend

● Made of several different generations of hardware● Har , Neh, Westmere,SandyB, IvyB, Haswell,...● Opteron● GPU Tesla● Mic Xeon Phi

●Cp3 partition 125 nodes, 2424 core●Zoe partition 26 nodes, 416 core●Def partition 102, 1280 core

● Installed compilers: GCC, Intel, PGI

● OS: GNU/Linux Centos 6

● Batch system: SLURM

Manneback


CISM Equipment: clusters

Manneback

Welcome to | |

__ `__ \ _` | __ \ __ \ _ \ __ \ _` | __| | / | | | ( | | | | | __/ | | ( | ( < _| _| _| \__,_| _| _| _| _| \___| _.__/ \__,_| \___| _|\_\

Charles Manneback Lemaitre fellow cluster

(GNU/Linux CentOS 6.9) front-end: 2x8Core E5-2650@2GHz/64GB RAM

contact, support : [email protected]

Use the following commands to adjust your environment:'module avail' - show available modules'module add <module>' - adds a module to your environment for this session'module initadd <module>' - configure module to be loaded at every login


CISM

Equipment: clusters

Sumitting jobs with SLURM: http://www.cism.ucl.ac.be/faq~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ http://www.uclouvain.be/cism http://www.ceci-hpc.be~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[root@manneback ~]# sloadMBack : 3262/4128=79% mb-har Harpertown : 238/632=37% mb-opt Opteron : 1086/1248=87% mb-neh Nehalem : 76/104=73% mb-wes Westmere : 0/24=0% mb-sab SandyBridge: 583/704=82% mb-ivy IvyBridge : 895/896=99% mb-has Haswell(p zoe): 384/416=92%


CISM

# sinfo NODELIST NODES PARTI CPUS MEMORY FEATURES GRES mb-har[121-123,125-130] 9 Def* 8 13869+ Harpertown,Xeon,X5460 localscratch:340mb-har[124,131-140] 11 Def* 8 13930+ Harpertown,Xeon,X5460 localscratch:20mb-har[001-009,011-014] 13 Def* 8 15947 Harpertown,Xeon,L5420 localscratch:166mb-har102 1 Def* 8 32108 Harpertown,Xeon,L5420 localscratch:146 mb-neh070 1 Def* 8 24019 Nehalem,Xeon,X5550 localscratch:814,gpu:TeslaC1060/M1060:2 mb-sab040 1 Def* 16 64398 SandyBridge,Xeon,E5-2660 localscratch:458,gpu:TeslaM2090:2,mic:5110P:1

[...]


CISM Equipment: exotic machines

Other peculiar machines● Lm9 : interactive matlab, TermoCalc, R,

2 6 core [email protected] GHz, 144 GB RAM

● Mb-neh070): dual quad [email protected] (85 Gflops) + ( 2x) Tesla M1060 = 240 GPU core, 624 SP Gflops, 77 DP Gflops GPU.

● Mb-sab040 : dual octa [email protected] +(2x) Tesla M2090 = 512 GPU core, 1332 SP Gflops, 666 DP Gflops GPU + Xeon Phi 61 [email protected] 1011 DP Gflops

● SCMS-SAS 3&4 : for SAS, STATA, R,...2*16 core 8192 Gflops 128GB RAM Xeon [email protected]

● LmPp001-003 : lemaitre2 PostProcessing Nvidia Quadro 4000 = 256 GPU core, 486 SP Gflops, 243 DP Gflops GPU.


CISM Equipment CECI Clusters

● 16 Dell PowerEdge R815 + 1 HP + 3 Ttec

● 17x48 core AMD Opteron 6174(Magny-Cours) @2.2GHz

+ 3x8 core AMD Opteron 8222 @3GHz (24h partition)

● 2: 512, 7:256, 8:128, 3:128 GB RAM

● /scratch 3.2TB or 1.7TB

● Infiniband 40Gb/s

● SLURM batch queuing system

=

● Tot: 840 core, 4128 TB RAM, 31 TB /scratch, 11TB /home, 7468 GFlops

Hmem ( www.ceci-hpc.be )

http://www.ceci-hpc.be/


CISM

● 112 HP DL380 with 2x6 core [email protected] 48GB RAM

● /scratch lustreFS 120TB, /tmp 325GB

● Infiniband 40Gb/s

● SLURM batch queuing system

=

● Tot: 1344 core, 5.25 TB RAM, 120 TB /scratch, 30TB /home, 13.6 TFlops

Lemaitre2 ( www.ceci-hpc.be )

Equipment CECI Clusters

mailto:[email protected]

http://www.ceci-hpc.be/


CISM

ULB+Unamur,+UMons+UCL+ULiège = CECI

Equipment CECI Clusters

See www.ceci-hpc.be/clusters.html


CISM

Manneback (2012)

1400 CPUs

Manneback (2012)

1400 CPUs

HMEM (2011)

816 CPUs

HMEM (2011)

816 CPUs

Lemaitre2 (2012)

1380 CPUs

Lemaitre2 (2012)

1380 CPUs

Vega (2013)

2752 CPUs

Vega (2013)

2752 CPUs

Dragon1 (2013)

416 CPUs

Dragon1 (2013)

416 CPUs

Hercules (2013)

896 CPUs

Hercules (2013)

896 CPUs

Nic4 (2014)

2048 CPUs

Nic4 (2014)

2048 CPUs

~ 8300 CPUs

(2010)

Zenobe (2014)13536 CPUs

Zenobe (2014)13536 CPUs

(2012)


CISM


CISM

280 M/mois = 4.7 MhCPU~ 533 ans/CPU


CISM CECI Distributed Filesystem


CISM CECI Distributed Filesystem

Belnet10 Gbps

UCL

ULg

HPC

Solution core design

(2014)HPC


CISM To reduce computing time…

… improve your code

● choice of algorithm ● source code● optimized compiling● optimized libraries

… use parallel computation

● OpenMP (mostly on SMP machines) ● MPI


CISM Source code

• Algorithm choice: volume of calculation increases with n, n x n,…? Stability ?

● indirect addressing expensive (pointers) ● fetching order of array elements (for optimal use of cache memory)● loop efficiency (get all uneccessary bits and pieces out of them)

• Programming language: FORTRAN, C, C++,…?

• Coding practise


CISM Compiling

• The compiler…

• Optimization options: -01, -02, -03

• Different qualities of compilers !!

● translates an instruction list written in a high level language into a machine readable (binary) file [= the object file]

e. g. ifc –c myprog.f90 generates object file myprog.o● link binary object files to produce an executable file

e. g. ifc –o myprog module1.o libmath.a myprog.o generates the executable file (= program) myprog


CISM Optimized libraries: BLAS

Basic Linear Algebra Subroutines

● set of optimized subroutines to handle vector x vector, matrix x vector, matrix x matrix operations (for real and complex numbers, single or double precision)

● the subrouines are optimized for a specific machine CPU/OS

● See http://www.netlig.org/blas● Example…

http://www.netlig.org/blas



● compiling from BLAS source: ifc –o mvm sgemv.f mvm.f

● compiling with pre-compiled BLAS library (optimized for Intel CPU):

ifc –o mvm mvm.f sblasd13d.a

real*8 matlxc(nl, nc)real*8 vectc(nc), result(nl)

call random_number(matlxc)call random_number(vectc)

do i=1,nl result(i)=0.0 do j=1,nc result(i)=result(i)+matlxc(i,j)*vectc(j) end doend do

call SGEMV('N',nl,nc,1.0d0,matlxc,nl,vectc,1,0.d0,result,1)



Performance comparison of Intel and PGI FORTRAN compilers, for self-made code, BLAS code and pre-compiled optimized libraries (matrix 10,000 x 5,000)

Compiler Subroutine Options Mflpos

Intel (ifc) DO loop - O0 11

- O3 11

BLAS source - O0 42

O3 115

BLAS compiled - O0 120

O3 120

PGI (pgf90) DO loop - O0 11

- O3 11

BLAS source - O0 48

- O3 57

BLAS compiled -O0 116

-O3 119


CISM Optimized libraries: LAPACK

• Linear Algebra Subroutines:

● linear equation system Ax=b● least square: min ||Ax-b||²● eigen value: Ax=x, AX=Bx● for real or complex, single or double precision● includes all utility routines (LU factoring, Cholesky,…)

• Based on BLAS (don't depend on hardware, always optimized)

• See http://www.netlib.org/lapack

http://www.netlib.org/lapack


CISM OpenMP

• Open Memory Parallelism: standard language (compiler directives, functions, environment variables) for shared memory architectures (OpenMP 2.0)

• Principle: compiler directives > parallelism details are left to the compiler > fast implementation

!OMP PARALLEL DO

modèle fork and join…

DO I=1,1000 a(i)=b(i)*c(i)END DO…


CISM MPI environment

• MPI = Message Passing Interface (2.0)

• Principle: the program has full control over data exchange between nodes while distributing work and managing communication between nodes

• Widely used standard for clusters (but also exists for SMP boxes)

…REAL a(100) …C Process 0 sends, process 1 receives: if( myrank.eq.0 ) then call MPI_SEND(a,100,MPI_REAL,1,17,MPI_COMM_WORLD,ierr) else if ( myrank.eq.1 ) then call MPI_RECV(a,100,MPI_REAL,0,17,MPI_COMM_WORLD,status,ierr) endif …


CISM Job submission

• Goal: one single task per CPU

• Principle: the user hands his program over to an automatic job management system, specifying his requirements (memory, architecture, number of cpus,…). When the requested resources become available, the job is dispatched and starts running.

• Slurm Workload Manager

● sbatch● sinfo● scancel


CISM Job submission

• Submission script examples…

• To submit your job: sbatch myscript


CISM CISM: research environment

ELEN

TERM

SC

TOPO

ELIENAPS

RDGN

RECI

BSMA

IMAP

MOST

MEMA

BIB

COMU

LICR

ELEN

INGI

INMA

ELIC

INFM

CP3

FACM

NAPS

LOCI

GERU

PAMO

LSM

ECON

RSPO


CISM CISM

• Equipment and support available for any UCL (and CECI) member

• Equipments are acquired through projects

• Goals: joining forces to acquire and manage more powerful equipments

• Institut de Calcul Intensif et de Stockage de Masse:

● management committee composed of representatives of user's entities: debates and decides on strategies; chairman elected for four years

● offices in Mercator; machine rooms in Pythagore and Marc de Hemptinne

● daily management by technical computer team, under leadership of CISM Director (elected for four years)


CISM CISM management team

Thomas KeutgenDirecteur CISM

Olivier MattelaerGestionnaire système &Support utilisateur

Bernard Van RenterghemGestionnaire système &

support utilisateur

Damien FrançoisGestionnaire système &

support utilisateur


CISM Environmental challenges



• two 60 KW water chillers

Aquarium

• water cooling (rack based)



• total hosting capacity 120 KW

• electrical redundancy and 200 KVA UPS protection

• 5 m3 buffer tank• redundant

pumps, electrical feed through independent UPS

Aquarium


CISMDCIII Data Center 3

Introduction to HPC at UCL · Introduction to HPC at UCL ... 76/104=73% mb-wes Westmere : 0/24=0%...

Documents

Transcript of Introduction to HPC at UCL · Introduction to HPC at UCL ... 76/104=73% mb-wes Westmere : 0/24=0%...