Parallel Computing with OpenMP

35
Parallel Computing with OpenMP Yutaka Masuda University of Georgia Summer Course May 18, 2018

Transcript of Parallel Computing with OpenMP

Page 1: Parallel Computing with OpenMP

Parallel Computingwith OpenMP

Yutaka Masuda

University of Georgia

Summer Course May 18, 2018

Page 2: Parallel Computing with OpenMP

Computing cores

• Why don’t you use multiple cores for your computations?

= This is what OpenMP will do.

Page 3: Parallel Computing with OpenMP

Multiple computers

• Why don’t you use multiple computers working together?

= This is a cluster: out of interest for OpenMP but MPI.

https://en.wikipedia.org/wiki/File:MEGWARE.CLIC.jpg

Page 4: Parallel Computing with OpenMP

Methods in parallel computing• OpenMP

• Shared memory (single computer)

• A set of directives

• Focus on parallelization for loops= limited purpose

• Automatic management by the program = easier to program

• MPI (Message Passing Interface)• Distributed memory (clusters)

• A collection of subroutines

• Any kinds of parallel computing= flexible

• Manual control of data flow & management = complicated

From www.comsol.com

Page 5: Parallel Computing with OpenMP

Sum of 1 to 100

• OpenMP • MPIprogram summpi

use mpiimplicit noneinteger :: i,myid,nprocs,ioreal :: s,local_scall mpi_init(io)call mpi_comm_rank(MPI_COMM_WORLD,myid,io)call mpi_comm_size(MPI_COMM_WORLD,nprocs,io)local_s = 0.0do i=myid+1,100,nprocs

local_s = local_s + iend docall mpi_reduce(local_s,s,1,MPI_REAL, &

MPI_SUM,0,MPI_COMM_WORLD,io)if(myid==0) then

print *,send ifcall mpi_finalize(io)

end program summpi

program sumompimplicit noneinteger :: ireal :: s = 0.0

!$omp parallel private(i) &!$omp reduction(+:s)!$omp do

do i=1,100s = s + i

end do!$omp end do!$end end parallel

print *,send program sumomp

Page 6: Parallel Computing with OpenMP

Cores and threads

• Cores: physical hardware within CPU (central processing unit)

• Threads: tasks defined in your software

• Operating system assigns a task to a core by “time splicing”.• The system switches tasks for a core in a short period of time.

• # of threads can be > # of cores.

https://www.intel.com/content/www/us/en/products/processors/core/i5-processors/i5-8400.html

Page 7: Parallel Computing with OpenMP

OpenMP creates threads

From Wikipedia

Regular (sequential) program:

Parallel program:

Page 8: Parallel Computing with OpenMP

Fork-Join model

From Wikipedia

ParallelRegion 1

ParallelRegion 2

ParallelRegion 3

Fork Join Fork Join Fork Join

Page 9: Parallel Computing with OpenMP

Program structure with OpenMP

do i=1,10x(i)=sqrt(x(i))

end do

!$omp parallel!$omp dodo i=1,10x(i)=sqrt(x(i))

end do!$omp end do!$omp end parallel

i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10

i=1

i=2i=3

i=4

i=5i=6

i=7

i=8

i=9

i=10 Using 3 threads.

Page 10: Parallel Computing with OpenMP

OpenMP directives

• The directive must begin with a keyword !$omp.• The directives will be effective

only if you put a compiler option.

• Otherwise, the directives will be ignored (because it looks like a comment).

• An OpenMP region must be encircled with !$omp … and!$omp … end directives.

!$omp parallel!$omp dodo i=1,10x(i)=sqrt(x(i))

end do!$omp end do!$omp end parallel

Page 11: Parallel Computing with OpenMP

OpenMP directives (con’d)

• Each directive can have an optional clause.• Variable type, number of threads,

conditional execution etc.

• A statement starts with !$ will be complied only when the OpenMP is effective (conditional compilation).

!$omp parallel private(i) shared(x)!$omp dodo i=1,10x(i)=sqrt(x(i))

end do!$omp end do!$omp end parallel

!$print *,’OpenMP is active!’

Page 12: Parallel Computing with OpenMP

Compiler options

• Depends on compilers• Intel Fortran Compiler (ifort): -qopenmp (or -fopenmp)

• GFortran: -fopenmp

• Absoft: -openmp

• NAG: -openmp

• PGI: -mp

• Examples• ifort -qopenmp prog.f90

• gfortran -fopenmp prog.f90

Page 13: Parallel Computing with OpenMP

Runtime options (seeing it later)

• Specification of the number of threads (on Linux)

OMP_NUM_THREADS=2 ./a.out (with 2 threads)

• Default behavior:• If OMP_NUM_THREADS, follows this.

• If limitation in the program, follows this.

• If nothing, use as many threads as possible in your computer.

Page 14: Parallel Computing with OpenMP

Directive: parallel

!$omp parallelprint *,’Hi!’ !$omp end parallel

• Defines a parallel region and assigns the task to each thread.• The region will be executed by

multiple threads.

• The number of threads can be controlled by the an optional clause, supplemental functions or an environmental variable.

print *,’Hi!’

print *,’Hi!’

print *,’Hi!’

(Output)Hi!Hi!Hi!

Page 15: Parallel Computing with OpenMP

Directive do

• Perform the do-loop with multiple threads.• The !$omp do directive must be

placed just before a do-loop.

• The directive must be surrounded by parallel.

• The counter is not necessarily incremented in order.

• The counter i is treated as a separate variable for each thread (private variable).

!$omp parallel!$omp dodo i=1,10x(i)=sqrt(x(i))

end do!$omp end do!$omp end parallel

i=1 i=5i=6

i=4 i=7i=9

i=2i=3 i=8 i=10

Page 16: Parallel Computing with OpenMP

Shared variable by default

! compute parent average (PA)

!$omp parallel!$omp dodo i=1,n

s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0

end do!$omp end do!$omp end parallel

nsire(:)

dam(:)s d

pa(:)

ebv(:)

Shared variable

= 1= =

=( + )/2.0

i

s sire(1)

d dam(1)pa(1) ebv( )s ebv( )d

iPrivate variable

= 2= =

=( + )/2.0

i

s sire(2)

d dam(2)pa(2) ebv( )s ebv( )d

iPrivate variable• All threads share the variables s and d.• One thread rewrites the variables while

another thread cites the variable!

Page 17: Parallel Computing with OpenMP

Private and shared variable! compute parent average (PA)

!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa)!$omp dodo i=1,n

s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0

end do!$omp end do!$omp end parallel

• Each thread has own variables s and dso there is no competition any more.

nsire(:)

dam(:)

pa(:)

ebv(:)

Shared variable

= 1= =

=( + )/2.0

i

s sire(1)

d dam(1)pa(1) ebv( )s ebv( )d

iPrivate variable

= 2= =

=( + )/2.0

i

s sire(2)

d dam(2)pa(2) ebv( )s ebv( )d

iPrivate variable

s d

s d

Page 18: Parallel Computing with OpenMP

Clause: shared and private

• Define variable types.• Use private() and shared()

clauses in the parallel directive.

• Private variables will be created for each thread.

• Shared variables will be shared (rewritten) by all threads.

• Variables will be shared by default except loop counters.

• Always declare the variable type to avoid bugs.

! compute parent average (PA)

!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa)!$omp dodo i=1,n

s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0

end do!$omp end do!$omp end parallel

Page 19: Parallel Computing with OpenMP

Clause: reduction

known=0

!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa) &!$omp reduction(+:known)!$omp dodo i=1,n

s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0if(s/=0.and.d/=0) known=known+1

end do!$omp end do!$omp end parallel

• Specify variable for “reduction” operations.• A variable known is treated as

private for each thread.

• In the end of the loop, all threads will add their private known to the global known.

• Other operations (instead of +) are available:

• +,*,max,min etc.

Page 20: Parallel Computing with OpenMP

Clause: if

known=0

!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa) &!$omp reduction(+:known) &!$omp if(n>100000)!$omp dodo i=1,ns=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0if(s/=0.and.d/=0) known=known+1

end do!$omp end do!$omp end parallel

• Conditional use of OpenMP• If the condition is true, OpenMP

will be invoked in the parallel region.

• If not, the OpenMP directives in this region will be ignored (i.e. single-thread execution).

Page 21: Parallel Computing with OpenMP

Built-in functions/subroutines

• Built-in functions/subroutines for OpenMP are defined in the module omp_lib.• Recommendation: always cite this

module as !$ use omp_lib because the module is usable only when you put a compiler option.

• See the textbook or openmp.org for details.

use omp_lib

or

!$ use omp_lib

Page 22: Parallel Computing with OpenMP

Built-in function: timing

• OpenMP function omp_get_wtime() returns wall-clock time.!$ use omp_lib

integer,parameter :: r8=selected_real_kind(15,300)real(r8) :: tic,toc...

!$ tic=omp_get_wtime()!$omp parallel!$omp dodo

...

end do!$omp end do!$omp end parallel!$ toc=omp_get_wtime()!$ print *,’running time=’,toc-tic

Page 23: Parallel Computing with OpenMP

The number of threads

• The default number of threads used is the maximum number available on your system.

• A parallel program will be slow if …• You separately run another parallel program and each program tries to use

the maximum number of threads.

• Three different ways to change the number of threads.1. Region-specific configuration (use of a clause in the parallel directive)

2. Program-specific configuration (use of a built-in subroutine)

3. Run-time configuration (use of an environmental variable)

Page 24: Parallel Computing with OpenMP

Approach 1

integer :: n

n = 2

!$omp parallel num_threads(n)!$omp do

do

...

end do

!$omp end do

!$omp end parallel

• Use of num_threads clause.• This is a region-specific

configuration.

Page 25: Parallel Computing with OpenMP

Approach 2

!$ use omp_lib

integer :: n

n=2

!$call omp_set_num_threads(n)

!$omp parallel

!$omp do

do

...

end do

!$omp end do

!$omp end parallel

• Use of a built-in function omp_set_num_threads.• It changes the default number of

threads in the program.

• It basically affects all the subsequent parallel regions without the num_threads clause (see OpenMP specifications).

Page 26: Parallel Computing with OpenMP

Approach 3

Linux and Mac OS X:

export OMP_NUM_THREADS=4

or

OMP_NUM_THREADS=4 ./a.out

Windows:

• Use of an environmental variable OMP_NUM_THREADS.• It means you don’t have to change

the program. You can just change the system variable.

• In Linux and Mac OS X, this variable is effective only in the session. Write the variable in your Bash-profile.

• In Windows, open the computer's property to set the variable.

Page 27: Parallel Computing with OpenMP

OpenMP is not perfect.

• A task should be split into several independent jobs.• Not directly applicable if there are data-dependencies.

do i=3,nx(i)=x(i-1)+x(i-2)

end do

• Hard to parallelized: Gauss-Seidel iterations, MCMC, and so on• Modify the algorithm to create independent tasks

• Is it really worth to do?

Page 28: Parallel Computing with OpenMP

Why my OpenMP is not so fast?

• Thread management is automatic but costly.• Possibility: management overhead ≥ advantage of threading

• 0.20 vs 0.50 sec w/wo 4 threads 11.9 vs 46.7 sec w/wo 4 threads

• OpenMP is useful only when the overhead can be ignored e.g. heavy computations repeated many times.

!$omp parallel private(i) &!$omp reduction(+:s)!$omp dodo i=1,1000000000

s = s + 1/real(i)end do!omp end do!omp end parallel

!$omp parallel private(i) &!$omp reduction(+:s)!$omp dodo i=1,1000000000

s = s + log(abs(sin((sqrt(1/dble(i)))**0.6)))end do!omp end do!omp end parallel

Page 29: Parallel Computing with OpenMP

Why my OpenMP is so slow?• Many other jobs use the computing cores.

• Possibility: too many multi-threaded programs in background, ortoo many threads consumed by few other programs

• Best practice: limit the number of threads for each program

0

100

200

300

400

1 2 4 8 16

Wall-clock time (sec) in one round ofAI REML with YAMS

Page 30: Parallel Computing with OpenMP

BLUPF90 programs and parallelization

• Relies on parallel libraries and modules• Genomic module dependent on Intel MKL i.e. optimized BLAS & LAPACK

subroutines. (MKL parallelized by OpenMP).

• The module also uses OpenMP directives.

• YAMS (a sparse matrix library) calls MKL as well.

• BLUPF90IOD2 (a licensed software) supports OpenMP.

• Please make sure how many threads you are actually using before running the parallel programs.

Page 31: Parallel Computing with OpenMP

BLAS and LAPACK

• BLAS: Basic Linear Algebra Subprograms• Very basic matrix/vector computations: mainly multiplications

• Originally written in Fortran

• Optimized by vendors

• MKL (Math Kernel Library) by Intel; A lot of optimization + OpenMP

• LAPACK: Linear Algebra Package• High-level matrix computations:

• Cholesky, eigenvalue, and singular value decompositions

• Linear equations etc.

• Performance dependent on BLAS

Page 32: Parallel Computing with OpenMP

Genomic module (next week)

• G-matrix𝐆 = 𝐙𝐙′/𝑘

• DGEMM from BLAS (Aguilar et al., 2011)

• Some quality control on markers

• A22-matrix and related computations• Subset of numerator relationship matrix (Aguilar et al., 2011)

• Diagonals of A22-inverse (Masuda et al., 2017)

• Indirect computations in A22-inverse (Masuda et al., 2016)

Page 33: Parallel Computing with OpenMP

Aguilar et al., 2011

Page 34: Parallel Computing with OpenMP

Animals Time (s) Mem (GB)

1000 0.18 0.4

2000 0.47 0.8

5000 1.87 2.1

10000 8.40 4.5

20000 28.4 10.4

50000 162.7 37.3

100000 664.2 111.8

200000 2796.0 372.5

Computing time in G = ZZ’/k with 50K SNP markers using DSYRK from MKL(Intel Xeon E5-2650 2.2GHz x 24 cores)

Page 35: Parallel Computing with OpenMP

YAMS (next week)

• Yet Another Mixed-model Solver (Masuda et al., 2014, 2015)

• Supernodal approaches• Use of dense matrix in sparse matrix i.e. BLAS

• Both in factorization and inversion

• Replacement of FSPAK• OPTION use_yams

• Variance component estimation with REML

• S. E. (or accuracy) of breeding value by inversion

• Indirect computation in A22-inverse

• Unavailable in Gibbs sampling