Parallel Computing with OpenMP

Parallel Computingwith OpenMP

Yutaka Masuda

University of Georgia

Summer Course May 18, 2018

Computing cores

• Why don’t you use multiple cores for your computations?

= This is what OpenMP will do.

Multiple computers

• Why don’t you use multiple computers working together?

= This is a cluster: out of interest for OpenMP but MPI.

https://en.wikipedia.org/wiki/File:MEGWARE.CLIC.jpg

Methods in parallel computing• OpenMP

• Shared memory (single computer)

• A set of directives

• Focus on parallelization for loops= limited purpose

• Automatic management by the program = easier to program

• MPI (Message Passing Interface)• Distributed memory (clusters)

• A collection of subroutines

• Any kinds of parallel computing= flexible

• Manual control of data flow & management = complicated

From www.comsol.com

Sum of 1 to 100

• OpenMP • MPIprogram summpi

use mpiimplicit noneinteger :: i,myid,nprocs,ioreal :: s,local_scall mpi_init(io)call mpi_comm_rank(MPI_COMM_WORLD,myid,io)call mpi_comm_size(MPI_COMM_WORLD,nprocs,io)local_s = 0.0do i=myid+1,100,nprocs

local_s = local_s + iend docall mpi_reduce(local_s,s,1,MPI_REAL, &

MPI_SUM,0,MPI_COMM_WORLD,io)if(myid==0) then

print *,send ifcall mpi_finalize(io)

end program summpi

program sumompimplicit noneinteger :: ireal :: s = 0.0

!$omp parallel private(i) &!$omp reduction(+:s)!$omp do

do i=1,100s = s + i

end do!$omp end do!$end end parallel

print *,send program sumomp

Cores and threads

• Cores: physical hardware within CPU (central processing unit)

• Threads: tasks defined in your software

• Operating system assigns a task to a core by “time splicing”.• The system switches tasks for a core in a short period of time.

• # of threads can be > # of cores.

https://www.intel.com/content/www/us/en/products/processors/core/i5-processors/i5-8400.html

OpenMP creates threads

From Wikipedia

Regular (sequential) program:

Parallel program:

Fork-Join model

From Wikipedia

ParallelRegion 1

ParallelRegion 2

ParallelRegion 3

Fork Join Fork Join Fork Join

Program structure with OpenMP

do i=1,10x(i)=sqrt(x(i))

end do

!$omp parallel!$omp dodo i=1,10x(i)=sqrt(x(i))

end do!$omp end do!$omp end parallel

i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10

i=1

i=2i=3

i=4

i=5i=6

i=7

i=8

i=9

i=10 Using 3 threads.

OpenMP directives

• The directive must begin with a keyword !$omp.• The directives will be effective

only if you put a compiler option.

• Otherwise, the directives will be ignored (because it looks like a comment).

• An OpenMP region must be encircled with !$omp … and!$omp … end directives.



OpenMP directives (con’d)

• Each directive can have an optional clause.• Variable type, number of threads,

conditional execution etc.

• A statement starts with !$ will be complied only when the OpenMP is effective (conditional compilation).

!$omp parallel private(i) shared(x)!$omp dodo i=1,10x(i)=sqrt(x(i))


!$print *,’OpenMP is active!’

Compiler options

• Depends on compilers• Intel Fortran Compiler (ifort): -qopenmp (or -fopenmp)

• GFortran: -fopenmp

• Absoft: -openmp

• NAG: -openmp

• PGI: -mp

• Examples• ifort -qopenmp prog.f90

• gfortran -fopenmp prog.f90

Runtime options (seeing it later)

• Specification of the number of threads (on Linux)

OMP_NUM_THREADS=2 ./a.out (with 2 threads)

• Default behavior:• If OMP_NUM_THREADS, follows this.

• If limitation in the program, follows this.

• If nothing, use as many threads as possible in your computer.

Directive: parallel

!$omp parallelprint *,’Hi!’ !$omp end parallel

• Defines a parallel region and assigns the task to each thread.• The region will be executed by

multiple threads.

• The number of threads can be controlled by the an optional clause, supplemental functions or an environmental variable.

print *,’Hi!’

print *,’Hi!’

print *,’Hi!’

(Output)Hi!Hi!Hi!

Directive do

• Perform the do-loop with multiple threads.• The !$omp do directive must be

placed just before a do-loop.

• The directive must be surrounded by parallel.

• The counter is not necessarily incremented in order.

• The counter i is treated as a separate variable for each thread (private variable).



i=1 i=5i=6

i=4 i=7i=9

i=2i=3 i=8 i=10

Shared variable by default

! compute parent average (PA)

!$omp parallel!$omp dodo i=1,n

s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0


nsire(:)

dam(:)s d

pa(:)

ebv(:)

Shared variable

= 1= =

=( + )/2.0

i

s sire(1)

d dam(1)pa(1) ebv( )s ebv( )d

iPrivate variable

= 2= =

=( + )/2.0

i

s sire(2)


iPrivate variable• All threads share the variables s and d.• One thread rewrites the variables while

another thread cites the variable!

Private and shared variable! compute parent average (PA)

!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa)!$omp dodo i=1,n



• Each thread has own variables s and dso there is no competition any more.

nsire(:)

dam(:)

pa(:)

ebv(:)

Shared variable

= 1= =

=( + )/2.0

i

s sire(1)


iPrivate variable

= 2= =

=( + )/2.0

i

s sire(2)


iPrivate variable

s d

s d

Clause: shared and private

• Define variable types.• Use private() and shared()

clauses in the parallel directive.

• Private variables will be created for each thread.

• Shared variables will be shared (rewritten) by all threads.

• Variables will be shared by default except loop counters.

• Always declare the variable type to avoid bugs.

! compute parent average (PA)

!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa)!$omp dodo i=1,n



Clause: reduction

known=0

!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa) &!$omp reduction(+:known)!$omp dodo i=1,n

s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0if(s/=0.and.d/=0) known=known+1


• Specify variable for “reduction” operations.• A variable known is treated as

private for each thread.

• In the end of the loop, all threads will add their private known to the global known.

• Other operations (instead of +) are available:

• +,*,max,min etc.

Clause: if

known=0

!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa) &!$omp reduction(+:known) &!$omp if(n>100000)!$omp dodo i=1,ns=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0if(s/=0.and.d/=0) known=known+1


• Conditional use of OpenMP• If the condition is true, OpenMP

will be invoked in the parallel region.

• If not, the OpenMP directives in this region will be ignored (i.e. single-thread execution).

Built-in functions/subroutines

• Built-in functions/subroutines for OpenMP are defined in the module omp_lib.• Recommendation: always cite this

module as !$ use omp_lib because the module is usable only when you put a compiler option.

• See the textbook or openmp.org for details.

use omp_lib

or

!$ use omp_lib

Built-in function: timing

• OpenMP function omp_get_wtime() returns wall-clock time.!$ use omp_lib

integer,parameter :: r8=selected_real_kind(15,300)real(r8) :: tic,toc...

!$ tic=omp_get_wtime()!$omp parallel!$omp dodo

...

end do!$omp end do!$omp end parallel!$ toc=omp_get_wtime()!$ print *,’running time=’,toc-tic

The number of threads

• The default number of threads used is the maximum number available on your system.

• A parallel program will be slow if …• You separately run another parallel program and each program tries to use

the maximum number of threads.

• Three different ways to change the number of threads.1. Region-specific configuration (use of a clause in the parallel directive)

2. Program-specific configuration (use of a built-in subroutine)

3. Run-time configuration (use of an environmental variable)

Approach 1

integer :: n

n = 2

!$omp parallel num_threads(n)!$omp do

do

...

end do

!$omp end do

!$omp end parallel

• Use of num_threads clause.• This is a region-specific

configuration.

Approach 2

!$ use omp_lib

integer :: n

n=2

!$call omp_set_num_threads(n)

!$omp parallel

!$omp do

do

...

end do

!$omp end do

!$omp end parallel

• Use of a built-in function omp_set_num_threads.• It changes the default number of

threads in the program.

• It basically affects all the subsequent parallel regions without the num_threads clause (see OpenMP specifications).

Approach 3

Linux and Mac OS X:

export OMP_NUM_THREADS=4

or

OMP_NUM_THREADS=4 ./a.out

Windows:

• Use of an environmental variable OMP_NUM_THREADS.• It means you don’t have to change

the program. You can just change the system variable.

• In Linux and Mac OS X, this variable is effective only in the session. Write the variable in your Bash-profile.

• In Windows, open the computer's property to set the variable.

OpenMP is not perfect.

• A task should be split into several independent jobs.• Not directly applicable if there are data-dependencies.

do i=3,nx(i)=x(i-1)+x(i-2)

end do

• Hard to parallelized: Gauss-Seidel iterations, MCMC, and so on• Modify the algorithm to create independent tasks

• Is it really worth to do?

Why my OpenMP is not so fast?

• Thread management is automatic but costly.• Possibility: management overhead ≥ advantage of threading

• 0.20 vs 0.50 sec w/wo 4 threads 11.9 vs 46.7 sec w/wo 4 threads

• OpenMP is useful only when the overhead can be ignored e.g. heavy computations repeated many times.

!$omp parallel private(i) &!$omp reduction(+:s)!$omp dodo i=1,1000000000

s = s + 1/real(i)end do!omp end do!omp end parallel

!$omp parallel private(i) &!$omp reduction(+:s)!$omp dodo i=1,1000000000

s = s + log(abs(sin((sqrt(1/dble(i)))**0.6)))end do!omp end do!omp end parallel

Why my OpenMP is so slow?• Many other jobs use the computing cores.

• Possibility: too many multi-threaded programs in background, ortoo many threads consumed by few other programs

• Best practice: limit the number of threads for each program

0

100

200

300

400

1 2 4 8 16

Wall-clock time (sec) in one round ofAI REML with YAMS

BLUPF90 programs and parallelization

• Relies on parallel libraries and modules• Genomic module dependent on Intel MKL i.e. optimized BLAS & LAPACK

subroutines. (MKL parallelized by OpenMP).

• The module also uses OpenMP directives.

• YAMS (a sparse matrix library) calls MKL as well.

• BLUPF90IOD2 (a licensed software) supports OpenMP.

• Please make sure how many threads you are actually using before running the parallel programs.

BLAS and LAPACK

• BLAS: Basic Linear Algebra Subprograms• Very basic matrix/vector computations: mainly multiplications

• Originally written in Fortran

• Optimized by vendors

• MKL (Math Kernel Library) by Intel; A lot of optimization + OpenMP

• LAPACK: Linear Algebra Package• High-level matrix computations:

• Cholesky, eigenvalue, and singular value decompositions

• Linear equations etc.

• Performance dependent on BLAS

Genomic module (next week)

• G-matrix𝐆 = 𝐙𝐙′/𝑘

• DGEMM from BLAS (Aguilar et al., 2011)

• Some quality control on markers

• A22-matrix and related computations• Subset of numerator relationship matrix (Aguilar et al., 2011)

• Diagonals of A22-inverse (Masuda et al., 2017)

• Indirect computations in A22-inverse (Masuda et al., 2016)

Aguilar et al., 2011

Animals Time (s) Mem (GB)

1000 0.18 0.4

2000 0.47 0.8

5000 1.87 2.1

10000 8.40 4.5

20000 28.4 10.4

50000 162.7 37.3

100000 664.2 111.8

200000 2796.0 372.5

Computing time in G = ZZ’/k with 50K SNP markers using DSYRK from MKL(Intel Xeon E5-2650 2.2GHz x 24 cores)

YAMS (next week)

• Yet Another Mixed-model Solver (Masuda et al., 2014, 2015)

• Supernodal approaches• Use of dense matrix in sparse matrix i.e. BLAS

• Both in factorization and inversion

• Replacement of FSPAK• OPTION use_yams

• Variance component estimation with REML

• S. E. (or accuracy) of breeding value by inversion

• Indirect computation in A22-inverse

• Unavailable in Gibbs sampling

Parallel Computing with OpenMP

Documents

Transcript of Parallel Computing with OpenMP