Parallel Computing with OpenMP
Transcript of Parallel Computing with OpenMP
Parallel Computingwith OpenMP
Yutaka Masuda
University of Georgia
Summer Course May 18, 2018
Computing cores
• Why don’t you use multiple cores for your computations?
= This is what OpenMP will do.
Multiple computers
• Why don’t you use multiple computers working together?
= This is a cluster: out of interest for OpenMP but MPI.
https://en.wikipedia.org/wiki/File:MEGWARE.CLIC.jpg
Methods in parallel computing• OpenMP
• Shared memory (single computer)
• A set of directives
• Focus on parallelization for loops= limited purpose
• Automatic management by the program = easier to program
• MPI (Message Passing Interface)• Distributed memory (clusters)
• A collection of subroutines
• Any kinds of parallel computing= flexible
• Manual control of data flow & management = complicated
From www.comsol.com
Sum of 1 to 100
• OpenMP • MPIprogram summpi
use mpiimplicit noneinteger :: i,myid,nprocs,ioreal :: s,local_scall mpi_init(io)call mpi_comm_rank(MPI_COMM_WORLD,myid,io)call mpi_comm_size(MPI_COMM_WORLD,nprocs,io)local_s = 0.0do i=myid+1,100,nprocs
local_s = local_s + iend docall mpi_reduce(local_s,s,1,MPI_REAL, &
MPI_SUM,0,MPI_COMM_WORLD,io)if(myid==0) then
print *,send ifcall mpi_finalize(io)
end program summpi
program sumompimplicit noneinteger :: ireal :: s = 0.0
!$omp parallel private(i) &!$omp reduction(+:s)!$omp do
do i=1,100s = s + i
end do!$omp end do!$end end parallel
print *,send program sumomp
Cores and threads
• Cores: physical hardware within CPU (central processing unit)
• Threads: tasks defined in your software
• Operating system assigns a task to a core by “time splicing”.• The system switches tasks for a core in a short period of time.
• # of threads can be > # of cores.
https://www.intel.com/content/www/us/en/products/processors/core/i5-processors/i5-8400.html
OpenMP creates threads
From Wikipedia
Regular (sequential) program:
Parallel program:
Fork-Join model
From Wikipedia
ParallelRegion 1
ParallelRegion 2
ParallelRegion 3
Fork Join Fork Join Fork Join
Program structure with OpenMP
do i=1,10x(i)=sqrt(x(i))
end do
!$omp parallel!$omp dodo i=1,10x(i)=sqrt(x(i))
end do!$omp end do!$omp end parallel
i=1 i=2 i=3 i=4 i=5 i=6 i=7 i=8 i=9 i=10
i=1
i=2i=3
i=4
i=5i=6
i=7
i=8
i=9
i=10 Using 3 threads.
OpenMP directives
• The directive must begin with a keyword !$omp.• The directives will be effective
only if you put a compiler option.
• Otherwise, the directives will be ignored (because it looks like a comment).
• An OpenMP region must be encircled with !$omp … and!$omp … end directives.
!$omp parallel!$omp dodo i=1,10x(i)=sqrt(x(i))
end do!$omp end do!$omp end parallel
OpenMP directives (con’d)
• Each directive can have an optional clause.• Variable type, number of threads,
conditional execution etc.
• A statement starts with !$ will be complied only when the OpenMP is effective (conditional compilation).
!$omp parallel private(i) shared(x)!$omp dodo i=1,10x(i)=sqrt(x(i))
end do!$omp end do!$omp end parallel
!$print *,’OpenMP is active!’
Compiler options
• Depends on compilers• Intel Fortran Compiler (ifort): -qopenmp (or -fopenmp)
• GFortran: -fopenmp
• Absoft: -openmp
• NAG: -openmp
• PGI: -mp
• Examples• ifort -qopenmp prog.f90
• gfortran -fopenmp prog.f90
Runtime options (seeing it later)
• Specification of the number of threads (on Linux)
OMP_NUM_THREADS=2 ./a.out (with 2 threads)
• Default behavior:• If OMP_NUM_THREADS, follows this.
• If limitation in the program, follows this.
• If nothing, use as many threads as possible in your computer.
Directive: parallel
!$omp parallelprint *,’Hi!’ !$omp end parallel
• Defines a parallel region and assigns the task to each thread.• The region will be executed by
multiple threads.
• The number of threads can be controlled by the an optional clause, supplemental functions or an environmental variable.
print *,’Hi!’
print *,’Hi!’
print *,’Hi!’
(Output)Hi!Hi!Hi!
Directive do
• Perform the do-loop with multiple threads.• The !$omp do directive must be
placed just before a do-loop.
• The directive must be surrounded by parallel.
• The counter is not necessarily incremented in order.
• The counter i is treated as a separate variable for each thread (private variable).
!$omp parallel!$omp dodo i=1,10x(i)=sqrt(x(i))
end do!$omp end do!$omp end parallel
i=1 i=5i=6
i=4 i=7i=9
i=2i=3 i=8 i=10
Shared variable by default
! compute parent average (PA)
!$omp parallel!$omp dodo i=1,n
s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0
end do!$omp end do!$omp end parallel
nsire(:)
dam(:)s d
pa(:)
ebv(:)
Shared variable
= 1= =
=( + )/2.0
i
s sire(1)
d dam(1)pa(1) ebv( )s ebv( )d
iPrivate variable
= 2= =
=( + )/2.0
i
s sire(2)
d dam(2)pa(2) ebv( )s ebv( )d
iPrivate variable• All threads share the variables s and d.• One thread rewrites the variables while
another thread cites the variable!
Private and shared variable! compute parent average (PA)
!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa)!$omp dodo i=1,n
s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0
end do!$omp end do!$omp end parallel
• Each thread has own variables s and dso there is no competition any more.
nsire(:)
dam(:)
pa(:)
ebv(:)
Shared variable
= 1= =
=( + )/2.0
i
s sire(1)
d dam(1)pa(1) ebv( )s ebv( )d
iPrivate variable
= 2= =
=( + )/2.0
i
s sire(2)
d dam(2)pa(2) ebv( )s ebv( )d
iPrivate variable
s d
s d
Clause: shared and private
• Define variable types.• Use private() and shared()
clauses in the parallel directive.
• Private variables will be created for each thread.
• Shared variables will be shared (rewritten) by all threads.
• Variables will be shared by default except loop counters.
• Always declare the variable type to avoid bugs.
! compute parent average (PA)
!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa)!$omp dodo i=1,n
s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0
end do!$omp end do!$omp end parallel
Clause: reduction
known=0
!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa) &!$omp reduction(+:known)!$omp dodo i=1,n
s=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0if(s/=0.and.d/=0) known=known+1
end do!$omp end do!$omp end parallel
• Specify variable for “reduction” operations.• A variable known is treated as
private for each thread.
• In the end of the loop, all threads will add their private known to the global known.
• Other operations (instead of +) are available:
• +,*,max,min etc.
Clause: if
known=0
!$omp parallel private(i,s,d) &!$omp shared(n,sire,dam,ebv,pa) &!$omp reduction(+:known) &!$omp if(n>100000)!$omp dodo i=1,ns=sire(i)d=dam(i)pa(i)=(ebv(s)+ebv(d))/2.0if(s/=0.and.d/=0) known=known+1
end do!$omp end do!$omp end parallel
• Conditional use of OpenMP• If the condition is true, OpenMP
will be invoked in the parallel region.
• If not, the OpenMP directives in this region will be ignored (i.e. single-thread execution).
Built-in functions/subroutines
• Built-in functions/subroutines for OpenMP are defined in the module omp_lib.• Recommendation: always cite this
module as !$ use omp_lib because the module is usable only when you put a compiler option.
• See the textbook or openmp.org for details.
use omp_lib
or
!$ use omp_lib
Built-in function: timing
• OpenMP function omp_get_wtime() returns wall-clock time.!$ use omp_lib
integer,parameter :: r8=selected_real_kind(15,300)real(r8) :: tic,toc...
!$ tic=omp_get_wtime()!$omp parallel!$omp dodo
...
end do!$omp end do!$omp end parallel!$ toc=omp_get_wtime()!$ print *,’running time=’,toc-tic
The number of threads
• The default number of threads used is the maximum number available on your system.
• A parallel program will be slow if …• You separately run another parallel program and each program tries to use
the maximum number of threads.
• Three different ways to change the number of threads.1. Region-specific configuration (use of a clause in the parallel directive)
2. Program-specific configuration (use of a built-in subroutine)
3. Run-time configuration (use of an environmental variable)
Approach 1
integer :: n
n = 2
!$omp parallel num_threads(n)!$omp do
do
...
end do
!$omp end do
!$omp end parallel
• Use of num_threads clause.• This is a region-specific
configuration.
Approach 2
!$ use omp_lib
integer :: n
n=2
!$call omp_set_num_threads(n)
!$omp parallel
!$omp do
do
...
end do
!$omp end do
!$omp end parallel
• Use of a built-in function omp_set_num_threads.• It changes the default number of
threads in the program.
• It basically affects all the subsequent parallel regions without the num_threads clause (see OpenMP specifications).
Approach 3
Linux and Mac OS X:
export OMP_NUM_THREADS=4
or
OMP_NUM_THREADS=4 ./a.out
Windows:
• Use of an environmental variable OMP_NUM_THREADS.• It means you don’t have to change
the program. You can just change the system variable.
• In Linux and Mac OS X, this variable is effective only in the session. Write the variable in your Bash-profile.
• In Windows, open the computer's property to set the variable.
OpenMP is not perfect.
• A task should be split into several independent jobs.• Not directly applicable if there are data-dependencies.
do i=3,nx(i)=x(i-1)+x(i-2)
end do
• Hard to parallelized: Gauss-Seidel iterations, MCMC, and so on• Modify the algorithm to create independent tasks
• Is it really worth to do?
Why my OpenMP is not so fast?
• Thread management is automatic but costly.• Possibility: management overhead ≥ advantage of threading
• 0.20 vs 0.50 sec w/wo 4 threads 11.9 vs 46.7 sec w/wo 4 threads
• OpenMP is useful only when the overhead can be ignored e.g. heavy computations repeated many times.
!$omp parallel private(i) &!$omp reduction(+:s)!$omp dodo i=1,1000000000
s = s + 1/real(i)end do!omp end do!omp end parallel
!$omp parallel private(i) &!$omp reduction(+:s)!$omp dodo i=1,1000000000
s = s + log(abs(sin((sqrt(1/dble(i)))**0.6)))end do!omp end do!omp end parallel
Why my OpenMP is so slow?• Many other jobs use the computing cores.
• Possibility: too many multi-threaded programs in background, ortoo many threads consumed by few other programs
• Best practice: limit the number of threads for each program
0
100
200
300
400
1 2 4 8 16
Wall-clock time (sec) in one round ofAI REML with YAMS
BLUPF90 programs and parallelization
• Relies on parallel libraries and modules• Genomic module dependent on Intel MKL i.e. optimized BLAS & LAPACK
subroutines. (MKL parallelized by OpenMP).
• The module also uses OpenMP directives.
• YAMS (a sparse matrix library) calls MKL as well.
• BLUPF90IOD2 (a licensed software) supports OpenMP.
• Please make sure how many threads you are actually using before running the parallel programs.
BLAS and LAPACK
• BLAS: Basic Linear Algebra Subprograms• Very basic matrix/vector computations: mainly multiplications
• Originally written in Fortran
• Optimized by vendors
• MKL (Math Kernel Library) by Intel; A lot of optimization + OpenMP
• LAPACK: Linear Algebra Package• High-level matrix computations:
• Cholesky, eigenvalue, and singular value decompositions
• Linear equations etc.
• Performance dependent on BLAS
Genomic module (next week)
• G-matrix𝐆 = 𝐙𝐙′/𝑘
• DGEMM from BLAS (Aguilar et al., 2011)
• Some quality control on markers
• A22-matrix and related computations• Subset of numerator relationship matrix (Aguilar et al., 2011)
• Diagonals of A22-inverse (Masuda et al., 2017)
• Indirect computations in A22-inverse (Masuda et al., 2016)
Aguilar et al., 2011
Animals Time (s) Mem (GB)
1000 0.18 0.4
2000 0.47 0.8
5000 1.87 2.1
10000 8.40 4.5
20000 28.4 10.4
50000 162.7 37.3
100000 664.2 111.8
200000 2796.0 372.5
Computing time in G = ZZ’/k with 50K SNP markers using DSYRK from MKL(Intel Xeon E5-2650 2.2GHz x 24 cores)
YAMS (next week)
• Yet Another Mixed-model Solver (Masuda et al., 2014, 2015)
• Supernodal approaches• Use of dense matrix in sparse matrix i.e. BLAS
• Both in factorization and inversion
• Replacement of FSPAK• OPTION use_yams
• Variance component estimation with REML
• S. E. (or accuracy) of breeding value by inversion
• Indirect computation in A22-inverse
• Unavailable in Gibbs sampling