Porting the physical parametrizations on GPU using directives
1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using...
-
Upload
priscilla-atkinson -
Category
Documents
-
view
212 -
download
0
Transcript of 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using...
![Page 1: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/1.jpg)
1 06/09/2011, COSMO GM Xavier Lapillonne
Porting the physical parametrizations on GPU using directives
X. Lapillonne, O. Fuhrer
Eidgenössisches Departement des Innern EDIBundesamt für Meteorologie und Klimatologie MeteoSchweiz
![Page 2: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/2.jpg)
2 06/09/2011, COSMO GM Xavier Lapillonne
Outline
• Physics with 2d data structure
• Porting the physical parametrization to GPU using directives
• Running COSMO on an hybrid GPU-CPU system
![Page 3: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/3.jpg)
3 06/09/2011, COSMO GM Xavier Lapillonne
New data structure
• 2D data fields inside the physics packages with one horizontal and one vertical dimensions: f(nproma,ke), with nproma = ie x je / nblock.
• Goals:• Physics package could be shared with ICON code• Blocking strategy: all physics parametrization could be computed
while data remains in the cache• organize_physics should be structured as follow:
call init_radiationcall init_turbulence …
do ib=1,nblockcall copy_to blockcall organize_radiation…call organize_turbulencecall copy_back
end do
• Note : an omp parallelization could be introduced around the block loop
where data inside organise_scheme is in block form t_b(nproma,ke)
Routines below organize_scheme will be shared with ICON. Fields are passed via argument list:
call fesft(t_b(:,:), …
![Page 4: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/4.jpg)
4 06/09/2011, COSMO GM Xavier Lapillonne
Current status• Base code: COSMO 4.18
• 2d version of microphysics (hydci_pp), radiation (Ritter-Geleyn), turbulence (turbtran+turbdiff).
• For the moment microphysics and radiation are in separate block loop. The turbulence scheme is copying 3d fields (i.e turbdiff(t(:,je,:) …)
Next steps
• All 3 parametrizations (microphysics + radiation + turbulence) in a common block loop
• Performance analysis
• OMP parallelization (?)
Longer term
• All parametrization required for operational runs should be inside the block loop and in 2 dimensional form
![Page 5: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/5.jpg)
5 06/09/2011, COSMO GM Xavier Lapillonne
Outline
• Physics with 2d data structure
• Porting the physical parametrization to GPU using directives
• Running COSMO on an hybrid GPU-CPU system
![Page 6: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/6.jpg)
6 06/09/2011, COSMO GM Xavier Lapillonne
Computing on Graphical Processing Units (GPUs)
• Benefit from the highly parallel architecture of GPUs
• Higher peak performance at lower cost / power consumption.
• High memory bandwidth
CoresFreq.
(GHz)
Peak Perf.
S.P. (GFLOPs)
Peak Perf.
D.P. (GFLOPs)
Memory Bandwith (GB/sec)
Power
Cons. (W)
CPU: AMD
Magny-cours12 2.1 202 101 42.7 115
GPU: Fermi
M2050448 1.15 1030 515 144 225
![Page 7: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/7.jpg)
7 06/09/2011, COSMO GM Xavier Lapillonne
Execution model
Host
(CPU)
Kernel
Sequential
Sequential
Device(GPU)
Data
Transfer
• Copy data from CPU to GPU(CPU and GPU memory are separate)
• Load specific GPU program (Kernel)
• Execution: Same kernel is executed by all threads, SIMD parallelism (Single instruction, multiple data)
• Copy back data from GPU to CPU
… …
… …
Parallel threads
![Page 8: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/8.jpg)
8 06/09/2011, COSMO GM Xavier Lapillonne
The directive approach, an example
!$acc data region local(a,b)!$acc update device(b) !initialization !$acc region do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do !$acc end region
! first layer !$acc region do i=1,N a(i,1)=0.1D0 end do !$acc end region
! vertical computation !$acc region do k=2,nlev do i=1,N a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k) end do end do !$acc end region !$acc update host(a)!$acc end data region
!$acc data region local(a,b)!$acc update device(b) !initialization !$acc region do kernel do i=1,N do k=1,nlev a(i,k)=0.0D0 end do end do !$acc end region
! first layer !$acc region do i=1,N a(i,1)=0.1D0 end do !$acc end region
! vertical computation !$acc region do kernel do i=1,N do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*b(i,k) end do end do !$acc end region !$acc update host(a)!$acc end data region
N=1000, nlev=60: t= 555 μs t= 225 μs
note : PGI directives
Loop reordering
3 different kernels
Array “a” remains on the GPU between the different kernel calls
![Page 9: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/9.jpg)
9 06/09/2011, COSMO GM Xavier Lapillonne
Physical parametrizations on GPU using directives
• Physical parametrizations are tested using standalone code.
• Currently ported parametrizations:• PGI : microphysics (hydci_pp), radiation (fesft), turbulence (only
turbdiff yet)• OMP – acc (Cray) : microphysics, radiation • GPU optimizaiton: loop reordering, replacement of arrays with
scalars• Note: hydci_pp, fesft and turbdiff subroutines represents
respectively 6.7%, 8% and 7.3% of the total execution time of a typical cosmo-2 run.
• Current version of OMP-acc are a subset of PGI directives and it is possible to write PGI code so that there is almost a one to one translation to omp-acc.
• First investigation show similar performance between the two compilers, but would need further analysis
![Page 10: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/10.jpg)
10 06/09/2011, COSMO GM Xavier Lapillonne
Results, Fermi card using PGI directives
Performance
0
5
10
15
20
25
30
microphysics radiation turbulence
GF
lop/
s
• Peak performance of a Fermi card for double precision is 515 GFlop/s, i.e. we are getting respectively 5%, 4.5% and 2.5% peak performance for the microphysics, radiation and turbulence schemes
• Theoretical bandwith is 140 GB/s, but maximum achievable is around 110 GB/s
• Test domain: nx x ny x nz = 80 x 60 x 60
Memory
0
20
40
60
80
100
120
microphysics radiation turbulence
Me
m.
Th
rou
gh
pu
t o
vera
ll (G
B/s
)
![Page 11: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/11.jpg)
11 06/09/2011, COSMO GM Xavier Lapillonne
Results: Comparison with CPU
Speed up with respect to a 12 cores CPU (Palu)
0.000
1.000
2.000
3.000
4.000
5.000
6.000
7.000
Microphysics Radiation Turbulence
Sp
ee
d u
p execution time
execution + data transfer
• Parallel CPU code run on 12 cores AMD magny-cours CPU – however there are no mpi-communications in these standalone test codes.
• Note: Expected performance would be between 3x and 5x and depending whether the problem is compute or memory bandwith bound.
• Overhead of data transfer for microphysics and turbulence is very large.
![Page 12: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/12.jpg)
12 06/09/2011, COSMO GM Xavier Lapillonne
Comments on the observed performance
• The microphysics has the largest compute intensity (with respect to memory access) and as such is more suited for the GPU.
• The lower speed up observed for the radiation is quite relative, and essentially comes from the fact that it is very well optimized and is vectorized on the CPU (~9% Peak performance)
• The turbulence scheme requires more memory access.
Next steps
• Port turbtran subroutine with pgi + additional test and optimizations (october 2011)
• Further investigation of radiation and turbulence schemes with Cray directives (november 2011)
• GPU version of microphysics + radiation + turbulence inside COSMO (november-december 2011)
![Page 13: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/13.jpg)
13 06/09/2011, COSMO GM Xavier Lapillonne
Outline
• Physics with 2d data structure
• Porting the physical parametrization to GPU using directives
• Running COSMO on an hybrid GPU-CPU system
![Page 14: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/14.jpg)
14 06/09/2011, COSMO GM Xavier Lapillonne
Possible future implementations in COSMO
Dynamic Microphysics Turbulence Radiation
Phys. parametrization
I/O
GPU
Dynamic Microphysics Turbulence Radiation
Phys. parametrization
I/O
GPU GPU GPU GPU
• Data movement for each routine
• “Full GPU” : Data remain on device, only send to CPU for I/O and communication
C++ - CUDA Directives
![Page 15: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/15.jpg)
15 06/09/2011, COSMO GM Xavier Lapillonne
Running COSMO-2 on Hybrid-system
Multicores Processor
GPUs
• One (or more) multicores CPU
• Domain decomposition
• One GPU per subdomain.
![Page 16: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/16.jpg)
16 06/09/2011, COSMO GM Xavier Lapillonne
Summary
• Porting of the microphysics, radiation and turbulence scheme on GPU was successfully carried out using a directive based approach
• Comparing with a 12 cores CPU, a speed up between 2.4x and 6.5x was observed using one Fermi GPU card
• These results are within the expected values considering hardware properties
• The large overhead of data transfer shows that the “full GPU” approach (i.e. data remains on the GPU, all computation on the device) is the prefered approach for COSMO
![Page 17: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/17.jpg)
17 06/09/2011, COSMO GM Xavier Lapillonne
Additional slides
![Page 18: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/18.jpg)
18 06/09/2011, COSMO GM Xavier Lapillonne
Comparison between PGI and OMP-acc
!$acc data region local(a)!time loopdo itime=1,nt !initialization !$acc region do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do !$acc end region
! first layer !$acc region do kernel do i=1,N a(i,1)=0.1D0 end do !$acc end region
! vertical computation !$acc region do kernel do i=1,N do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k) end do end do !$acc end region end do ! end time loop!$acc update host(a)!$acc end data region
!$omp acc_data acc_shared(a) !time loopdo itime=1,nt !initialization !$omp acc_region_loop do k=1,nlev do i=1,N a(i,k)=0.0D0 end do end do !$omp end acc_region loop
! first layer !$omp acc_region_loop do i=1,N a(i,1)=0.1D0 end do !$omp end acc_region_loop
! vertical computation !$omp acc_region_loop kernel do i=1,N do k=2,nlev a(i,k)=0.95D0*a(i,k-1)+exp(-2*a(i,k)**2)*a(i,k) end do end do !$omp end acc_region_loop end do ! end time loop!$omp acc_update host(a)!$omp end acc_data
![Page 19: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/19.jpg)
19 06/09/2011, COSMO GM Xavier Lapillonne
MAIN_ / mo_gscp_dwd_hydci_pp_ _ (x10)------------------------------------------------------------------------User time (approx) 2.999 secs 7197500711 cyclesSystem to D1 refill 2.434M/sec 7300271 linesSystem to D1 bandwidth 148.576MB/sec 467217344 bytesD2 to D1 bandwidth 1025.770MB/sec 3225672832 bytesL2 to System BW per core 140.940MB/sec 443203504 bytes
HW FP Ops / User time 435.162M/sec 1308546592 ops 4.5%peak(DP)
MAIN_ / src_radiation_fesft_ (x1)------------------------------------------------------------------------ User time (approx) 7.226 secs 17342858074 cycles 100.0%Time System to D1 refill 11.380M/sec 82232710 lines System to D1 bandwidth 694.569MB/sec 5262893440 bytes D2 to D1 bandwidth 1162.252MB/sec 8806624128 bytes L2 to System BW per core 645.679MB/sec 4892446080 bytes HW FP Ops / User time 893.252M/sec 6511701846 ops 9.3%peak(DP)
Craypat infos
MAIN_ / turbulence_diff_ref_turbdiff_ (x10)------------------------------------------------------------------------ User time (approx) 4.397 secs 10551890928 cycles 100.0%Time System to D1 refill 15.757M/sec 69278266 lines System to D1 bandwidth 961.741MB/sec 4433809024 bytes D2 to D1 bandwidth 485.462MB/sec 2238073856 bytes L2 to System BW per core 982.474MB/sec 4529394160 bytes HW FP Ops / User time 326.405M/sec 1452061875 ops 3.4%peak(DP)
![Page 20: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/20.jpg)
20 06/09/2011, COSMO GM Xavier Lapillonne
Palu Results
![Page 21: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/21.jpg)
21 06/09/2011, COSMO GM Xavier Lapillonne
Results, microphysics, double precision, Palu
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
1 CPU (12 cores)
2 CPU (24 cores)
GPU-Fermi C
ray
GPU-Fermi P
GI
Sp
eed
up
(D
P)
speedup without datatransfer
speedup including datatransfer
![Page 22: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/22.jpg)
22 06/09/2011, COSMO GM Xavier Lapillonne
Results, Radiation, double precision, Palu
0.00
0.50
1.00
1.50
2.00
2.50
3.00
1 CPU (12 co
res)
2 CPU (24 co
res)
GPU-Fermi C
ray
GPU-Fermi P
GI
Sp
ee
d u
p (
DP
)
speedup without datatransfer
speedup including datatransfer
![Page 23: 1 06/09/2011, COSMO GM Xavier Lapillonne Porting the physical parametrizations on GPU using directives X. Lapillonne, O. Fuhrer Eidgenössisches Departement.](https://reader035.fdocuments.in/reader035/viewer/2022070410/56649f1f5503460f94c37546/html5/thumbnails/23.jpg)
23 06/09/2011, COSMO GM Xavier Lapillonne
Results, Turbulence, double precision, Palu
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
1 CPU (12 cores)
2 CPU (24 cores)
GPU-Fermi C
ray
GPU-Fermi P
GI
Sp
ee
d u
p (
DP
)
speedup without datatransfer
speedup includingdata transfer