Cell processor implementation of a MILC lattice QCD application
description
Transcript of Cell processor implementation of a MILC lattice QCD application
National Center for Supercomputing ApplicationsUniversity of Illinois at Urbana-Champaign
Cell processor implementation of a MILC lattice QCD application
Guochun Shi, Volodymyr Kindratenko, Steven Gottlieb
2
Presentation outline
• Introduction1. Our view of MILC applications
2. Introduction to Cell Broadband Engine
• Implementation in Cell/B.E.1. PPU performance and stream benchmark
2. Profile in CPU and kernels to be ported
3. Different approaches
• Performance• Conclusion
3
Introduction
• Our target• MIMD Lattice Computation
(MILC) Collaboration code – dynamical clover fermions (clover_dynamical) using the hybrid-molecular dynamics R algorithm
• Our view of the MILC applications• A sequence of
communication and computation blocks
compute loop 1compute loop 1
compute loop ncompute loop n
MPI scatter/gatherMPI scatter/gather
Original CPU-based implementation
CPU
MPI scatter/gather for loop 2MPI scatter/gather for loop 2
MPI scatter/gather for loop 3MPI scatter/gather for loop 3
compute loop 2compute loop 2
MPI scatter/gather for loop n+1MPI scatter/gather for loop n+1
4
Introduction
• Cell/B.E. processor• One Power Processor Element
(PPE) and eight Synergistic Processing Elements (SPE), each SPE has 256 KBs of local storage
• 3.2 GHz processor • 25.6 GB/s processor-to-
memory bandwidth • > 200 GB/s EIB sustained
aggregate bandwidth• Theoretical peak performance:
204.8 GFLOPS (SP) and 14.63 GFLOPS (DP)
5
Presentation outline
• Introduction1. Our view of MILC applications
2. Introduction to Cell Broadband Engine
• Implementation in Cell/B.E.1. PPE performance and stream benchmark
2. Profile in CPU and kernels to be ported
3. Different approaches
• Performance• Conclusion
6
Performance in PPE
• Step 1: try to run it in PPE
• In PPE it runs approximately ~2-3x slower than modern CPU
• MILC is bandwidth-bound
• It agrees with what we see with stream benchmark
Runtime in CPU and PPE
0.0
50.0
100.0
150.0
200.0
250.0
300.0
350.0
400.0
450.0
8x8x16x16 16x16x16x16
lattice size
run
tim
e (s
eco
nd
s)
CPU
PPE
Stream benchmark on CPU and PPE
0
500
1000
1500
2000
2500
3000
3500
4000
Copy Scale Add Triad
Ban
dw
idth
(G
B/s
)CPU
PPE
7
Execution profile and kernels to be ported
0
5
10
15
20
25
udadu_mu_nu
ds las h_w_s ite_s pec ial
s u3mat_copy
mult_su
3_nn
mult_su
3_na
mult_this_
ldu_s ite
udadu_mat_mu_nu
s ingle action
s u3_adjoint
gene ral stri
ded gathe r
f_mu_nu
s calar_
mult_a dd_wve
c
s calar_
multi_su3_matrix
_addi
d_congrad2_c l
s et_neighbor
mult_su
3_an
upda te_u
compute_c lov
upda te_h_c l
s calar_
mult_a dd_wve
c_mag sq
realtra
ce_s u3_nn
add_ su3_m
atrix
wp_s hrink
mags q_wvec_ta
sk
gaug e_action
s et su3_matrix
to ze
ro
reunita
rize
kern
el
tim
e, %
0102030405060708090100
cum
ula
tive
, %
kernel run-time on 8x8x16x16 lattic e
kernel run-time on 16x16x16x16 lattic e
c umulative for 8x8x16x16 lattic ec umulative for 16x16x16x16 lattic e
• 10 of these subroutines are responsible for >90% of overall runtime
• All kernels responsible for 98.8%
8
Kernel memory access pattern
• Kernel codes must be SIMDized
• Performance determined by how fast you DMA in/out the data, not by SIMDized code
• In each iteration, only small elements are accessed
• Lattice size: 1832 bytes• su3_matrix: 72 bytes• wilson_vector: 96 bytes
• Challenge: how to get data into SPUs as fast as possible?
• Cell/B.E. has the best DMA performance when data is aligned to 128 bytes and size is multiple of 128 bytes.
• Data layout in MILC meets neither of them
#define FORSOMEPARITY(i,s,choice) \ for( i=((choice)==ODD ? even_sites_on_node : 0 ), \ s= &(lattice[i]); \ i< ( (choice)==EVEN ? even_sites_on_node : sites_on_node); \ i++,s++)
FORSOMEPARITY(i,s,parity) { mult_adj_mat_wilson_vec( &(s->link[nu]), ((wilson_vector *)F_PT(s,rsrc)), &rtemp ); mult_adj_mat_wilson_vec( (su3_matrix *)(gen_pt[1][i]), &rtemp, &(s->tmp) ); mult_mat_wilson_vec( (su3_matrix *)(gen_pt[0][i]), &(s->tmp), &rtemp ); mult_mat_wilson_vec( &(s->link[mu]), &rtemp, &(s->tmp) ); mult_sigma_mu_nu( &(s->tmp), &rtemp, mu, nu ); su3_projector_w( &rtemp, ((wilson_vector *)F_PT(s,lsrc)), ((su3_matrix*)F_PT(s,mat)) ); }
lattice site 0
Data accesses
Data from neighbor
One sample kernel from udadu_mu_nu() routine
9
Approach I: packing and unpacking
• Good performance in DMA operations• Packing and unpacking are expensive in PPE
PPE and main memory
…struct site
…
struct site
Packing
Unpacking
DMA operations
DMA operations…
SPEs
10
Approach II: Indirect memory access
• Replace elements in struct site with pointers• Pointers point to continuous memory regions• PPE overhead due to indirect memory access
Original lattice
Modified lattice
Continuous mem …
DMA operations
SPEs
……
PPE and main memory
11
Approach III: Padding and small memory DMAs
• Padding elements to appropriate size• Padding struct site to appropriate size• Gained good bandwidth performance with padding overhead• Su3_matrix from 3x3 complex to 4x4 complex matrix
• 72 bytes 128 bytes• Bandwidth efficiency lost: 44%
• Wilson_vector from 4x3 complex to 4x4 complex• 98 bytes 128 bytes• Bandwidth efficiency lost: 23%
Original lattice
Lattice after padding
…
SPEs
…
…
DMA operations
PPE and memory
12
Struct site Padding
• 128 byte stride access has different performance for different stride size
• This is due to 16 banks in main memory• Odd numbers always reach peak• We choose to pad the struct site to 2688 (21*128) bytes
25.38
12.69
25.36
8.6
25.49
12.89
25.48
4.26
25.5
12.93
25.34
8.58
25.51
12.89
25.47
2.13
25.34
0
10
20
30
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
stride (x 128 bytes)
ban
dw
idth
(G
B/s
)
13
Presentation outline
• Introduction1. Our view of MILC applications
2. Introduction to Cell Broadband Engine
• Implementation in Cell/BE.1. PPU performance and stream benchmark
2. Profile in CPU and kernels to be ported
3. Different approaches
• Performance• Conclusion
14
Kernel performance
• GFLOPS are low for all kernels
• Bandwidth is around 80% of peak for most of kernels
• Kernel speedup compared to CPU for most of kernels are between 10x to 20x
• set_memory_to_zero kernel has ~40x speedup, su3mat_copy() speedup >15x
0
1
2
3
4
5
6
7
8
9
10
ud
ad
u_
mu
_n
u
dsla
sh
_w
_sit
e_
sp
ecia
l
su
3m
at_
co
py
mu
lt_
su
3_
nn
mu
lt_
su
3_
na
mu
lt_
th
is_
ldu
_sit
e
ud
ad
u_
ma
t_
mu
_n
u
sin
gle
acti
on
su
3_
ad
join
t
ge
ne
ra
l strid
ed
ga
th
er
f_m
u_
nu
sca
lar_
mu
lt_
ad
d_
wv
ec
sca
lar_
mu
lti_
su
3_
ma
trix
_a
dd
i
d_
co
ng
ra
d2
_cl
se
t_
ne
igh
bo
r
mu
lt_
su
3_
an
up
da
te
_u
co
mp
ute
_clo
v
up
da
te
_h
_cl
sca
lar_
mu
lt_
ad
d_
wv
ec_
ma
g…
re
alt
ra
ce
_su
3_
nn
ad
d_
su
3_
ma
trix
wp
_sh
rin
k
ma
gsq
_w
ve
c_
ta
sk
ga
ug
e_
acti
on
se
t s
u3
_m
atrix
to
ze
ro
re
un
ita
riz
e
Peak GFLOPS (%)8x8x16x16 lattice 16x16x16x16 lattice
0
10
20
30
40
50
60
70
80
90
100
ud
ad
u_
mu
_n
u
dsla
sh
_w
_sit
e_
sp
ecia
l
su
3m
at_
co
py
mu
lt_
su
3_
nn
mu
lt_
su
3_
na
mu
lt_
th
is_
ldu
_sit
e
ud
ad
u_
ma
t_
mu
_n
u
sin
gle
acti
on
su
3_
ad
join
t
ge
ne
ra
l strid
ed
ga
th
er
f_m
u_
nu
sca
lar_
mu
lt_
ad
d_
wv
ec
sca
lar_
mu
lti_
su
3_
ma
trix
_a
dd
i
d_
co
ng
ra
d2
_cl
se
t_
ne
igh
bo
r
mu
lt_
su
3_
an
up
da
te
_u
co
mp
ute
_clo
v
up
da
te
_h
_cl
sca
lar_
mu
lt_
ad
d_
wv
ec_
ma
gsq
re
alt
ra
ce
_su
3_
nn
ad
d_
su
3_
ma
trix
wp
_sh
rin
k
ma
gsq
_w
ve
c_
ta
sk
ga
ug
e_
acti
on
se
t s
u3
_m
atrix
to
ze
ro
re
un
ita
riz
e
Peak bandwidth (%)8x8x16x16 lattice 16x16x16x16 lattice
S peedup
05
101520253035404550
8x8x16x16 lattic e 16x16x16x16 lattic e
15
Application performance
• Single Cell Application performance speedup
• ~8–10x, compared to Xeon single core
• Cell Blade application performance speedup
• 1.5-4.1x, compared to Xeon 2 socket 8 cores
• Profile in Xeon• 98.8% parallel code, 1.2% serial code speedup slowdown
• 67-38% kernel SPU time, 33-62% PPU time of overall runtime in Cell
PPE is standing in the way for further improvement
0.45 0.821.48 1.71
2.2438.244.64
2.97 1.911.72
0.00
1.00
2.00
3.00
4.00
5.00
6.00
1 core Xeon 8 cores Xeon 1 PPE, 8 SPEs
1 PPE, 16 SPEs
(NUMA)
2 PPEs, 8 SPEs per PPE
(MPI)
exec
ution
tim
e (s
ec)
execution mode
SPE contribution38.69
1.886.15 5.94 6.85
10.60166.69
49.65 11.52 6.82
6.38
0.00
5.00
10.00
15.00
20.00
25.00
1 core Xeon 8 cores Xeon 1 PPE, 8 SPEs
1 PPE, 16 SPEs
(NUMA)
2 PPEs, 8 SPEs per PPE
(MPI)
exec
ution
tim
e (s
ec)
execution mode
SPE contribution
55.8
168.57
8x8x16x16 lattice
16x16x16x16 lattice
16
Application performance on two blades
Execution time of the 54 kernels considered for the SPE implementation
Execution time of the rest of the code (PPE portion in the case of Cell/B.E. processor)
Total (seconds)
Two Intel Xeon blades
110.3 seconds27.1 seconds (24.5 seconds
due to MPI)137.3 seconds
Two Cell/B.E. blades
15.9 seconds67.9 seconds (47.6 seconds
due to MPI)83.8 seconds
• For comparison, we ran two Intel Xeon blades and Cell/B.E. blades through Gigabit Ethernet
• More data needed for Cell blades connected through Infiniband
17
Application performance: a fair comparison
8x8x16x16 lattice 16x16x16x16 lattice
Intel Xeon time
Cell/B.E. time
speedupIntel Xeon
timeCell/B.E.
timespeedup
Single core Xeon vs. Cell/B.E. PPE
38.7 73.2 0.5 168.6 412.8 0.4
Single core Xeon vs. Cell/B.E. PPE + 1 SPE
38.7 21.9 1.8 168.6 86.9 1.9
Quad core Xeon vs. Cell/B.E. PPE + 8 SPEs
15.4 4.5 3.4 100.2 17.5 5.7
Xeon blade vs. Cell/B.E. blade
5.5 3.6 1.5 55.8 13.7 4.1
• PPE is slower than Xeon• PPE + 1 SPE is ~2x faster than Xeon • A cell blade is 1.5-4.1x faster than 8-core Xeon blade
18
Conclusion
• We achieved reasonably good performance• 4.5-5.0 Gflops in one Cell processor for whole application
• We maintained the MPI framework • Without the assumtion that the code runs on one Cell processor, certain
optimization cannot be done, e.g. loop fusion
• Current site-centric data layout forces us to take the padding approach• 23-44% efficiency lost for bandwidth• Fix: field-centric data layout desired
• PPE slows the serial part, which is a problem for further improvement • Fix: IBM putting a full-version power core in Cell/B.E.
• PPE may impose problems in scaling to multiple Cell blades• PPE over Infiniband test needed