of Parallel thigh Performance Computing
Transcript of of Parallel thigh Performance Computing
Design of Parallel & thigh Performance ComputingReasoning about Performance II
Roofline Rode - Balance Principles - Analysis Scheduling
Roofline Rodell Williams et al. 20087
Computer : Program :
men I= W/Q I fbpslsyde ]bandwidth A
LLC tsyteskyck ] T = runtime I cycles ]
CPU peak perf . T P= WIT ( performance ) [ fbpskyck ]tfbpskyclei
Roofline plot : ( example T= 2, A=4 )
Ptfbpskyde ] sound jasedenp fP±pI ) reason :¥=Ww¥a=¥2f^
Sind
Sounds ( PET ) PZ p. I2-
: ⇒ logp a- legit leg B1 -
r
' k - I× ← program Isl ⇒ P 2- A
-
i
run on Some
I. guts input^ 1 1 1 1 > I I f6ps/J > tif
YYYZ1 2
,
Operational Intensity : Upper Surely
Icu ) =Wen )/QCu ) x ,y : length n
, RB ,C= men,
as scalarHeflops I # Sytes
meme > cache
asymptotic best exact
daepy y-
- AttyOct ) OCI ) I' 112
dgemv y - Arty Oct ) OCI ) E' 14
-
fft oclogu ) oaogy ) -
dgemm C - AB -1C 0647 o Crf ) E YT should Se
sharp ifg= code tie
gut . 828Example . y = 12nA
⇒ n E 700
baek to Slides → measured roofline plots
G. Balance Principles I Ckuey 86 )
Compete : Pegram :
men Qy : data transfer men a LLC
sLLC sire j W : work = # ops
CPU IT
The computer is called Salaam ifcompute time = data transfer the
ire . , assuming perfect utilization
¥ = Ese⇒ Ip=¥Jr
r
assume it → at increases ;how A nesalauee ?
a.) A → a. A ,✓ set A grows slower than 5 !s
. ) y → x. 8 what is × ?
Example I : Mahn multiplicities Ca ABTC
algorithm worth opkmel I= hot - QCTF )
Ip =¥←=o( of ) t.at, r→e?g
Example 2 : FFT / dorky
algorithm ark optimal I= @( logo )
Is = [email protected] → ya
pots are unnealrttfe .
7. Balance principles I ( Czechowski et al.
Lou )
coal : none detailed principles for multi ones
algorithm / architecture adesrgn ,assess webof HW tends
Computer ; Algorithm :
slow men PRAM : W : work,
Di depthdi lately ,
B : throughput,
fast men sing dnansaehon fired Qp, , : men . transfersof she A
. . . each poll perf IT
P,
Ppassumptions : optimal W
,W/Q
processors
Processor is faleuafd ;ftwo * set Qp , rn fun Qnx ?
- there are general fends
Them E Tamp ("
compute - tuned and some Arnet results
" as :
of a- put ⇐ > F 3- ¥a
Derivation principles :
i. ) Estimate Tuen : Idea Hinde DAG in levels
level I : o a o a o o 9,mean transfers of the A
level 2 : \o/###qz
r - - - . - - - - - -- - -
- - - e
^
i
level D= o a a o o O O 9g
In each level 2- B model : a+9,⇐ Fit
⇒ Tues I C at 9)= a. Dt Q -- Qara
B I
2.) Estimably Top : Brent 's lemma
Taupe CD -1¥ ) . ¥
←heh - par . compute
Balance : Them stamp ← parallelism computer
⇐ At .fi?j)eoY-eltIa )^ T parallelism algorithmI
Ken#
new. par . algorithm
Example I ; matrix multiplication
use Q 7 r¥TrFp. Cluny et al.
2004 ) ⇒ Icu ) -- O City)
and D K W,
D G Q C a means much smaller )
⇒ balance principle t.IE ofFr )p
I → a. IT rebalance : A → af on p → al g
p → a. p u : ( A → af and p - say ) onCp → Pr )
Example 2 a sending FFFT⇒ PF e o Clay E)
Back to Sholes → Evolution of balance
Greedy scheduling , Analysts
Reminder : T,
=W, To =D Tp 7 T ,/p , To Tp E Dt w/p= Tothlp
Greedy scheduler :
# complete steps IT ,/pIf incomplete steps E I ( every incomplete step reduces
critical path by 1 )⇒ Tp e T ,/pt To
How far from optimal ?
Assume Tpt is optimal :
Tpt >, nae { Tlp , Ta }
Tp I T ,/p t To E 2 met { Tlp , To } E 2Tp← so at most
footer 2 away
enough parallelism : WID = T.to ⇒ p ( ⇒ Tok Tlp
⇒ Tp E T ,/p + To a T, /p
Work stealing scheduling : Analysts
Theorem : The scheduler achieves Tp E T ,/p+ Otto )
Proof sketch : Every processor is either working or stealingTotal true working : T
,
Total time stealing : Every steel has Yp chance to reducethe critical path ; hence OCP To )
⇒ total time : T, + Ocpty )
⇒ time per processor : total Line /p = T, /pt Otto )
Note : space requirement Sp I p S, .