COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC...
Transcript of COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC...
COL730:AnIntroduc2ontoParallelComputa2on
Focus of the Course
• Understand parallel eco-system➡ Parallel computer organization
• Parallel algorithms and data structures; Analysis
• Learn broad parallel programming styles➡ Shared memory programming , Distributed memory programming, Co-processors,
Hybrids
• Learn basic parallelization tools and techniques➡ Performance measurement
• Scheduling and Load balancing Inter-process communication Synchronization
• Get broad coverage for teaching “parallel programming”➡ Also prepare to delve into research
• C/C++ is assumed
Related Keywords
• Parallel
• Concurrent
• Distributed
• Supercomputing
• Grid computing
• Cloud computing
• Big data
Large Computational Problems
• Molecular simulation➡ Many million bodies ⇒ days/iteration
• Atmospheric simulation➡ 1km 3D-grid, each point interacts with neighbors➡ Days of simulation time
• Data Analytics
• Computational biology➡ Drug design➡ Gene sequencing
• Oil exploration➡ Months of processing of seismic data
• Financial processing➡ Market prediction, investing
Large Computational Problems
• Molecular simulation➡ Many million bodies ⇒ days/iteration
• Atmospheric simulation➡ 1km 3D-grid, each point interacts with neighbors➡ Days of simulation time
• Data Analytics
• Computational biology➡ Drug design➡ Gene sequencing
• Oil exploration➡ Months of processing of seismic data
• Financial processing➡ Market prediction, investing
NAMD, 1M atoms: 316PF/nsGROMACS, .25M molecules: 25 PF/nsLAMMPS: .5M atoms: 15PF/nsWeather: 1M PF full weather modeling of 2Weeks.
Large Computational Problems
• Molecular simulation➡ Many million bodies ⇒ days/iteration
• Atmospheric simulation➡ 1km 3D-grid, each point interacts with neighbors➡ Days of simulation time
• Data Analytics
• Computational biology➡ Drug design➡ Gene sequencing
• Oil exploration➡ Months of processing of seismic data
• Financial processing➡ Market prediction, investing
NAMD, 1M atoms: 316PF/nsGROMACS, .25M molecules: 25 PF/nsLAMMPS: .5M atoms: 15PF/nsWeather: 1M PF full weather modeling of 2Weeks.
Theano 1M images: 375pF
Transistors (K)Clock rate (MHz)Power (W)Instr/Clk (ILP)
• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.
• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand
• personalportablecommunica2onsequipment.
Moore’s Law (1965)
• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.
• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand
• personalportablecommunica2onsequipment.“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.”
Moore’s Law (1965)
• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.
• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand
• personalportablecommunica2onsequipment.“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.”
Moore’s Law (1965)
1975:Revisedrateofcircuitcomplexitydoublingto18monthsgoingforward
“Thereisnoroomle3tosqueezeanythingoutbybeingclever.Goingforwardfromherewehavetodependonthetwosizefactors-biggerdiesandfinerdimensions.”
• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.
• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand
• personalportablecommunica2onsequipment.“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.”
Moore’s Law (1965)
1975:Revisedrateofcircuitcomplexitydoublingto18monthsgoingforward
“Thereisnoroomle3tosqueezeanythingoutbybeingclever.Goingforwardfromherewehavetodependonthetwosizefactors-biggerdiesandfinerdimensions.”
2003:AnotherdecadeisprobablystraighMorward...Thereiscertainlynoendtocrea2vity.
• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.
• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand
• personalportablecommunica2onsequipment.“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.”
Moore’s Law (1965)
1975:Revisedrateofcircuitcomplexitydoublingto18monthsgoingforward
“Thereisnoroomle3tosqueezeanythingoutbybeingclever.Goingforwardfromherewehavetodependonthetwosizefactors-biggerdiesandfinerdimensions.”
2003:AnotherdecadeisprobablystraighMorward...Thereiscertainlynoendtocrea2vity.2010:Int.TechnologyRoadmapforSemiconductors:growthtoslowattheendof2013,transistorcountsanddensi2eswilldoubleeverythreeyears.
• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.
• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand
• personalportablecommunica2onsequipment.“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.”
Moore’s Law (1965)
1975:Revisedrateofcircuitcomplexitydoublingto18monthsgoingforward
“Thereisnoroomle3tosqueezeanythingoutbybeingclever.Goingforwardfromherewehavetodependonthetwosizefactors-biggerdiesandfinerdimensions.”
2003:AnotherdecadeisprobablystraighMorward...Thereiscertainlynoendtocrea2vity.2010:Int.TechnologyRoadmapforSemiconductors:growthtoslowattheendof2013,transistorcountsanddensi2eswilldoubleeverythreeyears.2016:IntelSECfiling—Thegapbetweensuccessivegenera2onsofchipswithnew,smallertransistorswillwiden.Withthetransistorsat14nm(10nmin2017),itisbecomingmoredifficulttoshrinkthemcost-effec2vely.
• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF
• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF
• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF
• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF
2017 State of the Art
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
L3
Memory
• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF
• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF
• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF
• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF
2017 State of the Art
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
L3
Memory
• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF
• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF
• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF
• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF
2017 State of the Art
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
L3
Memory
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
L3
Memory
• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF
• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF
• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF
• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF
2017 State of the Art
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
L3
Memory
Accelerator
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
L3
Memory
• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF
• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF
• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF
• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF
2017 State of the Art
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
C
L2L1
L3
Memory
Accelerator
Accelerator
Accelerator
Accelerator
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
L3
Memory
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
L3
Memory
Accelerator
Accelerator
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
L3
Memory
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
L3
Memory
Accelerator
Accelerator
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
L3
Memory
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
L3
Memory
Accelerator
Accelerator
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
L3
Memory
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
CL2L1
L3
Memory
Network
• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF
• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF
• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF
• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF
2017 State of the Art
Parallel Computers
courtesyXinhua
Parallel Computers
courtesyXinhua
2016: “Fastest supercomputer in the world” Sunway TaihuLight (10.649,600 cores) Linpack Performance (Rmax) 93 PFlop/s Theoretical Peak (Rpeak) 125 pFlop/s Power: 15.4 MW Memory: 1.3 pB Processor: Sunway SW26010 260C 1.45GHz Interconnect: Sunway Operating System: Sunway RaiseOS 2.0.5
Parallel Computers
courtesyXinhua
2016: “Fastest supercomputer in the world” Sunway TaihuLight (10.649,600 cores) Linpack Performance (Rmax) 93 PFlop/s Theoretical Peak (Rpeak) 125 pFlop/s Power: 15.4 MW Memory: 1.3 pB Processor: Sunway SW26010 260C 1.45GHz Interconnect: Sunway Operating System: Sunway RaiseOS 2.0.5
© Mark Richards 3C DDP-116
4k words 16-bit memory 294K adds/sec
Photo by Burroughs Corp D825
Upto 4 computer modules Upto 16 memory modules
(4KW each) 167K add/sec, 25K mul/sec
Parallel Computers
courtesyXinhua
Parallel Computers
courtesyXinhua
2017: “Fastest supercomputer in the world” Sunway TaihuLight (10,649,600 cores) Linpack Performance (Rmax) 93 PFlop/s Theoretical Peak (Rpeak) 125 pFlop/s Power: 15.4 MW Memory: 1.3 pB Processor: Sunway SW26010 260C 1.45GHz Interconnect: Sunway Operating System: Sunway RaiseOS 2.0.5
Why Parallel?• Can’t clock faster• Do more per clock (bigger ICs ...)
– Execute complex “special-purpose” instruction– Execute more simple instructions
• Even with more instructions per second• RAM access remain a bottleneck (~ +10% per year)• Multiple processors can access memory in parallel• Increased caching requirement
• HPC is also about• Parallel memory (larger and faster)• Parallel IO• Many HPC applications rely on this, more than the raw computation
speed
11
Execu2ngStoredPrograms
Sequen2alOPoperands
OPoperands
OPoperandsOPoperands
OPoperands
OPoperands
OPoperands
OPoperands
ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands
Execu2ngStoredPrograms
Sequen2alOPoperands
OPoperands
OPoperandsOPoperands
OPoperands
OPoperands
OPoperands
OPoperands
ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands
Execu2ngStoredPrograms
Sequen2alOPoperands
OPoperands
OPoperandsOPoperands
OPoperands
OPoperands
OPoperands
OPoperands
ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands
Execu2ngStoredPrograms
Sequen2alOPoperands
OPoperands
OPoperandsOPoperands
OPoperands
OPoperands
OPoperands
OPoperands
ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands
Execu2ngStoredPrograms
Sequen2alOPoperands
OPoperands
OPoperandsOPoperands
OPoperands
OPoperands
OPoperands
OPoperands
ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands
Execu2ngStoredPrograms
Sequen2alOPoperands
OPoperands
OPoperandsOPoperands
OPoperands
OPoperands
OPoperands
OPoperands
ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands
Two threads of executions
Compiler/System Issues
• ILP: Even single thread not strictly in the ‘program order’ ➡ Pipelined: Different instructions in different stages➡ Superscalar: Multiple execution units execute different instructions in
“parallel”➡ Out-of-order execution➡ Speculative execution (Branch prediction)➡ Auto parallelization and vectorization
• Variables can be shared across multiple threads➡ With direct participation or obliviously➡ Memory transfer is not in terms of variables but “cache lines”➡ Need to “synchronize” access
• Execution of multiple threads need not be “sequentially consistent”
Programming in the ‘Parallel’• Understand target model (Semantics)
– Implications/Restrictions of constructs/features• Design for the target model
– Choice of granularity, synchronization primitive – Usually more of a performance issue
• Think concurrent– For each thread, other threads are ‘adversaries’
• At least with regard to timing– Process launch, Communication, Synchronization
• Clearly define pre and post conditions• Employ high-level constructs when possible
– Debugging is extra-hard without it
14
Learn Parallel Programming?• Let compiler extract parallelism
– In general, not successful so far– Too context sensitive– Many efficient serial data structures and algorithms are
parallel-inefficient– Even if compiler extracted parallelism from serial code,
it may not be what you want• Programmer must conceptualize and code
parallelism• Understand parallel algorithms and data structures
15
Communica2on
SharedMemory MessagePassing
P PP
Memory
P PPMem
ory
Mem
ory
Mem
ory
Interconnect
Shared Memory Architecture• Processors to access memory as global address space• Memory updates by one processor are visible to others
(eventually)– Memory consistency models
• UMA– Typically Symmetric Multiprocessors (SMP)– Equal access and access times to memory– Hardware support for Cache Coherency (CC-UMA)
• NUMA– Typically multiple SMPs, with access to each other’s memories– Not all processors have equal access time to all memories– CC-NUMA: Cache coherency is harder
17
SharedMemory
UMA NUMA
•
P PP
MemoryController
P PPMem
ory
Mem
ory
Mem
ory
Interconnect
Memory
Pros/Cons of Shared Memory
+ Easier to program with global address space
+ Typically fast memory access(when hardware supported)
- Hard to scale- Adding CPUs (geometrically) increases traffic - Programmer initiated synchronization of
memory accesses
19
Distributed Memory Arch.• Communication network (typically between
processors, but also memory)– Ethernet, Infiniband, Custom made
• Processor-local memory• Access to another processor’s data through
well defined communication protocol– Implicit synchronization semantics
• Inter-process synchronization by programmer
20
Pros/Cons of Distributed Memory+ Memory is scalable with number of
processors+ Local access is fast (no cache coherency
overhead)+ Cost effective, with off-the-shelf processor/
network- Programs often more complex (no RAM
model)- Data communication is complex to manage
21
Parallel Task Decomposition• Data Parallel
– Perform f(x) for many x• Task Parallel
– Perform many functions fi• Pipeline
DataParallel
TaskParallel
22
Fundamental Questions
• Is the problem amenable to parallelization?– Are there (serial) dependencies
• What machine architectures are available?– Can they be re-configured?– Communication network
• Algorithm– How to decompose the problem into tasks– How to map tasks to processors
23
Measuring Performance
• What do you measure?➡ Elapsed wall-clock time
✦ But other processes can interfere➡ CPU time, CPU+System
✦ But waits/stalls caused by you are your waits➡ Multiple threads of control➡ Job throughput
• Interfering processes
• Cache performance warp benchmarks➡ Other processes interfere with caches➡ OS also caches file data in memory➡ Flush?
• Mean of multiple measurements (keep an eye on variance)
• Measure scalability➡ With input size, input location, processor count, memory size, network
• Use Profiler
Measuring Performance
• What do you measure?➡ Elapsed wall-clock time
✦ But other processes can interfere➡ CPU time, CPU+System
✦ But waits/stalls caused by you are your waits➡ Multiple threads of control➡ Job throughput
• Interfering processes
• Cache performance warp benchmarks➡ Other processes interfere with caches➡ OS also caches file data in memory➡ Flush?
• Mean of multiple measurements (keep an eye on variance)
• Measure scalability➡ With input size, input location, processor count, memory size, network
• Use Profiler
• clock()• gettimeofday()• times()• clock_gettime()• /usr/bin/time
Speedup,Sp=Exec2meusing1processorsystem(T1)
Exec2meusingpprocessors(Tp)
Efficiency= Spp
SimplePerformanceMetrics
25
Speedup,Sp=Exec2meusing1processorsystem(T1)
Exec2meusingpprocessors(Tp)
Efficiency= Spp
Cost,Cp=p×Tp
SimplePerformanceMetrics
25
Speedup,Sp=Exec2meusing1processorsystem(T1)
Exec2meusingpprocessors(Tp)
Efficiency= Spp
Cost,Cp=p×Tp
SimplePerformanceMetrics
Op2malifCp=T1
25
Speedup,Sp=Exec2meusing1processorsystem(T1)
Exec2meusingpprocessors(Tp)
Efficiency= Spp
Cost,Cp=p×Tp
SimplePerformanceMetrics
Op2malifCp=T1
Lookoutforinefficiency: T1=n3 Tp=n2.5,forp=n2 Cp=n4.5
25
• f = fraction of the problem that is sequential➡(1 – f) = fraction that is parallel
• Best parallel time
• Speedup with p processors:
Amdahl’s Law
26
Tp = T1(f +1� f
p)
Amdahl’sLaw
• Onlyfrac2on(1-f)sharedbypprocessorsIncreasingpcannotspeed-upfracEonf
• Upperboundonspeedupatp=∞
f
Amdahl’sLaw
• Onlyfrac2on(1-f)sharedbypprocessorsIncreasingpcannotspeed-upfracEonf
• Upperboundonspeedupatp=∞
f
Convergesto0
Amdahl’sLaw
• Onlyfrac2on(1-f)sharedbypprocessorsIncreasingpcannotspeed-upfracEonf
• Upperboundonspeedupatp=∞
f
Convergesto0
Amdahl’sLaw
• Onlyfrac2on(1-f)sharedbypprocessorsIncreasingpcannotspeed-upfracEonf
• Upperboundonspeedupatp=∞
Example:f=2%,S∞→1/0.02=50
f
Convergesto0
Amdahl’sLaw
• Onlyfrac2on(1-f)sharedbypprocessorsIncreasingpcannotspeed-upfracEonf
• Upperboundonspeedupatp=∞
Example:f=2%,S∞→1/0.02=50
f
Convergesto0
Speedup versus the number of processing elements for adding a list of numbers.
Speedup can saturate and efficiency drops (Amdahl's law)
ScalingCharacteris2cs
source: Gramma et al.
Isoefficiency: Measure of Scalability
• Rate at which the problem size must increase (as a function of the number of processors) to maintain a constant efficiency➡ Lower rate => More Scalable➡ E =
n = I(K, p)
T1(n)
pTp(n, p)=) f(n, p) = K