COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC...

COL730:AnIntroduc2ontoParallelComputa2on

Focus of the Course

• Understand parallel eco-system➡ Parallel computer organization

• Parallel algorithms and data structures; Analysis

• Learn broad parallel programming styles➡ Shared memory programming , Distributed memory programming, Co-processors,

Hybrids

• Learn basic parallelization tools and techniques➡ Performance measurement

• Scheduling and Load balancing Inter-process communication Synchronization

• Get broad coverage for teaching “parallel programming”➡ Also prepare to delve into research

• C/C++ is assumed

Related Keywords

• Parallel

• Concurrent

• Distributed

• Supercomputing

• Grid computing

• Cloud computing

• Big data

Large Computational Problems

• Molecular simulation➡ Many million bodies ⇒ days/iteration

• Atmospheric simulation➡ 1km 3D-grid, each point interacts with neighbors➡ Days of simulation time

• Data Analytics

• Computational biology➡ Drug design➡ Gene sequencing

• Oil exploration➡ Months of processing of seismic data

• Financial processing➡ Market prediction, investing




• Data Analytics




NAMD, 1M atoms: 316PF/nsGROMACS, .25M molecules: 25 PF/nsLAMMPS: .5M atoms: 15PF/nsWeather: 1M PF full weather modeling of 2Weeks.




• Data Analytics




NAMD, 1M atoms: 316PF/nsGROMACS, .25M molecules: 25 PF/nsLAMMPS: .5M atoms: 15PF/nsWeather: 1M PF full weather modeling of 2Weeks.

Theano 1M images: 375pF

Transistors (K)Clock rate (MHz)Power (W)Instr/Clk (ILP)

• Newlytrendytechnologyofcomputerchipswouldbecomedoublypowerfuleverytwoyears.

• Theywouldeventuallybesosmall,thattheycouldbeembeddedinhomes,carsand

• personalportablecommunica2onsequipment.

Moore’s Law (1965)



• personalportablecommunica2onsequipment.“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years.”






1975:Revisedrateofcircuitcomplexitydoublingto18monthsgoingforward

“Thereisnoroomle3tosqueezeanythingoutbybeingclever.Goingforwardfromherewehavetodependonthetwosizefactors-biggerdiesandfinerdimensions.”







2003:AnotherdecadeisprobablystraighMorward...Thereiscertainlynoendtocrea2vity.







2003:AnotherdecadeisprobablystraighMorward...Thereiscertainlynoendtocrea2vity.2010:Int.TechnologyRoadmapforSemiconductors:growthtoslowattheendof2013,transistorcountsanddensi2eswilldoubleeverythreeyears.

http://en.wikipedia.org/wiki/International_Technology_Roadmap_for_Semiconductors







2003:AnotherdecadeisprobablystraighMorward...Thereiscertainlynoendtocrea2vity.2010:Int.TechnologyRoadmapforSemiconductors:growthtoslowattheendof2013,transistorcountsanddensi2eswilldoubleeverythreeyears.2016:IntelSECfiling—Thegapbetweensuccessivegenera2onsofchipswithnew,smallertransistorswillwiden.Withthetransistorsat14nm(10nmin2017),itisbecomingmoredifficulttoshrinkthemcost-effec2vely.

http://en.wikipedia.org/wiki/International_Technology_Roadmap_for_Semiconductors

• Intel Xeon E5 2699 v4➡ 22 cores @2.2GHz, .63TF

• nVIDIA P100➡ 3,584 cores @1.33GHz, 5.3TF

• Xeon Phi 7290➡ 72 cores @1.5GHz, 3.5TF

• Power ISeries 8286-42A➡ 12 cores @3.52 GHz, .34 TF

2017 State of the Art

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory






C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory






C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory






C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory

Accelerator

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory






C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

C

L2L1

L3

Memory

Accelerator

Accelerator

Accelerator

Accelerator

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

Accelerator

Accelerator

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

Accelerator

Accelerator

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

Accelerator

Accelerator

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

CL2L1

L3

Memory

Network






Parallel Computers

courtesyXinhua

Parallel Computers

courtesyXinhua

2016: “Fastest supercomputer in the world” Sunway TaihuLight (10.649,600 cores) Linpack Performance (Rmax) 93 PFlop/s Theoretical Peak (Rpeak) 125 pFlop/s Power: 15.4 MW Memory: 1.3 pB Processor: Sunway SW26010 260C 1.45GHz Interconnect: Sunway Operating System: Sunway RaiseOS 2.0.5

Parallel Computers

courtesyXinhua

2016: “Fastest supercomputer in the world” Sunway TaihuLight (10.649,600 cores) Linpack Performance (Rmax) 93 PFlop/s Theoretical Peak (Rpeak) 125 pFlop/s Power: 15.4 MW Memory: 1.3 pB Processor: Sunway SW26010 260C 1.45GHz Interconnect: Sunway Operating System: Sunway RaiseOS 2.0.5

© Mark Richards 3C DDP-116

4k words 16-bit memory 294K adds/sec

Photo by Burroughs Corp D825

Upto 4 computer modules Upto 16 memory modules

(4KW each) 167K add/sec, 25K mul/sec

Parallel Computers

courtesyXinhua

Parallel Computers

courtesyXinhua

2017: “Fastest supercomputer in the world” Sunway TaihuLight (10,649,600 cores) Linpack Performance (Rmax) 93 PFlop/s Theoretical Peak (Rpeak) 125 pFlop/s Power: 15.4 MW Memory: 1.3 pB Processor: Sunway SW26010 260C 1.45GHz Interconnect: Sunway Operating System: Sunway RaiseOS 2.0.5

Why Parallel?• Can’t clock faster• Do more per clock (bigger ICs ...)

– Execute complex “special-purpose” instruction– Execute more simple instructions

• Even with more instructions per second• RAM access remain a bottleneck (~ +10% per year)• Multiple processors can access memory in parallel• Increased caching requirement

• HPC is also about• Parallel memory (larger and faster)• Parallel IO• Many HPC applications rely on this, more than the raw computation

speed

11

Execu2ngStoredPrograms

Sequen2alOPoperands

OPoperands

OPoperandsOPoperands

OPoperands

OPoperands

OPoperands

OPoperands

ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands

Execu2ngStoredPrograms

Sequen2alOPoperands

OPoperands

OPoperandsOPoperands

OPoperands

OPoperands

OPoperands

OPoperands

ParallelOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands OPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperandsOPoperands

Two threads of executions

Compiler/System Issues

• ILP: Even single thread not strictly in the ‘program order’ ➡ Pipelined: Different instructions in different stages➡ Superscalar: Multiple execution units execute different instructions in

“parallel”➡ Out-of-order execution➡ Speculative execution (Branch prediction)➡ Auto parallelization and vectorization

• Variables can be shared across multiple threads➡ With direct participation or obliviously➡ Memory transfer is not in terms of variables but “cache lines”➡ Need to “synchronize” access

• Execution of multiple threads need not be “sequentially consistent”

Programming in the ‘Parallel’• Understand target model (Semantics)

– Implications/Restrictions of constructs/features• Design for the target model

– Choice of granularity, synchronization primitive – Usually more of a performance issue

• Think concurrent– For each thread, other threads are ‘adversaries’

• At least with regard to timing– Process launch, Communication, Synchronization

• Clearly define pre and post conditions• Employ high-level constructs when possible

– Debugging is extra-hard without it

14

Learn Parallel Programming?• Let compiler extract parallelism

– In general, not successful so far– Too context sensitive– Many efficient serial data structures and algorithms are

parallel-inefficient– Even if compiler extracted parallelism from serial code,

it may not be what you want• Programmer must conceptualize and code

parallelism• Understand parallel algorithms and data structures

15

Communica2on

SharedMemory MessagePassing

P PP

Memory

P PPMem

ory

Mem

ory

Mem

ory

Interconnect

Shared Memory Architecture• Processors to access memory as global address space• Memory updates by one processor are visible to others

(eventually)– Memory consistency models

• UMA– Typically Symmetric Multiprocessors (SMP)– Equal access and access times to memory– Hardware support for Cache Coherency (CC-UMA)

• NUMA– Typically multiple SMPs, with access to each other’s memories– Not all processors have equal access time to all memories– CC-NUMA: Cache coherency is harder

17

SharedMemory

UMA NUMA

•

P PP

MemoryController

P PPMem

ory

Mem

ory

Mem

ory

Interconnect

Memory

Pros/Cons of Shared Memory

+ Easier to program with global address space

+ Typically fast memory access(when hardware supported)

- Hard to scale- Adding CPUs (geometrically) increases traffic - Programmer initiated synchronization of

memory accesses

19

Distributed Memory Arch.• Communication network (typically between

processors, but also memory)– Ethernet, Infiniband, Custom made

• Processor-local memory• Access to another processor’s data through

well defined communication protocol– Implicit synchronization semantics

• Inter-process synchronization by programmer

20

Pros/Cons of Distributed Memory+ Memory is scalable with number of

processors+ Local access is fast (no cache coherency

overhead)+ Cost effective, with off-the-shelf processor/

network- Programs often more complex (no RAM

model)- Data communication is complex to manage

21

Parallel Task Decomposition• Data Parallel

– Perform f(x) for many x• Task Parallel

– Perform many functions fi• Pipeline

DataParallel

TaskParallel

22

Fundamental Questions

• Is the problem amenable to parallelization?– Are there (serial) dependencies

• What machine architectures are available?– Can they be re-configured?– Communication network

• Algorithm– How to decompose the problem into tasks– How to map tasks to processors

23

Measuring Performance

• What do you measure?➡ Elapsed wall-clock time

✦ But other processes can interfere➡ CPU time, CPU+System

✦ But waits/stalls caused by you are your waits➡ Multiple threads of control➡ Job throughput

• Interfering processes

• Cache performance warp benchmarks➡ Other processes interfere with caches➡ OS also caches file data in memory➡ Flush?

• Mean of multiple measurements (keep an eye on variance)

• Measure scalability➡ With input size, input location, processor count, memory size, network

• Use Profiler

Measuring Performance

• What do you measure?➡ Elapsed wall-clock time

✦ But other processes can interfere➡ CPU time, CPU+System

✦ But waits/stalls caused by you are your waits➡ Multiple threads of control➡ Job throughput

• Interfering processes

• Cache performance warp benchmarks➡ Other processes interfere with caches➡ OS also caches file data in memory➡ Flush?

• Mean of multiple measurements (keep an eye on variance)

• Measure scalability➡ With input size, input location, processor count, memory size, network

• Use Profiler

• clock()• gettimeofday()• times()• clock_gettime()• /usr/bin/time

Speedup,Sp=Exec2meusing1processorsystem(T1)

Exec2meusingpprocessors(Tp)

Efficiency= Spp

SimplePerformanceMetrics

25



Efficiency= Spp

Cost,Cp=p×Tp


25



Efficiency= Spp

Cost,Cp=p×Tp


Op2malifCp=T1

25



Efficiency= Spp

Cost,Cp=p×Tp


Op2malifCp=T1

Lookoutforinefficiency: T1=n3 Tp=n2.5,forp=n2 Cp=n4.5

25

• f = fraction of the problem that is sequential➡(1 – f) = fraction that is parallel

• Best parallel time

• Speedup with p processors:

Amdahl’s Law

26

Tp = T1(f +1� f

p)

Amdahl’sLaw

• Onlyfrac2on(1-f)sharedbypprocessorsIncreasingpcannotspeed-upfracEonf

• Upperboundonspeedupatp=∞

f

Amdahl’sLaw



f

Convergesto0

Amdahl’sLaw



Example:f=2%,S∞→1/0.02=50

f

Convergesto0

Speedup versus the number of processing elements for adding a list of numbers.

Speedup can saturate and efficiency drops (Amdahl's law)

ScalingCharacteris2cs

source: Gramma et al.

Isoefficiency: Measure of Scalability

• Rate at which the problem size must increase (as a function of the number of processors) to maintain a constant efficiency➡ Lower rate => More Scalable➡ E =

n = I(K, p)

T1(n)

pTp(n, p)=) f(n, p) = K

COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC...

Documents

Transcript of COL730: An Introduc2on to Parallel Computa2onsubodh/courses/COL730/pdfslides/...2016: Intel SEC...