Lecture 10:
Parallelism & Clusters
Department of Electrical EngineeringStanford University
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 1
http://eeclass.stanford.edu/ee282
Announcements
• Graded quiz 1 available on WedSol tions a ailable online– Solutions available online
• HW2 available online– Due on 11/12
• PA 1 due on 10/29• PA-1 due on 10/29
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 2
Review: Parallel Systems
• Differentiating factors to keep in mind– Degree of integration
P/$
Network
P/$P/$ P/$
– Degree of integration– Which resources are parallelized?– Uniform Vs. non-uniform storage access
Communication through memory or I/O accessesI/O
M
Network
– Communication through memory or I/O accesses
• These choices have implications on– Scaling suitability to specific apps cost software
P/$
M
N t k
P/$P/$ P/$
M M MScaling, suitability to specific apps, cost, software infrastructure, …
• Parallelization approaches
I/O
Network
P/$ P/$P/$ P/$Parallelization approaches– Data or domain parallelism– Task or functional parallelism – Task pipelining
P/$
M
P/$P/$ P/$
M M M
I/O I/O I/O I/O
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 3
Task pipelining– Combinations…
Network
Review: Limitations to Parallelism
• Major issues to keep in mind– Serially dominated workload– Serially dominated workload– Parallel overhead (e.g., excessive or slow communication)– I/O bottlenecks
Load Imbalance– Load Imbalance– Locality issues
• Metrics• Metrics– Speedup
• Amdhal’s Law• Don’t forget overheads of parallelismDon t forget overheads of parallelism
– Efficiency • May be misleading if parallel resources are cheap
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 4
Review: Shared-memory vs Message-PassingShared memory vs Message Passing
• Single address space for all CPUs • Private address-spaces for CPUs
• Communication through regular • Communication through message load/store operations– Implicit
send/receive operations (through memory or I/O network)– Explicit
Synchronization using locks and • Synchronization using blocking• Synchronization using locks and barriers
• Synchronization using blocking messages
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 5
An Example - Iterative Solver
double a[2][MAXI+2][MAXJ+2]; //two copies of state//use one to compute the other//use one to compute the other
for (s=0; s<STEPS; s++) {k = s&1; // 0 1 0 1 0 1 ...m = k^1; // 1 0 1 0 1 0 ...forall(i=1; i<=MAXI; i++) { // do iterations in parallel
forall(j=1; j<=MAXJ; j++){a[k][i][j] = c1*a[m][i][j] + c2*a[m][i-1][j] +
c3*a[m][i+1][j] + c4*a[m][i][j-1] +c3*a[m][i+1][j] + c4*a[m][i][j 1] +c5*a[m][i][j+1];
}}
}
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 6
Domain Decomposition
Di id i A 16J
• Divide matrix A over 16 processors– Each processor computes a
16x16 submatrix 0 1 2 3
6300
15 31 47I
• Processor 6– Owns [i][j] = [32..47][16..31]– Shares [i][j] = [31][16..31] and
th th t i4 5 6 7
15
three other strips• Each processor
– Communicates to get shared 8 9 10 11
31
data it needs– Computes its data– Synchronizes
12 13 14 14
63
47
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 7
63
Shared Memory Code
Fork N processeseach process, p computesistart[p], iend[p], jstart[p], jend[p]
For (s=0; s<STEPS; s++) {For (s 0; s<STEPS; s++) {k = s&1;m = k^1;forall(i=istart[p]; i<=iend[p]; i++) { // e.g. 32..47
f ll(j j t t[ ] j< j d[ ] j++){ // 16 31forall(j=jstart[p]; j<=jend[p]; j++){ // e.g. 16..31a[k][i][j] = c1*a[m][i][j] + c2*a[m][i-1][j] +
c3*a[m][i+1][j] + c4*a[m][i][j-1] +c5*a[m][i][j+1]; // implicit comm.[ ][ ][j ] p
}}barrier();
}
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 8
}
Message Passing Code
Fork N processes and distribute subarrays to processorsEach processor computes north[p], south[p], east[p], west[p],
1 if i hb i di i-1 if no neighbor in direction
for (s=0; s<STEPS; s++) {k = s&1; m = k^1;
( 0) ( 1 1 )if (north[p]>= 0) send(north[p], NORTH, a[m][1][1..MAXSUBJ]);if (east]>= 0) send(east[p], EAST, a[m][1..MAXSUBI][1]);same for south and westif (north[p]>= 0) receive(NORTH, a[m][0][1..MAXSUBJ]);
f h di isame for other directionsforall(i=1; i<=MAXSUBI; i++) {
forall(j=1; j<=MAXSUBJ; j++){a[k][i][j] = c1*a[m][i][j] + c2*a[m][i-1][j] +
3* [ ][i 1][j] 4* [ ][i][j 1]c3*a[m][i+1][j] + c4*a[m][i][j-1] +c5*a[m][i][j+1];
}}
}
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 9
}
Shared-memory Vs Message Passing ProgrammingProgramming
• Shared memoryT picall easier to rite first correct ersion– Typically easier to write first correct version
• Communication through load/stores, just get synchronization right– Typically more difficult to write fully optimized version
• Difficult to tell which loads/stores lead to communication– Often more difficult to scale
• Can create fine-grain communication/synchronization
• Message passing– Typically more difficult to write first correct version– Typically easier to write fully optimized version– Typically easier to write fully optimized version
• Communication/synchronization on sends/receives– Often easier to scale
T i ll l d t i i ti / h i ti
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 10
• Typically leads to coarse-grain communication/synchronization
Convergence of Models
• Can do– Message passing programs on top of shared memory hardwareMessage passing programs on top of shared memory hardware
• Load/stores to shared buffers to implement messages• This is how custom message passing machines work
– But no coherence…
Shared memory programs on top of message passing hardware– Shared memory programs on top of message passing hardware• Use virtual memory system to implement sharing
• Can combine shared-memory & message-passing hardwarey g p g– Message-passing cluster with each node a shared-memory
multiprocessor
• Within a chip (multi-core or CMP system), we can greatly improve both shared-memory and message-passing models– Lower latency, more bandwidth, simpler networks, specialized HW
suport
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 11
suport …
Clusters
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 12
What is an Cluster
• A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand alone computersconsists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource
• Clusters are message passing machinesDisjoint address spaces– Disjoint address spaces
• A typical cluster uses:yp– Commodity off the shelf parts computers and networks– Low latency communication protocols
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 13
The History of Clusters
In the 1980’s it was a vector SMPIn the 1980’s it was a vector SMPIn the 1980 s, it was a vector SMP.In the 1980 s, it was a vector SMP.
Custom components throughout
In the 1990’s, it was a In the 1990’s, it was a massively parallelmassively parallelmassively parallel massively parallel
computer.computer.
Commodity Off The Shelf CPUs, everything else customy g
… but today, it is a cluster.… but today, it is a cluster.
COTS components everywhere
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 14
everywhere
Intel and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries.
Systems View of a Cluster
Master Node LAN/WAN
File Server / Gateway
Cluster Cluster Management Management
Interconnect
Management Management ToolsTools
Compute Nodes
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 15
Cluster Pros/Cons
• Advantages– Low cost: they use high volume commodity components– Low cost: they use high volume, commodity components– Scale: easier to expand/scale that any other parallel system
• 10,000s of processors; can scale while service is uninterruptedE i l ti t dd li it ff t– Error isolation: separate address space limits error effect
– Repair: easier to replace machine in cluster than component in shared memory systems
• Disadvantages– Administration cost: just like administering N independent
machines• Shared memory systems behave like a single machine
– Communication overhead• Typically goes through I/O bus, OS. long networking protocols
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 16
y y g g g g– Dealing with distributed storage
Dealing with Cluster Shortcomings
• Administration costClones of identical PCs– Clones of identical PCs
– 3 steps: reboot, reinstall OS, recycle– At $1000/PC, cheaper to discard than to figure out what is wrong
and repair it?• Network performance (more discussion later)
– Storage area networksStorage area networks– I/O accelerations
• Network interface at the memory bus, direct user access
• Storage• Storage– Separation of long term storage and computation– If separate storage servers or file servers, cluster is no worse (?)
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 17
A Sample Cluster Design
Cluster Switch
Rack switches
External Network
4
1 3 5 7
2 4 6 8
9 11 13 15
10 12 14 16
PowerConnect 2016
1
100MLNK/ACT
FDXPOWER2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
4
Compute Nodes
Data Network 31 32
Master NodeExternal NetworkGigabit Ethernet(Fibre)
Data NetworkGigabit Ethernet(copper)
Storage Node
Control and Out-of-Band Network100BaseT Copper
Connection to storageEMC 2
Disk Store
Control Node
Rack-mount LCD Panel/keyboard
31 32
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 18
Control and Out-of-Band Network
Rack-mounted Systems
• Advantages– Dense packingp g– Simpler cabling
• Typical rack: 19” wide• Height measured in RUs
– 1 RU = 1.75”
Collocation sites charge by rack• Collocation sites charge by rack– Space + power supply
• A typical installation can support up to 1,000Watts per rack, p– Assuming 12MW/building, 10,000
square feet, 10 square feet per rack
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 19
Cluster Hardware: the Nodes
• A single element within the cluster• Compute Node• Compute Node
– Just computes – little else– Private IP address – no user access
• Master/Head/Front End Node• Master/Head/Front End Node– User login– Job scheduler
Public IP address connects to external network– Public IP address – connects to external network• Management/Administrator Node
– Systems/cluster management functionsS d i i t t dd– Secure administrator address
• I/O Node– Access to data
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 20
– Generally internal to cluster or to data centre
Technology Advancements in 5 Years
Codename Release date
GHz Number of cores
Peak FLOP per CPU cycle
Peak GFLOPS per CPU
Linpack on256 date of cores per CPU cycle per CPU 256
Processors
Foster September 2001
1.7 1 2 3.4 288.9*
Woodcrest June 2006 3.0 2 4 24 4781**
Example:* From November 2001 top500 supercomputer list
(cluster of Dell Precision 530)** Intel internal cluster built in 2006
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 21
Example
• Circa 2003, a custom Google PC consisted of:1 to 2 CPUs– 1 to 2 CPUs
• From a 533MHz Celeron to a 1.4GHz Pentium III– 256MB SDRAM (100MHz)– ~2 IDE disks (typically 5400RPM)– 100Mbps Ethernet link to a switch
• Why a low end design?– Cost
P– Power
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 22
Cluster Interconnect
Interconnect Typical Latency usec
Typical Bandwidth MB/sLatency usec Bandwidth MB/s
100 Mbps Ethernet 75 8
1Gbit/s Ethernet 60-90 901Gbit/s Ethernet 60 90 90
10 Gb/s Ethernet 12-20 800
SCI* 1.5-4 200-600
Myricom Myrinet* 2.2-3 250-1200
InfiniBand* 2-4 900-1400
Quadrics QsNet* 3-5 600-900
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 23
Q Q
Network Latency
• Diameter: the maximum (over all pairs of nodes) of the shortest path between a given pair of nodespath between a given pair of nodes.
• Latency: delay between send and receive times– Latency tends to vary widely across architectures
V d ft t h d l t i ( i ti )– Vendors often report hardware latencies (wire time)– Application programmers care about software latencies (user
program to user program)Ob ti• Observations:– Hardware/software latencies often differ by 1-2 orders of
magnitudeM i h d l t i ith di t b t th– Maximum hardware latency varies with diameter, but the variation in software latency is usually negligible
• Latency is important for programs with many small messages
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 24
Overall Network Latency
Sender Transmission timeSender
SenderOverhead
Transmission time(size ÷ bandwidth)
(processor
Transmission time(size ÷ bandwidth)
Time ofFlight
ReceiverOverhead
(processorbusy)
Receiver
Transport Latency (processorbusy)
Total Latency = Sender Overhead + Time of Flight +
Total Latency
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 25
Message Size ÷ BW + Receiver OverheadNote: don’t forget that packets have header and trailer…
Network Bandwidth
• The bandwidth of a link = w * 1/tis the n mber of ires U idi ti l i di ti– w is the number of wires
– t is the time per bit• Bandwidth typically in Gigabytes (GB), i.e., 8* 220 bits
Unidirectional: in one directionBidirectional: in both directions
• Effective bandwidth is usually lower than physical link bandwidth due to packet overhead
Routing
and conheader
Data
payloa d
Error co
Trailer
• Bandwidth is important for applications with mostly large messages
g ntrol
dode
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 26
Bisection Bandwidth
• Bisection bandwidth: bandwidth across smallest cut that divides network into two equal halvesinto two equal halves
• Bandwidth across “narrowest” part of the network
bisection cut
not a bisectioncut
cut
bisection bw= link bw bisection bw = sqrt(n) * link bw
Bisection bandwidth is important for algorithms in which all processors need
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 27
• Bisection bandwidth is important for algorithms in which all processors need to communicate with all others
Networking Background
• Topology (how things are connected)Crossbar ring 2 D and 2 D tor s h perc be omega net ork– Crossbar, ring, 2-D and 2-D torus, hypercube, omega network.
• Routing algorithm:– Example: all east-west then all north-south (avoids deadlock)
• Switching strategy:– Circuit switching: full path reserved for entire message, like the
telephonetelephone– Packet switching: message broken into separately-routed
packets, like the post office.Fl t l ( h t if th i ti )• Flow control (what if there is congestion):– Stall, store data temporarily in buffers, re-route data to other
nodes, tell source node to temporarily halt, discard, etc.
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 28
Network Topology
• In the past, there was considerable research in network topology and in mapping algorithms to topologymapping algorithms to topology.– Key cost to be minimized: number of “hops” between nodes – Modern networks hide hop cost so topology is no longer a major factor
in application performance.in application performance.• Why is topology interesting
– Algorithms may have a communication topologyTopology affects– Topology affects
• Bisection bandwidth• Latency• Observed congestiong
• Trade-off: connectivity vs costNumber of switches outgoing/incoming links per switch
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 29
– Number of switches, outgoing/incoming links per switch
Linear and Ring Topologies
• Linear array
– Diameter = n-1; average distance ~n/3.– Bisection bandwidth = 1 (units are link bandwidth).
• Torus or Ring
– Diameter = n/2; average distance ~ n/4.– Bisection bandwidth = 2.
N l f l i h h k i h 1D
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 30
– Natural for algorithms that work with 1D arrays.
Meshes and Tori
Two dimensional mesh
Diameter = 2 * (sqrt( n ) 1)Two dimensional torus
Diameter = sqrt( n )–Diameter = 2 (sqrt( n ) – 1)–Bisection bandwidth = sqrt(n)
–Diameter = sqrt( n )–Bisection bandwidth = 2* sqrt(n)
• Generalizes to higher dimensions (3D Torus).
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 31
• Natural for algorithms that work with 2D and/or 3D arrays.
Hypercubes
• Number of nodes n = 2d for dimension d.– Diameter = d– Diameter = d. – Bisection bandwidth = n/2.
• 0d 1d 2d 3d 4d
Greycode addressing: 111110• Greycode addressing:– Each node connected to d
others with 1 bit different. 100
010 011
111
101
110
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 32
001000
Trees
• Diameter = log n.
Bi ti b d idth 1• Bisection bandwidth = 1
• Easy layout as planar graph.
• Many tree algorithms (e g summation)Many tree algorithms (e.g., summation).
• Fat trees avoid bisection bandwidth problem:– More (or wider) links near top.
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 33
Butterflies
• Diameter = log n
Bi ti b d idth• Bisection bandwidth = n
• Cost: lots of wires.
• Natural for FFTNatural for FFT.
O 1O 1
O 1 O 1
butterfly switchmultistage butterfly network
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 34
butterfly switch
Cluster Architecture View: HW + SW
Parallel Benchmarks:Parallel Benchmarks:Perf, Ring, HINT, NAS, …Perf, Ring, HINT, NAS, … Real ApplicationsReal ApplicationsApplicationApplication
shmemshmemMiddlewareMiddleware MPIMPI PVMPVM
Other OSesOther OSesOSOS LinuxLinux
hh
TCP/IPTCP/IPProtocolProtocol ProprietaryProprietary
ii
VIAVIA
fi ib dfi ib dd id i
desktopdesktop
EthernetEthernet
HardwareHardware
InterconnectInterconnect
WorkstationWorkstation ServerServerServerServer1P/2P1P/2P
MyrinetMyrinetInfinibandInfinibandQuadricsQuadrics
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 35
HardwareHardware 4U +4U +1P/2P1P/2P
Cluster Design Exercise fromOld Version of TextbookOld Version of Textbook
• Goal: design a cluster with 32 processors, 32GB DRAM, 32-64 disks
• Choices– Type of processor (board)Type of processor (board)– Type of DRAM DIMMs– Location of disks (local vs across network)
• Constraints:– Cost– Rack-mounted system
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 36
Components & Costs
1-way 2-way 8-way
Processors/box 1 (1 RU) 2 (1 RU) 4 (8 RU)
Frequency/L2 Size 1GHz / 256KB 1GHz / 256KB 700MHz / 1MB
Box + 1 processor $1,759 $1,939 $14,614
Extra processor - $799 $1,799
0.5GB SDRAM DIMM $549 $749 $1,069
1GB SDRAM DIMM - $1,689 $2,3691GB SDRAM DIMM $1,689 $2,369
36GB SCSI Disk $579 $639 $639
73GB SCSI Disk - 1,299 $1,299
LAN Switch (8p 1RU) $6 280 $6 280 $6 280LAN Switch (8p, 1RU) $6,280 $6,280 $6,280
LAN Switch (30p, 2RU)
$15,995 $15,995 $15,995
LAN Adapter $795 $795 $795
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 37
LAN Adapter $795 $795 $795
44-RU Rack $1,975 $1,975 $1,975
Cluster with 1-way processors
• 32 boxes, 32x2x0.5GB DRAM, 32x2 36GB disks (2.3TB)
• 32 LAN adapters, 2 30p LAN switches
1 k (32 2 2 RU )
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 38
• 1 rack (32+2+2 RUs)
Cluster with 2-way processors
• 16 boxes, 16x2 1GB DRAM, 16x2 73GB disks (2.3TB)
• 16 LAN adapters, 1 30p LAN switches
• 1 rack (16+2 RUs)
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 39
• 1 rack (16+2 RUs)
Cluster with 8-way processors
4 bo es 4 8 1GB DRAM 4 2 73GB disks• 4 boxes, 4x8 1GB DRAM, 4x2 73GB disks
• 4 storage expansion slots (6 disks, 3 RU each)
• 4 LAN adapters, 1 8p LAN switches
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 40
• 2 racks (4*8+4*3+1 RUs)
Cost Comparison
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 41
Disks across the Network
• Disks per processor box:Ad antage cheaper– Advantage: cheaper
– Disadvantage: no fault tolerance
• Disks across the network (e.g. Fibre-channel SAN)– Advantage: can organize in fault-tolerant groups (see later
lecture)lecture)– Disadvantage: cost
• Controllers, enclosure, cables, extra rack space
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 42
Cost Comparison with SAN-based Disks
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 43
Total Cost Ownership
• Software: OS, database server, …Most companies charge one license per processor (or bo )– Most companies charge one license per processor (or box)
– E.g. Win2K for 1-4 CPUs $800, for 1-8 CPUs $3,295– E.g. SQL server for 1 CPU $16,000
• HW maintenance
• Rack space rental + power supply$800 $1 500 per month– $800-$1,500 per month
• Internet access
• Operatorp– $100,000 per year
• BackupsT d i
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 44
– Tapes, devices etc
3-year Ownership Cost Comparison
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 45
Some General Observations
• Cost of HW is typically less than cost of ownershipSoft are maintenance back ps– Software, maintenance, backups, …
• Space and power consumption can be important for clusters– Same as with embedded computing ☺
• Smaller nodes are typically cheaper and faster – Simpler & faster processors
Less sharing of resources within each node– Less sharing of resources within each node• Some integration is still beneficial
– 2-way to 4-way SMPs lead to space and cost savings• Heterogeneous clusters are often a necessity
– Build cluster incrementally from available components
EE282 – Fall 2008 Christos KozyrakisLecture 10 - 46
Top Related