Parallel Computing Platforms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Parallel Computing Platforms
description
Transcript of Parallel Computing Platforms
![Page 1: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/1.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.1
Parallel Computing Platforms
• Motivation: High Performance Computing
• Dichotomy of Parallel Computing Platforms
• Communication Model of Parallel Platforms
• Physical Organization of Parallel Platforms
• Communication Costs in Parallel Machines
![Page 2: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/2.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.2
High Performance Computing
• Computing power required to solve, effectively, computationally intensive and/or data intensive problems in science, engineering and other emerging disciplines
• Provided using – Parallel computers and– Parallel programming techniques
• Is this compute power not available otherwise?
![Page 3: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/3.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.3
Elements of a Parallel Computer
• Hardware– Multiple Processors
– Multiple Memories
– Interconnection Network
• System Software– Parallel Operating System
– Programming Constructs to Express/Orchestrate Concurrency
• Application Software– Parallel Algorithms
• Goal:
Utilize the Hardware, System, & Application Software to either– Achieve Speedup: Tp = Ts/p– Solve problems requiring a large amount of memory.
![Page 4: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/4.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.4
Dichotomy of Parallel Computing Platforms
• Logical Organization– The user’s view of the machine as it is being presented via its
system software
• Physical Organization– The actual hardware architecture
• Physical Architecture is to a large extent independent of the Logical Architecture
![Page 5: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/5.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.5
Logical Organization
• An explicitly parallel program must specify concurrency and interaction between concurrent tasks
• That is, there are two critical components of parallel computing, logically:1. Control structure: How to express parallel tasks
2. Communication model: mechanism for specifying interaction
• Parallelism can be expressed at various levels of granularity - from instruction level to processes.
![Page 6: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/6.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.6
Control Structure of Parallel Platforms
• Processing units in parallel computers either– operate under the centralized control of a single control unit or – work independently.
• If there is a single control unit that dispatches the same instruction to various processors (that work on different data), the model is referred to as single instruction stream, multiple data stream (SIMD).
• If each processor has its own control control unit, each processor can execute different instructions on different data items. This model is called multiple instruction stream, multiple data stream (MIMD).
![Page 7: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/7.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.7
SIMD and MIMD Processors
A typical SIMD architecture (a) and a typical MIMD architecture (b).
![Page 8: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/8.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.8
SIMD Processors
InstructionStream
Processor
A
Processor
B
Processor
C
Data Inputstream A
Data Inputstream B
Data Inputstream C
Data Outputstream A
Data Outputstream B
Data Outputstream C
![Page 9: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/9.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.9
MIMD Processors
Processor
A
Processor
B
Processor
C
Data Inputstream A
Data Inputstream B
Data Inputstream C
Data Outputstream A
Data Outputstream B
Data Outputstream C
InstructionStream A
InstructionStream B
InstructionStream C
![Page 10: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/10.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.10
SIMD & MIMD Processors (cont’d)
• SIMD relies on the regular structure of computations (such as those in image processing). – Require less hardware than MIMD computers (single control
unit).– Require less memory– Are specialized: not suited to all applications.
• In contrast to SIMD processors, MIMD processors can execute different programs on different processors.– Single program multiple data streams (SPMD) executes the
same program on different processors. – SPMD and MIMD are closely related in terms of programming
flexibility and underlying architectural support.
![Page 11: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/11.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.11
Logical Organization:Communication Model
• There are two primary forms of data exchange between parallel tasks:– Accessing a shared data space and – Exchanging messages.
• Platforms that provide a shared data space are called shared-address-space machines or multiprocessors.
• Platforms that support messaging are also called message passing platforms or multicomputers.
![Page 12: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/12.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.12
Shared-Address-Space Platforms
• Part (or all) of the memory is accessible to all processors.
• Processors interact by modifying data objects stored in this shared-address-space.
• If the time taken by a processor to access any memory word in the system global or local is identical, the platform is classified as a uniform memory access (UMA), else, a non-uniform memory access (NUMA) machine.
![Page 13: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/13.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.13
NUMA and UMA Shared-Address-Space Platforms
Typical shared-address-space architectures: (a) Uniform-memory access shared-address-space computer; (b) Uniform-memory-access shared-address-space computer with caches and
memories; (c) Non-uniform-memory-access shared-address-space computer with local
memory only.
![Page 14: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/14.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.14
NUMA and UMA Shared-Address-Space Platforms
• The distinction between NUMA and UMA platforms is important from the point of view of algorithm design. – NUMA machines require locality from underlying algorithms for
performance.
• Programming these platforms is easier since reads and writes are implicitly visible to other processors.
• However, read-write to shared data must be coordinated • Caches in such machines require coordinated access to
multiple copies. – This leads to the cache coherence problem.
![Page 15: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/15.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.15
Shared-Address-Space vs.
Shared Memory Machines
• Shared-address-space is as a programming abstraction
• Shared memory is as a physical machine attribute
• It is possible to provide a shared address space using a physically distributed memory– Distributed shared memory machines
• Shared-address-space machines commonly programmed using Pthreads and OpenMP
![Page 16: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/16.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.16
Message-Passing Platforms
• These platforms comprise of a set of processors and their own (exclusive) memory
• Instances of such a view come naturally from clustered workstations and non-shared-address-space multicomputers.
• These platforms are programmed using (variants of) send and receive primitives.
• Libraries such as MPI and PVM provide such primitives.
![Page 17: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/17.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.17
Physical Organization: Interconnection Networks (ICNs)
• Provide processor-to-processor and processor-to-memory connections
• Networks are classified as:– Static– Dynamic
• Static– Consist of a number of point-to-point links
• direct network– Historically used to link processors-to-processors
• distributed-memory• Dynamic
– The network consists of switching elements that the various processors attach to
• indirect network– Historically used to link processors-to-memory
• shared-memory systems
![Page 18: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/18.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.18
Static and DynamicInterconnection Networks
Classification of interconnection networks: (a) a static network; and (b) a dynamic network.
![Page 19: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/19.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.19
Network Topologies
Interconnection Networks
Static Dynamic
Bus-based Switch-based1-D 2-D HC
Single Multiple SS MS Crossbar
![Page 20: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/20.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.20
Network Topologies: Static ICNs
• Static (fixed) interconnection networks are characterized by having fixed paths, unidirectional or bi-directional, between processors.
• Completely connected networks (CCNs): Number of links: O(N2), delay complexity: O(1).
• Limited connected network (LCNs)– Linear arrays– Ring (Loop) networks– Two-dimensional arrays– Tree networks– Cube network
![Page 21: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/21.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.21
Network Topologies: Dynamic ICNs
• A variety of network topologies have been proposed and implemented:– Bus-based– Crossbar– Multistage– etc
• These topologies tradeoff performance for cost. • Commercial machines often implement hybrids of
multiple topologies for reasons of packaging, cost, and available components.
![Page 22: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/22.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.22
Network Topologies: Buses
• Shared medium. • Ideal for information broadcast
– Distance between any two nodes is a constant
• Bandwidth of the shared bus is a major bottleneck.
• Local memories can improve performance
• Scalable in terms of cost, unscalable in terms of performance.
![Page 23: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/23.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.23
Network Topologies: Crossbars
A completely non-blocking crossbar network connecting p processors to b memory banks.
• Uses an p×m grid of switches to connect p inputs to m outputs in a non-blocking manner.
• The cost of a crossbar of p processors grows as
• .• Scalable in terms of
performance, unscalable in terms of cost
)( 2p
![Page 24: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/24.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.24
Network Topologies: Multistage Networks
The schematic of a typical multistage interconnection network.
• Strike a compromise between the cost and performance scalability of the Bus and Crossbar networks.
![Page 25: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/25.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.25
Network Topologies: Multistage Omega Network
• One of the most commonly used multistage interconnects is the Omega network.
• This network consists of log p stages, where p is the number of inputs/outputs.
• At each stage, input i is connected to output j if:
![Page 26: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/26.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.26
Network Topologies: Multistage Omega Network
Each stage of the Omega network implements a perfect shuffle as follows:
A perfect shuffle interconnection for eight inputs and outputs.
![Page 27: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/27.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.27
Network Topologies: Completely Connect and Star Networks
(a) A completely-connected network of eight nodes; (b) a star connected network of nine nodes.
• Completely connected network is the static counterpart of a Crossbar network– Performance scales very well, the hardware complexity is not
realizable for large values of p.
• Star network is the static counterpart of a Bus network– Central processor is the bottleneck
![Page 28: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/28.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.28
Network Topologies: Linear Arrays, Meshes, and k-d Meshes
• In a linear array, each node has two neighbors, one to its left and one to its right.
• If the nodes at either end are connected, we refer to it as a 1-D torus or a ring.
• A generalization to 2 dimensions has nodes with 4 neighbors, to the north, south, east, and west.
• A further generalization to d dimensions has nodes with 2d neighbors.
• A special case of a d-dimensional mesh is a hypercube. Here, d = log p, where p is the total number of nodes.
![Page 29: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/29.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.29
Network Topologies: Linear Arrays and Meshes
Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link (2-D torus); and
(c) a 3-D mesh with no wraparound.
![Page 30: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/30.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.30
Network Topologies: Hypercubes and their Construction
Construction of hypercubes from hypercubes of lower dimension.
![Page 31: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/31.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.31
Network Topologies: Properties of Hypercubes
• The distance between any two nodes is at most log p.
• Each node has log p neighbors.
• The distance between two nodes is given by the number of bit positions at which the two nodes differ.
![Page 32: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/32.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.32
Network Topologies: Tree-Based Networks
Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.
![Page 33: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/33.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.33
Evaluation Metrics for ICNs
• The following evaluation metrics are the criteria used to characterize the cost and performance of static ICNs
• Diameter– The maximum distance between any two nodes– Smaller the better.
• Connectivity– The minimum number of arcs that must be removed to break it into two
disconnected networks– Larger the better
• Bisection width– The minimum number of arcs that must be removed to partition the network
into two equal halves.– Larger the better
• Cost– The number of links in the network– Smaller the better
![Page 34: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/34.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.34
Evaluating Static Interconnection Networks
Network Diameter BisectionWidth
Arc Connectivity
Cost (No. of links)
Completely-connected
Star
Complete binary tree
Linear array
2-D mesh, no wraparound
2-D wraparound mesh
Hypercube
Wraparound k-ary d-cube
![Page 35: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/35.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.35
Evaluating Dynamic Interconnection Networks
Network Diameter Bisection Width
Arc Connectivity
Cost (No. of links)
Crossbar
Omega Network
Dynamic Tree
![Page 36: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/36.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.36
Communication Costs in Parallel Machines
• Along with idling and contention, communication is a major overhead in parallel programs.
• Communication cost dependents on many features including:– Network topology – Data handling – Routing etc
![Page 37: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/37.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.37
Message Passing Costs in Parallel Computers
• The communication cost of a data-transfer operation depends on:– Start-up time: ts
• add headers/trailer, error-correction, execute the routing algorithm, establish the connection between source & destination
– Per-hop time: th
• time to travel between two directly connected nodes.– node latency
– Per-word transfer time: tw
• 1/channel-width
![Page 38: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/38.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.38
Store-and-Forward Routing
• A message traversing multiple hops is completely received at an intermediate hop before being forwarded to the next hop.
• The total communication cost for a message of size m words to traverse l communication links is
• In most platforms, th is small and the above expression can be approximated by
![Page 39: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/39.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.39
Routing Techniques
Passing a message from node P0 to P3 (a) through a store-and-forward communication network; (b) and (c) extending the concept to cut-through routing. The shaded regions represent the time that
the message is in transit. The startup time associated with this message transfer is assumed to be zero.
![Page 40: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/40.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.40
Packet Routing
• Store-and-forward makes poor use of communication resources.
• Packet routing breaks messages into packets and pipelines them through the network.
• Since packets may take different paths, each packet must carry routing information, error checking, sequencing, and other related header information.
• The total communication time for packet routing is approximated by:
• The factor tw accounts for overheads in packet headers.
![Page 41: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/41.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.41
Cut-Through Routing
• Takes the concept of packet routing to an extreme by further dividing messages into basic units called flits.
• Since flits are typically small, the header information must be minimized.
• This is done by forcing all flits to take the same path, in sequence.
• A tracer message first programs all intermediate routers. All flits then take the same route.
• Error checks are performed on the entire message, as opposed to flits.
• No sequence numbers are needed.
![Page 42: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/42.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.42
Cut-Through Routing
• The total communication time for cut-through routing is approximated by:
• This is identical to packet routing, however, tw is typically much smaller.
![Page 43: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/43.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.43
Simplified Cost Model for Communicating Messages
• The cost of communicating a message between two nodes l hops away using cut-through routing is given by
• In this expression, th is typically smaller than ts and tw. For this reason, the second term in the RHS does not show, particularly, when m is large.
• Furthermore, it is often not possible to control routing and placement of tasks.
• For these reasons, we can approximate the cost of message transfer by
![Page 44: Parallel Computing Platforms](https://reader036.fdocuments.in/reader036/viewer/2022081420/56814a50550346895db771a2/html5/thumbnails/44.jpg)
Sahalu Junaidu ICS 573: High Performance Computing 2.44
Notes on the Simplified Cost Model
• The given cost model allows the design of algorithms in an architecture-independent manner
• However, the following assumptions are made:– Communication between any pair of nodes takes equal time– Underlying network is uncongested– Underlying network is completely connected– Cut-through routing is used