session4 rpc-mpi [Kompatibilitätsmodus] · · 2013-04-15have multiple system images as each node...
-
Upload
hoanghuong -
Category
Documents
-
view
215 -
download
1
Transcript of session4 rpc-mpi [Kompatibilitätsmodus] · · 2013-04-15have multiple system images as each node...
15.04.2013
1
Talk based on material by Google
slide
2
Block II: Cluster/Grid/Cloud Programming & The Message Passing Interfaces (MPI)
ClustersHistory, Architectures, Programming Concepts, Scheduling, Components, Middleware, Single System
Image, Resource management, Programming Environments & Tools, Applications, Message Passing, Load-balancing, Distributed Shared-memory, Parallel I/O
GridsHistory, Technologies, Programming Concepts, Grid Projects, Open Standards, Resource, Protocol,
Network Enabled Service, API, SDK, Syntax, Hourglass Model, Grid Layers, The Globus Toolkit, Data Grid, Portals, Resource managers, Scheduling, Security, Economy Patterns, Projects, proteomics.net
History Remote Procedure Calls (RPC) Message Passing Interface (MPI)
15.04.2013
2
Rajkumar Buyya
Taxonomy ◦ based on how processors, memory & interconnect
are laid out, resources are managed Massively Parallel Processors (MPP) Symmetric Multiprocessors (SMP) Cache-Coherent Non-Uniform Memory
Access (CC-NUMA) Clusters Distributed Systems – Grids/P2P
MPP ◦ A large parallel processing system with a shared-
nothing architecture◦ Consist of several hundred nodes with a high-speed
interconnection network/switch◦ Each node consists of a main memory & one or more
processors Runs a separate copy of the OS
SMP◦ 2-64 processors today◦ Shared-everything architecture◦ All processors share all the global resources
available◦ Single copy of the OS runs on these systems
15.04.2013
3
CC-NUMA◦ a scalable multiprocessor system having a cache-coherent
nonuniform memory access architecture◦ every processor has a global view of all of the memory
Clusters◦ a collection of workstations / PCs that are interconnected by a
high-speed network◦ work as an integrated collection of resources ◦ have a single system image spanning all its nodes
Distributed systems◦ considered conventional networks of independent computers◦ have multiple system images as each node runs its own OS◦ the individual machines could be combinations of MPPs, SMPs,
clusters, & individual computers
Vector Computers (VC) - proprietary system:◦ provided the breakthrough needed for the emergence of
computational science, buy they were only a partial answer. Massively Parallel Processors (MPP) -proprietary
systems:◦ high cost and a low performance/price ratio.
Symmetric Multiprocessors (SMP):◦ suffers from scalability
Distributed Systems:◦ difficult to use and hard to extract parallel performance.
Clusters - gaining popularity:◦ High Performance Computing - Commodity
Supercomputing◦ High Availability Computing - Mission Critical Applications
15.04.2013
4
ACRI Alliant American
Supercomputer Ametek Applied Dynamics Astronautics BBN CDC Convex Cray Computer Cray Research
(SGI?Tera) Culler-Harris Culler Scientific Cydrome
Convex C4600
Dana/Ardent/Stellar Elxsi ETA Systems Evans & Sutherland
Computer Division Floating Point Systems Galaxy YH-1 Goodyear Aerospace MPP Gould NPL Guiltech Intel Scientific
Computers Intl. Parallel Machines KSR MasPar
Meiko Myrias Thinking
Machines Saxpy Scientific
Computer Systems (SCS)
Soviet Supercomputers
Suprenum
•Network of Workstations
15.04.2013
5
The promise of supercomputing to the average PC User ?
Performance of PC/Workstations components has almost reached performance of those used in supercomputers…◦ Microprocessors (50% to 100% per year)◦ Networks (Gigabit SANs);◦ Operating Systems (Linux,...);◦ Programming environment (MPI,…);◦ Applications (.edu, .com, .org, .net, .shop, .bank);
The rate of performance improvements of commodity systems is much rapid compared to specialized systems.
◦ Linking together two or more computers to jointly solve computational problems
◦ Since the early 1990s, an increasing trend to move away from expensive and specialized proprietary parallel supercomputers towards clusters of workstations Hard to find money to buy expensive systems
◦ The rapid improvement in the availability of commodity high performance components for workstations and networks Low-cost commodity supercomputing
◦ From specialized traditional supercomputing platforms to cheaper, general purpose systems consisting of loosely coupled components built up from single or multiprocessor PCs or workstations
15.04.2013
6
1960 1990 1995+1980s 2000+
PDAClusters
A cluster is a type of parallel or distributed processing system, which consists of a collection of interconnected stand-alone computers cooperatively working together as a single, integrated computing resource.
A node:◦ a single or multiprocessor system with memory, I/O facilities, & OS
A cluster:◦ generally 2 or more computers (nodes) connected together◦ in a single cabinet, or physically separated & connected via a LAN◦ appear as a single system to users and applications ◦ provide a cost-effective way to gain features and benefits
Sequential Applications
Parallel Applications
Parallel Programming Environment
Cluster Middleware
(Single System Image and Availability Infrastructure)
Cluster Interconnection Network/Switch
PC/Workstation
Network Interface Hardware
Communications
Software
PC/Workstation
Network Interface Hardware
Communications
Software
PC/Workstation
Network Interface Hardware
Communications
Software
PC/Workstation
Network Interface Hardware
Communications
Software
Sequential ApplicationsSequential Applications
Parallel ApplicationsParallel Applications
15.04.2013
7
Commodity Parts? Communications Packaging? Incremental Scalability? Independent Failure? Intelligent Network Interfaces? Complete System on every node◦ virtual memory◦ scheduler◦ files◦ …
Nodes can be used individually or jointly...
Parallel Processing ◦ Use multiple processors to build MPP/DSM-like systems for
parallel computing Network RAM◦ Use memory associated with each workstation as aggregate
DRAM cache Software RAID◦ Redundant array of inexpensive disks◦ Use the arrays of workstation disks to provide cheap, highly
available and scalable file storage◦ Possible to provide parallel I/O support to applications
Multipath Communication◦ Use multiple networks for parallel data transfer between
nodes
MPP: Massively Parallel ProcessingDSP: Distributed Shared Memory
• Enhanced Performance (performance @ low cost)• Enhanced Availability (failure management)• Single System Image (look-and-feel of one system)• Size Scalability (physical & application)• Fast Communication (networks & protocols)• Load Balancing (CPU, Net, Memory, Disk) • Security and Encryption (clusters of clusters)• Distributed Environment (Social issues)• Manageability (admin. And control)• Programmability (simple API if required)• Applicability (cluster-aware and non-aware app.)
Cluster Design Issues
15.04.2013
8
High Performance (dedicated). High Throughput (idle cycle harvesting). High Availability (fail-over).
A Unified System – HP and HA within the same cluster
Shared Pool ofComputing Resources:Processors, Memory, Disks
Interconnect
Guarantee at least oneworkstation to many individuals(when active)
Deliver large % of collectiveresources to few individualsat any one time
15.04.2013
9
• Best of both Worlds: (world is heading towards this configuration)
Work queues allow threads from one task to send processing work to another task in a decoupled fashion
CP
P
P
C
C
shared queue
Producers Consumers
15.04.2013
10
To make this work in a distributed setting, we would like this to simply “happen over the network”
CP
P
P
C
C
network shared queue
separate machines
Where does the queue live? How do you access it? (custom protocol? a
generic memory-sharing protocol?) How do you guarantee that it doesn't become
a bottleneck / source of deadlock?
... Some well-defined solutions exist to support inter-machine programming, which we'll see next
15.04.2013
11
Regular client-server protocols involve sending data back and forth according to a shared state
Client: Server:
HTTP/1.0 index.html GET
200 OK
Length: 2400
(file data)
HTTP/1.0 hello.gif GET
200 OK
Length: 81494
…
RPC servers will call arbitrary functions in dll, exe, with arguments passed over the network, and return values back over network
Client: Server:
foo.dll,bar(4, 10, “hello”)
“returned_string”
foo.dll,baz(42)
err: no such function
…
RPC can be used with two basic interfaces: synchronous and asynchronous
Synchronous RPC is a “remote function call” –client blocks and waits for return val
Asynchronous RPC is a “remote thread spawn”
15.04.2013
13
Writing rpc_call(foo.dll, bar, arg0, arg1..) is poor form◦ Confusing code◦ Breaks abstraction
Wrapper “stub” function makes code cleanerbar(arg0, arg1); //programmer writes this;
// makes RPC “under the hood”
Who can call RPC functions? Anybody? How do you handle multiple versions of a
function? Need to marshal objects How do you handle error conditions? Numerous protocols: DCOM, CORBA, JRMI…
“Imagine a Beowulf cluster of these…”-- common Slashdot meme
15.04.2013
14
Traditional cluster computing involves explicitly forming a cluster from computer nodes and dispatching jobs
Beowulf is a style of system that links Linux machines together
MPI (Message Passing Interface) describes an API for allowing programs to communicate with their parallel components
Makes a cluster of computers present a single computer interface
One computer is the “master”◦ Starts tasks◦ User terminal / external network is connected to
this machine Several “worker” nodes form backend; not
usually individually accessed
Runs on commodity PCs Uses standard Ethernet network (though
faster networks can be used too) Open-source software
15.04.2013
15
Beowulf is an architecture style◦ It is not itself an explicit library
Client nodes are set up in very dumb fashion◦ Use NFS to share file system with master
User starts programs on master machine Scripts use rsh to invoke subprograms on
worker nodes
If you need several totally isolated jobs done in parallel, the above is all you need
Most systems require more inter-thread communication than Beowulf offers
Special libraries make this easier
MPI is an API that allows programs running on multiple computers to interoperate
MPI itself is a standard; implementations of it exist in C and Fortran
Provides synchronization and communicationoperations to processes
15.04.2013
16
Messages are sequences of bytes moving between processes
The sender and receiver must agree on the type structure of values in the message
“Marshalling”: data layout so that there is no ambiguity such as “four chars” v. “one integer”.
Mateti, Linux Clusters 46
Process A sends a data buffer as a message to process B.
Process B waits for a message from A, and when it arrives copies it into its own local memory.
No memory shared between A and B.
Mateti, Linux Clusters 47
Obviously, ◦ Messages cannot be received before they are
sent.◦ A receiver waits until there is a message.
Asynchronous◦ Sender never blocks, even if infinitely many
messages are waiting to be received◦ Semi-asynchronous is a practical version of
above with large but finite amount of buffering
Mateti, Linux Clusters 48
15.04.2013
17
Q: send(m, P) ◦ Send message M to process P
P: recv(x, Q)◦ Receive message from process Q, and place it in
variable x The message data◦ Type of x must match that of m◦ As if x := m
Mateti, Linux Clusters 49
One sender Q, multiple receivers P Not all receivers may receive at the same time Q: broadcast (m) ◦ Send message M to processes
P: recv(x, Q)◦ Receive message from process Q, and place it in
variable x
Mateti, Linux Clusters 50
Sender blocks until receiver is ready to receive.
Cannot send messages to self. No buffering.
Mateti, Linux Clusters 51
15.04.2013
18
Sender never blocks. Receiver receives when ready. Can send messages to self. Infinite buffering.
Mateti, Linux Clusters 52
Speed not so good ◦ Sender copies message into system buffers.◦ Message travels the network.◦ Receiver copies message from system buffers into
local memory.◦ Special virtual memory techniques help.
Programming Quality◦ less error-prone cf. shared memory
Mateti, Linux Clusters 53
User explicitly spawns child processes to do work
MPI library aware of the size of the “universe” – the number of available machines
MPI system will spawn processes on different machines◦ Do not need to be the same executable
15.04.2013
19
MPI programs define a “Window” of a certain size as a shared memory region
Multiple processes attach to the window◦ Get() and Put() primitives copy data into the shared
memory asynchronously◦ Fence() command blocks until all users of the
window reach the fence, at which point their shared memories are consistent◦ User is responsible for ensuring that stale data is
not read from shared memory buffer
Supports intuitive notion of “barriers” with Fence()
Mutual exclusion locks also supported◦ Library ensures that multiple machines cannot
access the lock at the same time◦ Ensuring that failed nodes cannot deadlock an
entire distributed process will increase system complexity
Basic communication unit in MPI is a message – a piece of data sent from one machine to another
MPI provides message-sending and receiving functions that allow processes to exchange messages in a thread-safe fashion over the network
Also includes multi-party messages...
15.04.2013
20
1:n broadcast – one process sends a message to all processes in a group
n:1 reduce – all processes in a group send data to a designated process which merges the data
n:n messaging communication also supported
• One process in a group can send a message which all group members receive (e.g., a global “stop processing” signal)
• Processes in a group can all report data together (asynchronously) which is gathered into a single message reported to one process (e.g., reporting results of a distributed computation)
15.04.2013
21
• Combination of above paradigms; individual processes contribute components to a global message which reaches all group members
Programmers have very explicit control over data manipulation; allows high performance applications
Trade-off is that it has a steep learning curve Systems such as MapReduce are considerably
lower learning curve (but cannot handle as complex of system interactions)
Generic RPC and shared-memory libraries allow flexible definition of software systems
Require programmers to think hard about how the network is involved in the process
Systems such as MapReduce (next lecture) automate much of the lower-level inter-machine communication, in exchange for some inflexibility of design