Chapter 7 (excl. 7.9): Scalable Multiprocessors
description
Transcript of Chapter 7 (excl. 7.9): Scalable Multiprocessors
EECS 570: Fall 2003 -- rev3 1
Chapter 7 (excl. 7.9):Scalable Multiprocessors
EECS 570: Fall 2003 -- rev3 2
OutlineScalability
– bus doesn't hack it: need scalable interconnect (network)
Realizing Programming Models on Scalable Systems– network transactions– protocols
• shared address space• message passing (synchronous, asynchronous, etc.)• active messages
– safety: buffer overflow, fetch deadlock
Communication Architecture Design Space– where does network attach to processor node?– how much hardware interpretation of the network transaction?– impact on cost & performance
EECS 570: Fall 2003 -- rev3 3
Scalability
How do bandwidth, latency, cost, and packaging scale with P?• Ideal:
– latency, per-processor bandwidth, per-processor cost are constants– packaging does not create upper bound, does not exacerbate others
• Bus:– per-processor BW scales as 1/P– latency increases with P:
• queuing delays for fixed bus length (linear at saturation?)• as bus length is increased to accommodate more CPUs, clock must slow
• Reality:– “scalable” may just mean sub-Linear dependence (e.g., logarithmic)– practical limits ($/customer), sweet spot + growth path– switched interconnect (network)
EECS 570: Fall 2003 -- rev3 4
Aside on Cost-Effective Computing
Traditional view: efficiency = speedup(P)/PEfficiency < I → parallelism not worthwhile?But much of a computer's cost is NOT in the
processor (memory, packaging, interconnect, etc.)
• [Wood & Hill, IEEE Computer 2/95]
Let Costup(P) = Cost(P)/Cost(1)Parallel computing cost-effective:
Speedup(P) > Costup(P)E.g. for SGI PowerChallenge w/500MB:
Costup(32) = 86
EECS 570: Fall 2003 -- rev3 5
Network Transaction Primitive
one-way transfer of information from source to destinationcauses some action at the destination• process info and/or deposit in buffer• state change (e.g., set flag, interrupt program)• maybe initiate reply (separate network transaction)
output buffer input buffer
Source Node Destination Node
Communication Network
serialized msg
EECS 570: Fall 2003 -- rev3 6
Network Transaction Correctness Issues
• protection– User/user, user/system what if VM doesn't apply?– Fault containment (large machine component MTBF may be low)
• format– Variable length? Header info?– Affects efficiency, ability to handle in HW
• buffering/flow control– Finite buffering in network itself– Messages show up announced e.g., many-to-one pattern
• deadlock avoidance– If you're not careful -- details later
• action– What happens on delivery? How may options are provided?
• system guarantees– Delivery, ordering
EECS 570: Fall 2003 -- rev3 7
Performance Issues
Key parameters: latency, overhead, bandwidth
LogP: lat/over/gap(BW) as function of P
NIS
CPUS
Network
NIR
CPUR
EECS 570: Fall 2003 -- rev3 8
Programming Models
What is user's view of network transaction?– Depends on system architecture– Remember layered approach: OS/compiler/Iibrary
may implement alternative model on top of HW-provided model
We'll look at three basic ones:– Active Messages: “assembly language” for msg-
passing systems– Message Passing"- MPI-style interface, as seen by
appl. programmers– Shared Address Space: ignoring cache coherence for
now
EECS 570: Fall 2003 -- rev3 9
Active Messages
User-level analog of network transaction– invoke handler function at receiver to extract packet from network– grew out of attempts to do dataflow on msg-passing machines & remote
procedure calls– handler may send reply, but no other messages– Event notification: interrupts, polling, events?– May also perform memory-to-memory transfer
Flexible (can do almost any action on msg reception), but requires tight cooperation between CPU and network for high performance
– May be better to have HW do a few things faster
Request
handler
handler
Reply
EECS 570: Fall 2003 -- rev3 10
Message Passing
Basic idea:– Send(dest, tag, buffer) -- tag is arbitrary integer– Recv(src, tag, buffer ) -- src/tag may be wildcard (“any”)
Completion semantics:– Receive completes after data transfer complete from matching
send– Synchronous send completes after matching receive and data
sent– Asynchronous send completes after send buffer may be reused
• msg may simply be copied into alternate buffer, on src or dest node
Blocking vs. non-blocking:– does function wait for “completion” before returning – non-blocking: extra function calls to check for completion– assume blocking for now
EECS 570: Fall 2003 -- rev3 11
Synchronous Message Passing
Three-phase operation: ready-to-send, ready-to-receive, transfer– Can skip 1st phase if receiver initiates & specifies source
Overhead, latency tend to be highTransfer can achieve high bandwidth w/sufficient msg lengthProgrammer must avoid deadlock (e.g. pairwise exchange)
EECS 570: Fall 2003 -- rev3 12
Asynch. Msg Passing: Conservative
Same as synchronous. except msg can be buffered on sender– Allows computation to continue sooner– Deadlock still (not so much) an issue -- buffering is finite
EECS 570: Fall 2003 -- rev3 13
Asynch. Message Passing: Optimistic
Sender just ships data, hopes receiver can handle itBenefit: lower latencyProblems.
– receive was posted need fast lookup while data streams in
– receive not posted buffer? nack? discard?
EECS 570: Fall 2003 -- rev3 14
Key Features of Msg Passing Abstraction
Source knows send data address, dest. knows receive data address– after handshake they both know both
Arbitrary storage for asynchronous protocol– may post many sends before any receives
Fundamentally a 3-phase transaction– can use optimistic 1-phase in limited cases
Latency, overhead tend to be higher than SAS: high BW easier
Hardware support?– DMA: physical or virtual (better/harder)
EECS 570: Fall 2003 -- rev3 15
Shared Address Space Abstraction
Two-way request/response protocol– reads require data response– writes have acknowledgment (for consistency)
Issues– virtual or physical address on net? (where does
translation happen)– coherence, consistency, etc (later)
EECS 570: Fall 2003 -- rev3 16
Key Properties of SAS Abstraction
Data addresses are specified by the source of the request– no dynamic buffer allocation– protection achieved through virtual memory translation
Low overhead initiation: one instruction (load or store)High bandwidth more challenging
– may require prefetching, separate “block transfer engine”
Synchronization less straightforward (no explicit event notification)
Simple request-response pairs– few fixed message types– practical to implement in hardware w/o remote CPU involvement
Input buffering I flow control issue– what if request rate exceeds local memory bandwidth?
EECS 570: Fall 2003 -- rev3 17
Challenge 1: Input Buffer Overflow
Options• refuse input when full
– creates “back pressure” (in reliable network)– to avoid deadlock:
• low-level ack/nack– assumes dedicated network path for ack/nack (common in rings)– retry on nack
• drop packets– retry on timeout
• avoid by reserving space per source (“credit-based”)– when available for reuse?– scalability?
EECS 570: Fall 2003 -- rev3 18
Challenge 2: Fetch Deadlock
Processing a message may require sending a reply; what if reply can't be sent due to input buffer overflow?– step 1: guarantee that replies can be sunk @ destination
• requester reserves buffer space for reply– step 2: guarantee that replies can be sent into network
• backpressure: logically independent request/reply networks– physical networks or virtual channels
– credit-based: bound outgoing requests to K per node• buffer space for K(P-l ) requests + K responses at each node
– low-level ack/nack, packet dropping• guarantee that replies will never be nacked or dropped
For cache coherence protocols, some requests may require more: forward request to other node, send multiple invalidations– must extend techniques or nack these requests up front
EECS 570: Fall 2003 -- rev3 19
Outline
Scalability [7.1 - read]Realizing Programming Models
– network transactions– protocols: SAS, MP, Active Messages– safety: buffer overflow, fetch deadlock
Communication Architecture Design Space– where does hardware fit into node architecture?– how much hardware interpretation of the network transaction?– how much gap between hardware and user semantics?
• remainder must be done in software• increased flexibility, increased latency & overhead
– main CPU or dedicated/specialized processor?
EECS 570: Fall 2003 -- rev3 20
Massively Parallel Processor (MPP) Architectures
I/O Bus
Memory Bus
Processor
Cache
MainMemory
DiskController
Disk Disk
NetworkInterface
Network
I/O Bridge
• Network interface typically close to processor– Memory bus:
• locked to specific processor architecture/bus protocol
– Registers/cache:• only in research machines
• Time-to-market is long– processor already available
or work closely with processor designers
• Maximize performance and cost
EECS 570: Fall 2003 -- rev3 21
Network of Workstations
Network interface on I/0 busStandards (e.g., PCI) => longer
life, faster to marketSlow (microseconds) to access
network interface“System Area Network” (SAN):
between LAN & MPPI/O Bus
Processor
Cache
MainMemory
DiskController
Disk Disk
GraphicsController
NetworkInterface
Graphics Network
interrupts
Core Chip Set
EECS 570: Fall 2003 -- rev3 22
Transaction Interpretation
Simple: HW doesn't interpret much if anything– DMA from/to buffer, interrupt or set flag on completion– nCUBE, conventional LAN– Requires OS for address translation, often a user/kernel copy
User-level messaging: get the OS out of the way– HW does protection checks to allow direct user access to
network– may have minimal interpretation otherwise– May be on I/O bus (Myrinet), memory bus (CM-5), or in regs (J-
Machine, *T)– May require CPU involvement in all data transfers (explicit
memory-to-network copy)
EECS 570: Fall 2003 -- rev3 23
Transaction Interpretation (cont'd)
Virtual DMA: get the CPU out of the way (maybe)– basic protection plus address translation: user-level bulk DMA– Usually to limited region of addr space (pinned)– Can he done in hardware (VIA, Meiko CS-2) or software (some
Myrinet, Intel Paragon)
Reflective memory– DEC memory channel, Princeton SHRIMP
Global physical address space (NUMA): everything in hardware– complexity increases, but performance does too (if done right)
Cache coherence: even more so– stay tuned
EECS 570: Fall 2003 -- rev3 24
Net Transactions: Physical DMA
PMemory
Cmd
DestData
Addr
Length
Rdy
PMemory
DMAchannels
Status,interrupt
Addr
Length
Rdy
• Physical addresses: OS must initiate transfers– system call per message on both ends: ouch
• Sending OS copies data to kernel buffer w/ header/trailer– can avoid copy if interface does scatter/gather
• Receiver copies packet into OS buffer, then interprets– user message then copied (or mapped) into user space
EECS 570: Fall 2003 -- rev3 25
nCUBE/2 Network Interface
independent DMA channel per link direction
segmented messages: can inspect header to direct remainder of DMA directly to user buffer– avoids copy at expense of extra interrupt + DMA setup cost– can't let buffer be paged out (did nCUBE have VM?)
EECS 570: Fall 2003 -- rev3 26
Conventional LAN Network Interface
NIC Controller
DMAaddr
len
trncv
TX
RX
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Data
Host Memory NIC
IO Busmem bus
Proc
EECS 570: Fall 2003 -- rev3 27
User Level Messaging
PMem
DestData
User/system
PMemStatus,interrupt
• map network hardware into user’s address space– talk directly to network via loads & stores
• user-to-user communication without OS intervention: low latency• protection: user/user & user/system• DMA hard… CPU involvement (copying) becomes bottleneck
EECS 570: Fall 2003 -- rev3 28
User Level Network PortsVirtual address space
Status
Net outputport
Net inputport
Program counter
Registers
Processor
Appears to user as logical message queues plus status
EECS 570: Fall 2003 -- rev3 29
Example: CM-5
Diagnostics network
Control network
Data network
Processingpartition
Processingpartition
Controlprocessors
I/O partition
PM PM
SPARC
MBUS
DRAMctrl
DRAM DRAM DRAM DRAM
DRAMctrl
Vectorunit DRAM
ctrlDRAM
ctrl
Vectorunit
FPU Datanetworks
Controlnetwork
$ctrl
$SRAM
NI
• Input and output FIFO for each network
• Two data networks• Save/restore
network buffers on context switch
EECS 570: Fall 2003 -- rev3 30
User Level HandlersU ser /sy s te m
PM e m
D e stD ata A d dress
PM e m
• Hardware support to vector to address specified in message– message ports in registers– alternate register set for handler?
• Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU)
EECS 570: Fall 2003 -- rev3 31
J-Machine
• Each node a small message-driven processor
• HW support to queue msgs and dispatch to msg handler task
EECS 570: Fall 2003 -- rev3 32
Dedicated Message Processing Without Specialized Hardware
Network
° ° °
dest
Mem
P M P
NI
User System
Mem
P M P
NI
User System
• General purpose processor performs arbitrary output processing (at system level)
• General purpose processor interprets incoming network transactions (in system)
• User Processor <–> Msg Processor share memory• Msg Processor <–> Msg Processor via system network transaction
EECS 570: Fall 2003 -- rev3 33
Levels of Network Transaction
• User Processor stores cmd / msg / data into shared output queue
– must still check for output queue full (or grow dynamically)• Communication assists make transaction happen
– checking, translation, scheduling, transport, interpretation• Avoid system call overhead• Multiple bus crossings likely bottleneck
Network
° ° °
dest
Mem
P M P
NI
Mem
PM P
NI
EECS 570: Fall 2003 -- rev3 34
Example: Intel Paragon
Network
° ° ° Mem
P M P
NIi860xp50 MHz16 KB $4-way32B BlockMESI
sDMArDMA
64400 MB/s
$ $
16 175 MB/s Duplex
I/ONodes
rteMP handler
Var dataEOP
I/ONodes
Devices
Devices
2048 B
EECS 570: Fall 2003 -- rev3 35
EECS 570: Fall 2003 -- rev3 36
Dedicated MP w/specialized NI:Meiko CS-2
• Integrate message processor into network interface– active messages-like capability– dedicated threads for DMA, reply handling, simple
remote memory access– supports user-level virtual DMA
• own page table• can take a page fault, signal OS, restart
– meanwhile, nack other node
• Problem: processor is slow, time-slices threads– fundamental issue with building your own CPU
EECS 570: Fall 2003 -- rev3 37
Myricom Myrinet (Berkeley NOW)• Programmable network interface on I/O Bus (Sun SBUS or
PCI)– embedded custom CPU (“Lanai”, ~40 MHz RISC CPU)– 256KB SRAM– 3 DMA engines: to network, from network, to/from host memory
• Downloadable firmware executes in kernel mode– includes source-based routing protocol
• SRAM pages can be mapped into user space– separate pages for separate processes– firmware can define status words, queues, etc.
• data for short messages or pointers for long ones
• firmware can do address translation too… w/OS help
– poll to check for sends from user
• Bottom line: I/O bus still bottleneck, CPU could be faster
EECS 570: Fall 2003 -- rev3 38
Shared Physical Address Space
• Implement SAS model in hardware w/o caching– actual caching must be done by copying from remote memory to local
– programming paradigm looks more like message passing than Pthreads• yet, low latency & low overhead transfers thanks to HW interpretation; high
bandwidth too if done right• result: great platform for MPI & compiled data-parallel codes
• Implementation:– “pseudo-memory” acts as memory controller for remote mem, converts
accesses to network transaction (request)
– “pseudo-CPU” on remote node receives requests, performs on local memory, sends reply
– split-transaction or retry-capable bus required (or dual-ported mem)
EECS 570: Fall 2003 -- rev3 39
Example: Cray T3D• Up to 2,048 Alpha 21064s
– no off-chip L2 to avoid inherent latency
• In addition to remote mem ops, includes:– prefetch buffer (hide remote latency)– DMA engine (requires OS trap)– synchronization operations (swap, fetch&inc, global AND/OR)– message queue (requires OS trap on receiver)
• Big problem: physical address space– 21064 supports only 32 bits– 2K-node machine limited to 2M per node– external “DTB annex” provides segment-like registers for extended
addressing, but management is expensive & ugly
EECS 570: Fall 2003 -- rev3 40
Cray T3E
• Similar to T3D, uses Alpha 21164 instead of 21064 (on-chip L2)– still has physical address space problems
• E-registers for remote communication and synchronization– 512 user, 128 system; 64 bits each
– replace/unify DTB Annex, prefetch queue, block transfer engine, and remote load / store, message queue
– Address specifies source or destination E-register and command
– Data contains pointer to block of 4 E-regs and index for centrifuge
• Centrifuge– supports data distributions used in data-parallel languages (HPF)
– 4 E-regs for global memory operation: mask, base, two arguments
• Get & Put Operations
EECS 570: Fall 2003 -- rev3 41
T3E (continued)
• Atomic Memory operations– E-registers & centrifuge used– F&I, F&Add, Compare&Swap, Masked_Swap
• Messaging– arbitrary number of queues (user or system)– 64-byte messages– create msg queue by storing message control word to memory
location
• Msg Send– construct data in aligned block of 8 E-regs– send like put, but dest must be message control word– processor is responsible for queue space (buffer management)
• Barrier and Eureka synchronization
EECS 570: Fall 2003 -- rev3 42
DEC Memory Channel (Princeton SHRIMP)
• Reflective Memory• Writes on Sender appear in Receiver’s memory
– send & receive regions– page control table
• Receive region is pinned in memory• Requires duplicate writes, really just message buffers
Virtual
Physical Physical
Virtual
EECS 570: Fall 2003 -- rev3 43
Performance of Distributed Memory Machines
• Microbenchmarking
• One-way latency of small (five-word) message– echo test– round-trip divided by 2
• Shared Memory remote read
• Message Passing Operations– see text
EECS 570: Fall 2003 -- rev3 44
Network Transaction Performance
0
2
4
6
8
10
12
14
Mic
ros
ec
on
ds
CM
-5
Pa
rag
on
CS
2
NO
W
T3
D
CM
-5
Pa
rag
on
CS
2
NO
W
T3
D
Gap
L
Or
Os
Figure 7.31
EECS 570: Fall 2003 -- rev3 45
Remote Read Performance
0
5
10
15
20
25
Mic
ros
ec
on
ds
CM
-5
Pa
rag
on
CS
2
NO
W
T3
D
CM
-5
Pa
rag
on
CS
2
NO
W
T3
D
Gap
L
Issue
Figure 7.32
EECS 570: Fall 2003 -- rev3 46
Summary of Distributed Memory Machines
• Convergence of architectures– everything “looks basically the same”– processor, cache, memory, communication assist
• Communication Assist– where is it? (I/O bus, memory bus, processor registers)– what does it know?
• does it just move bytes, or does it perform some functions?
– is it programmable?– does it run user code?
• Network transaction– input & output buffering– action on remote node