Chapter 7 (excl. 7.9): Scalable Multiprocessors

EECS 570: Fall 2003 -- rev3 1

Chapter 7 (excl. 7.9):Scalable Multiprocessors

EECS 570: Fall 2003 -- rev3 2

OutlineScalability

– bus doesn't hack it: need scalable interconnect (network)

Realizing Programming Models on Scalable Systems– network transactions– protocols

• shared address space• message passing (synchronous, asynchronous, etc.)• active messages

– safety: buffer overflow, fetch deadlock

Communication Architecture Design Space– where does network attach to processor node?– how much hardware interpretation of the network transaction?– impact on cost & performance

EECS 570: Fall 2003 -- rev3 3

Scalability

How do bandwidth, latency, cost, and packaging scale with P?• Ideal:

– latency, per-processor bandwidth, per-processor cost are constants– packaging does not create upper bound, does not exacerbate others

• Bus:– per-processor BW scales as 1/P– latency increases with P:

• queuing delays for fixed bus length (linear at saturation?)• as bus length is increased to accommodate more CPUs, clock must slow

• Reality:– “scalable” may just mean sub-Linear dependence (e.g., logarithmic)– practical limits ($/customer), sweet spot + growth path– switched interconnect (network)

EECS 570: Fall 2003 -- rev3 4

Aside on Cost-Effective Computing

Traditional view: efficiency = speedup(P)/PEfficiency < I → parallelism not worthwhile?But much of a computer's cost is NOT in the

processor (memory, packaging, interconnect, etc.)

• [Wood & Hill, IEEE Computer 2/95]

Let Costup(P) = Cost(P)/Cost(1)Parallel computing cost-effective:

Speedup(P) > Costup(P)E.g. for SGI PowerChallenge w/500MB:

Costup(32) = 86

EECS 570: Fall 2003 -- rev3 5

Network Transaction Primitive

one-way transfer of information from source to destinationcauses some action at the destination• process info and/or deposit in buffer• state change (e.g., set flag, interrupt program)• maybe initiate reply (separate network transaction)

output buffer input buffer

Source Node Destination Node

Communication Network

serialized msg

EECS 570: Fall 2003 -- rev3 6

Network Transaction Correctness Issues

• protection– User/user, user/system what if VM doesn't apply?– Fault containment (large machine component MTBF may be low)

• format– Variable length? Header info?– Affects efficiency, ability to handle in HW

• buffering/flow control– Finite buffering in network itself– Messages show up announced e.g., many-to-one pattern

• deadlock avoidance– If you're not careful -- details later

• action– What happens on delivery? How may options are provided?

• system guarantees– Delivery, ordering

EECS 570: Fall 2003 -- rev3 7

Performance Issues

Key parameters: latency, overhead, bandwidth

LogP: lat/over/gap(BW) as function of P

NIS

CPUS

Network

NIR

CPUR

EECS 570: Fall 2003 -- rev3 8

Programming Models

What is user's view of network transaction?– Depends on system architecture– Remember layered approach: OS/compiler/Iibrary

may implement alternative model on top of HW-provided model

We'll look at three basic ones:– Active Messages: “assembly language” for msg-

passing systems– Message Passing"- MPI-style interface, as seen by

appl. programmers– Shared Address Space: ignoring cache coherence for

now

EECS 570: Fall 2003 -- rev3 9

Active Messages

User-level analog of network transaction– invoke handler function at receiver to extract packet from network– grew out of attempts to do dataflow on msg-passing machines & remote

procedure calls– handler may send reply, but no other messages– Event notification: interrupts, polling, events?– May also perform memory-to-memory transfer

Flexible (can do almost any action on msg reception), but requires tight cooperation between CPU and network for high performance

– May be better to have HW do a few things faster

Request

handler

handler

Reply

EECS 570: Fall 2003 -- rev3 10

Message Passing

Basic idea:– Send(dest, tag, buffer) -- tag is arbitrary integer– Recv(src, tag, buffer ) -- src/tag may be wildcard (“any”)

Completion semantics:– Receive completes after data transfer complete from matching

send– Synchronous send completes after matching receive and data

sent– Asynchronous send completes after send buffer may be reused

• msg may simply be copied into alternate buffer, on src or dest node

Blocking vs. non-blocking:– does function wait for “completion” before returning – non-blocking: extra function calls to check for completion– assume blocking for now

EECS 570: Fall 2003 -- rev3 11

Synchronous Message Passing

Three-phase operation: ready-to-send, ready-to-receive, transfer– Can skip 1st phase if receiver initiates & specifies source

Overhead, latency tend to be highTransfer can achieve high bandwidth w/sufficient msg lengthProgrammer must avoid deadlock (e.g. pairwise exchange)

EECS 570: Fall 2003 -- rev3 12

Asynch. Msg Passing: Conservative

Same as synchronous. except msg can be buffered on sender– Allows computation to continue sooner– Deadlock still (not so much) an issue -- buffering is finite

EECS 570: Fall 2003 -- rev3 13

Asynch. Message Passing: Optimistic

Sender just ships data, hopes receiver can handle itBenefit: lower latencyProblems.

– receive was posted need fast lookup while data streams in

– receive not posted buffer? nack? discard?

EECS 570: Fall 2003 -- rev3 14

Key Features of Msg Passing Abstraction

Source knows send data address, dest. knows receive data address– after handshake they both know both

Arbitrary storage for asynchronous protocol– may post many sends before any receives

Fundamentally a 3-phase transaction– can use optimistic 1-phase in limited cases

Latency, overhead tend to be higher than SAS: high BW easier

Hardware support?– DMA: physical or virtual (better/harder)

EECS 570: Fall 2003 -- rev3 15

Shared Address Space Abstraction

Two-way request/response protocol– reads require data response– writes have acknowledgment (for consistency)

Issues– virtual or physical address on net? (where does

translation happen)– coherence, consistency, etc (later)

EECS 570: Fall 2003 -- rev3 16

Key Properties of SAS Abstraction

Data addresses are specified by the source of the request– no dynamic buffer allocation– protection achieved through virtual memory translation

Low overhead initiation: one instruction (load or store)High bandwidth more challenging

– may require prefetching, separate “block transfer engine”

Synchronization less straightforward (no explicit event notification)

Simple request-response pairs– few fixed message types– practical to implement in hardware w/o remote CPU involvement

Input buffering I flow control issue– what if request rate exceeds local memory bandwidth?

EECS 570: Fall 2003 -- rev3 17

Challenge 1: Input Buffer Overflow

Options• refuse input when full

– creates “back pressure” (in reliable network)– to avoid deadlock:

• low-level ack/nack– assumes dedicated network path for ack/nack (common in rings)– retry on nack

• drop packets– retry on timeout

• avoid by reserving space per source (“credit-based”)– when available for reuse?– scalability?

EECS 570: Fall 2003 -- rev3 18

Challenge 2: Fetch Deadlock

Processing a message may require sending a reply; what if reply can't be sent due to input buffer overflow?– step 1: guarantee that replies can be sunk @ destination

• requester reserves buffer space for reply– step 2: guarantee that replies can be sent into network

• backpressure: logically independent request/reply networks– physical networks or virtual channels

– credit-based: bound outgoing requests to K per node• buffer space for K(P-l ) requests + K responses at each node

– low-level ack/nack, packet dropping• guarantee that replies will never be nacked or dropped

For cache coherence protocols, some requests may require more: forward request to other node, send multiple invalidations– must extend techniques or nack these requests up front

EECS 570: Fall 2003 -- rev3 19

Outline

Scalability [7.1 - read]Realizing Programming Models

– network transactions– protocols: SAS, MP, Active Messages– safety: buffer overflow, fetch deadlock

Communication Architecture Design Space– where does hardware fit into node architecture?– how much hardware interpretation of the network transaction?– how much gap between hardware and user semantics?

• remainder must be done in software• increased flexibility, increased latency & overhead

– main CPU or dedicated/specialized processor?

EECS 570: Fall 2003 -- rev3 20

Massively Parallel Processor (MPP) Architectures

I/O Bus

Memory Bus

Processor

Cache

MainMemory

DiskController

Disk Disk

NetworkInterface

Network

I/O Bridge

• Network interface typically close to processor– Memory bus:

• locked to specific processor architecture/bus protocol

– Registers/cache:• only in research machines

• Time-to-market is long– processor already available

or work closely with processor designers

• Maximize performance and cost

EECS 570: Fall 2003 -- rev3 21

Network of Workstations

Network interface on I/0 busStandards (e.g., PCI) => longer

life, faster to marketSlow (microseconds) to access

network interface“System Area Network” (SAN):

between LAN & MPPI/O Bus

Processor

Cache

MainMemory

DiskController

Disk Disk

GraphicsController

NetworkInterface

Graphics Network

interrupts

Core Chip Set

EECS 570: Fall 2003 -- rev3 22

Transaction Interpretation

Simple: HW doesn't interpret much if anything– DMA from/to buffer, interrupt or set flag on completion– nCUBE, conventional LAN– Requires OS for address translation, often a user/kernel copy

User-level messaging: get the OS out of the way– HW does protection checks to allow direct user access to

network– may have minimal interpretation otherwise– May be on I/O bus (Myrinet), memory bus (CM-5), or in regs (J-

Machine, *T)– May require CPU involvement in all data transfers (explicit

memory-to-network copy)

EECS 570: Fall 2003 -- rev3 23

Transaction Interpretation (cont'd)

Virtual DMA: get the CPU out of the way (maybe)– basic protection plus address translation: user-level bulk DMA– Usually to limited region of addr space (pinned)– Can he done in hardware (VIA, Meiko CS-2) or software (some

Myrinet, Intel Paragon)

Reflective memory– DEC memory channel, Princeton SHRIMP

Global physical address space (NUMA): everything in hardware– complexity increases, but performance does too (if done right)

Cache coherence: even more so– stay tuned

EECS 570: Fall 2003 -- rev3 24

Net Transactions: Physical DMA

PMemory

Cmd

DestData

Addr

Length

Rdy

PMemory

DMAchannels

Status,interrupt

Addr

Length

Rdy

• Physical addresses: OS must initiate transfers– system call per message on both ends: ouch

• Sending OS copies data to kernel buffer w/ header/trailer– can avoid copy if interface does scatter/gather

• Receiver copies packet into OS buffer, then interprets– user message then copied (or mapped) into user space

EECS 570: Fall 2003 -- rev3 25

nCUBE/2 Network Interface

independent DMA channel per link direction

segmented messages: can inspect header to direct remainder of DMA directly to user buffer– avoids copy at expense of extra interrupt + DMA setup cost– can't let buffer be paged out (did nCUBE have VM?)

EECS 570: Fall 2003 -- rev3 26

Conventional LAN Network Interface

NIC Controller

DMAaddr

len

trncv

TX

RX

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Addr LenStatusNext

Data

Host Memory NIC

IO Busmem bus

Proc

EECS 570: Fall 2003 -- rev3 27

User Level Messaging

PMem

DestData

User/system

PMemStatus,interrupt

• map network hardware into user’s address space– talk directly to network via loads & stores

• user-to-user communication without OS intervention: low latency• protection: user/user & user/system• DMA hard… CPU involvement (copying) becomes bottleneck

EECS 570: Fall 2003 -- rev3 28

User Level Network PortsVirtual address space

Status

Net outputport

Net inputport

Program counter

Registers

Processor

Appears to user as logical message queues plus status

EECS 570: Fall 2003 -- rev3 29

Example: CM-5

Diagnostics network

Control network

Data network

Processingpartition

Processingpartition

Controlprocessors

I/O partition

PM PM

SPARC

MBUS

DRAMctrl

DRAM DRAM DRAM DRAM

DRAMctrl

Vectorunit DRAM

ctrlDRAM

ctrl

Vectorunit

FPU Datanetworks

Controlnetwork

$ctrl

$SRAM

NI

• Input and output FIFO for each network

• Two data networks• Save/restore

network buffers on context switch

EECS 570: Fall 2003 -- rev3 30

User Level HandlersU ser /sy s te m

PM e m

D e stD ata A d dress

PM e m

• Hardware support to vector to address specified in message– message ports in registers– alternate register set for handler?

• Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU)

EECS 570: Fall 2003 -- rev3 31

J-Machine

• Each node a small message-driven processor

• HW support to queue msgs and dispatch to msg handler task

EECS 570: Fall 2003 -- rev3 32

Dedicated Message Processing Without Specialized Hardware

Network

° ° °

dest

Mem

P M P

NI

User System

Mem

P M P

NI

User System

• General purpose processor performs arbitrary output processing (at system level)

• General purpose processor interprets incoming network transactions (in system)

• User Processor <–> Msg Processor share memory• Msg Processor <–> Msg Processor via system network transaction

EECS 570: Fall 2003 -- rev3 33

Levels of Network Transaction

• User Processor stores cmd / msg / data into shared output queue

– must still check for output queue full (or grow dynamically)• Communication assists make transaction happen

– checking, translation, scheduling, transport, interpretation• Avoid system call overhead• Multiple bus crossings likely bottleneck

Network

° ° °

dest

Mem

P M P

NI

Mem

PM P

NI

EECS 570: Fall 2003 -- rev3 34

Example: Intel Paragon

Network

° ° ° Mem

P M P

NIi860xp50 MHz16 KB $4-way32B BlockMESI

sDMArDMA

64400 MB/s

$ $

16 175 MB/s Duplex

I/ONodes

rteMP handler

Var dataEOP

I/ONodes

Devices

Devices

2048 B

EECS 570: Fall 2003 -- rev3 35

EECS 570: Fall 2003 -- rev3 36

Dedicated MP w/specialized NI:Meiko CS-2

• Integrate message processor into network interface– active messages-like capability– dedicated threads for DMA, reply handling, simple

remote memory access– supports user-level virtual DMA

• own page table• can take a page fault, signal OS, restart

– meanwhile, nack other node

• Problem: processor is slow, time-slices threads– fundamental issue with building your own CPU

EECS 570: Fall 2003 -- rev3 37

Myricom Myrinet (Berkeley NOW)• Programmable network interface on I/O Bus (Sun SBUS or

PCI)– embedded custom CPU (“Lanai”, ~40 MHz RISC CPU)– 256KB SRAM– 3 DMA engines: to network, from network, to/from host memory

• Downloadable firmware executes in kernel mode– includes source-based routing protocol

• SRAM pages can be mapped into user space– separate pages for separate processes– firmware can define status words, queues, etc.

• data for short messages or pointers for long ones

• firmware can do address translation too… w/OS help

– poll to check for sends from user

• Bottom line: I/O bus still bottleneck, CPU could be faster

EECS 570: Fall 2003 -- rev3 38

Shared Physical Address Space

• Implement SAS model in hardware w/o caching– actual caching must be done by copying from remote memory to local

– programming paradigm looks more like message passing than Pthreads• yet, low latency & low overhead transfers thanks to HW interpretation; high

bandwidth too if done right• result: great platform for MPI & compiled data-parallel codes

• Implementation:– “pseudo-memory” acts as memory controller for remote mem, converts

accesses to network transaction (request)

– “pseudo-CPU” on remote node receives requests, performs on local memory, sends reply

– split-transaction or retry-capable bus required (or dual-ported mem)

EECS 570: Fall 2003 -- rev3 39

Example: Cray T3D• Up to 2,048 Alpha 21064s

– no off-chip L2 to avoid inherent latency

• In addition to remote mem ops, includes:– prefetch buffer (hide remote latency)– DMA engine (requires OS trap)– synchronization operations (swap, fetch&inc, global AND/OR)– message queue (requires OS trap on receiver)

• Big problem: physical address space– 21064 supports only 32 bits– 2K-node machine limited to 2M per node– external “DTB annex” provides segment-like registers for extended

addressing, but management is expensive & ugly

EECS 570: Fall 2003 -- rev3 40

Cray T3E

• Similar to T3D, uses Alpha 21164 instead of 21064 (on-chip L2)– still has physical address space problems

• E-registers for remote communication and synchronization– 512 user, 128 system; 64 bits each

– replace/unify DTB Annex, prefetch queue, block transfer engine, and remote load / store, message queue

– Address specifies source or destination E-register and command

– Data contains pointer to block of 4 E-regs and index for centrifuge

• Centrifuge– supports data distributions used in data-parallel languages (HPF)

– 4 E-regs for global memory operation: mask, base, two arguments

• Get & Put Operations

EECS 570: Fall 2003 -- rev3 41

T3E (continued)

• Atomic Memory operations– E-registers & centrifuge used– F&I, F&Add, Compare&Swap, Masked_Swap

• Messaging– arbitrary number of queues (user or system)– 64-byte messages– create msg queue by storing message control word to memory

location

• Msg Send– construct data in aligned block of 8 E-regs– send like put, but dest must be message control word– processor is responsible for queue space (buffer management)

• Barrier and Eureka synchronization

EECS 570: Fall 2003 -- rev3 42

DEC Memory Channel (Princeton SHRIMP)

• Reflective Memory• Writes on Sender appear in Receiver’s memory

– send & receive regions– page control table

• Receive region is pinned in memory• Requires duplicate writes, really just message buffers

Virtual

Physical Physical

Virtual

EECS 570: Fall 2003 -- rev3 43

Performance of Distributed Memory Machines

• Microbenchmarking

• One-way latency of small (five-word) message– echo test– round-trip divided by 2

• Shared Memory remote read

• Message Passing Operations– see text

EECS 570: Fall 2003 -- rev3 44

Network Transaction Performance

0

2

4

6

8

10

12

14

Mic

ros

ec

on

ds

CM

-5

Pa

rag

on

CS

2

NO

W

T3

D

CM

-5

Pa

rag

on

CS

2

NO

W

T3

D

Gap

L

Or

Os

Figure 7.31

EECS 570: Fall 2003 -- rev3 45

Remote Read Performance

0

5

10

15

20

25

Mic

ros

ec

on

ds

CM

-5

Pa

rag

on

CS

2

NO

W

T3

D

CM

-5

Pa

rag

on

CS

2

NO

W

T3

D

Gap

L

Issue

Figure 7.32

EECS 570: Fall 2003 -- rev3 46

Summary of Distributed Memory Machines

• Convergence of architectures– everything “looks basically the same”– processor, cache, memory, communication assist

• Communication Assist– where is it? (I/O bus, memory bus, processor registers)– what does it know?

• does it just move bytes, or does it perform some functions?

– is it programmable?– does it run user code?

• Network transaction– input & output buffering– action on remote node

Chapter 7 (excl. 7.9): Scalable Multiprocessors

Documents

Transcript of Chapter 7 (excl. 7.9): Scalable Multiprocessors