Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall”...
-
Upload
henry-armstrong -
Category
Documents
-
view
217 -
download
1
Transcript of Multiprocessor Systems. 2 Old CW: Power is free, Transistors expensive New CW: “Power wall”...
Multiprocessor Systems
2
• Old CW: Power is free, Transistors expensive• New CW: “Power wall” Power expensive, Xtors free
(Can put more on chip than can afford to turn on)• Old: Multiplies are slow, Memory access is fast• New: “Memory wall” Memory slow, multiplies fast
(200 clocks to DRAM memory, 4 clocks for FP multiply)• Old : Increasing Instruction Level Parallelism via compilers,
innovation (Out-of-order, speculation, VLIW, …)• New CW: “ILP wall” diminishing returns on more ILP • New: Power Wall + Memory Wall + ILP Wall = Brick Wall
• Old CW: Uniprocessor performance 2X / 1.5 yrs• New CW: Uniprocessor performance only 2X / 5 yrs?
Conventional Wisdom (CW) in Computer Architecture
3
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Pe
rfo
rma
nce
(vs
. V
AX
-11
/78
0)
25%/year
52%/year
??%/year
Uniprocessor Performance (SPECint)
• RISC + x86: 52%/year 1986 to 2002• RISC + x86: ??%/year 2002 to present
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Sea change in chip design: multiple “cores” or processors per chip
3X
Evolution from Single Core to Multi-Core“… today’s processors … are nearing an impasse as technologies approach the
speed of light..” David Mitchell, The Transputer: The Time Is Now (1989)
Procrastination rewarded: 2X seq. perf. / 1.5 years “We are dedicating all of our future product development to multicore designs. …
This is a sea change in computing” Paul Otellini, President, Intel (2005)
• All microprocessor companies switch to MP (2X CPUs / 2 yrs) Procrastination penalized: 2X sequential perf. / 5 yrsManufacturer/Year AMD/’05 Intel/’06 IBM/’04 Sun/’05Processors/chip 2 2 2 8Threads/Processor 1 2 2 4
Threads/chip 2 4 4 32
Procrastination : to keep delaying something that must be done
5
Flynn’s Taxonomy• Flynn classified by data and control streams in 1966
• SIMD Data Level Parallelism• MIMD Thread Level Parallelism• MIMD popular because
• Flexible: N pgms and 1 multithreaded pgm• Cost-effective: same MPU in desktop & MIMD
Single Instruction Single Data (SISD)(Uniprocessor)
Single Instruction Multiple Data SIMD(single PC: Vector, CM-2)
Multiple Instruction Single Data (MISD)(????)
Multiple Instruction Multiple Data MIMD(Clusters, SMP servers)
M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
Flynn’s Taxonomy• SISD (Single Instruction Single Data)
• Uniprocessors
• MISD (Multiple Instruction Single Data)• Single data stream operated by successive functional units
• SIMD (Single Instruction Multiple Data)• Instruction stream executed by multiple processors on different data
• Simple programming model, low overhead• Examples: Connection Machine and Vector Processors
• MIMD (Multiple Instruction Multiple Data) is the most general• Flexibility for parallel applications and multi-programmed systems
• Processors can run independent processes or applications• Processors can also run threads belonging to one parallel application
• Uses off-the-shelf microprocessors and components
Major MIMD Styles• Symmetric Multiprocessors (SMP)
• Main memory is shared and equally accessible by all processors• Called also Uniform Memory Access (UMA)• Bus based or interconnection network based
• Distributed memory multiprocessors• Distributed Shared Memory (DSM) multiprocessors
• Distributed memories are shared and can be accessed by all processors• Non-uniform memory access (NUMA)• Latency varies between local and remote memory access
• Message-Passing multiprocessors, multi-computers, and clusters• Distributed memories are NOT shared• Each processor can access its own local memory• Processors communicate by sending and receiving messages
Shared Memory Architecture• Any processor can directly reference any physical
memory• Any I/O controller to any physical memory• Operating system can run on any processor
• OS uses shared memory to coordinate
• Communication occurs implicitly as result of loads and stores
• Wide range of scale• Few to hundreds of processors• Memory may be physically
distributed among processors
• History dates to early 1960s
Shared PhysicalMemory
Processor
Processor
I/O I/O
I/O
Processor
ProcessorProcessor
Shared Memory OrganizationsP1
$
Interconnection network
$
Pn
Mem Mem
Dance Hall (UMA)
P1
$
Interconnection network
$
Pn
Mem Mem
Distributed Shared Memory (NUMA)
P1
$ $
Pn
Mem I/O devices
Bus-based Shared Memory
P1
Switch
Main memory
Pn
Interleaved
Interleaved
Cache
Shared Cache
Bus-Based Symmetric Multiprocessors
• Symmetric access to main memory from any processor• Dominate the server market
• Building blocks for larger systems
• Attractive as throughput servers and for parallel programs
I/O systemMain memory
Bus
P1
MultilevelCache
Pn
MultilevelCache
Uniform access via loads/storesAutomatic data movement and
coherent replication in cachesCheap and powerful extension to
uniprocessorsKey is extension of memory
hierarchy to support multiple processors
Shared Address Space Programming Model
• A process is a virtual address space• With one or more threads of control
• Part of the virtual address space is shared by processes• Multiple threads share the address space of a single process
• All communication isthrough shared memory
• Achieved by loads and stores• Writes by one process/thread
are visible to others• Special atomic operations
for synchronization• OS uses shared memory
to coordinate processes
St or e
P1
P2
Pn
P0
Load
P0 p r i v at e
P1 p r i v at e
P2 pr i v at e
Pn pr i v at e
Virtual address spaces for acollection of processes communicatingvia shared addresses
Machine physicaladdress space
Shared portionof address space
Private portionof address space
Common physicaladdresses
Medium and Large Scale Multiprocessors
• Problem is interconnect: high cost (crossbar) or bandwidth (bus)• Centralized memory or uniform memory access (UMA)
• Latencies to memory uniform, but uniformly large• Interconnection network: crossbar or multi-stage
• Distributed shared memory or non-uniform memory access (NUMA)• Distributed memory forms one global shared physical address space• Access to local memory is faster than access to remote memory
° ° °
Interconnection Network
Distributed Shared Memory
M
P
$ M
P
$ M
P
$
° ° °
Interconnection Network
Centralized Memory
M
P
$
P
$
P
$
M M° ° °
Message Passing Architectures • Complete computer as a building block
• Includes processor, memory, and I/O system
• Easier to build and scale thanshared memory architectures
• Communication via explicit I/O operations• Communication integrated at I/O level, Not into memory system
• Much in common with networks of workstations or clusters• However, tight integration between processor and network• Network is of higher capability than a local area network
• Programming model• Direct access only to local memory (private address space) • Communication via explicit send/receive messages (library or system
calls)
° ° °
Interconnection Network
M
P
$ M
P
$ M
P
$
Message-Passing Abstraction
• Send specifies receiving process and buffer to be transmitted• Receive specifies sending process and buffer to receive into• Optional tag on send and matching rule on receive
• Matching rule: match a specific tag t or any tag or any process• Combination of send and matching receive achieves …
• Pairwise synchronization event• Memory to memory copy of message
• Overheads: copying, buffer management, protection• Example: MPI and PVM message passing libraries
Address XSend Q, X, t
MatchLocal process address space
Address Y
Local process address space
Process P Process Q
Receive P, Y, t
Variants of Send and Receive
• Parallel programs using send and receive are quite structured• Most often, all nodes execute identical copies of a program• Processes can name each other using a simple linear ordering
• Blocking send:• Sender sends a request and waits until the reply is returned
• Non-blocking send:• Sender sends a message and continues without waiting for a reply
• Blocking receive:• Receiver blocks if it tries to receive a message that has not arrived
• Non-blocking receive:• Receiver simply posts a receive without waiting for sender
000001
010011
100
110
101
111
Evolution of Message-Passing Machines
• Early machines: FIFO on each link• Hardware close to programming model • Synchronous send/receive operations• Topology central (hypercube algorithms)
CalTech Cosmic Cube (Seitz)
Example Intel Paragon
Memory bus (64-bit, 50 MHz)
i860
L1 $
NI
DMA
i860
L1 $
Driver
Memctrl
4-wayinterleaved
DRAM
IntelParagonnode
8 bits,175 MHz,bidir ectional
2D grid networkwith processing nodeattached to every switch
Sandia’s Intel Paragon XP/S-based SuperComputer
Each card is an SMP with two or more i860 processors and a network interface chip connected to the cache-coherent memory bus.
Each node has a DMA engine to transfer contiguous chunks of data to and from the network at a high rate.
Example 1: Matrix Addition (1/3)
Assume an MIMD Shared-Memory Multiprocessor (SMP) with P processors. The objective is to write a parallel program for Matrix Addition C = A + B, where A, B, and C are all NxN matrices using P processor.
(1) How to partition the array of results (C) for parallel processing among P processes!
A few approaches: (1) Block-row partitioning, where each process is assigned N/p rows and N columns, or (2) Block-column partitioning, where each process is assigned N rows and N/p columns. The first is preferred due to row-major storage.
(2) Find the number of results assigned to each processes for load balancing
There are N 2 results to compute from array C. Each processor is to have N2/P for load balancing. Using Block-row partitioning, each process is assigned a block-row of (N/P)*N results.
(3) Determine the range of results that must be computed by each process as function of the process ID (Pid).
Using block-row partitioning, each process is assigned N/p rows and N columns, This can be achieved if we partition A into block-rows such that each processor will get the range:
Range = (Imin, Imax) = [Pid*N/P, Imin+N/P-1], where each process must plug its Pid in the above formula to work on its range, i.e. parallel SPMD program.
Example 1: Matrix Addition (2/3)
(3) Using a shared-memory multiprocessor, write an SPMV program to compute the Matrix Addition by P processes (threads) by assuming that all arrays are declared as shared data.
The SPMD program that will run on each processor will be: Imin=Pid*N/P; Imax= Imin+N/P-1 ; For i= Imin, I < Imax { For j=0, i<N c[i,j] = a[i,j] + b[i,j]; } Note: the range of computed data for each process depends on the Pid.
Example 1: Matrix Addition (3/3) (4) Using a Distributed-Memory Multiprocessor, write an SPMD program using MPI library to parallelize C=A*B, where each is an NxN matrix, using P processors. Assume block-row partitioning, each process will run only in its range of data, and the program and data return to master processor (process 0) after completion of all the iterations.
{MPI-init;
MPI-Comm-size(MPI-comm-wolrd, &numprocs) #Init MPI
MPI-Comm-rank(MPI-comm-wolrd, &mypid) #Get process id as mypid
array_size = N; #Now Process 0 scatter arrays A and B over all the processes:
MPI- scatter (&a, array_size=N, 0 (pid), mytag, MPI-Comm-world) #Scatter A
MPI- Scatter(&b, array_size=N, 0 (pid), mytag, MPI-Comm-world) # Scatter B
my_range = N/P; #Nbr of rows to processes by each process
For i= 0, i < my_range
{ For j = 0, I < N
c[i,j] = a[i,j] + b[i,j];
}
MPI- Gather(&c, array_size=N, 0 (pid), mytag, MPI-Comm-world)
}
Example 2: Summing 100,000 Numbers on 100 Processors
sum[Pid] = 0;for (i = 1000*Pid; i< 1000*(Pid+1); i = i + 1)
sum[Pid] = sum[Pid] + A[i];
Processors start by running a loop that sums their subset of vector A numbers (vectors A and sum are shared variables, Pid is the processor’s id, i is a private variable)
The processors then coordinate in adding together the partial sums (P is a private variable initialized to 100 (the number of processors))
repeatsynch(); /*synchronize firstif (P%2 != 0 && Pid == 0)
sum[0] = sum[0] + sum[P-1];P = P/2if (Pid<P) sum[Pid] = sum[Pid] + sum[Pid+P]
until (P == 1); /*final sum in sum[0]
An Example with 10 Processors
P0 P1 P2 P3 P4 P5 P6 P7 P8 P9
sum[P0]sum[P1]sum[P2] sum[P3]sum[P4]sum[P5]sum[P6] sum[P7]sum[P8] sum[P9]
P0
P0 P1 P2 P3 P4
P = 10
P = 5
P1 P = 2
P0 P = 1
Message Passing Multiprocessors
• Each processor has its own private address space• Q1 – Processors share data by explicitly sending and
receiving information (messages)• Q2 – Coordination is built into message passing primitives
(send and receive)