Parallel Architecture Overview - Wayne State Universityczxu/ece7660_f05/pca-model.pdf · Parallel...
Transcript of Parallel Architecture Overview - Wayne State Universityczxu/ece7660_f05/pca-model.pdf · Parallel...
1
Parallel Architecture Overview
2
© C. Xu, 1999-2005
Recap: What is Parallel Computer?A parallel computer is a collection of processing elements
that cooperate to solve large problems fast
Some broad issues:• Resource Allocation:
– how large a collection? – how powerful are the elements?– how much memory?
• Data access, Communication and Synchronization– how do the elements cooperate and communicate?– how are data transmitted between processors?– what are the abstractions and primitives for cooperation?
• Performance and Scalability– how does it all translate into performance?– how does it scale?
2
3
© C. Xu, 1999-2005
Parallel Computer ArchitectureComputer architecture ??
• Critical abstractions, boundaries, and primitives (interfaces)
• Organizational structures that implement interfaces (hw or sw)
Parallel Computer Architecture:Extension of CA to support communication and cooperation
• CA: Instruction Set + Organization
• PCA: CA + Communication Architecture
Communication architecture defines the basic communication and synchronous operations and addresses the organizational structures that realizes these operation
4
© C. Xu, 1999-2005
Communication Architecture= User/System Interface + Implementation
User/System Interface:• Comm. primitives exposed to user-level by hw and system-level sw
Implementation:• Organizational structures that implement the primitives: hw or OS• How optimized are they? How integrated into processing node?• Structure of network
Goals:• Performance• Broad applicability• Programmability• Scalability• Low Cost
3
5
© C. Xu, 1999-2005
Evolution of Traditional Architecture Models
Application Software
SystemSoftware SIMD
Message Passing
Shared MemoryDataflow
SystolicArrays Architecture
Historically, machines are tailored to prog. Modelsarch = prog. Model + abstraction + organization
6
© C. Xu, 1999-2005
Shared Address Space ArchitectureAny processor can directly reference any memory location
• Communication occurs implicitly as result of loads and stores
Programming model
• Similar to time-sharing on uniprocessors– Except processes run on different processors– Good throughput on multi-programmed workloads
Naturally provided on wide range of platforms
• History dates at least to precursors of mainframes in early 60s
• Wide range of scale: few to hundreds of processors
Popularly known as shared memory machines or model
• Ambiguous: memory may be physically distributed among processors
4
7
© C. Xu, 1999-2005
Shared Address Space AbstractionProcess: virtual address space plus one or more threads of control
Portions of address spaces of processes are shared
St or e
P1
P2
Pn
P0
Load
P0 pr i vat e
P1 pr i vat e
P2 pr i vat e
Pn pr i vat e
Virtual address spaces for acollection of processes communicatingvia shared addresses
Machine physical address space
Shared portionof address space
Private portionof address space
Common physicaladdresses
• Natural extension of uni-proc model: conventional mem operations for commspecial atomic operations for synchronization
• OS uses shared memory to coordinate processes
8
© C. Xu, 1999-2005
Communication Hardware
Also natural extension of uniprocessor
Already have processor, one or more memory modules and I/O controllers connected by hardware interconnect of some sort
Memory capacity increased by adding modules, I/O by controllers•Add processors for processing! •For higher-throughput multiprogramming, or parallel programs
I/O ctrlMem Mem Mem
Interconnect
Mem I/O ctrl
Processor Processor
Interconnect
I/Odevices
5
9
© C. Xu, 1999-2005
History
P
P
C
C
I/O
I/O
M MM M
PP
C
I/O
M MC
I/O
$ $
“Mainframe” approach• Motivated by multiprogramming• Extends crossbar used for mem bw and I/O• Originally processor cost limited to small
– later, cost of crossbar• Bandwidth scales with p• High incremental cost; use multistage instead
“Minicomputer” approach• Almost all microprocessor systems have bus• Motivated by multiprogramming, TP• Used heavily for parallel computing• Called symmetric multiprocessor (SMP)• Latency larger than for uniprocessor• Bus is bandwidth bottleneck
– caching is key: coherence problem
• Low incremental cost
10
© C. Xu, 1999-2005
Example: Intel Pentium Pro Quad
• All coherence and multiprocessing glue in processor module
• Highly integrated, targeted at high volume
• Low latency and bandwidth
P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)
CPU
Bus interface
MIU
P-Promodule
P-Promodule
P-Promodule256-KB
L2 $Interruptcontroller
PCIbridge
PCIbridge
Memorycontroller
1-, 2-, or 4-wayinterleaved
DRAM
PC
I bus
PC
I busPCI
I/Ocards
6
11
© C. Xu, 1999-2005
Example: SUN Enterprise
• 16 cards of either type: processors + memory, or I/O• All memory accessed over bus, so symmetric• Higher bandwidth, higher latency bus
Gigaplane bus (256 data, 41 addr ess, 83 MHz)
SB
US
SB
US
SB
US
2 F
iber
Cha
nnel
100b
T, S
CS
I
Bus interface
CPU/memcardsP
$2
$
P
$2
$
Mem ctrl
Bus interface/switch
I/O cards
12
© C. Xu, 1999-2005
Scaling Up
• Dance-hall: bandwidth still scalable, but lower cost than crossbar– latencies to memory uniform, but uniformly large
• Distributed memory or non-uniform memory access (NUMA)– Construct shared address space out of simple message transactions
across a general-purpose network (e.g. read-request, read-response)
• Caching shared (particularly nonlocal) data?
M M M° ° °
° ° ° M ° ° °M M
NetworkNetwork
P
$
P
$
P
$
P
$
P
$
P
$
“Dance hall” Distributed memory
7
13
© C. Xu, 1999-2005
Example: Cray T3E
• Scale up to 1024 processors, 480MB/s links• Memory controller generates comm. request for nonlocal references• No hardware mechanism for coherence (SGI Origin etc. provide this)
Switch
P
$
XY
Z
External I/O
Memctrl
and NI
Mem
14
© C. Xu, 1999-2005
Message Passing Architectures
Complete computer as building block, including I/O• Communication via explicit I/O operations
Programming model: directly access only private address space (local memory), comm. via explicit messages (send/receive)
High-level block diagram similar to distributed-memory SAS• But comm. integrated at IO level, needn’t be into memory system• Like networks of workstations (clusters), but tighter integration• Easier to build than scalable SAS
Programming model more removed from basic hardware operations• Library or OS intervention
8
15
© C. Xu, 1999-2005
Message-Passing Abstraction
• Send specifies buffer to be transmitted and receiving process• Recv specifies sending process and application storage to receive into• Memory to memory copy, but need to name processes• Optional tag on send and matching rule on receive• User process names local data and entities in process/tag space too• In simplest form, the send/recv match achieves pairwise synch event
– Other variants too• Many overheads: copying, buffer management, protection
Process P Process Q
Address Y
Address X
Send X, Q, t
Receive Y, P, tMatch
Local processaddress spaceLocal process
address space
16
© C. Xu, 1999-2005
Evolution of Message-Passing Machines
Early machines: FIFO on each link• Hw close to prog. Model; synchronous ops• Replaced by DMA, enabling non-blocking ops
– Buffered by system at destination until recv
Diminishing role of topology• Store&forward routing: topology important• Introduction of pipelined routing made it less so• Cost is in node-network interface• Simplifies programming
000001
010011
100
110
101
111
9
17
© C. Xu, 1999-2005
Example: IBM SP-2
• Made out of essentially complete RS6000 workstations• Network interface integrated in I/O bus (bw limited by I/O
bus)
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed from8-port switches
NIC
18
© C. Xu, 1999-2005
Example Intel Paragon
Memory bus (64-bit, 50 MHz)
i860
L1 $
NI
DMA
i860
L1 $
Driver
Memctrl
4-wayinterleaved
DRAM
IntelParagonnode
8 bits,175 MHz,bidirectional2D grid network
with processing nodeattached to every switch
Sandia’ s Intel Paragon XP/S-based Supercomputer
10
19
© C. Xu, 1999-2005
Toward Architectural ConvergenceEvolution and role of software have blurred boundary
• Send/recv supported on SAS machines via buffers
• Can construct global address space on MP using hashing
• Page-based (or finer-grained) shared virtual memory
Hardware organization converging too• Tighter NI integration even for MP (low-latency, high-bandwidth)
• At lower level, even hardware SAS passes hardware messages
Even clusters of workstations/SMPs are parallel systems• Emergence of fast system area networks (SAN)
Programming models distinct, but organizations converging
• Nodes connected by general network and communication assists
• Implementations also converging, at least in high-end machines
20
© C. Xu, 1999-2005
Data Parallel SystemsProgramming model
• Operations performed in parallel on each element of data structure
• Logically single thread of control, performs sequential or parallel steps
• Conceptually, a processor associated with each data element
Architectural model
• Array of many simple, cheap processors with little memory each– Processors don’t sequence through instructions
• Attached to a control processor that issues instructions
• Specialized and general communication, cheap global synchronization
SIMD: Single-Instruction Multiple-Data Streams
Flynn Classification: SISD, SIMD, MIMD, MISD PE PE PE° ° °
PE PE PE° ° °
PE PE PE° ° °
° ° ° ° ° ° ° ° °
Controlprocessor
11
21
© C. Xu, 1999-2005
Application of Data ParallelismEach PE contains an employee record with his/her salary
If salary > 100K thensalary = salary *1.05
elsesalary = salary *1.10
• Logically, the whole operation is a single step
• Some processors enabled for arithmetic operation, others disabled
Other examples:• Finite differences, linear algebra, ... • Document searching, graphics, image processing, ...
Some recent machines: • Thinking Machines CM-1, CM-2 (and CM-5)• Maspar MP-1 and MP-2,
22
© C. Xu, 1999-2005
Evolution and ConvergenceRigid control structure (SIMD in Flynn taxonomy)
• SISD = uni-processor, MIMD = multiprocessor
Popular when cost savings of centralized sequencer high• 60s when CPU was a cabinet• Replaced by vectors in mid-70s
– More flexible w.r.t. memory layout and easier to manage
• Revived in mid-80s when 32-bit datapath slices just fit on chip• No longer true with modern microprocessors
Other reasons for demise• Simple, regular applications have good locality, can do well anyway• Loss of applicability due to hardwiring data parallelism
– MIMD machines as effective for data parallelism and more general
Prog. model converges with SPMD (single program multiple data)• Contributes need for fast global synchronization• Structured global address space, implemented with either SAS or MP
12
23
© C. Xu, 1999-2005
Vector Machine for DP Model
Vector architectures are based on a single processor• Multiple functional units• All performing the same operation• Instructions may specific large amounts of parallelism (e.g., 64-way) but
hardware executes only a subset in parallel
Historically important• Overtaken by MPPs in the 90s
Re-emerging in recent years• At a large scale in the Earth Simulator (NEC SX6) and Cray X1• At a small sale in SIMD media extensions to microprocessors
– SSE, SSE2 (Intel: Pentium/IA64)– Altivec (IBM/Motorola/Apple: PowerPC)– VIS (Sun: Sparc)
Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to
24
© C. Xu, 1999-2005
Vector ProcessorsVector instructions operate on a vector of elements
• These are specified as operations on vector registers
• A supercomputer vector register holds ~32-64 elts
• The number of elements is larger than the amount of parallel hardware, called vector pipes or lanes, say 2-4
The hardware performs a full vector operation in
• #elements-per-vector-register / #pipes
r1 r2
r3
+ +
…vr2…vr1
…vr3
(logically, performs # eltsadds in parallel)
…vr2…vr1
(actually, performs # pipes adds in parallel)
++ ++ ++
13
25
© C. Xu, 1999-2005
12.8 Gflops (64 bit)
S
VV
S
VV
S
VV
S
VV
0.5 MB$
0.5 MB$
0.5 MB$
0.5 MB$
25.6 Gflops (32 bit)
To local memory and network:
2 MB Ecache
At frequency of400/800 MHz
51 GB/s
25-41 GB/s
25.6 GB/s12.8 - 20.5 GB/s
customblocks
Cray X1 Node
Figure source J. Levesque, Cray
Cray X1 builds a larger “virtual vector”, called an MSP
• 4 SSPs (each a 2-pipe vector processor) make up an MSP
• Compiler will (try to) vectorize/parallelize across the MSP
26
© C. Xu, 1999-2005
Cray X1: Parallel Vector Arch.
Cray combines several technologies in the X112.8 Gflop/s multi-streaming processors (MSP, vector proc)Shared caches (unusual on earlier vector machines)4 processor nodes sharing up to 64 GB of memorySingle System Image to 4096 ProcessorsRemote put/get between nodes (faster than MPI)
14
27
© C. Xu, 1999-2005
Earth Simulator Architecture
Parallel Vector Architecture
• High speed (vector) processors
• High memory bandwidth (vector architecture)
• Fast network (new crossbar switch)
Rearranging commodity parts can’t match this performance
28
© C. Xu, 1999-2005
Clusters have Arrived
15
29
© C. Xu, 1999-2005
What’s a Cluster?
Collection of independent computer systems working together as if a single system.
Coupled through a scalable, high bandwidth, low latency interconnect.
30
© C. Xu, 1999-2005
PC Clusters: Contributions of BeowulfAn experiment in parallel computing systems
Established vision of low cost, high end computing
Demonstrated effectiveness of PC clusters for some (not all) classes of applications
Provided networking software
Conveyed findings to broad community (great PR)
Tutorials and bookDesign standard to rally
community!
Standards beget: books, trained people, Open source SW
Adapted from Gordon Bell, presentation at Salishan 2000
16
31
© C. Xu, 1999-2005
Clusters of SMPsSMPs are the fastest commodity machine, so use them as a
building block for a larger machine with a network
Common names:
• CLUMP = Cluster of SMPs• Hierarchical machines, constellations
Most modern machines look like this:
• Millennium, IBM SPs, (not the t3e)...What is the right programming model???
• Treat machine as “flat”, always use message passing, even within SMP (simple, but ignores an important part of memory hierarchy).
• Shared memory within one SMP, but message passing outside of an SMP.
32
© C. Xu, 1999-2005
Cluster of SMP ApproachA supercomputer is a stretched high-end server
Parallel system is built by assembling nodes that are modest size, commercial, SMP servers – just put more of them together
Image from LLNL
17
33
© C. Xu, 1999-2005
WSU CLUMP: Cluster of SMPs
CPU
Cache
CPU
Cache
CPU
Cache
MEMORY I/O Network
Symmetric Multiprocessor (SMP)
12 UltraSPARCs1Gb Ethernet
155Mb ATM
4 UltraSPARCs
30 UltraSPARCs
34
© C. Xu, 1999-2005Local
ClusterInter Planet
Grid
2100
2100 2100 2100 2100
2100 2100 2100 2100
Personal Device SMPs or SuperComputers
GlobalGrid
PERFORMANCE
+
Q
o
S
•Individual•Group•Department•Campus•State•National•Globe•Inter Planet•Universe
Administrative Barriers
EnterpriseCluster/Grid
Scalable Computing
Figure due to Rajkumar Buyya, University of Melbourne, Australia, www.gridbus.org
18
35
© C. Xu, 1999-2005
Computational/Data GridCoordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations
• direct access to computers, sw, data, and other resources, rather than file exchange
• Such sharing rules defines a set of individuals and/or institutions, which form a virtual organization
• Examples of VOs: application service providers, storage service providers, cycle providers, etc
• Grid computing is to develop protocols, services, and tools for coordinated resource sharing and problem solving in VOs
• Security solutions for management of credentials and policies
• RM protocols and services for secure remote access
• Information query protocols and services for configuratin
• Data management, etc
36
© C. Xu, 1999-2005
Top 500 SupercomputersListing of the 500 most powerful computers in the world
- Yardstick: Rmax from LINPACK MPP benchmarkAx=b, dense problem
- Dense LU Factorization (dominated by matrix multiply)
Updated twice a year SC‘xy in the States in November• Meeting in Mannheim, Germany in June• All data (and slides) available from www.top500.org• Also measures N-1/2 (size required to get ½ speed)
Rat
e
performance
Size
19
37
© C. Xu, 1999-2005
In 1980 a computation that took 1 full year to completecan now be done in 1 month!
Fastest Computer Over Time
TMCCM-2(2048)
Fujitsu VP-2600
Cray Y-MP (8)
05
101520253035404550
1990 1992 1994 1996 1998 2000
Year
GF
lop/
s
38
© C. Xu, 1999-2005
In 1980 a computation that took 1 full year to completecan now be done in 4 days!
Fastest Computer Over Time
HitachiCP-PACS
(2040)IntelParagon(6788)
FujitsuVPP-500
(140)TMC CM-5(1024)
NEC SX-3(4)
TMCCM-2(2048)
Fujitsu VP-2600
Cray Y-MP (8)
050
100150200250300350400450500
1990 1992 1994 1996 1998 2000
Year
GF
lop/
s
20
39
© C. Xu, 1999-2005
In 1980 a computation that took 1 full year to completecan today be done in 1 hour!
Fastest Computer Over TimeASCI White
Pacific(7424)
ASCI BluePacific SST
(5808)
SGI ASCIBlue
Mountain(5040)
Intel ASCI Red
(9152)Hitachi
CP-PACS(2040)
IntelParagon(6788)
FujitsuVPP-500
(140)
TMC CM-5(1024)
NEC SX-3
(4)
TMCCM-2(2048)
Fujitsu VP-2600
Cray Y-MP (8)
Intel ASCIRed Xeon
(9632)
0500
100015002000250030003500400045005000
1990 1992 1994 1996 1998 2000
Year
GF
lop/
s
40
© C. Xu, 1999-2005
BlueGene/L
•DOE/LLNL (Livermore)• upto 2^16=65536 compute note• 3D torus and a combining tree network for global operations • Each node: a compute ASIC (appl-specific integrated circuit) and mem
21
41
© C. Xu, 1999-2005
42
© C. Xu, 1999-2005
22
43
© C. Xu, 1999-2005
Issues with Traditional Architecture Models
Application Software
SystemSoftware SIMD
Message Passing
Shared MemoryDataflow
SystolicArrays Architecture
• Divergent architectures, with no predictable pattern of growth• Uncertainty of direction paralyzed parallel software developemnt
44
© C. Xu, 1999-2005
Convergence: Layers of Abstraction
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Database Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compilationor library
Operating systems support
Communication hardware
Physical communication medium
Hardware/software boundary
23
45
© C. Xu, 1999-2005
Convergence: Generic OrganizationA generic modern multiprocessor
Node: processor(s), memory system, plus communication assist• Network interface and communication controller
• Scalable network
• Convergence allows lots of innovation, now within framework• Integration of assist with node, what operations, how efficiently...
Mem
° ° °
Network
P
$
Communicationassist (CA)