Parallel Architecture Overview - Wayne State Universityczxu/ece7660_f05/pca-model.pdf · Parallel...

1

Parallel Architecture Overview

2

© C. Xu, 1999-2005

Recap: What is Parallel Computer?A parallel computer is a collection of processing elements

that cooperate to solve large problems fast

Some broad issues:• Resource Allocation:

– how large a collection? – how powerful are the elements?– how much memory?

• Data access, Communication and Synchronization– how do the elements cooperate and communicate?– how are data transmitted between processors?– what are the abstractions and primitives for cooperation?

• Performance and Scalability– how does it all translate into performance?– how does it scale?

2

3

© C. Xu, 1999-2005

Parallel Computer ArchitectureComputer architecture ??

• Critical abstractions, boundaries, and primitives (interfaces)

• Organizational structures that implement interfaces (hw or sw)

Parallel Computer Architecture:Extension of CA to support communication and cooperation

• CA: Instruction Set + Organization

• PCA: CA + Communication Architecture

Communication architecture defines the basic communication and synchronous operations and addresses the organizational structures that realizes these operation

4

© C. Xu, 1999-2005

Communication Architecture= User/System Interface + Implementation

User/System Interface:• Comm. primitives exposed to user-level by hw and system-level sw

Implementation:• Organizational structures that implement the primitives: hw or OS• How optimized are they? How integrated into processing node?• Structure of network

Goals:• Performance• Broad applicability• Programmability• Scalability• Low Cost

3

5

© C. Xu, 1999-2005

Evolution of Traditional Architecture Models

Application Software

SystemSoftware SIMD

Message Passing

Shared MemoryDataflow

SystolicArrays Architecture

Historically, machines are tailored to prog. Modelsarch = prog. Model + abstraction + organization

6

© C. Xu, 1999-2005

Shared Address Space ArchitectureAny processor can directly reference any memory location

• Communication occurs implicitly as result of loads and stores

Programming model

• Similar to time-sharing on uniprocessors– Except processes run on different processors– Good throughput on multi-programmed workloads

Naturally provided on wide range of platforms

• History dates at least to precursors of mainframes in early 60s

• Wide range of scale: few to hundreds of processors

Popularly known as shared memory machines or model

• Ambiguous: memory may be physically distributed among processors

4

7

© C. Xu, 1999-2005

Shared Address Space AbstractionProcess: virtual address space plus one or more threads of control

Portions of address spaces of processes are shared

St or e

P1

P2

Pn

P0

Load

P0 pr i vat e

P1 pr i vat e

P2 pr i vat e

Pn pr i vat e

Virtual address spaces for acollection of processes communicatingvia shared addresses

Machine physical address space

Shared portionof address space

Private portionof address space

Common physicaladdresses

• Natural extension of uni-proc model: conventional mem operations for commspecial atomic operations for synchronization

• OS uses shared memory to coordinate processes

8

© C. Xu, 1999-2005

Communication Hardware

Also natural extension of uniprocessor

Already have processor, one or more memory modules and I/O controllers connected by hardware interconnect of some sort

Memory capacity increased by adding modules, I/O by controllers•Add processors for processing! •For higher-throughput multiprogramming, or parallel programs

I/O ctrlMem Mem Mem

Interconnect

Mem I/O ctrl

Processor Processor

Interconnect

I/Odevices

5

9

© C. Xu, 1999-2005

History

P

P

C

C

I/O

I/O

M MM M

PP

C

I/O

M MC

I/O

$ $

“Mainframe” approach• Motivated by multiprogramming• Extends crossbar used for mem bw and I/O• Originally processor cost limited to small

– later, cost of crossbar• Bandwidth scales with p• High incremental cost; use multistage instead

“Minicomputer” approach• Almost all microprocessor systems have bus• Motivated by multiprogramming, TP• Used heavily for parallel computing• Called symmetric multiprocessor (SMP)• Latency larger than for uniprocessor• Bus is bandwidth bottleneck

– caching is key: coherence problem

• Low incremental cost

10

© C. Xu, 1999-2005

Example: Intel Pentium Pro Quad

• All coherence and multiprocessing glue in processor module

• Highly integrated, targeted at high volume

• Low latency and bandwidth

P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)

CPU

Bus interface

MIU

P-Promodule

P-Promodule

P-Promodule256-KB

L2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PC

I bus

PC

I busPCI

I/Ocards

6

11

© C. Xu, 1999-2005

Example: SUN Enterprise

• 16 cards of either type: processors + memory, or I/O• All memory accessed over bus, so symmetric• Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 addr ess, 83 MHz)

SB

US

SB

US

SB

US

2 F

iber

Cha

nnel

100b

T, S

CS

I

Bus interface

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

I/O cards

12

© C. Xu, 1999-2005

Scaling Up

• Dance-hall: bandwidth still scalable, but lower cost than crossbar– latencies to memory uniform, but uniformly large

• Distributed memory or non-uniform memory access (NUMA)– Construct shared address space out of simple message transactions

across a general-purpose network (e.g. read-request, read-response)

• Caching shared (particularly nonlocal) data?

M M M° ° °

° ° ° M ° ° °M M

NetworkNetwork

P

$

P

$

P

$

P

$

P

$

P

$

“Dance hall” Distributed memory

7

13

© C. Xu, 1999-2005

Example: Cray T3E

• Scale up to 1024 processors, 480MB/s links• Memory controller generates comm. request for nonlocal references• No hardware mechanism for coherence (SGI Origin etc. provide this)

Switch

P

$

XY

Z

External I/O

Memctrl

and NI

Mem

14

© C. Xu, 1999-2005

Message Passing Architectures

Complete computer as building block, including I/O• Communication via explicit I/O operations

Programming model: directly access only private address space (local memory), comm. via explicit messages (send/receive)

High-level block diagram similar to distributed-memory SAS• But comm. integrated at IO level, needn’t be into memory system• Like networks of workstations (clusters), but tighter integration• Easier to build than scalable SAS

Programming model more removed from basic hardware operations• Library or OS intervention

8

15

© C. Xu, 1999-2005

Message-Passing Abstraction

• Send specifies buffer to be transmitted and receiving process• Recv specifies sending process and application storage to receive into• Memory to memory copy, but need to name processes• Optional tag on send and matching rule on receive• User process names local data and entities in process/tag space too• In simplest form, the send/recv match achieves pairwise synch event

– Other variants too• Many overheads: copying, buffer management, protection

Process P Process Q

Address Y

Address X

Send X, Q, t

Receive Y, P, tMatch

Local processaddress spaceLocal process

address space

16

© C. Xu, 1999-2005

Evolution of Message-Passing Machines

Early machines: FIFO on each link• Hw close to prog. Model; synchronous ops• Replaced by DMA, enabling non-blocking ops

– Buffered by system at destination until recv

Diminishing role of topology• Store&forward routing: topology important• Introduction of pipelined routing made it less so• Cost is in node-network interface• Simplifies programming

000001

010011

100

110

101

111

9

17

© C. Xu, 1999-2005

Example: IBM SP-2

• Made out of essentially complete RS6000 workstations• Network interface integrated in I/O bus (bw limited by I/O

bus)

Memory bus

MicroChannel bus

I/O

i860 NI

DMA

DR

AM

IBM SP-2 node

L2 $

Power 2CPU

Memorycontroller

4-wayinterleaved

DRAM

General interconnectionnetwork formed from8-port switches

NIC

18

© C. Xu, 1999-2005

Example Intel Paragon

Memory bus (64-bit, 50 MHz)

i860

L1 $

NI

DMA

i860

L1 $

Driver

Memctrl

4-wayinterleaved

DRAM

IntelParagonnode

8 bits,175 MHz,bidirectional2D grid network

with processing nodeattached to every switch

Sandia’ s Intel Paragon XP/S-based Supercomputer

10

19

© C. Xu, 1999-2005

Toward Architectural ConvergenceEvolution and role of software have blurred boundary

• Send/recv supported on SAS machines via buffers

• Can construct global address space on MP using hashing

• Page-based (or finer-grained) shared virtual memory

Hardware organization converging too• Tighter NI integration even for MP (low-latency, high-bandwidth)

• At lower level, even hardware SAS passes hardware messages

Even clusters of workstations/SMPs are parallel systems• Emergence of fast system area networks (SAN)

Programming models distinct, but organizations converging

• Nodes connected by general network and communication assists

• Implementations also converging, at least in high-end machines

20

© C. Xu, 1999-2005

Data Parallel SystemsProgramming model

• Operations performed in parallel on each element of data structure

• Logically single thread of control, performs sequential or parallel steps

• Conceptually, a processor associated with each data element

Architectural model

• Array of many simple, cheap processors with little memory each– Processors don’t sequence through instructions

• Attached to a control processor that issues instructions

• Specialized and general communication, cheap global synchronization

SIMD: Single-Instruction Multiple-Data Streams

Flynn Classification: SISD, SIMD, MIMD, MISD PE PE PE° ° °

PE PE PE° ° °

PE PE PE° ° °

° ° ° ° ° ° ° ° °

Controlprocessor

11

21

© C. Xu, 1999-2005

Application of Data ParallelismEach PE contains an employee record with his/her salary

If salary > 100K thensalary = salary *1.05

elsesalary = salary *1.10

• Logically, the whole operation is a single step

• Some processors enabled for arithmetic operation, others disabled

Other examples:• Finite differences, linear algebra, ... • Document searching, graphics, image processing, ...

Some recent machines: • Thinking Machines CM-1, CM-2 (and CM-5)• Maspar MP-1 and MP-2,

22

© C. Xu, 1999-2005

Evolution and ConvergenceRigid control structure (SIMD in Flynn taxonomy)

• SISD = uni-processor, MIMD = multiprocessor

Popular when cost savings of centralized sequencer high• 60s when CPU was a cabinet• Replaced by vectors in mid-70s

– More flexible w.r.t. memory layout and easier to manage

• Revived in mid-80s when 32-bit datapath slices just fit on chip• No longer true with modern microprocessors

Other reasons for demise• Simple, regular applications have good locality, can do well anyway• Loss of applicability due to hardwiring data parallelism

– MIMD machines as effective for data parallelism and more general

Prog. model converges with SPMD (single program multiple data)• Contributes need for fast global synchronization• Structured global address space, implemented with either SAS or MP

12

23

© C. Xu, 1999-2005

Vector Machine for DP Model

Vector architectures are based on a single processor• Multiple functional units• All performing the same operation• Instructions may specific large amounts of parallelism (e.g., 64-way) but

hardware executes only a subset in parallel

Historically important• Overtaken by MPPs in the 90s

Re-emerging in recent years• At a large scale in the Earth Simulator (NEC SX6) and Cray X1• At a small sale in SIMD media extensions to microprocessors

– SSE, SSE2 (Intel: Pentium/IA64)– Altivec (IBM/Motorola/Apple: PowerPC)– VIS (Sun: Sparc)

Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to

24

© C. Xu, 1999-2005

Vector ProcessorsVector instructions operate on a vector of elements

• These are specified as operations on vector registers

• A supercomputer vector register holds ~32-64 elts

• The number of elements is larger than the amount of parallel hardware, called vector pipes or lanes, say 2-4

The hardware performs a full vector operation in

• #elements-per-vector-register / #pipes

r1 r2

r3

+ +

…vr2…vr1

…vr3

(logically, performs # eltsadds in parallel)

…vr2…vr1

(actually, performs # pipes adds in parallel)

++ ++ ++

13

25

© C. Xu, 1999-2005

12.8 Gflops (64 bit)

S

VV

S

VV

S

VV

S

VV

0.5 MB$

0.5 MB$

0.5 MB$

0.5 MB$

25.6 Gflops (32 bit)

To local memory and network:

2 MB Ecache

At frequency of400/800 MHz

51 GB/s

25-41 GB/s

25.6 GB/s12.8 - 20.5 GB/s

customblocks

Cray X1 Node

Figure source J. Levesque, Cray

Cray X1 builds a larger “virtual vector”, called an MSP

• 4 SSPs (each a 2-pipe vector processor) make up an MSP

• Compiler will (try to) vectorize/parallelize across the MSP

26

© C. Xu, 1999-2005

Cray X1: Parallel Vector Arch.

Cray combines several technologies in the X112.8 Gflop/s multi-streaming processors (MSP, vector proc)Shared caches (unusual on earlier vector machines)4 processor nodes sharing up to 64 GB of memorySingle System Image to 4096 ProcessorsRemote put/get between nodes (faster than MPI)

14

27

© C. Xu, 1999-2005

Earth Simulator Architecture

Parallel Vector Architecture

• High speed (vector) processors

• High memory bandwidth (vector architecture)

• Fast network (new crossbar switch)

Rearranging commodity parts can’t match this performance

28

© C. Xu, 1999-2005

Clusters have Arrived

15

29

© C. Xu, 1999-2005

What’s a Cluster?

Collection of independent computer systems working together as if a single system.

Coupled through a scalable, high bandwidth, low latency interconnect.

30

© C. Xu, 1999-2005

PC Clusters: Contributions of BeowulfAn experiment in parallel computing systems

Established vision of low cost, high end computing

Demonstrated effectiveness of PC clusters for some (not all) classes of applications

Provided networking software

Conveyed findings to broad community (great PR)

Tutorials and bookDesign standard to rally

community!

Standards beget: books, trained people, Open source SW

Adapted from Gordon Bell, presentation at Salishan 2000

16

31

© C. Xu, 1999-2005

Clusters of SMPsSMPs are the fastest commodity machine, so use them as a

building block for a larger machine with a network

Common names:

• CLUMP = Cluster of SMPs• Hierarchical machines, constellations

Most modern machines look like this:

• Millennium, IBM SPs, (not the t3e)...What is the right programming model???

• Treat machine as “flat”, always use message passing, even within SMP (simple, but ignores an important part of memory hierarchy).

• Shared memory within one SMP, but message passing outside of an SMP.

32

© C. Xu, 1999-2005

Cluster of SMP ApproachA supercomputer is a stretched high-end server

Parallel system is built by assembling nodes that are modest size, commercial, SMP servers – just put more of them together

Image from LLNL

17

33

© C. Xu, 1999-2005

WSU CLUMP: Cluster of SMPs

CPU

Cache

CPU

Cache

CPU

Cache

MEMORY I/O Network

Symmetric Multiprocessor (SMP)

12 UltraSPARCs1Gb Ethernet

155Mb ATM

4 UltraSPARCs

30 UltraSPARCs

34

© C. Xu, 1999-2005Local

ClusterInter Planet

Grid

2100

2100 2100 2100 2100

2100 2100 2100 2100

Personal Device SMPs or SuperComputers

GlobalGrid

PERFORMANCE

+

Q

o

S

•Individual•Group•Department•Campus•State•National•Globe•Inter Planet•Universe

Administrative Barriers

EnterpriseCluster/Grid

Scalable Computing

Figure due to Rajkumar Buyya, University of Melbourne, Australia, www.gridbus.org

18

35

© C. Xu, 1999-2005

Computational/Data GridCoordinated resource sharing and problem solving in dynamic, multi-institutional virtual organizations

• direct access to computers, sw, data, and other resources, rather than file exchange

• Such sharing rules defines a set of individuals and/or institutions, which form a virtual organization

• Examples of VOs: application service providers, storage service providers, cycle providers, etc

• Grid computing is to develop protocols, services, and tools for coordinated resource sharing and problem solving in VOs

• Security solutions for management of credentials and policies

• RM protocols and services for secure remote access

• Information query protocols and services for configuratin

• Data management, etc

36

© C. Xu, 1999-2005

Top 500 SupercomputersListing of the 500 most powerful computers in the world

- Yardstick: Rmax from LINPACK MPP benchmarkAx=b, dense problem

- Dense LU Factorization (dominated by matrix multiply)

Updated twice a year SC‘xy in the States in November• Meeting in Mannheim, Germany in June• All data (and slides) available from www.top500.org• Also measures N-1/2 (size required to get ½ speed)

Rat

e

performance

Size

19

37

© C. Xu, 1999-2005

In 1980 a computation that took 1 full year to completecan now be done in 1 month!

Fastest Computer Over Time

TMCCM-2(2048)

Fujitsu VP-2600

Cray Y-MP (8)

05

101520253035404550

1990 1992 1994 1996 1998 2000

Year

GF

lop/

s

38

© C. Xu, 1999-2005

In 1980 a computation that took 1 full year to completecan now be done in 4 days!

Fastest Computer Over Time

HitachiCP-PACS

(2040)IntelParagon(6788)

FujitsuVPP-500

(140)TMC CM-5(1024)

NEC SX-3(4)

TMCCM-2(2048)

Fujitsu VP-2600

Cray Y-MP (8)

050

100150200250300350400450500

1990 1992 1994 1996 1998 2000

Year

GF

lop/

s

20

39

© C. Xu, 1999-2005

In 1980 a computation that took 1 full year to completecan today be done in 1 hour!

Fastest Computer Over TimeASCI White

Pacific(7424)

ASCI BluePacific SST

(5808)

SGI ASCIBlue

Mountain(5040)

Intel ASCI Red

(9152)Hitachi

CP-PACS(2040)

IntelParagon(6788)

FujitsuVPP-500

(140)

TMC CM-5(1024)

NEC SX-3

(4)

TMCCM-2(2048)

Fujitsu VP-2600

Cray Y-MP (8)

Intel ASCIRed Xeon

(9632)

0500

100015002000250030003500400045005000

1990 1992 1994 1996 1998 2000

Year

GF

lop/

s

40

© C. Xu, 1999-2005

BlueGene/L

•DOE/LLNL (Livermore)• upto 2^16=65536 compute note• 3D torus and a combining tree network for global operations • Each node: a compute ASIC (appl-specific integrated circuit) and mem

22

43

© C. Xu, 1999-2005

Issues with Traditional Architecture Models

Application Software

SystemSoftware SIMD

Message Passing

Shared MemoryDataflow

SystolicArrays Architecture

• Divergent architectures, with no predictable pattern of growth• Uncertainty of direction paralyzed parallel software developemnt

44

© C. Xu, 1999-2005

Convergence: Layers of Abstraction

CAD

Multiprogramming Sharedaddress

Messagepassing

Dataparallel

Database Scientific modeling Parallel applications

Programming models

Communication abstractionUser/system boundary

Compilationor library

Operating systems support

Communication hardware

Physical communication medium

Hardware/software boundary

23

45

© C. Xu, 1999-2005

Convergence: Generic OrganizationA generic modern multiprocessor

Node: processor(s), memory system, plus communication assist• Network interface and communication controller

• Scalable network

• Convergence allows lots of innovation, now within framework• Integration of assist with node, what operations, how efficiently...

Mem

° ° °

Network

P

$

Communicationassist (CA)

Parallel Architecture Overview - Wayne State Universityczxu/ece7660_f05/pca-model.pdf · Parallel...

Documents

Transcript of Parallel Architecture Overview - Wayne State Universityczxu/ece7660_f05/pca-model.pdf · Parallel...