PSU CS 106 Computing Fundamentals II Sample Architectures HM 4/14/2008.

PSU CS 106Computing Fundamentals II

Sample Architectures

HM 4/14/2008

2

Agenda

• Single Accumulator Architecture

• General-Purpose Register Architecture

• Stack Machine Architecture

• Pipelined Architecture

• Vector Architecture

• Shared-Memory Multiprocessor Architecture

• Distributed-Memory Multiprocessor Architecture• Systolic Architecture• Superscalar Architecture• VLIW Architecture

3

Single Accumulator ArchitectureSingle Accumulator (SAA) Architecture:• Single register to hold operation results• Conventionally called accumulator• Accumulator used as destination of arithmetic operations, and as

(one) source• Has central processing unit, memory unit, connecting memory bus• pc points to next instruction (in memory) to be executed next• Sample: ENIAC

Accumul.

pc

Main Mem.

4

General-Purpose Reg. ArchitectureGeneral-Purpose Register (GPR) Architecture• Accumulates ALU results in more than one register, n typically 4, 8,

16, .. 64• Allows register-to-register operations, fast!• Essentially a multi-register extension of SA architecture• Two-address architecture specifies one source operand plus

destination• Three-address architecture specifies two source operands plus

destination• Variations allow additional index registers, base registers etc.

pc Main Mem.

r0

r1

r2

r(n-1)

5

Stack Machine ArchitectureStack Machine Architecture (SA)• AKA zero-address architecture, as operations need no explicit operands• Pure Stack Machine has no registers; of course there are no pure SAs • Hence performance will be slow/poor, as all operations involve memory• However: implement n top of stack elements as registers: Cache• Sample architectures: Burroughs B5000, HP 3000• Implement impure stack operations that bypass tos operand addressing• Example code sequence to compute

res := a * ( 145 + b ) -- high-level source

push a

pushlit 145

push b

add

mult

pop res

pc Stack

tos

InstructionsStatic Data

free

6

Pipelined ArchitecturePipelined Architecture (PA)• Arithmetic Logic Unit (ALU) split into separate, sequential units• Each of which can be initiated once per cycle• Yet each subunit is implemented in HW just once• Multiple subunits operate in parallel on different sub-ops, at

different stages• Ideally, all subunits require unit time (1 cycle)• Ideally, all operations (add, fetch, store) take the same # of steps

of time• Non-unit time, differing # of cycles per operation cause different

termination moments• Operation aborted in case of branch, exception, call etc.• Operation must stall in case of operand dependence: stall, caused

by the interlock

7

Pipelined Architecture, Cont’d

ALU opALU op

I-FetchI-Fetch

R StoreR Store

Decode Decode

O1-FetchO1-Fetch

O2-FetchO2-Fetch

..

I-FetchI-Fetch

Decode Decode

O1-FetchO1-Fetch

O2-FetchO2-Fetch

ALU opALU op

R StoreR Store

8

Vector Architecture (VA)• Register implemented as HW array of identical registers,

named Vr0, Vr1• VA may also have scalar registers, named r0, r1, etc.• Vector registers can load/store block of contiguous data• Still in sequence, but overlapped; number of steps to complete

load/store of a vector also depends on width of bus• Vector registers can perform multiple operations of the same kind

on blocks of operands• Still sequentially, but overlapped, and all operands are readily

available• Otherwise operation of VA is similar to GPR architecture

9

Vector Architecture, Cont’d• Sample operations:

ldv vr1,memi -- loads e.g. 64 memory locations

stv vr2,memj -- stores Vr2 in 64 contiguous locs

vadd vr1,vr2, vr3 -- register-register vector addition

cvaddf r0, vr1, vr2, vr3 -- has special, conditional meaning:

-- sequential equivalent:

for i = 0 to 63 do

if bit i in r0 is 1 then

vr1[i] = vr2[i] + vr3[i]

end if

end for

Vr0[0]Vr0[0]

Vr0[2]Vr0[2]

Vr0[0n-1]Vr0[0n-1]

Vr0[1]Vr0[1]

….….

Vr0[3]Vr0[3]

10

Shared-Memory MultiprocessorShared Memory Architecture (SMA)

• Equal access to memory for all n processors, p0 to pn-1; possible to have additional, local memories

• Only one will succeed to access shared memory, if multiple, simultaneous accesses

• Simultaneous access must be deterministic• Von Neumann bottleneck becoming even tighter than for

conventional UP system• If locality is great, and ~ 2 * loads as stores, and many arithmetic

operations over memory accesses, then resource utilization good• Typically # loads = ( 2 to 3 ) times the # stores• Else some processors idle due to memory conflict• Typical number of processors n=4, but n=8 and greater possible,

with large 2nd level cache, even 3rd level

11

Shared-Memory Multiprocessor

cachecache

P0P0

cachecache

P1P1

cachecache

Pn-1Pn-1

arbitratearbitrate

Mem Mem

12

Distributed-Memory MultiprocessorDistributed Memory Architecture (DMA)• All memories are private

• Hence each processor pi always has access to its own memory Memi

• However, collection of all memories is program’s logical data space• Thus, processors must access others’ memories• Done via Message Passing or Virtual Shared Memory• Messages must be routed, route be determined; route may be long• Blocking when: message expected but hasn’t arrived yet• Blocking when: message to be sent, but destination cannot receive• Growing message buffer size increases illusion of asynchronicity of

sending and receiving operations• Key parameter: time and package overhead to send empty

message• Message may also be delayed because of network congestion

13

Distributed-Memory Multiprocessor

Mem 0 0cache

cache

P0P0

cachecache

P1P1

cachecache

Pn-1Pn-1 Mem n-1Mem n-1

Mem 1

NetworkNetwork

14

Systolic Array MultiprocessorSystolic Array (SA) Architecture• Each processor has private memory• Network is defined by the Systolic Pathway (SP)• Each node is connected via SP to some subset of other processors• Node connectivity: determined by implemented/selected network topology• Systolic pathway is high-performance network; sending and receiving may

be synchronized (blocking) or asynchronous (data received are buffered)• Typical network topologies: ring torus, hex grid, mesh, etc.• Sample below is a ring; note that wrap-around along x and y dimensions are

not shown• Processor can write to x or y gate; sends word off on x or y SP• Processor can read from x or y gate; consumes word from x or y SP• Buffered SA can write to gate, even if receiver cannot read• Reading from gate when no message available blocks• Automatic code generation for non-buffered SA hard, compiler must keep

track of interprocessor synchronization• Can view SP as an extension of memory with infinite capacity, but with

sequential access

15

Systolic Array Multiprocessor

cachecache

P0P0

Mem 0Mem 0

xx

yy

cachecache

P1P1

Mem 1Mem 1

xx

yy

cachecache

P2P2

Mem 2Mem 2

xx

yy

cachecache

Pn-1Pn-1

Mem n-1Mem n-1

xx

xx

yy

yy

16

Systolic Array Multiprocessor• Note that each pathway, x or y, may be bi-directional• May have any number of pathways, nothing magic about 2, x and y• Possible to have I/O capability with each node• Next example shows a torus (without displaying the wrap-around

pathways)

cachecache

xx P0 P0

cachecache

xx P0 P0

cachecache

P0,2 P0,2

cachecache

xx P0,0 P0,0

cachecache

xx P0,1 P0,1

cachecache

xx P0 P0

cachecache

xx P0 P0

cachecache

xx P1,2 P1,2

cachecache

xx P1,0 P1,0

cachecache

xx P1,1 P1,1

cachecache

xx P0 P0

cachecache

xx P0 P0

cachecache

xx Pn-1,2 Pn-1,2

xx

cachecache

xxP0,n-1P0,n-1

cachecache

xxP1,n-1P1,n-1

cachecache

xxPn-1,n-1Pn-1,n-1

cachecache

xxPn-1,0Pn-1,0

cachecache

xxPn-1,1Pn-1,1

yy

yy

yy

yy

yy

yy

yy

yy

yy

yy

xx

xx

xx

17

Hybrid Multiprocessor - SuperscalarSuperscalar (SSA) Architecture• Is scalar architecture w.r.t. object code• Is parallel architecture, with multiple copies of some hardware units• Has multiple ALUs, possibly FP add (FPA), FP multiply (FPM), >1

integer units• Arithmetic operations simultaneous with load- and store operations• Code sequence looks like sequence of instructions for scalar

processor• Object code can be custom-tailored by compiler• Fetch enough instruction bytes to support longest possible object

sequence• Decoding is bottle-neck for CISC, easy for RISC 32-bit units• Sample of superscalar: i80860 has FPA, FPM, 2 integer ops, load,

store with pre-post increment and decrement

18

Hybrid Multiprocessor – SSA & PA

IF DE EX WB

123

456

9

78

Pipelined + Superscalar Architecture

19

Hybrid Multiprocessor - VLIWVLIW Architecture (VLIW)• Very Long Instruction Word

– typically 128 bits or more– below 128: LIW

• Object code no longer purely scalar• Some special-select opcodes designed to support parallel execution• Compiler/programmer explicitly packs VLIW ops• Other opcodes are still scalar, can coexists with VLIW instructions• Scalar operation possible by placing no-ops into some VLIW fields• Sample: Compute instruction of CMU warp ® and Intel iWarp ®• Data Dependence example: Result of FPA cannot be used as operand

for FPM in the same VLIW instruction• Thus, need to software-pipeline; not discussed in CS 106

20

Hybrid Multiprocessor

One single VLIW Instruction

21

Hybrid MultiprocessorEPIC Architecture (EA)• Groups instructions into bundles• Straighten out the branches by associating predicate with

instructions• Execute instructions in parallel, say the else-clause and the then

clause of an If Statement• Decide at run time, which of the predicates is true, and execute just

that part• Use speculation, to straighten branch tree• Use rotating register file, AKA register windows• Have provides many registers, not just 64

PSU CS 106 Computing Fundamentals II Sample Architectures HM 4/14/2008.

Documents

Transcript of PSU CS 106 Computing Fundamentals II Sample Architectures HM 4/14/2008.