PSU CS 106 Computing Fundamentals II Sample Architectures HM 4/14/2008.
-
date post
22-Dec-2015 -
Category
Documents
-
view
222 -
download
1
Transcript of PSU CS 106 Computing Fundamentals II Sample Architectures HM 4/14/2008.
PSU CS 106Computing Fundamentals II
Sample Architectures
HM 4/14/2008
2
Agenda
• Single Accumulator Architecture
• General-Purpose Register Architecture
• Stack Machine Architecture
• Pipelined Architecture
• Vector Architecture
• Shared-Memory Multiprocessor Architecture
• Distributed-Memory Multiprocessor Architecture• Systolic Architecture• Superscalar Architecture• VLIW Architecture
3
Single Accumulator ArchitectureSingle Accumulator (SAA) Architecture:• Single register to hold operation results• Conventionally called accumulator• Accumulator used as destination of arithmetic operations, and as
(one) source• Has central processing unit, memory unit, connecting memory bus• pc points to next instruction (in memory) to be executed next• Sample: ENIAC
Accumul.
pc
Main Mem.
4
General-Purpose Reg. ArchitectureGeneral-Purpose Register (GPR) Architecture• Accumulates ALU results in more than one register, n typically 4, 8,
16, .. 64• Allows register-to-register operations, fast!• Essentially a multi-register extension of SA architecture• Two-address architecture specifies one source operand plus
destination• Three-address architecture specifies two source operands plus
destination• Variations allow additional index registers, base registers etc.
pc Main Mem.
r0
r1
r2
r(n-1)
5
Stack Machine ArchitectureStack Machine Architecture (SA)• AKA zero-address architecture, as operations need no explicit operands• Pure Stack Machine has no registers; of course there are no pure SAs • Hence performance will be slow/poor, as all operations involve memory• However: implement n top of stack elements as registers: Cache• Sample architectures: Burroughs B5000, HP 3000• Implement impure stack operations that bypass tos operand addressing• Example code sequence to compute
res := a * ( 145 + b ) -- high-level source
push a
pushlit 145
push b
add
mult
pop res
pc Stack
tos
InstructionsStatic Data
free
6
Pipelined ArchitecturePipelined Architecture (PA)• Arithmetic Logic Unit (ALU) split into separate, sequential units• Each of which can be initiated once per cycle• Yet each subunit is implemented in HW just once• Multiple subunits operate in parallel on different sub-ops, at
different stages• Ideally, all subunits require unit time (1 cycle)• Ideally, all operations (add, fetch, store) take the same # of steps
of time• Non-unit time, differing # of cycles per operation cause different
termination moments• Operation aborted in case of branch, exception, call etc.• Operation must stall in case of operand dependence: stall, caused
by the interlock
7
Pipelined Architecture, Cont’d
ALU opALU op
I-FetchI-Fetch
R StoreR Store
Decode Decode
O1-FetchO1-Fetch
O2-FetchO2-Fetch
..
I-FetchI-Fetch
Decode Decode
O1-FetchO1-Fetch
O2-FetchO2-Fetch
ALU opALU op
R StoreR Store
8
Vector Architecture (VA)• Register implemented as HW array of identical registers,
named Vr0, Vr1• VA may also have scalar registers, named r0, r1, etc.• Vector registers can load/store block of contiguous data• Still in sequence, but overlapped; number of steps to complete
load/store of a vector also depends on width of bus• Vector registers can perform multiple operations of the same kind
on blocks of operands• Still sequentially, but overlapped, and all operands are readily
available• Otherwise operation of VA is similar to GPR architecture
9
Vector Architecture, Cont’d• Sample operations:
ldv vr1,memi -- loads e.g. 64 memory locations
stv vr2,memj -- stores Vr2 in 64 contiguous locs
vadd vr1,vr2, vr3 -- register-register vector addition
cvaddf r0, vr1, vr2, vr3 -- has special, conditional meaning:
-- sequential equivalent:
for i = 0 to 63 do
if bit i in r0 is 1 then
vr1[i] = vr2[i] + vr3[i]
end if
end for
Vr0[0]Vr0[0]
Vr0[2]Vr0[2]
Vr0[0n-1]Vr0[0n-1]
Vr0[1]Vr0[1]
….….
Vr0[3]Vr0[3]
10
Shared-Memory MultiprocessorShared Memory Architecture (SMA)
• Equal access to memory for all n processors, p0 to pn-1; possible to have additional, local memories
• Only one will succeed to access shared memory, if multiple, simultaneous accesses
• Simultaneous access must be deterministic• Von Neumann bottleneck becoming even tighter than for
conventional UP system• If locality is great, and ~ 2 * loads as stores, and many arithmetic
operations over memory accesses, then resource utilization good• Typically # loads = ( 2 to 3 ) times the # stores• Else some processors idle due to memory conflict• Typical number of processors n=4, but n=8 and greater possible,
with large 2nd level cache, even 3rd level
11
Shared-Memory Multiprocessor
cachecache
P0P0
cachecache
P1P1
cachecache
Pn-1Pn-1
arbitratearbitrate
Mem Mem
12
Distributed-Memory MultiprocessorDistributed Memory Architecture (DMA)• All memories are private
• Hence each processor pi always has access to its own memory Memi
• However, collection of all memories is program’s logical data space• Thus, processors must access others’ memories• Done via Message Passing or Virtual Shared Memory• Messages must be routed, route be determined; route may be long• Blocking when: message expected but hasn’t arrived yet• Blocking when: message to be sent, but destination cannot receive• Growing message buffer size increases illusion of asynchronicity of
sending and receiving operations• Key parameter: time and package overhead to send empty
message• Message may also be delayed because of network congestion
13
Distributed-Memory Multiprocessor
Mem 0 0cache
cache
P0P0
cachecache
P1P1
cachecache
Pn-1Pn-1 Mem n-1Mem n-1
Mem 1
NetworkNetwork
14
Systolic Array MultiprocessorSystolic Array (SA) Architecture• Each processor has private memory• Network is defined by the Systolic Pathway (SP)• Each node is connected via SP to some subset of other processors• Node connectivity: determined by implemented/selected network topology• Systolic pathway is high-performance network; sending and receiving may
be synchronized (blocking) or asynchronous (data received are buffered)• Typical network topologies: ring torus, hex grid, mesh, etc.• Sample below is a ring; note that wrap-around along x and y dimensions are
not shown• Processor can write to x or y gate; sends word off on x or y SP• Processor can read from x or y gate; consumes word from x or y SP• Buffered SA can write to gate, even if receiver cannot read• Reading from gate when no message available blocks• Automatic code generation for non-buffered SA hard, compiler must keep
track of interprocessor synchronization• Can view SP as an extension of memory with infinite capacity, but with
sequential access
15
Systolic Array Multiprocessor
cachecache
P0P0
Mem 0Mem 0
xx
yy
cachecache
P1P1
Mem 1Mem 1
xx
yy
cachecache
P2P2
Mem 2Mem 2
xx
yy
cachecache
Pn-1Pn-1
Mem n-1Mem n-1
xx
xx
yy
yy
16
Systolic Array Multiprocessor• Note that each pathway, x or y, may be bi-directional• May have any number of pathways, nothing magic about 2, x and y• Possible to have I/O capability with each node• Next example shows a torus (without displaying the wrap-around
pathways)
cachecache
xx P0 P0
cachecache
xx P0 P0
cachecache
P0,2 P0,2
cachecache
xx P0,0 P0,0
cachecache
xx P0,1 P0,1
cachecache
xx P0 P0
cachecache
xx P0 P0
cachecache
xx P1,2 P1,2
cachecache
xx P1,0 P1,0
cachecache
xx P1,1 P1,1
cachecache
xx P0 P0
cachecache
xx P0 P0
cachecache
xx Pn-1,2 Pn-1,2
xx
cachecache
xxP0,n-1P0,n-1
cachecache
xxP1,n-1P1,n-1
cachecache
xxPn-1,n-1Pn-1,n-1
cachecache
xxPn-1,0Pn-1,0
cachecache
xxPn-1,1Pn-1,1
yy
yy
yy
yy
yy
yy
yy
yy
yy
yy
xx
xx
xx
17
Hybrid Multiprocessor - SuperscalarSuperscalar (SSA) Architecture• Is scalar architecture w.r.t. object code• Is parallel architecture, with multiple copies of some hardware units• Has multiple ALUs, possibly FP add (FPA), FP multiply (FPM), >1
integer units• Arithmetic operations simultaneous with load- and store operations• Code sequence looks like sequence of instructions for scalar
processor• Object code can be custom-tailored by compiler• Fetch enough instruction bytes to support longest possible object
sequence• Decoding is bottle-neck for CISC, easy for RISC 32-bit units• Sample of superscalar: i80860 has FPA, FPM, 2 integer ops, load,
store with pre-post increment and decrement
18
Hybrid Multiprocessor – SSA & PA
IF DE EX WB
123
456
9
78
Pipelined + Superscalar Architecture
19
Hybrid Multiprocessor - VLIWVLIW Architecture (VLIW)• Very Long Instruction Word
– typically 128 bits or more– below 128: LIW
• Object code no longer purely scalar• Some special-select opcodes designed to support parallel execution• Compiler/programmer explicitly packs VLIW ops• Other opcodes are still scalar, can coexists with VLIW instructions• Scalar operation possible by placing no-ops into some VLIW fields• Sample: Compute instruction of CMU warp ® and Intel iWarp ®• Data Dependence example: Result of FPA cannot be used as operand
for FPM in the same VLIW instruction• Thus, need to software-pipeline; not discussed in CS 106
20
Hybrid Multiprocessor
One single VLIW Instruction
21
Hybrid MultiprocessorEPIC Architecture (EA)• Groups instructions into bundles• Straighten out the branches by associating predicate with
instructions• Execute instructions in parallel, say the else-clause and the then
clause of an If Statement• Decide at run time, which of the predicates is true, and execute just
that part• Use speculation, to straighten branch tree• Use rotating register file, AKA register windows• Have provides many registers, not just 64