CS61V
description
Transcript of CS61V
Department of Computer ScienceUniversity of the West Indies
Architecture Classification
The Flynn taxonomy (proposed in 1966!)
Functional taxonomy based on the notion of streams of information: data and instructions
Platforms are classified according to whether they have a single (S) or multiple (M) stream of data or instructions.
Flynn’s Classification
Architecture Categories
SISD SIMD MISD MIMD
SISD
Classic von Neumann machine
Basic components: CPU (control unit, ALU) and Main Memory (RAM)
Connected via Bus (aka von Neumann bottleneck)
Examples: standard desktop computer, laptop
SISD
C P MIS IS DS
SIMD
Pure SIMD machine: single CPU devoted exclusively to control collection of subordinate ALUs each w/small amount of memory
Instruction cycle: CPU broadcasts, ALUs execute or idle lock-step progress (effectively a global clock)
Key point: completely synchronous execution of statements
Vector and matrix computation lend themselves to an SIMD implementation
Examples of SIMD computers: Illiac IV, MPP, DAP, CM-2, and MasPar MP-2
SIMD
PE PE PE
PE PE PE
PE PE PE
Controlprocessor
Data Parallel Systems
Programming model Operations performed in parallel on each element of data structure Logically single thread of control, performs sequential or parallel steps Conceptually, a processor associated with each data element
Architectural model Array of many simple, cheap processors with little memory each
Processors don’t sequence through instructions
Attached to a control processor that issues instructions Specialized and general communication, cheap global synchronization
Original motivations Matches simple differential equation solvers Centralize high cost of instruction fetch/sequencing
Data Parallel Programming
In this approach, we must determine how large amounts of data can be split up. In other words, we need to identify small chunks of data which require similar processing.
These chunks of data are than assigned to different sites where they
can be processed. The computations at each node may require some intermediate results from peer nodes.
The same executable could be running on each processing site, but
each processing site would have different datasets.
For data parallelism to work best the volume of communicated values should be small compared with the volume of locally computed results.
Data Parallel Programming
Data Parallel decomposition can be implemented using a SPMD (single program multiple data) programming model.
One processing element is regarded as "first among equals“:
This processor starts up the program and initialises the other processors. It then works as an equal to these processors.
Each PE is doing approximately the same calculation on different data.
Data Parallel Programming
Data-parallel architectures introduced the new programming-language concept of a distributed or parallel array. Typically the set of semantic operations allowed on a distributed array was somewhat different to the operations allowed on a sequential array
Unfortunately, each data parallel language had features tied to a particular manufacturer's parallel computer architecture e.g.
*LISP, C* and CM Fortran for Thinking Machines Corporation’s Connection Machine series of computers.
In the 1980s and 1990s microprocessors grew in power and availability, and fell in price. Building SIMD computers out of simple but specialized compute nodes gradually became less economical than putting a general purpose commodity microprocessor at every node. Eventually SIMD computers were displaced almost completely by Multiple Instruction Multiple Data (MIMD) parallel computer architectures.
Example - ILLIAC IV
ILLIAC IV was the first large system to employ semiconductor primary memory, built in 1974 at the University of Illinois.
The ILLIAC IV was a SIMD computer for array processing.
It consisted of: a control unit (CU) and 64 processing elements (PEs).
Each processing element had two thousand 64-bit words of memory associated with it. The CU could access all 128K words of memory through a bus, but each PE could only directly access its local memory.
Example - ILLIAC IV
An 8 by 8 grid interconnect joined each PE to 4 neighbours.
The CU interpreted program instructions scattered across the memory, and broadcast them to the PEs.
Neither the PEs nor the CU were general-purpose computers in the modern sense--the CU had quite limited arithmetic capabilities.
Between 1975 and 1981 it was the world's fastest computer.
Example - ILLIAC IV
The ILLIAC IV had thirteen rotating fixed head disks which comprised part of the central system memory.
The ILLIAC IV, one of the first computers to use all semiconductor main memories.
Example - ILLIAC IV
Example - ILLIAC IV
Data Parallel Languages
CFD was a data parallel language developed in the early 70s at the Computational Fluid Dynamics Branch of Ames Research Center.
CFD was a ``FORTRAN-like'' language, rather than a FORTRAN dialect.
The language design was extremely pragmatic. No attempt was made to hide the hardware peculiarities from the user; in fact, every attempt was made to give programmers access and control of all of the ILLIAC hardware so they could construct an efficient program.
CFD had five basic datatypes: CU INTEGER CU REAL CU LOGICAL PE REAL PE INTEGER.
Data Parallel Languages
The type of a variable statically encoded its home:
either on the control unit or on the processing elements.
Apart from restrictions on their home, the two INTEGER and REAL types behave like the corresponding types in ordinary FORTRAN.
The CU LOGICAL type was more idiosyncratic:
it had 64 independent bits that acted as flags controlling activity of the PEs.
Data Parallel Languages
Scalars and arrays of the five types could be declared as in FORTRAN.
An ordinary variable or array of type CU REAL, for example, would be allocated in the (very small) control unit memory.
An ordinary variable or array of type PE REAL would be allocated somewhere in the collective memory of the processing elements (accessible by the control unit over the data bus) e.g.
CU REAL A, B(100)
PE INTEGER I
PE REAL D(25), E(1000)
The last data structure available in CFD was a new kind of array called a vector-aligned array.
Data Parallel Languages
Only the first dimension could be distributed, and the extent of that dimension had to be exactly 64.
A vector-aligned array would be of PE INTEGER or PE REAL type, and the syntax for the distributed dimension involved an asterisk:
PE INTEGER J(*) PE REAL X(*,4), Y(*,2,8)
These are parallel arrays.
J(1) is stored on the first PE J(2) is stored on the second PE, and so on.
Similarly X(1,1), X(1,2), X(1,3), X(1,4) are stored on PE 1 X(2,1), X(2,2), X(2,3), X(2,4) are stored on PE 2, etc.
Data Parallel Languages
A vector expression was a vector-aligned array with a (*) subscript in the first dimension.
Communication between neighbouring PEs was captured by allowing the (*) to have some shift added, as in:
DIFP(*) = P(* + 1) - P(* - 1)
All shifts were cyclic (end-around) shifts, so this parallel statement is equivalent to the sequential statements:
DIFP(1) = P(2) - P(64) DIFP(2) = P(3) - P(1) ... DIFP(64) = P(1) - P(63)
Data Parallel Languages
Essential flexibility was added by allowing vector assignments to be executed conditionally with a vector test, e.g.
IF(A(*) .LT. 0) A(*) = -A(*)
Less structured methods of masking operations by explicitly assigning PE activity flags in CU LOGICAL variables were also available;
there were special primitives for restricting activity to simply-specified ranges of PEs.
PEs could concurrently access different addresses in their local memory by using vector subscripts: DIAG(*) = RHO(*, X(*))
Connection Machine
(Tucker, IEEE Computer, Aug. 1988)
CM-5
Repackaged SparcStation
4 per board
Fat-Tree network Control network for
global synchronization
Whither SIMD machines?
Trade-off individual processor performance for collective performance:CM-1 had 64K PEs each 1-bit!
Problems with SIMD Inflexible - not all problems can use this style of
parallelismcannot leverage off microprocessor technology
=> cannot be general-purpose architectures
Special-purpose SIMD architecture still viable (array processors, DSP chips)
Vector Processors
Definition: a processor that can do element-wise operations on entire vectors with a single instruction, called a vector instruction These are specified as operations on vector registers A processor comes with some number of such registers
A vector register holds ~32-64 elements The number of elements is larger than the amount of parallel hardware, called
vector pipes or lanes, say 2-4The hardware performs a full vector operation in
#elements-per-vector-register / #pipes
r1 r2
r3
+ +
… vr2 … vr1
… vr3
(logically, performs #elts adds in parallel)
… vr2 … vr1
(actually, performs #pipes adds in parallel)
++ ++
Vector Processors
Advantagesquick fetch and decode of a single instruction for
multiple operations the instruction provides the processor with a regular
source of data, which can arrive at each cycle, and processed in a pipelined fashion
The compiler does the work for you of courseMemory-to-memory
no registerscan process very long vectors, but startup time is
largeappeared in the 70s and died in the 80s
Examples: Cray, Fujitsu, Hitachi, NEC
Vector Processors
What about:
for (j = 0; j < 100; j++)
A[j] = B[j] * C[j]
Scalar code: load, operate, store for each iteration
Both instructions and data consume memory bandwidth
The solution: A vector instruction
Vector Processors
A[0:99] = B[0.99] * C[0:99]
Single instruction requires memory bandwidth for data only.
No control overhead for loops
Pitfallsextension to instruction set, vector fu’s,
vector registers, memory subsystem changes for vectors
Vector Processors
Merits of vector processor
1. Very deep pipeline without data hazard The computation of each result is independent
of the computation of previous results
2. Instruction bandwidth requirement is reduced A vector instruction specifies a great deal of
work
3. Control hazards are nonexistent A vector instruction represents an entire loop. No loop branch
Vector Processors (Cont’d)
The high latency of initiating a main memory access is amortized
A single access is initiated for the entire vector rather than a single word
Known access patternInterleaved memory banks
Vector operations is faster than a sequence of scalar operations on the same number of data items!
Vector Programming Example
LD F0, aADDI R4, Rx, #512 ; last address to load
Loop: LD F2, 0(Rx) ; load X(i)MULTD F2, F0, F2 ; a x X(i)LD F4, 0(Ry) ; load Y(i)ADDD F4, F2, F4 ; a x X(i) + Y(i)SD F4, 0(Ry) ; store into Y(i)ADDI Rx, Rx, #8 ; increment index to XADDI Ry, Ry, #8 ; increment index to YSUB R20, R4, Rx ; compute boundBNZ R20, loop ; check if done
RISC machine
Repeat 64 times
Y = a * X + Y
Vector Programming Example(Cont’d)
LD F0, a ; load scalar LV V1, Rx ; load vector XMULTSV V2, F0, V1 ; vector-scalar multiplyLV V3, Ry ; load vector YADDV V4, V2, V3 ; addSV Ry, V4 ; store the result
Vector machine
6 instructions(low instructionbandwidth)
Y = a * X + Y
A Vector-Register Architecture(DLXV)
Main Memory
VectorLoad-store
FP add/subtractFP add/subtract
FP add/subtractFP add/subtract
FP add/subtractFP add/subtract
FP add/subtractFP add/subtract
FP add/subtractFP add/subtract
Vectorregisters
Scalarregisters
Crossbar Crossbar
Vector Machines
CRAY-1
CRAY-2
CRAY X-MP
CRAY C-90
NEC SX/2
NEC SX/4
Fujitsu VP200
Hitachi S820
Convex C-1
8
8
8
8
8 + 8192
8 + 8192
8 - 256
32
8
Registers
64
64
64
128
256
256
32-1024
256
128
Elementsper register
1
1
2Ld/1St
4
8
8
2
4
1
LoadStore
6
5
8
8
16
16
3
4
4
Functionalunits
CRAY Y-MP 8 64 2Ld/1St 8
MISD
Multiple instruction, single data
Doesn’t really exist, unless you consider pipelining an MISD configuration
MISD
C
C
P
P
M
IS
IS
IS
IS
DS
DS