CO Module 5
-
Upload
alan-leewllyn-bivera -
Category
Technology
-
view
4.893 -
download
0
Transcript of CO Module 5
MODULE - 5INTRODUCTION TO PARALLEL
PROCESSING
Module 5 Parallel Processing Slide 2
Contents:-1. Parallel Processing2. Architectural Classification 3. Pipeline Computers4. Arithmetic Pipeline5. Instruction Pipeline6. Array Processors7. Vector Processing8. Multiprocessors 9. Comparison of RISC & CISC.
Module 5 Parallel Processing Slide 3
5.1. Parallel Processing• Parallel processing is an efficient form of
information processing which emphasizes the exploitation of concurrent events in the computing process.
• Concurrency implies :1. Parallelism2. Simultaneity
3. Pipelining. – Parallel events may occur in multiple resources during
the same time interval; – simultaneous events may occur at the same time
instant; and – pipelined events may occur in overlapped time spans.
• Parallel processing demands concurrent execution of many programs in the computer.
Module 5 Parallel Processing Slide 4
• The purpose of parallel processing is to speed up the computer processing capability and increase its throughput,
• Through put means the amount of processing that can be accomplished during a given interval of time.
• The amount of hardware increases with parallel processing, and so the cost of the system increases.
Adv and Disadv
Module 5 Parallel Processing Slide 5
Processor registers
Adder-Subtractor
Integer multiply
Logic unit
Shift Unit
Incrementer
Floating-point multiply
Floating-point add-subtract
Floating-point divide
To memory
Processor with multiple functional units operating in parallel
Module 5 Parallel Processing Slide 6
• The operands in the registers are applied to one of the units depending on the operation specified by the instruction associated with the operands.
• The operation performed in each functional unit is indicated in each block of the diagram.
• The adder and integer multiplier perform the arithmetic operations with integer numbers.
• The floating point operations are separated into 3 circuits operating in parallel.
• The logic, shift, and increment operations can be performed concurrently on different data.
• All units are independent of each other, so one no: can be shifted while another number is being incremented.
Module 5 Parallel Processing Slide 7
5.2 Architectural classification schemes
• Three computer architectural classification schemes:1. Flynn’s classification is based on the
multiplicity of instruction stream and data stream in a computer system.
2. Feng’s scheme is based on serial versus parallel processing.
3. Handler’s classification is determined by the degree of parallelism and pipelining in various levels.
Module 5 Parallel Processing Slide 8
Flynn’s Classification
Module 5 Parallel Processing Slide 9
The four categories:1. Single instruction stream-single
data stream (SISD)2. Single instruction stream-multiple
data stream (SIMD)3. Multiple instruction stream-single
data stream (MISD)4. Multiple instruction stream-
multiple data stream (MIMD)
Module 5 Parallel Processing Slide 10
CU PU MM
IS
IS DS
5.2.1. SISD computer organization
• This represents the organization of a single computer containing a CU, a Processor Unit, and a MU.
• Instructions are executed sequentially and the system may or may not have internal parallel processing capabilities.
• Parallel processing may be achieved by means of multiple functional units or by pipeline processing.
Module 5 Parallel Processing Slide 11
5.2.2. SIMD computer organization
• This class corresponds to array processors.
• There are multiple processing elements supervised by the same control unit.
• All PEs receive the same instruction broadcast from the CU but operate on different data sets from distinct data streams.
• The shared memory subsystem may contain multiple modules.
Module 5 Parallel Processing Slide 12
CU
PU1
PU2
PUn
SharedMemory
MM1
MM2
MMn
IS
DS1
DS2
DSn
Module 5 Parallel Processing Slide 13
5.2.3. MISD Computer organization
• There are n processor units, each receiving distinct instructions operating over the same data stream and its derivatives.
• The result of one processor become the input of the next processor in the macro pipe.
• MISD structure is only of theoretical interest since no practical system has been constructed using this organization.
14
CU1
CU2
CUn
PU1
PU2
PUn
MM1 MM2 MMm
IS1
IS2
ISn
IS1
IS2
ISnDS
SMDS
Module 5 Parallel Processing Slide 15
5.2.4. MIMD computer organization
• MIMD organization refers to a computer system capable of processing several programs at the same time.
• Most multiprocessor systems and multiple computer systems can be classified into this category.
• An intrinsic MIMD computer implies interactions among the n processors because all memory streams are derived from the same data space shared by all processors.
16
CU1
CU2
CUn
PU1
PU2
PUn
MM1
MM2
SM
MMm
DS1
DS2
DSn
IS1
IS2
ISn
IS1
IS2
ISn
IS1
IS2
ISn
Module 5 Parallel Processing Slide 17
Parallel Computer Structures• Parallel computers are those systems that emphasize parallel
processing. Parallel computers are divided into three architectural configurations:
1. Pipeline computers2. Array processors3. Multiprocessor systems
– A pipeline computer performs overlapped computations to exploit temporal parallelism.
– An array processor uses multiple synchronized arithmetic logic units to achieve spatial parallelism.
– A multiprocessor system achieves asynchronous parallelism through a set of interactive processors with shared resources.
• The diff. bet. an array processor and a multiprocessor is that the processing elements in an array processor operate synchronously but in multiprocessor system it may operate asynchronously.
Module 5 Parallel Processing Slide 18
5.3 Pipeline Computers• Pipelining is a technique of decomposing a
sequential process into sub operations, with each sub process being executed in a special dedicated segment that operates concurrently with all other segments.
• Each segment performs partial processing dictated by the way the task is partitioned.
• The result obtained from the computation in each segment is transferred to the next segment in the pipeline.
• The final result is obtained after the data have passed through all segments.
• The overlapping of computation is made possible by associating a register with each segment in the pipeline.
Module 5 Parallel Processing Slide 19
Basic Ideas• Parallel processing • Pipelined
processinga1 a2 a3 a4
b1 b2 b3 b4
c1 c2 c3 c4
d1 d2 d3 d4
a1 b1 c1 d1
a2 b2 c2 d2
a3 b3 c3 d3
a4 b4 c4 d4
P1
P2
P3
P4
P1
P2
P3
P4
time
Colors: different types of operations performeda, b, c, d: different data streams processed
Less inter-processor communicationComplicated processor hardware
time
More inter-processor communicationSimpler processor hardware
Module 5 Parallel Processing Slide 20
Data Dependence• Parallel processing requires
NO data dependence between processors
• Pipelined processing will involve inter-processor communication
P1
P2
P3
P4
P1
P2
P3
P4
time time
Module 5 Parallel Processing Slide 21
• The process of executing an instruction involves 4 major steps:
• In a nonpipelined computer, these 4 steps must be completed before the next instruction can be issued.
• In a pipelined computer, successive instructions are executed in an overlapped fashion.
IF ID OF EX
S1 S2 S3 S4 ( stages)
Pipelined processor
Module 5 Parallel Processing Slide 22
• The general structure of a 4 segment pipeline:
• The operands pass through all 4 segments in a fixed sequence. Each segment consists of a combinational circuit Si that performs a sub operation over the data stream flowing through the pipe.
• The segments are separated by Ri that hold the intermediate results between the stages.
• The behavior of a pipeline can be illustrated with a space time diagram.
S1 R1 S2 R2 S3 R3 S4 R4Input
Clock
Module 5 Parallel Processing Slide 23
• Suppose we want to perform – Ai * Bi + Ci for I = 1,2,3,…,7
• Each sub operation is to be implemented in a segment within a pipeline.
• Each segment has one or two registers and a combinational circuit.
• R1 through R5 are registers that receive new data with every clock pulse.
• The sub operations performed in each segment of the pipeline are:
R1 Ai ; R2 Bi,R3 R1 * R2 ; R4 Ci ;R5 R3 + R4
R1 R2
Multiplier
R3 R4
Adder
R5
Ai Bi CiExample:
Module 5 Parallel Processing Slide 24
• The first clock pulse transfers A1 & B1 into R1 & R2.
• The second clock pulse transfers the product of R1 & R2 into R3 and C1 into R4. The same clock pulse transfers A2 and B2 into R1 & R2.
• The third clock pulse operates on all 3 segments simultaneously. It places A3 & B3 into R1 & R2, transfers R1 * R2 into R3, transfers C2 into R4, and places the sum of R3 & R4 into R5.
• It takes 3 clock pulses to fill up the pipe and retrieve the first output from R5. From there on, each clock produces a new output and moves the data one step down the pipeline.
25
I1 I2 I3
I1 I2 I3
I1 I2 I3 I1 I2 I3 I4
EX
OF
ID
IF
a) Pipelined stages O/P
I1 I2 I3 I4 I5
I1 I2 I3 I4 I5
I1 I2 I3 I4 I5
I1 I2 I3 I4 I5
EX
OF
ID
IF
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9 10 11 12 13
O/P O/P O/Pb) NonPipelined stages
Module 5 Parallel Processing Slide 26
• An instruction cycle consists of multiple pipeline cycles. • Pipeline cycle can be set equal to the delay of the
slowest stage. • The flow of data from stage to stage is triggered by a
common clock of the pipeline.• For the nonpipelined computer, it takes four pipeline
cycles to complete one instruction.• Once a pipeline is filled up, an output result is produced
from the pipeline on each cycle.• Instruction cycle is reduced to one-fourth of the original
cycle time.• Due to overlapped instruction and arithmetic execution,
the pipeline machines are better tuned to perform the same operations repeatedly.
Module 5 Parallel Processing Slide 27
• To complete n tasks with clock time tp using a k-segment pipeline requires k+(n-1) clock cycles.
• For eg:, consider 4 segments and 5 tasks. The time required to complete all the operations is 4 + (5-1) = 8 clock cycles.
• Consider a nonpipeline unit that performs the same operation and takes a time tn to complete each task. The total time required for n tasks is ntn.
• The speed up of a pipeline processing over an equivalent nonpipeline processing is defined as
S = ntn / ((k+n-1)tp)
Module 5 Parallel Processing Slide 28
5.4 Arithmetic Pipeline
• An arithmetic pipeline divides an arithmetic operation into sub operations for execution in the pipeline segments.
• They are used to implement floating point operations, multiplication of fixed-point numbers, and similar computations encountered in scientific problems.
Module 5 Parallel Processing Slide 29
Example• The inputs to the floating-point adder pipeline are two
normalized floating-point binary numbers:• X = A x 2a
• Y = B x 2b
• A & B are two fractions that represent the mantissas and a & b are the exponents. The floating point addition and subtraction can be performed in 4 segments.
• The registers labeled R are placed between the segments to store intermediate results.
• The sub operations that are performed in the 4 segments are:
1. Compare the exponents2. Align the mantissas3. Add or subtract the mantissas.4. Normalize the result.
Module 5 Parallel Processing Slide 30
• Exponents are compares by subtracting them to determine their difference.
• The larger exponent is chosen as the exponent of the result.
• The exponent difference determines how many times the mantissa associated with the smaller exponent must be shifted to the right. This produces an alignment of the 2 mantissas.
• The 2 mantissas are added or subtracted in segment 3. The result is normalized in segment 4.
• When an overflow occurs, the mantissa is shifted right and the exponent incremented by one
• If an underflow occurs, the no: of leading zeros in the mantissa determines the no: of left shifts in the mantissa and the number that must be subtracted from the exponent
31
R R
Compare ExponentsBy subtraction
R
Choose exponent
R
R
Align Mantissas
Normalize result
R
Add or Subtract mantissas
R
Adjust exponent
R
Seg 1:
Seg 2:
Seg 3:
Seg 4:
Difference
Exponents Mantissasa b A B
Module 5 Parallel Processing Slide 32
• Consider 2 normalized floating-point numbers:
– X = 0.9504 x 103 ; Y = 0.8200 x 102
• The 2 exponents are subtracted in the segment 1 to obtain 3 -2 = 1. The larger exponent 3 is chosen as the exponent of the result.
• The next segment shifts the mantissa of Y to the right to obtain
– X = 0.9504 x 103 ; Y = 0.0820 x 103
• This aligns the 2 mantissa under the same exponent.
• The addition of the 2 mantissas in segment 3 produces the sum
– Z = 1.0324 x 103
• The sum is adjusted by normalizing the result so that it has a fraction with a nonzero first digit. This is done by shifting the mantissa once to the right and incrementing the exponent by one to obtain the normalized sum.
– Z = 0.10324 x 104
Module 5 Parallel Processing Slide 33
5.5 Instruction Pipeline• An instruction pipeline is a technique used in
the design of computers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time).
• The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step.
• An instruction pipeline reads consecutive instructions from memory while previous instructions are being executed in other segments.
• This causes the instruction fetch and execute phases to overlap and perform simultaneous operations.
Module 5 Parallel Processing Slide 34
• In the most general case, the computer needs to process each instruction with the following sequence of steps:
1. Fetch the instruction from memory2. Decode the instruction3. Calculate the effective address4. Fetch the operands from memory5. Execute the instruction6. Store the result in the proper place.
– Two or more segments may require memory access at the same time, causing one segment to wait until another is finished with the memory.
– Memory access conflicts can be resolved by using 2 memory buses for accessing data & instructions.
Module 5 Parallel Processing Slide 35
Example
• While an instruction is being executed in segment 4, the next instruction in sequence is busy fetching an operand from memory in segment 3.
• The EA may be calculated in a separate arithmetic circuit for the 3rd instruction, and whenever the memory is available, the 4th and all subsequent instructions can be fetched and placed in an instruction FIFO.
• Thus up to 4 sub operations in the instruction cycle can overlap and up to 4 different instructions can be in progress of being processed at the same time.
Fetch instruction from memory
Decode instruction & calculate EA
Branch?
Fetch Operand from memory
Execute instruction
Interrupt?
Empty pipe
Update PC
Interrupt Handling
Yes
Yes
No
No
Seg 1:
Seg 2:
Seg 3:
Seg 4:
Four segment CPU pipeline
Module 5 Parallel Processing Slide 37
Timing of instruction pipelineSteps 1 2 3 4 5 6 7 8 9 10 11 12 13
1 FI DA FO EX
2 FI DA FO EX
3 FI DA FO EX
4 FI - - FI DA FO EX
5 - - - FI DA FO EX
6 FI DA FO EX
7 FI DA FO EX
INSTRUCTION
BRANCH
Module 5 Parallel Processing Slide 38
• It is assumed that processor has separate instruction and data memories so that the operation in FI & FO can proceed at the same time.
• In the absence of a branch instruction, each segment operates on different instructions.
• Thus in step 4, instruction 1 is being executed in Seg 4; the operand for instruction 2 is fetched in Seg 3; instruction 3 is being decoded in Seg 2; and instruction 4 is being fetched from memory in Seg 2.
• Assume that instruction 3 is a branch instruction. – As soon as this is decoded in seg 2 in step 4, the transfer
from FI to DA of other instructions is halted until the branch instruction is executed in step 6.
– If the branch is not taken, a new instruction is fetched in step 7.
– If the branch is not taken, the instruction fetched previously in step 4 can be used.
– The pipeline then continues until a new branch instruction is encountered.
Module 5 Parallel Processing
Major Difficulties
1. Resource conflicts– Access to memory by two segments at the
same time.
– Soln: Separate memory for Instruction and Data
2. Data dependency– An instruction depends on the result of
previous instruction
3. Branch difficulty
Module 5 Parallel Processing
Data Dependency• It occurs when an instruction needs data
that are not yet available
• For eg:, an instruction in the FO segment may need to fetch an operand that is being generated at the same time by the previous instruction in segment EX. Therefore, the second instruction must wait for data to become available by the first instruction.
Module 5 Parallel Processing
Solutions1. Hardware Interlocks
• Circuit that detects instructions whose source operands are destinations of instructions farther up in the pipeline
• Detection of this situation causes the instruction whose source is not available to be delayed by enough clock cycles to resolve the conflict.
2. Operand forwarding• Special hardware to detect a conflict and then avoid by
routing data through special paths between pipeline segments
• Eg: ALU result forward into ALU input location3. Delayed Load
• Problem is solved in the compilation process itself• Delay the data loading by inserting NO-Operation
instruction
Module 5 Parallel Processing
Handling Branch Instruction1. Pre-fetch target instruction –
• Pre-fetch the target instruction in addition to the instruction following the branch.
• Both are saved until the branch is executed.• if branch condition is successful, pipeline continues
from branch target instruction.2. Branch target buffer or BTB
– It is an associative memory included in the fetch segment of the pipeline.
– Each entry consists of the address of a previously executed branch instruction and the target instruction for the branch.
– It also stores the next few instructions after the branch target instruction.
– Adv is branch instructions that have occurred previously are readily available in the pipeline without interruption
Module 5 Parallel Processing
1.Loop buffer• Small very high speed register file
maintained by the instruction fetch segment of pipeline.
• When a loop is detected, it is stored in the loop buffer in its entirety, including all the branches.
• The loop can be executed directly without having to access memory until the loop is removed
Module 5 Parallel Processing
5.6 Vector Processing• Applies to scientific and engineering data
processing which require vast number of computations
• Vector processing can apply to
– Long-range weather forecasting
– Image processing
– Medical Diagnostics
– Flight simulations
– AI & Expert Systems
– Petroleum Explorations
Module 5 Parallel Processing
• Vector Operations
– A Vector is an ordered set of a 1D array of data items
– It is represented as row vector by V = [V1, V2, ….. Vn] of length ‘n’
– Operations can broken down into single computations with subscripted variables
– The element Vi of vector V is written as V(I) and the index I refers to a memory address or register where the no. is stored.
– Vector processing eliminates the overhead of fetch and execution of each step
Module 5 Parallel Processing
• Vector processor allows operations to be specified with single vector instruction.
• The vector instruction includes the initial address of the operands, the length of the vectors, and the operation to be performed, all in one composite instruction
• Instruction Format:
• This is a 3 address instruction with 3 fields specifying the base address of the operands and an additional field that gives the length of the data items in the vectors. This assumes that the vector operands reside in memory
Operation Code
Base Address Source 1
Base Address Source 2
Base Address
Destination
Vector Length
Module 5 Parallel Processing
Matrix Multiplication
• Computational intensive operations
performed in computers with vector
processors.
• Multiplication two n X n matrix consists n2
inner products or n3 multiply-add operations.
• N x m matrix may be considered as
constituting a set of n row vectors or a set of
m column vectors
• Each multiplication and addition can
implement by using floating-point pipeline
Module 5 Parallel Processing
• Consider the product matrix C ( 3 x 3) as A x B whose elements are related to elements of A & B by the inner product:
Cij = ∑ aik x bkj• For eg: for i=1 & j=1 then C11 = a11b11 + a12b21 + a13b31
• This requires 3 multiplications and 3 additions. • The total number of multiplications or additions required to
compute the matrix product is 9 x 3= 27• If we consider the linked multiply-add operation c + a x b as
a cumulative operation, the product of two n x n matrices requires n3 multiply-add operations. The computation consists of n2 inner products, with each inner product requiring n multiply-add operations.
• In general, the inner product consists of sum of k product terms of the form:
C = A1B1 + A2B2 + A3B3+……. + AkBk
3
k=1
Module 5 Parallel Processing
Pipeline for calculating an inner product
• Values of A & B are either in memory or in processor registers.
• The floating point multiplier pipeline and the floating point adder pipeline are assumed to have 4 segments each.
• All segment registers in the multiplier and adder are initialized to 0.
• Therefore, the output of the adder is 0 for the first 8 cycles until both pipes are full.
• Ai & Bi pairs are brought in and multiplied at a rate of one pair per cycle.
Source A
Source B Multiplier Pipeline Adder Pipeline
Module 5 Parallel Processing
• After the first 4 cycles, the products begin to be added to the output of the adder.
• During the next 4 cycles 0 is added to the products entering the adder pipeline.
• At the end of the 8th cycle, the first 4 products A1B1 through A 4B4 are in the 4 adder segments, and the next 4 products, A5B5 through A 8B 8 , are in the multiplier segments.
• At the beginning of the 9th cycle, the output of the adder is A1B1 and of multiplier is A5B5 and thus it starts addition
A1B1 + A5B5 in the adder pipeline.
• The 10th cycle starts the addition A2B2 + A 6B 6 and so on
Module 5 Parallel Processing
ie C = A1B1 + A5B5 + A9B9 + A13B13 + ……
+ A2B2 + A 6B 6 + A10B10 + A 14B14 + ……
+ A3B3 + A 7B 7 + A11B11 + A 15B15 + ……
+ A4B4 + A 8B 8 + A12B12 + A 16B16 + ……
• When there are no more product terms to be added, system inserts 4 zeros into the multiplier pipeline.
Module 5 Parallel Processing
• Memory Interleaving– Pipeline and vector processors require
simultaneous access to memory from 2 or more sources
– Instead of using 2 memory buses for simultaneous access, the memory can be partitioned into modules connected to a common address and data bus
– Memory Module is a memory array with Address Register and Data Register
– Multiple memory operations are possible in each memory module
Module 5 Parallel Processing
Memory Array Memory Array Memory Array Memory Array
AR AR AR AR
DRDRDRDR
Address Bus
Data Bus
Module 5 Parallel Processing
• Each memory has its own AR & DR. The AR receives information from a common address bus and DR communicates with a bidirectional data bus.
• The two least significant bits of the address can be used to distinguish between the 4 modules.
• In an interleaved memory, different sets of addresses are assigned to different memory modules. For eg:, in a 2 module memory system, the even addresses may be in one module and the odd addresses in the other.
• When the number of modules is a power of 2, the LSB of the address select a memory module and the remaining bits designate the specific location to be accessed within the selected module.
• A vector processor that uses an n-way interleaved memory can fetch n operands from n different modules.
Module 5 Parallel Processing
• Super computers
– A computer with Vector instructions and pipelined floating-point arithmetic operations
– Internal components are packed tightly together to minimize the distance
– Special techniques to remove heat from circuit
– Performance is measured in terms of number of floating-point operations per second (FLOPS)
– The first supercomputer is the CRAY-1 and it uses vector processing with 12 distinct functional units in parallel.
Module 5 Parallel Processing Slide 56
5.7 Array Processors• Array processor is a synchronous parallel
computer with multiple ALUs, called processing elements (PE), that can operate in a lock step fashion.
• The PEs are synchronized to perform the same function at the same time.
• Scalar and control type instructions are directly executed in the CU.
• Each PE consists of an ALU with registers and a local memory.
Module 5 Parallel Processing
SIMD Array Processor• An SIMD array processor is a computer with multiple
processing units operating in parallel.• The PUs are synchronized to perform the same operation
under the control of a common CU, thus providing a single instruction stream, multiple data stream (SIMD) organization.
• It contains a set of PEs each having a local memory M. Each PE includes an ALU, a floating point arithmetic unit, and working registers.
• The main memory is used for storage of the program. • The CU controls the operations in the PEs. It decodes the
instructions and determine how the instruction is to be executed.
• Vector instructions are broadcast to all PEs simultaneously. Each PE uses operands stored in its local memory. Vector operands are distributed to the local memories prior to the parallel execution of the instruction.
Module 5 Parallel Processing
Master control Unit
Main Memory
PE1 M1
PE2 M2
PE3 M3
MnPEn
. . . . ... . . . ..
Module 5 Parallel Processing Slide 59
• Eg: C = A + B– The control unit first stores the ith components ai & bi of
A & B in local memory Mi for i = 1,2,3,...n. It then broadcasts the floating point add instruction ci = ai + bi to all PEs, causing the addition to take place simultaneously.
– The components of Ci are stored in fixed locations in each local memory. This produces the desired vector sum in one add cycle.
• The best known SIMD array processor is the ILLIAC IV computer.
• SIMD processors are highly specialized computers which are suited primarily for numerical problems that can be expressed in vector or matrix form.
Module 5 Parallel Processing Slide 60
5.8. Multiprocessor systems
• The system contains two or more processors of appropriately comparable capabilities.
• All processors share access to common sets of memory modules, I/O channels ,and peripheral devices.
• The entire system must be controlled by a single integrated OS providing interactions between processors and their programs.
• Each processor has its own local memory and I/O devices.
• A multiprocessor system is an interconnection of 2 or more CPUs with memory and input-output devices.
• Multiprocessors are MIMD systems.
Module 5 Parallel Processing Slide 61
• If a fault causes one processor to fail, a second processor can be assigned to perform the functions of the disabled processor.
• The system as a whole can continue to function correctly with perhaps some loss in efficiency.
• the adv from a multiprocessor organization is an improved system performance and reliability
Module 5 Parallel Processing Slide 62
Architectural Aspects• The architecture of multiprocessor system contains a
number of CPU’s and a number of Input Output Processors with so many i/o devices and a memory unit connected and it needs a no of network policies to be considered to get the maximum performance with minimized complexity.
• Three different interconnections :
1. Time-shared common bus
2. Crossbar switch network
3. Multiport memories.
Module 5 Parallel Processing
Time-shared common bus• A common bus multiprocessor system consists of a number
of processors connected through a common path to a memory unit.
• In a time shared common bus, only one processor can communicate with the memory or another processor at any given time.
Memory Unit IOP 1IOP 1
CPU 1 CPU 1 CPU 1
Common Bus
Module 5 Parallel Processing
• Transfer operations are conducted by the processor that is in control of the bus at the time.
• Any other processor wishing to initiate a transfer must first determine the availability status of the bus, and only after the bus becomes available can the processor address the destination unit to initiate the transfer.
• A command is issued to inform the destination unit what operation is to be performed.
• The receiving unit recognizes its address in the bus and responds to the control signals from the sender, after which the transfer is initiated.
• Single common bus system is restricted to one transfer at a time.
Module 5 Parallel Processing
Multiport Memory• A multiport memory system employs separate
buses between each memory module and each CPU.
CPU 1
CPU 3
CPU 4
MM 1 MM 2 MM 3 MM 4
CPU 2
CPU 1
Module 5 Parallel Processing
• Each processor bus is connected to each MM.
• A processor bus consists of address, data and control lines required to communicate with memory.
• The MM is said to have 4 ports and each port accommodates one of the buses.
• The module must have internal control logic to determine which port will have access to memory at any given time.
• Memory access conflicts are resolved by assigning fixed priorities to each memory port.
• The priority for memory access associated with each processor may be established by the physical port position that its bus occupies in each module.
• Thus CPU 1 will have priority over CPU2 and CPU 4 will have the lowest priority
Module 5 Parallel Processing
• The advantage is the high transfer rate that can be achieved because of the multiple paths between processors and memory.
• The disadvantage is that it requires expensive memory control logic and a large no of cables and connectors.
Module 5 Parallel Processing
Crossbar Switch Organization• the organization consists of a number of crosspoints
that are placed at intersections between processor buses and memory module paths.
• The small square in each crosspoint is a switch that determines the path from a processor to a memory module
• Each switch point has control logic to set up the transfer path between a processor and memory.
• It examines the address that is placed in the bus to determine whether its particular module is being addressed.
• It also resolves multiple requests for access to the same memory module on a predetermined priority basis.
Module 5 Parallel Processing
CPU 1
CPU 2
CPU 3
MM 1 MM 2 MM 3
Crossbar Switch
Module 5 Parallel Processing
• the circuit consists of multiplexers that select the data, address and control from one CPU for communication with the memory module.
• priority levels are established by the arbitration logic to select one CPU when two or more CPUs attempt to access the same memory.
• A crossbar switch organization supports simultaneous transfers from all memory modules because there is a separate path associated with each module.
Module 5 Parallel Processing
Comparison of RISC and CISC
Multiplying Two Numbers in Memory
• The main memory is divided into locations numbered from (row) 1: (column) 1 to (row) 6: (column) 4.
• The execution unit is responsible for carrying out all computations.
• However, the execution unit can only operate on data that has been loaded into one of the six registers (A, B, C, D, E, or F).
• to find the product of two numbers - one stored in location 2:3 and another stored in location 5:2 - and then store the product back in the location 2:3.
Module 5 Parallel Processing
The CISC (Complex Instruction Set Computers)
Approach • The primary goal of CISC architecture is to complete
a task in as few lines of assembly as possible.• This is achieved by building processor hardware that
is capable of understanding and executing a series of operations.
• For a particular task, a CISC processor would come prepared with a specific instruction (say "MULT").
• When executed, this instruction loads the two values into separate registers, multiplies the operands in the execution unit, and then stores the product in the appropriate register.
• Thus, the entire task of multiplying two numbers can be completed with one instruction:
MULT 2:3, 5:2
Module 5 Parallel Processing
• MULT is what is known as a "complex instruction."
• It operates directly on the computer's memory banks and does not require the programmer to explicitly call any loading or storing functions.
• It closely resembles a command in a higher level language.
• For instance, if we let "a" represent the value of 2:3 and "b" represent the value of 5:2, then this command is identical to the C statement "a = a * b."
Module 5 Parallel Processing
• One of the primary advantages of this system is that the compiler has to do very little work to translate a high-level language statement into assembly.
• Because the length of the code is relatively short, very little RAM is required to store instructions.
• The emphasis is put on building complex instructions directly into the hardware.
Module 5 Parallel Processing
The RISC (reduced instruction set computer )
Approach • RISC processors only use simple instructions that can be executed within one clock cycle.
• Thus, the "MULT" command described above could be divided into three separate commands: "LOAD," which moves data from the memory bank to a register, "PROD," which finds the product of two operands located within the registers, and "STORE," which moves data from a register to the memory banks.
• In order to perform the exact series of steps described in the CISC approach, a programmer would need to code four lines of assembly:
LOAD A, 2:3LOAD B, 5:2PROD A, BSTORE 2:3, A
Module 5 Parallel Processing
• At first, this may seem like a much less efficient way of completing the operation.– Because there are more lines of code, more RAM is
needed to store the assembly level instructions. – The compiler must also perform more work to convert a
high-level language statement into code of this form. • However, the RISC strategy also brings some very
important advantages.– Because each instruction requires only one clock cycle to
execute, the entire program will execute in approximately the same amount of time as the multi-cycle "MULT" command.
– These RISC "reduced instructions" require less transistors of hardware space than the complex instructions, leaving more room for general purpose registers.
– Because all of the instructions execute in a uniform amount of time (i.e. one clock), pipelining is possible.
Module 5 Parallel Processing
• Separating the "LOAD" and "STORE" instructions actually reduces the amount of work that the computer must perform.
• After a CISC-style "MULT" command is executed, the processor automatically erases the registers. If one of the operands needs to be used for another computation, the processor must re-load the data from the memory bank into a register.
• In RISC, the operand will remain in the register until another value is loaded in its place.
Module 5 Parallel Processing
• The Performance Equation:
• The CISC approach attempts to minimize the number of instructions per program, sacrificing the number of cycles per instruction.
• RISC does the opposite, reducing the cycles per instruction at the cost of the number of instructions per program.
Module 5 Parallel Processing
What is RISC?• RISC, or Reduced Instruction Set
Computer is a type of microprocessor architecture that utilizes a small, highly-optimized set of instructions, rather than a more specialized set of instructions often found in other types of architectures.
Module 5 Parallel Processing
CISC versus RISCCISC RISC
Emphasis on hardware Emphasis on software
Includes multi-clock complex instructions Single-clock, reduced instruction only
Memory-to-memory:"LOAD" and "STORE"incorporated in instructions
Register to register:"LOAD" and "STORE"are independent instructions
Small code sizes, high cycles per second Low cycles per second, large code sizes
Transistors used for storing complex instructions
Spends more transistors on memory registers
variable format instructions Fixed format instructions
Single register setMultiple register sets
Module 5 Parallel Processing
RISC Pipelines• A RISC processor pipeline operates in
much the same way, although the stages in the pipeline are different.
• The stages are: 1. fetch instructions from memory 2. read registers and decode the instruction 3. execute the instruction or calculate an
address 4. access an operand in data memory 5. write the result into a register
Module 5 Parallel Processing
• Because RISC instructions are simpler than those used in CISC processors , they are more conducive to pipelining.
• While CISC instructions varied in length, RISC instructions are all the same length and can be fetched in a single operation.
• Ideally, each of the stages in a RISC processor pipeline should take 1 clock cycle so that the processor finishes an instruction each clock cycle and averages one cycle per instruction (CPI).
Pipeline Problems
• In practice, however, RISC processors operate at more than one cycle per instruction. The processor might occasionally stall a a result of data dependencies and branch instructions.
Module 5 Parallel Processing
Eg: 3-segment instruction pipeline:• The instruction cycle can be divided into 3 sub
operations and implemented in 3 segments:1. Instruction Fetch2. ALU Operation3. Execute instruction
• Segment I fetches the instruction from program memory.
• The instruction is decoded and an ALU operation is performed in the A segment.
• Segment E directs the output of the ALU to one of 3 destinations (register file: result of ALU operation, Memory: transfers the EA for loading or storing, PC: transfers the branch address), depending on the decoded instruction
Module 5 Parallel Processing
Delayed Load:• Consider
1. LOAD: R1<- M[address1]2. LOAD: R2<- M[address2]3. ADD: R3<- R1 + R24. STORE : M[address3]<- R3
• If the 3 segment proceeds without interruptions, there will be a data conflict in instruction 3 because the operand in R2 is not yet available in A segment
Module 5 Parallel Processing
Clock Cycles 1 2 3 4 5 6
1; LOAD R1 I A E
2; LOAD R2 I A E
3. Add R1+R2 I A E
4. Store R3 I A E
Clock Cycles 1 2 3 4 5 6 7
1; LOAD R1 I A E
2; LOAD R2 I A E
3. No-operation I A E
4. Add R1+R2 I A E
5. Store R3 I A E
Pipeline timing with delayed load
Pipeline timing with data conflict
Module 5 Parallel Processing
• E segment in clock cycle 4 is in a process of placing the memory data into R2.
• A segment in clock cycle 4 is using the data from R2, but the value in R2 will not be the correct value since it has not yet been transferred from memory.
• It is up to the compiler to make sure that the instruction following the load instruction uses the data fetched from the memory.
• If the compiler cannot find a useful instruction to put after the load, it inserts a no-operation instruction and thus waiting a clock cycle.
• This concept of delaying the use of the data loaded from memory is referred to as delayed load.