Web viewThe initial collision vector (ICV) ... (One VLIW instruction word contains maximum 8...
Transcript of Web viewThe initial collision vector (ICV) ... (One VLIW instruction word contains maximum 8...
Chapter OneIntroduction to Pipelined Processors
Pipelining:
• It is a technique of decomposing a sequential process into sub-operations which can be executed in a special dedicated segment that operates concurrently with all other segments
• It improves processor performance by overlapping the execution of multiple instructions
Example for pipeline in computer
• Consider that the process of execution of an instruction involves four major steps:
1. Instruction Fetch (IF): from main memory
2. Instruction Decoding (ID): which identifies the operation to be performed
3. Operand Fetch(OF): if needed in execution
4. Execution(EX): of the decoded arithmetic/logic operation
• In a non-pipelined computer, these four steps must be completed before the next instruction can be issued
• In a pipelined computer, successive stages are executed in an overlapped fashion
• Theoretically a k-stage linear pipeline could be k-times faster.
• But this ideal speedup cannot be achieved due to factors like :
– Data dependency
– Branch and Interrupts
Principles of Linear Pipelining
• In pipelining, we divide a task into set of subtasks.
• The precedence relation of a set of subtasks {T1, T2,…, Tk} for a given task T implies that the task Tj cannot start until some earlier task Ti finishes.
• The interdependencies of all subtasks form the precedence graph.
• With a linear precedence relation, task Tj cannot start until earlier subtasks { Ti} for all (i < j) finish.
• A linear pipeline can process subtasks with a linear precedence graph.
Basic Linear Pipeline
• L: latches, interface between different stages of pipeline
• S1, S2, etc. : pipeline stages
• It consists of cascade of processing stages.
• Stages : Pure combinational circuits performing arithmetic or logic operations over the data flowing through the pipe.
• Stages are separated by high speed interface latches.
• Latches : Fast Registers holding intermediate results between stages
• Information Flow are under the control of common clock applied to all latches
• The flow of data in a linear pipeline having four stages for the evaluation of a function on five inputs is as shown below:
• The vertical axis represents four stages
• The horizontal axis represents time in units of clock period of the pipeline.
Clock Period (τ) for the pipeline
• Let τi be the time delay of the circuitry Si and t1 be time delay of latch.
• Then the clock period of a linear pipeline is defined by
The reciprocal of clock period is called clock frequency (f = 1/τ) of a pipeline processor
Performance of a linear pipeline
• Consider a linear pipeline with k stages.
• Let T be the clock period and the pipeline is initially empty.
• Starting at any time, let us feed n inputs and wait till the results come out of the pipeline.
τ=maxi=1
k
{τ i}+t 1=tm+t1
• First input takes k periods and the remaining (n-1) inputs come one after the another in successive clock periods.
• Thus the computation time for the pipeline Tp is
Tp = kT+(n-1)T = [k+(n-1)]T
• For example if the linear pipeline have four stages with five inputs.
• Tp = [k+(n-1)]T = [4+4]T = 8T
Performance Parameters
• The various performance parameters of pipeline are :
1. Speed-up2. Throughput3. EfficiencySpeedupSpeedup is defined as
Speedup = Time taken for a given computation by a non-pipelined functional unitTime taken for the same computation by a pipelined version
Assume a function of k stages of equal complexity which takes the same amount of time T.
Non-pipelined function will take kT time for one input.
Then Speedup = nkT/(k+n-1)T = nk/(k+n-1)
The maximum value of speedup is
Lt [Speedup] = kn à ∞
EfficiencyIt is an indicator of how efficiently the resources of the pipeline are used. If a stage is available during a clock period, then its availability becomes the unit of resource.Efficiency can be defined as
No. of stage time units = nk (there are n inputs and each input uses k stages. )
Total no. of stage-time units available = k[ k + (n-1)]
It is the product of no. of stages in the pipeline (k) and no. of clock periods taken for computation(k+(n-1)).
Thus efficiency is expressed as follows:
Efficiency = Number of stage time units actually used during computation Total number of stage time units available during that computation
Efficiency =nkk [k+n-1 ]
= nk+n−1
The maximum value of efficiency is 1.
Throughput
It is the average number of results computed per unit time.
For n inputs, a k-staged pipeline takes [k+(n-1)]T time units
Then,
Throughput = n / [k+n-1] T = nf / [k+n-1]
where f is the clock frequency
The maximum value of throughput is
Lt [Throughput] = fn à ∞Throughput = Efficiency x Frequency
Example : Floating Point Adder Unit
• This pipeline is linearly constructed with 4 functional stages.
• The inputs to this pipeline are two normalized floating point numbers of the form
A = a x 10p
B = b x 10q
where a and b are two fractions and p and q are their exponents.
• Our purpose is to compute the sum
C = A + B = c x 10r = d x 10s
where r = max(p,q) and 0.1 ≤ d < 1
• For example:
A=0.9504 x 103
B=0.8200 x 102
a = 0.9504 b= 0.8200
p=3 & q =2
• Operations performed in the four pipeline stages are :
1. Compare p and q and choose the largest exponent, r = max(p,q)and compute t = |p – q|
Example:
r = max(p , q) = 3
t = |p-q| = |3-2|= 1
2. Shift right the fraction associated with the smaller exponent by t units to equalize the two exponents before fraction addition.
• Example:
Smaller exponent, b= 0.8200
Shift right b by 1 unit is 0.082
3. Perform fixed-point addition of two fractions to produce the intermediate sum fraction c
• Example :
a = 0.9504 b= 0.082
c = a + b = 0.9504 + 0.082 = 1.0324
4. Count the number of leading zeros (u) in fraction c and shift left c by u units to produce the normalized fraction sum d = c x 10u, with a leading bit 1. Update the large exponent s by subtracting s = r – u to produce the output exponent.
• Example:
c = 1.0324 , u = -1 è right shift
d = 0.10324 , s= r – u = 3-(-1) = 4
C = 0.10324 x 104
• The above 4 steps can all be implemented with combinational logic circuits and the 4 stages are:
1. Comparator / Subtractor
2. Shifter
3. Fixed Point Adder
4. Normalizer (leading zero counter and shifter)
Classification of Pipeline Processors
• There are various classification schemes for classifying pipeline processors.
• Two important schemes are
1. Handler’s Classification2. Li and Ramamurthy's Classification
Handler’s Classification
• Based on the level of processing, the pipelined processors can be classified as:
1. Arithmetic Pipelining2. Instruction Pipelining3. Processor Pipelining
Arithmetic Pipelining
• The arithmetic logic units of a computer can be segmented for pipelined operations in various data formats.
A = a x 2 p
B = b x 2 q
ab
AB
Exp one ntsub trac tor
Frac tionsele ctor
Frac tion wi th m
in(p ,q)R
igh t sh ifte r
Othe r
frac tion
t = | p - q |r = m
ax (p,q )
Frac tionadd er
Lead ing zer ocou nter
rc
Left shi fterc
Exp one ntadd er
r
sd d
Stag es:S1S2S3S4
C= X + Y = d x
2s
• Example : Star 100
• Example : Star 100
– It has two pipelines where arithmetic operations are performed
– First: Floating Point Adder and Multiplier
– Second : Multifunctional : For all scalar instructions with floating point adder, multiplier and divider.
– Both pipelines are 64-bit and can be split into four 32-bit at the cost of precision
Instruction Pipelining
• The execution of a stream of instructions can be pipelined by overlapping the execution of current instruction with the fetch, decode and operand fetch of the subsequent instructions
• It is also called instruction look-ahead
The organization of 8086 into a separate BIU and EU allows the fetch and execute cycle to overlap.
Processor Pipelining
• This refers to the processing of same data stream by a cascade of processors each of which processes a specific task
• The data stream passes the first processor with results stored in a memory block which is also accessible by the second processor
• The second processor then passes the refined results to the third and so on.
Li and Ramamurthy's Classification
• According to pipeline configurations and control strategies, Li and Ramamurthy classify pipelines under three schemes
– Unifunction v/s Multi-function Pipelines– Static v/s Dynamic Pipelines– Scalar v/s Vector Pipelines
Unifunction v/s Multi-function Pipeline
Unifunctional Pipelines• A pipeline unit with fixed and dedicated function is called unifunctional. • Example: CRAY1 (Supercomputer - 1976) • It has 12 unifunctional pipelines described in four groups:
– Address Functional Units: • Address Add Unit • Address Multiply Unit
– Scalar Functional Units • Scalar Add Unit • Scalar Shift Unit • Scalar Logical Unit • Population/Leading Zero Count Unit
– Vector Functional Units • Vector Add Unit • Vector Shift Unit • Vector Logical Unit
– Floating Point Functional Units • Floating Point Add Unit • Floating Point Multiply Unit • Reciprocal Approximation Unit
Multifunctional Pipelines• A multifunction pipe may perform different functions either at different times or same
time, by interconnecting different subset of stages in pipeline. • Example 4X-TI-ASC (Supercomputer - 1973)
•• It has four multifunction pipeline processors, each of which is reconfigurable for a variety
of arithmetic or logic operations at different times.• It is a four central processor comprised of nine units. • It has • one instruction processing unit• four memory buffer units and• four arithmetic units. • Thus it provides four parallel execution pipelines below the IPU. • Any mixture of scalar and vector instructions can be executed simultaneously in four
pipes.
Static Vs Dynamic Pipeline
Static Pipeline
• It may assume only one functional configuration at a time• It can be either unifunctional or multifunctional• Static pipelines are preferred when instructions of same type are to be executed
continuously• A unifunction pipe must be static.
Dynamic pipeline• It permits several functional configurations to exist simultaneously• A dynamic pipeline must be multi-functional• The dynamic configuration requires more elaborate control and sequencing
mechanisms than static pipeliningScalar Vs Vector PipelineScalar Pipeline
• It processes a sequence of scalar operands under the control of a DO loop• Instructions in a small DO loop are often prefetched into the instruction buffer.• The required scalar operands are moved into a data cache to continuously supply the
pipeline with operands• Example: IBM System/360 Model 91• In this computer, buffering plays a major role. • Instruction fetch buffering:
• provide the capacity to hold program loops of meaningful size.• Upon encountering a loop which fits, the buffer locks onto the loop and
subsequent branching requires less time.• Operand fetch buffering:
• provide a queue into which storage can dump operands and execution units can fetch operands.
• This improves operand fetching for storage-to-register and storage-to-storage instruction types.
Vector Pipelines• They are specially designed to handle vector instructions over vector operands. • Computers having vector instructions are called vector processors. • The design of a vector pipeline is expanded from that of a scalar pipeline. • The handling of vector operands in vector pipelines is under firmware and hardware
control.• Example : Cray 1
Linear pipeline (Static & Unifunctional)• In a linear pipeline data flows from one stage to another and all stages are used once in
a computation and it is for one functional evaluation.
Nonlinear Pipeline
• In floating point adder, stage (2) and (4) needs a shift register. • We can use the same shift register and then there will be only 3 stages.• Then we should have a feedback from third stage to second stage. • Further the same pipeline can be used to perform fixed point addition.• A pipeline with feed-forward and/or feedback connections is called non-linear
Example: 3-stage nonlinear pipeline
• It has 3 stages Sa, Sb and Sc and latches.• Multiplexers(cross circles) can take more than one input and pass one of the inputs
to output• Output of stages has been tapped and used for feedback and feed-forward. • The above pipeline can perform a variety of functions.• Each functional evaluation can be represented by a particular sequence of usage of
stages.• Some examples are:
• Sa, Sb, Sc• Sa, Sb, Sc, Sb, Sc, Sa• Sa, Sc, Sb, Sa, Sb, Sc
• Each functional evaluation can be represented using a diagram called Reservation Table(RT).
• It is the space-time diagram of a pipeline corresponding to one functional evaluation.• X axis – time units • Y axis – stages• For first sequence Sa, Sb, Sc, Sb, Sc, Sa called function A , we have
0 1 2 3 4 5
Sa A A
Sb A A
Sc A A
Sa Sb ScInput
Output A
• For second sequence Sa, Sc, Sb, Sa, Sb, Sc called function B, we have
0 1 2 3 4 5
Sa A A
Sb A A
Sc A A
• After starting a function, the stages need to be reserved in corresponding time units.
• Each function supported by multifunction pipeline is represented by different RTs
• Time taken for function evaluation in units of clock period is compute time.(For A & B, it is 6)
• Marking in same row => usage of stage more than once
• Marking in same column => more than one stage at a time
• Hardware of multifunction pipeline should be reconfigurable.
• Multifunction pipeline can be static or dynamic
• Static:
• Initially configured for one functional evaluation.
• For another function, pipeline need to be drained and reconfigured.
• You cannot have two inputs of different function at the same time
• Dynamic:
• Can do different functional evaluation at a time.
• It is difficult to control as we need to be sure that there is no conflict in usage of stages.
Principle of Designing Pipeline Processors
Instruction Prefetch and Branch Handling
• The instructions in computer programs can be classified into 4 types:
– Arithmetic/Load Operations (60%) – Store Type Instructions (15%)– Branch Type Instructions (5%)– Conditional Branch Type (Yes – 12% and No – 8%)
• Arithmetic/Load Operations (60%) : – These operations require one or two operand fetches. – The execution of different operations requires a different number of pipeline
cycles• Store Type Instructions (15%) :
– It requires a memory access to store the data.• Branch Type Instructions (5%) :
– It corresponds to an unconditional jump.• Conditional Branch Type (Yes – 12% and No – 8%) :
– Yes path requires the calculation of the new address – No path proceeds to next sequential instruction.
• Arithmetic-load and store instructions do not alter the execution order of the program. • Branch instructions and Interrupts cause some damaging effects on the performance
of pipeline computers.Interrupts
• When instruction I is being executed,the occurrence of an interrupt postpones instruction I+1 until ISR is serviced.
• There are two types of interrupt: – Precise : caused by illegal operation codes and can be detected at decoding stage – Imprecise: caused by defaults from storage, address and execution functions
• Precise: Since decoding is the first stage, instruction I prohibits I+1 from entering the pipeline and all preceding instructions are executed before ISR
• Imprecise : No new instructions are allowed and all incomplete instructions whether they precede or follow are executed before ISR.
Interrupt Handling Example of Cray1• The interrupt system is built around an exchange package. • When an interrupt occurs, the Cray-1 saves 8 scalar registers, 8 address registers,
program counter and monitor flags.• These are packed into 16 words and swapped with a block whose address is specified
by a hardware exchange address register• Since exchange package does not have all state information, software interrupt handler
have to store remaining statesIn general, the higher the percentage of branch type instructions in a program, the slower a program will run on a pipeline processor.
Estimation of the effect of branching on an n-segment instruction pipeline
• Consider an instruction cycle with n pipeline clock periods.• Let
– p – probability of conditional branch (20%)– q – probability that a branch is successful (60% of 20%) (12/20=0.6)
• Suppose there are m instructions • Then no. of instructions of successful branches = mxpxq (mx0.2x0.6)
• Delay of (n-1)/n is required for each successful branch to flush pipeline.• Thus, the total instruction cycle required for m instructions =
• As m becomes large , the average no. of instructions per instruction cycle is given as
• When p =0, the above measure reduces to n, which is ideal.• In reality, it is always less than n.
Solution: Multiple Prefetch Buffers• Buffers can be used to match the instruction fetch rate to pipeline consumption rate1. Sequential Buffers: for in-sequence pipelining2. Target Buffers: instructions from a branch target (for out-of-sequence pipelining)
A conditional branch cause both sequential and target to fill and based on condition one is selected and other is discarded
Data Buffering and Busing Structures
Speeding up of pipeline segments• The processing speeds of pipeline segments are usually unequal.• Consider the example given below:
• If T1 = T3 = T and T2 = 3T, S2 becomes the bottleneck and we need to remove it• How?• One method is to subdivide the bottleneck
– Two divisions possible are: • First Method:
S1
S2
S3T
1T2
T3
1n
(n+m−1 )+ mpq( n−1)n
= n1+ pq(n−1)
Ltm→∞
m(n+m−1 )
n+
mpq (n−1 )n
• Second Method:
• If the bottleneck is not sub-divisible, we can duplicate S2 in parallel
• Control and Synchronization is more complex in parallel segmentsData Buffering
• Instruction and data buffering provides a continuous flow to pipeline units• Example: 4X TI ASC• In this system it uses a memory buffer unit (MBU) which
• Supply arithmetic unit with a continuous stream of operands• Store results in memory
• The MBU has three double buffers X, Y and Z (one octet per buffer)• X,Y for input and Z for output
• This provides pipeline processing at high rate and alleviate bandwidth mismatch problem between memory and arithmetic pipeline
• In TI ASC, once instruction dependency is recognized, update capability is incorporated by transferring contents of Z buffer to X or Y buffer.
Internal Forwarding and Register Tagging
• Internal Forwarding: It is replacing unnecessary memory accesses by register-to-register transfers.
S1
S2 S
3T
3T TS23TS23T
S1T T T
S3T T
S1T T 2
T
S3T
• Register Tagging: It is the use of tagged registers for exploiting concurrent activities among multiple ALUs.
• Memory access is slower than register-to-register operations.• Performance can be enhanced by eliminating unnecessary memory accesses• This concept can be explored in 3 directions:
1. Store – Load Forwarding
2. Load – Load Forwarding
3. Store – Store Forwarding
Register TaggingExample : IBM Model 91 : Floating Point Execution Unit
• The floating point execution unit consists of :– Data registers
– Transfer paths– Floating Point Adder Unit– Multiply-Divide Unit– Reservation stations– Common Data Bus
• There are 3 reservation stations for adder named A1, A2 and A3 and 2 for multipliers named M1 and M2.
• Each station has the source & sink registers and their tag & control fields• The stations hold operands for next execution.• 3 store data buffers(SDBs) and 4 floating point registers (FLRs) are tagged• Busy bits in FLR indicates the dependence of instructions in subsequent execution• Common Data Bus(CDB) is to transfer operands• There are 11 units to supply information to CDB: 6 FLBs, 3 adders & 2 multiply/divide
unit• Tags for these stations are :
Unit Tag Unit Tag
FLB1 0001 ADD1 1010
FLB2 0010 ADD2 1011
FLB3 0011 ADD3 1100
FLB4 0100 M1 1000
FLB5 0101 M2 1001
FLB6 0110
• Internal forwarding can be achieved with tagging scheme on CDB.• Example: • Let F refers to FLR and FLBi stands for ith FLB and their contents be (F) and (FLBi)• Consider instruction sequence
ADD F,FLB1 F ß (F) + (FLB1) MPY F,FLB2 F ß (F) x (FLB2)
• During addition :– Busy bit of F is set to 1– Contents of F and FLB1 is sent to adder A1 – Tag of F is set to 1010 (tag of adder)
• Meantime, the decode of MPY reveals F is busy, then– F should set tag of M1 as 1010 (Tag of adder)– F should change its tag to 1000 (Tag of Multiplier)– Send content of FLB2 to M1
• When addition is done, CDB finds that the result should be sent to M1• Multiplication is done when both operands are available
Hazard Detection and Resolution
• Hazards are caused by resource usage conflicts among various instructions• They are triggered by inter-instruction dependencies
Terminologies:• Resource Objects: set of working registers, memory locations and special flags• Data Objects: Content of resource objects• Each Instruction can be considered as a mapping from a set of data objects to a set of
data objects.• Domain D(I) : set of resource of objects whose data objects may affect the execution of
instruction I.• Range R(I): set of resource objects whose data objects may be modified by the
execution of instruction I• Instruction reads from its domain and writes in its range• Consider execution of instructions I and J, and J appears immediately after I.• There are 3 types of data dependent hazards:1. RAW (Read After Write)
The necessary condition for this hazard is
R( I )∩D (J )≠φ
Example:I1 : LOAD r1,aI2 : ADD r2,r1I2 cannot be correctly executed until r1 is loadedThus I2 is RAW dependent on I1
2. WAW(Write After Write)The necessary condition is
• ExampleI1 : MUL r1, r2I2 : ADD r1,r4
• Here I1 and I2 writes to same destination and hence they are said to be WAW dependent.
3. WAR (Write After Write) The necessary condition is
• Example:• I1 : MUL r1,r2• I2 : ADD r2,r3• Here I2 has r2 as destination while I1 uses it as source and hence they are WAR
dependent
R( I )∩R (J )≠φ
D( I )∩R (J )≠φ
• Hazards can be detected in fetch stage by comparing domain and range.• Once detected, there are two methods:1. Generate a warning signal to prevent hazard2. Allow incoming instruction through pipe and distribute detection to all pipeline stages.
Job Sequencing and Collision Prevention• Consider reservation table given below at t=1
1 2 3 4 5 6
Sa A A
Sb A A
Sc A A
• Consider next initiation made at t=2
1 2 3 4 5 6 7 8
Sa A1 A2 A1 A2
Sb A1 A2 A1 A2
Sc A1 A2 A1 A2
• The second initiation easily fits in the reservation table Now consider the case when first initiation is made at t = 1 and second at t = 3.
1 2 3 4 5 6 7 8
Sa A1 A2 A1 A2
Sb A1 A2 A1A2 A2
Sc A1 A2 A1A2 A2
• Here both markings A1 and A2 falls in the same stage time units and is called collision and it must be avoided Terminologies
• Latency: Time difference between two initiations in units of clock period • Forbidden Latency: Latencies resulting in collision• Forbidden Latency Set: Set of all forbidden latencies • Considering all initiations:
1 2 3 4 5 6 7 8 9 10 11
Sa A1 A2 A3 A4 A5 A6A1 A2 A3 A4 A5 A6
Sb A1 A2 A1A3 A2A4 A3A5 A4A6 A5 A6
Sc A1 A2 A1A3 A2A4 A3A5 A4A6 A5 A6
• Forbidden Latencies are 2 and 5 Shortcut Method of finding Latency
• Forbidden Latency Set = {5} U {2} U {2} = { 2,5}• Latency Sequence : Sequence of latencies between successive initiations• For a RT, number of valid initiations and latencies are infinite • Latency Cycle:• Among the infinite possible latency sequence, the periodic ones are significant. E.g. { 1, 3, 3, 1, 3, 3,… }• The subsequence that repeats itself is called latency cycle.
E.g. {1, 3, 3}• Period of cycle: The sum of latencies in a latency cycle (1+3+3=7)• Average Latency: The average taken over its latency cycle (AL=7/3=2.33)• To design a pipeline, we need a control strategy that maximize the throughput (no. of
results per unit time)• Maximizing throughput is minimizing AL• Latency sequence which is aperiodic in nature is impossible to design• Thus design problem is arriving at a latency cycle having minimal average latency. State Diagram• The initial collision vector (ICV) is a binary vector formed from F such that
C = (Cn…. C2 C1) where Ci = 1 if i Î F and Ci = 0 if otherwise• Thus in our example
F = { 2,5 } C = (1 0 0 1 0)• The procedure is as follows:1. Start with the ICV2. For each unprocessed state,
For each bit i in the CVi which is 0, do the following:a. Shift CVi right by i bitsb. Drop i rightmost bits
c. Append zeros to leftd. Logically OR with ICVe. If step(d) results in a new state then form a new node for this state and join it with
node of CVi by an arc with a marking i. This shifting process needs to continue until no more new states can be generated.
• The state with all zeros has a self-loop which corresponds to empty pipeline and it is possible to wait for indefinite number of latency cycles of the form (7),(8), (9),(10) etc.
• Simple Cycle: latency cycle in which each state is encountered only once.• Complex Cycle: consists of more than one simple cycle in it.
It is enough to look for simple cycles• Greedy Cycle: A simple cycle is a greedy cycle if each latency contained in a cycle is
the minimal latency(outgoing arc) from a state in the cycle.• A good task initiation sequence should include the greedy cycle.• The simple cycles are (3),(5) ,(1,3,3),(4,3) and (4)• The Greedy cycle is (1,3,3) • In the above example, the cycle that offers MAL is (1, 3, 3) (MAL = (1+3+3)/3 =2.333)• The task initiation sequence with MAL is given as below:
1 2 3 4 5 6 7 8 9 10 11 12 13
Sa A1 A2 A5 A1 A2 A8 A5 A8
Sb A1 A2 A1 A2 A5 A5 A8 A8
Sc A1 A2 A1 A2 A5 A5 A8 A8
Superscalar Processors• Scalar processors: one instruction per cycle• Superscalar : multiple instruction pipelines are used.• Purpose: To exploit more instruction level parallelism in user programs.• Only independent instructions can be executed in parallel. • The fundamental structure (m=3) is as follows:
1 0 0 1 0
1 1 0 1 1
1
1 0 0 1 1
4
35+
34
3
5+
5+
• Here, the instruction decoding and execution resources are increased• Example: A dual pipeline superscalar processor
• It can issue two instructions per cycle• There are two pipelines with four processing stages : fetch, decode, execute and store• Two instruction streams are from a single I-cache.• Assume each stage requires one cycle except execution stage.• The four functional units of execution stage are:
Functional Unit Number of stages
Adder 02
Multiplier 03
Logic 01
Load 01
• Functional units are shared on dynamic basis• Look-ahead Window: for out-of-order instruction issue
Superscalar Performance• Time required by scalar base machine is
T(1,1) = k+N-1• The ideal execution time required by an m-issue superscalar machine is
k is the time required to execute first m instructions
T (m,1 )= k+(N-m )
m
(N-m)/m is the time required to execute remaining (N-m) instructions.• The ideal speedup of the superscalar machine is
As N à ∞, the speedup S(m,1) à m.Superpipeline Processors
• In a superpipelined processor of degree n, the pipeline cycle time is 1/n of base cycle.
• Time to execute N instructions for a superpipelined machine of degree n with k stages is T(1,n) = k + (N-1)/n
• Speedup is given as
• As N à∞ , S(1,n) àn
Superpipelined Superscalar Processors
• This machine executes m instructions every cycle with a pipeline cycle 1/n of base cycle.
• Time taken to execute N independent instructions on a superpipelined superscalar machine of degree (m,n) is
• The speedup over base machine is
S( m,1)=T (1,1 )T (m,1)
S( m,1)=m( N+k-1)N+m( k-1)
S(1,n )=T (1,1)T (1,n)
=n( k+N-1)nk+(N-1)
T (m,n )= k +N-mmn
S( m,n )=T (1,1 )T (m,n )
=mn (k+N−1 )mnk +N−m
• As N à ∞, S(m,n)àmn
Systolic Architecture• Conventional architecture operate on load and store operations from memory.• This requires more memory references which slows down the system as shown below:
• In systolic processing, data to be processed flows through various operation stages and finally put in memory as shown below:
• The basic architecture constitutes processing elements (PEs) that are simple and identical in behavior at all instants.
• Each PE may have some registers and an ALU.• PEs are interlinked in a manner dictated by the requirements of the specific algorithm.• E.g. 2D mesh, hexagonal arrays etc.• PEs at the boundary of structure are connected to memory • Data picked up from memory is circulated among PEs which require it in a rhythmic
manner and the result is fed back to memory and hence the name systolic• Example : Multiplication of two n x n matrices• Every element in input is picked up n times from memory as it contributes to n elements
in the output.• To reduce this memory access, systolic architecture ensures that each element is pulled
only once• Consider an example where n = 3
Conventional Method: O(n3) For I = 1 to N For J = 1 to N For K = 1 to N C[I,J] = C[I,J] + A[J,K] * B[K,J];This will run in O(n) time!To run in n time we need n x n processing units, in our example n = 9.
For systolic processing, the input data need to be modified as:P9P8P7P6P5P4
P1 P2 P3
a11 a12 a13a21 a22 a23a31 a32 a33
*b11 b12 b13b21 b22 b23b31 b32 b33
=c11 c12 c13c21 c22 c23c31 c32 c33
and finally stagger the data sets for input.
At every tick of the global system clock, data is passed to each processor from two different directions, then it is multiplied and the result is saved in a register.
Example: Samba: Systolic Accelerator for Molecular Biological ApplicationsThis systolic array contains 128 processors shared into 32 full custom VLSI chips. One chip houses 4 processors, and one processor performs 10 millions matrix cells per second.
Very Long Instruction Word (VLIW) ArchitectureVLIW Machine• It consists of many functional units connected to a large central register file• Each functional unit have two read ports and one write port• Register file would have enough memory bandwidth to balance the operand usage rate of
functional units
a13 a12 a11
a23 a22 a21
a33 a32 a31
b31b21b11
b32b22b12
b33b23b13
P9P8P7
P6P5P4
P1 P2 P3
a13 a12 a11a23 a22 a21a33 a32 a31
b31 b32 b33b21 b22 b23b11 b12 b13
Flip columns 1 & 3
Flip rows 1 & 3
VLIW characteristics• VLIW contains multiple primitive instructions that can be executed in parallel• The compiler packs a number of primitive, independent instructions into a very long
instruction word• The compiler must guarantee that multiple primitive instructions which group
together are independent so they can be executed in parallel.• Example of a single VLIW instruction: F=a+b; c=e/g; d=x&y; w=z*h;
VLIW Principles1. The compiler analyzes dependence of all instructions among sequential code and extracts as much parallelism as possible.2. Based on analysis, the compiler re-codes the sequential code in VLIW instruction words.(One VLIW instruction word contains maximum 8 primitive instructions)3. Finally VLIW hardware
– Fetch the VLIWs from cache, – Decode them, – Dispatch the independent primitive instructions to corresponding functional units
and– Execute
Advantages of VLIW architecture• Reduces Complexity: Due to parallelism among their primitive instructions
• Higher possible clock rate because of reduced complexity Drawbacks of VLIW Architecture
• Compiler has to be aware of technology dependent parameters-like latencies and repetition rate. This restricts the use of same compiler for a family of VLIW processors.
• Wasted memory space and bandwidth when some EUs are not used• Performance is dependent on how the compiler produces VLIW words.
Data Flow Computers• They are based on the concept of data driven computation• Conventional computers are under program flow control. •
Features of Control Flow Model
• Data is passed between instructions via shared memory• Flow of control is implicitly sequential• Program counters are used to sequence the execution of instruction
Features of Data Flow Model• Intermediate or final results are passed directly as data token between instructions• There is no concept of shared data storage• Program sequencing is constrained only by data dependency among instructions
Data Flow Graph• A data flow graph is a directed graph whose nodes correspond to operators and arcs are
pointers forwarding data tokens• The graph demonstrates sequencing constraints among instructions• In DFC, the machine level program is represented by DFGs.• The firing rules of instructions is based on the data availability
• An operator of the schema is enabled when tokens are present on all input arcs• The enabled operator may fire at any time by
• removing the tokens on its input arc• computing a value from operands associated with input tokens and• associating that value with the result token placed on its output arc
• The result may be sent to more than one destinations by means of a link • Consider the following simple computation
input : a,b y = (a+b)/xx = (a*(a+b))+boutput: y,x
• The representation of conditionals and iterations requires additional types of links and actors
• They are as follows:
Rectangular Boxes: operatorsSmall dots:LinksLarge Dots: initial configuration of the program
+*
/+
a bL1
L3
L2
L4
L5
L6y x
A1A2
A3
A4
x(0)
• Data link : Data values pass through data links
Control link : Control tokens are transmitted which conveys a value of either true or false
Actors Operator: It removes tokens on input arc, compute a value based on operands in input
arc and result is placed in output arc.
Decider : Decider receives values from its input arc applies its associated predicate and produces either a TRUE or FALSE control token at its output arc
Boolean Operator: It can be used to combine control tokens produced at a decider allowing a decision to be built up from simpler decision.
Control token direct the flow of data tokens by means of T-gates, F-gates or merge actors
Γ Λ V
P
T-gate : A T-gate passes the data token on its input arc to its output arc only when it receives a control token with TRUE value at its control input
F-gate : A F-gate passes the data token on its input arc to its output arc only when it receives a control token with FALSE value at its control input
Merge Actor : It has a true input, a false input and a control input. It passes to its output arc a data token from the input arc corresponding to the value of the control token received
Example: Draw the data flow diagram corresponding to the following computation(xn) input x,n y=1, i=n while i > 0 do begin y=y*x; i = i-1 end z = y output z
T F
F
T
• The tokens carrying the false values on the input arcs of the merge operator allows it to initiate.
• The decider emits a token carrying the TRUE value each time the execution of the loop body is required.
• When the firing of the decider yields a FALSE, the value of y is routed to the output link z. Data Flow Machine Architectures
• Depending on the way of handling data tokens DFCs can be divided into– Static Model– Dynamic Model
Static Data Flow Computer• In this machine, data tokens are assumed to move along the arcs of the dataflow program
graph to the operator nodes• Nodal operation gets executed when all its operands are present at input arcs• Only one token is allowed to exist on any arc at any given time.• This architecture is static because
– Tokens are not labeledControl Token must be used for timing
T F
T F
T F
TX
T F-1
T>0
X 1 n
z
y
SDFC Example• A data flow schema to be executed is stored in the memory of the processor• Memory is organized into instruction cells, each cell corresponds to an operator of
dataflow program.• Each instruction cell is composed of 3 registers.
– First – holds instruction– Second and Third – holds the operands
• Instruction specifies the operation to be performed and the address(es) of the register(s) to which the result of the operation is to be directed.
When a cell contains an instruction and the necessary operands, it is enabled and transmits its content as operation packets
Arbitration network directs operation packet to appropriate processing unit by decoding instruction portion of packet
Processing unit performs the desired function The result of an operation leaves processing unit as one or more data packets consisting
of computed value and address of the register in memory to which the value is to be delivered
Distribution network accepts data packets and utilizes address to store the value of operation.
Many instruction cells may be enabled simultaneously and it is the task of arbitration network to efficiently deliver operation packets to processing units and to queue operation packets.
Each arbitration unit passes one packet at a time to resolve ambiguity.
Dynamic Dataflow Architecture• It uses tagged tokens so that more than one token can exist in an arc.• Tagging is achieved by attaching a label with each token which identifies the context of
that token.• Maximum parallelism can be exploited in this model