Download - Solution Assignment No 2

8/16/2019 Solution Assignment No 2

http://slidepdf.com/reader/full/solution-assignment-no-2 1/8

Question 1:

Give a high-level view of pipelined processor datapath and explain its working;compare the performance of pipelined datapath and the multi-cycle datapath.

Solution:

Instruction pipelining is a technique that implements a form

of parallelism called instruction-level parallelism within a single processor. It therefore

allows faster CPU throughput (the number of instructions that can be executed in a unit

of time) than would otherwise be possible at a given cloc rate. !he basic instruction

c"cle is broen up into a series called a pipeline. #ather than processing each

instruction sequentiall" (finishing one instruction before starting the next)$ each

instruction is split up into a sequence of steps so different steps can be executed

in parallel and instructions can be processed concurrentl"(starting one instruction before

finishing the previous one).

Pipelining increases instruction throughput b" performing multiple operations at the

same time$ but does not reduce instruction latenc"$ which is the time to complete a

single instruction from start to finish$ as it still must go through all steps. Indeed$ it ma"

increase latenc" due to additional overhead from breaing the computation into

separate steps and worse$ the pipeline ma" stall (or even need to be flushed)$ further

increasing the latenc". !hus$ pipelining increases throughput at the cost of latenc"$ and

is frequentl" used in CPUs but avoided in real-time s"stems$ in which latenc" is a hard

constraint.

%ach instruction is split into a sequence of dependent steps. !he first step is alwa"s to

fetch the instruction from memor"& the final step is usuall" writing the results of the

instruction to processor registers or to memor". Pipelining sees to let the processor

https://en.wikipedia.org/wiki/Parallel_computer

https://en.wikipedia.org/wiki/Instruction-level_parallelism

https://en.wikipedia.org/wiki/CPU


https://en.wikipedia.org/wiki/Throughput

https://en.wikipedia.org/wiki/Clock_rate

https://en.wikipedia.org/wiki/Instruction_cycle


https://en.wikipedia.org/wiki/Pipeline_(computing)


https://en.wikipedia.org/wiki/Parallel_computing

https://en.wikipedia.org/wiki/Concurrent_computing

https://en.wikipedia.org/wiki/Latency_(engineering)

https://en.wikipedia.org/wiki/Pipeline_stall

https://en.wikipedia.org/wiki/Pipeline_flush

https://en.wikipedia.org/wiki/Real-time_computing

https://en.wikipedia.org/wiki/Instruction-level_parallelism


https://en.wikipedia.org/wiki/Throughput

https://en.wikipedia.org/wiki/Clock_rate




https://en.wikipedia.org/wiki/Parallel_computing

https://en.wikipedia.org/wiki/Concurrent_computing

https://en.wikipedia.org/wiki/Latency_(engineering)

https://en.wikipedia.org/wiki/Pipeline_stall

https://en.wikipedia.org/wiki/Pipeline_flush

https://en.wikipedia.org/wiki/Real-time_computing

https://en.wikipedia.org/wiki/Parallel_computer



wor on as man" instructions as there are dependent steps$ 'ust as an assembl"

line builds man" vehicles at once$ rather than waiting until one vehicle has passed

through the line before admitting the next one. ust as the goal of the assembl" line is to

eep each assembler productive at all times$ pipelining sees to eep ever" portion of

the processor bus" with some instruction. Pipelining lets the computers c"cle time be

the time of the slowest step$ and ideall" lets one instruction complete in ever" c"cle.

!he term pipeline is an analog" to the fact that there is fluid in each lin of a pipeline$ as

each part of the processor is occupied with wor.

Question 2:

Following code lines are written in a high level language:

a = c + d;

= c + e;

!he corresponding instructions for "#$% are:

&' ()* ,(

&' (* /,(

011 (2* ()* (

%' (2* ),(

&' (/* 3,(

011 (4* ()* (/

%' (4* )5,(

!hese instructions are to e executed on a pipelined processor with forwarding.

https://en.wikipedia.org/wiki/Assembly_line






a. #dentify ha6ards y showing the execution of these instructions per cycleases.

b. (eorder these instructions to avoid any pipeline stalls.c. 7ow many cycles are saved after executing the reordered instructions8

Solution:

a. #dentify ha6ards y showing the execution of theseinstructions per cycle ases.

SR.NO. CODE ASSEMBLIY LENT CODE

1 LW RI, 0(RO) LOADRI Mem[O+Reg[R0]]

2 LW R2, 4(RO) LOADRI € Mem[O+Reg[R0]]

3 ADD R, RI, R2 g!"g[R] € ##"g[RI]$%"g[R2]]

4 SW R, 12(RO) Mem[R] € I#"g[R]+Mem[12+Reg[R1]]

5 LW R4, &(RO) LOADR4 € Mem[&+Reg[RO''

ADD R, RI, R4 Reg[R] € g%[R1]*1"g[R4]]

SW R, 1(RO) Mem[R] € Ilsg[125]+Mem[16-Ren I ]]

Sample

Instruction

1 2 3 4 5 6 7 8 9 10 11

L R1! 0"R0# I$ I% &'& M&M (

L R2! 4"R)# I$ I% &'& M&M (

*%% R3! R1! R2 I$ I% &'& M&M (

R3! 12"R)# I$ I% &'& M&M (

L R4! 8"R)# I$ I% &'& M&M (

*%% R5! R1! R4 I$ I% &'& M&M &

R5! 16"R)# I$ I% &'& M&M (



Instruction

1 2 3 4 5 6 7 8 9 10 11 12 13

L R1! 0 R0 I$ I% &'& M&M (L R2! 4"R0# I$ I% &'& M&M (

*%% R3! R1! R2 I$ st,ll I% &'& M&M (

R3! 12"R)# I$ I% &'& M&M (

L R4! 8"R0# I$ I% &'& M&M (

*%% R5! R1! R4 I$ st,ll I% &'& M&M (

R5! 16"R)# I$ I% &'& M&M (

b. (eorder these instructions to avoid anypipeline stalls.

SR.NO. CODE ASSENIBLIY LINE CODE

1 L RI! 0"R)# L)*%R1 . Mem[01-Reg[R)]]

2 L R2! 4"R)# L)*%RI & Mem[01-Reg[R)]]

3 L R4! "R)# L)*%R4 & Mem[+Reg[R)]]

4 *%% R3! RI! R2 Reg[R3] & I/g[R1] 1Mg[R2]]

5 R3! 12"R)# Mem[121-Reg[R)]] & Reg[R3]

6 *%% R5! RI! R4 Reg[R5] gsg&R1isg[R4]]

7 R5! 16"R)# Mem[16 RegR0]1 * Reg[R5]

Instruction

1 2 3 4 5 6 7 8 9 10 11

L R1! 0"R)# I$ I% & M& (L R2! 4 R) I$ I% &'& M& (L R4! 8"R)# I$ I% &'& M& (

*%% R3! R1! R2 I$ I% &'& M&M & R3! 12 R) I$ I% &'& M& (

*%% R5! R1! R4 I$ I% &'& M& &

R5! 16"R)# I$ I% &'& M&M &



c. 7ow many cycles are saved after executing the reordered instructions8

e coe :e;ore reorering cont,ine 13 cloc/ c<cle in , gi=en >uestion

Inst?uction

1 2 3 4 5 6 7 9 10 11 12 13

L RI! 0"R)# I$ I% &'& M&M (

L R2! 4"R)# I$ I% &'& M&M &

*%% R3! R1! R2 I$ st,ll I% &'& M&M (

R3! 12"R)# I$ I% &'& M&M &

L R4! "R)# I$ I% &'& M&M (

*%% R5! RI! R4 I$ st,ll I% &'& M&M &

R5! 16"R0# I$ I% &'& M&M (

e coe ,;ter reorering con ,ine 11 cloc/ c<cle in , gi=en >uestion

Instruction

1 2 3 4 5 6 7 9 10 11

L R1! 0"R)# I$ I% &'& M&M &

L R2! 4"R)# I$ I% &'& M&M (L R4! R) I$ I% &'& M&M &

*%% R3! R1! R2 I$ I% &'& M&M ( R3! 12 R) I$ I% &'& M,l &*%% R5! R1! R4 I$ I% &'& I&M ( R5! 16"R)# I$ I% &'& M&M (

%ue to reorering s,=e t@o c<cles!

Question 3:

(ead the research paper titled 9 An optimizing pipeline stall reduction algorithm for

power and performance on multi-core CPUs* and answer the following uestions:

a. 7ow the proposed &eft-(ight ,&( algorithm works8b. 'hy &( algorithm is giving etter results as compared to traditional in-order

and !omasulo<s algorithms8



a. 7ow the proposed &eft-(ight ,&( algorithm works8

Proposed algorithm (LR(Left-Right)): We have proposed an algorithm which performs

the stall reduction in a Left-Right (LR) manner, insequential instruction execution as

shown in igure !" #ur algorithm introduces a h$%rid order of instruction execution in

order to reduce the power dissipationl" &ore precisel$, it executes the instructions

seriall$ as in-order execution until a stall condition is encountered, and thereafter, it

uses of concept of out-of-order execution to replace the stall with an independent

instruction" 'hus, LR increases the throughput %$ executing independent instructions

while the length$ instructions are still executed in other functional units or the registers

are involved in an ongoing process" LR also prevents the haards that might occur

during the instruction execution" 'he instructions are scheduled staticall$ at compile



time as shown in igure " *n our proposed approach, if a %uffer in presence can hold a

certain num%er of sequential instructions, our algorithm will generate a sequence

inwhich the instructions should %e executed to reduce the num%er of stalls while

maximiing the throughput of a processor" *t is assumed that all the instructions are in

the form of op-code source destination format"

proposed an algorithm which performs the stall reduction in a +eft-#ight (+#) manner$ in

sequential instruction execution as shown in ,igure . ur algorithm introduces a h"brid

order of instruction execution in order to reduce the power dissipationl. /ore precisel"$ it

executes the instructions seriall" as in-order execution until a stall condition is

encountered$ and thereafter$ it uses of concept of out-of-order execution to replace the

stall with an independent instruction. !hus$ +# increases the throughput b" executing

independent instructions while the length" instructions are still executed in other

functional units or the registers are involved in an ongoing process. +# also prevents

the ha0ards that might occur during the instruction execution. !he instructions are

scheduled staticall" at compile time as shown in ,igure 1. In our proposed approach$ if

a buffer in presence can hold a certain number of sequential instructions$ our algorithm

will generate a sequence in which the instructions should be executed to reduce the

number of stalls while maximi0ing the throughput of a processor. It is assumed that all

the instructions are in the form of op-code source destination format.

. 'hy &( algorithm is giving etter results as compared to traditional in-orderand !omasulo<s algorithms8

Solution:

Comparison of LR vs. Tomasulo algorithm

In this section$ the performance and power gain of the +# and the !omasulo algorithms are

compared.

Simulation and poer!performance evaluation

2s our baseline configuration$ we use an Intel core i3 dual core processor with 1.45678 cloc

frequenc"$ and 94-bit operating s"stem. :e also use the ;im-Panal"0er simulator <13=. !he +#$ in-

order$ and !omasulo algorithms are developed as C programs. !hese C programs were compiled

using arm-linux-gcc in order to obtain the ob'ect files for each

of them$ on an 2#/ microprocessor model.



2t the earl" stage of the processor design$ various levels of simulators can be used to estimate the

power and performance such as transistor level$ s"stem level$ instruction level$ and micro-

architecture level simulators. In transistor level simulators$ one can estimate the voltage and current

behaviour over time. !his t"pe of simulators are used for integrated circuit design$ and not suitable

for large programs. n the other hand$ microarchitecture level simulators provide the power

estimation across c"cles and these are used in modern processors. ur wor is similar to this ind of simulator because our ob'ective is to evaluate the power-performance behaviour of a micro-

architecture level

design abstraction. !hough$ a literature surve" suggests several power estimation tools such as

C2C!I$ :2!!C7 <19=$ and we have choose the ;im-Panal"0er <13= since it provides an accurate

power modelling b" taing into account both the leaage and d"namic power dissipation.

!he actual instruction execution of our proposed algorithm against existing ones is shown in

2lgorithms and 1. In the +# algorithm$ an instruction is executed seriall" in-order until a stall

occurs$ and thereafter the out-of-order execution technique comes to pla" to replace the stall with an

independent instruction stage. !herefore$ in most cases$ our proposed algorithm taes less c"cle of

operation and less c"cle time

compared to existing algorithms as shwon in algorithm <1=. !he comparison of our proposed

algorithm against the !omasulo algorithm and the in-orderalgorithm is shown in !able . !he next

section focusses on the power-performance efficienc" of our proposed algorithm