Post on 14-Apr-2018
7/29/2019 9.Ilp Dynamic Sched1
1/23
CSE502 Computer Architecture
Dr. Ilchul Yoon (icyoon@sunykorea.ac.kr)
Slides adapted from:
Larry Wittie, SBU & John Kubiatowicz, EECS, UC, Berkeley
Lecture 09
Instruction Level Parallelism
7/29/2019 9.Ilp Dynamic Sched1
2/23
Outline
l ILP Instruction Level Parallelisml Compiler techniques to increase ILPl Loop Unrollingl Static Branch Predictionl Dynamic Branch Predictionl Overcoming Data Hazards with Dynamic Schedulingl (Start) Tomasulo Algorithml Conclusion
2
7/29/2019 9.Ilp Dynamic Sched1
3/23
Recall from Pipelining Review
l Pipeline CPI = Ideal pipeline CPI + Structural Stalls +Data Hazard Stalls + Control Stalls
l Ideal pipeline CPI: Measure of the maximum performanceattainable by the implementation
l Structural hazards: Hardware cannot support this combinationof instructions
l Data hazards: Instruction depends on result of prior instructionstill in the pipeline
lControl hazards: Caused by delay between fetching ofinstructions and control flow decisions (branches and jumps)
l e.g., in MIPS, j jump, jal call, jr return})
3
7/29/2019 9.Ilp Dynamic Sched1
4/23
Instruction Level Parallelism
l Instruction-Level Parallelism (ILP)l Overlap the execution of instructions to run programs faster
(improve performance)
l Two approaches to exploit ILP:1. Rely on hardware to help discover and exploit the parallelism
dynamically (e.g., Pentium 4, AMD Opteron, IBM Power)
2. Rely on software technology to find parallelism, statically atcompile-time (e.g., Itanium 2 (IA-64))
4
7/29/2019 9.Ilp Dynamic Sched1
5/23
Instruction-Level Parallelism (ILP)
l Basic Block (BB) ILP is quite smalll BB: a straight-line code sequence with no branches in except
to the entry and no branches out except at the exit
l average dynamic branch frequency 15% to 25%=> 4 to 7 instructions execute between a pair of branches
l other problem: instructions in a BB likely depend on eachother
l To obtain substantial performance enhancements, wemust exploit ILP across multiple basic blocks (trace
scheduling)
5
7/29/2019 9.Ilp Dynamic Sched1
6/23
Instruction-Level Parallelism (ILP)
l Simplest: loop-level parallelism to exploit parallelismamong iterations of a loop.
l For example,for (j=0; j
7/29/2019 9.Ilp Dynamic Sched1
7/23
Loop-Level Parallelism
l Exploit loop-level parallelism by unrolling loop eitherby
1.dynamic via branch prediction or2.static via loop unrolling by compiler(Another way is vectors, to be covered later)
lDetermining dependences is critical !l If 2 instructions are
l Parallel, they can execute simultaneously in a pipeline ofarbitrary depth without causing any stalls (assuming no
structural hazards)
l Dependent, they are not parallel and must be executed inorder, although they may often be partially overlapped
7
7/29/2019 9.Ilp Dynamic Sched1
8/23
Data Dependence and Hazards
l InstrJ is data dependent (aka true dependence) on InstrI:1. InstrJ tries to read operand before InstrI writes it
2. InstrJ is data dependent on InstrKwhich is dependent on InstrIl If two instructions are data dependent, they cannot execute
simultaneously or be completely overlapped
l Data dependence in instruction sequence data dependence in source code
effect of original data dependence must be preserved
l If data dependence causes a hazard in a pipeline, it is a Read After Write (RAW) hazard RAW is real
8
I: add r1,r2,r3
J: sub r4,r1,r3
7/29/2019 9.Ilp Dynamic Sched1
9/23
ILP and Data Dependencies, Hazards
l HW/SW must preserve illusion (or program order)l Code must give the same results as if instructions were executed
sequentially in the original order of the source program
l Dependences are a property of programsl The presence of a dependence indicates the potential for a hazard,
but the existence of an actual hazard and length of any stall are
pipeline properties
l Importance of data dependencies1. Indicate the possibility of a hazard2. Determine the order in which results must be calculated3. Set upper bounds on how much parallelism can possibly be exploited to
speedup a program
l HW/SW goal: exploit parallelism by preserving program orderonly where it affects the outcome of the program
9
7/29/2019 9.Ilp Dynamic Sched1
10/23
Name Dependence #1: Anti-dependence
l Name dependence: when 2 instructions use same register ormemory location, called a name, but no flow of data between the
instructions associated with that name; 2 versions of name
dependence, which may cause WAR and WAW hazards.
l InstrJ writes operand before InstrI reads it
Called an anti-dependence by compiler writers. This results
from reuse of the name r1
l If anti-dependence caused a hazard in the pipeline, thats a WriteAfter Read (WAR) hazard
10
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
7/29/2019 9.Ilp Dynamic Sched1
11/23
Name Dependence #2: Output dependence
l InstrJ writes operand before InstrI writes it.
l Called an output dependence by compiler writersThis also results from the reuse of name r1
l If anti-dependence caused a hazard in the pipeline, thats a WriteAfter Write (WAW) hazard
l Instructions involved in a name dependence can executesimultaneously, if we can make the instructions do not conflict.
l Register renaming by hardwarel Use different register names by compiler
11
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
7/29/2019 9.Ilp Dynamic Sched1
12/23
Control Dependencies
l Every instruction is control dependent on some set of branches,and, in general, these control dependencies must be preserved to
preserve program order
if p1 {
S1;};
if p2 {
S2;
}
l S1 is control dependent on proposition p1, and S2 is controldependent on p2, but not on p1.
12
7/29/2019 9.Ilp Dynamic Sched1
13/23
Carefully Violate Control Dependencies
l Control dependence need NOT always be preservedl Can be violated by executing instructions that should not have
been, if doing so does NOT affect program results
l e.g., speculative execution HW throws away results of bad branchguesses.
l Instead, 2 properties critical to program correctness arel Exception behavior andl
Data flow
13
7/29/2019 9.Ilp Dynamic Sched1
14/23
Exception Behavior Is Important
l Preserving exception behavior any changes in instruction execution order must NOT change
how exceptions are raised in program
( no new exceptions)
l Example:DADDU R2,R3,R4
BEQZ R2,L1
LW R1,-1(R2)
L1:
(This code example assumes branches are not delayed)
l What is the problem with moving LW before BEQZ?l e.g., Array overflow: what if R2=0, so -1+[R2] is out of program memory
bounds?14
7/29/2019 9.Ilp Dynamic Sched1
15/23
Data Flow Of Values Must Be Preserved
l Data flow: actual flow of data values from instructions thatproduce results to those that consume them
l branches make flow dynamic (since we know details only at runtime);must determine which instruction is the data supplier
l Example:DADDU R1,R2,R3BEQZ R4,L
DSUBU R1,R5,R6
L:
OR R7,R1,R8
l OR depends on DADDU orDSUBU?Compilers and HW must preserve data flow during execution
15
7/29/2019 9.Ilp Dynamic Sched1
16/23
Outline
l ILP Instruction Level Parallelisml Compiler techniques to increase ILPl Loop Unrollingl
Static Branch Predictionl Dynamic Branch Predictionl Overcoming Data Hazards with Dynamic Schedulingl (Start) Tomasulo Algorithml Conclusion
16
7/29/2019 9.Ilp Dynamic Sched1
17/23
Software Techniques - Example
l This code adds a scalar to a vector:for (i=1000; i>0; i=i1)
x[i] = x[i] + s;
l Assume following latencies for all examplesl Ignore delayed branches in these examples
17
Instruction Instruction Latencyproducing result using result in cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2Load double FP ALU op 1
Load double Store double 0
Integer op Integer op 0
7/29/2019 9.Ilp Dynamic Sched1
18/23
FP Loop: Where are the Hazards?
l First, translate into MIPS code. To simplify, assume:l 8 is the lowest address,l F2 hassl R1 starts with with address for x[1000]
18
for (i=1000; i>0; i=i1)
x[i] = x[i] + s;
Loop: L.D F0,0(R1) ;F0=vector element
ADD.D F4,F0,F2 ;add scalar from F2
S.D 0(R1),F4 ;store result
DADDUI R1,R1,-8 ;decrement pointer 8B (DW)BNEZ R1,Loop ;branch R1!=zero
7/29/2019 9.Ilp Dynamic Sched1
19/23
FP Loop Showing Stalls
19
1 Loop: L.D F0,0(R1) ;F0=vector element
2 stall
3 ADD.D F4,F0,F2 ;add scalar in F2
4 stall
5 stall
6 S.D 0(R1),F4 ;store result7 DADDUI R1,R1,-8 ;decrement pointer 8Bytes (DW)
8 stall ;assume cannot forward to branch
9 BNEZ R1,Loop ;branch R1!=zero
plus branch delay!
for (i=1000; i>0; i=i1)
x[i] = x[i] + s;
Instruction Instruction Latency inproducingresult usingresult clock cycles
FP ALU op Another FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Loop every 9 clock cycles. How reorder code to minimize stalls?
7/29/2019 9.Ilp Dynamic Sched1
20/23
Revised FP Loop Minimizing Stalls
20
1 Loop: L.D F0,0(R1)
2 DADDUI R1,R1,-8
3 ADD.D F4,F0,F2
4 stall
5 stall
6 S.D 8(R1),F4 ;altered offset 0=>8 when moved DADDUI
7 BNEZ R1,Loop
1 Loop: L.D F0,0(R1)
2 stall
3 ADD.D F4,F0,F2
4 stall
5 stall
6 S.D 0(R1),F4
7 DADDUI R1,R1,-8
8 stall
9 BNEZ R1,Loop
Swap DADDUI and S.D; change address offset of S.D
produce result use result stalls between
FP ALU op Other FP ALU op 3
FP ALU op Store double 2
Load double FP ALU op 1
Load double Store double 0
Integer op=>R Branch on R 1Integer op Integer op 0
Loop takes 7 clock cycles
- 3 for execution (L.D,ADD.D,S.D)- 4 for loop overhead;
How to make faster?
for (i=1000; i>0; i=i1)
x[i] = x[i] + s;
7/29/2019 9.Ilp Dynamic Sched1
21/23
Unroll Original Loop Four Times
l Straightforward way for time saving!!
21
Four loops take 27 clock cycles, or 6.75 per iteration!!(Assumes R1 is a multiple of 4)
for (i=1000; i>0; i=i1)
x[i] = x[i] + s;
1 Loop: L.D F0,0(R1)
3 ADD.D F4,F0,F2
6 S.D 0(R1),F4 ;drop DADDUI & BNEZ
7 L.D F6,-8(R1)
9 ADD.D F8,F6,F2
12 S.D -8(R1),F8 ;drop DADDUI & BNEZ
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D -16(R1),F12 ;drop DADDUI & BNEZ19 L.D F14,-24(R1)
21 ADD.D F16,F14,F2
24 S.D -24(R1),F16
25 DADDUI R1,R1,#-32 ;alter to 4*8
27 BNEZ R1,LOOP
1 cycle stall2 cycle stall
1 cycle stall
How to rewrite
loop to minimize
stalls?
7/29/2019 9.Ilp Dynamic Sched1
22/23
Loop Unrolling Detail - Strip Mining
l Do not usually know upper bound of loopl Suppose it is n, and we would like to unroll the loop to
make kcopies of the body
lInstead of a single unrolled loop, generate a pair ofconsecutive loops:
l 1st executes (n mod k) times and has a body that is the originalloop (called strip mining of a loop)
l 2nd is the unrolled body surrounded by an outer loop thatiterates (| n/k|) times
l For large values ofn, most of the execution time will bespent in the | n/k| unrolled loops
22
7/29/2019 9.Ilp Dynamic Sched1
23/23
Unrolled Loop with Minimal Stalls
23
1 Loop: L.D F0,0(R1)
3 ADD.D F4,F0,F2
6 S.D 0(R1),F4
7 L.D F6,-8(R1)
9 ADD.D F8,F6,F2
12 S.D -8(R1),F8
13 L.D F10,-16(R1)
15 ADD.D F12,F10,F2
18 S.D -16(R1),F12
19 L.D F14,-24(R1)21 ADD.D F16,F14,F2
24 S.D -24(R1),F16
25 DADDUI R1,R1,#-32
27 BNEZ R1,LOOP
1 Loop: L.D F0,0(R1)
2 L.D F6,-8(R1)
3 L.D F10,-16(R1)
4 L.D F14,-24(R1)
5 ADD.D F4,F0,F2
6 ADD.D F8,F6,F2
7 ADD.D F12,F10,F2
8 ADD.D F16,F14,F2
9 S.D 0(R1),F4
10 S.D -8(R1),F8
11 S.D -16(R1),F1212 DADDUI R1,R1,#-32
13 S.D 8(R1), F16 ; 8-32 = -24
14 BNEZ R1,LOOP
Four loops take 14 clock cycles, or 3.5 per loop