1 Advanced Computer Architecture Limits to ILP Lecture 3.
-
Upload
sincere-roby -
Category
Documents
-
view
219 -
download
0
Transcript of 1 Advanced Computer Architecture Limits to ILP Lecture 3.
![Page 1: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/1.jpg)
1
Advanced Computer Architecture
Limits to ILPLecture 3
![Page 2: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/2.jpg)
2
Factors limiting ILP Rename registers
Reservation stations, reorder buffer, rename file
Branch prediction Jump prediction
Harder than branches Memory aliasing Window size Issue width
![Page 3: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/3.jpg)
3
ILP in a “perfect” processor
![Page 4: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/4.jpg)
4
Window size limits How far ahead can you look? 50 instruction window
2,450 register comparisons looking for RAW, WAR, WAW
Assuming only register-register operations
2,000 instruction window ~4M comparisons
Commercial machines: window size up to 128
![Page 5: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/5.jpg)
5
Window size effects
Assuming all instructions take 1 cycle,
Assuming perfect branch prediction
![Page 6: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/6.jpg)
6
Window size effects, limited issue
Limited to 64 issues per clock
(note: current limits closer to 6
![Page 7: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/7.jpg)
7
Branch prediction effects
Assuming 2K instruction window
![Page 8: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/8.jpg)
8
How much misprediction affects ILP?
Not much
![Page 9: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/9.jpg)
9
Register limitations
Current processors near 256 – not much left!
![Page 10: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/10.jpg)
10
Memory disambiguation Memory aliasing
Like RAW, WAW, WAR, but for loads and stores
Even compiler can’t always know Indirection, base address changes, pointers
How to avoid? Don’t: do everything in order Speculate: fix up later Value prediction
![Page 11: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/11.jpg)
11
Alias analysis
Current compilers somewhere between Inspection and Global/stack perfect
![Page 12: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/12.jpg)
12
Instruction issue complexity “Classic” RISC instructions
Only 32-bit instructions Current RISC
Most have 16-bit as well as 32-bit forms
x86 1 to 17 bytes per instruction
![Page 13: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/13.jpg)
13
Tradeoff More instruction issue
More gates in decode – lower clock rate More pipe stages in decode – more
branch penalty More instruction execution
More register ports – lower clock rate More comparisons – lower clock rate or
more pipe stages Chip area grows faster than
performance?
![Page 14: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/14.jpg)
14
Help from the compiler Standard FP loop:
Loop: L.D F0,0(R1)
ADD.D F4,F0,F2
S.D F4,0(R1)
DADDUI R1,R1,#-8
BNE R1,R2,Loop
1 cycle load stall
2 cycle execute stall
1 cycle branch stall
![Page 15: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/15.jpg)
15
Same code, reordered
Loop: L.D F0,0(R1)
DADDUI R1,R1,#-8
ADD.D F4,F0,F2
BNE R1,R2,Loop
S.D F4,0(R1)
in load delay slot
in branch delay slot
1 cycle execute stall
Saves 3 cycles per loop
![Page 16: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/16.jpg)
16
Loop Unrolling In previous loop
3 instructions do “work” 2 instructions “overhead” (loop
control) Can we minimize overhead?
Do more “work” per loop Repeat loop body multiple times for
each counter/branch operation
![Page 17: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/17.jpg)
17
Unrolled Loop
Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)
L.D F0,-8(R1)ADD.D F4,F0,F2S.D F4,-8(R1)
L.D F0,-16(R1)ADD.D F4,F0,F2S.D F4,-16(R1)
L.D F0,-24(R1)ADD.D F4,F0,F2S.D F4,-24(R1)DADDUI R1,R1,#-32BNE R1,R2,Loop
![Page 18: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/18.jpg)
18
Tradeoffs Fewer ADD/BR instructions
1/n for n unrolls Code expansion
Nearly n times for n unrolls Prologue
Most loops aren’t 0 mod n iterations Need a loop start (or end) of single
iterations from 0 to n-1
![Page 19: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/19.jpg)
19
Unrolled loop, dependencies minimized
Loop: L.D F0,0(R1)ADD.D F4,F0,F2S.D F4,0(R1)
L.D F6,-8(R1)ADD.D F8,F6,F2S.D F8,-8(R1)
L.D F10,-16(R1)ADD.D F12,F10,F2S.D F12,-16(R1)
L.D F14,-24(R1)ADD.D F16,F14,F2S.D F16,-24(R1)DADDUI R1,R1,#-32BNE R1,R2,Loop
This is what register renaming does in hardware
![Page 20: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/20.jpg)
20
Unrolled loop, reordered
Loop: L.D F0,0(R1)L.D F6,-8(R1)L.D F10,-16(R1)L.D F14,-24(R1)ADD.D F4,F0,F2ADD.D F8,F6,F2ADD.D F12,F10,F2ADD.D F16,F14,F2S.D F4,0(R1)S.D F8,-8(R1)DADDUI R1,R1,#-32S.D F12,16(R1)BNE R1,R2,LoopS.D F16,8(R1)
This is what dynamic scheduling does in hardware
![Page 21: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/21.jpg)
21
Limitations on software rescheduling Register pressure
Run out of architectural registers Why Itanium has 128 registers
Data path pressure How many parallel loads?
Interaction with issue hardware Can hardware see far enough ahead to
notice? One reason there are different compilers for
each processor implementation Compiler sees dependencies
Hardware sees hazards
![Page 22: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/22.jpg)
22
Superscalar Issue
Integer pipeline FP pipelineLoop: L.D F0,0(R1) L.D F6,-8(R1) L.D F10,-16(R1) ADD.D F4,F0,F2 L.D F14,-24(R1) ADD.D F8,F6,F2 L.D F18,-32(R1) ADD.D F12,F10,F2 S.D F4,0(R1) ADD.D F16,F14,F2 S.D F8,-8(R1) ADD.D F20,F18,F2 S.D F12,-16(R1) DADDUI R1,R1,# -40 S.D F16,16(R1) BNE R1,R2,Loop S.D F20,8(R1)
Can the compiler tell the hardware to issue in parallel?
![Page 23: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/23.jpg)
23
Limit on loop unrolling Keeping the pipeline filled
Each iteration is load-execute-store Pipeline has to wait – at least at start
of loop Load delays Dependencies Code explosion
Cache effects
![Page 24: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/24.jpg)
24
Loop dependencies
for(i=0;i<1000;i++) {A[i] = B[i];
}
for(i=0;i<1000;i++) {A[i] = B[i] + x;B[i+1] = A[i] + B[i];
}
Embarrassingly parallel
Iteration i depends on i-1
![Page 25: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/25.jpg)
25
Software pipelining Separate program loop from data
loop Each iteration has different phases of
execution Each iteration of original loop occurs in
2 or more iterations of pipelined loop Analogous to Tomasulo
Reservation stations hold different iterations of loop
![Page 26: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/26.jpg)
26
Software pipeliningIteration
0Iteration
1Iteration
2Iteration
3
Iteration -2
Iteration -1
Iteration 0
Iteration 1
Iteration 2
![Page 27: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/27.jpg)
27
Compiling for Software Pipeline
for(i=0; i<1000; i++ ) {x = load(a[i]);y = calculate(x);b[i] = y;
}
x = load(a[0]);y = calculate(x);x = load(a[0]);for(i=1; i<1000; i++ ) {
b[i-1] = y;y = calculate(x);x = load(a[i+1]);
}
Original
Pipelined
![Page 28: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/28.jpg)
28
Unrolling vs. Software pipelining
Loop unrolling minimizes branch overhead Software pipelining minimizes
dependencies Especially useful for long latency loads/stores No idle cycles within iterations
Can combine unrolling with software pipelining
Both techniques are standard practice in high performance compilers
![Page 29: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/29.jpg)
29
Static Branch Prediction Some of the benefits of dynamic
techniques Delayed branch Compiler guesses
for(i=0;i<10000;i++) is pretty easy Compiler hints
Some instruction sets include hints Especially useful for trace-based
optimization
![Page 30: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/30.jpg)
30
More branch troubles Even unrolled, loops still have some
branches Taken branches can break pipeline –
nonsequential fetch Unpredicted branches can break pipeline
Especially annoying for short branchesa = f();b = g();if( a > 0 )
x = a;else
x = b;
![Page 31: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/31.jpg)
31
Predication Add a predicate to an instruction
Instruction only executes if predicate is true Several approaches
flags, registers On every instruction, or just a few (e.g., load
predicated) Used in several architectures
HP Precision, ARM, IA-64LD R2,b[]LD R1,a[]ifGT MOV R4,R1ifLE MOV R4,R2…
![Page 32: 1 Advanced Computer Architecture Limits to ILP Lecture 3.](https://reader034.fdocuments.in/reader034/viewer/2022051401/56649c7c5503460f94930edb/html5/thumbnails/32.jpg)
32
Cost/Benefit of Predication Enables longer basic blocks
Much easier for compilers to find parallelism Can eliminate instructions
Branch folded into instruction Instruction space pressure
How to indicate predication? Register pressure
May need extra registers for predicates Stalls can still occur
Still have to evaluate predicate and forward to predicated instruction