Presentation stolen from the web (with changes) from the Univ of Aberta and Espen Skoglund and...
-
Upload
precious-powles -
Category
Documents
-
view
212 -
download
0
Transcript of Presentation stolen from the web (with changes) from the Univ of Aberta and Espen Skoglund and...
Presentation stolen from the web(with changes)
from the Univ of Aberta and
Espen Skoglundand
Thomas Richards (470 alum)and
Our textbook’s authors
IA-64:Advanced Loads
Speculative Loads Software Pipelining
IA-64
• 128 64-bit registers– Use a register window similarish to SPARC
• 128 82 bit fp registers• 64 1 bit predicate registers• 8 64-bit branch target registers
Explicit Parallelism
• Groups– Instructions which could be executed in parallel
if hardware resources available.
• Bundle– Code format. 3 instructions fit into a 128-bit
bundle.– 5 bits of template, 41*3 bits of instruction.
» Template specifies what execution units each instruction requires.
Instruction groups
• IA-64 instructions are bound in instruction groups– No read-after-write dependencies– No write-after-write dependencies– Any instruction in the group may be executed in
parallel– New processors can easily take advantage of the
existing ILP in the instruction group• Instruction groups indicated by stop bits in
template• Instruction groups may end dynamically on
branches
Instruction bundles
• Instruction bundles contain– 3 instructions– A template field which maps instructions to execution units
• Processor dispatches all three instruction in parallel
• Instruction group may end in middle of bundle• Bundles are aligned on 16 byte boundaries
Slot 3Slot 3 Slot 2Slot 2 Slot 1Slot 1
Tem
pla
te
Instruction bundle
45 086127 4
Predication
• Use predicates to eliminate branches
• Predicates are one bit registers (total of 64)
• Most instructions can be predicated
(qp) mnemonic dest = source
• Predicates are set by compare instructions
(qp) cmp.crel px,py = source
• x86 assembly:cmp a, bbeq .eqadd $4, yjmp .done
.eq: add $3, y
.done:
• IA-64 assembly:cmp.eq p1,p2 = a,b
(p1) add y = y, 3(p2) add y = y, 4
• C code:if (a == b)
y += 3;else
y += 4;
Advanced loads and speculative loads
Advanced loads• Used to address data
dependencies
Speculative loads• Used to address control
dependencies
advanced load
st
check load
st
ld
(p) br
ld
speculative load
(p) br
check speculation
Advanced loads
• Addr1 and addr2 in example might point to same address
• If different:– Datum in addr2 can be
prefetched
• If same:– Datum in addr2 can not
be prefetched
• C code example:int foo (int *addr1, int *addr2){
int h;*addr1 = 4;h = *addr2;return h+1;
}
Advanced loads
• Insert advanced loads (ld.a) to prefetch data (store in ALAT)
• Use check data instruction (ld.c) in place of original load
• If memory contents has changed, perform real load
• Advanced loads do not defer exceptions
(e.g., page-faults)
• Regular load:add r3 = 4,r0 ;;st4 [r32] = r3ld4 r2 = [r33]regular loadadd r5 = r2,r3 use data
• Advanced Load:ld4.a r2 = [r33]advanced loadadd r3 = 4,r0 ;;st4 [r32] = r3ld4.c r2 = [r33] ;; verify dataadd r5 = r2,r3 use data
Speculative loads
• If addr in example is legal, we can prefetch its value
• If addr is illegal, prefetching the value would cause exception
• Any exception should be delayed until code path has been resolved
• C code example:
int add5 (int *addr){
if (addr == NULL)return (-1);
elsereturn
(*addr+5);}
Speculative loads
• Insert speculative loads (ld.s) to prefetch data
• Verify load using check instruction (chk.s)
• NaT-bit/NaTVal is used track success of load
• Might also be combined with advanced loads
(ld.sa and chk.a)
• Assembly code:add5:
ld8.s r1 = [r32]
cmp.eqp6,p5 = r32,r0 ;;
(p6) add r8 = -1,r0
(p6) br.ret(p5) chk.s
r1,return_erroradd r8
= 5,r1br.ret ;;
return_error:recovery
code
Code example“Why hoist loads?”
• add r15 = r2,r3 //A• mult r4 = r15,r2 //B• mult r4 = r4,r4 //C• st8 [r12] = r4 //D• ld8 r5 = [r15] //E• div r6 = r5,r7 //F• add r5 = r6,r2 //G
Assume latencies are:add, store: +0mult, div: +3ld: +4
A:1
D:1
G:1
C:4
E:5B:4
F:4
20
11 10
Advanced Loads Recovery// Case A: Advanced
Load ld.a r2 = [r10]st8 [r1] = r9ld.c r2 = [r10]add r15 = r2, r3st8 [r18] = r19
• Case A – Hoist just the load. » In this case, if there is a
memory dependency we just re-execute the load.
A ld.c will onlyre-execute the load,r5 is still wrong after theld.c!
• Case B – Hoist the load and dependent instructions.
» In this case, we need to re-execute all of the dependent instructions.
// Case B: Advanced Load // With Speculative Add ld.a r2 = [r10]add r5 = r2, r3st8 [r1] = r9ld.c r2 = [r10] // Wrongst8 [r18] = r19
Advanced Load-Use Recovery: Compiler Generated Recovery
Code// Solution: Using the chk.a instruction ld8.a r2 = [r10] add r5 = r2, r3 st8 [r1] = r9 chk.a r6, fixupreturn: // Return Point st8 [r18] = r19............fixup: // Re-execute load and all speculative uses ld8 r2 = [r10] add r5 = r2, r3 br return
• Use ld.c if JUST a load is speculative. Use chk.a if a load and an instruction that is dependant on the load are both speculative.
The Advanced Load Address Table (ALAT)
• The ALAT tells us if we need to recover from an Advanced Load.
• When an advanced load is executed – Save the type of load, size of load, and load address to the ALAT (indexed by PR).
• When we execute a ld.c or chk.a look for the entry in the ALAT. If it is missing, run the recovery code.
• Remove an entry from the ALAT if– A store address overlaps an ALAT entry.– Capacity/Associatively evictions.– Other advanced load indexes the same PR.
Control Speculation and Recovery• What if we want to move a load above a branch?
– Problem is that the load maybe shouldn’t have executed and might have thrown a spurious exception.
• Similar to Advanced Load, but no ALAT.– Instead, check NaT bit for deferred exceptions.
» See next slide.– Use chk.s for recovery (instead of chk.a or ld.a).
// Control Speculation and Recovery ld8.s r1 = [r10] // load moved outside of branch st8 [r11] = r9 (p1)br.cond branch_label // (p1) is a predication bit chk.s r1,recoveryreturn: add r2 = r1, r2
• chk.s checks r1 to see if the NaT bit is set. If so, branch to recovery code (re-execute instructions if necessary).
Not a Thing Bit (NaT)IA64 register
• If a control speculative load causes an exception, the processor can set this bit, which defers the exception.
• NaT bits propagate.– Propagation allows a single check for multiple ld.s.
64bits + 1NaT
ld8.s r1 = [r10] ld8.s r2 = [r11] add r3 = r1, r2 ld8.s r4 = [r3] st8[r11] = r9(p1)br.cond branch_label chk.s r4, recovery
Software pipelining on IA-64
• Lots of tricks – Rotating registers– Special counters
• Often don’t need Prologue and Epilog. – Special counters and prediction lets us only
execute those instructions we need to.
Prolog and epilogFrom before!!!!!
r3=r3-8 // Needed to check legal!r4=MEM[r2+0] //A(1)r1=r4*2 //B(1)r4=MEM[r2+4] //A(2)
Loop: MEM[r2+0]=r1 //C(n)r1=r4*2 //B(n+1)r4=MEM[r2+8] //A(n+2)r2=r2+4 //D(n)bne r2 r3 Loop //E(n)MEM[r2+0]=r1 // C(x-1)r1=r4*2 // B(x)MEM[r2+0]=r1 // C(x)r3=r3+8 // Could have used tmp
var.
There are three special purpose registers used in IA-64 for software
pipelining
• There are three special purpose registers used in IA-64 for software pipelining
• Loop counter (LC) indicates how many times to run through loop (prolog/kernel)
– Initialized to N-1 before starting loop code– Decremented until LC == 0
• Epilog counter (EC) indicates how many times to run loop after loop counter exhausted (epilog)
– Needed to flush the software pipeline– Initialized to num-stages before entering loop code– Decremented if LC == 0, and EC > 1
And RRB (Register Rename Base)
• Add internal counter RRB to register number to get actual used register
– Counter decreased by special loop branch instructions
– May be reset by clrrrb instruction– Use modular lookup (so we wrap around!)
• Rotated predicate registers– Initially reset using: mov pr.rot = value– pr63 is reset before every rotation
How does register rotation work?
(Basics)• Rotated registers:
– General: gr32 - grN (as specified by alloc instruction)
– Predicate: pr16 - pr63
– Floating point: fr16 - fr127
• Registers are rotated to higher numbers– Register rn is renamed to rn+1, rmax is renamed to rmin
• Registers are rotated by specific loop branch instructions
– br.ctop, br.cexit (for counted loops)– br.wtop, br.exit (for while loops)
How they relate
LC--
EC=EC
PR[63]=1
RRB--
LC=LC
EC=EC
PR[63]=0
RRB=RRB
LC=LC
EC--
PR[63]=0
RRB--
LC=LC
EC--
PR[63]=0
RRB--
EC?
LC?== 0 (epilog)
== 0(prolog/kernel)
> 1!= 0
ctop, cexit
ctop: branchcexit: fall-thru
ctop: fall-thrucexit: branch
(special unrolled loops)
== 1
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
32 33 34 35 36 37 38
General Registers (Physical)
0 0116 17 18
Predicate Registers
4
LC
3
EC
x4x5
x1x2x3
Memory
39
32 33 34 35 36 37 38 39
General Registers (Logical)
0
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x132 33 34 35 36 37 38
General Registers (Physical)
0 0116 17 18
Predicate Registers
4
LC
3
EC
x4x5
x1x2x3
Memory
39
32 33 34 35 36 37 38 39
General Registers (Logical)
0
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0 0116 17 18
Predicate Registers
4
LC
3
EC
x4x5
x1x2x3
Memory
x132 33 34 35 36 37 38
General Registers (Physical)
39
32 33 34 35 36 37 38 39
General Registers (Logical)
0
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0 0116 17 18
Predicate Registers
4
LC
3
EC
x4x5
x1x2x3
Memory
x132 33 34 35 36 37 38
General Registers (Physical)
39
32 33 34 35 36 37 38 39
General Registers (Logical)
0
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0 0116 17 18
Predicate Registers
4
LC
3
EC
1
x4x5
x1x2x3
Memory
x133 34 35 36 37 38 39
General Registers (Physical)
32
32 33 34 35 36 37 38 39
General Registers (Logical)
-1
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 0116 17 18
Predicate Registers
3
LC
3
EC
x4x5
x1x2x3
Memory
x133 34 35 36 37 38 39
General Registers (Physical)
32
32 33 34 35 36 37 38 39
General Registers (Logical)
-1
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 0116 17 18
Predicate Registers
3
LC
3
EC
x4x5
x1x2x3
Memory
x133 34 35 36 37 38 39
General Registers (Physical)
32
32 33 34 35 36 37 38 39
General Registers (Logical)
x2
-1
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 0116 17 18
Predicate Registers
3
LC
3
EC
x4x5
x1x2x3
Memory
x133 34 35 36 37 38 39
General Registers (Physical)
32
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1
-1
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 0116 17 18
Predicate Registers
3
LC
3
EC
x4x5
x1x2x3
Memory
x133 34 35 36 37 38 39
General Registers (Physical)
32
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1
-1
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 0116 17 18
Predicate Registers
3
LC
3
EC
x4x5
x1x2x3
Memory
x133 34 35 36 37 38 39
General Registers (Physical)
32
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1
-1
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
2
LC
3
EC
1
x4x5
x1x2x3
Memory
x134 35 36 37 38 39 32
General Registers (Physical)
33
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1
-2
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
2
LC
3
EC
x4x5
x1x2x3
Memory
x134 35 36 37 38 39 32
General Registers (Physical)
33
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1 x3
-2
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
y2
1 1116 17 18
Predicate Registers
2
LC
3
EC
x4x5
x1x2x3
Memory
34 35 36 37 38 39 32
General Registers (Physical)
33
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1 x3
-2
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
2
LC
3
EC
x4x5
x1x2x3 y1
Memory
y234 35 36 37 38 39 32
General Registers (Physical)
33
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1 x3
-2
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
2
LC
3
EC
x4x5
x1x2x3 y1
Memory
y234 35 36 37 38 39 32
General Registers (Physical)
33
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1 x3
-2
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
1
LC
3
EC
1
x4x5
x1x2x3 y1
Memory
-3
RRB
y235 36 37 38 39 32 33
General Registers (Physical)
34
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1 x3
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
1
LC
3
EC
x4x5
x1x2x3 y1
Memory
-3
RRB
y2 x435 36 37 38 39 32 33
General Registers (Physical)
34
32 33 34 35 36 37 38 39
General Registers (Logical)
x2y1 x3
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
1
LC
3
EC
x4x5
x1x2x3 y1
Memory
y2 x435 36 37 38 39 32 33
General Registers (Physical)
34
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 x3
-3
RRB
Software Pipelining Example in the IA-64
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1 1116 17 18
Predicate Registers
1
LC
3
EC
x4x5
x1x2x3 y1
y2
Memory
y2 x435 36 37 38 39 32 33
General Registers (Physical)
34
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 x3
-3
RRB
Software Pipelining Example in the IA-64
1 1116 17 18
Predicate Registers
1
LC
3
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2
Memory
y2 x435 36 37 38 39 32 33
General Registers (Physical)
34
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 x3
-3
RRB
Software Pipelining Example in the IA-64
1 1116 17 18
Predicate Registers
0
LC
3
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
1
x4x5
x1x2x3 y1
y2
Memory
-4
RRB
y2 x436 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 x3
Software Pipelining Example in the IA-64
1 1116 17 18
Predicate Registers
0
LC
3
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2
Memory
y2 x5 x436 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 x3
-4
RRB
Software Pipelining Example in the IA-64
1 1116 17 18
Predicate Registers
0
LC
3
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2
Memory
y2 x5 x436 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-4
RRB
Software Pipelining Example in the IA-64
1 1116 17 18
Predicate Registers
0
LC
3
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2y3
Memory
-4
RRB
y2 x5 x436 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
Software Pipelining Example in the IA-64
1 1116 17 18
Predicate Registers
0
LC
3
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2y3
Memory
y2 x5 x436 37 38 39 32 33 34
General Registers (Physical)
35
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-4
RRB
Software Pipelining Example in the IA-64
1 1016 17 18
Predicate Registers
0
LC
2
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0
x4x5
x1x2x3 y1
y2y3
Memory
y2 x5 x437 38 39 32 33 34 35
General Registers (Physical)
36
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-5
RRB
Software Pipelining Example in the IA-64
1 1016 17 18
Predicate Registers
0
LC
2
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2y3
Memory
y2 x5 x437 38 39 32 33 34 35
General Registers (Physical)
36
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-5
RRB
Software Pipelining Example in the IA-64
1 1016 17 18
Predicate Registers
0
LC
2
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3 y1
y2y3
Memory
y2 x5 y537 38 39 32 33 34 35
General Registers (Physical)
36
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-5
RRB
Software Pipelining Example in the IA-64
1 1016 17 18
Predicate Registers
0
LC
2
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4
y1y2y3
Memory
y2 x5 y537 38 39 32 33 34 35
General Registers (Physical)
36
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-5
RRB
Software Pipelining Example in the IA-64
1 1016 17 18
Predicate Registers
0
LC
2
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4
y1y2y3
Memory
y2 x5 y537 38 39 32 33 34 35
General Registers (Physical)
36
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-5
RRB
Software Pipelining Example in the IA-64
0 1016 17 18
Predicate Registers
0
LC
1
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0
x4x5
x1x2x3
y4
y1y2y3
Memory
y2 x5 y538 39 32 33 34 35 36
General Registers (Physical)
37
32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-6
RRB
Software Pipelining Example in the IA-64
0 1016 17 18
Predicate Registers
0
LC
1
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4
y1y2y3
Memory
y2 x5 y5
General Registers (Physical)32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-6
RRB
38 39 32 33 34 35 36 37
Software Pipelining Example in the IA-64
0 1016 17 18
Predicate Registers
0
LC
1
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4
y1y2y3
Memory
y2 x5 y5
General Registers (Physical)32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-6
RRB
38 39 32 33 34 35 36 37
Software Pipelining Example in the IA-64
0 1016 17 18
Predicate Registers
0
LC
1
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4y5
y1y2y3
Memory
y2 x5 y5
General Registers (Physical)32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-6
RRB
38 39 32 33 34 35 36 37
Software Pipelining Example in the IA-64
0 1016 17 18
Predicate Registers
0
LC
1
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4y5
y1y2y3
Memory
y2 x5 y5
General Registers (Physical)32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-6
RRB
38 39 32 33 34 35 36 37
Software Pipelining Example in the IA-64
0 1016 17 18
Predicate Registers
0
LC
1
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
x4x5
x1x2x3
y4y5
y1y2y3
Memory
y2 x5 y5
General Registers (Physical)32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-6
RRB
38 39 32 33 34 35 36 37
Software Pipelining Example in the IA-64
0 0016 17 18
Predicate Registers
0
LC
0
EC
loop:(p16) ldl r32 = [r12], 1(p17) add r34 = 1, r33(p18) stl [r13] = r35,1
br.ctop loop
0
x4x5
x1x2x3
y4y5
y1y2y3
Memory
y2 x5 y5
General Registers (Physical)32 33 34 35 36 37 38 39
General Registers (Logical)
y3y1 y4
-7
RRB
38 39 32 33 34 35 36 37
IA-64 Software pipelining Review
• No prolog or epilog in code– But we execute a lot of noops.
• Rotated registers help– In this case, we just didn’t have to reverse the
code ordering» But in general, better still. Could move load
from use more than one loop iteration apart.
• Looks good at least in this case…
IA-64 review
• Some problems– ALAT difficult for compliers to use.
» Recall Colwell talking about “once we figure out how to do this…”
– 128/3 instruction size makes I-cache worse.– Big register file has disadvantages
» Context switch mainly.– So many dependencies with special purpose
instructions, dynamic OoO is unlikely.
• But…– If the complier could do a good job, there really
does look like the potential for a big win.