Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

22
Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds Henk Corporaal www.ics.ele.tue.nl/~heco TUEindhoven 2009

description

Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds. Henk Corporaal www.ics.ele.tue.nl/~heco TUEindhoven 2009. Avoiding superscalar complexity. An alternative: EPIC (explicit parallel instruction computer) EPIC: Best of both worlds? - PowerPoint PPT Presentation

Transcript of Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

Page 1: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

Advanced Computer Architecture5MD00 / 5Z033

EPIC / Itanium architecturebest of both worlds

Henk Corporaalwww.ics.ele.tue.nl/~heco

TUEindhoven2009

Page 2: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 2

Avoiding superscalar complexity• An alternative:

– EPIC (explicit parallel instruction computer)

• EPIC: Best of both worlds? – Superscalar: expensive but binary compatible– VLIW: simple, but not compatible

• Or: use VLIW with Binary translation at Run-time– Transmeta: Crusoe VLIW processor– Runs x86 code on a VLIW !!!

Page 3: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 3

EPIC Architecture: IA-64 / ItaniumExplicit Parallel Instruction Computer• IA-64 • Implementations: Merced (2001), McKinley (2002), Montecite

(2 core, 2006), Tukwila (4-core 2009), Poulson (Q4, 2009, 8-core)

• architecture is now called Itanium

Register model:• 128 64-bit int x bits, stack, rotating• 128 82-bit floating point, rotating• 64 1-bit booleans• 8 64-bit branch target address• system control registers

Page 4: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 4(2002)

Page 5: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 5

Itanium Instruction format• Instructions grouped in 128-bit bundles

– 3 * 41-bit instruction– 5 template bits, indicate type and stop location

• Each 41-bit instruction – starts with 4-bit opcode, and – ends with 6-bit guard (boolean) register-id

5 41 41 41

Page 6: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 6

Page 7: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 7

Predication• Predicated execution of virtually all instructions

– (p) add r1 = r2, r3• If p is true, normal add operation. Otherwise, NOP

– 64 1-bit predicate registers– Advantages of predicated execution:

• Remove branches– Convert control dependence to data dependence– Reduce misprediction penalties

• Increase the size of basic block – Both codes from taken & not-taken path can be scheduled in the

same cycle

Page 8: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 8

Control Speculation

• Loads incur high latency– Need to schedule loads as early as possible– Two barriers – branches and stores

• Control speculation – move loads above branches:

Page 9: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 9

Control speculation – move loads above branches

Problem: loads can cause exceptions• Separate load behavior from exception behavior

– Speculative load (ld.s) initiates a load op. & detects exceptions

– On an exception, hardware propagates exception token (stored with destination register) from ld.s to chk.s

– Speculative check (chk.s) delivers the exception detected by ld.s

Page 10: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 10

Control Speculation• Control speculating uses further increase ILP

– Dependent instructions following the load can be also speculated above branches

Page 11: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 11

Data Speculation• Loads and previous stores can conflict

– When the loads/stores overlap (access the same memory location), the loads must wait for previous stores due to RAW dependence

• IA-64 enables data speculation by ld.a and ld.c/chk.a with ALAT (Advanced Load Address Table)– ld. a performs a normal load and inserts the address to ALAT– Any intervening stores eliminate the overlapping entries from

ALAT– The advanced load check (ld.c) checks ALAT whether there

was a violation and reissues the load if necessary

Page 12: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 12

Data Speculation• Move loads above potentially overlapping stores

Page 13: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 13

Data Speculation• Uses of speculative data can be further speculated

• Also, control and data speculation can be combined– Schedule loads across branches and across stores at the same time– Speculative advanced loads – ld.sa combines the semantics of ld.a and ld.s

Page 14: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 14

Register Stack• Procedure call overhead

– Spill registers to memory on call– Restore them on procedure return

• Register Stack– Register stack is used to save/restore

procedure contexts across calls– Stack area in memory to save/restore

procedure context– Explicit allocation of stack frames

• Effective use of 96 registers• Allocate only what is needed

– Overlapping stack frames avoids parameter copying

– Mechanism implemented by renaming register addresses

Page 15: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 15

Register Stack

Page 16: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 16

Register Stack Engine (RSE)• Automatically saves/restores stack registers

without software intervention– Avoids explicit spill/fill (Eliminates stack management

overhead)– Provides the illusion of infinite physical registers

• RSE uses unused memory bandwidth (cycle stealing) to perform register spill and fill operations in the background– Overflow: alloc needs more registers than available– Underflow: return needs to restore frame saved in

memory

Page 17: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 17

Software Pipelining Support• High performance loops without

code size overhead– No prologue and epilogue

• Rotating registers– Provide automatic renaming

• Rotating predicates (stage predicates)– Unify prologue, kernel, and epilogue

• Loop control registers (LC, EC)• Loop branches

– Counted loop (br.ctop)– While loop (br.wtop)

– Especially valuable for integer loops with small trip counts

Page 18: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 18

Software Pipelining Example

L1: (p16) ld4 r32 = [r5], 4 // Cycle 0 (p18) add r35 = r34, r9 // Cycle 0 (p19) st4 [r6] = r36, 4 // Cycle 0 br.ctop L1 // Cycle 0

ld Prolog ld add ld st add ld Kernel st add ld st add Epilog st add st

L1: ld4 r4 = [r5], 4 //Cycle 0 add r7 = r4, r9 //Cycle 2 st4 [r6] = r7, 4 //Cycle 3 br.cloop L1;;

Iteration1 r32 r33 r34 r35 … p16 p17 p18 p19 .. 1 0 0 0 ..Iteration2 r33 r34 r35 r36 … p17 p18 p19 .. p16 1 0 0 .. 1Iteration3 r34 r35 r36 r37 … p18 p19 .. p16 p17 1 0 .. 1 1

What happens during runtime?

Page 19: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 19

IA-64 / Itanium architecture: a VLIW?• Yes, but:

– Instructions contain only one operation; compiler can indicate that successive instructions can be executed in parallel

– HW does the Operation – FU binding– Pipeline latencies not visible in the ISA– These measures make the ISA independent of #FUs

and pipeline latencies ISA supports multiple implementations

Page 20: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 20

Montecito 2006: dual 11-issue cores

Page 21: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 21

Tukwila 4 core Itanium, 2009

Page 22: Advanced Computer Architecture 5MD00 / 5Z033 EPIC / Itanium architecture best of both worlds

04/22/23 ACA H.Corporaal 22

How further?Burton SmithMicrosoft2005