Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
-
date post
22-Dec-2015 -
Category
Documents
-
view
222 -
download
5
Transcript of Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)
![Page 1: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/1.jpg)
Chapter 15
IA-64 Architectureor
(EPIC – Extremely Parallel Instruction Computing)
![Page 2: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/2.jpg)
Superscaler & Superpipeline
Superscaler Machine:
A Superscalar machine employs multiple independent pipelines to executes multiple independent instructions in parallel.— Particularly common instructions (arithmetic, load/store,
conditional branch) can be executed independently.
Superpipelined machine:• Superpiplined machines overlap pipe stages
— Relies on stages being able to begin operations before the last is complete.
![Page 3: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/3.jpg)
Reflection on Superscalar Machines
Challenges:• Data Dependencies
—Requires reordering of instructions
• Procedural Dependencies—Requires reordering of fetch, execute, updating of
registers—Requires register renaming—Requires “committing” or “retiring” instructions
• Resource Conflicts—Requires reordering of instructions
• “scaling” challenges—Control complexity increases exponentially—Time delay increases exponentially
![Page 4: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/4.jpg)
Why A New Architecture Direction?
Processor designers obvious choices for use of increasing number of transistors on chip and extra speed:
• Bigger Caches diminishing returns
• Increase degree of Superscaling by adding more execution units complexity wall: more logic, need improved branch prediction, more renaming registers, more complicated dependencies.
• Multiple Processors challenge to use them effectively in general computing
• Longer pipelines greater penalty for misprediction
![Page 5: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/5.jpg)
IA-64 : Background• Explicitly Parallel Instruction Computing (EPIC) - Jointly developed by Intel & Hewlett-Packard (HP)
• New 64 bit architecture—Not extension of x86—Not adaptation of HP 64bit RISC architecture
• To exploit increasing chip transistors and increasing speeds
• Utilizes systematic parallelism
• Departure from superscalar
Note: Has become the architecture of the Intel Itanium
![Page 6: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/6.jpg)
Design Approach – Explicit Parallelism
• Compiler has vision of whole program and what is coming
• Increase the execution units and use them effectively
• Reduce dynamic reconfigurations
• Avoid exponentially increasing complex circuitry
Compiler statically schedules instructions at compile time, rather than processor dynamically scheduling them at run time.
![Page 7: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/7.jpg)
Basic Concepts for IA-64
• Instruction level parallelism — Explicit in machine instruction rather than determined at run
time by processor
• Long or very long instruction words (LIW/VLIW)— Fetch bigger chunks already “preprocessed”
• Branch predication (not the same as branch prediction)— Go ahead and fetch & decode instructions, but keep track of
them so the decision to “issue” them, or not, can be practically made later
• Speculative loading— Go ahead and load data so it is ready when need, and have a practical
way to recover if speculation proved wrong
• Software Pipelining - Multiple iterations of a loop can be executed in parallel
![Page 8: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/8.jpg)
Comparing IA-64 with Superscalar
![Page 9: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/9.jpg)
Predicate Registers
• Used as a flag for instructions that may or may not be executed.
• A set of instructions is assigned a predicate register when it
is uncertain whether the instruction sequence will actually be executed (think branch).
• Only instructions with a predicate value of true are executed.
• When it is known that the instruction is going to be executed, its predicate is set. All instructions with that predicate true can now be completed.
• Those instructions with predicate false are now candidates for cleanup.
![Page 10: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/10.jpg)
Predication
![Page 11: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/11.jpg)
Speculative Loading
![Page 12: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/12.jpg)
General Organization
![Page 13: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/13.jpg)
IA-64 Key Features
• Large number of registers—IA-64 instruction format assumes 256 Registers
– 128 * 64 bit integer, logical & general purpose– 128 * 82 bit floating point and graphic
—64 predicated execution registers (To support high degree of parallelism)
• Multiple execution units—8 or more
![Page 14: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/14.jpg)
IA-64 Register Set
![Page 15: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/15.jpg)
IA-64 Execution Units • I-Unit
—Integer arithmetic—Shift and add—Logical—Compare—Integer multimedia ops
• M-Unit— Load and store
– Between register and memory— Some integer ALU operations
• B-Unit— Branch instructions
• F-Unit— Floating point instructions
![Page 16: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/16.jpg)
Relationship between Instruction Type & Execution Unit
![Page 17: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/17.jpg)
Instruction Format
128 bit bundles• Can fetch one or more bundles at a time
• Bundle holds three instructions plus template
• Instructions are usually 41 bit long— Have associated predicated execution registers
• Template contains info on which instructions can be executed in parallel— Not confined to single bundle— e.g. a stream of 8 instructions may be executed in parallel— Compiler will have re-ordered instructions to form contiguous bundles— Can mix dependent and independent instructions in same bundle
![Page 18: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/18.jpg)
Instruction Format Diagram
![Page 19: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/19.jpg)
Field Encoding & Instr Set Mapping
Note: BAR indicates stops: Possible dependencies with Instructions after the stop
![Page 20: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/20.jpg)
Assembly Language Format
[qp] mnemonic [.comp] dest = srcs ;; //
• qp - predicate register– 1 at execution execute and commit result to hardware– 0 result is discarded
• mnemonic - name of instruction
• comp – one or more instruction completers used to qualify mnemonic
• dest – one or more destination operands
• srcs – one or more source operands• ;; - instruction groups stops (when appropriate)
– Sequence without read after write or write after write– Do not need hardware register dependency checks
• // - comment follows
![Page 21: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/21.jpg)
Assembly Example
ld8 r1 = [r5] ;; //first group
add r3 = r1, r4 //second group
• Second instruction depends on value in r1—Changed by first instruction—Can not be in same group for parallel
execution
• Note ;; ends the group of instructions that can be executed in parallel
Register Dependency:
![Page 22: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/22.jpg)
Assembly Example
ld8 r1 = [r5] //first group
sub r6 = r8, r9 ;; //first group
add r3 = r1, r4 //second group
st8 [r6] = r12 //second group
• Last instruction stores in the memory location whose address is in r6, which is established in the second instruction
Multiple Register Dependencies:
![Page 23: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/23.jpg)
Predication
![Page 24: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/24.jpg)
Speculative Loading
![Page 25: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/25.jpg)
Assembly Example – Predicated Code
if (a&&b)
j = j + 1;
else
if(c)
k = k + 1;
else
k = k – 1;
i = i + 1;
Consider the Following program with branches:
![Page 26: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/26.jpg)
Assembly Example – Predicated Code
Source CodeSource Codeif (a&&b)
j = j + 1;
else
if(c)
k = k + 1;
else
k = k – 1;
i = i + 1;
Pentium Assembly CodePentium Assembly Code
cmp a, 0 ; compare with 0
je L1 ; branch to L1 if a = 0
cmp b, 0
je L1
add j, 1 ; j = j + 1
jmp L3
L1: cmp c, 0
je L2
add k, 1 ; k = k + 1
jmp L3
L2: sub k, 1 ; k = k – 1
L3: add i, 1 ; i = i + 1
![Page 27: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/27.jpg)
Assembly Example – Predicated Code
Source CodeSource Codeif (a&&b)
j = j + 1;
else
if(c)
k = k + 1;
else
k = k – 1;
i = i + 1;
Pentium CodePentium Code
cmp a, 0
je L1
cmp b, 0
je L1
add j, 1
jmp L3
L1: cmp c, 0
je L2
add k, 1
jmp L3
L2: sub k, 1
L3: add i, 1
IA-64 CodeIA-64 Code
cmp. eq p1, p2 = 0, a ;;
(p2) cmp. eq p1, p3 = 0, b
(p3) add j = 1, j
(p1) cmp. ne p4, p5 = 0, c
(p4) add k = 1, k
(p5) add k = -1, k
add i = 1, i
![Page 28: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/28.jpg)
Example of Prediction
![Page 29: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/29.jpg)
Data Speculation
• Load data from memory before needed
• What might go wrong?—Load moved before store that might alter
memory location—Need subsequent check in value
![Page 30: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/30.jpg)
Assembly Example – Data Speculation
(p1) br some_label // cycle 0
ld8 r1 = [r5] ;; // cycle 1
add r1 = r1, r3 // cycle 3
Consider the Following program:
![Page 31: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/31.jpg)
Assembly Example – Data Speculation
(p1) br some_label //cycle 0
ld8 r1 = [r5] ;; //cycle 1
add r1 = r1, r3 //cycle 3
Consider the Following program:
Original code Speculated CodeOriginal code Speculated Code
ld8.s r1 = [r5] ;; //cycle -2
// other instructions
(p1) br some_label //cycle 0
chk.s r1, recovery //cycle 0
add r2 = r1, r3 //cycle 0
![Page 32: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/32.jpg)
Assembly Example – Data Speculation
st8 [r4] = r12 //cycle 0
ld8 r6 = [r8] ;; //cycle 0
add r5 = r6, r7 ;; //cycle 2
st8 [r18] = r5 //cycle 3
Consider the Following program:
What if r4 and r18 point to the same address?
![Page 33: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/33.jpg)
Assembly Example – Data Speculation
st8 [r4] = r12 //cycle 0
ld8 r6 = [r8] ;; //cycle 0
add r5 = r6, r7 ;; //cycle 2
st8 [r18] = r5 //cycle 3
Consider the Following program:
Without Data Speculation With Data SpeculationWithout Data Speculation With Data Speculation
What if r4 and r18 point to the same address?
ld8.a r6 = [r8] ;; //cycle -2, adv
// other instructions
st8 [r4] = r12 //cycle 0
ld8.c r6 = [r8] //cycle 0, check
add r5 = r6, r7 ;; //cycle 0
st8 [r18] = r5 //cycle 1
![Page 34: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/34.jpg)
Assembly Example – Data Speculation
ld8.a r6 = [r8];; //cycle -3,adv ld // other instructions add r5 = r6, r7 //cycle -1,uses r6 // other instructions st8 [r4] = r12 //cycle 0 chk.a r6, recover //cycle 0, checkback: //return pt st8 [r18] = r5 //cycle 0
recover: ld8 r6 = [r8] ;; //get r6 from [r8] add r5 = r6, r7;; //re-execute be back //jump back
ld8.a r6 = [r8] ;; //cycle-2// other instructions
st8 [r4] = r12 //cycle 0ld8.c r6 = [r8] //cycle 0add r5 = r6, r7 ;; //cycle 0st8 [r18] = r5 //cycle 1
Data Dependencies:
Speculation Speculation with data dependencySpeculation Speculation with data dependency
![Page 35: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/35.jpg)
Software Pipelining
L1: ld4 r4=[r5],4 ;;//cycle 0 load postinc 4add r7=r4,r9 ;;//cycle 2st4 [r6]=r7,4 //cycle 3 store postinc
4br.cloop L1 ;;//cycle 3
• Adds constant to one vector and stores result in another
• No opportunity for instruction level parallelism in one iteration
• Instruction in iteration x all executed before iteration x+1 begins
• If no address conflicts between loads and stores can move independent instructions from loop x+1 to loop x
![Page 36: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/36.jpg)
Pipeline - Unrolled Loop, Pipeline Display
Unrolled loopUnrolled loop
ld4 r32=[r5],4;; //cycle 0
ld4 r33=[r5],4;; //cycle 1
ld4 r34=[r5],4 //cycle 2
add r36=r32,r9;; //cycle 2
ld4 r35=[r5],4 //cycle 3
add r37=r33,r9 //cycle 3
st4 [r6]=r36,4;; //cycle 3
ld4 r36=[r5],4 //cycle 3
add r38=r34,r9 //cycle 4
st4 [r6]=r37,4;; //cycle 4
add r39=r35,r9 //cycle 5
st4 [r6]=r38,4;; //cycle 5
add r40=r36,r9 //cycle 6
st4 [r6]=r39,4;; //cycle 6
st4 [r6]=r40,4;; //cycle 7
Original LoopOriginal LoopL1: ld4 r4=[r5],4 ;;//cycle 0 load postinc 4 add r7=r4,r9 ;;//cycle 2 st4 [r6]=r7, 4 //cycle 3 store postinc 4 br.cloop L1 ;;//cycle 3
Pipeline DisplayPipeline Display
![Page 37: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/37.jpg)
Unrolled Loop Observations
• Completes 5 iterations in 7 cycles—Compared with 20 cycles in original code
• Assumes two memory ports—Load and store can be done in parallel
![Page 38: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/38.jpg)
Support For Software Pipelining
• Automatic register renaming— Fixed size are of predicate and fp register file (p16-P32, fr32-fr127) and
programmable size area of gp register file (max r32-r127) capable of rotation
— Loop using r32 on first iteration automatically uses r33 on second
• Predication— Each instruction in loop predicated on rotating predicate register
– Determines whether pipeline is in prolog, kernel, or epilog
• Special loop termination instructions— Branch instructions that cause registers to rotate and loop counter to
decrement
![Page 39: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/39.jpg)
IA-64 Register Set
![Page 40: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/40.jpg)
IA-64 Registers (1)• General Registers
— 128 gp 64 bit registers— r0-r31 static
– references interpreted literally
— r32-r127 can be used as rotating registers for software pipeline or register stack
– References are virtual– Hardware may rename dynamically
• Floating Point Registers— 128 fp 82 bit registers— Will hold IEEE 745 double extended format— fr0-fr31 static, fr32-fr127 can be rotated for pipeline
• Predicate registers— 64 1 bit registers used as predicates— pr0 always 1 to allow unpredicated instructions— pr1-pr15 static, pr16-pr63 can be rotated
![Page 41: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/41.jpg)
IA-64 Registers (2)• Branch registers
— 8 64 bit registers
• Instruction pointer— Bundle address of currently executing instruction
• Current frame marker—State info relating to current general register stack
frame—Rotation info for fr and pr—User mask
– Set of single bit values– Allignment traps, performance monitors, fp register usage
monitoring
• Performance monitoring data registers— Support performance monitoring hardware
• Application registers— Special purpose registers
![Page 42: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/42.jpg)
Register Stack• Avoids unnecessary movement of data at procedure call & return
• Provides procedure with new frame up to 96 registers on entry— r32-r127
• Compiler specifies required number— Local— Output
• Registers renamed so local registers from previous frame hidden
• Output registers from calling procedure now have numbers starting r32
• Physical registers r32-r127 allocated in circular buffer to virtual registers
• Hardware moves register contents between registers and memory if more registers needed
![Page 43: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/43.jpg)
Register Stack Behavior
![Page 44: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/44.jpg)
Register Formats
![Page 45: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/45.jpg)
HW 9: Due 12/4/07
Problems 15.7, 15.8, & 15.10
![Page 46: Chapter 15 IA-64 Architecture or (EPIC – Extremely Parallel Instruction Computing)](https://reader035.fdocuments.in/reader035/viewer/2022062407/56649d815503460f94a65b74/html5/thumbnails/46.jpg)
Intel’s Itanium Implements the IA-64