Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

97
ECE 4100/6100 Advanced Computer Architecture Lecture 15 Static Scheduling Machines Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology

Transcript of Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Page 1: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

ECE 4100/6100Advanced Computer Architecture

Lecture 15 Static Scheduling Machines

Prof. Hsien-Hsin Sean Lee

School of Electrical and Computer Engineering

Georgia Institute of Technology

Page 2: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

2

Static Scheduling• Compiler performs instruction scheduling• VLIW Very Long Instruction Word• An alternative to dynamic scheduling processors• Pack multiple operations into one instruction• Move scheduling to Compiler (Software Approach)• Can simplify the complexity of a hardware-based instruction

scheduler• Cydrome, Multiflow, EPIC

Page 3: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

3

Very Long Instruction Word (VLIW)

• Rely on Compilers• Simple Hardware• Dependency is explicitly represented in the instructions• Instruction window, supposedly, is much larger than a

hardware scheduling window– How about loop boundary?– How about function boundary?– Interprocedural optimization is generally difficult

• Might lead to compatibility or performance issues if instruction latency changed

• EPIC/Itanium closely follows VLIW philosophy, many embedded and DSP processors embrace VLIW

Page 4: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

4

Intel Itanium ISA• Itanium Instruction “Bundle” (VLIW)

– 128 bits each– Contains three Itanium instructions (aka syllables)– Template bits in each bundle specify dependencies both within a

bundle as well as between sequential bundles– A collection of independent bundles forms a “group” (use stops)

• Each Itanium Instruction– Fixed-length 41 bits long– Left-most 4 bits (40-37) are the major opcode (e.g. FP ld/st, INT

ld/st, ALU)– Contains max three 7-bit register specifiers– Contains a 6-bit field for specifying one of the 64 one-bit qualifying

predicate registers

Instruction Slot 1 Instruction Slot 2 Instruction Slot 3 Templt0454586127

Page 5: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

5

Encoding Instruction Bundle

• Use “;;” as “stop bitstop bit” in assembly code to separate dependent instructions• Instructions between “;;” belong to the same “instruction group”

– RAW and WAW are not allowed in the same instruction group– WAR is allowed except for an special case: when writing p63 by modulo-scheduled

branch (e.g. br.ctop) after reading p63 (e.g. qualifying predicate) by B-type instruction

• Each instruction slot can represent one (out of 5) functional unit type based on encoding (e.g. slot 0 can be M-unit or B-unit)

• 12 basic templates provided, each with 2 versions depending on stop bit– MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, MFB– MII_, MI_I_, MLX_, MMI_, M_MI_, MFI_, MMF_, MIB_, MBB_, BBB_, MMB_, MFB_

{ .mii ld4 r28=[r8]add r9 = 2,r1;;add r30= 1,r9

}MI_I format ⇒ Template encoded “02”

Page 6: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

6

Itanium Instruction Example

{ .mii add r1 = r2, r3 sub r4 = r4, r5;; shr r1, r4, r1;;}{ .mmi ld8 r2, [r1];; st8 [r1] = r23 tbit p1,p2 = r4, 5} { .mbb ld8 r45 = [r55](p3)br.call b1=func1(p4)br.cond Label1}{ .mfi st4 [r45] = r6 fmac f1=f2,f3 add r3=r3, 8;;}

Page 7: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

7

Itanium Register Files

Stacked (Rotating)

Static

0

3132

127

General Purpose Registers

Stacked (Rotating)

Static

0

3132

127

FP Registers

063 081

Stacked (Rotating)

Static

01516

630

Predicate Registers

Page 8: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

8

Register Stack Engine

• Avoid spills/fills during function call/return• Callee uses instruction alloc r1=ar.pfs, i, l, o, r alloc r1=ar.pfs, i, l, o, r upon entering a function

(inputs)

Static

0

3132

127

localsoutputs

illegalsize of frame (sof)

sofsol

Current Frame Marker (CFM) 38 bits

size of locals (sol = i+l)

sorrrb.grrrb.frrrb.pr

size of rotating (sor)

Page 9: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

9

Function Call Examplemain(){

a=foo(i*i, b[i]);

}

int foo(int ii, int bb){

}

r32

r43r44r45

i*i b[i]

r127

main: alloc r32=ar.pfs,0,12,2,0

foo: alloc r26=ar.pfs,2,5,0,0

GPR

Caller (main)

r32

r43r32r33

i*i b[i]

r127

GPR

r38

Callee (foo)

Page 10: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

10

RSE: A Function Call

32

46

loc

out52

sofsol

CFM 2114

PFS.pfm xx

3238

out

sofsol

70

2114

call

pfm: Previous frame marker

Page 11: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

11

RSE: Alloc

32

46

loc

out52

sofsol

CFM 2114

PFS.pfm xx

3238

out

sofsol

70

2114

call alloc r32=ar.pfs,7,9,3,0

sofsol

1916

2114

32

48

loc

out50

inputs

alloc copies PFM to GR (r32)

Page 12: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

12

RSE: Return

32

46

loc

out52

sofsol

CFM 2114

PFS.pfm xx

3238

out

sofsol

70

2114

call alloc

sofsol

1916

2114

32

48

loc

out50

32

46

loc

out52

sofsol

2114

2114

return

Page 13: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

13

Itanium Pipelines

• Performance improvement due to pipeline shortening — 4% to 6% • Large integer register file cause extra stage WLD (Word Line Decode) in Itanium,

circuit improved for Itanium 2 • Inter-group latency is enforced by a scoreboard

– Latency due to scheduling that failed to space instructions out– Due to cache misses

Front-endFront-end

Ckt improvedCkt improved

Dependency Scoreboard Stall checked here prior to EXE

Page 14: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

14

Itanium 2 Eight-stage Pipeline

EXPEXP RENRENROTROTIPGIPG REGREG EXEEXE DETDET WBWB

FP1FP1 FP2FP2 FP3FP3 FP4FP4 WBWB

L2NL2N L2IL2I L2AL2A L2ML2M L2DL2D L2CL2C L2WL2W

CoreCore

FPFP

L2L2

IPGIPG IP Generate, L1I cache (6 inst) and TLB access

EXEEXE ALU Execute, L1D Cache and TLB Access + L2 Cache Tag Access

ROTROT Instruction Rotate and Buffer (6 inst) DETDET Exception Detect, Branch Correction

EXPEXP Expand, Port assignment and routing WBWB Writeback, INT register update

RENREN INT and FP register rename FP1-WBFP1-WB FP FMAC pipeline (2) + register write

REGREG INT and FP register file read L2N-L2IL2N-L2I L2 Queue Nominate/Issue (4)(speculatively issued with L1 requestspeculatively issued with L1 request)

L2A-L2WL2A-L2W L2 Access, Rotate, Correct, Write (4)

Page 15: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

15

Itanium 2 MicroarchitectureL1 I-Cache &

Fetch/Prefetch engine I-TLB

8 bundles8 bundlesInstructionInstructionQueueQueue

Branch Prediction

FF FFII IIMM MMMM MMBBBB BB

Register stack engine / remapping Register stack engine / remapping

Branch & Predicate

128 INTRegisters

128 FPRegisters

BranchUnits

BranchUnits

BranchUnits

INT & MMUnits

INT & MMUnits

INT & MMUnits

INT & MMUnits

INT & MMUnits

INT & MMUnits

Quad-port(INT) L1

PIPT DataCache (WT)

D-TLB

ALA

T

FloatingFloatingPointPointUnitsUnits

FloatingFloatingPointPointUnitsUnits

Scor

eboa

rd, P

redi

cate

NaT

, Exc

eptio

ns

IA-32Decode

& Control

11 issue 11 issue portsports

PIPT

Uni

fied

L2 C

ache

Qua

d-Po

rt (E

CC

)PI

PT U

nifie

d L2

Cac

he Q

uad-

Port

(EC

C)

On-

chip

PIP

T U

nifie

d L

3 C

ache

Sin

gle-

port

ed

On-

chip

PIP

T U

nifie

d L

3 C

ache

Sin

gle-

port

ed

(EC

C)

(EC

C)

Bus Controller (ECC)Bus Controller (ECC)

Page 16: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

16

Page 17: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

17

ld.sld.sinstr 1instr 1instr 2instr 2brbr

chk.schk.suse use

ItaniumItanium

instr 1instr 1instr 2instr 2. . .. . .brbr

LoadLoaduseuse

Conventional ArchitecturesConventional Architectures

Elevate loads above a branchElevate loads above a branch

• To improve memory latency by control speculation at compile time• Defer exceptions by setting NaT (GR’s 65th bit) that indicates:

– Whether or not an exception has occurred – Branch to fixup code required

• NaT set during ld.s, checked by chk.s

BarrierBarrier

Control Speculation (Speculative Load)

Page 18: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

18

Control Speculation (Hoist Uses)

• The uses of speculative data can be executed speculatively– Distinguishes speculation from simple prefetch

• NaT bit propagates down to the dependent instruction chain

ld.sld.sinstr 1instr 1instr 2instr 2brbr

chk.schk.suse use

IA-64IA-64

Page 19: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

19

Control Speculation (Recovery)

• All computation instructions propagate NaTsNaTs to the consumers to reduce number of checks

• Cmp propagates “false” if NaT is set when writing predicates (“0” for both target predicates)

chk.s chk.s r5r5, recv, recvsub r7 = sub r7 = r5r5,r2,r2

ld8.s r3 = (r9)ld8.s r3 = (r9)ld8.sld8.s r4 = (r10) r4 = (r10)addaddr6 = r3, r4r6 = r3, r4ld8.s ld8.s r5r5 = (r6) = (r6)p1,p2 = cmp(...)p1,p2 = cmp(...)

Allows single chk on Allows single chk on resultresult

ld8ld8ld8ld8addaddld8ld8br homebr home

Recovery codeRecovery code

Page 20: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

20

Data Speculation (Advanced Loads)

• Compiler can hoist a load prior to a preceding, possibly-conflicting store• ALAT (Advanced Load Address Table) is used for checking every store

address in-between • Can be done by superscalar machine using Store coloringStore coloring

instr 1instr 1instr 2instr 2. . .. . .st8st8

ld8ld8useuse

BarrierBarrier

Conventional ArchitecturesConventional Architectures

ld8.ald8.ainstr 1instr 1instr 2instr 2st8st8

ld.cld.cuse use

ItaniumItanium

Page 21: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

21

Data Speculation (load.a + chk.a)• Compiler hoist a load and its subsequent consumersits subsequent consumers prior to

a preceding, possibly-conflicting store• Need to patch a recovery code for mis-speculation

ld8.a r3=ld8.a r3=instr 1instr 1instr 2instr 2st8st8

ld.cld.cadd =r3, add =r3,

ld8.a r3=ld8.a r3=instr 1instr 1add =r3,add =r3,instr 2instr 2st8st8

chk.achk.aL1:L1:

ld8 r3=ld8 r3=add =r3,add =r3,br L1br L1

Recovery codeRecovery code

Page 22: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

22

Parallel Compare Types

• Three new types of compares:– and: both target predicates set FALSE if compare is false– or: both target predicates set TRUE if compare is true– DeMorgan: if true, sets one TRUE, sets other FALSE

• Do not get confused with the “parallel compare” pcmp1/pcmp2/pcmp4

Reduces Critical PathReduces Critical PathReduces Critical PathReduces Critical Path

BB

AA

CC

DD

BBAA CC

DD

Page 23: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

23

Eight Queen Example

Source: Crawford & HuckSource: Crawford & Huck

if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]ld R2=[R1]ld R2=[R1]ld.s R4=[R3]ld.s R4=[R3]ld.s R6=[R5]ld.s R6=[R5]p1,p2=cmp.unc(R2==true)p1,p2=cmp.unc(R2==true)

(p1)(p1) chk.s R4chk.s R4(p1)(p1) p3,p4=cmp.unc(R4==true)p3,p4=cmp.unc(R4==true)

(p3)(p3) chk.s R6chk.s R6(p3)(p3) p5,p6=cmp.unc(R5==true)p5,p6=cmp.unc(R5==true)(p5) br then(p5) br thenelseelse

1

2

4

5

6

7

ThenElse

P1

P2

P5

P3 P4

P6

8 queens control flow8 queens control flowUnconditional ComparesUnconditional Compares

Page 24: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

24

Eight Queen Example

Source: Crawford & HuckSource: Crawford & Huck

if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))

ThenElse

P1

P2

P5

P3 P4

P6

Parallel ComparesParallel Compares

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) br then(p1) br thenelseelse

1

2

4

5

Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5

ThenElse

P1= true P1=False

Page 25: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

25

More Example of Parallel Compare

1

0 cmp.eq p1,p2 = r0,r0;;

cmp.eq.and.orcm p1,p2 = c1,r0 cmp.eq.and.orcm p1,p2 = c2,r0 cmp.eq.and.orcm p1,p2 = c3,r0 cmp.eq.and.orcm p1,p2 = c4,r0

(p1) add r1=r2,r3(p2) sub r4=r5-r6

c1

c2

c3

else

c4

then

Itanium CodeItanium Code

2

if (c1 && c2 && c3 && c4)if (c1 && c2 && c3 && c4) r1 = r2 + r3;r1 = r2 + r3;else else r4 = r5 – r6 r4 = r5 – r6

Parallel cmp.crel.and or cmp.crel.or write the same values to both predicatesParallel cmp.crel.and or cmp.crel.or write the same values to both predicates

Use Use cmp.crel.and.orcm cmp.crel.and.orcm or or cmp.crel.or.andcmcmp.crel.or.andcm for writing for writing

complementary predicatescomplementary predicates

Also called Also called DeMorganDeMorgan type type (for complementary output)(for complementary output)

Page 26: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

26

Multiway Branches

3 branch cycles3 branch cycles3 branch cycles3 branch cycles 1 branch cycle1 branch cycle1 branch cycle1 branch cycle

w/o Speculationw/o Speculation Hoisting LoadsHoisting Loads

ld8 r6 = (ra)ld8 r6 = (ra)(p1) br exit1(p1) br exit1

ld8 r7 = (rb)ld8 r7 = (rb)(p3) br exit2(p3) br exit2

ld8 r8 = (rc)ld8 r8 = (rc)(p5) br exit3(p5) br exit3

(p1) br exit1(p1) br exit1

chk r7, rec1chk r7, rec1(p3) br exit2(p3) br exit2

chk r8, rec2chk r8, rec2(p5) br exit3(p5) br exit3

ld8 r6 = (ra)ld8 r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)

ld8 r6 = (ra)ld8 r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)

(p2) chk r7, rec1(p2) chk r7, rec1(p4) chk r8, rec2(p4) chk r8, rec2 (p1) br exit1(p1) br exit1(p3) br exit2(p3) br exit2(p5) br exit3(p5) br exit3

P1P1

P6P6P5P5

P2P2

P4P4P3P3

• Multiway branches: more than 1 branch in a single cycleMultiway branches: more than 1 branch in a single cycle– Itanium allows multiple Itanium allows multiple ““consecutiveconsecutive”” B instructions in the same inst group B instructions in the same inst group– Allows n-way branching (Itanium and Itanium 2 have 3 branch units)Allows n-way branching (Itanium and Itanium 2 have 3 branch units) per cycle per cycle– Ordering matters if branch predicates are not mutually exclusiveOrdering matters if branch predicates are not mutually exclusive

• E.g. E.g. BBB template enables 3 branches in one bundleBBB template enables 3 branches in one bundle

Multi-way BranchesMulti-way Branches

Page 27: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

27

Branch and Prefetch Hints

• Compiler provides hints for branch predictor by– Completer in branch instructions, e.g. br.call.sptksptk

• 4 completer types for static and dynamic predictions: sptk, spnt, sptk, spnt, dptk, dpntdptk, dpnt

– Explicit brpbrp instructions• Compiler provide hints for instructioninstruction sequentialsequential prefetchingprefetching

– Use completer in branch instructions, e.g. br.call.sptk.manymany• 2 completer types: many, few many, few• ManyMany and fewfew are implementation-specific

• Compiler directs predictor allocation– For managing branch predictor resources– Use completer in branch instructions, e.g. br.call.sptk.many.nonenone

• 2 completer types: none, clr none, clr• nonenone: don’t deallocate; clrclr: deallocate branch info

Page 28: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

28

Modulo Scheduling Support

• Will be discussed next• Itanium features support modulo scheduling

(or software pipelining)– Full Predication– Special branch handling features

•br.ctop (for for-loop with known loop count)•br.wtop (for while-loop)

– Register rotation: removes loop copy overhead•No modulo variable expansion, tighter code

– Predicate rotation/generation•Removes prologue & epilogue

Page 29: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

29

List Scheduling

++

xx

A1A1

A2A2

A3A3

M1M1

M2M2

M3M3

C1C1

C3C3

C2C2

++

++

xx

xx

ld

st

X1X1

X2X2

P = Mem[A++] + C1;Q = P * C2;Y = P * C3 + (P + Q) * (P * C3);Mem[B++] = Y;

Latency: Latency: Mem — 1 cycleAdder — 2 cyclesMultiplier — 2 cycles

Schedule = {X1, A1, M1, A2, M2, M3, A3, X2}Schedule = {X1, A1, M1, A2, M2, M3, A3, X2}

• Build dependency graph• Assign a priority of “0” to all operations

having no successors• Assign each remaining operation the sum of

priority and latency of their successor. If more than one successor, assign the maximum.

• Schedule instructions based on priority

00

11

33

55 55

99

1111

77

Page 30: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

30

List Scheduling

++

xx

A1A1

A2A2

A3A3

M1M1

M2M2

M3M3

C1C1

C3C3

C2C2

++

++

xx

xx

ld

st

X1X1

X2X2 00

11

33

55 55

99

1111

• LS (a heuristic) provides near-optimal schedule• But no guarantee for optimality, especially, in terms of

throughputthroughput

Reservation TableReservation Table

Time MEM ADDER MULT0 X11 A123 M14 M25 A267 M389 A31011 X2

77

Page 31: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

31

Scheduling• If I want to use the same schedule, what is the minimum

initiation interval? • In the example, do I need to wait for 12 cycles?• If not, how do I avoid collision?

Time MEM ADDER MULT0 X11 A123 M14 M25 A267 M389 A31011 X2

Page 32: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

32

Modulo Scheduling [RauGlaeser’81]

• A.k.a. “Polycyclic scheduling” or “Software pipelining”• Exploit ILP among loop iterations to maximize

– Machine utilization– Throughput

• Use a common schedule for the majority of iterations• Overlap execution of consecutive iterations• Constant initiation rate Init iat ion IntervalInit iat ion Interval (I II I )• Minimum II (MIIMII) generates an optimal schedule with

maximum throughput• Originally developed for polycyclic architecture (or

horizontal architecture, or aka VLIW later) at TRW/ESL

Page 33: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

33

Modulo Scheduling: Resource Constraint

• The optimal schedule is constrained by the number of available resources

• Determine ResII (Resource minimal initiation interval)– Successive iterations will be scheduled ResII cycles

apart• N(i) is the number of usage of resource i in a loop• C(i) is the number of resources i

) .... ,C(3)

N(3) ,

C(2)

N(2) ,

C(1)

N(1) max( ResII

=

Page 34: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

34

Resource II

++

xx

A1A1

A2A2

A3A3

M1M1

M2M2

M3M3

C1C1

C3C3

C2C2

++

++

xx

xx

ld

st

X1X1

X2X2

• Assume 3 FUs– 1 adder with 2-cycle latency– 1 mult with 2-cycle latency– 1 mem unit with 1-cycle

latency

• Determine MII = MII = Resource I IResource I I

3 ) 1

3 ,

1

3,

1

2 max( MII ResII ===

Page 35: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

35

Modulo Reservation Table (MRT)

MRT

New Schedule for 1 iteration

Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 X1 0 01 A1 1 12 2 23 M1 0 34 M2 1 45 A2 2 56 0 67 M3 1 78 2 89 A3 0 910 1 1011 X2 2 11

0 121 132 14

Modulo MEM ADDER MULT012

Page 36: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

36

Modulo Reservation Table (MRT)

MRT

New Schedule for 1 iteration

Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 X1 0 0 X11 A1 1 1 A12 2 23 M1 0 3 M14 M2 1 4 M25 A2 2 5 A26 0 67 M3 1 78 2 8 M39 A3 0 910 1 1011 X2 2 11

0 12 A31 132 14 X2

Modulo MEM ADDER MULT0 X1 A3 M11 A1 M22 X2 A2 M3

Page 37: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

37

Modulo Scheduled Loop

Kernel, steady state (MRT schedule)

Prolog

Modulo Time MEM ADDER MULT0 0 X1 (1)1 1 A1 (1)2 20 3 X1 (2) M1 (1)1 4 A1 (2) M2 (1)2 5 A2 (1)0 6 X1 (3) M1 (2)1 7 A1 (3) M2 (2)2 8 A2 (2) M3 (1)0 9 X1 (4) M1 (3)1 10 A1 (4) M2 (3)2 11 A2 (3) M3 (2)0 12 X1 (5) A3 (1) M1 (4)1 13 A1 (5) M2 (4)2 14 X2 (1) A2 (4) M3 (3)0 15 X1 (6) A3 (2) M1 (5)1 16 A1 (6) M2 (5)2 17 X2 (2) A2 (5) M3 (4)0 18 X1 (7) A3 (3) M1 (6)1 19 A1 (7) M2 (6)2 20 X2 (3) A2 (6) M3 (5)0 21 X1 (8) A3 (4) M1 (7)1 22 A1 (8) M2 (7)2 23 X2 (4) A2 (7) M3 (6)0 24 X1 (9) A3 (5) M1 (8)1 25 A1 (9) M2 (8)2 26 X2 (5) A2 (8) M3 (7)0 27 X1 (10) A3 (6) M1 (9)1 28 A1 (10) M2 (9)2 29 X2 (6) A2 (9) M3 (8)0 30 X1 (11) A3 (7) M1 (10)1 31 A1 (11) M2 (10)2 32 X2 (7) A2 (10) M3 (9)

Page 38: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

38

Modulo Scheduled Loop

Lastkernel

Epilog

Modulo Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 0 X1 (1) 0 T+0 X1 (N-2) A3 (N-6) M1 (N-3)1 1 A1 (1) 1 T+1 A1 (N-2) M2 (N-3)2 2 2 T+2 X2 (N-6) A2 (N-3) M3 (N-4)0 3 X1 (2) M1 (1) 0 T+3 X1 (N-1) A3 (N-5) M1 (N-2)1 4 A1 (2) M2 (1) 1 T+4 A1 (N-1) M2 (N-2)2 5 A2 (1) 2 T+5 X2 (N-5) A2 (N-2) M3 (N-3)0 6 X1 (3) M1 (2) 0 T+6 X1 (N) A3 (N-4) M1 (N-1)1 7 A1 (3) M2 (2) 1 T+7 A1 (N) M2 (N-1)2 8 A2 (2) M3 (1) 2 T+8 X2 (N-4) A2 (N-1) M3 (N-2)0 9 X1 (4) M1 (3) 0 T+9 A3 (N-3) M1 (N)1 10 A1 (4) M2 (3) 1 T+10 M2 (N)2 11 A2 (3) M3 (2) 2 T+11 X2 (N-3) A2 (N) M3 (N-1)0 12 X1 (5) A3 (1) M1 (4) 0 T+12 A3 (N-2)1 13 A1 (5) M2 (4) 1 T+132 14 X2 (1) A2 (4) M3 (3) 2 T+14 X2 (N-2) M3 (N)0 15 X1 (6) A3 (2) M1 (5) 0 T+15 A3 (N-1)1 16 A1 (6) M2 (5) 1 T+162 17 X2 (2) A2 (5) M3 (4) 2 T+17 X2 (N-1)0 18 X1 (7) A3 (3) M1 (6) 0 T+18 A3 (N)1 19 A1 (7) M2 (6) 1 T+192 20 X2 (3) A2 (6) M3 (5) 2 T+20 X2 (N)0 21 X1 (8) A3 (4) M1 (7)1 22 A1 (8) M2 (7)2 23 X2 (4) A2 (7) M3 (6)0 24 X1 (9) A3 (5) M1 (8)1 25 A1 (9) M2 (8)2 26 X2 (5) A2 (8) M3 (7)0 27 X1 (10) A3 (6) M1 (9)1 28 A1 (10) M2 (9)2 29 X2 (6) A2 (9) M3 (8)0 30 X1 (11) A3 (7) M1 (10)1 31 A1 (11) M2 (10)2 32 X2 (7) A2 (10) M3 (9)

Page 39: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

39

Another Modulo Schedule Example

xx

A1A1

A3A3

M2M2M1M1

AA BB

EE

ZZ

++ A2A2

CC DD

00

1111

33 33

Modulo Reservation TableModulo Reservation Table

Given 2 adders (1-cycle) & 1 multiplier (2-cycle)Given 2 adders (1-cycle) & 1 multiplier (2-cycle)

prologprolog

epilogepilog

5x kernel5x kernel

Multiplier is fully utilizedMultiplier is fully utilized

MII = max(3/2, 2/1) = 2 MII = max(3/2, 2/1) = 2

++

++

xx

Modulo ADDER1 ADDER2 MULT0 A1 (3) A2 (3) M2 (2)1 A3 (1) M1 (3)

Page 40: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

40

How to Perform Register Allocation?• We are overlapping multiple iterations into one

schedule.– Example: iteration 1 to 5 are alive at the same time

• Registers from multiple iterations are alive during a period of time

MRT

Modulo MEM ADDER MULT0 X1 (5) A3 (1) M1 (4)1 A1 (5) M2 (4)2 X2 (1) A2 (4) M3 (3)

Page 41: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

41

Modulo Variable Expansion

• Analyze the “life time” of an architecture register• Unroll the loop to enable modulo schedule• R5 needs to stay alive for 8 cycles = 8/3 = 3 MII (i.e. unroll 3 times)

r1(1) r2

(4)

r3 (2) r4

(3)

r5 (8)

r6 (4)

r7 (2)

The cycle numbers assumes WAR allowed in the same cycle

Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r11, (A)++ mul r3, r2, $c21 4 add r12, r11, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 X1 (3) mul r13, r12, $c21 7 A1 (3) mul r15, r12, $c32 8 add r14, r12, r13 mul r6, r4, r50 9 X1 (4) M1 (3)1 10 A1 (4) M2 (3)2 11 A2 (3) mul r16, r14, r150 12 X1 (5) add r7, r5, r6 M1 (4)1 13 A1 (5) M2 (4)2 14 st r7, (B)++ A2 (4) M3 (3)

Page 42: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

42

Post MVE code

Kernel (unrolled 3 times)

Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r11, (A)++ mul r3, r2, $c21 4 add r12, r11, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 ld r21, (A)++ mul r13, r12, $c21 7 add r22, r21, $c1 mul r15, r12, $c32 8 add r14, r12, r13 mul r6, r4, r50 9 ld r1, (A)++ mul r23, r22, $c21 10 add r2, r1, $c1 mul r25, r22, $c32 11 add r24, r22, r23 mul r16, r14, r150 12 ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c21 13 add r12, r11, $c1 mul r5, r2, $c32 14 st r7, (B)++ add r4, r2, r3 mul r26, r24, r250 15 ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c21 16 add r22, r21, $c1 mul r15, r12, $c32 17 st r17, (B)++ add r14, r12, r13 mul r6, r4, r50 18 ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c21 19 add r2, r1, $c1 mul r25, r22, $c32 20 st r27, (B)++ add r24, r22, r23 mul r16, r14, r150 21 ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c21 22 add r12, r11, $c1 mul r5, r2, $c32 23 st r7, (B)++ add r4, r2, r3 mul r26, r24, r250 24 ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c21 25 add r22, r21, $c1 mul r15, r12, $c32 26 st r17, (B)++ add r14, r12, r13 mul r6, r4, r50 27 ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c21 28 add r2, r1, $c1 mul r25, r22, $c32 29 st r27, (B)++ add r24, r22, r23 mul r16, r14, r15

Page 43: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

43

Register Allocation for MVE

• To save # of registers, might not need to expand all registers• Calculate the lifetime of each register to determine if a new register is

needed across iterations (the formula assumes WAR in the same instruction bundle is allowed)

• # of copies = (MII % lifetime/MII == 0) ? lifetime/MII : MII• 14 5/14

– R1 is alive for 1 cycle = 1/3 = 1 MII (need 1 copy)– R2 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since 3%2=1)– R3 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy)– R4 is alive for 3 cycles = 3/3 = 1 MII (need 1 copy)– R5 is alive for 8 cycles = 8/3 = 3 MII (need 3 copies)– R6 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since 3%2=1)– R7 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy)

• 13 registers used, instead of 21 with the same unrolling degree

Page 44: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

44

MVE (reallocate registers)

Kernel (unrolled 3 times)

The cycle numbers assumes WAR allowed in the same cycle

Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r1, (A)++ mul r3, r2, $c21 4 add r12, r1, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 ld r1, (A)++ mul r3, r12, $c21 7 add r22, r1, $c1 mul r15, r12, $c32 8 add r4, r12, r3 mul r6, r4, r50 9 ld r1, (A)++ mul r3, r22, $c21 10 add r2, r1, $c1 mul r25, r22, $c32 11 add r4, r22, r3 mul r16, r4, r150 12 ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c21 13 add r12, r1, $c1 mul r5, r2, $c32 14 st r7, (B)++ add r4, r2, r3 mul r26, r4, r250 15 ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c21 16 add r22, r1, $c1 mul r15, r12, $c32 17 st r7, (B)++ add r4, r12, r3 mul r6, r4, r50 18 ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c21 19 add r2, r1, $c1 mul r25, r22, $c32 20 st r7, (B)++ add r4, r22, r3 mul r16, r4, r150 21 ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c21 22 add r12, r1, $c1 mul r5, r2, $c32 23 st r7, (B)++ add r4, r2, r3 mul r26, r4, r250 24 ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c21 25 add 22, r1, $c1 mul r15, r12, $c32 26 st r7, (B)++ add r4, r12, r3 mul r6, r4, r50 27 ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c21 28 add r2, r1, $c1 mul r25, r22, $c32 29 st r7, (B)++ add r4, r22, r3 mul r16, r4, r15

Page 45: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

45

Final Modulo Schedule

Prolog Code (12 instruction bundles)

Epilog Code (12 instruction bundles)

**Branch instruction not shown

9 instruction bundles

ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c2add r12, r11, $c1 mul r5, r2, $c3

st r7, (B)++ add r4, r2, r3 mul r26, r24, r25ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c2

add r22, r21, $c1 mul r15, r12, $c3st r17, (B)++ add r14, r12, r13 mul r6, r4, r5ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c2

add r2, r1, $c1 mul r25, r22, $c3st r27, (B)++ add r24, r22, r23 mul r16, r14, r15

Page 46: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

46

Final Modulo Schedule (Reallocate Registers)

Prolog Code (12 instruction bundles)

Epilog Code (12 instruction bundles)

**Branch instruction not shown

9 instruction bundles

ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c2add r12, r1, $c1 mul r5, r2, $c3

st r7, (B)++ add r4, r2, r3 mul r26, r4, r25ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c2

add r22, r1, $c1 mul r15, r12, $c3st r7, (B)++ add r4, r12, r3 mul r6, r4, r5ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c2

add r2, r1, $c1 mul r25, r22, $c3st r7, (B)++ add r4, r22, r3 mul r16, r4, r15

Page 47: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

47

Issues with Modulo Variable Expansion

• Many architecture registers are needed• Code size gets bigger when more unrolling

needed

• Alternative solution: Rotating register file– A hardware technique– Solving problem without code duplication – Similar to register windowregister window plus renamingrenaming: keep

old iteration values on the stack (Itanium calls the hardware Register Stack EngineRegister Stack Engine or RSERSE)

Page 48: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

48

Intention of Using Rotation Registers• Use exactly the same schedule (below) for all

including– Kernel codes– Prolog codes– Epilog codes

• The “registers” need to be re-allocated• Registers “rotate” per iteration!!!

**Branch instruction not shown

ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c2add r2, r1, $c1 mul r5, r2, $c3

st r7, (B)++ add r4, r2, r3 mul r6, r4, r5

Page 49: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

49

Idea of Rotation Register (Original Schedule)

i te Time Mem Adder Multipl ier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 mul r43, r42, $c2

4 mul r45, r42, $c3

5 add r44, r42, r43

2 6

7

8 mul r46, r44, r45

3 9

10

11

4 12 add r47, r45, r46

13

14 st r47, (B)++

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 50: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

50

Original Code Schedule

i te Time Mem Adder Multipl ier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 mul r43, r42, $c2

4 mul r45, r42, $c3

5 add r44, r42, r43

2 6

7

8 mul r46, r44, r45

3 9

10

11

4 12 add r47, r45, r46

13

14 st r47, (B)++

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 51: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

51

Assume HW Rotation Registers

i te Time Mem Adder Multipl ier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 mul r44, r43, $c2

4 mul r45, r43, $c3

5 add r52, r43, r44

2 6

7

8 mul r48, r53, r46

3 9

10

11

4 12 add r51, r48, r50

13

14 st r51, (B)++

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 52: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

52

Rotation Registers in Itanium Processors

Stacked (Rotating)

Static

0

3132

127

General Purpose Registers

Stacked (Rotating)

Static

0

3132

127

FP Registers

063 081

Stacked (Rotating)

Static

01516

630

Predicate Registers

Page 53: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

53

Register Rotation (Prolog i0)

i te Time Mem Adder Multipl ier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 54: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

54

Register Rotation (Prolog i1)

i te Time Mem Adder Multipl ier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 ld r41, (A)++ mul r44, r43, $c2

4 add r42, r41, $c1 mul r45, r43, $c3

5 add r52, r43, r44

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 55: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

55

Register Rotation (Prolog i2)

i te Time Mem Adder Multipl ier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 ld r41, (A)++ mul r44, r43, $c2

4 add r42, r41, $c1 mul r45, r43, $c3

5 add r52, r43, r44

2 6 ld r41, (A)++ mul r44, r43, $c2

7 add r42, r41, $c1 mul r45, r43, $c3

8 add r52, r43, r44 mul r48, r53, r46

3 9

10

11

4 12

13

14

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 56: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

56

Register Rotation (Prolog i3)

i te Time Mem Adder Multipl ier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 ld r41, (A)++ mul r44, r43, $c2

4 add r42, r41, $c1 mul r45, r43, $c3

5 add r52, r43, r44

2 6 ld r41, (A)++ mul r44, r43, $c2

7 add r42, r41, $c1 mul r45, r43, $c3

8 add r52, r43, r44 mul r48, r53, r46

3 9 ld r41, (A)++ mul r44, r43, $c2

10 add r42, r41, $c1 mul r45, r43, $c3

11 add r52, r43, r44 mul r48, r53, r46

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 57: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

57

Register Rotation (Kernel Steady State i4)

i te Time Mem Adder Multipl ier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 ld r41, (A)++ mul r44, r43, $c2

4 add r42, r41, $c1 mul r45, r43, $c3

5 add r52, r43, r44

2 6 ld r41, (A)++ mul r44, r43, $c2

7 add r42, r41, $c1 mul r45, r43, $c3

8 add r52, r43, r44 mul r48, r53, r46

3 9 ld r41, (A)++ mul r44, r43, $c2

10 add r42, r41, $c1 mul r45, r43, $c3

11 add r52, r43, r44 mul r48, r53, r46

4 12 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

13 add r42, r41, $c1 mul r45, r43, $c3

14 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Registers wrapped around if exceeding specified bound

Page 58: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

58

• Execute many iterations in the kernel …Register Rotation (Kernel)

Page 59: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

59

Register Rotation (Kernel to Epilog, i<-4>)

i te Time Mem Adder Multipl ier

-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

N-13 add r42, r41, $c1 mul r45, r43, $c3

N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-3 N-11

N-10

N-9

-2 N-8

N-7

N-6

-1 N-5

N-4

N-3

0 N-2

N-1

N

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 60: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

60

Register Rotation (Kernel to Epilog, i<-3>)

i te Time Mem Adder Multipl ier

-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

N-13 add r42, r41, $c1 mul r45, r43, $c3

N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-3 N-11 add r51, r48, r50 mul r44, r43, $c2

N-10 mul r45, r43, $c3

N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-2 N-8

N-7

N-6

-1 N-5

N-4

N-3

0 N-2

N-1

N

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 61: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

61

Register Rotation (Kernel to Epilog, i<-2>)

i te Time Mem Adder Multipl ier

-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

N-13 add r42, r41, $c1 mul r45, r43, $c3

N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-3 N-11 add r51, r48, r50 mul r44, r43, $c2

N-10 mul r45, r43, $c3

N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-2 N-8 add r51, r48, r50

N-7

N-6 st r51, (B)++ mul r48, r53, r46

-1 N-5

N-4

N-3

0 N-2

N-1

N

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 62: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

62

Register Rotation (Kernel to Epilog, i<-1>)

i te Time Mem Adder Multipl ier

-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

N-13 add r42, r41, $c1 mul r45, r43, $c3

N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-3 N-11 add r51, r48, r50 mul r44, r43, $c2

N-10 mul r45, r43, $c3

N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-2 N-8 add r51, r48, r50

N-7

N-6 st r51, (B)++ mul r48, r53, r46

-1 N-5 add r51, r48, r50

N-4

N-3 st r51, (B)++

0 N-2

N-1

N

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 63: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

63

Register Rotation (Kernel to Epilog, final ite)

i te Time Mem Adder Multipl ier

-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

N-13 add r42, r41, $c1 mul r45, r43, $c3

N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-3 N-11 add r51, r48, r50 mul r44, r43, $c2

N-10 mul r45, r43, $c3

N-9 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-2 N-8 add r51, r48, r50

N-7

N-6 st r51, (B)++ mul r48, r53, r46

-1 N-5 add r51, r48, r50

N-4

N-3 st r51, (B)++

0 N-2 add r51, r48, r50

N-1

N st r51, (B)++

Assuming that registers are rotated per iteration automatically

In Intel Itanium, integer registers 32 – 127 are rotating registers

Page 64: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

64

Modulo Schedule with Rotating Register Support

• No loop unrolling required (required careful register allocation)

• Tighter code, saving space• However, there are still prolog and epilog codes• Can we use the same schedule for prolog/epilog?

– Use stage predicates to execute instructions conditionally– Require new ISA support (Itanium)

ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2add r42, r41, $c1 mul r45, r43, $c3

st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

Page 65: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

65

Predicated Instruction Execution (Prolog i0)i te Time Mem Adder Multipl ier

0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

1 (p16) add r42, r41, $c1 mul r45, r43, $c3

2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

1 3

4

5

2 6

7

8

3 9

10

11

4 12

13

14

Don’t execute shaded instructions

cc0: only issue ld

cc1: only issue add

cc2: no issue

Page 66: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

66

Predicated Prolog (Prolog i1)i te Time Mem Adder Multipl ier

0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

1 (p16) add r42, r41, $c1 mul r45, r43, $c3

2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46

2 6

7

8

3 9

10

11

4 12

13

14

cc3: ld(i1) & mul(i0)

cc4: add(i0) & mul(i0)

cc5: add(i0)

Note that stage predicates also “rotate” per iteration

Page 67: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

67

Predicated Prolog (Prolog i2)i te Time Mem Adder Multipl ier

0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

1 (p16) add r42, r41, $c1 mul r45, r43, $c3

2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46

2 6 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

3 9

10

11

4 12

13

14

cc6: ld(i2) & mul(i1)

cc7: add(i2) & mul(i1)

cc8: add(i1) & mul(i0)

Page 68: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

68

Predicated Prolog (Prolog i3)i te Time Mem Adder Multipl ier

0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

1 (p16) add r42, r41, $c1 mul r45, r43, $c3

2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46

2 6 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

3 9 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

11 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

4 12

13

14

cc9: ld(i3) & mul(i2)

cc10: add(i3) & mul(i2)

cc11: add(i2) & mul(i1)

Page 69: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

69

Predicated Kernel (i4)i te Time Mem Adder Multipl ier

0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

1 (p16) add r42, r41, $c1 mul r45, r43, $c3

2 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46

2 6 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

3 9 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

11 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

4 12 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

14 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

cc12: ld(i4) & add(i0) & mul(i3)cc13: st(i0) & add(i4) & mul(3)cc11: add(i3) & mul(i2)

(p20) is used in iteration 4, not (p19) because of predicate rotation

Page 70: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

70

• Execute many iterations in the kernel …Register Rotation (Kernel)

Page 71: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

71

Predicated Epilog (i<-4>)i te Time Mem Adder Multipl ier

-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-3 N-11

N-10

N-9

-2 N-8

N-7

N-6

-1 N-5

N-4

N-3

0 N-2

N-1

N

Page 72: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

72

Predicated Epilog (i<-3>)i te Time Mem Adder Multipl ier

-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-2 N-8

N-7

N-6

-1 N-5

N-4

N-3

0 N-2

N-1

N

Page 73: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

73

Predicated Epilog (i<-2>)i te Time Mem Adder Multipl ier

-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-2 N-8 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-6 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-1 N-5

N-4

N-3

0 N-2

N-1

N

Page 74: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

74

Predicated Epilog (i<-1>)i te Time Mem Adder Multipl ier

-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-2 N-8 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-6 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-1 N-5 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-3 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

0 N-2

N-1

N

Page 75: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

75

Predicated Epilog (final iteration)i te Time Mem Adder Multipl ier

-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-3 N-11 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-9 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-2 N-8 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-6 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-1 N-5 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-3 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

0 N-2 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-1 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Page 76: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

76

Final Modulo Schedule (Itanium-like)

• Before entering the loop, set p16p16 =1 (p16 is the first rotating predicate register)

• When the modulo-scheduled loop branch (e.g. br.ctop) encountered – p63p63 is set to 1 by hardware in the prolog code (see next slide)– All registers (rotating registers and predicate rotating registers) rotate as each

stage (iteration) advances• Only 3 Itanium Instruction Bundles (= 3 VLIWs) needed

– No prolog, epilog codes– No modulo variable expansions that stress registers and blow up code size

(p16) r41 = (A)++ (p20) r51 = r48 + r50

(p20) (B)++ = r51(p16) r42 = r41 + $c1

(p17) r44 = r43 * $c2

(p17) r52 = r43 + r44

mov ar.lc = 196 // loop countmov ar.ec = 5 // epilog stages+1mov pr.rot = 0x10000 // special inst set pr[16]=1 and p[63:17]=0

L1top:

br.ctop L1top

(p17) r45 = r43 * $c3(p18) r48 = r53 * r46

Page 77: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Counted Modulo-scheduled Loop

p20p20 00

p19p19 00

p18p18 00

p17p17 00

p16p16 11

p63p63 11

p62p62 00

Stage 0 (Stage 0 (PrologProlog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

After the first iterationLC = 195, EC = 5

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 78: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Counted Modulo-scheduled Loop

p20p20 00

p19p19 00

p18p18 00

p17p17 00

p16p16 11

p63p63 11

p62p62 00

Stage 1 (Stage 1 (PrologProlog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 2nd iteration

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 79: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Counted Modulo-scheduled Loop

p20p20 00

p19p19 00

p18p18 00

p17p17 00

p16p16 11

p63p63 11

p62p62 11

p61p61 00

Stage 2 (Stage 2 (PrologProlog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 3rd iteration

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 80: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Counted Modulo-scheduled Loop

p20p20 00

p19p19 00

p18p18 00

p17p17 00

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

Stage 3 (Stage 3 (PrologProlog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 4th iteration

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 81: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Counted Modulo-scheduled Loop

p20p20 00

p19p19 00

p18p18 00

p17p17 00

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 11

p59p59 00

Stage 4 (Stage 4 (KernelKernel))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 5th iteration

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 82: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

In the Kernel

• After Another 191 Iterations …..

Page 83: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Counted Modulo-scheduled Loop

p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 11

p59p59 11

p58p58 11

p57p57 11

p56p56 11

p55p55 11

Stage 195 (Stage 195 (KernelKernel))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 196th iterationLC=0, EC=5

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 84: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Counted Modulo-scheduled Loop

p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

p59p59 11

p58p58 11

p57p57 11

p56p56 11

p55p55 11

Stage 195 (Stage 195 (KernelKernel))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

after the 196th iterationEC=4

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 85: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Counted Modulo-scheduled Loop

p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

p59p59 11

p58p58 11

p57p57 11

p56p56 11

p55p55 11

Stage 196 (Stage 196 (EpilogEpilog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 197th iterationEC=4

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 86: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Counted Modulo-scheduled Loop

p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

p59p59 00

p58p58 11

p57p57 11

p56p56 11

p55p55 11

Stage 197 (Stage 197 (EpilogEpilog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 198th iterationEC=3

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 87: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Counted Modulo-scheduled Loop

p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

p59p59 00

p58p58 00

p57p57 11

p56p56 11

p55p55 11

Stage 198 (Stage 198 (EpilogEpilog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 199th iterationEC=2

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 88: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Counted Modulo-scheduled Loop

p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

p59p59 00

p58p58 00

p57p57 00

p56p56 11

p55p55 11

Stage 199 (Stage 199 (EpilogEpilog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

Before the 200th iteration (Last iteration)EC=1

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20

Page 89: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Counted Modulo-scheduled Loop

p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

p59p59 00

p58p58 00

p57p57 00

p56p56 00

p55p55 11

Stage 199 (Stage 199 (EpilogEpilog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

After the 200th iteration (Last iteration)EC=0

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20• “br.ctop” instruction exits

the loop

Page 90: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

90

Modulo Scheduling ExampleLoop{

P=A+B

Q=C+D;

X=PxE

Y=PxQ

Z=X+Y

}

Step 1: Data flow graph

xx M2M2

A1A1

AA BB

++ A2A2

CC DD

++

A3A3

ZZ

++

EE

M1M1 xx

Loop{

P=A+B

Q=C+D;

X=PxE

Y=PxQ

Z=X+Y

}

Loop{

P=A+B

Q=C+D;

X=PxE

Y=PxQ

Z=X+Y

}

Loop{

P=A+B

Q=C+D;

X=PxE

Y=PxQ

Z=X+Y

}

Loop{

P=A+B

Q=C+D;

X=PxE

Y=PxQ

Z=X+Y

}

Page 91: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

91

Modulo SchedulingStep 2: Generate a list schedule

xx M2M2

A1A1

AA BB

++ A2A2

CC DD

++

A3A3

ZZ

++

EE

M1M1 xx

00

11 11

3333

Execution units:2 Adders – 1cycle latency1 Multiplier – 2 cycle latency

Page 92: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

92

Modulo SchedulingStep 2: Generate a list schedule

xx M2M2

A1A1

AA BB

++ A2A2

CC DD

++

A3A3

ZZ

++

EE

M1M1 xx

00

11 11

3333

ReservationReservation TableTable

Time Adder1 Adder2 Mult0 A1

1234

A2

M1

M2

A3

Page 93: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

93

Modulo SchedulingGenerating Modulo Schedule:

1. Determine the MII:

=

Ctyavailabilisource

NdemandsourceMII

:_Re

:_Remax

MII = max[(3/2) ,(2/1)] = 2

Page 94: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

94

Modulo SchedulingMapping from list schedule to modulo schedule

Time Modulo 2 Adder1 Adder2 Mult

0 0 A1 A2

1 1 M1

2 0 M2

3 1

4 0 A3

5 1

6 0

List scheduleList schedule

Time Adder1 Adder2 Mult0 A1

1234

A2

M1

M2

A3

Modulo scheduleModulo schedulefor 1 iterationfor 1 iteration

A3

Page 95: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

95

Modulo SchedulingTime Modulo 2 Adder1 Adder2 Mult

0 0 1:A1 1:A2

1 1 1:M1

2 0 2:A1 2:A2 1:M2

3 1 2:M1

4 0 2:M2

5 1 1:A3

6 0

7 1 2:A3

8 0

inserting iteration 2

Page 96: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

96

Modulo SchedulingTime Modulo 2 Adder1 Adder2 Mult

0 0 1:A1 1:A2

1 1 1:M1

2 0 2:A1 2:A2 1:M2

3 1 2:M1

4 0 3:A1 3:A2 2:M2

5 1 1:A3 3:M1

6 0 3:M2

7 1 2:A3

8 0

9 1 3:A3

inserting iteration 3

Page 97: Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

97

Modulo Scheduled Loop

prologprolog

epilogepilog

5x kernel5x kernel

Modulo 2

Adder 1 Adder 2 Mult

0 3:A1 3:A2 2:M2

1 1:A3 3:M1

MRT