Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

ECE 4100/6100Advanced Computer Architecture

Lecture 15 Static Scheduling Machines

Prof. Hsien-Hsin Sean Lee

School of Electrical and Computer Engineering

Georgia Institute of Technology

2

Static Scheduling• Compiler performs instruction scheduling• VLIW Very Long Instruction Word• An alternative to dynamic scheduling processors• Pack multiple operations into one instruction• Move scheduling to Compiler (Software Approach)• Can simplify the complexity of a hardware-based instruction

scheduler• Cydrome, Multiflow, EPIC

3

Very Long Instruction Word (VLIW)

• Rely on Compilers• Simple Hardware• Dependency is explicitly represented in the instructions• Instruction window, supposedly, is much larger than a

hardware scheduling window– How about loop boundary?– How about function boundary?– Interprocedural optimization is generally difficult

• Might lead to compatibility or performance issues if instruction latency changed

• EPIC/Itanium closely follows VLIW philosophy, many embedded and DSP processors embrace VLIW

4

Intel Itanium ISA• Itanium Instruction “Bundle” (VLIW)

– 128 bits each– Contains three Itanium instructions (aka syllables)– Template bits in each bundle specify dependencies both within a

bundle as well as between sequential bundles– A collection of independent bundles forms a “group” (use stops)

• Each Itanium Instruction– Fixed-length 41 bits long– Left-most 4 bits (40-37) are the major opcode (e.g. FP ld/st, INT

ld/st, ALU)– Contains max three 7-bit register specifiers– Contains a 6-bit field for specifying one of the 64 one-bit qualifying

predicate registers

Instruction Slot 1 Instruction Slot 2 Instruction Slot 3 Templt0454586127

5

Encoding Instruction Bundle

• Use “;;” as “stop bitstop bit” in assembly code to separate dependent instructions• Instructions between “;;” belong to the same “instruction group”

– RAW and WAW are not allowed in the same instruction group– WAR is allowed except for an special case: when writing p63 by modulo-scheduled

branch (e.g. br.ctop) after reading p63 (e.g. qualifying predicate) by B-type instruction

• Each instruction slot can represent one (out of 5) functional unit type based on encoding (e.g. slot 0 can be M-unit or B-unit)

• 12 basic templates provided, each with 2 versions depending on stop bit– MII, MI_I, MLX, MMI, M_MI, MFI, MMF, MIB, MBB, BBB, MMB, MFB– MII_, MI_I_, MLX_, MMI_, M_MI_, MFI_, MMF_, MIB_, MBB_, BBB_, MMB_, MFB_

{ .mii ld4 r28=[r8]add r9 = 2,r1;;add r30= 1,r9

}MI_I format ⇒ Template encoded “02”

6

Itanium Instruction Example

{ .mii add r1 = r2, r3 sub r4 = r4, r5;; shr r1, r4, r1;;}{ .mmi ld8 r2, [r1];; st8 [r1] = r23 tbit p1,p2 = r4, 5} { .mbb ld8 r45 = [r55](p3)br.call b1=func1(p4)br.cond Label1}{ .mfi st4 [r45] = r6 fmac f1=f2,f3 add r3=r3, 8;;}

7

Itanium Register Files

Stacked (Rotating)

Static

0

3132

127

General Purpose Registers

Stacked (Rotating)

Static

0

3132

127

FP Registers

063 081

Stacked (Rotating)

Static

01516

630

Predicate Registers

8

Register Stack Engine

• Avoid spills/fills during function call/return• Callee uses instruction alloc r1=ar.pfs, i, l, o, r alloc r1=ar.pfs, i, l, o, r upon entering a function

(inputs)

Static

0

3132

127

localsoutputs

illegalsize of frame (sof)

sofsol

Current Frame Marker (CFM) 38 bits

size of locals (sol = i+l)

sorrrb.grrrb.frrrb.pr

size of rotating (sor)

9

Function Call Examplemain(){

a=foo(i*i, b[i]);

}

int foo(int ii, int bb){

}

r32

r43r44r45

i*i b[i]

r127

main: alloc r32=ar.pfs,0,12,2,0

foo: alloc r26=ar.pfs,2,5,0,0

GPR

Caller (main)

r32

r43r32r33

i*i b[i]

r127

GPR

r38

Callee (foo)

10

RSE: A Function Call

32

46

loc

out52

sofsol

CFM 2114

PFS.pfm xx

3238

out

sofsol

70

2114

call

pfm: Previous frame marker

11

RSE: Alloc

32

46

loc

out52

sofsol

CFM 2114

PFS.pfm xx

3238

out

sofsol

70

2114

call alloc r32=ar.pfs,7,9,3,0

sofsol

1916

2114

32

48

loc

out50

inputs

alloc copies PFM to GR (r32)

12

RSE: Return

32

46

loc

out52

sofsol

CFM 2114

PFS.pfm xx

3238

out

sofsol

70

2114

call alloc

sofsol

1916

2114

32

48

loc

out50

32

46

loc

out52

sofsol

2114

2114

return

13

Itanium Pipelines

• Performance improvement due to pipeline shortening — 4% to 6% • Large integer register file cause extra stage WLD (Word Line Decode) in Itanium,

circuit improved for Itanium 2 • Inter-group latency is enforced by a scoreboard

– Latency due to scheduling that failed to space instructions out– Due to cache misses

Front-endFront-end

Ckt improvedCkt improved

Dependency Scoreboard Stall checked here prior to EXE

14

Itanium 2 Eight-stage Pipeline

EXPEXP RENRENROTROTIPGIPG REGREG EXEEXE DETDET WBWB

FP1FP1 FP2FP2 FP3FP3 FP4FP4 WBWB

L2NL2N L2IL2I L2AL2A L2ML2M L2DL2D L2CL2C L2WL2W

CoreCore

FPFP

L2L2

IPGIPG IP Generate, L1I cache (6 inst) and TLB access

EXEEXE ALU Execute, L1D Cache and TLB Access + L2 Cache Tag Access

ROTROT Instruction Rotate and Buffer (6 inst) DETDET Exception Detect, Branch Correction

EXPEXP Expand, Port assignment and routing WBWB Writeback, INT register update

RENREN INT and FP register rename FP1-WBFP1-WB FP FMAC pipeline (2) + register write

REGREG INT and FP register file read L2N-L2IL2N-L2I L2 Queue Nominate/Issue (4)(speculatively issued with L1 requestspeculatively issued with L1 request)

L2A-L2WL2A-L2W L2 Access, Rotate, Correct, Write (4)

15

Itanium 2 MicroarchitectureL1 I-Cache &

Fetch/Prefetch engine I-TLB

8 bundles8 bundlesInstructionInstructionQueueQueue

Branch Prediction

FF FFII IIMM MMMM MMBBBB BB

Register stack engine / remapping Register stack engine / remapping

Branch & Predicate

128 INTRegisters

128 FPRegisters

BranchUnits

BranchUnits

BranchUnits

INT & MMUnits

INT & MMUnits

INT & MMUnits

INT & MMUnits

INT & MMUnits

INT & MMUnits

Quad-port(INT) L1

PIPT DataCache (WT)

D-TLB

ALA

T

FloatingFloatingPointPointUnitsUnits

FloatingFloatingPointPointUnitsUnits

Scor

eboa

rd, P

redi

cate

NaT

, Exc

eptio

ns

IA-32Decode

& Control

11 issue 11 issue portsports

PIPT

Uni

fied

L2 C

ache

Qua

d-Po

rt (E

CC

)PI

PT U

nifie

d L2

Cac

he Q

uad-

Port

(EC

C)

On-

chip

PIP

T U

nifie

d L

3 C

ache

Sin

gle-

port

ed

On-

chip

PIP

T U

nifie

d L

3 C

ache

Sin

gle-

port

ed

(EC

C)

(EC

C)

Bus Controller (ECC)Bus Controller (ECC)

17

ld.sld.sinstr 1instr 1instr 2instr 2brbr

chk.schk.suse use

ItaniumItanium

instr 1instr 1instr 2instr 2. . .. . .brbr

LoadLoaduseuse

Conventional ArchitecturesConventional Architectures

Elevate loads above a branchElevate loads above a branch

• To improve memory latency by control speculation at compile time• Defer exceptions by setting NaT (GR’s 65th bit) that indicates:

– Whether or not an exception has occurred – Branch to fixup code required

• NaT set during ld.s, checked by chk.s

BarrierBarrier

Control Speculation (Speculative Load)

18

Control Speculation (Hoist Uses)

• The uses of speculative data can be executed speculatively– Distinguishes speculation from simple prefetch

• NaT bit propagates down to the dependent instruction chain

ld.sld.sinstr 1instr 1instr 2instr 2brbr

chk.schk.suse use

IA-64IA-64

19

Control Speculation (Recovery)

• All computation instructions propagate NaTsNaTs to the consumers to reduce number of checks

• Cmp propagates “false” if NaT is set when writing predicates (“0” for both target predicates)

chk.s chk.s r5r5, recv, recvsub r7 = sub r7 = r5r5,r2,r2

ld8.s r3 = (r9)ld8.s r3 = (r9)ld8.sld8.s r4 = (r10) r4 = (r10)addaddr6 = r3, r4r6 = r3, r4ld8.s ld8.s r5r5 = (r6) = (r6)p1,p2 = cmp(...)p1,p2 = cmp(...)

Allows single chk on Allows single chk on resultresult

ld8ld8ld8ld8addaddld8ld8br homebr home

Recovery codeRecovery code

20

Data Speculation (Advanced Loads)

• Compiler can hoist a load prior to a preceding, possibly-conflicting store• ALAT (Advanced Load Address Table) is used for checking every store

address in-between • Can be done by superscalar machine using Store coloringStore coloring

instr 1instr 1instr 2instr 2. . .. . .st8st8

ld8ld8useuse

BarrierBarrier

Conventional ArchitecturesConventional Architectures

ld8.ald8.ainstr 1instr 1instr 2instr 2st8st8

ld.cld.cuse use

ItaniumItanium

21

Data Speculation (load.a + chk.a)• Compiler hoist a load and its subsequent consumersits subsequent consumers prior to

a preceding, possibly-conflicting store• Need to patch a recovery code for mis-speculation

ld8.a r3=ld8.a r3=instr 1instr 1instr 2instr 2st8st8

ld.cld.cadd =r3, add =r3,

ld8.a r3=ld8.a r3=instr 1instr 1add =r3,add =r3,instr 2instr 2st8st8

chk.achk.aL1:L1:

ld8 r3=ld8 r3=add =r3,add =r3,br L1br L1

Recovery codeRecovery code

22

Parallel Compare Types

• Three new types of compares:– and: both target predicates set FALSE if compare is false– or: both target predicates set TRUE if compare is true– DeMorgan: if true, sets one TRUE, sets other FALSE

• Do not get confused with the “parallel compare” pcmp1/pcmp2/pcmp4

Reduces Critical PathReduces Critical PathReduces Critical PathReduces Critical Path

BB

AA

CC

DD

BBAA CC

DD

23

Eight Queen Example

Source: Crawford & HuckSource: Crawford & Huck

if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]ld R2=[R1]ld R2=[R1]ld.s R4=[R3]ld.s R4=[R3]ld.s R6=[R5]ld.s R6=[R5]p1,p2=cmp.unc(R2==true)p1,p2=cmp.unc(R2==true)

(p1)(p1) chk.s R4chk.s R4(p1)(p1) p3,p4=cmp.unc(R4==true)p3,p4=cmp.unc(R4==true)

(p3)(p3) chk.s R6chk.s R6(p3)(p3) p5,p6=cmp.unc(R5==true)p5,p6=cmp.unc(R5==true)(p5) br then(p5) br thenelseelse

1

2

4

5

6

7

ThenElse

P1

P2

P5

P3 P4

P6

8 queens control flow8 queens control flowUnconditional ComparesUnconditional Compares

24

Eight Queen Example

Source: Crawford & HuckSource: Crawford & Huck

if ((b[j] == true) && (a[i+j] == true) && (c[i-j+7] == true))

ThenElse

P1

P2

P5

P3 P4

P6

Parallel ComparesParallel Compares

R1=&b[j]R1=&b[j]R3=&a[i+j]R3=&a[i+j]R5=&c[i-j+7]R5=&c[i-j+7]p1 <- truep1 <- trueld R2=[R1]ld R2=[R1]ld R4=[R3]ld R4=[R3]ld R6=[R5]ld R6=[R5]p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R2==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R4==true)p1,p2 <- cmp.and(R6==true)p1,p2 <- cmp.and(R6==true)(p1) br then(p1) br thenelseelse

1

2

4

5

Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5Reduced from 7 cycles to 5

ThenElse

P1= true P1=False

25

More Example of Parallel Compare

1

0 cmp.eq p1,p2 = r0,r0;;

cmp.eq.and.orcm p1,p2 = c1,r0 cmp.eq.and.orcm p1,p2 = c2,r0 cmp.eq.and.orcm p1,p2 = c3,r0 cmp.eq.and.orcm p1,p2 = c4,r0

(p1) add r1=r2,r3(p2) sub r4=r5-r6

c1

c2

c3

else

c4

then

Itanium CodeItanium Code

2

if (c1 && c2 && c3 && c4)if (c1 && c2 && c3 && c4) r1 = r2 + r3;r1 = r2 + r3;else else r4 = r5 – r6 r4 = r5 – r6

Parallel cmp.crel.and or cmp.crel.or write the same values to both predicatesParallel cmp.crel.and or cmp.crel.or write the same values to both predicates

Use Use cmp.crel.and.orcm cmp.crel.and.orcm or or cmp.crel.or.andcmcmp.crel.or.andcm for writing for writing

complementary predicatescomplementary predicates

Also called Also called DeMorganDeMorgan type type (for complementary output)(for complementary output)

26

Multiway Branches

3 branch cycles3 branch cycles3 branch cycles3 branch cycles 1 branch cycle1 branch cycle1 branch cycle1 branch cycle

w/o Speculationw/o Speculation Hoisting LoadsHoisting Loads

ld8 r6 = (ra)ld8 r6 = (ra)(p1) br exit1(p1) br exit1

ld8 r7 = (rb)ld8 r7 = (rb)(p3) br exit2(p3) br exit2

ld8 r8 = (rc)ld8 r8 = (rc)(p5) br exit3(p5) br exit3

(p1) br exit1(p1) br exit1

chk r7, rec1chk r7, rec1(p3) br exit2(p3) br exit2

chk r8, rec2chk r8, rec2(p5) br exit3(p5) br exit3

ld8 r6 = (ra)ld8 r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)

ld8 r6 = (ra)ld8 r6 = (ra)ld8.s r7 = (rb)ld8.s r7 = (rb)ld8.s r8 = (rc)ld8.s r8 = (rc)

(p2) chk r7, rec1(p2) chk r7, rec1(p4) chk r8, rec2(p4) chk r8, rec2 (p1) br exit1(p1) br exit1(p3) br exit2(p3) br exit2(p5) br exit3(p5) br exit3

P1P1

P6P6P5P5

P2P2

P4P4P3P3

• Multiway branches: more than 1 branch in a single cycleMultiway branches: more than 1 branch in a single cycle– Itanium allows multiple Itanium allows multiple ““consecutiveconsecutive”” B instructions in the same inst group B instructions in the same inst group– Allows n-way branching (Itanium and Itanium 2 have 3 branch units)Allows n-way branching (Itanium and Itanium 2 have 3 branch units) per cycle per cycle– Ordering matters if branch predicates are not mutually exclusiveOrdering matters if branch predicates are not mutually exclusive

• E.g. E.g. BBB template enables 3 branches in one bundleBBB template enables 3 branches in one bundle

Multi-way BranchesMulti-way Branches

27

Branch and Prefetch Hints

• Compiler provides hints for branch predictor by– Completer in branch instructions, e.g. br.call.sptksptk

• 4 completer types for static and dynamic predictions: sptk, spnt, sptk, spnt, dptk, dpntdptk, dpnt

– Explicit brpbrp instructions• Compiler provide hints for instructioninstruction sequentialsequential prefetchingprefetching

– Use completer in branch instructions, e.g. br.call.sptk.manymany• 2 completer types: many, few many, few• ManyMany and fewfew are implementation-specific

• Compiler directs predictor allocation– For managing branch predictor resources– Use completer in branch instructions, e.g. br.call.sptk.many.nonenone

• 2 completer types: none, clr none, clr• nonenone: don’t deallocate; clrclr: deallocate branch info

28

Modulo Scheduling Support

• Will be discussed next• Itanium features support modulo scheduling

(or software pipelining)– Full Predication– Special branch handling features

•br.ctop (for for-loop with known loop count)•br.wtop (for while-loop)

– Register rotation: removes loop copy overhead•No modulo variable expansion, tighter code

– Predicate rotation/generation•Removes prologue & epilogue

29

List Scheduling

++

xx

A1A1

A2A2

A3A3

M1M1

M2M2

M3M3

C1C1

C3C3

C2C2

++

++

xx

xx

ld

st

X1X1

X2X2

P = Mem[A++] + C1;Q = P * C2;Y = P * C3 + (P + Q) * (P * C3);Mem[B++] = Y;

Latency: Latency: Mem — 1 cycleAdder — 2 cyclesMultiplier — 2 cycles

Schedule = {X1, A1, M1, A2, M2, M3, A3, X2}Schedule = {X1, A1, M1, A2, M2, M3, A3, X2}

• Build dependency graph• Assign a priority of “0” to all operations

having no successors• Assign each remaining operation the sum of

priority and latency of their successor. If more than one successor, assign the maximum.

• Schedule instructions based on priority

00

11

33

55 55

99

1111

77

30

List Scheduling

++

xx

A1A1

A2A2

A3A3

M1M1

M2M2

M3M3

C1C1

C3C3

C2C2

++

++

xx

xx

ld

st

X1X1

X2X2 00

11

33

55 55

99

1111

• LS (a heuristic) provides near-optimal schedule• But no guarantee for optimality, especially, in terms of

throughputthroughput

Reservation TableReservation Table

Time MEM ADDER MULT0 X11 A123 M14 M25 A267 M389 A31011 X2

77

31

Scheduling• If I want to use the same schedule, what is the minimum

initiation interval? • In the example, do I need to wait for 12 cycles?• If not, how do I avoid collision?

Time MEM ADDER MULT0 X11 A123 M14 M25 A267 M389 A31011 X2

32

Modulo Scheduling [RauGlaeser’81]

• A.k.a. “Polycyclic scheduling” or “Software pipelining”• Exploit ILP among loop iterations to maximize

– Machine utilization– Throughput

• Use a common schedule for the majority of iterations• Overlap execution of consecutive iterations• Constant initiation rate Init iat ion IntervalInit iat ion Interval (I II I )• Minimum II (MIIMII) generates an optimal schedule with

maximum throughput• Originally developed for polycyclic architecture (or

horizontal architecture, or aka VLIW later) at TRW/ESL

33

Modulo Scheduling: Resource Constraint

• The optimal schedule is constrained by the number of available resources

• Determine ResII (Resource minimal initiation interval)– Successive iterations will be scheduled ResII cycles

apart• N(i) is the number of usage of resource i in a loop• C(i) is the number of resources i

) .... ,C(3)

N(3) ,

C(2)

N(2) ,

C(1)

N(1) max( ResII

=

34

Resource II

++

xx

A1A1

A2A2

A3A3

M1M1

M2M2

M3M3

C1C1

C3C3

C2C2

++

++

xx

xx

ld

st

X1X1

X2X2

• Assume 3 FUs– 1 adder with 2-cycle latency– 1 mult with 2-cycle latency– 1 mem unit with 1-cycle

latency

• Determine MII = MII = Resource I IResource I I

3 ) 1

3 ,

1

3,

1

2 max( MII ResII ===

35

Modulo Reservation Table (MRT)

MRT

New Schedule for 1 iteration

Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 X1 0 01 A1 1 12 2 23 M1 0 34 M2 1 45 A2 2 56 0 67 M3 1 78 2 89 A3 0 910 1 1011 X2 2 11

0 121 132 14

Modulo MEM ADDER MULT012

36

Modulo Reservation Table (MRT)

MRT

New Schedule for 1 iteration

Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 X1 0 0 X11 A1 1 1 A12 2 23 M1 0 3 M14 M2 1 4 M25 A2 2 5 A26 0 67 M3 1 78 2 8 M39 A3 0 910 1 1011 X2 2 11

0 12 A31 132 14 X2

Modulo MEM ADDER MULT0 X1 A3 M11 A1 M22 X2 A2 M3

37

Modulo Scheduled Loop

Kernel, steady state (MRT schedule)

Prolog

Modulo Time MEM ADDER MULT0 0 X1 (1)1 1 A1 (1)2 20 3 X1 (2) M1 (1)1 4 A1 (2) M2 (1)2 5 A2 (1)0 6 X1 (3) M1 (2)1 7 A1 (3) M2 (2)2 8 A2 (2) M3 (1)0 9 X1 (4) M1 (3)1 10 A1 (4) M2 (3)2 11 A2 (3) M3 (2)0 12 X1 (5) A3 (1) M1 (4)1 13 A1 (5) M2 (4)2 14 X2 (1) A2 (4) M3 (3)0 15 X1 (6) A3 (2) M1 (5)1 16 A1 (6) M2 (5)2 17 X2 (2) A2 (5) M3 (4)0 18 X1 (7) A3 (3) M1 (6)1 19 A1 (7) M2 (6)2 20 X2 (3) A2 (6) M3 (5)0 21 X1 (8) A3 (4) M1 (7)1 22 A1 (8) M2 (7)2 23 X2 (4) A2 (7) M3 (6)0 24 X1 (9) A3 (5) M1 (8)1 25 A1 (9) M2 (8)2 26 X2 (5) A2 (8) M3 (7)0 27 X1 (10) A3 (6) M1 (9)1 28 A1 (10) M2 (9)2 29 X2 (6) A2 (9) M3 (8)0 30 X1 (11) A3 (7) M1 (10)1 31 A1 (11) M2 (10)2 32 X2 (7) A2 (10) M3 (9)

38


Lastkernel

Epilog

Modulo Time MEM ADDER MULT Modulo Time MEM ADDER MULT0 0 X1 (1) 0 T+0 X1 (N-2) A3 (N-6) M1 (N-3)1 1 A1 (1) 1 T+1 A1 (N-2) M2 (N-3)2 2 2 T+2 X2 (N-6) A2 (N-3) M3 (N-4)0 3 X1 (2) M1 (1) 0 T+3 X1 (N-1) A3 (N-5) M1 (N-2)1 4 A1 (2) M2 (1) 1 T+4 A1 (N-1) M2 (N-2)2 5 A2 (1) 2 T+5 X2 (N-5) A2 (N-2) M3 (N-3)0 6 X1 (3) M1 (2) 0 T+6 X1 (N) A3 (N-4) M1 (N-1)1 7 A1 (3) M2 (2) 1 T+7 A1 (N) M2 (N-1)2 8 A2 (2) M3 (1) 2 T+8 X2 (N-4) A2 (N-1) M3 (N-2)0 9 X1 (4) M1 (3) 0 T+9 A3 (N-3) M1 (N)1 10 A1 (4) M2 (3) 1 T+10 M2 (N)2 11 A2 (3) M3 (2) 2 T+11 X2 (N-3) A2 (N) M3 (N-1)0 12 X1 (5) A3 (1) M1 (4) 0 T+12 A3 (N-2)1 13 A1 (5) M2 (4) 1 T+132 14 X2 (1) A2 (4) M3 (3) 2 T+14 X2 (N-2) M3 (N)0 15 X1 (6) A3 (2) M1 (5) 0 T+15 A3 (N-1)1 16 A1 (6) M2 (5) 1 T+162 17 X2 (2) A2 (5) M3 (4) 2 T+17 X2 (N-1)0 18 X1 (7) A3 (3) M1 (6) 0 T+18 A3 (N)1 19 A1 (7) M2 (6) 1 T+192 20 X2 (3) A2 (6) M3 (5) 2 T+20 X2 (N)0 21 X1 (8) A3 (4) M1 (7)1 22 A1 (8) M2 (7)2 23 X2 (4) A2 (7) M3 (6)0 24 X1 (9) A3 (5) M1 (8)1 25 A1 (9) M2 (8)2 26 X2 (5) A2 (8) M3 (7)0 27 X1 (10) A3 (6) M1 (9)1 28 A1 (10) M2 (9)2 29 X2 (6) A2 (9) M3 (8)0 30 X1 (11) A3 (7) M1 (10)1 31 A1 (11) M2 (10)2 32 X2 (7) A2 (10) M3 (9)

39

Another Modulo Schedule Example

xx

A1A1

A3A3

M2M2M1M1

AA BB

EE

ZZ

++ A2A2

CC DD

00

1111

33 33

Modulo Reservation TableModulo Reservation Table

Given 2 adders (1-cycle) & 1 multiplier (2-cycle)Given 2 adders (1-cycle) & 1 multiplier (2-cycle)

prologprolog

epilogepilog

5x kernel5x kernel

Multiplier is fully utilizedMultiplier is fully utilized

MII = max(3/2, 2/1) = 2 MII = max(3/2, 2/1) = 2

++

++

xx

Modulo ADDER1 ADDER2 MULT0 A1 (3) A2 (3) M2 (2)1 A3 (1) M1 (3)

40

How to Perform Register Allocation?• We are overlapping multiple iterations into one

schedule.– Example: iteration 1 to 5 are alive at the same time

• Registers from multiple iterations are alive during a period of time

MRT

Modulo MEM ADDER MULT0 X1 (5) A3 (1) M1 (4)1 A1 (5) M2 (4)2 X2 (1) A2 (4) M3 (3)

41

Modulo Variable Expansion

• Analyze the “life time” of an architecture register• Unroll the loop to enable modulo schedule• R5 needs to stay alive for 8 cycles = 8/3 = 3 MII (i.e. unroll 3 times)

r1(1) r2

(4)

r3 (2) r4

(3)

r5 (8)

r6 (4)

r7 (2)

The cycle numbers assumes WAR allowed in the same cycle

Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r11, (A)++ mul r3, r2, $c21 4 add r12, r11, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 X1 (3) mul r13, r12, $c21 7 A1 (3) mul r15, r12, $c32 8 add r14, r12, r13 mul r6, r4, r50 9 X1 (4) M1 (3)1 10 A1 (4) M2 (3)2 11 A2 (3) mul r16, r14, r150 12 X1 (5) add r7, r5, r6 M1 (4)1 13 A1 (5) M2 (4)2 14 st r7, (B)++ A2 (4) M3 (3)

42

Post MVE code

Kernel (unrolled 3 times)

Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r11, (A)++ mul r3, r2, $c21 4 add r12, r11, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 ld r21, (A)++ mul r13, r12, $c21 7 add r22, r21, $c1 mul r15, r12, $c32 8 add r14, r12, r13 mul r6, r4, r50 9 ld r1, (A)++ mul r23, r22, $c21 10 add r2, r1, $c1 mul r25, r22, $c32 11 add r24, r22, r23 mul r16, r14, r150 12 ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c21 13 add r12, r11, $c1 mul r5, r2, $c32 14 st r7, (B)++ add r4, r2, r3 mul r26, r24, r250 15 ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c21 16 add r22, r21, $c1 mul r15, r12, $c32 17 st r17, (B)++ add r14, r12, r13 mul r6, r4, r50 18 ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c21 19 add r2, r1, $c1 mul r25, r22, $c32 20 st r27, (B)++ add r24, r22, r23 mul r16, r14, r150 21 ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c21 22 add r12, r11, $c1 mul r5, r2, $c32 23 st r7, (B)++ add r4, r2, r3 mul r26, r24, r250 24 ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c21 25 add r22, r21, $c1 mul r15, r12, $c32 26 st r17, (B)++ add r14, r12, r13 mul r6, r4, r50 27 ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c21 28 add r2, r1, $c1 mul r25, r22, $c32 29 st r27, (B)++ add r24, r22, r23 mul r16, r14, r15

43

Register Allocation for MVE

• To save # of registers, might not need to expand all registers• Calculate the lifetime of each register to determine if a new register is

needed across iterations (the formula assumes WAR in the same instruction bundle is allowed)

• # of copies = (MII % lifetime/MII == 0) ? lifetime/MII : MII• 14 5/14

– R1 is alive for 1 cycle = 1/3 = 1 MII (need 1 copy)– R2 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since 3%2=1)– R3 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy)– R4 is alive for 3 cycles = 3/3 = 1 MII (need 1 copy)– R5 is alive for 8 cycles = 8/3 = 3 MII (need 3 copies)– R6 is alive for 4 cycles = 4/3 = 2 MII (need 3 copies since 3%2=1)– R7 is alive for 2 cycles = 2/3 = 1 MII (need 1 copy)

• 13 registers used, instead of 21 with the same unrolling degree

44

MVE (reallocate registers)

Kernel (unrolled 3 times)

The cycle numbers assumes WAR allowed in the same cycle

Modulo Time MEM ADDER MULT0 0 ld r1, (A)++1 1 add r2, r1, $c12 20 3 ld r1, (A)++ mul r3, r2, $c21 4 add r12, r1, $c1 mul r5, r2, $c32 5 add r4, r2, r30 6 ld r1, (A)++ mul r3, r12, $c21 7 add r22, r1, $c1 mul r15, r12, $c32 8 add r4, r12, r3 mul r6, r4, r50 9 ld r1, (A)++ mul r3, r22, $c21 10 add r2, r1, $c1 mul r25, r22, $c32 11 add r4, r22, r3 mul r16, r4, r150 12 ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c21 13 add r12, r1, $c1 mul r5, r2, $c32 14 st r7, (B)++ add r4, r2, r3 mul r26, r4, r250 15 ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c21 16 add r22, r1, $c1 mul r15, r12, $c32 17 st r7, (B)++ add r4, r12, r3 mul r6, r4, r50 18 ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c21 19 add r2, r1, $c1 mul r25, r22, $c32 20 st r7, (B)++ add r4, r22, r3 mul r16, r4, r150 21 ld r1, (A)++ add r7, r5, r6 mul r3, r2, $c21 22 add r12, r1, $c1 mul r5, r2, $c32 23 st r7, (B)++ add r4, r2, r3 mul r26, r4, r250 24 ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c21 25 add 22, r1, $c1 mul r15, r12, $c32 26 st r7, (B)++ add r4, r12, r3 mul r6, r4, r50 27 ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c21 28 add r2, r1, $c1 mul r25, r22, $c32 29 st r7, (B)++ add r4, r22, r3 mul r16, r4, r15

45

Final Modulo Schedule

Prolog Code (12 instruction bundles)

Epilog Code (12 instruction bundles)

**Branch instruction not shown

9 instruction bundles

ld r11, (A)++ add r7, r5, r6 mul r3, r2, $c2add r12, r11, $c1 mul r5, r2, $c3

st r7, (B)++ add r4, r2, r3 mul r26, r24, r25ld r21, (A)++ add r17, r15, r16 mul r13, r12, $c2

add r22, r21, $c1 mul r15, r12, $c3st r17, (B)++ add r14, r12, r13 mul r6, r4, r5ld r1, (A)++ add r27, r25, r26 mul r23, r22, $c2

add r2, r1, $c1 mul r25, r22, $c3st r27, (B)++ add r24, r22, r23 mul r16, r14, r15

46

Final Modulo Schedule (Reallocate Registers)

Prolog Code (12 instruction bundles)

Epilog Code (12 instruction bundles)


9 instruction bundles


st r7, (B)++ add r4, r2, r3 mul r26, r4, r25ld r1, (A)++ add r7, r15, r16 mul r3, r12, $c2

add r22, r1, $c1 mul r15, r12, $c3st r7, (B)++ add r4, r12, r3 mul r6, r4, r5ld r1, (A)++ add r7, r25, r26 mul r3, r22, $c2

add r2, r1, $c1 mul r25, r22, $c3st r7, (B)++ add r4, r22, r3 mul r16, r4, r15

47

Issues with Modulo Variable Expansion

• Many architecture registers are needed• Code size gets bigger when more unrolling

needed

• Alternative solution: Rotating register file– A hardware technique– Solving problem without code duplication – Similar to register windowregister window plus renamingrenaming: keep

old iteration values on the stack (Itanium calls the hardware Register Stack EngineRegister Stack Engine or RSERSE)

48

Intention of Using Rotation Registers• Use exactly the same schedule (below) for all

including– Kernel codes– Prolog codes– Epilog codes

• The “registers” need to be re-allocated• Registers “rotate” per iteration!!!



st r7, (B)++ add r4, r2, r3 mul r6, r4, r5

49

Idea of Rotation Register (Original Schedule)

i te Time Mem Adder Multipl ier

0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 mul r43, r42, $c2

4 mul r45, r42, $c3

5 add r44, r42, r43

2 6

7

8 mul r46, r44, r45

3 9

10

11

4 12 add r47, r45, r46

13

14 st r47, (B)++

In Intel Itanium, integer registers 32 – 127 are rotating registers

50

Original Code Schedule


0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 mul r43, r42, $c2

4 mul r45, r42, $c3

5 add r44, r42, r43

2 6

7

8 mul r46, r44, r45

3 9

10

11

4 12 add r47, r45, r46

13

14 st r47, (B)++


51

Assume HW Rotation Registers


0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 mul r44, r43, $c2

4 mul r45, r43, $c3

5 add r52, r43, r44

2 6

7

8 mul r48, r53, r46

3 9

10

11

4 12 add r51, r48, r50

13

14 st r51, (B)++

Assuming that registers are rotated per iteration automatically


52

Rotation Registers in Itanium Processors

Stacked (Rotating)

Static

0

3132

127

General Purpose Registers

Stacked (Rotating)

Static

0

3132

127

FP Registers

063 081

Stacked (Rotating)

Static

01516

630

Predicate Registers

53

Register Rotation (Prolog i0)


0 0 ld r41, (A)++

1 add r42, r41, $c1

2



54



0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 ld r41, (A)++ mul r44, r43, $c2

4 add r42, r41, $c1 mul r45, r43, $c3

5 add r52, r43, r44



55



0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 ld r41, (A)++ mul r44, r43, $c2

4 add r42, r41, $c1 mul r45, r43, $c3

5 add r52, r43, r44

2 6 ld r41, (A)++ mul r44, r43, $c2

7 add r42, r41, $c1 mul r45, r43, $c3

8 add r52, r43, r44 mul r48, r53, r46

3 9

10

11

4 12

13

14



56



0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 ld r41, (A)++ mul r44, r43, $c2

4 add r42, r41, $c1 mul r45, r43, $c3

5 add r52, r43, r44

2 6 ld r41, (A)++ mul r44, r43, $c2

7 add r42, r41, $c1 mul r45, r43, $c3

8 add r52, r43, r44 mul r48, r53, r46

3 9 ld r41, (A)++ mul r44, r43, $c2

10 add r42, r41, $c1 mul r45, r43, $c3

11 add r52, r43, r44 mul r48, r53, r46



57

Register Rotation (Kernel Steady State i4)


0 0 ld r41, (A)++

1 add r42, r41, $c1

2

1 3 ld r41, (A)++ mul r44, r43, $c2

4 add r42, r41, $c1 mul r45, r43, $c3

5 add r52, r43, r44

2 6 ld r41, (A)++ mul r44, r43, $c2

7 add r42, r41, $c1 mul r45, r43, $c3

8 add r52, r43, r44 mul r48, r53, r46

3 9 ld r41, (A)++ mul r44, r43, $c2

10 add r42, r41, $c1 mul r45, r43, $c3

11 add r52, r43, r44 mul r48, r53, r46

4 12 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

13 add r42, r41, $c1 mul r45, r43, $c3

14 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46



Registers wrapped around if exceeding specified bound

58

• Execute many iterations in the kernel …Register Rotation (Kernel)

59

Register Rotation (Kernel to Epilog, i<-4>)


-4 N-14 ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

N-13 add r42, r41, $c1 mul r45, r43, $c3

N-12 st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

-3 N-11

N-10

N-9

-2 N-8

N-7

N-6

-1 N-5

N-4

N-3

0 N-2

N-1

N



60




N-13 add r42, r41, $c1 mul r45, r43, $c3


-3 N-11 add r51, r48, r50 mul r44, r43, $c2

N-10 mul r45, r43, $c3


-2 N-8

N-7

N-6

-1 N-5

N-4

N-3

0 N-2

N-1

N



61




N-13 add r42, r41, $c1 mul r45, r43, $c3


-3 N-11 add r51, r48, r50 mul r44, r43, $c2

N-10 mul r45, r43, $c3


-2 N-8 add r51, r48, r50

N-7

N-6 st r51, (B)++ mul r48, r53, r46

-1 N-5

N-4

N-3

0 N-2

N-1

N



62




N-13 add r42, r41, $c1 mul r45, r43, $c3


-3 N-11 add r51, r48, r50 mul r44, r43, $c2

N-10 mul r45, r43, $c3


-2 N-8 add r51, r48, r50

N-7

N-6 st r51, (B)++ mul r48, r53, r46

-1 N-5 add r51, r48, r50

N-4

N-3 st r51, (B)++

0 N-2

N-1

N



63

Register Rotation (Kernel to Epilog, final ite)



N-13 add r42, r41, $c1 mul r45, r43, $c3


-3 N-11 add r51, r48, r50 mul r44, r43, $c2

N-10 mul r45, r43, $c3


-2 N-8 add r51, r48, r50

N-7

N-6 st r51, (B)++ mul r48, r53, r46

-1 N-5 add r51, r48, r50

N-4

N-3 st r51, (B)++

0 N-2 add r51, r48, r50

N-1

N st r51, (B)++



64

Modulo Schedule with Rotating Register Support

• No loop unrolling required (required careful register allocation)

• Tighter code, saving space• However, there are still prolog and epilog codes• Can we use the same schedule for prolog/epilog?

– Use stage predicates to execute instructions conditionally– Require new ISA support (Itanium)


st r51, (B)++ add r52, r43, r44 mul r48, r53, r46

65

Predicated Instruction Execution (Prolog i0)i te Time Mem Adder Multipl ier

0 0 (p16) ld r41, (A)++ add r51, r48, r50 mul r44, r43, $c2

1 (p16) add r42, r41, $c1 mul r45, r43, $c3


1 3

4

5

2 6

7

8

3 9

10

11

4 12

13

14

Don’t execute shaded instructions

cc0: only issue ld

cc1: only issue add

cc2: no issue

66

Predicated Prolog (Prolog i1)i te Time Mem Adder Multipl ier


1 (p16) add r42, r41, $c1 mul r45, r43, $c3


1 3 (p16) ld r41, (A)++ add r51, r48, r50 (p17) mul r44, r43, $c2

4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

5 st r51, (B)++ (p17) add r52, r43, r44 mul r48, r53, r46

2 6

7

8

3 9

10

11

4 12

13

14

cc3: ld(i1) & mul(i0)

cc4: add(i0) & mul(i0)

cc5: add(i0)

Note that stage predicates also “rotate” per iteration

67



1 (p16) add r42, r41, $c1 mul r45, r43, $c3



4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3



7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

8 st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

3 9

10

11

4 12

13

14




68



1 (p16) add r42, r41, $c1 mul r45, r43, $c3



4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3



7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3



10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3


4 12

13

14




69

Predicated Kernel (i4)i te Time Mem Adder Multipl ier


1 (p16) add r42, r41, $c1 mul r45, r43, $c3



4 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3



7 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3



10 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3


4 12 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

14 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

cc12: ld(i4) & add(i0) & mul(i3)cc13: st(i0) & add(i4) & mul(3)cc11: add(i3) & mul(i2)

(p20) is used in iteration 4, not (p19) because of predicate rotation

70

• Execute many iterations in the kernel …Register Rotation (Kernel)

71

Predicated Epilog (i<-4>)i te Time Mem Adder Multipl ier

-4 N-14 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

N-13 (p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

N-12 (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

-3 N-11

N-10

N-9

-2 N-8

N-7

N-6

-1 N-5

N-4

N-3

0 N-2

N-1

N

72








-2 N-8

N-7

N-6

-1 N-5

N-4

N-3

0 N-2

N-1

N

73











-1 N-5

N-4

N-3

0 N-2

N-1

N

74














0 N-2

N-1

N

75

Predicated Epilog (final iteration)i te Time Mem Adder Multipl ier













0 N-2 (p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2


N (p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

76

Final Modulo Schedule (Itanium-like)

• Before entering the loop, set p16p16 =1 (p16 is the first rotating predicate register)

• When the modulo-scheduled loop branch (e.g. br.ctop) encountered – p63p63 is set to 1 by hardware in the prolog code (see next slide)– All registers (rotating registers and predicate rotating registers) rotate as each

stage (iteration) advances• Only 3 Itanium Instruction Bundles (= 3 VLIWs) needed

– No prolog, epilog codes– No modulo variable expansions that stress registers and blow up code size

(p16) r41 = (A)++ (p20) r51 = r48 + r50

(p20) (B)++ = r51(p16) r42 = r41 + $c1

(p17) r44 = r43 * $c2

(p17) r52 = r43 + r44

mov ar.lc = 196 // loop countmov ar.ec = 5 // epilog stages+1mov pr.rot = 0x10000 // special inst set pr[16]=1 and p[63:17]=0

L1top:

br.ctop L1top

(p17) r45 = r43 * $c3(p18) r48 = r53 * r46

Counted Modulo-scheduled Loop

p20p20 00

p19p19 00

p18p18 00

p17p17 00

p16p16 11

p63p63 11

p62p62 00

Stage 0 (Stage 0 (PrologProlog))

Mem Adder Multiplier(p16) ld r41, (A)++ (p20) add r51, r48, r50 (p17) mul r44, r43, $c2

(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3

(p20) st r51, (B)++ (p17) add r52, r43, r44 (p18) mul r48, r53, r46

After the first iterationLC = 195, EC = 5

Rotating PredicateRegisters

p16

p63

p17

p18

p19

p20


p20p20 00

p19p19 00

p18p18 00

p17p17 00

p16p16 11

p63p63 11

p62p62 00



(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3


Before the 2nd iteration


p16

p63

p17

p18

p19

p20


p20p20 00

p19p19 00

p18p18 00

p17p17 00

p16p16 11

p63p63 11

p62p62 11

p61p61 00



(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3


Before the 3rd iteration


p16

p63

p17

p18

p19

p20


p20p20 00

p19p19 00

p18p18 00

p17p17 00

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00



(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3


Before the 4th iteration


p16

p63

p17

p18

p19

p20


p20p20 00

p19p19 00

p18p18 00

p17p17 00

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 11

p59p59 00

Stage 4 (Stage 4 (KernelKernel))


(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3


Before the 5th iteration


p16

p63

p17

p18

p19

p20

In the Kernel

• After Another 191 Iterations …..


p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 11

p59p59 11

p58p58 11

p57p57 11

p56p56 11

p55p55 11



(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3


Before the 196th iterationLC=0, EC=5


p16

p63

p17

p18

p19

p20


p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

p59p59 11

p58p58 11

p57p57 11

p56p56 11

p55p55 11



(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3


after the 196th iterationEC=4


p16

p63

p17

p18

p19

p20


p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

p59p59 11

p58p58 11

p57p57 11

p56p56 11

p55p55 11

Stage 196 (Stage 196 (EpilogEpilog))


(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3


Before the 197th iterationEC=4


p16

p63

p17

p18

p19

p20


p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

p59p59 00

p58p58 11

p57p57 11

p56p56 11

p55p55 11



(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3




p16

p63

p17

p18

p19

p20


p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

p59p59 00

p58p58 00

p57p57 11

p56p56 11

p55p55 11



(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3




p16

p63

p17

p18

p19

p20


p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

p59p59 00

p58p58 00

p57p57 00

p56p56 11

p55p55 11



(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3


Before the 200th iteration (Last iteration)EC=1


p16

p63

p17

p18

p19

p20


p20p20 11

p19p19 11

p18p18 11

p17p17 11

p16p16 11

p63p63 11

p62p62 11

p61p61 11

p60p60 00

p59p59 00

p58p58 00

p57p57 00

p56p56 00

p55p55 11



(p16) add r42, r41, $c1 (p17) mul r45, r43, $c3


After the 200th iteration (Last iteration)EC=0


p16

p63

p17

p18

p19

p20• “br.ctop” instruction exits

the loop

90

Modulo Scheduling ExampleLoop{

P=A+B

Q=C+D;

X=PxE

Y=PxQ

Z=X+Y

}

Step 1: Data flow graph

xx M2M2

A1A1

AA BB

++ A2A2

CC DD

++

A3A3

ZZ

++

EE

M1M1 xx

Loop{

P=A+B

Q=C+D;

X=PxE

Y=PxQ

Z=X+Y

}

Loop{

P=A+B

Q=C+D;

X=PxE

Y=PxQ

Z=X+Y

}

Loop{

P=A+B

Q=C+D;

X=PxE

Y=PxQ

Z=X+Y

}

Loop{

P=A+B

Q=C+D;

X=PxE

Y=PxQ

Z=X+Y

}

91

Modulo SchedulingStep 2: Generate a list schedule

xx M2M2

A1A1

AA BB

++ A2A2

CC DD

++

A3A3

ZZ

++

EE

M1M1 xx

00

11 11

3333

Execution units:2 Adders – 1cycle latency1 Multiplier – 2 cycle latency

92

Modulo SchedulingStep 2: Generate a list schedule

xx M2M2

A1A1

AA BB

++ A2A2

CC DD

++

A3A3

ZZ

++

EE

M1M1 xx

00

11 11

3333

ReservationReservation TableTable

Time Adder1 Adder2 Mult0 A1

1234

A2

M1

M2

A3

93

Modulo SchedulingGenerating Modulo Schedule:

1. Determine the MII:

=

Ctyavailabilisource

NdemandsourceMII

:_Re

:_Remax

MII = max[(3/2) ,(2/1)] = 2

94

Modulo SchedulingMapping from list schedule to modulo schedule

Time Modulo 2 Adder1 Adder2 Mult

0 0 A1 A2

1 1 M1

2 0 M2

3 1

4 0 A3

5 1

6 0

List scheduleList schedule

Time Adder1 Adder2 Mult0 A1

1234

A2

M1

M2

A3

Modulo scheduleModulo schedulefor 1 iterationfor 1 iteration

A3

95

Modulo SchedulingTime Modulo 2 Adder1 Adder2 Mult

0 0 1:A1 1:A2

1 1 1:M1

2 0 2:A1 2:A2 1:M2

3 1 2:M1

4 0 2:M2

5 1 1:A3

6 0

7 1 2:A3

8 0

inserting iteration 2

96

Modulo SchedulingTime Modulo 2 Adder1 Adder2 Mult

0 0 1:A1 1:A2

1 1 1:M1

2 0 2:A1 2:A2 1:M2

3 1 2:M1

4 0 3:A1 3:A2 2:M2

5 1 1:A3 3:M1

6 0 3:M2

7 1 2:A3

8 0

9 1 3:A3

inserting iteration 3

97


prologprolog

epilogepilog

5x kernel5x kernel

Modulo 2

Adder 1 Adder 2 Mult

0 3:A1 3:A2 2:M2

1 1:A3 3:M1

MRT

Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW

Devices & Hardware

Transcript of Lec15 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- EPIC VLIW