Topic 2b Basic Back-End Optimization - capsl.udel.edu 2b Basic Back-End Optimization ... to solve...

56
2012/3/12 \course\cpeg421-10F\Topic2a.ppt 1 Topic 2b Basic Back-End Optimization Instruction Selection Instruction scheduling Register allocation

Transcript of Topic 2b Basic Back-End Optimization - capsl.udel.edu 2b Basic Back-End Optimization ... to solve...

2012312 coursecpeg421-10FTopic2appt 1

Topic 2b

Basic Back-End Optimization

Instruction Selection

Instruction scheduling

Register allocation

2012312 coursecpeg421-10FTopic2appt 2

ABET Outcome

Ability to apply knowledge of basic code generation techniques

eg Instruction scheduling register allocation to solve code generation problems

An ability to identify formulate and solve loops scheduling problems using software pipelining techniques

Ability to analyze the basic algorithms on the above techniques and conduct experiments to show their effectiveness

Ability to use a modern compiler development platform and tools for the practice of above

A Knowledge on contemporary issues on this topic

2012312 coursecpeg421-10FTopic2appt 3

Reading List Reading List

(1) K D Cooper amp L Torczon Engineering a Compiler Chapter 12

(2) Dragon Book Chapter 101 ~ 104

2012312 coursecpeg421-10FTopic2appt 4

A Short Tour on

Data Dependence

A Short Tour on

Data Dependence

2012312 coursecpeg421-10FTopic2appt 5

Basic Concept and Motivation Basic Concept and Motivation

Data dependence between 2 accesses

The same memory location

Exist an execution path between them

At least one of them is a write

Three types of data dependencies

Dependence graphs

Things are not simple when dealing with loops

2012312 coursecpeg421-10FTopic2appt 6

Data Dependencies

There is a data dependence between statements Si and Sj if and only if

Both statements access the same memory location and at least one of the statements writes into it and

There is a feasible run-time execution path from Si to Sj

2012312 coursecpeg421-10FTopic2appt 7

Types of Data Dependencies

Flow (true) Dependencies - writeread (d) x = 4

hellip

y = x + 1

Output Dependencies - writewrite (do) x = 4

hellip

x = y + 1

Anti-dependencies - readwrite (d-1) y = x + 1

hellip

x = 4

d

-1

d--

0 d

2012312 coursecpeg421-10FTopic2appt 8

(1) x = 4

(2) y = 6

(3) p = x + 2

(4) z = y + p

(5) x = z

(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = z

Flow

An Example of Data Dependencies

Output

Anti

2012312 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements

nodes = statements

edges = dependence relation (type label)

2012312 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

2012312 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

2012312 coursecpeg421-10FTopic2appt 12

Instruction Scheduling

Loop restructuring

Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 2

ABET Outcome

Ability to apply knowledge of basic code generation techniques

eg Instruction scheduling register allocation to solve code generation problems

An ability to identify formulate and solve loops scheduling problems using software pipelining techniques

Ability to analyze the basic algorithms on the above techniques and conduct experiments to show their effectiveness

Ability to use a modern compiler development platform and tools for the practice of above

A Knowledge on contemporary issues on this topic

2012312 coursecpeg421-10FTopic2appt 3

Reading List Reading List

(1) K D Cooper amp L Torczon Engineering a Compiler Chapter 12

(2) Dragon Book Chapter 101 ~ 104

2012312 coursecpeg421-10FTopic2appt 4

A Short Tour on

Data Dependence

A Short Tour on

Data Dependence

2012312 coursecpeg421-10FTopic2appt 5

Basic Concept and Motivation Basic Concept and Motivation

Data dependence between 2 accesses

The same memory location

Exist an execution path between them

At least one of them is a write

Three types of data dependencies

Dependence graphs

Things are not simple when dealing with loops

2012312 coursecpeg421-10FTopic2appt 6

Data Dependencies

There is a data dependence between statements Si and Sj if and only if

Both statements access the same memory location and at least one of the statements writes into it and

There is a feasible run-time execution path from Si to Sj

2012312 coursecpeg421-10FTopic2appt 7

Types of Data Dependencies

Flow (true) Dependencies - writeread (d) x = 4

hellip

y = x + 1

Output Dependencies - writewrite (do) x = 4

hellip

x = y + 1

Anti-dependencies - readwrite (d-1) y = x + 1

hellip

x = 4

d

-1

d--

0 d

2012312 coursecpeg421-10FTopic2appt 8

(1) x = 4

(2) y = 6

(3) p = x + 2

(4) z = y + p

(5) x = z

(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = z

Flow

An Example of Data Dependencies

Output

Anti

2012312 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements

nodes = statements

edges = dependence relation (type label)

2012312 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

2012312 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

2012312 coursecpeg421-10FTopic2appt 12

Instruction Scheduling

Loop restructuring

Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 3

Reading List Reading List

(1) K D Cooper amp L Torczon Engineering a Compiler Chapter 12

(2) Dragon Book Chapter 101 ~ 104

2012312 coursecpeg421-10FTopic2appt 4

A Short Tour on

Data Dependence

A Short Tour on

Data Dependence

2012312 coursecpeg421-10FTopic2appt 5

Basic Concept and Motivation Basic Concept and Motivation

Data dependence between 2 accesses

The same memory location

Exist an execution path between them

At least one of them is a write

Three types of data dependencies

Dependence graphs

Things are not simple when dealing with loops

2012312 coursecpeg421-10FTopic2appt 6

Data Dependencies

There is a data dependence between statements Si and Sj if and only if

Both statements access the same memory location and at least one of the statements writes into it and

There is a feasible run-time execution path from Si to Sj

2012312 coursecpeg421-10FTopic2appt 7

Types of Data Dependencies

Flow (true) Dependencies - writeread (d) x = 4

hellip

y = x + 1

Output Dependencies - writewrite (do) x = 4

hellip

x = y + 1

Anti-dependencies - readwrite (d-1) y = x + 1

hellip

x = 4

d

-1

d--

0 d

2012312 coursecpeg421-10FTopic2appt 8

(1) x = 4

(2) y = 6

(3) p = x + 2

(4) z = y + p

(5) x = z

(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = z

Flow

An Example of Data Dependencies

Output

Anti

2012312 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements

nodes = statements

edges = dependence relation (type label)

2012312 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

2012312 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

2012312 coursecpeg421-10FTopic2appt 12

Instruction Scheduling

Loop restructuring

Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 4

A Short Tour on

Data Dependence

A Short Tour on

Data Dependence

2012312 coursecpeg421-10FTopic2appt 5

Basic Concept and Motivation Basic Concept and Motivation

Data dependence between 2 accesses

The same memory location

Exist an execution path between them

At least one of them is a write

Three types of data dependencies

Dependence graphs

Things are not simple when dealing with loops

2012312 coursecpeg421-10FTopic2appt 6

Data Dependencies

There is a data dependence between statements Si and Sj if and only if

Both statements access the same memory location and at least one of the statements writes into it and

There is a feasible run-time execution path from Si to Sj

2012312 coursecpeg421-10FTopic2appt 7

Types of Data Dependencies

Flow (true) Dependencies - writeread (d) x = 4

hellip

y = x + 1

Output Dependencies - writewrite (do) x = 4

hellip

x = y + 1

Anti-dependencies - readwrite (d-1) y = x + 1

hellip

x = 4

d

-1

d--

0 d

2012312 coursecpeg421-10FTopic2appt 8

(1) x = 4

(2) y = 6

(3) p = x + 2

(4) z = y + p

(5) x = z

(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = z

Flow

An Example of Data Dependencies

Output

Anti

2012312 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements

nodes = statements

edges = dependence relation (type label)

2012312 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

2012312 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

2012312 coursecpeg421-10FTopic2appt 12

Instruction Scheduling

Loop restructuring

Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 5

Basic Concept and Motivation Basic Concept and Motivation

Data dependence between 2 accesses

The same memory location

Exist an execution path between them

At least one of them is a write

Three types of data dependencies

Dependence graphs

Things are not simple when dealing with loops

2012312 coursecpeg421-10FTopic2appt 6

Data Dependencies

There is a data dependence between statements Si and Sj if and only if

Both statements access the same memory location and at least one of the statements writes into it and

There is a feasible run-time execution path from Si to Sj

2012312 coursecpeg421-10FTopic2appt 7

Types of Data Dependencies

Flow (true) Dependencies - writeread (d) x = 4

hellip

y = x + 1

Output Dependencies - writewrite (do) x = 4

hellip

x = y + 1

Anti-dependencies - readwrite (d-1) y = x + 1

hellip

x = 4

d

-1

d--

0 d

2012312 coursecpeg421-10FTopic2appt 8

(1) x = 4

(2) y = 6

(3) p = x + 2

(4) z = y + p

(5) x = z

(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = z

Flow

An Example of Data Dependencies

Output

Anti

2012312 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements

nodes = statements

edges = dependence relation (type label)

2012312 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

2012312 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

2012312 coursecpeg421-10FTopic2appt 12

Instruction Scheduling

Loop restructuring

Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 6

Data Dependencies

There is a data dependence between statements Si and Sj if and only if

Both statements access the same memory location and at least one of the statements writes into it and

There is a feasible run-time execution path from Si to Sj

2012312 coursecpeg421-10FTopic2appt 7

Types of Data Dependencies

Flow (true) Dependencies - writeread (d) x = 4

hellip

y = x + 1

Output Dependencies - writewrite (do) x = 4

hellip

x = y + 1

Anti-dependencies - readwrite (d-1) y = x + 1

hellip

x = 4

d

-1

d--

0 d

2012312 coursecpeg421-10FTopic2appt 8

(1) x = 4

(2) y = 6

(3) p = x + 2

(4) z = y + p

(5) x = z

(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = z

Flow

An Example of Data Dependencies

Output

Anti

2012312 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements

nodes = statements

edges = dependence relation (type label)

2012312 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

2012312 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

2012312 coursecpeg421-10FTopic2appt 12

Instruction Scheduling

Loop restructuring

Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 7

Types of Data Dependencies

Flow (true) Dependencies - writeread (d) x = 4

hellip

y = x + 1

Output Dependencies - writewrite (do) x = 4

hellip

x = y + 1

Anti-dependencies - readwrite (d-1) y = x + 1

hellip

x = 4

d

-1

d--

0 d

2012312 coursecpeg421-10FTopic2appt 8

(1) x = 4

(2) y = 6

(3) p = x + 2

(4) z = y + p

(5) x = z

(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = z

Flow

An Example of Data Dependencies

Output

Anti

2012312 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements

nodes = statements

edges = dependence relation (type label)

2012312 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

2012312 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

2012312 coursecpeg421-10FTopic2appt 12

Instruction Scheduling

Loop restructuring

Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 8

(1) x = 4

(2) y = 6

(3) p = x + 2

(4) z = y + p

(5) x = z

(6) y = p

x = 4 y = 6

p = x + 2

z = y + p

y = p x = z

Flow

An Example of Data Dependencies

Output

Anti

2012312 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements

nodes = statements

edges = dependence relation (type label)

2012312 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

2012312 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

2012312 coursecpeg421-10FTopic2appt 12

Instruction Scheduling

Loop restructuring

Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 9

Data Dependence Graph (DDG)

Forms a data dependence graph between statements

nodes = statements

edges = dependence relation (type label)

2012312 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

2012312 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

2012312 coursecpeg421-10FTopic2appt 12

Instruction Scheduling

Loop restructuring

Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 10

Reordering Transformations using DDG

Given a correct data dependence graph any order-based optimization that does not change the dependences of a program is guaranteed not to change the results of the program

2012312 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

2012312 coursecpeg421-10FTopic2appt 12

Instruction Scheduling

Loop restructuring

Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 11

Reordering Transformations

A reordering transformation is any program transformation that merely changes the order of execution of the code without adding or deleting any executions of any statements

A reordering transformation preserves a dependence if it preserves the relative execution order of the source and sink of that dependence

2012312 coursecpeg421-10FTopic2appt 12

Instruction Scheduling

Loop restructuring

Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 12

Instruction Scheduling

Loop restructuring

Exploiting Parallelism Analyze array references to determine

whether two iterations access the same memory location Iterations I1 and I2 can be safely executed in parallel if there is no data dependency between them

hellip

Reordering Transformations (Conrsquot)

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 13

Data Dependence Graph Data Dependence Graph

Example 1

S1 A = 0

S2 B = A

S3 C = A + D

S4 D = 2

Sx d Sy flow dependence

S1

S2

S3

S4

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 14

Example 2

S1 A = 0

S2 B = A

S3 A = B + 1

S4 C = A

S1

S2

S3

S4

Data Dependence Graph Data Dependence Graph

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 15

Should we consider input dependence

Should we consider input dependence

= X

= X

Is the reading of the

same X important

Well it may be

(if we intend to group the 2 reads

together for cache optimization)

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 16

- register allocation - instruction scheduling - loop scheduling - vectorization - parallelization - memory hierarchy optimization

Applications of Data

Dependence Graph

Applications of Data

Dependence Graph

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 17

Data Dependence in Loops

Problem How to extend the concept to loops

(s1) do i = 15

(s2) x = a + 1 s2 d-1 s3 s2 d s3

(s3) a = x - 2

(s4) end do s3 d s2 (next iteration)

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 18

Instruction Scheduling

Modern processors can overlap the execution of

multiple independent instructions through

pipelining and multiple functional units

Instruction scheduling can improve the

performance of a program by placing

independent target instructions in parallel or

adjacent positions

Motivation

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 19

Instruction scheduling (conrsquot)

bull Assume all instructions are essential ie we

have finished optimizing the IR

bull Instruction scheduling attempts to reorder the

codes for maximum instruction-level

parallelism (ILP)

bull It is one of the instruction-level optimizations

bull Instruction scheduling (IS) in general is NP-

complete so heuristics must be used

Instruction

Schedular

Original Code Reordered Code

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 20

Instruction scheduling

A Simple Example

a = 1 + x

b = 2 + y

c = 3 + z

Since all three instructions are independent we can

execute them in parallel assuming adequate hardware

processing resources

time

a = 1 + x b = 2 + y c = 3 + z

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 21

Hardware Parallelism

Three forms of parallelism are found in

modern hardware

bullpipelining

bullsuperscalar processing

bullVLIW

bullmultiprocessing

Of these the first three forms are

commonly exploited by todayrsquos compilersrsquo

instruction scheduling phase

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 22

Pipelining amp Superscalar

Processing

Pipelining Decompose an instructionrsquos execution into a sequence of

stages so that multiple instruction executions can be

overlapped It has the same principle as the assembly

line

Superscalar Processing Multiple instructions proceed simultaneously assisted by

hardware dynamic scheduling mechanism This is

accomplished by adding more hardware for parallel

execution of stages and for dispatching instructions to

them

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 23

A Classic Five-Stage Pipeline

IF RF EX ME WB

time

- instruction fetch

- decode and register fetch

- execute on ALU

- memory access

- write back to register file

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 24

Pipeline Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

time

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

The standard non-pipelined model

time

In a given cycle each

instruction is in a different

stage but every stage is active

The pipeline is ldquofullrdquo here

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 25

Parallelism in a pipeline

Example i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 0(r1)

i4 add r5 r3 r4

Consider two possible instruction schedules (permutations) `

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

1 Idle Cycle

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 26

Superscalar Illustration

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

Multiple instructions in

the same pipeline stage at

the same time IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

IF RF EX ME WB

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 27

Parallelism Constraints

Data-dependence constraints

If instruction A computes a value that is

read by instruction B then B canrsquot

execute before A is completed

Resource hazards

Finiteness of hardware function units

means limited parallelism

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 28

Scheduling Complications

1048707 Hardware Resources

bull finite set of FUs with instruction type and width

and latency constraints

bull cache hierarchy also has many constraints

1048707 Data Dependences

bull canrsquot consume a result before it is produced

bull ambiguous dependences create many challenges

1048707 Control Dependences

bull impractical to schedule for all possible paths

bull choosing an ldquoexpectedrdquo path may be difficult

bull recovery costs can be non-trivial if you are wrong

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 29

Legality Constraint for

Instruction Scheduling

Question when must we preserve the

order of two instructions i

and j

Answer when there is a dependence from i

to j

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 30

Construct DDG with Weights

Construct a DDG by assigning weights to

nodes and edges in the DDG to model the

pipeline as follows

bull Each DDG node is labeled a resource-reservation table whose

value is the resource-reservation table associated with the

operation type of this node

bull Each edge e from node j to node k is labeled with a weight

(latency or delay) de indicting that the destination node j must be

issued no earlier than de cycles after the source node k is

issued

Dragon book 722

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 31

Example of a Weighted

Data Dependence Graph

i1 add r1 r1 r2

i2 add r3 r3 r1

i3 lw r4 (r1)

i4 add r5 r3 r4

i2

i3

i1 i4

1 1

3 1

ALU

ALU

ALU

Mem

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 32

Legal Schedules for Pipeline

Consider a basic block with m instructions

i1 hellip im

A legal sequence S for the basic block on a

pipeline consists of

A permutation f on 1hellipm such that f(j)

identifies the new position of instruction j in the

basic block For each DDG edge form j to k the

schedule must satisfy f(j) lt= f(k)

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 33

Legal Schedules for Pipeline (Conrsquot)

Instruction start-time

An instruction start-time satisfies the following conditions

bull Start-time (j) gt= 0 for each instruction j

bull No two instructions have the same start-time value

bull For each DDG edge from j to k

start-time(k) gt= completion-time (j)

where

completion-time (j) = start-time (j)

+ (weight between j and k)

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 34

We also define

make_span(S) = completion time of schedule S

= MAX ( completion-time (j))

Legal Schedules for Pipeline (Conrsquot)

1 le j le m

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 35

Example of Legal

Schedules

Start-time 0 1 2 4

Start-time 0 1 2 5

i1 i2 i3 i4

2 Idle Cycle

Schedule S1 (completion time = 6 cycles)

i1 i3 i2 i4 Schedule S2 (completion time = 5 cycles)

1 Idle Cycle

Assume

Register instruction 1 cycle

Memory instruction 3 cycle

i1 add r1 r1 r2 i2 add r3 r3 r1 i3 lw r4 (r1) i4 add r5 r3 r4

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 36

Instruction Scheduling

(Simplified)

Given an acyclic weighted data dependence graph G with bull Directed edges precedence

bull Undirected edges resource constraints

Determine a schedule S

such that the length of the schedule is minimized

1

2 3

4 5

6

d12

d23

d13

d35 d34

d45

d56 d46

d26

d24

Problem Statement

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 37

Simplify Resource

Constraints

Assume a machine M with n functional units or a

ldquocleanrdquo pipeline with k stages

What is the complexity of a optimal scheduling

algorithm under such constraints

Scheduling of M is still hard

bull n = 2 exists a polynomial time algorithm [CoffmanGraham]

bull n = 3 remain open conjecture NP- hard

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 38

General Approaches of

Instruction Scheduling

List Scheduling

Trace scheduling

Software pipelining

helliphellip

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 39

Trace Scheduling

A technique for scheduling instructions across basic blocks

The Basic Idea of trace scheduling

1048707 Uses information about actual program behaviors to select regions for scheduling

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 40

Software Pipelining

A technique for scheduling instructions across loop iterations

The Basic Idea of software pipelining

1048707 Rewrite the loop as a repeating pattern

that overlaps instructions from different

iterations

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 41

List Scheduling

A most common technique for scheduling instructions within a basic block The Basic Idea of list scheduling

1048707 All instructions are sorted according to some priority function

Also Maintain a list of instructions that are ready to execute

(ie data dependence constraints are satisfied)

1048707 Moving cycle-by-cycle through the schedule template

bull choose instructions from the list amp schedule them (provided that machine resources are available)

bull update the list for the next cycle

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 42

List Scheduling (Conrsquot)

1048707 Uses a greedy heuristic approach

1048707 Has forward and backward forms

1048707 Is the basis for most algorithms that perform scheduling over regions larger than a single block

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 43

Heuristic Solution

Greedy List Scheduling Algorithm

1 Build a priority list L of the instructions in non-decreasing order of some rank function

2 For each instruction j initialize

pred-count[j] = predecessors of j in DDG

3 Ready-instructions = j | pred-count[j] = 0

4 While (ready-instructions is non-empty) do j = first ready instruction according to the order in priority list L Output j as the next instruction in the schedule

Consider resource constraints beyond a single clean pipeline

Consider resource constraints beyond a single clean pipeline

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 44

Ready-instructions = ready-instructions- j

for each successor k of j in the DDG do

pred-count[k] = pred-count[k] - 1

if (pred-count[k] = 0 ) then

ready-instructions = ready-instruction

+ k

end if

end for

end while

Conrsquod

Heuristic Solution

Greedy List Scheduling Algorithm

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 45

Special Performance

Bounds

For a single two stage pipeline

m = 1 and k = 2 ==gt

(here m is the number of pipelines and k is the number of pipeline

stages per pipeline)

makespan (greedy)makespan(OPT) lt= 15

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 46

Properties of List

Scheduling

bull The result is within a factor of two from the

optimal for pipelined machines (Lawler87)

Note we are considering basic block scheduling here Note we are considering basic block scheduling here

bull Complexity O(n^2) --- where n is the

number of nodes in the DDG

bull In practice it is dominated by DDG

building which itself is also O(n^2)

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 47

A Heuristic Rank Function

Based on Critical paths

1 Compute EST (Earliest Starting Times) for each node in the augmented DDG as follows

EST[START] = 0 ESTy] =

MAX (EST[x] + node-weight (x) + edge-weight (xy) |

there exists an edge from x to y )

2 Set CPL = EST[END] the critical path length of the augmented DDG

3 Similarly compute LST (Latest Starting Time of All nodes)

4 Set rank (i) = LST [i] - EST [i] for each instruction i

(all instructions on a critical path will have zero rank)

NOTE there are other heurestics NOTE there are other heurestics

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 48

Example of Rank Computation

Node x EST[X] LST[x] rank (x)

Start 0 0 0

i1 0 0 0

i2 1 2 1

i3 1 1 0

i4 3 3 0

END 4 4 0 ==gt Priority list = (i1 i3 i4 i2)

i2

i3

i1 i4

0 0

1

1

1

END 0

START

0

0

0

1 0 1

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 49

Summary

Instruction Scheduling for a Basic Block

1 Build the data dependence graph (DDG) for the basic block

bull Node = target instruction

bull Edge = data dependence (flowantioutput)

2 Assign weights to nodes and edges in the DDG so as to model target processor eg for a two-stage pipeline

bull Node weight = 1 for all nodes

bull Edge weight = 1 for edges with loadstore instruction as source node edge weight = 0 for all other edges

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 50

Summary

3 A legal schedule for a weighted DDG must

obey all ordering and timing constraints of the

weighted DDG

4 Goal find a legal schedule with minimum

completion time

Conrsquod

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 51

Other Heurestics for Ranking

Number of successors

Number of total decendents

Latency

Last use of a variable

Others

Note these heuristics help break ties but none dominates the others

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 52

Hazards Preventing

Pipelining

bull Structural hazards

bull Data dependent hazard

bull Control hazard

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 53

Local vs Global Scheduling

1 Straight-line code (basic block) ndash Local scheduling

2 Acyclic control flow ndash Global scheduling

bull Trace scheduling

bull Hyperblocksuperblock scheduling

bull IGLS (integrated Global and Local Scheduling)

3 Loops - a solution for this case is loop

unrolling+scheduling another solution is

software pipelining or modulo scheduling ie to rewrite

the loop as a repeating pattern that overlaps instructions

from different iterations

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 54

Smooth Info

flow into Backend

in Open64

Smooth Info

flow into Backend

in Open64

WHIRL

IGLS GRALRA

WHIRL-to-TOP

CGIR

Hyperblock Formation Critical Path Reduction

Inner Loop Opt Software Pipelining

Extended basic block optimization

Control Flow optimization

Code Emission

Executable

Info

rma

tion

fro

m F

ron

t en

d

(ali

as

str

uct

ure

et

c)

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 55

Flowchart of Code

Generator

WHIRL

Process Inner Loops unrolling EBO

Loop prep software pipelining

IGLS pre-pass

GRA LRA EBO

IGLS post-pass

Control Flow Opt

Code Emission

WHIRL-to-TOP Lowering

CGIR Quad Op List

Control Flow Opt I EBO

Hyperblock Formation Critical-Path Reduction

Control Flow Opt II

EBO

EBO Extended basic block optimization peephole etc

PQS Predicate Query System

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success

2012312 coursecpeg421-10FTopic2appt 56

Software Pipelining

vs

Normal Scheduling

a SWP-amenable loop candidate

Inner loop processing

software pipelining

Code Emission

IGLS

GRALRA

IGLS

No Yes

Failurenot profitable

Success