Lec1 final

Superscalar and VLIW Superscalar and VLIW ArchitecturesArchitectures

Parallel processing [2]Parallel processing [2]

Processing instructions in parallel requires Processing instructions in parallel requires three majorthree major tasks: tasks:

1.1. checking dependencies between checking dependencies between instructions toinstructions to determine which determine which instructions can be grouped together forinstructions can be grouped together for parallel execution; parallel execution;

2.2. assigning instructions to theassigning instructions to the functional functional units on the hardware; units on the hardware;

3.3. determining whendetermining when instructions are instructions are initiatedinitiated placed together into a single placed together into a single word.word.

Major categories [2]Major categories [2]

VLIW – Very Long Instruction WordEPIC – Explicitly Parallel Instruction Computing

Major categories [2]Major categories [2]

Superscalar Processors Superscalar Processors [1][1]

Superscalar processors are designed to Superscalar processors are designed to exploit more instruction-level parallelism exploit more instruction-level parallelism in user programs. in user programs.

Only independent instructions can be Only independent instructions can be executed in parallel without causing a executed in parallel without causing a wait state. wait state.

The amount of instruction-level The amount of instruction-level parallelism varies widely depending on parallelism varies widely depending on the type of code being executed.the type of code being executed.

Pipelining in Superscalar Pipelining in Superscalar Processors [1] Processors [1]

In order to fully utilise a superscalar In order to fully utilise a superscalar processor of degree processor of degree mm, , mm instructions instructions must be executable in parallel. This must be executable in parallel. This situation may not be true in all clock situation may not be true in all clock cycles. In that case, some of the pipelines cycles. In that case, some of the pipelines may be stalling in a wait state.may be stalling in a wait state.

In a superscalar processor, the simple In a superscalar processor, the simple operation latency should require only one operation latency should require only one cycle, as in the base scalar processor.cycle, as in the base scalar processor.

Superscalar ExecutionSuperscalar Execution

Superscalar Superscalar ImplementationImplementation Simultaneously fetch multiple instructionsSimultaneously fetch multiple instructions Logic to determine true dependencies involving Logic to determine true dependencies involving

register valuesregister values Mechanisms to communicate these valuesMechanisms to communicate these values Mechanisms to initiate multiple instructions in Mechanisms to initiate multiple instructions in

parallelparallel Resources for parallel execution of multiple Resources for parallel execution of multiple

instructionsinstructions Mechanisms for committing process state in Mechanisms for committing process state in

correct ordercorrect order

Some ArchitecturesSome Architectures PowerPC 604PowerPC 604

– six independent execution units:six independent execution units: Branch execution unitBranch execution unit Load/Store unitLoad/Store unit 3 Integer units3 Integer units Floating-point unitFloating-point unit

– in-order issuein-order issue– register renamingregister renaming

Power PC 620Power PC 620– provides in addition to the 604 out-of-order issueprovides in addition to the 604 out-of-order issue

PentiumPentium– three independent execution units:three independent execution units:

2 Integer units2 Integer units Floating point unitFloating point unit

– iin-order issuen-order issue

VLIWVLIW Very Long Instruction Word (VLIW) architectures are used for executing more Very Long Instruction Word (VLIW) architectures are used for executing more

than one basic instruction at a time. than one basic instruction at a time.

These processors contain multiple functional units, which fetch from the These processors contain multiple functional units, which fetch from the instruction cache a Very-Long Instruction Word containing several basic instruction cache a Very-Long Instruction Word containing several basic instructions, and dispatch the entire VLIW for parallel execution. These instructions, and dispatch the entire VLIW for parallel execution. These capabilities are exploited by compilers which generate code that has grouped capabilities are exploited by compilers which generate code that has grouped together independent primitive instructions executable in parallel.together independent primitive instructions executable in parallel.

VLIW has been described as VLIW has been described as a natural successor to RISC a natural successor to RISC (Reduced Instruction (Reduced Instruction Set Computing)Set Computing),, because it moves complexity from the hardware to the compiler, because it moves complexity from the hardware to the compiler, allowing simpler, faster processors.allowing simpler, faster processors.

VLIW eliminates the complicated instruction scheduling and parallel dispatch VLIW eliminates the complicated instruction scheduling and parallel dispatch that occurs in most modern microprocessors.that occurs in most modern microprocessors.

WHY VLIW ?WHY VLIW ?

The key to higher performance in microprocessors for a broad range of applications is the ability to exploit fine-grain, instruction-levelparallelism.

Some methods for exploiting fine-grain parallelism include:

Pipelining Multiple processors Superscalar implementation Specifying multiple independent operations per instruction

Architecture Comparison: Architecture Comparison: CISC, RISC & VLIWCISC, RISC & VLIW

ARCHITECTURECHARACTERISTIC

CISC CISC RISCRISC VLIWVLIW

INSTRUCTION SIZE Varies One size, usually 32 bits One size

INSTRUCTION FORMAT

Field placement varies Regular, consistent

placement of fields

Regular, consistent placement of

Fields

INSTRUCTIONSEMANTICS

Varies from simple to complex ; possibly many dependent operations per instruction

Almost always one simple operation

Many simple, independent

operations

REGISTERS Few, sometimes special Many, general-purpose Many, general-purpose

Architecture Comparison: Architecture Comparison: CISC, RISC & VLIWCISC, RISC & VLIW

ARCHITECTURECHARACTERISTIC

CISC CISC RISCRISC VLIWVLIW

MEMORY REFERENCES Bundled with operations in many different types of instructions

Not bundled with operations, i.e.,load/store

architecture

Not bundled with

operations,i.e., load/store

architecture

HARDWARE DESIGN HARDWARE DESIGN FOCUSFOCUS

Exploit micro coded

implementations

Exploit

implementations

with one pipeline and & no microcode

Exploit

Implementations

With multiple pipelines, no microcode & no complex dispatch logic

PICTURES OF FIVE PICTURES OF FIVE TYPICAL INSTRUCTIONSTYPICAL INSTRUCTIONS

Advantages of VLIWAdvantages of VLIW

VLIW processors rely on the compiler that generates the VLIW code to

explicitly specify parallelism. Relying on the compiler has advantages.

VLIW architecture reduces hardware complexity. VLIW simply moves

complexity from hardware into software.

What is ILP ?What is ILP ?

Instruction-level parallelism (ILP) is a measure of how many of the

operations in a computer program can be performed simultaneously.

A system is said to embody ILP (instruction-level parallelism) is

multiple instructions runs on them at the same time.

ILP can have a significant effect on performance which is critical to

embedded systems.

ILP provides an form of power saving by slowing the clock.

What we intend to do with What we intend to do with ILP ?ILP ?

We use Micro-architectural techniques to exploit the ILP. The various techniques We use Micro-architectural techniques to exploit the ILP. The various techniques

include :include :

Instruction pipeliningInstruction pipelining which depend on CPU caches. which depend on CPU caches.

Register renamingRegister renaming which refers to a technique used to avoid unnecessary. which refers to a technique used to avoid unnecessary.

serialization of program operations imposed by the reuse of registers by those serialization of program operations imposed by the reuse of registers by those

operations.operations.

Speculative executionSpeculative execution which reduce pipeline stalls due to control dependencies. which reduce pipeline stalls due to control dependencies.

Branch predictionBranch prediction which is used to keep the pipeline full. which is used to keep the pipeline full.

Superscalar executionSuperscalar execution in which multiple execution units are used to execute in which multiple execution units are used to execute

multiple instructions in parallel. multiple instructions in parallel.

Out of Order executionOut of Order execution which reduces pipeline stall due to operand dependencies. which reduces pipeline stall due to operand dependencies.

Algorithms for schedulingAlgorithms for scheduling

Few of the Instruction scheduling algorithms used are :

List scheduling

Trace scheduling

Software pipelining (modulo scheduling)

List SchedulingList Scheduling

List scheduling by steps :

1.1. Construct a dependence graph of the basic block. (The edges are Construct a dependence graph of the basic block. (The edges are

weighted with the latency of the instruction).weighted with the latency of the instruction).

2.2. Use the dependence graph to determine instructions that can execute; Use the dependence graph to determine instructions that can execute;

insert on a list, called the Readylist.insert on a list, called the Readylist.

3.3. Use the dependence graph and the Ready list to schedule an instruction Use the dependence graph and the Ready list to schedule an instruction

that causes the smallest possible stall; update the Ready list. Repeat that causes the smallest possible stall; update the Ready list. Repeat

Code Representation for Code Representation for List SchedulingList Scheduling

a = b + c

d = e - f

1. load R1, b2. load R2, c3. add R2,R14. store a, R25. load R3, e6. load R4,f7. sub R3,R48. store d,R3

4

3

8

7

1 2 5 6

Code Representation for Code Representation for List SchedulingList Scheduling

1. load R1, b5.load R3, e2. load R2, c 6.load R4, f3.add R2,R17.sub R3,R44.store a, R28. store d,

R3

4

3

8

7

1 2 5 61. load R1, b2. load R2, c3. add R2,R14. store a, R25. load R3, e6. load R4,f7. sub R3,R48. store d,R3

a = b + c

d = e - f

Now we have a schedule that requires no stalls and no NOPs.

Problem and SolutionProblem and Solution

Register allocation conflict : Register allocation conflict : use of same register creates

anti-Dependencies that restrict scheduling

Register allocation before scheduling

–prevents good scheduling

Scheduling before register allocation

–spills destroy scheduling

Solution : Schedule abstract assembly, Allocate registers, Schedule again.

Trace schedulingTrace scheduling

Steps involved in Trace Scheduling :

Trace Selection

– Find the most common trace of basic blocks.

Trace Compaction

–Combine the basic blocks in the trace and schedule them as one block

–Create clean-up code if the execution goes off-trace

Parallelism across IF branches vs. LOOP branches

Can provide a speedup if static prediction is accurate

How Trace Scheduling worksHow Trace Scheduling works

Look for higher priority and trace the blocks as shown below.


After tracing the priority blocks you schedule it first and rest parallel to that .


We can see the blocks been traced depending on the priority.


• Creating large extended basic blocks by duplication

• Schedule the larger blocks

Figure above shows how the extended basic blocks can be created.


This block diagram in its final stage shows you the parallelism across the This block diagram in its final stage shows you the parallelism across the branches.branches.

Limitations of Trace Scheduling

Optimizations depends on the traces being the dominant paths

in the program’s control-flow.

Therefore, the following two things should be true:

–Programs should demonstrate the behavior of being skewed in

the branches taken at run-time, for typical mixes of input data.

–We should have access to this information at compile time.

Not so easy.

Software PipeliningSoftware Pipelining

In software pipelining, iterations of a loop in the source program are In software pipelining, iterations of a loop in the source program are

continuously initiated at constant intervals, before the precedingcontinuously initiated at constant intervals, before the preceding

iterations complete thus taking advantage of the parallelism in data path. iterations complete thus taking advantage of the parallelism in data path.

Its also explained as Its also explained as scheduling the operations within an iteration,

such that the iterations can be pipelined to yield optimal throughput.

The sequence of instructions before the steady state are called

PROLOG and the ones that are in the sequence after the steady state is

called EPILOG.

Software Pipelining ExampleSoftware Pipelining Example

••Source code: Source code: for(i=0;i<n;i++) sum += a[i]for(i=0;i<n;i++) sum += a[i]

••Loop body in assembly:Loop body in assembly:r1 = L r0r1 = L r0---;stall ---;stall r2 = Addr2,r1r2 = Addr2,r1r0 = addr0,4 r0 = addr0,4

••Unroll loop & allocate registersUnroll loop & allocate registersr1 = L r0r1 = L r0---;stall ---;stall r2 = Add r2,r1r2 = Add r2,r1r0 = Add r0,12r0 = Add r0,12

r4 = L r3r4 = L r3---;stall ---;stall r2 = Add r2,r4r2 = Add r2,r4r3 = add r3,12r3 = add r3,12

r7 = L r6r7 = L r6

---;stall ---;stall

r2 = Add r2,r7r2 = Add r2,r7

r6 = add r6,12r6 = add r6,12

r10 = L r9r10 = L r9

---;stall ---;stall

r2 = Add r2,r10r2 = Add r2,r10

r9 = add r9,12r9 = add r9,12


Schedule Unrolled Instructions, exploiting VLIW (or not)

Identify Repeating Pattern(Kernel)

EPILOG

PROLOG

Constraints in Software Constraints in Software pipeliningpipelining

Recurrence Constraints: which is determined Recurrence Constraints: which is determined

by loop carried data dependencies.by loop carried data dependencies.

Resource Constraints: which is determined by Resource Constraints: which is determined by

total resource requirements.total resource requirements.

Remarks on Software Remarks on Software PipeliningPipelining

Innermost loop, loops with larger trip count, loops without conditionals

can be software pipelined.

Code size increase due to prolog and epilog.

Code size increase due to unrolling for MVE (Modulo Variable

Expansion).

Register allocation strategies for software pipelined loops .

Loops with conditional can be software pipelined if predicated execution

is supported.

–Higher resource requirement, but efficient schedule

Lec1 final

Technology

Transcript of Lec1 final