Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1...

Instruction Level Parallelism

● ILP, Loop level Parallelism● Dependences, Hazards● Speculation, Branch prediction

Basic Block● A straight line code sequence with no branches in

except to the entry and no branches out except at the exit

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop

DADDUI R1, 0(R2)

BEQZ R2, L1

LW R1, 0(R2)

● Name dependence

– antidependence, output dependence

– Register renaming● Hazard

– Overlap during execution would change the order of access to the operand involved in the dependence.

for (i=0; i<=999; i=i+1)x[i] = x[i] + a;

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, LoopData DependenceName Dependence

ADD.D F4, F0, F2ADD.D F4, F6, F8

Hazards● Program Order

– ILP preserves program order only where it affects the outcome of the program

● Structural Hazards– Resource conflicts

● Data Hazards– RAW, WAW, WAR

Structural Hazard

MEM ID EX MEM WB

1 2 3 4 5 6 7 8 9

MEM ID EX MEM WBi5

HAZARD!!!

Data HazardDADDDSUBANDORXOR

R4,R1,R5R6,R1,R7

R1,R2,R3

R8,R1,R9R10,R1,R11

IM REG DMDADD

Time (clock cycles)

ALU REG

IM REG DMALU REG

IM REG DMALU

IM REG ALU

Avoiding Data Hazards – ForwardingDADDDSUBANDORXOR

R4,R1,R5R6,R1,R7

R1,R2,R3

R8,R1,R9R10,R1,R11

IM REG DMDADD

Time (clock cycles)

ALU REG

IM REG DMALU REG

IM REG DMALU

IM REG ALU

Load Delay SlotLDDSUBANDOR

R4,R1,R5R6,R1,R7

R1,0(R2)

R8,R1,R9

IM REG DMLD

Time (clock cycles)

ALU REG

IM REG DMALU REG

IM REG DMALU

The loaded value might not be available in the destination

register for use by the instruction immediately following the load

LOAD DELAY SLOT

Cost of StallsData references = 40%. Ideal CPI=1.Processor with hazard is 1.1 times faster than the processor without hazard.Which processor is faster?

Pipeline CPI= Ideal pipeline CPI +Structural stalls+Data hazard stalls+Control stalls

Pipeline Scheduling

Reorder the instructions of the program so that dependent

instructions are far enough apart

Done by the compiler, before the program runs:

Static Instruction Scheduling

Done by the hardware, when the program is running:

Dynamic Instruction Scheduling

Pipeline Scheduling

LW R3, 0(R1)

LW R13, 0(R11)

ADDI R5, R3, 1

ADD R2, R2, R3

ADD R12, R13, R3

LW R3, 0(R1)

ADDI R5, R3, 1

ADD R2, R2, R3

LW R13, 0(R11)

ADD R12, R13, R3

Original Program

Pipeline Scheduling

Scheduled Code

Total Execution Cycles: 7 Total Execution Cycles: 5

Loop-level Parallelism

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

DADDUI R1, R1, #-8

BNE R1, R2, Loop

Original Loop:Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

L.D F6, -8(R1)

ADD.D F8, F2, F6

S.D F8, -8(R1)

L.D F10, -16(R1)

ADD.D F12, F2, F10

S.D F12, -16(R1)

L.D F14, -24(R1)

ADD.D F16, F2, F14

S.D F16, -24(R1)

DADDUI R1, R1, #-32

BNE R1, R2, Loop

UNROLLED

Loop Unrolling

Instr producing result

Instr using result Latency to avoid a stall

FP ALU op Another FP ALU op

FP ALU op Store Double 2

Load Double FP ALU op 1

Load Double Store double 0

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

S.D F4, 0(R1)

ADD.D F8, F2, F6

S.D F8, -8(R1)

L.D F6, -8(R1)

L.D F10, -16(R1)

ADD.D F12, F2, F10

S.D F12, -16(R1)

L.D F14, -24(R1)

ADD.D F16, F2, F14

S.D F16, -24(R1)

DADDUI R1, R1, #-32

BNE R1, R2, Loop

Total Cycles: 27 cycles

Loop Unrolling

Instr producing result

Instr using result Latency to avoid a stall

FP ALU op Another FP ALU op

FP ALU op Store Double 2

Load Double FP ALU op 1

Load Double Store double 0

Loop: L.D F0, 0(R1)

ADD.D F4, F0, F2

ADD.D F8, F2, F6

ADD.D F12, F2, F10

ADD.D F16, F2, F14

S.D F4, 0(R1)

S.D F8, -8(R1)

L.D F6, -8(R1)

L.D F10, -16(R1)

S.D F12, 16(R1)

L.D F14, -24(R1)

S.D F16, 8(R1)

BNE R1, R2, Loop

Total Cycles: 14 cyclesDADDUI R1, R1, #-32

➢ Code Size➢ Register pressure

Exceptions● Certain exceptional events that occur during

program execution, handled by the processor hardware

● Control transfer to specific OS code based on the family of exception

● I/O device requests, System call, Breakpoint, Integer arithmetic overflow, FP arithmetic anomaly, Page fault, Undefined or unimplemented instruction, Hardware malfunctions, Power failure.

Exceptions● Synchronous vs. Asynchronous● User requested vs. Coerced● User maskable vs. User non-maskable● Within vs. Between instructions

– Save, and restore processor state

– restartable pipeline

● Resume vs. Terminate

Stopping and Restarting Execution● Trap instruction, Turn off writes, Save PC, Save

processor state, Exception handler, RFE● Precise exceptions

Pipeline stage Problem exceptions occurring

IF Page fault on IF, misaligned memory access; memory protection violation

ID Undefined or illegal opcode

EX Arithmetic exception

MEM Page fault on data fetch; misaligned memory access; memory protection violation

WB None

Precise ExceptionsLD IF ID EX MEM WB

DADD IF ID EX MEM WB

● Exceptions at the same cycle● Early exception by a later instruction● Instruction Status Vector

– Check before commit

Control Dependences● Program correctness

– Data flow and Exception behaviour

● Software Speculation– Liveness

DADDU R2, R3, R4

BEQZ R2, L1

LW R1, 0(R2)

DADDU R1, R2, R3

BEQZ R4, L1

DSUBU R1, R5, R6

L1: …........

OR R7, R1, R8

DADDU R1, R2, R3

BEQZ R12, L1

DSUBU R4, R5, R6

DADDU R5, R4, R9

L1: OR R7, R8, R9

Branch Hazards

● 1 stall cycle for every branch yields a performance loss of 10% to 30%!

IF ID EX MEM WB

Branch

Branch Successor

Branch Successor + 1

Time(clock cycles)

1 2 3 4 5 6 7 8 9

IF ID EX MEM WB

Reducing Pipeline Branch Penalties● Freeze the pipeline● Static Prediction

– Predict Taken, Predict Untaken

● Fill Branch Delay Slot

IF ID EX MEM WB

Branch

Branch Delay Slot

Branch Successor

Time(clock cycles)

1 2 3 4 5 6 7 8 9

ID EX MEM WB

From the MIPS ISA ManualThe transfer of control

takes place only following the instruction

immediately after the control transfer

instruction

Branch Delay Slot

Performance of Branch Schemes

Stall cyclesBranches=Branch frequency×Branch penalty

Speedup pipelining=Pipeline depth

1+Pipeline stall cycles per instruction

Speedup pipelining=Pipeline depth

1+Branch frequency×Branch penalty

Classes of ExceptionsException type Synchronous

vs. AsyncUser request vs. Coerced

User maskable vs. nonmaskable

Within vs. between instructions

Resume vs. Terminate

I/O device request

Async Coerced Nonmaskable Between Resume

Invoke OS Sync User request Nonmaskable Between Resume

Tracing Instruction Execution

Sync User request User maskable

Between Resume

Breakpoint Sync User request User maskable

Between Resume

Arithmetic Overflow

Sync Coerced User maskable

Within Resume

FP underflow or overflow

Sync Coerced User maskable

Within Resume

Page fault Sync Coerced Nonmaskable Within Resume

Undefined Instructions

Sync Coerced Nonmaskable Within Terminate

Hardware malfunctions

Async Coerced Nonmaskable Within Terminate

Power Failure Async Coerced Nonmaskable Within Terminate

Smith and Pleszkun, Implementing precise interrupts in pipelined processors, IEEE Transactions on Computers, 37(5), 1998.

Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1...

Documents

Transcript of Instruction Level Parallelism€¦ · ADD R2, R2, R3 ADD R12, R13, R3 LW R3, 0(R1) ADDI R5, R3, 1...

Nipper Quirindi Motorcycle Club Inc. 2017...Nipper Quirindi Motorcycle Club Inc. 2017 Bike No Name R1 R2 R3 TOTAL R1 R2 R3 TOTAL R1 R2 R3 TOTAL R1 R2 R3 TOTAL R1 R2 R3 TOTAL Max Partridge

College Algebra r2 = = - , r3 = = - , r4 = = - = 16(-1/4) = 16(1/256 ...

TSHWANE JOburg - Microsofteolstoragewe.blob.core.windows.net/wm-695976-cms... · joburg : sunday october, Marks Park 1st R5 300 2nd R3 750 3rd R2 150 1st R5 300 2nd R3 750 3rd R2

RESIDENT/FELLOW STAFF 2015/2016 - Kern Medical€¦ · Kevin Kemp . MD R4. Maria Alfaro . MD R3 . Sergey Kozyr . MD R3. Jorge Almodovar . MD R2 . Domenech Asbun . MD R2 . Tigran Karamanukyan

Answer key Code...ANSWER KEY CODE : R1 to R6 Q. No.R1 R2 R3 R4 R5 R6Q. No.R1 R2 R3 R4 R5 R6Q. No.R1 R2 R3 R4 R5 R6Q. No.R1 R2 R3 R4 R5 R6 14 322&3 46 91 136 241 347 92 137 3 4 1 3

INF5050 – Protocols and Routing in Internet (Friday 2.2.2018)folk.uio.no/yanzhang/INF5050-2018/IProuter2018.pdf · R10 R11 R4 R13 R9 R5 R2 R1 R6 R3 R7 R12 R16 R15 R14 R8 (2.5 Gb/s)

COMUNE DI MONTECENERI · di in r3 in ar-r r2 r4 r3 r3 r2 r3 ar-r r4 r2 r3 apep r2 r2 ar-r r3 aepp r2 r3 r2 r2 r4 nt in r2 r2 r3 r2 aepp aepp r2 r2 r2 apep r2 r2 r2 r2 nt r4 r4 np

R2, R3, R4 Multi-Family Survey Report City of West Hollywood

R2 R3 R4 R5 Platinum 2 - moa-home.commoa-home.com/moa2013/files/MOA2013_FloorPlans.pdf · R2 R3 R4 R5 R6 Tea Break 40’ 20’ 20’ 10’ 10’ Super Platinum 2 6’ ... Pelagus

Substation Standard ZSS Template - EESS-10309 … · -r1, -r2, -r3 metrosil -r11 -r12 -r13 shunt resistors main protection -x61 main protection test rack 16/16 assembly -k3m high-speed

βλεκιορ - Internet ArchiveΚατάλογος υλικών Αντιστάσεις R1 = 15 k R2, R14 = 33 k R3, R6 = 22 k R4 = 470 Ω R5, R13 = 10 k R7, R15 = 100 k R8 = 220 Ω

MOVI Voice Control Shield for Arduino Ⓡboards User’s Manual...Uno R1 and R2, MEGA2560 R1 and R2, Leonardo R1 and R2 Uno R3, Mega2560 R3, Leonardo R3 Freeduino Olimexino-328 Diavolino

L3 R3 L1 R1 L2 R2 - Home - EXPO Productions · L1 R1 L2 R2 L3 R3 12 x 21 screen 5’ off ground L1 & R1 - 6’ wide x 12’ tall L2 & R2 - 6’ wide x 13’ tall L3 & R3 - 5’ wide

Vectores r2 y r3

PharmacokineticsofGanodericAcidsAandF ...2 Evidence-Based Complementary and Alternative Medicine Ganoderic acid A: R1 =O, R2 β-OH, R3 =H, R4 =α-OH Ganoderic acid F: R1 R1 R4 R3 =R2

RC Group Meeting - Scripps Research › baran › images › grpmtgpdf › ... · R2 1 )T(O -Pr4/ MgCl N R3 R4 2) 5 then H 2O NH R3 4 1 R2 R1 R2 NH R3 R4 + yields 48-94% rr >20:1

A Knights Rest Wedding Packages · A Knights Rest Wedding Package Venue Hire: Lapa - R3 000.00 (weekday) and R3 500.00 (weekend) Tea Garden – R2 000.00 (weekday) and R2 500.00 (weekend)

KITS DE INTERCOMUNICACIÓN · 331560 r4 r3 r2 r1 l-l+ off on r4 r3 r2 r1 100-240v a.c. r4 r3 r2 r1 l-l+ l-l+ r4 r3 r2 r1 off on off on l-l+ r4 r3 r2 r1 r4 r3 r2 r1 off on ac100-240v

Design representations Control Oriented Modelsfileadmin.cs.lth.se/cs/Education/EDAN15/Lectures/Lecture6.pdf33/u2 r1 r1 r1 r3 r3 r3 r1 r1 r1 r1 r1 r1 r2 r2 r2 r2 r2 r2 r3 r3 r3 r3 2012-03-30

CharacterisationofFlavonoidAglyconesbyNegativeIon Chip ... · 2011. 9. 26. · 2 International Journal of Analytical Chemistry HO OH O O R1 R2 R3 R4 OH OH OH OH OH OH OH R1 R2 R3