Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the...

Processor Design � Pipelined Processor

Hung-Wei Tseng

Pipelining

Pipelining• Break up the logic with “pipeline registers” into

pipeline stages• Each stage can act on different instruction/data• States/Control signals of instructions are hold in

pipeline registers

Pipelining

cycle #1pi

cycle #2

cycle #3

cycle #4

cycle #5

After the 5th cycle, the processor can do 5 instructions in parallel

Pipelining

cycle #6

cycle #7

cycle #8

cycle #9

cycle #10

The processor can complete 1 instruction each cycleCPI == 1 if everything works perfectly!

Single-cycle v.s. pipeline

Cycle time of a pipeline processor• Critical path is the longest possible delay between two

registers in a design.• The critical path sets the cycle time, since the cycle

time must be long enough for a signal to traverse the critical path.

• Lengthening or shortening non-critical paths does not change performance

• Ideally, all paths are about the same length

Designing a 5-stage pipeline processor for

Basic steps of execution• Instruction fetch: where?

• Decode:• What’s the instruction?• Where are the operands?

• Execute• Memory access

• Where is my data?

• Write back• Where to put the result

• Determine the next PC16

Processor

120007a30: 0f00bb27 ldah gp,15(t12) 120007a34: 509cbd23 lda gp,-25520(gp)120007a38: 00005d24 ldah t1,0(gp)120007a3c: 0000bd24 ldah t4,0(gp)120007a40: 2ca422a0 ldl t0,-23508(t1)120007a44: 130020e4 beq t0,120007a94120007a48: 00003d24 ldah t0,0(gp)120007a4c: 2ca4e2b3 stl zero,-23508(t1)in

y800bf9000: 00c2e800 12773376800bf9004: 00000008 8800bf9008: 00c2f000 12775424800bf900c: 00000008 8800bf9010: 00c2f800 12777472800bf9014: 00000008 8800bf9018: 00c30000 12779520800bf901c: 00000008 8

......

registersALU

instruction memory

registersALUs

data memory

registers

Pipeline a MIPS processor• Instruction Fetch

• Read from instruction memory

• Decode• Figure out the incoming instruction?• Fetch the operands from the registers

• Execution• Perform ALU functions

• Memory access• Read/write data memory

• Write back results to registers• Write to the register file

Execution (EXE)

Instruction Fetch (IF)

Instruction Decode (ID)

Memory Access (MEM)

Write Back (WB)

From single-cycle to pipeline

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[15:11]inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

PCSrc = Branch & Zero

IF/ID ID/EX EX/MEM MEM/WB

Instruction Fetch Instruction Decode Execution MemoryAccess

WriteBack

Will this work?

Control

inst[31:25],inst[5:0]

Pipelined processor

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[15:11]inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

Control

inst[31:25],inst[5:0]

Pipelined processor

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[15:11]inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

Control

inst[31:25],inst[5:0]

Pipelined processor

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[15:11]inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ControlWB

Where can I find these?

inst[31:25],inst[5:0]

Pipelined processor

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[15:11]

inst[31:0]mux

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegDst

RegWrite MemWrite

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ControlWB

inst[31:25],inst[5:0]

RegDst

Pipelined processor

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[15:11]

inst[31:0]mux

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

ALUSrcMemtoReg

MemRead

RegWrite MemWrite

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

ControlWB

Is this right?RegWrite

inst[31:25],inst[5:0]

Pipelined processor

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

ALUSrcMemtoReg

MemReadRegDst

RegWrite MemWrite

ControlWB

RegWrite

inst[31:25],inst[5:0]

5-stage pipelined processor

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

ALUSrcMemtoReg

MemReadRegDst

RegWrite MemWrite

ControlWB

RegWrite

inst[31:25],inst[5:0]

Simplified pipeline diagram• Use symbols to represent the physical resources

with the abbreviations for pipeline stages.• IF, ID, EXE, MEM, WB

• Horizontal axis represent the timeline, vertical axis for the instruction stream

• Example:

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

IF EXE WBID MEM

IF EXEID MEM

IF EXEID

EXE WBMEM

Pipeline hazards

Pipeline hazards• Even though we perfectly divide pipeline stages, it’s

still hard to achieve CPI == 1.• Pipeline hazards:

• Structural hazard• The hardware does not allow two pipeline stages to work concurrently

• Data hazard• A later instruction in a pipeline stage depends on the outcome of an earlier

instruction in the pipeline

• Control hazard• The processor is not clear about what’s the next instruction to fetch

Can we get the right result?• Given the current 5-stage pipeline,

how many of the following MIPS code can work correctly?

I II III IVadd $1, $2, $3lw $4, 0($1)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9, $1, $10sw $11, 0($12)

add $1, $2, $3lw $4, 0($5)bne $0, $7, Lsub $9,$10,$11sw $1, 0($12)

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10,$11sw $1, 0($12)

IF EXE WBID MEM

a:b:c:d:e:

b cannot get $1 produced by a before WB

both a and d are accessing $1 at 5th cycle

We don’t know if d & e will be executed or not

Data hazard Structural hazard

Control hazard

Structural hazard

Structural hazard• The hardware cannot support the combination of

instructions that we want to execute at the same cycle

• The original pipeline incurs structural hazard when two instructions competing the same register.

• Solution: write early, read late• Writes occur at the clock edge and complete long enough

before the end of the clock cycle.• This leaves enough time for outputs to settle for reads• The revised register file is the default one from now!

add $1, $2, $3lw $4, 0($5)sub $6, $7, $8sub $9,$10, $1sw $1, 0($12)

IF EXEID MEM

IF EXEID

EXE WBMEM

Structural hazard• The design of hardware causes structural hazard• We need to modify the hardware design to avoid

structural hazard

Data hazard

Data hazard• When an instruction in the pipeline needs a value

that is not available• Data dependences

• The output of an instruction is the input of a later instruction• May result in data hazard if the later instruction that

consumes the result is still in the pipeline

Sol. of data hazard I: Stall• When the source operand of an instruction is not ready,

stall the pipeline• Suspend the instruction and the following instruction• Allow the previous instructions to proceed• This introduces a pipeline bubble: a bubble does nothing,

propagate through the pipeline like a nop instruction

• How to stall the pipeline?• Disable the PC update• Disable the pipeline registers on the earlier pipeline stages• When the stall is over, re-enable the pipeline registers, PC

updates

Hazard detection & stall

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

MemtoReg

MemReadRegDst

RegWrite MemWrite

ControlWB

RegWrite

ALUSrc

hazard detection

ID/EX.MemReadPCWrite

IF/IDWrite

Check if the destination register of EX == source register of the instruction in ID

Check if the destination register of MEM == source register of the instruction in ID

Insert a “noop” if we need to stall

inst[31:25],inst[5:0]

Performance of stall

add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)

IF EXEID MEM

IF EXEID

MEM WB

ID ID MEMEXE WB

IFIF ID MEMEXE WB

IF ID ID ID MEMEXE WB

15 cycles! CPI == 3(If there is no stall, CPI should be just 1!)

Insert a “noop” in EXE stageInsert another “noop” in EXE stage, previous noop goes to MEM stage

Sol. of data hazard II: Forwarding• The result is available after EXE and MEM stage,

but publicized in WB!• The data is already there, we should use it right

away!• Also called bypassing

add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)

IF EXEID

We can obtain the result here!

Sol. of data hazard II: Forwarding• Take the values, where ever they are!

add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)

IF EXEID

MEM WB

ID MEMEXE WB

IF ID MEMEXE WB

10 cycles! CPI == 2 (Not optimal, but much better!)

When can/should we forward data?• If the instruction entering the EXE stage consumes a

result from a previous instruction that is entering MEM stage or WB stage• A source of the instruction entering EXE stage is the destination

of an instruction entering MEM/WB stage• The previous instruction must be an instruction that updates

register file

Forwarding in hardware

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

ControlWB

RegWrite

forwardingunit

ForwardA

ForwardB

ForwardA

ForwardBdestination of Ins#1

Rs of Ins#2

Rt of Ins#2

ALU result of Ins#1

Control of Ins#1Control of Ins#2

inst[31:25],inst[5:0]

previous instruction (Ins#1)curernt instruction (Ins#2)

How about load?

Forwarding in hardware

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

ControlWB

RegWrite

forwardingunit

ForwardA

ForwardB

ForwardA

ForwardB

Rd of Ins#1

ALU/MEM result of

Ins#1Control of Ins#1

inst[31:25],inst[5:0]

There is still a case that we have to stall...

• Revisit the following code:

add $1, $2, $3lw $4, 0($1)sub $5, $2, $4sub $1, $3, $1sw $1, 0($5)

IF EXEID

MEM WB

ID MEMEXE WB

IF ID MEMEXE WB

lw generates result at MEM stage, we have to stall

• If the instruction entering EXE stage depends on a load instruction that does not finish its MEM stage yet, we have to stall!• We call this hazard detection

We need to know the following:1. If an instruction in EX/MEM updates a register (RegWrite)2. If an instruction in EX/MEM reads memory (MemRead)3. If the destination register of EX/MEM is a source of ID/EX (rs, rt of ID/EX == rt of EX/MEM #1)

Hazard detection with forwarding

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

ControlWB

RegWrite

forwardingunit

ForwardA

ForwardB

ForwardA

ForwardB

hazard detection

IF/IDWrite

inst[31:25],inst[5:0]

Control hazard

Control hazard• The processor cannot determine the next PC to

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP lw $t3, 0($s0)

IF EXEID

EXE MEM

ID EXE

MEM WB

IF ID MEMEXE WBstall

7 cycles per loop

Reducing the overhead of

control hazards

Solution I: Delayed branches• An agreement between ISA and hardware

• “Branch delay” slots: the next N instructions after a branch are always executed

• Compiler decides the instructions in branch delay slots• Reordering the instruction cannot affect the correctness of the program

• MIPS has one branch delay slot

• Good• Simple hardware

• Bad • N cannot change• Sometimes cannot find good candidates for the slot

Solution I: Delayed branches

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP

branch delay slot

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 bne $t1, $t0, LOOP addi $s0, $s0, 4 lw $t3, 0($s0)

IF EXEID

EXE MEM

ID EXE

MEM WB

ID MEMEXE WBstall

6 cycles per loop

Solution II: always predict not-taken• Always predict the next PC is PC+4

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP sw $v0, 0($s1) add $t4, $t3, $t5

IF EXEID

EXE MEM

ID EXE

If branch is not taken: no stalls!If branch is taken: doesn’t hurt!

lw $t3, 0($s0) IF

MEMEXE WB

7 cycles per loopflush the instructions fetched incorrectly

Solution III: always predict taken

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

AddShi>le>2

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

ControlWB

RegWrite

forwardingunit

ForwardA

ForwardB

ForwardA

ForwardB

hazard detection

IF/IDWrite

inst[31:25],inst[5:0]

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

ControlWB

RegWrite

forwardingunit

ForwardA

ForwardB

ForwardA

ForwardB

hazard detection

IF/IDWrite

Shi>le>2

Still have to stall 1 cycle

inst[31:25],inst[5:0]

ReadAddress

Instruc(onMemory

PC ALU

WriteData

ReadData1

ReadData2

ReadReg1

ReadReg2

WriteReg

Register

inst[25:21]

inst[20:16]

inst[31:0]

1sign-extend 3216

DataMemory

Address ReadData

WriteData

ALUSrc

MemtoReg

MemReadRegDst

RegWrite MemWrite

ControlWB

RegWrite

forwardingunit

ForwardA

ForwardB

ForwardA

ForwardB

hazard detection

IF/IDWrite

Shi>le>2

Branch Target Buffer

Consult BTB in fetch stage

inst[31:25],inst[5:0]

branch PCtarget address ortarget instruction

Solution III: always predict taken• Always predict taken with the help of BTB

LOOP: lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3 addi $s0, $s0, 4 bne $t1, $t0, LOOP

IF EXEID

EXE MEM

ID EXE

MEM WB

MEMEXE WB

5 cycles per loop(CPI == 1 !!!)

But what if the branch is not always taken?

lw $t3, 0($s0) addi $t0, $t0, 1 add $v0, $v0, $t3

Dynamic branch prediction

1-bit counter• Predict this branch will go the same way as the

result of the last time this branch executed• 1 for taken, 0 for not takens

0x400420 0x8048324 1

0x400464 0x8048392 1

0x400578 0x804850a 0

0x41000C 0x8049624 1

PC = 0x400420

Taken!

2-bit counter• A 2-bit counter for each branch• Predict taken if the counter value >= 2• If the prediction in taken states, fetch from target PC,

otherwise, use PC+4

Taken3 (11)

Taken2 (10)

NotTaken0 (00)

NotTaken1 (01)

not taken

71Branch Target Buffer

PC= 0x400420

Taken!0x400420 0x8048324 11

0x400464 0x8048392 10

0x400578 0x804850a 00

0x41000C 0x8049624 01

Performance of 2-bit counter• 2-bit state machine for each branch

for(i = 0; i < 10; i++) {! sum += a[i];}

90% accuracy!Taken3 (11)

Taken2 (10)

NotTaken0 (00)

NotTaken1 (01)

not taken

taken • Application: 80% ALU, 20% Branch, and branch resolved in EX stage, average CPI?• 1+20%*(1-90%)*2 = 1.04 72

i state predict actual1 10 T T2 11 T T3 11 T T

4-9 11 T T10 11 T NT +

Make the prediction better• Consider the following code:

i = 0;do { if( i % 3 != 0) // Branch Y, taken if i % 3 == 0 a[i] *= 2; a[i] += i;} while ( ++i < 100) // Branch X

i branch result0 Y T0 X T1 Y NT1 X T2 Y NT2 X T3 Y T3 X T4 Y NT4 X T5 Y NT5 X T6 Y T6 X T7 Y NT

Can we capture the pattern?

Predict using history• Instead of using the PC to choose the predictor, use

a bit vector (global history register, GHR) made up of the previous branch outcomes.

• Each entry in the history table has its own counter.

0111101100111110

history table

n-bit GHR

2n entries

= 101 (T, NT, T)

Taken!

Performance of global history predictor

• Consider the following code:i = 0;do { if( i % 3 != 0) // Branch Y, taken if i % 3 == 0 a[i] *= 2; a[i] += i;// Branch Y} while ( ++i < 100) // Branch X

i ? GHR BHT prediction actual New BHT

0 Y 0000 10 T T 110 X 0001 10 T T 111 Y 0011 10 T NT 011 X 0110 10 T T 112 Y 1101 10 T NT 012 X 1010 10 T T 113 Y 0101 10 T T 113 X 1011 10 T T 114 Y 0111 10 T NT 014 X 1110 10 T T 115 Y 1101 01 NT NT 005 X 1010 11 T T 116 Y 0101 11 T T 116 X 1011 11 T T 117 Y 0111 01 NT NT 007 X 1110 11 T T 118 Y 1101 00 NT NT 008 X 1010 11 T T 119 Y 0101 11 T T 119 X 1011 11 T T 11

10 Y 0111 00 NT NT 00

Assume that we start with a 4-bit GHR= 0, all counters are 10.

Nearly perfect after this

Branch prediction and modern processors

Deeper pipeline

• Higher frequencies by shortening the pipeline stages• Higher marketing values since consumers usually link

performance with frequencies• Potentially higher power consumption as

dynamic/active power = aCV2f• If the execution time is better, still consume less energy

Case Study

Intel Pentium 4 Microarch.

Intel Pentium 4• Very deep pipeline: in order to achieve high frequency!

(start from 1.5GHz)• 20 stages in Netburst

• 31 stages in Prescott

• 103W (3.6GHz, 65nm) • Reference

• The Microarchitecture of the Pentium 4 Processor

1 2 3 4 5Drive

6Alloc

7 8 9Que

13Disp

14Disp

18Flgs

19Br Ck

20DriveTC Nxt IP TC Fetch Rename

AMD Athlon 64

AMD Athlon 64• 12 stage pipeline

• 89W TDP (Opteron 2.2GHz 90nm)

1Inst. AddrDecode

2Inst Mem

3Inst. Byte

6Inst. Dbl. & Pack

7ID and Pack

8Dispatch

9Scheduling

10Execution

11D-CacheAddress

12D-cacheAccess

Demo revisited• Why the sorting the array speed up the code despite

the increased instruction count?

if(option) std::sort(data, data + arraySize);

for (unsigned i = 0; i < 100000; ++i) { int threshold = std::rand(); for (unsigned i = 0; i < arraySize; ++i) { if (data[i] >= threshold) sum ++; } }

Deep pipelining and data hazards

Data hazard revisited• How many cycles it takes to execute the following

code? • Draw the pipeline execution diagram

• assume that we have full data forwarding.lw $t1, 0($a0)lw $a0, 0($t1)bne $a0, $zero, 0

IF EXE

MEM WB

9 cycles

INTEL® 64 AND IA-32 PROCESSOR ARCHITECTURES

2.1 THE SKYLAKE MICROARCHITECTURE The Skylake microarchitecture builds on the successes of the Haswell and Broadwell microarchitectures. The basic pipeline functionality of the Skylake microarchitecture is depicted in Figure 2-1.

The Skylake microarchitecture offers the following enhancements:• Larger internal buffers to enable deeper OOO execution and higher cache bandwidth.• Improved front end throughput.• Improved branch predictor.• Improved divider throughput and latency.• Lower power consumption.• Improved SMT performance with Hyper-Threading Technology.• Balanced floating-point ADD, MUL, FMA throughput and latency.

The microarchitecture supports flexible integration of multiple processor cores with a shared uncore sub-system consisting of a number of components including a ring interconnect to multiple slices of L3 (an off-die L4 is optional), processor graphics, integrated memory controller, interconnect fabrics, etc. A four-core configuration can be supported similar to the arrangement shown in Figure 2-3.

Figure 2-1. CPU Core Pipeline Functionality of the Skylake Microarchitecture

32K L1 Instruction Cache

MSROM Decoded Icache (DSB)

Legacy DecodePipeline

Instruction Decode Queue (IDQ,, or micro-op queue)

Allocate/Rename/Retire/MoveElimination/ZeroIdiom

32K L1 Data Cache

256K L2 Cache (Unified)

Int ALU, Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Divide,

Branch2

Port 2LD/STA

Scheduler

Port 0

Int ALU, Fast LEA,Vec FMA,Vec MUL,Vec Add,Vec ALU,Vec Shft,Int MUL,Slow LEA

Int ALU, Fast LEA,Vec SHUF,Vec ALU,

Int ALU, Int Shft,Branch1,

Port 3LD/STA

Port 4STD

Port 7STA

Port 1 Port 5 Port 6

5 uops/cycle4 uops/cycle6 uops/cycle

Intel’s latest SkyLake

Good reference for intel microarchitectures: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the...

Documents

Transcript of Processor Design Pipelined ProcessorCycle time of a pipeline processor • Critical path is the...

Critical path 2015 sm

SPM, Critical Path Method

Critical Path Method - dssbooks.comdssbooks.com/.../CaseStudies/CS12.Critical-Path-Method.pdf · CASE STUDY 12 Critical Path Method 5 Using the precedence matrix for every cell (i,

Critical Path Ppt

The Critical-Path Algorithm - people.hsc.edupeople.hsc.edu/faculty-staff/robbk/Math111/Lectures/Fall 2013/Lect… · The Critical-Path Algorithm Deﬁnition (The Critical-Path Algorithm)

CFL Guidelines for Developing Critical Path Method ...Critical Path Method ... - 1 - CFL Guidelines for Developing Critical Path Method ... on the critical path. It is very important

The Critical-Path Algorithm - Hampden-Sydney Collegepeople.hsc.edu/faculty-staff/robbk/Math111/Lectures/Spring 2015/Le… · The Critical-Path Algorithm Deﬁnition (The Critical-Path

Cpm (critical path method)

Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.

Critical Path Analysis Module

Critical Path Method - R1

CS12.Critical Path Method

Critical Path Method

critical path method.ppt

Critical Path Analysis

Critical path analysis.notebook

“Critical” Path - EECS Instructional Support Group …cs150/sp13/agenda/lec/...Spring 2013 EECS150 - Lec17-timing(2) Page 1600 IEEEJOURNALOF Searching for processor critical path

PARALLEL PROCESSOR ARRAY FOR HIGH SPEED PATH …ericfossum.com/Publications/Papers/Parallel processor array for high speed path...PARALLEL PROCESSOR ARRAY FOR HIGH SPEED PATH PLANNING

2550: Critical Path Method unvg 194 Critical Path Method ...

Critical Path Initiative Challenges