Chap2.pdf

Chapitre 2Chapitre 2

Architecture Avancée des Ordinateurs

EI 2

Architecture DLX

Enseignant: MAHMOUD B

ALU instruction has two or three operandsALU instruction has two or three operands

DD R3, R1, R2R3 <‐R1 + R2

orADD R1, R2R1 <‐ R1 + R2

DLX ArchitectureDLX Architecture

A simple load/store instruction setDesign for pipelining efficiencyDesign for pipelining efficiencyAn easily decoded instruction set

Registers DLX: thirty-two 32-bit general purpose registers (GPRs), named R0, R1, ..., R31. The

value of R0 is always 0.thirty-two floating-point registers (FPRs), which can be used as 32 single

precision (32-bit) registers F0,F1,F2,...,F31 even-odd pairs holding double-precision values. Thus, the 64-bit FPRs are

named D0, D1, D2,….D31, , ,

R0 = 0

R1

F0

F1

D0

D1R1

..

R31

F1

..

F31

D1

..

D31

GPR FPR1 FPR2

Data types for DLX

Data types Bit number Declaration

i t 8 bit b t B b Gl b linteger 8-bit bytes B or b . Global a a: .byte 44

integer 16-bit half words HW or hw . Global b b: .Halfword 44

integer 32-bit words W or w . Global x gx: .word 555

floating point 32-bit single precision F or f . Global yfloating point 32 bit single precision F or f . Global yy: .Float 4474

floating point 64 bit double precis D or d Global tfloating point 64-bit double precis D or d . Global t t: .double 908

MMemory

b t dd blbyte addressable

Bi E di dBig Endian mode

32 bit dd32-bit address

t dd i d (i di t d di l t)two addressing modes (immediate and displacement).

instruction layout for DLX Instruction layout for DLX. All instructions are encoded in one of three types.

instruction layout for DLX

Instruction type/opcode Instruction meaning

Move data between registers and memory or between theData transfers

Move data between registers and memory, or between the integer and FP or special register; only memory address mode is 16‐bit displacement + contents of a GPR

LB, LBU, SB Load byte, load byte unsigned, store byteLB, LBU, SB Load byte, load byte unsigned, store byte

LH, LHU, SH Load halfword, load halfword unsigned, store halfword

LW, SW Load word, store word (to/from integer registers)Load SP float load DP float store SP float store DP float (SP

LF, LD, SF, SDLoad SP float, load DP float, store SP float, store DP float (SP ‐single precision, DP ‐ double precision)

MOVI2S, MOVS2I Move from/to GPR to/from a special registerCopy one floating‐point register or a DP pair to another register

MOVF, MOVDCopy one floating point register or a DP pair to another register or pair

MOVFP2I, MOVI2FP Move 32 bits from/to FP tegister to/from integer registers

/Operations on integer or logical data in GPRs; signed arithmetics

Arithmetic / LogicalOperations on integer or logical data in GPRs; signed arithmeticstrap on overflow

ADD, ADDI, ADDU, ADDUIAdd, add immediate (all immediates are 16‐bits); signed and unsigned

SUB, SUBI, SUBU, SUBUI Subtract, subtract immediate; signed and unsigned

MULT, MULTU, DIV, DIVUMultiply and divide, signed and unsigned; operands must be floating‐point registers; all operations take and yield 32‐bit values

AND, ANDI And, and immediate

OR, ORI, XOP, XOPI Or, or immediate, exclusive or, exclusive or immediate

Instruction type/opcode Instruction meaning

LHI Load high immediate - loads upper half of register with immediate

SLL, SRL, SRA, SLLI, SRLI, SRAI

Shifts: both immediate(S__I) and variable form(S__); shifts are shift left logical, right logical, right arithmetic

S__, S__I Set conditional: "__"may be LT, GT, LE, GE, EQ, NE

Control Conditional branches and jumps; PC-relative or through register

BEQZ, BNEZ Branch GPR equal/not equal to zero; 16-bit offset from PC

BFPT, BFPF Test comparison bit in the FP status register and branch; 16-bit offset from PC

J, JR Jumps: 26-bit offset from PC(J) or target in register(JR)

JAL, JALR Jump and link: save PC+4 to R31, target is PC-relative(JAL) ot a register(JALR)

TRAP Transfer to operating system at a vectored addressTRAP Transfer to operating system at a vectored addressRFE Return to user code from an exception; restore user codeFloating point Floating-point operations on DP and SP formatsADDD ADDF Add DP SP numbersADDD, ADDF Add DP, SP numbersSUBD, SUBF Subtract DP, SP numbersMULTD, MULTF Multiply DP, SP floating pointDIVD, DIVF Divide DP, SP floating pointDIVD, DIVF Divide DP, SP floating pointCVTF2D, CVTF2I, CVTD2F,CVTD2I, CVTI2F, CVTI2D

Convert instructions: CVTx2y converts from type x to type y, where x and y are one of I(Integer), D(Double precision), or F(Single precision). Both operands are in the FP registers.CVTI2D

__D, __F DP and SP compares: "__" may be LT, GT, LE, GE, EQ, NE; set comparison bit in FP status register.

OperationsThere are four classes of instructions:There are four classes of instructions: Load/Store Any of the GPRs or FPRs may be loaded and stored except that loading R0 has no effect.

ALU Operations All ALU instructions are register-register instructions. The operations are :The operations are : - add , subtract, AND , OR , XOR , shifts - Compare instructions compare two registers (=,!=,<,>,=<,=>). If the condition is true, these instructions place a 1 in the destination register, otherwiseIf the condition is true, these instructions place a 1 in the destination register, otherwisethey place a 0.

Branches/Jumpsa c es/Ju psAll branches are conditional.The branch condition is specified by the instruction, whichmay test the register source for zero or nonzero.

Floating-Point Operations - add- subtract- multiply- divide

The load and store instructions in DLX

Example instruction Instruction name Meaning

LW R1,30(R2) Load word Regs[R1] <‐32Mem[30+Regs[R2]]

000( 0) d d [ ] [ 000 0] i 0 l i 0LW R1,1000(R0) Load word Regs[R1] <‐32Mem[1000+0] ; Register R0 always contains 0

LB R1,40(R3) Load byte Regs[R1] <‐32(Mem[40+Regs[R3]]0)24##Mem[40+Regs[R3]]

LBU R1 40(R3) Load byte unsigned Regs[R1] < 024 ## Mem[40+Regs[R3]]LBU R1,40(R3) Load byte unsigned Regs[R1] <‐32024 ## Mem[40+Regs[R3]]

LH R1,40(R3) Load half wordRegs[R1] <‐32(Mem[40+Regs[R3]]0)16 ## Mem[40+Regs[R3]] ## Mem[41+Regs[R3]]

LF F0 50(R3) L d fl R [F0] M [50 R [R3]]LF F0,50(R3) Load float Regs[F0] <‐32Mem[50+Regs[R3]]

LD FO,50(R2) Load double Regs[F0] ##Regs[F1]<‐64Mem[50+Regs[R2]]

SW 500(R4) R3 Store word Mem[500+Regs[R4]] < Regs[R3]SW 500(R4),R3 Store word Mem[500+Regs[R4]] <‐32Regs[R3]

SF 40(R3),F0 Store float Mem[40+Regs[R3]] <‐32Regs[F0]

SD 40(Re) F0 Store doubleMem[40+Regs[R3]] <‐32Regs[F0]; SD 40(Re),F0 Store doubleMem[44+Regs[R3]] <‐32Regs[F1]

SH 502(R2),R3 Store half Mem[502+Regs[R2]] <‐16Regs[R3]16..31

SB 41(R3) R2 Store byte Mem[41+Regs[R3]] <‐ Regs[R2]SB 41(R3),R2 Store byte Mem[41+Regs[R3]] <‐8Regs[R2]24..31

Examples of arithmetic/logical instructions on DLX

Example Instruction name Meaninginstruction Instruction name Meaning

ADD R1, R2, R3 Add Regs[R1] <= Regs[R2]+Regs[R3]Regs[R2]+Regs[R3]

ADDI R1, R2, #3 Add immediate Regs[R1] <= Regs[R2] + 3

LHI R1, #42 Load high immediate Regs[R1] <= 42##016

SLLI R1, R2, #5 Shift left logical immediate Regs[R1] <= Regs[R2] << 5

SLT R1, R2, R3 Set less thanif (Regs[R2]<Regs[R3]) Regs[R1] <= 1 SLT R1, R2, R3 Set less than Regs[R1] 1 else Regs[R1] <= 0

Typical control‐flow instructions in DLX

Exampleinstruction Instruction name Meaning

PC< name; ((PC+4) 225) <= name<J name Jump PC<-name; ((PC+4)-225) <= name< ((PC+4)+225)

JAL name Jump and link R31<-PC+4; PC<-name; ((PC+4)-JAL name Jump and link 225)<=name<((PC+4)+225)

JALR R2 Jump and link register Regs[R31]<-PC+4; PC , Regs[R2]register

JR R3 Jump register PC <- Regs[R3]

if (Regs[R4]==0) PC<-name;BEQZ R4, name Branch equal zero if (Regs[R4]==0) PC<-name; ((PC+4)-215)<=name<((PC+4)+215)

Branch not equal if (Regs[R4]!=0) PC< name;BNEZ R4, name Branch not equal zero

if (Regs[R4]!=0) PC<-name; ((PC+4)-215)<=name<((PC+4)+215)

An Implementation of DLX

Every DLX instruction can be implemented in at mostEvery DLX instruction can be implemented in at mostfive clock cycles. The five clock cycles are

1. Instruction fetch cycle (IF)

2. Instruction decode/register fetch (ID)

3 E ti /Eff ti dd l (EX)3. Execution/Effective address cycle (EX)

4. Memory access/branch completion cycle (MEM) y p y ( )

5. Write-back cycle (WB)

Detailed description of each follows:

Instruction fetch cycle (IF):

IR <- MEM[PC] NPC <- PC +4Operation:

- Send out the PC and fetch the instruction from memory into the instruction register(IR)

- increment the PC by 4 to address the nextysequential instruction

- the IR is used to hold the instruction thatthe IR is used to hold the instruction thatwill be needed on subsequent clock cycles

the NPC is used to hold the next- the NPC is used to hold the nextsequential PC (program counter)

Instruction decode/register fetch (ID):

A <=Regs[IR6..10] B <= Regs[IR11..15] Imm <= ((IR16)16##IR16..31)Operation:16 16..31 p

- Decode the instruction and access the register file to read the registers.

- the output of the general-purpose registers are readinto two temporary registers (A and B) for use in later clock cycles.

- the lower 16 bits of the IR are also sign-extendedd t d i t th t i t IMM fand stored into the temporary register IMM, for use

in the next cycle.

decoding is done in parallel with reading registers- decoding is done in parallel with reading registers, which is possible because these fields are at a fixedlocation in the DLX instruction format. This technique is known as fixed-field decodingtechnique is known as fixed-field decoding

Execution/Effective address cycle (EX):The ALU operates on the operand prepared in the prior cycle, performing one of four functionsd di h DLX i idepending on the DLX instruction type Memory reference: ALUOutput <= A +ImmOperation: The ALU adds the operands to

form the effective address and places the result into the register ALUOutput

Register-Register ALU instruction: ALUOutput <= A op BOperation: The ALU performsthe operation specified by the opcode on the value in register A and on the value in register B. The result is placed in the register ALUOutputThe result is placed in the register ALUOutput.

Register- Immediate ALU instruction: ALUOutput <= A op ImmOperation: The ALU performs the operation specified by the opcode on the value in register A and on the value inperforms the operation specified by the opcode on the value in register A and on the value in register Imm. The result is placed in the register ALUOutput.

Branch: ALUOutput <= NPC + ImmBranch: ALUOutput NPC ImmCond <= ( A op 0 )Operation:

-The ALU adds the NPC to the sign-extended immediate value in Imm to compute the addressof the branch target. g-Register A, which has been read in the prior cycle, is checked to determine whether the branchis taken. - the comparison operation op is the relational operator determined by the branch opcode (e.g. op is "==" for the instruction BEQZ)

Memory access/branch completion cycle (MEM):The only DLX instructions active in this cycle are loads stores and branchesThe only DLX instructions active in this cycle are loads, stores, and branches.

Memory reference:

LMD <= Mem[ALUOutput] or Mem[ALUOutput] <= BOperation:

-Access memory if neededy- If the instruction is load , data returns from memory and is placed in the LMD (loadmemory data) register- If the instruction is store, data from the B register is written into memory. - In either case the address used is the one computed during the prior cycle and stored in the register ALUOutput

Branch: if (cond) PC <- ALUOutputelse PC <- NPCOperation:

If th i t ti b h th PC i l d ith b h d ti ti dd i th- If the instruction branches, the PC is replaced with branch destination address in the register ALUOutput

Otherwise PC is replaced with the incremented PC in the register NPC- Otherwise, PC is replaced with the incremented PC in the register NPC

Write-back cycle (WB):

Register-Register ALU instruction: Regs[IR16..20] <=ALUOutput

Register-Immediate ALU instruction: Regs[IR11..15] <= ALUOutput

Load instruction: Regs[IR11..15] <= LMD

Operation:

W it th lt i t th i t fil h th it f th (LMD)- Write the result into the register file, whether it comes from the memory(LMD) or from ALU (ALUOutput)

the register destination field is in one of two positions depending on the opcode- the register destination field is in one of two positions depending on the opcode

Instruction : différentes phases d’exécution

Phase 1: IFRecherche (Fetch) de l’Instruction : l’adresse en mémoire de l’instruction àexécuter est stockée dans un registre du processeur, appelé PC (ProgramCounter). L’instruction pointée par le PC est cherchée dans la mémoire etstockée dans un autre registre du processeur: le IR (Instruction Register).g p ( g )

Phase 2: IDDécodage (Decode) de l’Instruction : chaque instruction est identifiée, grâçeà un code (opcode). En fonction de ce code, le processeur choisit la tâche àexécuter c’est-à-dire la séquence de micro-instruction à exécuterexécuter, c est-à-dire la séquence de micro-instruction à exécuter.

Phase 3: EXExécution (execute) de l’Instruction: l’UAL exécute le code de l’instruction

Phase 4: M - WBEnregistrement du résultat dans les registres ou la mémoire (WriteBackResult): à la fin de cette phase, on retourne à la première phase

02/02/2009 48Bouraoui MAHMOUD

Processeur : chemin de contrôle (Contolpath)

ProcesseurContrôle Traitement

EntréesSéquenceur

Micromémoire

Registre

opérateur

MémoireEntrées / Sorties

Bus d’adresse B d d éBus de donnéesBus de contrôle


Processeur : chemin de données (Datapath)

Contrôle

Écriture dans le banc de

registre

$r1 $er1 $r4 Opération ADD

ProcesseurContrôle Traitement

EntréesSéquenceur

Micromémoire

Registre

opérateur

MémoireEntrées / Sorties

Bus d’adresse B d d éBus de donnéesBus de contrôle


Phases d’exécution d’une instruction de type UAL

Recherche de l’instruction Mise à jours du PC

Décodage de l’instruction

IF/PC

g

Recherche des opérandes

PC (Program Counter) = Adresse de

ID/IF

Exécution ( g )

l’instruction à exécuterRegistres Tampons :• MAR (Memory Adress Register)

Ex

Enregistrement du résultat dans les registres

ou la mémoire

• MDR (Memory Data Register)• IR (Instruction Register) MEM

WB


Phases d’exécution d’une instruction : Types : UAL, Mémoire et Branchement

Instructions ALU Instructions mémoire Instructions de branchementInstructions ALU Instructions mémoire Instructions de branchement


Pipeline for DLXWhile each instruction takes five clock cycles to complete, during each clock cycle the y p , g yhardware will initiate a new instruction and will execute some part of the five different instructions.

Instr Num 1 2 3 4 5 6 7 8 9

instr i IF ID EX MEM WB

instr i+1 IF ID EX MEM WB

instr i+2 IF ID EX MEM WBinstr i+2 IF ID EX MEM WB



Modèle d’exécution en pipeline

• Non‐pipeliné (séquentiel)

5402/02/2009 Bouraoui MAHMOUD

1/5Inst

1Inst

4/5Inst

3/5Inst

2/5Inst

02/02/2009 Bouraoui MAHMOUD 55

h h b i hi h hi fThere are three observations on which this fact rests:

The basic datapath uses separate instruction and data memories. This eliminatesa conflict for a single memory that would arise between instruction fetch and data memory access.

The register file is used in two stages : for reading in ID and for writing in WB. This does mean that we need to perform two reads and one write on every clockcycle Question for you: What if a read and write are to the same register?cycle. Question for you: What if a read and write are to the same register?

To start a new instruction every clock, we must increment and store the PC every clock and this must be done during the IF stage in preparation for the nextevery clock , and this must be done during the IF stage in preparation for the nextinstruction. The problem arises when we conside the effect of branches, whichchange the PC also , but not until the MEM stage.

Pipelined Datapath

The following table shows what happens in any pipeline stage depending on the instruction type. yp

Stage Any Instruction

IF IF/Id.IR <- Mem[PC];IF [ ];IF/ID.NPC, PC <- (if EX/MEM.cond {EX/MEM.NPC} else {PC+4});ID/EX.A <- Regs[IF/ID.IR6..10];ID/EX B < R [IF/ID IR ]

IDID/EX.B <- Regs[IF/ID.IR11..15];ID/EX.NPC <- IF/ID.NPC;ID/EX.IR <- IF/ID.IR;

16ID/EX.Imm <- (IR16)16##IR16..31

ALU instruction Load or Store instruction Branch instructionALU instruction Load or Store instruction Branch instructionEX/MEM.IR <- ID/EX.IR; EX/MEM.ALUoutput <-ID/EX.A op ID/EX.B;

EX/MEM.IR <- ID/EX.IR; EX/MEM.ALUoutput <-

EX/MEM.ALUoutput <-ID/EX.NPC + ID/EX IEX

p ;or EX/MEM.ALUoutput <-ID/EX.A op ID/EX.Imm; EX/MEM d < 0

pID/EX.A + ID/EX.Imm; EX/MEM.cond <- 0;

EX/MEM.B <- ID/EX.B

ID/EX.Imm;

EX/MEM.cond <-(ID/EX.A op 0;EX/MEM.cond <-0; ( p ;

MEM/WB.IR <-EX/MEM.IR; MEM/WB.LMD <-

MEMMEM/WB.IR <- EX/MEM.IR; MEM/WB.ALUoutput <-EX/MEM.ALUoutput;

MEM/WB.LMD Mem[EX/MEM.ALUoutput]; or Mem[EX/MEM.ALUoutput]<-EX/MEM.B;

Regs[MEM/WB.IR16..20]<-MEM/WB ALUoutput;

WBMEM/WB.ALUoutput; or Regs[MEM/WB.IR11..15] <-MEM/WB.ALUoutput;

Regs[MEM/WB.IR11..15] <-MEM/WB.LMD;

Performance Issues in Pipelining

Simple exampleConsider a nonpipelined machine with 6 execution stages of lengths 50 ns, 50 ns, 60 ns 60 ns 50 ns and 50 nsns, 60 ns, 50 ns, and 50 ns.

- Find the instruction latency on this machine. - How much time does it take to execute 100 instructions?

S l tiSolution:

Instruction latency = 50+50+60+60+50+50= 320 nsTime to execute 100 instructions = 100*320 = 32000 nsTime to execute 100 instructions = 100 320 = 32000 ns

Suppose we introduce pipelining on this machine. Assume that when introducing i li i th l k k dd 5 f h d t h ti tpipelining, the clock skew adds 5ns of overhead to each execution stage.

- What is the instruction latency on the pipelined machine? - How much time does it take to execute 100 instructions?

Solution:Remember that in the pipelined implementation, the length of the pipe stages must all be the same, i.e., the speed of the slowest stage plus overhead. With 5ns overhead it comes to:

The length of pipelined stage = MAX(lengths of unpipelined stages) + overhead = 60 + 5 = 65 nsInstruction latency = 65 nsTime to execute 100 instructions = 65*6*1 + 65*1*99 = 390 + 6435 = 6825 ns

Chap2.pdf

Documents

Transcript of Chap2.pdf