Post on 25-Jan-2016
Chapitre 2Chapitre 2
Architecture Avancée des Ordinateurs
EI 2
Architecture DLX
Enseignant: MAHMOUD B
ALU instruction has two or three operandsALU instruction has two or three operands
DD R3, R1, R2R3 <‐R1 + R2
orADD R1, R2R1 <‐ R1 + R2
DLX ArchitectureDLX Architecture
A simple load/store instruction setDesign for pipelining efficiencyDesign for pipelining efficiencyAn easily decoded instruction set
Registers DLX: thirty-two 32-bit general purpose registers (GPRs), named R0, R1, ..., R31. The
value of R0 is always 0.thirty-two floating-point registers (FPRs), which can be used as 32 single
precision (32-bit) registers F0,F1,F2,...,F31 even-odd pairs holding double-precision values. Thus, the 64-bit FPRs are
named D0, D1, D2,….D31, , ,
R0 = 0
R1
F0
F1
D0
D1R1
..
R31
F1
..
F31
D1
..
D31
GPR FPR1 FPR2
Data types for DLX
Data types Bit number Declaration
i t 8 bit b t B b Gl b linteger 8-bit bytes B or b . Global a a: .byte 44
integer 16-bit half words HW or hw . Global b b: .Halfword 44
integer 32-bit words W or w . Global x gx: .word 555
floating point 32-bit single precision F or f . Global yfloating point 32 bit single precision F or f . Global yy: .Float 4474
floating point 64 bit double precis D or d Global tfloating point 64-bit double precis D or d . Global t t: .double 908
MMemory
b t dd blbyte addressable
Bi E di dBig Endian mode
32 bit dd32-bit address
t dd i d (i di t d di l t)two addressing modes (immediate and displacement).
instruction layout for DLX Instruction layout for DLX. All instructions are encoded in one of three types.
instruction layout for DLX
Instruction type/opcode Instruction meaning
Move data between registers and memory or between theData transfers
Move data between registers and memory, or between the integer and FP or special register; only memory address mode is 16‐bit displacement + contents of a GPR
LB, LBU, SB Load byte, load byte unsigned, store byteLB, LBU, SB Load byte, load byte unsigned, store byte
LH, LHU, SH Load halfword, load halfword unsigned, store halfword
LW, SW Load word, store word (to/from integer registers)Load SP float load DP float store SP float store DP float (SP
LF, LD, SF, SDLoad SP float, load DP float, store SP float, store DP float (SP ‐single precision, DP ‐ double precision)
MOVI2S, MOVS2I Move from/to GPR to/from a special registerCopy one floating‐point register or a DP pair to another register
MOVF, MOVDCopy one floating point register or a DP pair to another register or pair
MOVFP2I, MOVI2FP Move 32 bits from/to FP tegister to/from integer registers
/Operations on integer or logical data in GPRs; signed arithmetics
Arithmetic / LogicalOperations on integer or logical data in GPRs; signed arithmeticstrap on overflow
ADD, ADDI, ADDU, ADDUIAdd, add immediate (all immediates are 16‐bits); signed and unsigned
SUB, SUBI, SUBU, SUBUI Subtract, subtract immediate; signed and unsigned
MULT, MULTU, DIV, DIVUMultiply and divide, signed and unsigned; operands must be floating‐point registers; all operations take and yield 32‐bit values
AND, ANDI And, and immediate
OR, ORI, XOP, XOPI Or, or immediate, exclusive or, exclusive or immediate
Instruction type/opcode Instruction meaning
LHI Load high immediate - loads upper half of register with immediate
SLL, SRL, SRA, SLLI, SRLI, SRAI
Shifts: both immediate(S__I) and variable form(S__); shifts are shift left logical, right logical, right arithmetic
S__, S__I Set conditional: "__"may be LT, GT, LE, GE, EQ, NE
Control Conditional branches and jumps; PC-relative or through register
BEQZ, BNEZ Branch GPR equal/not equal to zero; 16-bit offset from PC
BFPT, BFPF Test comparison bit in the FP status register and branch; 16-bit offset from PC
J, JR Jumps: 26-bit offset from PC(J) or target in register(JR)
JAL, JALR Jump and link: save PC+4 to R31, target is PC-relative(JAL) ot a register(JALR)
TRAP Transfer to operating system at a vectored addressTRAP Transfer to operating system at a vectored addressRFE Return to user code from an exception; restore user codeFloating point Floating-point operations on DP and SP formatsADDD ADDF Add DP SP numbersADDD, ADDF Add DP, SP numbersSUBD, SUBF Subtract DP, SP numbersMULTD, MULTF Multiply DP, SP floating pointDIVD, DIVF Divide DP, SP floating pointDIVD, DIVF Divide DP, SP floating pointCVTF2D, CVTF2I, CVTD2F,CVTD2I, CVTI2F, CVTI2D
Convert instructions: CVTx2y converts from type x to type y, where x and y are one of I(Integer), D(Double precision), or F(Single precision). Both operands are in the FP registers.CVTI2D
__D, __F DP and SP compares: "__" may be LT, GT, LE, GE, EQ, NE; set comparison bit in FP status register.
OperationsThere are four classes of instructions:There are four classes of instructions: Load/Store Any of the GPRs or FPRs may be loaded and stored except that loading R0 has no effect.
ALU Operations All ALU instructions are register-register instructions. The operations are :The operations are : - add , subtract, AND , OR , XOR , shifts - Compare instructions compare two registers (=,!=,<,>,=<,=>). If the condition is true, these instructions place a 1 in the destination register, otherwiseIf the condition is true, these instructions place a 1 in the destination register, otherwisethey place a 0.
Branches/Jumpsa c es/Ju psAll branches are conditional.The branch condition is specified by the instruction, whichmay test the register source for zero or nonzero.
Floating-Point Operations - add- subtract- multiply- divide
The load and store instructions in DLX
Example instruction Instruction name Meaning
LW R1,30(R2) Load word Regs[R1] <‐32Mem[30+Regs[R2]]
000( 0) d d [ ] [ 000 0] i 0 l i 0LW R1,1000(R0) Load word Regs[R1] <‐32Mem[1000+0] ; Register R0 always contains 0
LB R1,40(R3) Load byte Regs[R1] <‐32(Mem[40+Regs[R3]]0)24##Mem[40+Regs[R3]]
LBU R1 40(R3) Load byte unsigned Regs[R1] < 024 ## Mem[40+Regs[R3]]LBU R1,40(R3) Load byte unsigned Regs[R1] <‐32024 ## Mem[40+Regs[R3]]
LH R1,40(R3) Load half wordRegs[R1] <‐32(Mem[40+Regs[R3]]0)16 ## Mem[40+Regs[R3]] ## Mem[41+Regs[R3]]
LF F0 50(R3) L d fl R [F0] M [50 R [R3]]LF F0,50(R3) Load float Regs[F0] <‐32Mem[50+Regs[R3]]
LD FO,50(R2) Load double Regs[F0] ##Regs[F1]<‐64Mem[50+Regs[R2]]
SW 500(R4) R3 Store word Mem[500+Regs[R4]] < Regs[R3]SW 500(R4),R3 Store word Mem[500+Regs[R4]] <‐32Regs[R3]
SF 40(R3),F0 Store float Mem[40+Regs[R3]] <‐32Regs[F0]
SD 40(Re) F0 Store doubleMem[40+Regs[R3]] <‐32Regs[F0]; SD 40(Re),F0 Store doubleMem[44+Regs[R3]] <‐32Regs[F1]
SH 502(R2),R3 Store half Mem[502+Regs[R2]] <‐16Regs[R3]16..31
SB 41(R3) R2 Store byte Mem[41+Regs[R3]] <‐ Regs[R2]SB 41(R3),R2 Store byte Mem[41+Regs[R3]] <‐8Regs[R2]24..31
Examples of arithmetic/logical instructions on DLX
Example Instruction name Meaninginstruction Instruction name Meaning
ADD R1, R2, R3 Add Regs[R1] <= Regs[R2]+Regs[R3]Regs[R2]+Regs[R3]
ADDI R1, R2, #3 Add immediate Regs[R1] <= Regs[R2] + 3
LHI R1, #42 Load high immediate Regs[R1] <= 42##016
SLLI R1, R2, #5 Shift left logical immediate Regs[R1] <= Regs[R2] << 5
SLT R1, R2, R3 Set less thanif (Regs[R2]<Regs[R3]) Regs[R1] <= 1 SLT R1, R2, R3 Set less than Regs[R1] 1 else Regs[R1] <= 0
Typical control‐flow instructions in DLX
Exampleinstruction Instruction name Meaning
PC< name; ((PC+4) 225) <= name<J name Jump PC<-name; ((PC+4)-225) <= name< ((PC+4)+225)
JAL name Jump and link R31<-PC+4; PC<-name; ((PC+4)-JAL name Jump and link 225)<=name<((PC+4)+225)
JALR R2 Jump and link register Regs[R31]<-PC+4; PC , Regs[R2]register
JR R3 Jump register PC <- Regs[R3]
if (Regs[R4]==0) PC<-name;BEQZ R4, name Branch equal zero if (Regs[R4]==0) PC<-name; ((PC+4)-215)<=name<((PC+4)+215)
Branch not equal if (Regs[R4]!=0) PC< name;BNEZ R4, name Branch not equal zero
if (Regs[R4]!=0) PC<-name; ((PC+4)-215)<=name<((PC+4)+215)
An Implementation of DLX
Every DLX instruction can be implemented in at mostEvery DLX instruction can be implemented in at mostfive clock cycles. The five clock cycles are
1. Instruction fetch cycle (IF)
2. Instruction decode/register fetch (ID)
3 E ti /Eff ti dd l (EX)3. Execution/Effective address cycle (EX)
4. Memory access/branch completion cycle (MEM) y p y ( )
5. Write-back cycle (WB)
Detailed description of each follows:
Instruction fetch cycle (IF):
IR <- MEM[PC] NPC <- PC +4Operation:
- Send out the PC and fetch the instruction from memory into the instruction register(IR)
- increment the PC by 4 to address the nextysequential instruction
- the IR is used to hold the instruction thatthe IR is used to hold the instruction thatwill be needed on subsequent clock cycles
the NPC is used to hold the next- the NPC is used to hold the nextsequential PC (program counter)
Instruction decode/register fetch (ID):
A <=Regs[IR6..10] B <= Regs[IR11..15] Imm <= ((IR16)16##IR16..31)Operation:16 16..31 p
- Decode the instruction and access the register file to read the registers.
- the output of the general-purpose registers are readinto two temporary registers (A and B) for use in later clock cycles.
- the lower 16 bits of the IR are also sign-extendedd t d i t th t i t IMM fand stored into the temporary register IMM, for use
in the next cycle.
decoding is done in parallel with reading registers- decoding is done in parallel with reading registers, which is possible because these fields are at a fixedlocation in the DLX instruction format. This technique is known as fixed-field decodingtechnique is known as fixed-field decoding
Execution/Effective address cycle (EX):The ALU operates on the operand prepared in the prior cycle, performing one of four functionsd di h DLX i idepending on the DLX instruction type Memory reference: ALUOutput <= A +ImmOperation: The ALU adds the operands to
form the effective address and places the result into the register ALUOutput
Register-Register ALU instruction: ALUOutput <= A op BOperation: The ALU performsthe operation specified by the opcode on the value in register A and on the value in register B. The result is placed in the register ALUOutputThe result is placed in the register ALUOutput.
Register- Immediate ALU instruction: ALUOutput <= A op ImmOperation: The ALU performs the operation specified by the opcode on the value in register A and on the value inperforms the operation specified by the opcode on the value in register A and on the value in register Imm. The result is placed in the register ALUOutput.
Branch: ALUOutput <= NPC + ImmBranch: ALUOutput NPC ImmCond <= ( A op 0 )Operation:
-The ALU adds the NPC to the sign-extended immediate value in Imm to compute the addressof the branch target. g-Register A, which has been read in the prior cycle, is checked to determine whether the branchis taken. - the comparison operation op is the relational operator determined by the branch opcode (e.g. op is "==" for the instruction BEQZ)
Memory access/branch completion cycle (MEM):The only DLX instructions active in this cycle are loads stores and branchesThe only DLX instructions active in this cycle are loads, stores, and branches.
Memory reference:
LMD <= Mem[ALUOutput] or Mem[ALUOutput] <= BOperation:
-Access memory if neededy- If the instruction is load , data returns from memory and is placed in the LMD (loadmemory data) register- If the instruction is store, data from the B register is written into memory. - In either case the address used is the one computed during the prior cycle and stored in the register ALUOutput
Branch: if (cond) PC <- ALUOutputelse PC <- NPCOperation:
If th i t ti b h th PC i l d ith b h d ti ti dd i th- If the instruction branches, the PC is replaced with branch destination address in the register ALUOutput
Otherwise PC is replaced with the incremented PC in the register NPC- Otherwise, PC is replaced with the incremented PC in the register NPC
Write-back cycle (WB):
Register-Register ALU instruction: Regs[IR16..20] <=ALUOutput
Register-Immediate ALU instruction: Regs[IR11..15] <= ALUOutput
Load instruction: Regs[IR11..15] <= LMD
Operation:
W it th lt i t th i t fil h th it f th (LMD)- Write the result into the register file, whether it comes from the memory(LMD) or from ALU (ALUOutput)
the register destination field is in one of two positions depending on the opcode- the register destination field is in one of two positions depending on the opcode
Instruction : différentes phases d’exécution
Phase 1: IFRecherche (Fetch) de l’Instruction : l’adresse en mémoire de l’instruction àexécuter est stockée dans un registre du processeur, appelé PC (ProgramCounter). L’instruction pointée par le PC est cherchée dans la mémoire etstockée dans un autre registre du processeur: le IR (Instruction Register).g p ( g )
Phase 2: IDDécodage (Decode) de l’Instruction : chaque instruction est identifiée, grâçeà un code (opcode). En fonction de ce code, le processeur choisit la tâche àexécuter c’est-à-dire la séquence de micro-instruction à exécuterexécuter, c est-à-dire la séquence de micro-instruction à exécuter.
Phase 3: EXExécution (execute) de l’Instruction: l’UAL exécute le code de l’instruction
Phase 4: M - WBEnregistrement du résultat dans les registres ou la mémoire (WriteBackResult): à la fin de cette phase, on retourne à la première phase
02/02/2009 48Bouraoui MAHMOUD
Processeur : chemin de contrôle (Contolpath)
ProcesseurContrôle Traitement
EntréesSéquenceur
Micromémoire
Registre
opérateur
MémoireEntrées / Sorties
Bus d’adresse B d d éBus de donnéesBus de contrôle
02/02/2009 49Bouraoui MAHMOUD
Processeur : chemin de données (Datapath)
Contrôle
Écriture dans le banc de
registre
$r1 $er1 $r4 Opération ADD
ProcesseurContrôle Traitement
EntréesSéquenceur
Micromémoire
Registre
opérateur
MémoireEntrées / Sorties
Bus d’adresse B d d éBus de donnéesBus de contrôle
02/02/2009 50Bouraoui MAHMOUD
Phases d’exécution d’une instruction de type UAL
Recherche de l’instruction Mise à jours du PC
Décodage de l’instruction
IF/PC
g
Recherche des opérandes
PC (Program Counter) = Adresse de
ID/IF
Exécution ( g )
l’instruction à exécuterRegistres Tampons :• MAR (Memory Adress Register)
Ex
Enregistrement du résultat dans les registres
ou la mémoire
• MDR (Memory Data Register)• IR (Instruction Register) MEM
WB
02/02/2009 51Bouraoui MAHMOUD
Phases d’exécution d’une instruction : Types : UAL, Mémoire et Branchement
Instructions ALU Instructions mémoire Instructions de branchementInstructions ALU Instructions mémoire Instructions de branchement
02/02/2009 52Bouraoui MAHMOUD
Pipeline for DLXWhile each instruction takes five clock cycles to complete, during each clock cycle the y p , g yhardware will initiate a new instruction and will execute some part of the five different instructions.
Instr Num 1 2 3 4 5 6 7 8 9
instr i IF ID EX MEM WB
instr i+1 IF ID EX MEM WB
instr i+2 IF ID EX MEM WBinstr i+2 IF ID EX MEM WB
instr i+3 IF ID EX MEM WB
instr i+4 IF ID EX MEM WB
Modèle d’exécution en pipeline
• Non‐pipeliné (séquentiel)
5402/02/2009 Bouraoui MAHMOUD
1/5Inst
1Inst
4/5Inst
3/5Inst
2/5Inst
02/02/2009 Bouraoui MAHMOUD 55
h h b i hi h hi fThere are three observations on which this fact rests:
The basic datapath uses separate instruction and data memories. This eliminatesa conflict for a single memory that would arise between instruction fetch and data memory access.
The register file is used in two stages : for reading in ID and for writing in WB. This does mean that we need to perform two reads and one write on every clockcycle Question for you: What if a read and write are to the same register?cycle. Question for you: What if a read and write are to the same register?
To start a new instruction every clock, we must increment and store the PC every clock and this must be done during the IF stage in preparation for the nextevery clock , and this must be done during the IF stage in preparation for the nextinstruction. The problem arises when we conside the effect of branches, whichchange the PC also , but not until the MEM stage.
Pipelined Datapath
The following table shows what happens in any pipeline stage depending on the instruction type. yp
Stage Any Instruction
IF IF/Id.IR <- Mem[PC];IF [ ];IF/ID.NPC, PC <- (if EX/MEM.cond {EX/MEM.NPC} else {PC+4});ID/EX.A <- Regs[IF/ID.IR6..10];ID/EX B < R [IF/ID IR ]
IDID/EX.B <- Regs[IF/ID.IR11..15];ID/EX.NPC <- IF/ID.NPC;ID/EX.IR <- IF/ID.IR;
16ID/EX.Imm <- (IR16)16##IR16..31
ALU instruction Load or Store instruction Branch instructionALU instruction Load or Store instruction Branch instructionEX/MEM.IR <- ID/EX.IR; EX/MEM.ALUoutput <-ID/EX.A op ID/EX.B;
EX/MEM.IR <- ID/EX.IR; EX/MEM.ALUoutput <-
EX/MEM.ALUoutput <-ID/EX.NPC + ID/EX IEX
p ;or EX/MEM.ALUoutput <-ID/EX.A op ID/EX.Imm; EX/MEM d < 0
pID/EX.A + ID/EX.Imm; EX/MEM.cond <- 0;
EX/MEM.B <- ID/EX.B
ID/EX.Imm;
EX/MEM.cond <-(ID/EX.A op 0;EX/MEM.cond <-0; ( p ;
MEM/WB.IR <-EX/MEM.IR; MEM/WB.LMD <-
MEMMEM/WB.IR <- EX/MEM.IR; MEM/WB.ALUoutput <-EX/MEM.ALUoutput;
MEM/WB.LMD Mem[EX/MEM.ALUoutput]; or Mem[EX/MEM.ALUoutput]<-EX/MEM.B;
Regs[MEM/WB.IR16..20]<-MEM/WB ALUoutput;
WBMEM/WB.ALUoutput; or Regs[MEM/WB.IR11..15] <-MEM/WB.ALUoutput;
Regs[MEM/WB.IR11..15] <-MEM/WB.LMD;
Performance Issues in Pipelining
Simple exampleConsider a nonpipelined machine with 6 execution stages of lengths 50 ns, 50 ns, 60 ns 60 ns 50 ns and 50 nsns, 60 ns, 50 ns, and 50 ns.
- Find the instruction latency on this machine. - How much time does it take to execute 100 instructions?
S l tiSolution:
Instruction latency = 50+50+60+60+50+50= 320 nsTime to execute 100 instructions = 100*320 = 32000 nsTime to execute 100 instructions = 100 320 = 32000 ns
Suppose we introduce pipelining on this machine. Assume that when introducing i li i th l k k dd 5 f h d t h ti tpipelining, the clock skew adds 5ns of overhead to each execution stage.
- What is the instruction latency on the pipelined machine? - How much time does it take to execute 100 instructions?
Solution:Remember that in the pipelined implementation, the length of the pipe stages must all be the same, i.e., the speed of the slowest stage plus overhead. With 5ns overhead it comes to:
The length of pipelined stage = MAX(lengths of unpipelined stages) + overhead = 60 + 5 = 65 nsInstruction latency = 65 nsTime to execute 100 instructions = 65*6*1 + 65*1*99 = 390 + 6435 = 6825 ns