Course-module-1 _Compatibility Mode_.pdf

download Course-module-1 _Compatibility Mode_.pdf

of 102

Transcript of Course-module-1 _Compatibility Mode_.pdf

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    1/102

    1

    Santanu Chaudhury

    Computer Architecture EEL308

    2

    Books:

    Computer Organization and Design, The Hardware/Software Interface

    Author(s) : Patterson & Hennessy

    Imprint: Morgan Kaufmann

    Additional Reference :

    (i) Computer Architecture and Organisation : J .P. Hayes

    (ii) Hamacher& Zacky

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    2/102

    3

    Evaluation & Attendance Policy

    Minor-1 ; 20 Minor-2 : 20

    Major: 35

    Tutorial, Assignments, Quiz : 25

    At tendance Pol icy:

    One grade less if attendance less than 75%

    No E-grade if attendance less than 75%

    4

    Introduction

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    3/102

    5

    What is a computer?

    An electron ic device that can be programmed for solving aproblem

    Components of a Computer

    processor

    input (mouse, keyboard,scanner, camera)

    output (display, printer)

    memory (disk drives, DRAM, SRAM, CD)

    network

    Rapidly changing technology

    vacuum tube -> transistor -> IC -> VLSI

    doubling every 1.5 years: memory capacity

    processor speed

    6

    Computer Architecture and Organization

    Computer Architecture refers to those attributes of a systemvisible to a programmer

    instruction set

    number of bit s used to represent various data types

    i/o mechanisms

    techniques for addressing memory

    Computer Organization refers to the operational units andtheir interconnections that realize the architecturalspecifications

    control signals

    interfaces between the computer and peripherals memory technology

    Distinction is fundamental

    manufacturers can offer computer models with samearchitecture with differences in organization

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    4/102

    7

    Structure and Function

    A Computer is a complex system

    hierarchic nature of complex system essential for their designand description

    behavior at each level depends only on a simplified, abstractedcharacterization of the system at the low er level

    At each level designer is concerned wi th

    structure

    CPU - Central Processing Unit

    Main memory: Stores data

    I/O: moves data between computer and its external environment

    System Interconnection

    function

    data processing data storage

    data movement

    control

    8

    Structure - Top Level

    Computer

    Main

    Memory

    Input

    Output

    Systems

    Interconnection

    Peripherals

    Communication

    lines

    Central

    Processing

    Unit

    Computer

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    5/102

    9

    Structure - The CPU

    Computer Arithmeticand

    Logic Unit

    Control

    Unit

    Internal CPU

    Interconnection

    Registers

    CPU

    I/O

    Memory

    System

    Bus

    CPU

    10

    Structure - The Control Unit

    CPU

    Control

    Memory

    Control Unit

    Registers andDecoders

    Sequencing

    LogicControl

    Unit

    ALU

    Registers

    Internal

    Bus

    Control Unit

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    6/102

    11

    History

    First generation computers were made with vacuum valves andused punched cards as the main (non-volatile) storage medium. Ageneral purpose computer of this era was 'ENIAC' (ElectronicNumerical Integrator and Computer) which was completed in 1946.

    The next major step in the history of computing was the inventionof the transistor in 1947. Transistorized computers are normallyreferred to as 'Second Generation' and dominated the late 1950sand early 1960s.

    12

    History

    'Third Generation' computers used J ack St. Claire Kilby's invention- the integrated circuit or microchip;

    the first integrated circuit was produced in September 1958 butcomputers using them didn't begin to appear until 1963.

    In 1964 IBM announced system/ 360with increased storage and

    processing capabilities.

    formed the foundation of modern computer architecture

    In 1971 Intel anounced 4004 - first chip to contain all of the

    components of CPU

    the microprocessor was born

    Fourth generation computers used as underlying technology - verylarge scale integration : VLSI

    8086 and all of Intel's processors for the IBM PC and compatibles

    Supercomputers of the era were immensely powerful, like the

    Cray-1

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    7/102

    13

    CRAY X-MP

    14

    Performance of Computers have shown drastic improvement

    over time

    Parameter for evaluating Performance: Response Time (latency)

    How long does it take for my job to run?

    How long does it take to execute a job?

    How long must I wait for the database query?

    Parameter for evaluating Performance: Throughput

    How many jobs can the machine run at once?

    What is the average execution rate?

    How much work is getting done?

    Performance of Computers

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    8/102

    15

    Elapsed Time covers everything (disk and memory accesses, I/O , etc.)

    a useful indicator, but often not good for comparisonpurposes

    CPU time

    doesn't count I/O or time spent running other programs

    can be broken up into system time, and user time

    User CPU time

    time spent executing the lines of code that are " in" ourprogram

    Execution Time

    16

    For some program running on machine X,

    PerformanceX = 1 / Execut ion t imeX

    "X is n times faster than Y"

    PerformanceX / PerformanceY = n

    A Defini tion of Performance

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    9/102

    17

    Clock Cycles

    Instead of report ing execution time in seconds, we often usecycles

    Internal clock in a computer co-ordinates execution o finstructions

    Clock ticks indicate when to start activities (oneabstraction):

    cycle time = time between ticks = seconds per cycle

    clock rate (frequency) = cycles per second

    time

    cycleseconds

    programcycles

    programseconds

    18

    Could assume that # of cycles = # of instruct ions

    However, different instructions take differentamounts of time on different machines.

    time

    1stinstruction

    2ndinstruction

    3rdinstruction

    4th

    5th

    6th ..

    .

    How many cycles are required for a program?

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    10/102

    19

    Multiplication takes more time than addition Floating point operations take longer than integer ones Accessing memory takes more t ime than accessing

    registers

    Different numbers of cycles for different instructions

    20

    A given program wil l requi re

    some number of instructions (machine instructions)

    some number of cycles

    some number of seconds

    We have a vocabulary that relates these quantities:

    cycle time (seconds per cycle)

    clock rate (cycles per second)

    CPI (cycles per i nstruct ion)a floating point i ntensive application might have a higher CPI

    MIPS (millions of ins tructions per second)

    this would be higher for a program using simple instruc tions

    Terminology

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    11/102

    21

    Suppose we have two machines

    For some program,

    Machine A has a clock cycle time of 10 ns. and a CPI of 2.0Machine B has a clock cycle time of 20 ns. and a CPI of 1.2

    What machine is faster for this program, and by how much?

    CPI Example

    If the program on both the machines requires

    same number of instructions -N, then

    machine A requires 2*N*10ns and

    machine B requires 1.2*N*20 ns

    22

    A compi ler designer is try ing to decide between two codesequences for a particular machine. Based on the hardwareimplementation, there are three different classes ofinstructions: Class A, Class B, and Class C, and they requireone, two, and three cycles (respectively).

    The first code sequence has 5 instruc tions: 2 of A, 1 of B,and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1of C.

    Which sequence will be faster? How much?What is the CPI for each sequence?

    Number of Instructions Example

    First sequence : 10 cycles, CPI-2

    Second Sequence: 9 cycles, CPI - 1.5

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    12/102

    23

    Two dif ferent compi lers are being tested for a 100 MHz machinewith three d if ferent c lasses of instruc tions : Class A, Class B,and Class C, which require one, two, and three cycles(respect ively). Both compilers are used to produce code for alarge piece of software.

    The f irs t compi ler 's code uses 5 mill ion Class A inst ruct ions, 1mil lion Class B instructions, and 1 mil lion Class C instructions.

    The second compiler's code uses 10 million Class Ainstructions, 1 mil lion Class B instructions, and 1 mil lion Class C

    instructions.

    Which sequence will be faster according to MIPS? Which sequence will be faster according to execution time?

    MIPS example

    24

    Performance best determined by running a real application

    Use programs typ ical of expected workload

    Or, typical of expected class of applicationse.g., compilers/editors, scientific applications, graphics,

    etc.

    SPEC (System Performance Evaluation Cooperative)

    companies have agreed on a set of real program andinputs

    valuable indicator of performance (and compiler

    technology)

    Benchmarks

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    13/102

    25

    SPEC 95Benchmark Descri pti on

    g o Artific ial in tellig en ce; p lay s th e g ame o f Go

    m88ksim Motorola 88k chip simulator; runs test programg cc Th e Gn u C c om piler gen er atin g SPARC c od e

    compress Compresses and decompresses file in memory

    li Lisp interpreter

    ijpeg Graph ic compression and decompression

    p er l Man ip ulates str in gs an d p rime n umb ers

    v or tex A d atab as e p ro gram

    tomcatv A mesh generation program

    swim Shal low water model w ith 513 x 513 gr id

    su2cor quantum physics; Monte Carlo simulat ion

    hydro2d Astrophysics; Hydrodynamic Naiver Stokes equations

    mgrid Mult igrid solver in 3-D potent ial field

    applu Parabol ic /ell ipt ic partial di fferent ial equat ions

    trub3d Simulates isotropic, homogeneous turbulence in a cube

    apsi Solves problems regarding temperature, wind veloci ty , and dist ributio

    fp pp p Qu an tu m ch emistry

    wave5 Plasma physics; electromagnetic part icle simulation

    26

    SPEC95

    Can a machine with a slower clock rate have better performance?

    P entiumClock rate (MHz)

    SPECfp

    P entium Pro

    2

    0

    4

    6

    8

    3

    1

    5

    7

    9

    10

    200 25015010050

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    14/102

    27

    Uniprocessor to Multiprocessor

    Multiple processor on a single chip Multi-core microprocessor

    Impact more on throughput than on response time

    To improve response time may need to rewrite the code totake advantage of mu ltiple cores

    28

    Instruction Set Architecture

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    15/102

    29

    Instruction Set Architecture

    A very important abstract ion

    interface between hardware and low-level software

    standardizes inst ructions, machine language bit patterns, etc.

    There can be different implementations of the

    same architecture

    30

    Instructions

    Language of the Machine

    More primitive than higher level languages

    Very restrict ive

    Variety of functions a CPU may perform are reflected in itsinstruction set

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    16/102

    31

    Instruction Set of Computers :Complex Instruction set

    Each instruction in a CISC instruction set might perform a series ofoperations inside the processor.

    Reduces the number of instructions required to implement a givenprogram, and allows the programmer to learn a small but flexibleset of instructions.

    You can even have a single instruction computer

    Since earlier memory was slow and expensive, the CISCphilosophy made sense

    Most common microprocessor designs --- including the Intel 80x86

    and Motorola 68K series --- also follow the CISC philosphy Later, it was discovered that, by reducing the full set to only the

    most frequently used instructions, the computer would get morework done in a shorter amount of time for most applications - RISC

    32

    Instruction Set of Computers :

    Reduced Instruct ion set

    Background

    With advances in semiconductor technology difference in speedbetween main memory and processor reduced.

    a sequence of simple instructions produces the same results asa sequence of complex instructions, but can be implementedwith a simpler (and faster) hardware - assuming that memorycan keep up

    RISC forms the basis for modern design

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    17/102

    33

    Instruction Set of Computers :Reduced Instruct ion set

    RISC characteristics

    Simple instruction set.

    In a RISC machine, the instruction set contains simple, basicinstructions, from which more complex instructions can be

    composed.

    Same length instructions.

    Each instruction is the same length, so that it may be fetchedin a single operation.

    1 machine-cycle instructions. Most instructions complete in one machine cycle

    34

    RISC Architecture

    Well be working with the MIPS instruction set

    architecture

    similar to other architectures developed since

    the 1980's

    used by NEC, Nintendo, Silicon Graphics,

    Sony

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    18/102

    35

    Instructions are bits Programs are stored in memory

    to be read or written just like data

    Fetch & Execute Cycle

    Instructions are fetched and put into a special register in theprocessor

    Bits in the register "control" the subsequent actions

    Fetch the next instruction and continue

    Processor Memory

    memory for data, programs,compilers, editors, etc.

    Stored Program Concept

    36

    Elements of Instruction

    Elements of an instruc tion

    Operation Code

    Source Operand Reference

    one or two

    Result Operand Reference

    (may be) Next Instruction Reference

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    19/102

    37

    MIPS arithmetic Instructions

    All inst ruct ions have 3 operands Operand order is fixed (destination fir st)

    Example:

    C code: A = B + C

    M I PS code: add $s0, $s1, $s2

    $si - indicate REGISTERS: storage inside CPUassociated with variables by compiler

    Operands can be only registers: 32 registers provided

    Principle of Regularity

    38

    Registers vs. Memory

    Processor I/O

    Control

    Datapath

    Memory

    Input

    Output

    Ar ithmetic instruct ions operands must be registers, only 32 registers provided

    Compiler associates variables with registers

    What about programs with lots of variables ?

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    20/102

    39

    Memory Organization

    Viewed as a large, single-dimension array, with an address. A memory address is an index into the array

    "Byte addressing" means that the index points to a byte ofmemory.

    0

    1

    2

    3

    4

    5

    6

    ...

    8 bits of data

    8 bits of data

    8 bits of data

    8 bits of data

    8 bits of data

    8 bits of data

    8 bits of data

    40

    Memory Organization

    Bytes are nice, but most data items use larger "words"

    For MIPS, a word is 32 bits or 4 bytes.

    232 bytes with byte addresses from 0 to 232-1

    230 words with byte addresses 0, 4, 8, ... 232-4

    0

    4

    8

    12

    ...

    32 bits of data

    32 bits of data

    32 bits of data

    32 bits of data

    Registers hold 32 bits of data

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    21/102

    41

    More Instructions

    MIPS loading words but addressing bytes arithmetic on registers only

    Instruction Meaning

    add $s1, $s2, $s3 $s1 = $s2 + $s3

    sub $s1, $s2, $s3 $s1 = $s2 $s3

    lw $s1, 100($s2) $s1 = Memory[$s2+100]

    sw $s1, 100($s2) Memory[$s2+100] = $s1

    42

    Instructions

    Load and store instructions

    Example:

    C code: A[8] = h + A[8];

    MIPS code: lw $t0, 32($s3)add $t0, $s2, $t0sw $t0, 32($s3)

    Store word has destination last

    Remember arithmetic operands are registers, not memory!

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    22/102

    43

    Our First Example

    Can we figure out the code?

    swap(int v[], int k);{ int temp;

    temp = v[k]v[k] = v[k+1];v[k+1] = temp;

    }

    swap:muli $2, $5, 4add $2, $4, $2lw $15, 0($2)lw $16, 4($2)sw $16, 0($2)sw $15, 4($2)jr $31

    44

    Instructions, like registers and words of data, are also 32 bits long

    Example: add $t0, $s1, $s2

    registers have numbers, $t0=9, $s1=17, $s2=18

    Instruction Format:

    op rs rt rd shamt funct

    op: 6 bits opcode field

    rs: 5 bits first register source operand

    rt: 5 bits second register source operand

    rd: 5 bits register destination operand

    shamt: 5 bits shift amount to be used in shift instructionsfunct: 6 bits selects specific variant of the operation in

    opcode field

    This is R-type instruction format

    Machine Language

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    23/102

    45

    Consider the load-word and store-word instructions,

    What woul d the regularity princip le have us do?

    New principle: Good design demands a compromise

    Introduce a new type of inst ruction format

    I-type for data transfer instructions

    other format w as R-type for register

    Example: lw $t0, 32($s2)

    35 18 9 32

    op rs rt 16 bit number

    16 bit offset : +/-215 bytes of t he address in the base register rs

    Where's the compromise?

    Instructions o f fixed length but of different format to take care ofdifferent functional requirements

    Multiple format complicates hardware

    Machine Language

    46

    Decision making instructions

    alter the control fl ow,

    i.e., change the "next" instruct ion to be executed

    MIPS conditional branch instructions:

    bne $t0, $t1, Label

    Go to statement at address Label if value in register t0 doesnot equal value in register t1

    beq $t0, $t1, Label

    Example: if (i==j) h = i + j;

    bne $s0, $s1, Labeladd $s3, $s0, $s1

    Label: ....

    Control

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    24/102

    47

    MIPS unconditional branch instruct ions:j label

    Example:

    if (i!=j) beq $s4, $s5, Lab1h=i+j; add $s3, $s4, $s5

    else j Lab2h=i-j; Lab1: sub $s3, $s4, $s5

    Lab2: ...

    Format of Branch instructions

    Branch - I Format

    Address corresponding to Label is given by 16 bit offset

    Unconditional Branch Jump instruction

    J format - 26 bits of offset

    Control

    48

    Instructions:

    bne $t4,$t5,Label Next instruction is at Label if $t4$t5beq $t4,$t5,Label Next instruction is at Label if $t4=$t5

    Formats:

    Use a register (like lw and sw) and add its content to address

    use Instruction Address Register (PC = program counter)

    most branches are local (princip le of locality)

    Jump instructions just use high order bits of PC address boundaries of 256 MB

    op rs rt 16 bit addressI

    Addresses in Branches

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    25/102

    49

    So far:

    Instruction Meaning

    add $s1,$s2,$s3 $s1 = $s2 + $s3

    sub $s1,$s2,$s3 $s1 = $s2 $s3

    lw $s1,100($s2) $s1 = Memory[$s2+100]

    sw $s1,100($s2) Memory[$s2+100] = $s1

    bne $s4,$s5,L Next instr. is at Label if $s4 $s5

    beq $s4,$s5,L Next instr. is at Label if $s4 = $s5

    j Label Next instr. is at Label

    Formats:

    op rs rt rd shamt funct

    op rs rt 16 bit address

    op 26 bit address

    R

    I

    J

    50

    We have: beq, bne, what about Branch-if-less-than?

    New instruction:if $s1 < $s2 then

    $t0 = 1slt $t0, $s1, $s2 else

    $t0 = 0

    Can use this instruction to build Branch if less than

    slt $t0,$s0,$s1

    bne $t0,$Zero, Less

    Register $Zero always contain 0

    Control Flow

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    26/102

    51

    While loop

    Example C- codewhile (save[i] == k)

    i=i+j;

    Corresponding MIPS code

    Assume I, j, k corresponds to regis ters $3, $4, $5 respecti vely andbase of the array save is in $6.

    Loop: add $t1,$s3,$s3 # reg $t1= 2*i

    add $t1,$t1,$t1 # reg $t1= 4*I

    add $t1,$t1,$s6 #$t1= address of save[i]

    lw $t0, 0($t1)

    bne $t0,$s5,Exit #go to Exit if save[i]!=k

    add $s3,$s3,$s4 #i=i+j

    j Loop #go to loopExit:

    52

    Register Use Conventions

    Name egister numbe Usage

    $zero 0 the constant value 0

    $v0-$v1 2-3 values for results and expression ev

    $a0-$a3 4-7 arguments

    $t0-$t7 8-15 temporaries

    $s0-$s7 16-23 saved

    $t8-$t9 24-25 more temporaries

    $gp 28 global pointer

    $sp 29 stack pointer $fp 30 frame pointer

    $ra 31 return address

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    27/102

    53

    Small constants are used quite frequently (50% of operands)e.g., A = A + 5;

    B = B + 1;C = C - 18;

    Mechanism

    put 'typical constants' in memory alongwith instruct ions and loadthem.

    create hard-wired registers (like $zero) for constants like one.

    MIPS Instructions :

    addi $29, $29, 4slti $8, $18, 10andi $29, $29, 6ori $29, $29, 4

    How do we make this work?

    Instructions are I-type

    16 bit field for the cons tant

    Constants

    54

    We'd like to be able to load a 32 bit constant into a register

    Must use two instruc tions, new "load upper immediate"instruction

    lui $t0, 1010101010101010

    Then must get the lower order bits right, i.e.,ori $t0, $t0, 1010101010101010

    1010101010101010 0000000000000000

    0000000000000000 1010101010101010

    1010101010101010 1010101010101010

    ori

    1010101010101010 0000000000000000

    filled with zeros

    How about larger constants?

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    28/102

    55

    Supporting Procedures in Hardware

    Execution of a Procedure Requires Place parameters so that procedures can access them

    Transfer control to the procedure

    Acqu ire storage resources for the procedure

    Place the result so that calling procedure can access it

    Return Control to the point o f origin

    In MIPS architecture

    $a0 - $a3 : four argument registers in which to passparameters

    $v0 - $v1 : two value registers in which to pass parameters

    $ra : one return address register to return the point oforigin

    Special Instructionjal ProcedureAddress

    Control jumps to the address and simultaneously saves theaddress of the following instruction in $ra

    56

    Assembly Language vs. Machine Language

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    29/102

    57

    Other Issues

    58

    simple instruct ions all 32 bits wide

    very structured, no unnecessary baggage

    only three instruction formats

    rely on compi ler to achieve performance

    what are the compiler's goals? help compiler where we can

    op rs rt rd shamt funct

    op rs rt 16 bit address

    op 26 bit address

    R

    I

    J

    Summarising

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    30/102

    59

    Byte Halfword Word

    Registers

    Memory

    Memory

    Word

    Memory

    Word

    Register

    Register

    1. Immediate addressing

    2. Register addressing

    3. Base addressing

    4. PC -relative addressing

    5. Ps eudodirect addressing

    op rs rt

    op rs rt

    op rs rt

    op

    op

    rs rt

    Address

    Address

    Address

    rd . . . funct

    Immediate

    PC

    PC

    +

    +

    60

    Design alternative:

    provide more powerful operations

    goal is to reduce number of instruct ions executed

    danger is a slower cycle time and/or a higher CPI

    Sometimes referred to as RISC vs. CISC

    virtually all new instruct ion sets s ince 1982 have been

    RISC

    VAX: minimize code size, make assembly language easy

    instructions from 1 to 54 bytes long!

    Well look at PowerPC and 80x86

    Alternative Architectures

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    31/102

    61

    PowerPC

    Indexed addressing example: lw $t1,$a0+$s3 #$t1=Memory[$a0+$s3]

    What do we have to do in MIPS?

    Update addressing

    update a register as part of load (for marching througharrays)

    example: lwu $t0,4($s3)#$t0=Memory[$s3+4];$s3=$s3+4

    What do we have to do in MIPS?

    Others:

    load multiple/store multiple a special counter register bc Loop

    decrement counter, if not 0 goto loop

    62

    80x86

    1978: The Intel 8086 is announced (16 bit architecture)

    1980: The 8087 floating point coprocessor is added

    1982: The 80286 increases address space to 24 bits,+instructions

    1985: The 80386 extends to 32 bits , new addressing modes

    1989-1995: The 80486, Pentium, Pentium Pro add a fewinstructions

    (mostly designed for higher performance)

    1997: MMX is added

    This history illustrates the impact of the golden handcuffs ofcompatibility

    adding new features as someone might add clothing to a packed bag

    an architecture that is difficult to explain and impossible to love

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    32/102

    63

    A dominant archi tecture: 80x86

    See your textbook for a more detailed description Complexity:

    Instructions from 1 to 17 bytes long

    one operand must act as both a source and destination

    one operand can come from memory

    complex addressing modese.g., base or scaled index with 8 or 32 bit d isplacement

    Saving grace:

    the most frequently used instructions are not too difficultto build

    compilers avoid the portions of the architecture that are

    slow

    what the 80x86 lacks in style is made up in quantity,

    making it beautiful from the right perspective

    64

    Instruction complexity is only one variable

    lower instruct ion count vs. higher CPI / lower clock rate

    Design Principles:

    simplicity favors regularity

    smaller is faster

    good design demands compromise

    make the common case fast

    Instruction set architecture

    a very important abstraction indeed!

    Summary

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    33/102

    65

    Computer Arithmetic

    66

    Ari thmetic

    Basic computation invo lves arithmetic operations

    Instructions for arithmetic operations

    Ar ithmetic Operations implemented in hardware

    Ar ithmetic-Logical Uni t (ALU)

    32

    32

    32

    operation

    result

    a

    b

    ALU

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    34/102

    67

    Bits are just bits (no inherent meaning) conventions define relationship between bits and numbers

    Binary numbers (base 2)0000 0001 0010 0011 0100 0101 0110 0111 1000 1001...decimal: 0...2n-1

    Problems:With fixed set of bits you can represent a finite set ofnumbers but actually set of possible numbers (evenintegers ) are infinitehow to represent fractions and real numbers?how to represent negative numbers?

    Which bit patterns will represent which numbers?

    Numbers

    68

    Sign Magnitude: One's Complement Two's Complement

    000 = +0 000 = +0 000 = +0001 = +1 001 = +1 001 = +1010 = +2 010 = +2 010 = +2011 = +3 011 = +3 011 = +3100 = -0 100 = -3 100 = -4101 = -1 101 = -2 101 = -3110 = -2 110 = -1 110 = -2111 = -3 111 = -0 111 = -1

    Issues: balance, number of zeros, ease of operations

    Which one is best? Why?

    Possible Representations of Negative Integers

    2s complement ; consistent zero

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    35/102

    69

    32 bit signed numbers:

    0000 0000 0000 0000 0000 0000 0000 0000two = 0ten0000 0000 0000 0000 0000 0000 0000 0001two = + 1ten0000 0000 0000 0000 0000 0000 0000 0010two = + 2ten...

    0111 1111 1111 1111 1111 1111 1111 1110two = + 2,147,483,646ten0111 1111 1111 1111 1111 1111 1111 1111two = + 2,147,483,647ten1000 0000 0000 0000 0000 0000 0000 0000two = 2,147,483,648ten1000 0000 0000 0000 0000 0000 0000 0001two = 2,147,483,647ten1000 0000 0000 0000 0000 0000 0000 0010two = 2,147,483,646ten...

    1111 1111 1111 1111 1111 1111 1111 1101two = 3ten1111 1111 1111 1111 1111 1111 1111 1110two = 2ten1111 1111 1111 1111 1111 1111 1111 1111two = 1ten

    maxint

    minint

    MIPS

    70

    Negating a two 's complement number:

    invert all bits and add 1

    remember: negate and invert are quite different!

    Converting n bit numbers into numbers with more than n bits:

    MIPS 16 bit immediate gets converted to 32 bits for arithmetic

    copy the most sign ificant bit (the sign bit) into the other bits

    0010 -> 0000 0010

    1010 -> 1111 1010

    "si gn extension" (lbu vs. lb)

    Two's Complement Operations

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    36/102

    71

    Just like in school0111 0111 0110

    + 0110 - 0110 - 0101

    Two's complement operations easy

    subtraction using addition of negative numbers

    0111

    + 1010

    Overflow (result too large for finite computer word):

    e.g., adding two n-bit numbers does not yield an n-bit number

    0111

    + 0001 note that overflow term is somewhat misleading,

    1000 it does not mean a carry overflowed

    (becomes negative!)

    Addit ion & Subtract ion

    72

    No overflow when adding a pos itive and a negative number

    No overflow when signs are the same for subtraction

    Overflow occurs when the value affects the sign:

    overflow when adding two positives yields a negative

    or, adding two negatives gives a positive

    or, subtract a negative from a positive and get a negative

    or, subtract a positi ve from a negative and get a positi ve

    Consider the operations A + B, and A B

    Can overflow occur if B is 0 ?

    Can overflow occur if A is 0 ?

    Detecting Overflow

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    37/102

    73

    An except ion (interrupt) occurs in MIPS Control jumps to predefined address for exception

    Interrupted address is saved for possible resumption

    Handling based on requirements of the software

    Don't always want to detect overflow

    Unsigned arithmetic new MIPS instructions: addu, addiu, subunote: addiu still sign-extends!

    note: sltu, sltiu for unsigned comparisons

    Effects of Overflow

    74

    Bit-wise AND, OR, Invert

    Shift left

    Shift right

    Addi tional Operat ions

    Bit -wise X-OR, X-NOR

    Logical Operations

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    38/102

    75

    76

    Let's build an ALU to support the andi and ori instructions

    we'll just bu ild a 1 bit ALU, and use 32 of them

    Possible Implementation (sum-of-products):

    b

    a

    operation

    result

    op a b res

    An ALU (arithmetic logic unit )

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    39/102

    77

    Selects one of the inputs to be the output , based on a controlinput

    Lets build our ALU using a MUX:

    S

    CA

    B0

    1

    The Multiplexor

    note: we call this a 2-input mux

    even though it has 3 inputs!

    78

    Desirable Features

    Do not want too many inputs to a single gate

    Do not want to have to go through too many gates

    Let's look at a 1-bit ALU for addition:

    How could we build a 1-bit ALU foradd, and, and or?

    How could w e build a 32-bit ALU?

    Different Implementations

    cout = a b + a cin + b cinsum = a xor b xor cin

    Sum

    CarryIn

    CarryOut

    a

    b

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    40/102

    79

    Building a 32 bit ALU

    b

    0

    2

    Result

    Operation

    a

    1

    CarryIn

    CarryOut

    Result31a31

    b31

    Result0

    CarryIn

    a0

    b0

    Result1

    a1

    b1

    Result2

    a2

    b2

    Operation

    ALU0

    CarryIn

    CarryOut

    ALU1

    CarryIn

    CarryOut

    ALU2

    CarryIn

    CarryOut

    ALU31CarryIn

    80

    Two's complement approch: just negate b and add.

    How do we negate?

    A very clever solution:

    What about subtraction (a b) ?

    0

    2

    Result

    O peration

    a

    1

    C arryIn

    CarryOut

    0

    1

    Binvert

    b

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    41/102

    81

    Need to support the set-on-less-than instruction (slt)

    remember: slt is an arithmetic instruc tion

    produces a 1 if rs < rt and 0 otherwise

    use subt raction: (a-b) < 0 impl ies a < b

    Need to support test for equality (beq $t5, $t6, $t7)

    use subt raction: (a-b) = 0 impl ies a = b

    Adding more Operations to ALU

    82

    Supporting slt

    Can we figure out the idea?0

    3

    Result

    Operation

    a

    1

    CarryIn

    CarryOut

    0

    1

    Binvert

    b 2

    Less

    0

    3

    Result

    Operation

    a

    1

    CarryIn

    0

    1

    Binvert

    b 2

    Less

    S et

    Ove rflowdetection

    Overflow

    a.

    b.

    Set output if a

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    42/102

    83

    S e ta 3 1

    0

    A L U 0R e s u l t0

    C a r ry I n

    a 0

    R e s u l t1a 1

    0

    R e s u l t2a 2

    0

    O p e r a tio n

    b 3 1

    b 0

    b 1

    b 2

    R e s u l t3 1

    O v e r flo w

    B in v e r t

    C a r r y In

    L e s s

    C a r ry I n

    C a r ry O u t

    A L U 1L e s s

    C a r ry I n

    C a r ry O u t

    A L U 2L e s s

    C a r ry I n

    C a r ry O u t

    A L U 3 1L e s s

    C a r ry I n

    84

    Test for equality

    Notice control lines:

    000 = and

    001 = or

    010 = add

    110 = subtract

    111 = slt

    Note: zero is a 1 when the result is zero!

    Seta31

    0

    Result0a0

    Result1a1

    0

    Result2a2

    0

    Operation

    b31

    b0

    b1

    b2

    Result31

    Overflow

    Bnegate

    Zero

    ALU0Less

    CarryIn

    CarryOut

    ALU1Less

    CarryIn

    CarryOut

    ALU2

    Less

    CarryIn

    CarryOut

    ALU31Less

    CarryIn

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    43/102

    85

    Recap

    We can build an ALU to suppor t the MIPS instructi on set

    key idea: use multiplexor to select the output we want

    we can efficiently perform subtraction using twos complement

    we can replicate a 1-bit ALU to p roduce a 32-bit ALU

    Important points about hardware

    all of the gates are always working

    the speed of a gate is affected by the number of inputs to the gate

    the speed of a circuit is affected by the number of gates in series

    (on the critical path or the deepest level of logic )

    Note

    Clever changes to organization can improve performance(similar to using better algorithms in software)

    well look at two examples for addition and multipl ication

    86

    Is a 32-bit ALU as fast as a 1-bit ALU?

    Sequential dependence in the 32 bit ALU

    Fast Carry carry computation in parallel

    c1 = b0c0 + a0c0 + a0b0

    c2 = b1c1 + a1c1 + a1b1 c2 =

    c3 = b2c2 + a2c2 + a2b2 c3 =

    c4 = b3c3 + a3c3 + a3b3 c4 =

    Not feasible! Why?

    large hardware requirement

    Problem: ripple carry adder is slow

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    44/102

    87

    An approach in-between our two extremes Motivation:

    If we didn't know the value of carry-in, what could we do?

    When would we always generate a carry? gi = aibi When would we propagate the carry? pi = ai + bi Ci+1 = gi + pi.ci

    When gi is 1, ci+1 = gi + pi.ci = 1 + pi.ci = 1

    Adder generates Ci+1 independent of CiWhen gi=0 and pi=1, ci+1= 0 + 1.ci =Ci

    Adder propagates

    Did we get rid of the ripple?c1 = g0 + p0c0

    c2 = g1 + p1c1 c2 =c3 = g2 + p2c2 c3 =

    Feasible! Why?

    Can use generate and propagate for larger buildingblocks 4-bit adder

    Carry-Lookahead adder

    88

    Four 4-bit adders combinedto make a 16 bit adder

    Carries come from Carrylookahead un it

    Carry lookahead adder isfaster because carrygeneration and propagationlogic starts working themoment clock cycle begins;

    carry goes through lessernumber of gates

    Typically this 16-bit adder is6 times faster than ripplecarry adder

    Build bigger adders

    CarryIn

    Result0--3

    A LU0

    CarryIn

    Result4--7

    A LU1

    CarryIn

    Result8--11

    A LU2

    CarryIn

    Ca rryOut

    Result12--15

    A LU3

    CarryIn

    C 1

    C 2

    C 3

    C 4

    P 0G 0

    P 1G 1

    P 2G 2

    P 3G 3

    pigi

    pi + 1gi + 1

    ci + 1

    ci + 2

    ci + 3

    ci + 4

    pi + 2gi + 2

    pi + 3gi + 3

    a0 b0 a1 b1 a2 b2 a3 b3

    a4 b4 a5 b5 a6 b6 a7 b7

    a8 b8 a9 b9

    a10

    b10 a11 b11

    a12 b12 a13 b13 a14 b14 a15 b15

    Carry-lookahead unit

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    45/102

    89

    More complicated than addition accomplished via shifting and addition

    More time and more area

    Simplest Scheme

    0010 (multiplicand)

    __x_1011 (multiplier)

    Negative numbers: convert and multiply

    Multiplication

    90

    Multiplication implementation

    D o n e

    1 . Tes tM ultiplier0

    1a. A dd m ul tip li cand to product and place the resu lt in Prod uct register

    2. S hift the M ultipl ican d register left 1 b it

    3. S hift the M ultipl ier register r ight 1 bit

    32nd repetit ion?

    S tart

    M ultiplier0 = 0M ultiplier0 = 1

    No : < 32 repetitions

    Y e s : 3 2 re p etitio ns

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    46/102

    91

    Multiplication: Implementation

    64-bit ALU

    Control test

    MultiplierShift right

    ProductWrite

    MultiplicandShift left

    64 bits

    64 bits

    32 bits

    92

    2nd Version

    D o n e

    1 . T e s tM u l tip l i e r 0

    1 a . A d d m u l ti p lic a n d t o t h e l e f t h a l f o f th e p r o d u c t a n d p l a c e t h e r e s u l t in t h e l e f t h a l f o f th e P r o d u c t r e g is t e r

    2 . S h i f t th e P r o d u c t r e g i s te r r ig h t 1 b i t

    3 . S h i ft th e M u l ti p l ie r r e g is t e r ri g h t 1 b i t

    3 2 n d r e p e t iti o n ?

    S t a rt

    M u l tip l i e r 0 = 0M u l tip l i e r 0 = 1

    N o : < 3 2 r e p e ti tio n s

    Y e s : 3 2 r e p e ti ti o n s

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    47/102

    93

    Second Version

    MultiplierShift right

    Write

    32 bits

    64 bits

    32 bits

    Shift right

    Multiplicand

    32-bit ALU

    Product Control test

    94

    Final Version

    ControltestWrite

    32bits

    64bits

    Shift rightProduct

    Multiplicand

    32-bit ALU

    Done

    1. TestProduct0

    1a. Add multiplicand to the left half ofthe product and place the result inthe left half of the Product register

    2. Shift the Product register right 1 bit

    32nd repetition?

    Start

    Product0=0Product0=1

    No:

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    48/102

    95

    Efficient Multiplication: Booths Multiplication

    Motivation

    Use of addition and subtraction permits product computationin a variety of ways

    Eg. 2 X 6 0010 X 0110

    6 = -2 + 8 0110 = -0010 + 1000

    2x6 = -2x2 + 8x6

    We can replace a string of 1s in the multiplier with an initialsubtract when we see a 1 and then later add when we see thebit after last 1

    To reduce the number of additions(subtractions)

    96

    Booths Algorithm

    Works with signed integers; twos complement form

    Looks at two bits at a time scanning from right to left;

    Steps

    1. Depending on the current and previous b its, do

    00 : Middle of a string of 0s, so no arithmetic operations

    01 : End of a string of 1s, so add the multiplicand to the left half ofthe product

    10 : Beginning of a string of 1s, so subtract the multiplicand from theleft half of the product

    11 : Middle of a string of 1s so no arithmetic operation

    Starts with a 0 for the imaginary bit to the right of the rightmostbit for the first stage

    2. Shift the product register right 1 bit

    Simulated Example

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    49/102

    97

    Booths Algorithm: Example

    10011100 -100x 01100011 99------------------------------

    00000000 00000000- 11111111 10011100

    ------------------------------00000000 01100100

    + 11111110 011100-----------------------------11111110 11010100

    - 11110011 100----------------------------

    00001011 01010100+ 11001110 0--------------------

    11011001 01010100 -9900

    Note that the multiplicandand multiplier are 8-bit tw o'scomplement number, butthe result is understood as16-bit two's comp lementnumber. Be careful aboutthe proper alignment of thecolumns. 10 pair causes asubtraction, aligned with 1,01 pair causes an addition,aligned with 0. In bothcases, it aligns wi th the oneon the left. The algorithmstarts w ith the 0-th bit. Weshould assume that there isa (-1)-th bit , having value 0

    98

    Booths Algorithm: Hardware

    The hardware consists of 32-bit register M for the multiplicand, 64-bit

    product register P, and a 1-bit register C, 32-bit ALU and control.

    Initially, M contains multiplicand, P contains multiplier (the upper half

    Ph = 0), and C contains bit 0. The algorithm is the following steps.

    Repeat 32 times:

    1.If (P0, C) pair is:

    10: Ph = Ph - M,

    01: Ph = Ph + M,

    00: do nothing,

    11: do nothing.

    2.Arithmetic shift P right 1 bit. The shift-out bit gets into C.

    Arithmetic shift preserves the sign of a two's complement number, thus

    shift right arithmetic (sra) 0100 ... 111 -> 00100 ... 11 1100 ... 111 -> 11100 ...

    11

    Shift right arithmetic performed on P is equivalent to shift the multiplicand left with

    sign extension.

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    50/102

    99

    Floating Point Numbers

    We need a way to represent

    numbers wi th fract ions , e.g., 3.1416

    very small numbers, e.g., .000000001

    very large numbers, e.g., 3.15576 109 Representation:

    sign, exponent, significand: (1)sign significand 2exponent

    more bits for sign ificand gives more accuracy

    more bits for exponent increases range

    IEEE 754 floating point standard: single precision: 8 bit exponent, 23 bit significand

    double precision: 11 bit exponent, 52 bit signif icand

    100

    IEEE 754 floating-point standard

    Leading 1 bit of significand is implicit

    Exponent is biased to make sorting easier

    all 0s is smallest exponent all 1s i s largest

    bias of 127 for single precision and 1023 for double precision

    summary: (1)sign (1+significand) 2exponent bias Example:

    decimal: -.75 = -3/4 = -3/22

    binary: -11/22 = -.11 = -1.1 x 2-1

    floati ng poin t: exponent = 126 = 01111110

    IEEE single precision: 10111111010000000000000000000000

    Representation of Zero: all zero bits in the exponent is reserved andused for indi cating zero.

    Pattern of all 1 bits in exponent to indicate values and situationsoutside the scope of representation

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    51/102

    101

    Floating Point Addition

    Al ign the binary po int of the number wi th smal ler exponent byshifting the significand of the smaller number to the right(such that exponent of the smaller number matches the largerexponent)

    Addi tion of the sign if icands

    Normalize the result and accordingly adjust exponent (shiftingright and incrementing the exponent or shifting left anddecrementing the exponent)

    Generate exception in case of underflow or overflow

    If necessary, round (or truncate) the significand

    102

    Floating Point Multiplication

    Add the biased exponents of the two numbers, subtract ing thebias from the sum to get the new biased exponent

    Multiply the significands

    Normalise the product if necessary by shifting right andincrementing the exponent

    Round the significand

    Set the sign of the products correctly

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    52/102

    103

    Floating Point instructions in MIPS

    Addi tion, subt raction, mul tipl ication, div is ion, compar ison Single and double-precision

    Separate floating point registers: $f0, $f1, $f2, .. and separateload and store for floating point registers

    Registers used either as single or double-precision; a doub leprecision register is really an even-odd pair of sing le precisionregisters, using the even register as it s name

    104

    Accurate Ari thmetic ?

    Floating point numbers, unlike integers, are approximations

    Between 0 and 1 there are infinite number of real numbersout of which only 253 can be exactly represented in doubleprecision form

    Rounding prov ides the mechanism for desired approximation

    Extra bits required because if every intermediate result had tobe truncated to the exact number of d igits, then there wouldbe no opportunity to round

    IEEE 754 keeps 2 extra bit s on the righ t dur ing intermediatecalculations called guard and round

    A DECIMAL EXAMPLE:

    with 2-digit significand and 2 extra digits - round and guard

    2.56x100 +2.34x102

    Normalisation: 2.3400 +0.0256 (5 in guard, 6 in round)

    Result: 2.3656X102 after rounding 2.37X102

    Without guard or round bit : 2.34 +0.02 =2.36x102

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    53/102

    105

    Floating Point Complexities: Summary

    Operations are somewhat more complicated

    In addition to overflow we can have underflow

    Accuracy can be a b ig problem

    IEEE 754 keeps two ext ra bits , guard and round

    four rounding modes

    positive divided by zero yields infinity

    zero divide by zero yields not a number

    other complexities

    Implementing the standard can be tricky

    Not using the standard can be even worse

    106

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    54/102

    107

    108

    Through implementation of a simplified version of MIPS

    Simplified to contain only:

    memory-reference instructions: lw, sw

    arithmetic-logical instructions: add, sub, and, or, slt

    control flow instructions:beq, j

    Generic Implementation:

    use the program counter (PC) to supply ins truction address

    get the instruct ion from memory

    read registers

    use the instruction to decide exactly what to do Al l i ns tructions use the ALU after reading the reg isters

    The Processor: Data-path & Control

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    55/102

    109

    Conceptual View of the Processor

    Registers

    Register #

    Data

    Register #

    Datamemory

    Address

    Data

    Register #

    PC Instruction ALU

    Instructionmemory

    Address

    Two types of functional units:

    elements that operate on data values (combinational)

    elements that contain state (sequential)

    110

    Unclocked vs. Clocked

    Clocks used in synchronous logic

    when shou ld an element that contains state be updated?

    cycle time

    rising edge

    falling edge

    Recap: State Elements

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    56/102

    111

    The set-reset latch output depends on present inputs and also on past inputs

    An un-clocked state element

    112

    Output is equal to the stored value inside the element

    Change of s tate (value) is based on the clock

    Latches: whenever the inputs change, and the clock isasserted

    Flip-flop: state changes only on a clock edge(edge-triggered methodology)

    "logically true", could mean electrically low

    A clocking methodology defines when signals can be read and written wouldn't want to read a signal at the same time it was being written

    Latches and Flip-flops

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    57/102

    113

    Two inputs: the data value to be stored (D)

    the clock signal (C) indicating when to read & store D

    Two outputs:

    the value of the internal state (Q) and it's complement

    D-latch

    Q

    C

    D

    _Q

    D

    C

    Q

    114

    D flip-flop

    Output changes only on the clock edge

    QQ

    _Q

    Q

    _Q

    D latch

    D

    C

    D latch

    DD

    C

    C

    D

    C

    Q

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    58/102

    115

    Our Implementation

    An edge triggered methodology Typical execution:

    read contents of some state elements,

    send values through some combinational logic

    write results to one or more state elements

    Clock cycle

    Stateelement

    1Combinational logic

    Stateelement

    2

    116

    Built using D flip-flops

    Register File

    M ux

    R egister 0

    R egister 1

    R egister n 1

    Register n

    M ux

    R ead data 1

    R ead data 2

    R ead registernumber 1

    R ead registernumber 2

    Read register

    number 1 Readdata 1

    Readdata 2

    Read registernumber 2

    Register fileWriteregister

    Writedata Write

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    59/102

    117

    Register File

    Use the real clock to determine when to write

    n-to-1decoder

    R egister 0

    R egister 1

    Register n 1

    C

    CD

    D

    R egister n

    C

    C

    D

    D

    R egister number

    W rite

    R egister data

    0

    1

    n 1

    n

    118

    Building the Datapath

    Use multiplexors to stit ch functional components together

    PC

    Instructionmemory

    Readaddress

    Instruction

    16 32

    Add ALUresult

    Mux

    Registers

    WriteregisterWritedata

    Readdata 1

    Readdata 2

    Readregister 1

    Readregister 2

    Shiftleft 2

    4

    Mux

    ALU operation3

    RegWrite

    MemRead

    MemWrite

    PCSrc

    ALUSrc

    MemtoReg

    ALUresult

    ZeroALU

    Datamemory

    AddressWritedata

    Readdata Mu

    x

    Signextend

    Add

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    60/102

    119

    Control

    Selecting the operations to perform (ALU, read/write, etc.)

    Controlling the flow of data (multiplexor inputs)

    Decode Information that comes from the 32 bits of the instruction

    Example:

    add $8, $17, $18 Instruction Format:

    000000 10001 10010 01000 00000 100000

    op rs rt rd shamt funct

    ALU's operat ion based on inst ruct ion type and function code

    120

    What should the ALU do with this inst ruction

    Example: lw $1, 100($2)

    35 2 1 100

    op rs rt 16 bit offset

    ALU cont ro l input

    000 AND001 OR 010 add

    110 subtract111 set-on-less-than

    Why is the code for subtract 110 and not 011?

    Control

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    61/102

    121

    Must describe hardware to compute 3-bit ALU conrol input given instruction type

    00 = lw, sw01 = beq,11 = arithmetic

    function code for arithmetic

    Describe it using a truth table (can turn into gates):

    ALUOpcomputed from instruction type

    Control

    ALUOp Func t f ield Operat ion

    ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0

    0 0 X X X X X X 010

    X 1 X X X X X X 110

    1 X X X 0 0 0 0 0101 X X X 0 0 1 0 110

    1 X X X 0 1 0 0 000

    1 X X X 0 1 0 1 001

    1 X X X 1 0 1 0 111

    122

    Control

    Ins truc tion RegDst ALUSrc

    Memto-

    Re

    Reg

    Write

    Mem

    Read

    Mem

    Wr it e B ranc h A LUOp1 A LUp0

    R-format 1 0 0 1 0 0 0 1 0l w 0 1 1 1 1 0 0 0 0sw X 1 X 0 0 1 0 0 0beq X 0 X 0 0 0 1 0 1

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    62/102

    123

    PC

    Instructionmemory

    Readaddress

    Instruction[310]

    Instruction[2016]

    Instruction[2521]

    Add

    Instruction[50]

    MemtoReg

    ALUOp

    MemWrite

    RegWrite

    MemRead

    Branch

    RegDst

    ALUSrc

    Instruction[3126]

    4

    16 32Instruction[150]

    0

    0Mux

    0

    1

    Control

    Add ALUresult

    Mux

    0

    1

    RegistersWriteregister

    Writedata

    Readdata1

    Readdata2

    Readregister 1

    Readregister 2

    Signextend

    Shiftleft2

    Mux1

    ALUresult

    Zero

    Datamemory

    Writedata

    Readdata

    Mux

    1

    Instruction[1511]

    ALUcontrol

    ALUAddress

    124

    Control

    Simple combinational logic (truth tables)

    Operation2

    Operation1

    Operation0

    Operation

    ALUOp1

    F3

    F2

    F1

    F0

    F (50)

    ALUOp0

    ALUOp

    ALU control block

    R -format Iw sw beq

    Op0

    Op1

    Op2

    Op3

    Op4

    Op5

    Inputs

    O utputs

    RegDst

    ALUSrc

    MemtoReg

    R egWrite

    MemRead

    MemWrite

    Branch

    ALUOp1

    ALUOpO

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    63/102

    125

    Al l of the logic is combinational

    We wait for everything to settle down, and the right thing to be

    done

    ALU might not produce right answer righ t away

    we use write signals along with clock to determine when to

    write

    Cycle time determined by length of the longest path

    Our Simple Control Structure

    We are ignoring some details like setup and hold times

    126

    Single Cycle Implementation Calculate cycle time assuming negligible delays except:

    memory (2ns), ALU and adders (2ns), register file access(1ns)

    MemtoReg

    MemRead

    MemWrite

    ALUOp

    ALUSrc

    RegDst

    PC

    Instructionmemory

    Readaddress

    Instruction[31 0]

    Instruction [2016]

    Instruction [2521]

    Add

    Instruction [50]

    RegWrite

    4

    16 32Instruction [150]

    0

    Registers

    Writeregister

    WritedataWritedata

    Readdata 1

    Readdata 2

    Readregister 1

    Readregister 2

    Signextend

    ALUresult

    Zero

    Datamemory

    Address Readdata

    Mux1

    0

    Mux

    1

    0

    Mu

    x

    1

    0

    Mux

    1

    Instruction [1511]

    ALUcontrol

    Shiftleft 2

    PCSrc

    ALU

    AddALU

    result

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    64/102

    127

    Analysis

    Single Cycle Problems: what if we had a more complicated instruction like floating

    point?

    wasteful of area: repetetion of functional units if they areneeded more than once in an instruction

    One Solution:

    use a smaller cycle time

    have different instructions take different numbers of cycles

    a multicycle datapath:

    128

    Multi -Cycle Data Path

    PC

    Memory

    Address

    Instructionor data

    Data

    Instructionregister

    Registers

    Register #

    Data

    Register #

    Register #

    ALU

    Memorydata

    register

    A

    B

    ALUOut

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    65/102

    129

    We will be reusing func tional units ALU used to compute address and to increment PC

    Memory used for ins truction and data

    Our control signals will not be determined solely byinstruction

    e.g., what should the ALU do for a subtract instruction?

    Well use a finite state machine for control

    Multicycle Approach

    130

    Finite state machines:

    a set of states and

    next state function (determined by cur rent state and theinput)

    output function (determined by current state and possiblyinput)

    Well use a Moore machine (output based only on currentstate)

    Review: fin ite state machines

    Next-statefunction

    Current state

    Clock

    Outputfunction

    Nextstate

    Outputs

    Inputs

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    66/102

    131

    Break up the instruc tions in to steps, each step takes a cycle balance the amount of work to be done

    restrict each cycle to use only one major functional unit

    At the end of a cycle

    store values for use in later cycles (easiest thing to do)

    introduce additional internal registers

    Multicycle Approach

    132

    Multi -Cycle Path

    Shiftleft 2

    PC

    Memory

    MemData

    Writedata

    Mux

    0

    1

    RegistersWriteregister

    Writedata

    Readdata 1

    Readdata 2

    Readregister 1

    Readregister 2

    Mux

    0

    1

    Mux

    0

    1

    4

    Instruction[150]

    Signextend

    3216

    Instruction[2521]

    Instruction[2016]

    Instruction[150]

    Instructionregister

    1 Mux

    0

    3

    2

    Mux

    ALUresult

    ALU

    Zero

    Memorydata

    register

    Instruction[1511]

    A

    B

    ALUOut

    0

    1

    Address

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    67/102

    133

    Instruction Fetch

    Instruction Decode and Register Fetch

    Execution, Memory Address Computation, or Branch Completion

    Memory Access or R-type instruction completion

    Write-back step

    INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!

    Five Execution Steps

    134

    Use PC to get instruction and put it in the Instruction Register.

    Increment the PC by 4 and pu t the result back in the PC.

    Can be described succinctly using RTL " Register-TransferLanguage"

    IR = Memory[PC];

    PC = PC + 4;

    Can we figure out the values of the control signals?

    What is the advantage of updating the PC now?

    Step 1: Instruction Fetch

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    68/102

    135

    Read regis ters rs and rt in case we need them Compute the branch address in case the instruction is a branch

    RTL:

    A = Reg[IR[25-21]];

    B = Reg[IR[20-16]];

    ALUOut = PC + (sign-extend(IR[15-0])

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    69/102

    137

    Loads and stores access memory

    MDR = Memory[ALUOut];

    or

    Memory[ALUOut] = B;

    R-type instructions finish

    Reg[IR[15-11]] = ALUOut;

    The write actually takes place at the end of the cycle on the edge

    Step 4 (R-type or memory-access)

    138

    Reg[IR[20-16]]= MDR;

    What about all the other instructions?

    Write-back step

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    70/102

    139

    Illustrations

    Implementation of instructions in multi-cycle Add

    Beq

    J

    140

    Summary:

    Step name

    Action for R-type

    instructions

    Action for memory-reference

    instructions

    Action for

    branches

    Action for

    umps

    Instruction fetch IR =Memory[PC]

    PC =PC +4

    Instruction A =Reg [IR[25-21]]

    decode/register fetch B =Reg [IR[20-16]]ALUOut =PC +(sign-extend (IR[15-0])

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    71/102

    141

    How many cycles will it take to execute this code?

    lw $t2, 0($t3)lw $t3, 4($t3)beq $t2, $t3, Label #assume notadd $t5, $t2, $t3sw $t5, 8($t3)

    Label: ...

    What is going on during the 8th cycle of execution?

    In what cycle does the actual addition of$t2 and $t3 takes place?

    Simple Questions

    142

    Value of contro l signals is dependent upon:

    what instruction is being executed

    which step is being performed

    Use the information weve acculumated to specify a finite statemachine

    specify the finite state machine graphically, or

    use microprogramming

    Implementation can be derived from specification

    Implementing the Control

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    72/102

    143 How many state bits will we need?

    FSM

    P C W r ite P C S o u r c e = 1 0

    A L U S r c A = 1A L U S r c B = 0 0

    A L U O p = 0 1 P C W r it e C o n d

    P C S o u r c e = 0 1

    A L U S r c A =1 A L U S r c B = 0 0

    A L U O p = 10

    R e g D s t = 1 R e g W r it e

    M e m t o R e g = 0

    M e m W r ite I o r D = 1

    M e m R e a d I o r D = 1

    A L U S r c A = 1 A L U S r c B = 1 0 A L U O p = 0 0

    R e g D s t = 0 R e g W r it e

    M e m t o R e g = 1

    A L U S r cA = 0 A L U S r c B = 1 1

    A L U O p = 0 0

    M e m R e a d A L U S rc A = 0

    I o r D = 0 I R W r it e

    A L U S r c B = 0 1A L U O p = 0 0

    P C W r i te P C S o u r c e = 0 0

    Instruct ion fetchIn s t r u c ti o n d eco d e /

    reg ister fetch

    J u m p co mp l e t i o n

    B r a n c h co mp l e t i o nE xecu t i o n

    M e m o r y a d d re s s co mp u ta t i o n

    M e m o r ya c c e s s

    M e m o r ya c c e s s R - typ e co m p l e t io n

    W r ite -b ack s tep

    (Op='L

    W ')or (O

    p ='SW

    ') (Op =

    R-typ

    e)

    (Op

    ='B

    EQ')

    (Op='J')

    (Op='S

    W')

    (Op='LW')

    4

    01

    9862

    753

    S ta r t

    144

    Implementation:

    Finite State Machine for Control

    PCWrite

    PCWriteCond

    IorD

    MemtoReg

    PCSource

    ALUOp

    ALUSrcB

    ALUSrcA

    RegWrite

    RegDst

    NS3

    NS2NS1

    NS0

    Op5

    Op4

    Op3

    Op2

    Op1

    Op0

    S3

    S2

    S1

    S0

    State register

    IRWrite

    MemRead

    MemWrite

    Instruction registeropcode field

    Outputs

    Control logic

    Inputs

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    73/102

    145

    PLA Implementation

    O p 5

    O p 4

    O p 3

    O p 2

    O p 1

    O p 0

    S 3

    S 2

    S 1

    S 0

    I o r D

    I R W r ite

    M e m R e a d

    M e m W r ite

    P C W r ite

    P C W r ite C o n d

    M e m to R e g

    P C S o u rc e 1

    A L U O p 1

    A L U S rc B 0A L U S rc A

    R e g W r ite

    R e g D s t

    N S 3

    N S 2

    N S 1

    N S 0

    A L U S rc B 1

    A L U O p 0

    P C S o u rc e 0

    146

    ROM = "Read Only Memory"

    values of memory locations are fixed ahead of t ime

    A ROM can be used to imp lement a truth table

    if the address is m-bits, we can address 2m entries in the ROM.

    our outpu ts are the bits of data that the address points to.

    ROM Implementation

    m n

    0 0 0 0 0 1 10 0 1 1 1 0 00 1 0 1 1 0 00 1 1 1 0 0 01 0 0 0 0 0 0

    1 0 1 0 0 0 11 1 0 0 1 1 01 1 1 0 1 1 1

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    74/102

    147

    How many inputs are there?6 bits fo r opcode, 4 bits for state = 10 address lines(i.e., 210 = 1024 dif ferent addresses)

    How many outputs are there?16 datapath-contro l outputs, 4 state bits = 20 outputs

    ROM is 210 x 20 = 20K bits (and a rather unusual size)

    Rather wasteful, since for lots o f the entries, the outputs arethe same

    i.e., opcode is often ignored

    ROM Implementation

    148

    Break up the table into two parts

    4 state bits tell you the 16 outputs , 24 x 16 bits of ROM

    10 bits tell you the 4 next state bits, 210 x 4 bits of ROM

    Total: 4.3K bits of ROM

    PLA is much smaller

    can share product terms

    only need entries that produce an active output

    can take into account don't cares

    Size is (#inputs #product-terms) + (#outputs #product-terms)For this example = (10x17)+(20x17) = 460 PLA cells

    PLA cells usually about the size of a ROM cell (slightl y bigger)

    ROM vs PLA

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    75/102

    149

    Complex instructions : the "next state" is often current state +1

    Another Implementation Style

    AddrCtl

    Outputs

    PLA or ROM

    State

    Address select logic

    Op[5

    0]

    Adder

    Instruction registeropcodefield

    1

    Control unit

    Input

    PCWrite

    PCWriteCond

    IorD

    MemtoReg

    PCSource

    ALUOp

    ALUSrcB

    ALUSrcA

    RegWrite

    RegDst

    IRWrite

    MemRead

    MemWrite

    BWrite

    150

    Details

    Dispatch ROM 1 Dispatch ROM 2

    Op Opcode name Value Op Opcode name Value000000 R-format 0110 100011 l w 0011

    000010 j mp 1001 101011 sw 0101

    000100 beq 1000

    100011 l w 0010

    101011 sw 0010

    State number Address-control action Value of AddrCtl0 Use incremented state 3

    1 Use dispatch ROM 1 1

    2 Use dispatch ROM 2 2

    3 Use incremented state 3

    4 Replace state number by 0 0

    5 Replace state number by 0 0

    6 Use incremented state 3

    7 Replace state number by 0 0

    8 Replace state number by 0 0

    9 Replace state number by 0 0

    State

    Op

    Adder

    1

    PLA or ROM

    Mux

    3 2 1 0

    Dispatch ROM 1Dispatch ROM 2

    0

    AddrCtl

    Address select logic

    Instruction registeropcode field

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    76/102

    151

    Microprogramming

    What are the microinstructions ?

    PCWrite

    PCWriteCondIorD

    MemtoReg

    PCSourceALUOp

    ALUSrcB

    ALUSrcARegWrite

    AddrCtl

    Outputs

    Microcode memory

    IRWrite

    MemRead

    MemWrite

    RegDst

    Control unit

    Input

    Microprogram counter

    Address select logic

    Op[50]

    Adder

    1

    Datapath

    Instruction register

    opcode field

    BWrite

    152

    A specif ication methodology

    appropriate if hundreds of opcodes, modes, cycles, etc.

    signals specified symbolically using microinstructions

    Will two implementations of the same architecture have the same microcode?

    What would a microassembler do?

    Microprogramming

    Label

    ALU

    control SRC1 SRC2

    Register

    control Memory

    PCWrite

    control Sequencing

    Fetch Add PC 4 Read PC ALU Seq

    Add PC Extshft Read Dispatch 1

    Mem1 Add A Extend Dispatch 2

    LW2 Read ALU Seq

    Write MDR Fetch

    SW2 Write ALU Fetch

    Rformat1 Func code A B Seq

    Write ALU FetchBEQ1 Subt A B ALUOut-cond Fetch

    J UMP1 J ump address Fetch

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    77/102

    153

    Microinstruction formatField name Value Signals active Comment

    Add ALUOp =00 Cause the ALU to add.

    ALU control Subt ALUOp =01 Cause the ALU to subtract; this implements the compare for

    branches.

    Func code ALUOp =10 Use the instruction's function code to determine ALU control.SRC1 PC ALUSrcA =0 Use the PC as the first ALU input.

    A ALUSrcA =1 Register A is the first ALU input.

    B ALUSrcB =00 Register B is the second ALU input.

    SRC2 4 ALUSrcB =01 Use 4 as the second ALU input.

    Extend ALUSrcB =10 Use output of the sign extension unit as the second ALU input.

    Extshft ALUSrcB =11 Use the output of the shift-by-two unit as the second ALU input.

    Read Read two registers using the rs and rt fields of the IR as the register

    numbers and putting the data into registers A and B.

    Write ALU RegWrite, Write a register using the rd field of the IR as the register number and

    Register RegDst =1, the contents of the ALUOut as the data.

    control MemtoReg =0

    Write MDR RegWrite, Write a register using the rt field of the IR as the register number and

    RegDst =0, the contents of the MDR as the data.

    MemtoReg =1

    Read PC MemRead, Read memory using the PC as address; write result into IR (and

    lorD =0 the MDR).

    Memory Read ALU MemRead, Read memory us ing the ALUOut as address; write result into MDR.

    lorD =1

    Write ALU MemWrite, Write memory using the ALUOut as address, contents of B as the

    lorD =1 data.

    ALU PCS ource =00 Write the output of the ALU into the PC.

    PCWrite

    PC write control ALUOut-cond PCSource =01, If the Zero output of the ALU is active, write the PC with the contentsPCWriteCond of the register ALUOut.

    jump address PCSource =10, Write the PC with the jump address from the instruction.

    PCWrite

    Seq AddrCtl =11 Choose the next microinstruction sequentially.

    Sequencing Fetch AddrCtl =00 Go to the first microinstruction to begin a new instruction.

    Dispatch 1 AddrCtl =01 Dispatch us ing the ROM 1.

    Dispatch 2 AddrCtl =10 Dispatch using the ROM 2.

    154

    No encoding:

    1 bit fo r each datapath operation

    faster, requires more memory (logic)

    used for Vax 780 an astonishing 400K of memory!

    Lots of encoding:

    send the microinstructions through logic to get control signals

    uses less memory, slower

    Historical context of CISC:

    Too much logic to put on a single chip with everything else

    Use a ROM (or even RAM) to hold the microcode Its easy to add new instructions

    Maximally vs. Minimally Encoded

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    78/102

    155

    Microcode: Trade-offs

    Distinction between specification and implementation is sometimes blurred

    Specification Advantages:

    Easy to design and wri te

    Design architecture and m icrocode in parallel

    Implementation (off-chip ROM) Advantages

    Easy to change since values are in memory

    Can emulate other architectures

    Can make use of internal registers

    Implementation Disadvantages, SLOWER now that:

    Control is implemented on same chip as processor

    ROM is no longer faster than RAM

    No need to go back and make changes

    156

    The Big Picture

    Initialrepresentation

    Finite statediagram

    Microprogram

    Sequencingcontrol

    Explicit nextstate function

    Microprogram counter+ dispatch R OMS

    Logicrepresentation

    Logicequations

    Truthtables

    Implementationtechnique

    P rogrammablelogic array

    Read onlymemory

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    79/102

    157

    158

    SRAM:

    value is stored on a pair of inverting gates

    very fast but takes up more space than DRAM (4 to 6transistors)

    DRAM:

    value is stored as a charge on capacitor (must berefreshed)

    very small bu t slower than SRAM (factor o f 5 to 10)

    Memories: Review

    B

    A A

    B

    Word line

    Pass transistor

    Capacitor

    Bit line

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    80/102

    159

    Users want large and fast memories!

    SRAM access times are 2 - 25ns at cost of $100 to $250 per Mbyte.

    DRAM access times are 60-120ns at cost of $5 to $10 per Mbyte.Disk access times are 10 to 20 million ns at cost of $.10 to $.20 per

    Mbyte.

    Try and give it to them anyway

    build a memory hierarchy

    Exploiting Memory Hierarchy

    1997

    CPU

    Level n

    Level 2

    Level 1

    Levels in thememory hierarchy

    Increasing distance fromthe CPU in

    access time

    Size of the memory at each level

    160

    Locality

    A p rincip le that makes having a memory h ierarchy a good idea

    If an item is referenced,

    temporal locality: it will tend to be referenced again soon

    spatial locality: nearby items will tend to be referenced soon.

    Why does code have locality?

    Our initial focus: two levels (upper, lower)

    block: minimum unit of data

    hit: data requested is in the upper level

    miss: data requested is not in the upper level

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    81/102

    161

    Caches, Memory and Processor

    CPUcache

    controller cache

    main

    memory

    data

    data

    address

    data

    address

    162

    Two issues:

    How do we know if a data item is in the cache?

    If it is, how do we find it?

    Our first example:

    block s ize is one word o f data

    "direct mapped"

    For each item of data at the lower level,there is exactly one location in the cache where it might be.

    e.g., lots of items at the lower level share locations in the upper level

    Cache

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    82/102

    163

    Cache operation

    Many main memory locations are mapped onto one cacheentry.

    May have caches for:

    instructions;

    data;

    data + instruc tions (unified).

    Memory access time is no longer deterministic.

    164

    Terms

    Cache hit: required location is in cache.

    Cache miss: required location is not in cache.

    Working set: set of locations used by program in a timeinterval.

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    83/102

    165

    Types of misses

    Compulsory (cold): location has never been accessed. Capacity: working set is too large.

    Conflict: multiple locations in working set map to same cacheentry.

    166

    Memory system performance

    h = cache hit rate.

    tcache = cache access time, tmain = main memory access time.

    Average memory access t ime:

    tav = htcache + (1-h)tmain

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    84/102

    167

    Multiple levels of cache

    CPU L1 cache L2 cache

    168

    Multi-level cache access t ime

    h1 = cache hit rate.

    h2 = rate for miss on L1, hit on L2.

    Average memory access t ime:

    tav = h1tL1 + (h2-h1)tL2 + (1- h2-h1)tmain

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    85/102

    169

    Cache performance benefits

    Keep frequently-accessed locations in fast cache. Cache retrieves more than one word at a time.

    Sequential accesses are faster after first access.

    170

    Replacement policies

    Replacement poli cy: strategy for choosing which cache entryto throw out to make room for a new memory location.

    Two popular st rategies:

    Random.

    Least-recently used (LRU).

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    86/102

    171

    Write operations

    Write-through: immediately copy write to main memory. Write-back: write to main memory only when location is

    removed from cache.

    172

    Cache organizations

    Fully-associative: any memory location can be storedanywhere in the cache (almost never imp lemented).

    Direct-mapped: each memory location maps onto exactly onecache entry.

    N-way set-associative: each memory location can go into oneof n sets.

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    87/102

    173

    Mapping: address is modulo the number of blocks in thecache

    Direct Mapped Cache

    00001 00101 01001 01101 10001 10101 11001 11101

    000

    Cache

    Memory

    001

    010

    011

    100

    101

    110

    111

    174

    Direct Mapped Cache

    20 10

    Byteoffset

    Valid Tag DataIndex

    0

    1

    2

    1021

    10221023

    Ta g

    Index

    Hit Data

    20 32

    31 30 13 12 11 2 1 0

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    88/102

    175

    Direct-mapped cache

    valid

    =

    tag index offset

    hit value

    tag data

    1 0xabcd byte byte byte ...

    byte

    cache block

    176

    Taking advantage of spatial locality

    Direct Mapped Cache

    16 12 Byteoffset

    V Tag Data

    Hit Data

    16 32

    4Kentries

    16 bits 128 bits

    Mux

    32 32 32

    2

    32

    Block offsetIndex

    Tag

    31 16 15 4 32 10

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    89/102

    177

    Read hits this is w hat we want!

    Read misses

    stall the CPU, fetch block from memory, deliver to cache, restart

    Write hits:

    can replace data in cache and memory (write-through)

    write the data only i nto the cache (write-back the cache later)

    Write misses:

    read the entire block i nto the cache, then write the word

    Hits vs. Misses

    178

    Make reading mul tiple words easier by using banks of memory

    It can get a lot more complicated...

    Hardware Issues

    CPU

    Cache

    Bus

    Memory

    a. One- word-widememory organization

    CPU

    Bus

    b. Wide memory organization

    Memory

    Multiplexor

    Cache

    CPU

    Cache

    Bus

    Memorybank 1

    Memorybank 2

    Memorybank 3

    Memorybank 0

    c. Interleaved memory organization

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    90/102

    179

    Increasing the block size tends to decrease miss rate:

    Use split caches because there is more spatial locality in code:

    Performance

    1 KB8 KB16 KB64 KB256 KB

    256

    40%

    35%

    30%

    25%

    20%

    15%

    10%

    5%

    0%

    Missrate

    64164

    Block size (bytes)

    ProgramBlock size in

    wordsInstructionmiss rate

    Data missrate

    Effective combinedmiss rate

    gcc 1 6.1% 2.1% 5.4%

    4 2.0% 1.7% 1.9%spice 1 1.2% 1.3% 1.2%

    4 0.3% 0.6% 0.4%

    180

    Direct-mapped cache locations

    Many locations map onto the same cache block.

    Conflict misses are easy to generate:

    Array a[] uses locations 0, 1, 2,

    Array b[ ] uses locat ions 1024, 1025, 1026,

    Operation a[i] + b[i] generates conflic t misses.

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    91/102

    181

    Performance

    Simplified model:

    execution time = (execution cycles + stall cycles) cycle timestall cycles = # of instruc tions miss ratio miss penalty

    Two ways of improving performance:

    decreasing the miss ratio

    decreasing the miss penalty

    What happens if we increase block size?

    182

    Compared to direct mapped, give a series of references that:

    results in a lower miss ratio using a 2-way set associative cache

    results in a higher miss ratio using a 2-way set associative cache

    assuming we use the least recently used replacement strategy

    Decreasing miss ratio with associativity

    Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data Tag Data

    Eight-way set associative (fully associative)

    Tag Data Tag Data Tag Data Tag Data

    Four-way set associative

    Set

    0

    1

    Tag Data

    (direct mapped)

    Block

    0

    7

    1

    2

    3

    4

    5

    6

    Tag Data

    Two-way set associative

    Set

    0

    1

    2

    3

    Tag Data

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    92/102

    183

    Set-associative cache

    A set of direct -mapped caches:

    Set 1 Set 2 Set n...

    hit data

    184

    An implementation

    22 8

    V TagIndex

    0

    1

    2

    253

    254

    255

    Data V Tag Data V T ag Data V T ag Data

    3222

    4-to-1 multiplexor

    H it Data

    123891011123031 0

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    93/102

    185

    Performance

    0%

    3%

    6%

    9%

    12%

    15%

    Eight-wayFour-wayTwo -wayOne-way1 K B 2 K B 4 K B 8 K B

    Missrate

    Associativity 16 KB 32 KB 64 KB 128 KB

    186

    Example: direct-mapped vs. set-associative

    address data

    000 0101

    001 1111

    010 0000

    011 0110

    100 1000

    101 0001

    110 1010

    111 0100

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    94/102

    187

    Direct-mapped cache behavior

    After 001 access:block tag data

    00 - -

    01 0 1111

    10 - -

    11 - -

    After 010 access:block tag data

    00 - -

    01 0 1111

    10 0 0000

    11 - -

    188

    Direct-mapped cache behavior, contd.

    After 011 access:

    block tag data

    00 - -

    01 0 1111

    10 0 0000

    11 0 0110

    After 100 access:

    block tag data

    00 1 1000

    01 0 1111

    10 0 0000

    11 0 0110

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    95/102

    189

    Direct-mapped cache behavior, contd.

    After 101 access:block tag data

    00 1 1000

    01 1 0001

    10 0 0000

    11 0 0110

    After 111 access:block tag data

    00 1 1000

    01 1 0001

    10 0 0000

    11 1 0100

    190

    2-way set-associtive cache behavior

    Final state of cache (twice as big as direct-mapped):

    set blk 0 tag blk 0 data blk 1 tag blk 1 data

    001 1000 - -

    010 1111 1 0001

    100 0000 - -

    110 0110 1 0100

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    96/102

    191

    2-way set-associative cache behavior

    Final state of cache (same size as direct-mapped):set blk 0 tag blk 0 data blk 1 tag blk 1 data

    0 01 0000 10 1000

    1 10 0111 11 0100

    192

    Decreasing miss penalty w ith mul tilevel caches

    Add a second level cache:

    often primary cache is on the same chip as the processor

    use SRAMs to add another cache above primary memory (DRAM)

    miss penalty goes dow n if data is in 2nd level cache

    Example:

    CPI of 1.0 on a 500Mhz machine with a 5% miss rate, 200ns DRAM access

    Adding 2nd level cache w ith 20ns access t ime decreases miss rate to 2%

    Using multi level caches:

    try and opt imize the hit time on the 1st level cache

    try and opt imize the miss rate on the 2nd level cache

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    97/102

    193

    Example caches

    194

    Memory management units

    Memory management unit (MMU) translates addresses:

    CPUmain

    memory

    memory

    management

    unit

    logical

    addressphysical

    address

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    98/102

    195

    Memory management tasks

    Al lows programs to move in physical memory duringexecution.

    Al lows virtual memory:

    memory images kept in secondary storage;

    images returned to main memory on demand duringexecution.

    Page fault: request for location not resident in memory.

    196

    Address translation

    Requires some sort of register/table to allow arbitrarymappings of log ical to physical addresses.

    Two basic schemes:

    segmented;

    paged.

    Segmentation and paging can be combined (x86).

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    99/102

    197

    Segments and pages

    memory

    segment 1

    segment 2

    page 1

    page 2

    198

    Segment address translation

    segment base address logical address

    range

    check

    physical address

    +

    range

    errorsegment lower bound

    segment upper bound

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    100/102

    199

    Page address translation

    page offset

    page offset

    page i base

    concatenate

    200

    Page table organizations

    flat tree

    page descriptor

    page

    descriptor

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    101/102

    201

    Caching address translations

    Large translation tables require main memory access. TLB: cache for address translation.

    Typically small.

    202

    ARM memory management

    Memory region types:

    section: 1 Mbyte block;

    large page: 64 kbytes;

    small page: 4 kbytes.

    An address is marked as sect ion-mapped or page-mapped.

    Two-level translation scheme.

  • 8/22/2019 Course-module-1 _Compatibility Mode_.pdf

    102/102

    203

    ARM address translat ion

    offset1st index 2nd index

    physical address

    Translation table

    base register

    1st level tabledescriptor

    2nd level tabledescriptor

    concatenate

    concatenate