Capp 01

download Capp 01

of 54

Transcript of Capp 01

  • 7/27/2019 Capp 01

    1/54

    1

    CS 606Computer Architecture and Parallel Processing

    N. Sarma

    Lecture 1

  • 7/27/2019 Capp 01

    2/54

    2

    What is *Computer Architecture*

    Computer Architecture =

    Instruction Set Architecture +

    Organization +

    Hardware +

  • 7/27/2019 Capp 01

    3/54

    3

    What is *Computer Architecture*

    The term Computer Architecture was coined in 1964by the chief architects of IBM System/360 as

    the structure of a computer that a machine language

    programmer must understand to write a correct

    (timing independent) program for a machine.

    Definitely, it comprises

    the definition of registers, memory

    Instruction set, instruction formats, addressing modes

    actual coding of the instructions without

    implementation (h/w structure) and realization (logic

    technology, packaging etc.)

  • 7/27/2019 Capp 01

    4/54

    4

    What is *Computer Architecture*

    Recent Interpretation uses a number of levels ofincreasing abstraction, at each level, the architecture

    will be described by

    Underlying computational model

    Level of Consideration/interest (micromachine level,processor level and computer system level) and

    Scope of Interest (Functional specification and

    implementation); Abstract architecture and Concrete

    architecture

  • 7/27/2019 Capp 01

    5/54

    5

    What is *Computer Architecture*

    Abstract Architecture is a black-box specification From Programmers point of view (Programming model)

    From the hardware designers point of view (hardware

    model, with additional spec. like interface protocols)

    Concrete Architecture can be considered from 2different points of view

    Logic design (logical components used e.g. Reg, EU etc,

    interconnections, sequence of information transfers) is an

    abstraction of Physical design and precedes Physical design.

    physical design (based on concrete ckt elements,

    specifications of ckt elements with signals, interconnections,

    declaration of initiated signal sequences)

  • 7/27/2019 Capp 01

    6/54

    6

    What is *Computer Architecture*

    Logic design will usually be described informally bymeans of a block diagram, or formally by means of an

    ADL (Architecture Description Language)

    Logic designed represented in ADL is passed to a CAD

    package as input, and a Physical design is typically theoutput of a CAD Package.

    Example architectures

    Concrete architectures of computer systems : based onprocessor level building blocks, their specifications,

    interconnections and operation of the whole system

  • 7/27/2019 Capp 01

    7/54

    7

    What is *Computer Architecture*

    Example architectures

    Abstract Architecture of Processor: Programming model

    (ISA or Architecture), Hardware model (programming interface,

    Interrupt interface, I/O interface)

    Concrete architectures of Processors: (also called

    microarchitecture) Microarchitecture of a processor is usually

    given as a Logic Design. Described by block diagram, through

    the specification of a set of functional units (Register blocks,

    buses, EU etc.) and their interconnections, and by declaring the

    operation of the whole processor.

    Microarchitecture as a Physical Design is usually detailed in

    Technical Documentation (Proprietary)

    Formal Description of a CA : for a precise, unambiguous

    representation and for verification, simulation and analysis.

  • 7/27/2019 Capp 01

    8/54

    8

    Computational Model

    It can be observed that Computer Architecture classes and

    Programming Language classes correspond to one another

    e.g., Von Neumann Architecture Imperative Language,

    Reduction Architecture Functional Language

    Architectures & Languages have a common paradigm or

    foundation called a Computational Model.

    Von Neumann Computational Model&Applicative

    Computational Model

    Computational Model represents a higher level of abstraction of

    Architecture and Language

    Programming Language Specification tool for formulation of

    computational task using a computational model

    Architecture A tool to execute a given computational task

  • 7/27/2019 Capp 01

    9/54

    9

    Computational Model

    A Computational Modelcomprises the following three

    abstractions

    The basic Items of Computation:

    specification of items computation refers to, and the kind of

    computations performed on them Variables in Programming languages, Memory/Register

    Addresses in Architectures

    The problem description model

    Style (e.g. procedural or declarative) and Method of problemdescription

    The execution model

    Interpretation of the computation, execution semantics, and the

    control of the execution sequence

  • 7/27/2019 Capp 01

    10/54

    10

    Computational Model

    The Control of execution sequence Control-Driven, Execution sequence is implicitly given by the order of the

    instructions or by explicit control instructions

    Data-Driven, an operation is activated as soon as the needed input data is

    available (also called Eager Evaluation, used in Dataflow Computational

    Model) Demand-driven, operations will be activated only when their execution is

    needed to achieve the final result (also called Lazy Evaluation, used in

    Applicative Computational Model)

  • 7/27/2019 Capp 01

    11/54

    11

    The Instruction Set: a Critical Interface

    instruction set

    software

    hardware

    The actual programmer visible instruction set

  • 7/27/2019 Capp 01

    12/54

  • 7/27/2019 Capp 01

    13/54

  • 7/27/2019 Capp 01

    14/54

    14

    Hardware

    Machine specifics: Feature size (10 microns in 1971 to 0.18 microns in 2001)

    Minimum size of a transistor or a wire in either the x or y

    dimension Logic designs

    Packaging technology

    Clock rate

    Supply voltage

  • 7/27/2019 Capp 01

    15/54

  • 7/27/2019 Capp 01

    16/54

    16

    Applications and Requirements

    Scientific/numerical: weather prediction, molecularmodeling Need: large memory, floating-point arithmetic

    Commercial: inventory, payroll, web serving, e-

    commerce Need: integer arithmetic, high I/O

    Embedded: automobile engines, microwave, PDAs Need: low power, low cost, interrupt driven

    Home computing: multimedia, games, entertainment Need: high data bandwidth, graphics

  • 7/27/2019 Capp 01

    17/54

    17

    Classes of Computers

    High performance (supercomputers) Supercomputers Cray T-90 Massively parallel computers Cray T3E

    Balanced cost/performance Workstations SPARCstations Servers SGI Origin, UltraSPARC

    High-end PCs Pentium quads Low cost/power Low-end PCs, laptops, PDAs mobile Pentiums

  • 7/27/2019 Capp 01

    18/54

    18

    Classes of Computers

    High performance (supercomputers) Supercomputers Cray T-90 Massively parallel computers Cray T3E

    Balanced cost/performance Workstations SPARCstations Servers SGI Origin, UltraSPARC

    High-end PCs Pentium quads Low cost/power Low-end PCs, laptops, PDAs mobile Pentiums

  • 7/27/2019 Capp 01

    19/54

    19

    Classification of Computer Systems

    Purpose To provide a basis for information ordering .. Predicting properties of an architecture .. Explanation

    Flynns Classification [1966]- System classifications are based upon thenumber of concurrent instructions and datastreams present in the computer architecture

    - Rely on several architectural properties :Instruction memory (IM), Data memory (DM),Control Unit (CU), Processing Unit (PU),Instruction stream (IS), Data Stream (DS)

  • 7/27/2019 Capp 01

    20/54

    20

    Classification of Computer Systems

    SISD (Single Instruction, Single Data) a singleprocessor, a uniprocessor, executes a singleinstruction stream, to operate on data stored ina single memory

    Instruction prefetching, pipelined execution ofinstructions are common examples found in mostmodern SISD computers

    C P MIS IS DS

  • 7/27/2019 Capp 01

    21/54

    21

    Classification of Computer Systems

    SIMD (Single Instruction, Multi Data)

    C

    P

    P

    MIS

    DS

    DS

  • 7/27/2019 Capp 01

    22/54

    22

    Classification of Computer Systems

    MISD (Multi Instruction, Single Data)

    C

    C

    P

    P

    M

    IS

    IS

    IS

    IS

    DS

    DS

  • 7/27/2019 Capp 01

    23/54

    23

    Classification of Computer Systems

    MIMD (MultiInstruction, MultiData)

    C

    C

    P

    P

    M

    IS

    IS

    IS

    IS

    DS

    DS

  • 7/27/2019 Capp 01

    24/54

    Fengs Classification

    1 16 32 64

    1

    16

    64

    256

    16K

    word length

    bit slice

    length

    MPP

    STARAN

    C.mmP

    PDP11

    PEPE

    IBM370

    IlliacIV

    CRAY-1

    Degree of parallelism, without mentioning how it offers. Bit

    width of processor (n), and Bit slice width (m). Max P D=m*n

  • 7/27/2019 Capp 01

    25/54

    Hndlers Classification

    It explicitly gives type of parallelism

    - 3 levels of proc h/w program control unit (K),ALU (D), Elementary Logic Ckt level (W)

    < K x K , D x D , W x W >

    control data worddash degree of pipelining

    TI - ASC

    CDC 6600 x (I/O)C.mmP + +

    PEPE

    Cray-1

  • 7/27/2019 Capp 01

    26/54

    Modern Classification

    Based on the kind of parallelism exploited and

    granularity of functional parallelism utilized

    Parallel Architectures: Data Parallel Architectures and Function Parallel

    Architectures Data Parallel Architectures may be

    Vector architectures, Associative and neuralarchitectures, SIMDs, Systolic Architectures

    Function Parallel architectures may be Instruction-level (Pipelined, VLIWs, Superscalar),

    Thread-level, Process-level (MIMDs: DistributedMemory MIMD and Shared Memory MIMD)

  • 7/27/2019 Capp 01

    27/54

    27

    Why Study Computer Architecture

    Arent they fast enough already? Are they?

    Fast enough to do everything we will EVER want? AI, protein sequencing, graphics

    Is speed the only goal? Power: heat dissipation + battery life

    Cost

    Reliability

    Etc.

    Answer #1: requirements are always changing

  • 7/27/2019 Capp 01

    28/54

    28

    Why Study Computer Architecture

    Annual technology improvements (approx.) Logic: density + 25%, speed +20%

    DRAM (memory): density +60%, speed: +4% Disk: density +25%, disk speed: +4%

    Designs change even if requirements arefixed. But the requirements are not fixed.

    Answer #2: technology playing field is always changing

  • 7/27/2019 Capp 01

    29/54

    29

    Example of Changing Designs

    Having, or not having caches 1970: 10K transistors on a single chip, DRAM

    faster than logic

    having a cache is bad 1990: 1M transistors, logic is faster than DRAM having a cache is good

    2000: 600M transistors -> multiple level caches

    and multiple CPUs Will caches ever be a bad idea again?

  • 7/27/2019 Capp 01

    30/54

  • 7/27/2019 Capp 01

    31/54

    31

    Types of Parallelism

    Available and Utilized Parallelism Available parallelism means parallelism available in

    programs (or in problem solutions)

    Utilized Parallelism means parallelism occurring during

    execution

    Problem solutions may contain 2 different kinds of

    available parallelism

    Functional, which arises from the logic of a problem

    solution

    Data, which comes from using data structures that allow

    parallel operations on their elements, such as vectors or

    matrics

  • 7/27/2019 Capp 01

    32/54

    32

    Levels of Parallelism

    Programs written in imperative languages mayembody functional parallelism at different levels

    Parallelism at the Instruction Level

    Parallelism at the loop level

    Parallelism at the Procedure level

    Parallelism at the program level

    Utilization of functional parallelism

    Available parallelism can be utilized by architectures,compilers and OS conjointly for speeding up computation.

    Available functional parallelism can be utilized at four

    levels instruction, thread, process and user level

  • 7/27/2019 Capp 01

    33/54

    33

    Levels of Parallelism

    Utilization of functional parallelism It is quite natural to utilize available functional parallelism,

    which is inherent (hidden) in a conventional sequential

    program, at the instruction level by executing instructions in

    parallel. Can be achieved by means of architectures (called ILP

    architecture) capable of parallel instruction execution.

    Must be detected either by a dedicated compiler or by an ILP

    architecture. Available loop and procedure level parallelism will often be

    utilized in the form of threads and processes

    Two different ways to Threads and Processes Specialized

    architectures Multi-Threaded and MIMD Archs.

  • 7/27/2019 Capp 01

    34/54

    34

    Levels of Parallelism

    Utilization of functional parallelism Two different ways to Threads and Processes Specialized

    architectures Multi-Threaded and MIMD Archs., Use

    architectures that run threads and processes in sequence under

    the supervision of a Multi-threaded orMultitasking OS

  • 7/27/2019 Capp 01

    35/54

    35

    Levels of Parallelism

    Utilization of Data parallelism Available Data parallelism can be utilized in two different

    ways.

    One is to exploit data parallelism directly by dedicated

    architectures that permit parallel operations on dataelements (Data Parallel architectures)

    Other possibility is to convert data parallelism into

    functional parallelism by expressing parallel executable

    operations on data elements in a sequential manner (usingloop construct)

  • 7/27/2019 Capp 01

    36/54

  • 7/27/2019 Capp 01

    37/54

    37

    Performance Terminology

    X is ntimes faster than Y means:

    Execution timeY

    Execution timeX

    = n

    X is m%faster than Y means:

    Execution timeY - Execution timeX

    Execution timeX

    = mX 100%

  • 7/27/2019 Capp 01

    38/54

    38

    Amdahls Law

    -To find the maximum expected

    improvement to an overall system when

    only part of the system is improved.

    -It is often used in parallel computing to

    predict the theoretical maximum speedup

    using multiple processors.

  • 7/27/2019 Capp 01

    39/54

    39

    Amdahls Law

  • 7/27/2019 Capp 01

    40/54

    40

    Execution time w/o E (Before)

    Execution time w E (After)

    Compute SpeedupAmdahls Law

    Speedup is due to enhancement(E):

    Speedup (E) =

    TimeBefore

    Suppose that enhancement E accelerates a fraction Fof the task by a factor S, and the remainder of the taskis unaffected, what is the Execution timeafterandSpeedup(E)?

    TimeAfter

  • 7/27/2019 Capp 01

    41/54

    41

    Amdahls Law

    Execution timeafter

    Speedup(E)

    = ExTimebefore x [(1-F) +FS

    ]

    =ExTimebefore

    ExTimeafter

    =1

    FS ][(1-F) +

  • 7/27/2019 Capp 01

    42/54

    42

    Amdahls Law

    Execution timeafter

    Speedup(E)

    = ExTimebefore x [(1-P) +PN

    ]

    =ExTimebefore

    ExTimeafter

    =1

    PN ][(1-P) +

    P: Portion of a program that can be made Parallel, N: no. of

    Processors

  • 7/27/2019 Capp 01

    43/54

    43

    Amdahls Law An Example

    Q: Floating point instructions improved to run 2X;but only 10% of execution time are FP ops. What isthe execution time and speedup after improvement?

    Ans:F = 0.1, S = 2

    ExTimeafter = ExTimebefore x [ (1-0.1) + 0.1/2 ] = 0.95 ExTimebefore

    Speedup = ExTimebefore

    ExTimeafter

    = 1

    0.95

    = 1.053

    Read examples in the book!

  • 7/27/2019 Capp 01

    44/54

    44

    Gustafson's Law

    Gustafson's Law (also known as Gustafson-Barsis'

    law) is a law in computer science which says that

    problems with large, repetitive data sets can be

    efficiently parallelized

    S(P) = P - * (P 1)

    P No. of processors Non-parallelizable fraction

  • 7/27/2019 Capp 01

    45/54

    45

    Gustafson's Law

  • 7/27/2019 Capp 01

    46/54

    46

    CPU Performance

    The Fundamental Law

    Three components of CPU performance: Instruction count

    CPI

    Clock cycle time

    cycle

    seconds

    ninstructio

    cycles

    program

    nsinstructio

    program

    secondstimeCPU

    Inst. Count CPI Clock

    Program X

    Compiler X X

    Inst. SetArchitecture

    X X X

    Arch X X

    Physical Design X

  • 7/27/2019 Capp 01

    47/54

    47

    CPI - Cycles per InstructionLet Fi be the frequency of type I instructions in a program.

    Then, Average CPI:

    n

    1i

    iiii

    CountnInstructio

    ICFwhereFCPI

    CountnInstructioTotal

    CycleTotalCPI

    )IC(CPItimeCycletimeCPUn

    1i

    ii

    Instruction type ALU Load Store Branch

    Frequency 43% 21% 12% 24%

    Clock cycles 1 2 2 2

    Example:

    average CPI = 0.43 + 0.42 + 0.24 + 0.48 = 1.57 cycles/instruction

  • 7/27/2019 Capp 01

    48/54

    48

    Example

    Instructionmix of a RISC architecture.

    Add a register-memory ALU instruction format?

    One op. in register, one op. in memory

    The new instruction will take 2 cc but will also

    increase the Branches to 3 cc.Q: What fraction of loads must be eliminated for this

    to pay off?

    Inst. ALU Load Store Branch

    Freq. 50% 20% 10% 20%

    C. C. 1 2 2 2

  • 7/27/2019 Capp 01

    49/54

    49

    Solution

    Exec Time = Instr. Cnt. x CPI x Cycle time

    Instr. Fi CPIi CPIixFi Ii CPIi CPIixIi

    ALU .5 1 .5 .5-X 1 .5-X

    Load .2 2 .4 .2-X 2 .4-2X

    Store .1 2 .2 .1 2 .2

    Branch .2 2 .4 .2 3 .6

    Reg/Mem X 2 2X

    1.0 CPI=1.5 1-X (1.7-X)/(1-X)

    Instr. Cntold x CPIold x Cycle timeold >= Instr. Cntnew x CPInew x Cycle timenew1.0 x 1.5 >= (1-X) x (1.7-X)/(1-X)

    X >= 0.2

    ALL loads must be eliminated for this to be a win!

  • 7/27/2019 Capp 01

    50/54

    50

    Improve Memory System

    All instructions require an instruction fetch, only afraction require a data fetch/store. Optimize instruction access over data access

    Programs exhibit locality

    Spatial Locality Temporal Locality

    Access to small memories is faster Provide a storage hierarchy such that the most frequent

    accesses are to the smallest (closest) memories.

    Cache Memory Disk/TapeRegisters

  • 7/27/2019 Capp 01

    51/54

    51

    Benchmarks

    program as unit of work There are millions of programs

    Not all are the same, most are very different Which ones to use?

    Benchmarks Standard programs for measuring or comparing

    performanceRepresentative of programs people care about

    repeatable!!

  • 7/27/2019 Capp 01

    52/54

    52

    Choosing Programs to Evaluate Perf.

    Toy benchmarks e.g., quicksort, puzzle No one really runs. Scary fact: used to prove the value of RISC

    in early 80s

    Synthetic benchmarks Attempt to match average frequencies of operations and

    operands in real workloads. e.g., Whetstone, Dhrystone Often slightly more complex than kernels; But do not represent

    real programs

    Kernels Most frequently executed pieces of real programs

    e.g., livermore loops Good for focusing on individual features not big picture Tend to over-emphasize target feature

    Real programs e.g., gcc, spice, SPEC89, 92, 95, SPEC2000 (standard

    performance evaluation corporation), TPCC, TPCD

  • 7/27/2019 Capp 01

    53/54

    53

    Networking Benchmarks: Netbench, Commbench,Applications: IP Forwarding, TCP/IP, SSL, Apache, SpecWeb

    Commbench:www.ecs.umass.edu/ece/wolf/nsl/software/cb/index.html

    Execution Driven Simulators:Simplescalar

    http://www.simplescalar.com/

    NepSim -http://www.cs.ucr.edu/~yluo/nepsim/

    http://www.simplescalar.com/http://www.simplescalar.com/
  • 7/27/2019 Capp 01

    54/54

    54

    MIPS and MFLOPS

    MIPS: millions of instructions per second: MIPS = Inst. count/ (CPU time * 10**6) = Clock

    rate/(CPI*106)

    easy to understand and to market

    inst. set dependent, cannot be used across machines.

    program dependent can vary inversely to performance! (why? read the book)

    MFLOPS:million of FP ops per second. less compiler dependent than MIPS.

    not all FP ops are implemented in h/w on all machines. not all FP ops have same latencies.

    normalized MFLOPS: uses an equivalence table to evenout the various latencies of FP ops.