Capp 01

7/27/2019 Capp 01

1/54

1

CS 606Computer Architecture and Parallel Processing

N. Sarma

Lecture 1

7/27/2019 Capp 01

2/54

2

What is *Computer Architecture*

Computer Architecture =

Instruction Set Architecture +

Organization +

Hardware +

7/27/2019 Capp 01

3/54

3


The term Computer Architecture was coined in 1964by the chief architects of IBM System/360 as

the structure of a computer that a machine language

programmer must understand to write a correct

(timing independent) program for a machine.

Definitely, it comprises

the definition of registers, memory

Instruction set, instruction formats, addressing modes

actual coding of the instructions without

implementation (h/w structure) and realization (logic

technology, packaging etc.)

7/27/2019 Capp 01

4/54

4


Recent Interpretation uses a number of levels ofincreasing abstraction, at each level, the architecture

will be described by

Underlying computational model

Level of Consideration/interest (micromachine level,processor level and computer system level) and

Scope of Interest (Functional specification and

implementation); Abstract architecture and Concrete

architecture

7/27/2019 Capp 01

5/54

5


Abstract Architecture is a black-box specification From Programmers point of view (Programming model)

From the hardware designers point of view (hardware

model, with additional spec. like interface protocols)

Concrete Architecture can be considered from 2different points of view

Logic design (logical components used e.g. Reg, EU etc,

interconnections, sequence of information transfers) is an

abstraction of Physical design and precedes Physical design.

physical design (based on concrete ckt elements,

specifications of ckt elements with signals, interconnections,

declaration of initiated signal sequences)

7/27/2019 Capp 01

6/54

6


Logic design will usually be described informally bymeans of a block diagram, or formally by means of an

ADL (Architecture Description Language)

Logic designed represented in ADL is passed to a CAD

package as input, and a Physical design is typically theoutput of a CAD Package.

Example architectures

Concrete architectures of computer systems : based onprocessor level building blocks, their specifications,

interconnections and operation of the whole system

7/27/2019 Capp 01

7/54

7


Example architectures

Abstract Architecture of Processor: Programming model

(ISA or Architecture), Hardware model (programming interface,

Interrupt interface, I/O interface)

Concrete architectures of Processors: (also called

microarchitecture) Microarchitecture of a processor is usually

given as a Logic Design. Described by block diagram, through

the specification of a set of functional units (Register blocks,

buses, EU etc.) and their interconnections, and by declaring the

operation of the whole processor.

Microarchitecture as a Physical Design is usually detailed in

Technical Documentation (Proprietary)

Formal Description of a CA : for a precise, unambiguous

representation and for verification, simulation and analysis.

7/27/2019 Capp 01

8/54

8

Computational Model

It can be observed that Computer Architecture classes and

Programming Language classes correspond to one another

e.g., Von Neumann Architecture Imperative Language,

Reduction Architecture Functional Language

Architectures & Languages have a common paradigm or

foundation called a Computational Model.

Von Neumann Computational Model&Applicative

Computational Model

Computational Model represents a higher level of abstraction of

Architecture and Language

Programming Language Specification tool for formulation of

computational task using a computational model

Architecture A tool to execute a given computational task

7/27/2019 Capp 01

9/54

9

Computational Model

A Computational Modelcomprises the following three

abstractions

The basic Items of Computation:

specification of items computation refers to, and the kind of

computations performed on them Variables in Programming languages, Memory/Register

Addresses in Architectures

The problem description model

Style (e.g. procedural or declarative) and Method of problemdescription

The execution model

Interpretation of the computation, execution semantics, and the

control of the execution sequence

7/27/2019 Capp 01

10/54

10

Computational Model

The Control of execution sequence Control-Driven, Execution sequence is implicitly given by the order of the

instructions or by explicit control instructions

Data-Driven, an operation is activated as soon as the needed input data is

available (also called Eager Evaluation, used in Dataflow Computational

Model) Demand-driven, operations will be activated only when their execution is

needed to achieve the final result (also called Lazy Evaluation, used in

Applicative Computational Model)

7/27/2019 Capp 01

11/54

11

The Instruction Set: a Critical Interface

instruction set

software

hardware

The actual programmer visible instruction set

7/27/2019 Capp 01

12/54

7/27/2019 Capp 01

13/54

7/27/2019 Capp 01

14/54

14

Hardware

Machine specifics: Feature size (10 microns in 1971 to 0.18 microns in 2001)

Minimum size of a transistor or a wire in either the x or y

dimension Logic designs

Packaging technology

Clock rate

Supply voltage

7/27/2019 Capp 01

15/54

7/27/2019 Capp 01

16/54

16

Applications and Requirements

Scientific/numerical: weather prediction, molecularmodeling Need: large memory, floating-point arithmetic

Commercial: inventory, payroll, web serving, e-

commerce Need: integer arithmetic, high I/O

Embedded: automobile engines, microwave, PDAs Need: low power, low cost, interrupt driven

Home computing: multimedia, games, entertainment Need: high data bandwidth, graphics

7/27/2019 Capp 01

17/54

17

Classes of Computers

High performance (supercomputers) Supercomputers Cray T-90 Massively parallel computers Cray T3E

Balanced cost/performance Workstations SPARCstations Servers SGI Origin, UltraSPARC

High-end PCs Pentium quads Low cost/power Low-end PCs, laptops, PDAs mobile Pentiums

7/27/2019 Capp 01

18/54

18

Classes of Computers

High performance (supercomputers) Supercomputers Cray T-90 Massively parallel computers Cray T3E

Balanced cost/performance Workstations SPARCstations Servers SGI Origin, UltraSPARC

High-end PCs Pentium quads Low cost/power Low-end PCs, laptops, PDAs mobile Pentiums

7/27/2019 Capp 01

19/54

19

Classification of Computer Systems

Purpose To provide a basis for information ordering .. Predicting properties of an architecture .. Explanation

Flynns Classification [1966]- System classifications are based upon thenumber of concurrent instructions and datastreams present in the computer architecture

- Rely on several architectural properties :Instruction memory (IM), Data memory (DM),Control Unit (CU), Processing Unit (PU),Instruction stream (IS), Data Stream (DS)

7/27/2019 Capp 01

20/54

20


SISD (Single Instruction, Single Data) a singleprocessor, a uniprocessor, executes a singleinstruction stream, to operate on data stored ina single memory

Instruction prefetching, pipelined execution ofinstructions are common examples found in mostmodern SISD computers

C P MIS IS DS

7/27/2019 Capp 01

21/54

21


SIMD (Single Instruction, Multi Data)

C

P

P

MIS

DS

DS

7/27/2019 Capp 01

22/54

22


MISD (Multi Instruction, Single Data)

C

C

P

P

M

IS

IS

IS

IS

DS

DS

7/27/2019 Capp 01

23/54

23


MIMD (MultiInstruction, MultiData)

C

C

P

P

M

IS

IS

IS

IS

DS

DS

7/27/2019 Capp 01

24/54

Fengs Classification

1 16 32 64

1

16

64

256

16K

word length

bit slice

length

MPP

STARAN

C.mmP

PDP11

PEPE

IBM370

IlliacIV

CRAY-1

Degree of parallelism, without mentioning how it offers. Bit

width of processor (n), and Bit slice width (m). Max P D=m*n

7/27/2019 Capp 01

25/54

Hndlers Classification

It explicitly gives type of parallelism

- 3 levels of proc h/w program control unit (K),ALU (D), Elementary Logic Ckt level (W)

< K x K , D x D , W x W >

control data worddash degree of pipelining

TI - ASC

CDC 6600 x (I/O)C.mmP + +

PEPE

Cray-1

7/27/2019 Capp 01

26/54

Modern Classification

Based on the kind of parallelism exploited and

granularity of functional parallelism utilized

Parallel Architectures: Data Parallel Architectures and Function Parallel

Architectures Data Parallel Architectures may be

Vector architectures, Associative and neuralarchitectures, SIMDs, Systolic Architectures

Function Parallel architectures may be Instruction-level (Pipelined, VLIWs, Superscalar),

Thread-level, Process-level (MIMDs: DistributedMemory MIMD and Shared Memory MIMD)

7/27/2019 Capp 01

27/54

27

Why Study Computer Architecture

Arent they fast enough already? Are they?

Fast enough to do everything we will EVER want? AI, protein sequencing, graphics

Is speed the only goal? Power: heat dissipation + battery life

Cost

Reliability

Etc.

Answer #1: requirements are always changing

7/27/2019 Capp 01

28/54

28

Why Study Computer Architecture

Annual technology improvements (approx.) Logic: density + 25%, speed +20%

DRAM (memory): density +60%, speed: +4% Disk: density +25%, disk speed: +4%

Designs change even if requirements arefixed. But the requirements are not fixed.

Answer #2: technology playing field is always changing

7/27/2019 Capp 01

29/54

29

Example of Changing Designs

Having, or not having caches 1970: 10K transistors on a single chip, DRAM

faster than logic

having a cache is bad 1990: 1M transistors, logic is faster than DRAM having a cache is good

2000: 600M transistors -> multiple level caches

and multiple CPUs Will caches ever be a bad idea again?

7/27/2019 Capp 01

30/54

7/27/2019 Capp 01

31/54

31

Types of Parallelism

Available and Utilized Parallelism Available parallelism means parallelism available in

programs (or in problem solutions)

Utilized Parallelism means parallelism occurring during

execution

Problem solutions may contain 2 different kinds of

available parallelism

Functional, which arises from the logic of a problem

solution

Data, which comes from using data structures that allow

parallel operations on their elements, such as vectors or

matrics

7/27/2019 Capp 01

32/54

32

Levels of Parallelism

Programs written in imperative languages mayembody functional parallelism at different levels

Parallelism at the Instruction Level

Parallelism at the loop level

Parallelism at the Procedure level

Parallelism at the program level

Utilization of functional parallelism

Available parallelism can be utilized by architectures,compilers and OS conjointly for speeding up computation.

Available functional parallelism can be utilized at four

levels instruction, thread, process and user level

7/27/2019 Capp 01

33/54

33


Utilization of functional parallelism It is quite natural to utilize available functional parallelism,

which is inherent (hidden) in a conventional sequential

program, at the instruction level by executing instructions in

parallel. Can be achieved by means of architectures (called ILP

architecture) capable of parallel instruction execution.

Must be detected either by a dedicated compiler or by an ILP

architecture. Available loop and procedure level parallelism will often be

utilized in the form of threads and processes

Two different ways to Threads and Processes Specialized

architectures Multi-Threaded and MIMD Archs.

7/27/2019 Capp 01

34/54

34


Utilization of functional parallelism Two different ways to Threads and Processes Specialized

architectures Multi-Threaded and MIMD Archs., Use

architectures that run threads and processes in sequence under

the supervision of a Multi-threaded orMultitasking OS

7/27/2019 Capp 01

35/54

35


Utilization of Data parallelism Available Data parallelism can be utilized in two different

ways.

One is to exploit data parallelism directly by dedicated

architectures that permit parallel operations on dataelements (Data Parallel architectures)

Other possibility is to convert data parallelism into

functional parallelism by expressing parallel executable

operations on data elements in a sequential manner (usingloop construct)

7/27/2019 Capp 01

36/54

7/27/2019 Capp 01

37/54

37

Performance Terminology

X is ntimes faster than Y means:

Execution timeY

Execution timeX

= n

X is m%faster than Y means:

Execution timeY - Execution timeX

Execution timeX

= mX 100%

7/27/2019 Capp 01

38/54

38

Amdahls Law

-To find the maximum expected

improvement to an overall system when

only part of the system is improved.

-It is often used in parallel computing to

predict the theoretical maximum speedup

using multiple processors.

7/27/2019 Capp 01

39/54

39

Amdahls Law

7/27/2019 Capp 01

40/54

40

Execution time w/o E (Before)

Execution time w E (After)

Compute SpeedupAmdahls Law

Speedup is due to enhancement(E):

Speedup (E) =

TimeBefore

Suppose that enhancement E accelerates a fraction Fof the task by a factor S, and the remainder of the taskis unaffected, what is the Execution timeafterandSpeedup(E)?

TimeAfter

7/27/2019 Capp 01

41/54

41

Amdahls Law

Execution timeafter

Speedup(E)

= ExTimebefore x [(1-F) +FS

]

=ExTimebefore

ExTimeafter

=1

FS ][(1-F) +

7/27/2019 Capp 01

42/54

42

Amdahls Law

Execution timeafter

Speedup(E)

= ExTimebefore x [(1-P) +PN

]

=ExTimebefore

ExTimeafter

=1

PN ][(1-P) +

P: Portion of a program that can be made Parallel, N: no. of

Processors

7/27/2019 Capp 01

43/54

43

Amdahls Law An Example

Q: Floating point instructions improved to run 2X;but only 10% of execution time are FP ops. What isthe execution time and speedup after improvement?

Ans:F = 0.1, S = 2

ExTimeafter = ExTimebefore x [ (1-0.1) + 0.1/2 ] = 0.95 ExTimebefore

Speedup = ExTimebefore

ExTimeafter

= 1

0.95

= 1.053

Read examples in the book!

7/27/2019 Capp 01

44/54

44

Gustafson's Law

Gustafson's Law (also known as Gustafson-Barsis'

law) is a law in computer science which says that

problems with large, repetitive data sets can be

efficiently parallelized

S(P) = P - * (P 1)

P No. of processors Non-parallelizable fraction

7/27/2019 Capp 01

45/54

45

Gustafson's Law

7/27/2019 Capp 01

46/54

46

CPU Performance

The Fundamental Law

Three components of CPU performance: Instruction count

CPI

Clock cycle time

cycle

seconds

ninstructio

cycles

program

nsinstructio

program

secondstimeCPU

Inst. Count CPI Clock

Program X

Compiler X X

Inst. SetArchitecture

X X X

Arch X X

Physical Design X

7/27/2019 Capp 01

47/54

47

CPI - Cycles per InstructionLet Fi be the frequency of type I instructions in a program.

Then, Average CPI:

n

1i

iiii

CountnInstructio

ICFwhereFCPI

CountnInstructioTotal

CycleTotalCPI

)IC(CPItimeCycletimeCPUn

1i

ii

Instruction type ALU Load Store Branch

Frequency 43% 21% 12% 24%

Clock cycles 1 2 2 2

Example:

average CPI = 0.43 + 0.42 + 0.24 + 0.48 = 1.57 cycles/instruction

7/27/2019 Capp 01

48/54

48

Example

Instructionmix of a RISC architecture.

Add a register-memory ALU instruction format?

One op. in register, one op. in memory

The new instruction will take 2 cc but will also

increase the Branches to 3 cc.Q: What fraction of loads must be eliminated for this

to pay off?

Inst. ALU Load Store Branch

Freq. 50% 20% 10% 20%

C. C. 1 2 2 2

7/27/2019 Capp 01

49/54

49

Solution

Exec Time = Instr. Cnt. x CPI x Cycle time

Instr. Fi CPIi CPIixFi Ii CPIi CPIixIi

ALU .5 1 .5 .5-X 1 .5-X

Load .2 2 .4 .2-X 2 .4-2X

Store .1 2 .2 .1 2 .2

Branch .2 2 .4 .2 3 .6

Reg/Mem X 2 2X

1.0 CPI=1.5 1-X (1.7-X)/(1-X)

Instr. Cntold x CPIold x Cycle timeold >= Instr. Cntnew x CPInew x Cycle timenew1.0 x 1.5 >= (1-X) x (1.7-X)/(1-X)

X >= 0.2

ALL loads must be eliminated for this to be a win!

7/27/2019 Capp 01

50/54

50

Improve Memory System

All instructions require an instruction fetch, only afraction require a data fetch/store. Optimize instruction access over data access

Programs exhibit locality

Spatial Locality Temporal Locality

Access to small memories is faster Provide a storage hierarchy such that the most frequent

accesses are to the smallest (closest) memories.

Cache Memory Disk/TapeRegisters

7/27/2019 Capp 01

51/54

51

Benchmarks

program as unit of work There are millions of programs

Not all are the same, most are very different Which ones to use?

Benchmarks Standard programs for measuring or comparing

performanceRepresentative of programs people care about

repeatable!!

7/27/2019 Capp 01

52/54

52

Choosing Programs to Evaluate Perf.

Toy benchmarks e.g., quicksort, puzzle No one really runs. Scary fact: used to prove the value of RISC

in early 80s

Synthetic benchmarks Attempt to match average frequencies of operations and

operands in real workloads. e.g., Whetstone, Dhrystone Often slightly more complex than kernels; But do not represent

real programs

Kernels Most frequently executed pieces of real programs

e.g., livermore loops Good for focusing on individual features not big picture Tend to over-emphasize target feature

Real programs e.g., gcc, spice, SPEC89, 92, 95, SPEC2000 (standard

performance evaluation corporation), TPCC, TPCD

7/27/2019 Capp 01

53/54

53

Networking Benchmarks: Netbench, Commbench,Applications: IP Forwarding, TCP/IP, SSL, Apache, SpecWeb

Commbench:www.ecs.umass.edu/ece/wolf/nsl/software/cb/index.html

Execution Driven Simulators:Simplescalar

http://www.simplescalar.com/

NepSim -http://www.cs.ucr.edu/~yluo/nepsim/
http://www.simplescalar.com/http://www.simplescalar.com/

7/27/2019 Capp 01

54/54

54

MIPS and MFLOPS

MIPS: millions of instructions per second: MIPS = Inst. count/ (CPU time * 10**6) = Clock

rate/(CPI*106)

easy to understand and to market

inst. set dependent, cannot be used across machines.

program dependent can vary inversely to performance! (why? read the book)

MFLOPS:million of FP ops per second. less compiler dependent than MIPS.

not all FP ops are implemented in h/w on all machines. not all FP ops have same latencies.

normalized MFLOPS: uses an equivalence table to evenout the various latencies of FP ops.

Capp 01

Documents

Transcript of Capp 01