Capp 01
-
Upload
adityabaid4 -
Category
Documents
-
view
218 -
download
0
Transcript of Capp 01
-
7/27/2019 Capp 01
1/54
1
CS 606Computer Architecture and Parallel Processing
N. Sarma
Lecture 1
-
7/27/2019 Capp 01
2/54
2
What is *Computer Architecture*
Computer Architecture =
Instruction Set Architecture +
Organization +
Hardware +
-
7/27/2019 Capp 01
3/54
3
What is *Computer Architecture*
The term Computer Architecture was coined in 1964by the chief architects of IBM System/360 as
the structure of a computer that a machine language
programmer must understand to write a correct
(timing independent) program for a machine.
Definitely, it comprises
the definition of registers, memory
Instruction set, instruction formats, addressing modes
actual coding of the instructions without
implementation (h/w structure) and realization (logic
technology, packaging etc.)
-
7/27/2019 Capp 01
4/54
4
What is *Computer Architecture*
Recent Interpretation uses a number of levels ofincreasing abstraction, at each level, the architecture
will be described by
Underlying computational model
Level of Consideration/interest (micromachine level,processor level and computer system level) and
Scope of Interest (Functional specification and
implementation); Abstract architecture and Concrete
architecture
-
7/27/2019 Capp 01
5/54
5
What is *Computer Architecture*
Abstract Architecture is a black-box specification From Programmers point of view (Programming model)
From the hardware designers point of view (hardware
model, with additional spec. like interface protocols)
Concrete Architecture can be considered from 2different points of view
Logic design (logical components used e.g. Reg, EU etc,
interconnections, sequence of information transfers) is an
abstraction of Physical design and precedes Physical design.
physical design (based on concrete ckt elements,
specifications of ckt elements with signals, interconnections,
declaration of initiated signal sequences)
-
7/27/2019 Capp 01
6/54
6
What is *Computer Architecture*
Logic design will usually be described informally bymeans of a block diagram, or formally by means of an
ADL (Architecture Description Language)
Logic designed represented in ADL is passed to a CAD
package as input, and a Physical design is typically theoutput of a CAD Package.
Example architectures
Concrete architectures of computer systems : based onprocessor level building blocks, their specifications,
interconnections and operation of the whole system
-
7/27/2019 Capp 01
7/54
7
What is *Computer Architecture*
Example architectures
Abstract Architecture of Processor: Programming model
(ISA or Architecture), Hardware model (programming interface,
Interrupt interface, I/O interface)
Concrete architectures of Processors: (also called
microarchitecture) Microarchitecture of a processor is usually
given as a Logic Design. Described by block diagram, through
the specification of a set of functional units (Register blocks,
buses, EU etc.) and their interconnections, and by declaring the
operation of the whole processor.
Microarchitecture as a Physical Design is usually detailed in
Technical Documentation (Proprietary)
Formal Description of a CA : for a precise, unambiguous
representation and for verification, simulation and analysis.
-
7/27/2019 Capp 01
8/54
8
Computational Model
It can be observed that Computer Architecture classes and
Programming Language classes correspond to one another
e.g., Von Neumann Architecture Imperative Language,
Reduction Architecture Functional Language
Architectures & Languages have a common paradigm or
foundation called a Computational Model.
Von Neumann Computational Model&Applicative
Computational Model
Computational Model represents a higher level of abstraction of
Architecture and Language
Programming Language Specification tool for formulation of
computational task using a computational model
Architecture A tool to execute a given computational task
-
7/27/2019 Capp 01
9/54
9
Computational Model
A Computational Modelcomprises the following three
abstractions
The basic Items of Computation:
specification of items computation refers to, and the kind of
computations performed on them Variables in Programming languages, Memory/Register
Addresses in Architectures
The problem description model
Style (e.g. procedural or declarative) and Method of problemdescription
The execution model
Interpretation of the computation, execution semantics, and the
control of the execution sequence
-
7/27/2019 Capp 01
10/54
10
Computational Model
The Control of execution sequence Control-Driven, Execution sequence is implicitly given by the order of the
instructions or by explicit control instructions
Data-Driven, an operation is activated as soon as the needed input data is
available (also called Eager Evaluation, used in Dataflow Computational
Model) Demand-driven, operations will be activated only when their execution is
needed to achieve the final result (also called Lazy Evaluation, used in
Applicative Computational Model)
-
7/27/2019 Capp 01
11/54
11
The Instruction Set: a Critical Interface
instruction set
software
hardware
The actual programmer visible instruction set
-
7/27/2019 Capp 01
12/54
-
7/27/2019 Capp 01
13/54
-
7/27/2019 Capp 01
14/54
14
Hardware
Machine specifics: Feature size (10 microns in 1971 to 0.18 microns in 2001)
Minimum size of a transistor or a wire in either the x or y
dimension Logic designs
Packaging technology
Clock rate
Supply voltage
-
7/27/2019 Capp 01
15/54
-
7/27/2019 Capp 01
16/54
16
Applications and Requirements
Scientific/numerical: weather prediction, molecularmodeling Need: large memory, floating-point arithmetic
Commercial: inventory, payroll, web serving, e-
commerce Need: integer arithmetic, high I/O
Embedded: automobile engines, microwave, PDAs Need: low power, low cost, interrupt driven
Home computing: multimedia, games, entertainment Need: high data bandwidth, graphics
-
7/27/2019 Capp 01
17/54
17
Classes of Computers
High performance (supercomputers) Supercomputers Cray T-90 Massively parallel computers Cray T3E
Balanced cost/performance Workstations SPARCstations Servers SGI Origin, UltraSPARC
High-end PCs Pentium quads Low cost/power Low-end PCs, laptops, PDAs mobile Pentiums
-
7/27/2019 Capp 01
18/54
18
Classes of Computers
High performance (supercomputers) Supercomputers Cray T-90 Massively parallel computers Cray T3E
Balanced cost/performance Workstations SPARCstations Servers SGI Origin, UltraSPARC
High-end PCs Pentium quads Low cost/power Low-end PCs, laptops, PDAs mobile Pentiums
-
7/27/2019 Capp 01
19/54
19
Classification of Computer Systems
Purpose To provide a basis for information ordering .. Predicting properties of an architecture .. Explanation
Flynns Classification [1966]- System classifications are based upon thenumber of concurrent instructions and datastreams present in the computer architecture
- Rely on several architectural properties :Instruction memory (IM), Data memory (DM),Control Unit (CU), Processing Unit (PU),Instruction stream (IS), Data Stream (DS)
-
7/27/2019 Capp 01
20/54
20
Classification of Computer Systems
SISD (Single Instruction, Single Data) a singleprocessor, a uniprocessor, executes a singleinstruction stream, to operate on data stored ina single memory
Instruction prefetching, pipelined execution ofinstructions are common examples found in mostmodern SISD computers
C P MIS IS DS
-
7/27/2019 Capp 01
21/54
21
Classification of Computer Systems
SIMD (Single Instruction, Multi Data)
C
P
P
MIS
DS
DS
-
7/27/2019 Capp 01
22/54
22
Classification of Computer Systems
MISD (Multi Instruction, Single Data)
C
C
P
P
M
IS
IS
IS
IS
DS
DS
-
7/27/2019 Capp 01
23/54
23
Classification of Computer Systems
MIMD (MultiInstruction, MultiData)
C
C
P
P
M
IS
IS
IS
IS
DS
DS
-
7/27/2019 Capp 01
24/54
Fengs Classification
1 16 32 64
1
16
64
256
16K
word length
bit slice
length
MPP
STARAN
C.mmP
PDP11
PEPE
IBM370
IlliacIV
CRAY-1
Degree of parallelism, without mentioning how it offers. Bit
width of processor (n), and Bit slice width (m). Max P D=m*n
-
7/27/2019 Capp 01
25/54
Hndlers Classification
It explicitly gives type of parallelism
- 3 levels of proc h/w program control unit (K),ALU (D), Elementary Logic Ckt level (W)
< K x K , D x D , W x W >
control data worddash degree of pipelining
TI - ASC
CDC 6600 x (I/O)C.mmP + +
PEPE
Cray-1
-
7/27/2019 Capp 01
26/54
Modern Classification
Based on the kind of parallelism exploited and
granularity of functional parallelism utilized
Parallel Architectures: Data Parallel Architectures and Function Parallel
Architectures Data Parallel Architectures may be
Vector architectures, Associative and neuralarchitectures, SIMDs, Systolic Architectures
Function Parallel architectures may be Instruction-level (Pipelined, VLIWs, Superscalar),
Thread-level, Process-level (MIMDs: DistributedMemory MIMD and Shared Memory MIMD)
-
7/27/2019 Capp 01
27/54
27
Why Study Computer Architecture
Arent they fast enough already? Are they?
Fast enough to do everything we will EVER want? AI, protein sequencing, graphics
Is speed the only goal? Power: heat dissipation + battery life
Cost
Reliability
Etc.
Answer #1: requirements are always changing
-
7/27/2019 Capp 01
28/54
28
Why Study Computer Architecture
Annual technology improvements (approx.) Logic: density + 25%, speed +20%
DRAM (memory): density +60%, speed: +4% Disk: density +25%, disk speed: +4%
Designs change even if requirements arefixed. But the requirements are not fixed.
Answer #2: technology playing field is always changing
-
7/27/2019 Capp 01
29/54
29
Example of Changing Designs
Having, or not having caches 1970: 10K transistors on a single chip, DRAM
faster than logic
having a cache is bad 1990: 1M transistors, logic is faster than DRAM having a cache is good
2000: 600M transistors -> multiple level caches
and multiple CPUs Will caches ever be a bad idea again?
-
7/27/2019 Capp 01
30/54
-
7/27/2019 Capp 01
31/54
31
Types of Parallelism
Available and Utilized Parallelism Available parallelism means parallelism available in
programs (or in problem solutions)
Utilized Parallelism means parallelism occurring during
execution
Problem solutions may contain 2 different kinds of
available parallelism
Functional, which arises from the logic of a problem
solution
Data, which comes from using data structures that allow
parallel operations on their elements, such as vectors or
matrics
-
7/27/2019 Capp 01
32/54
32
Levels of Parallelism
Programs written in imperative languages mayembody functional parallelism at different levels
Parallelism at the Instruction Level
Parallelism at the loop level
Parallelism at the Procedure level
Parallelism at the program level
Utilization of functional parallelism
Available parallelism can be utilized by architectures,compilers and OS conjointly for speeding up computation.
Available functional parallelism can be utilized at four
levels instruction, thread, process and user level
-
7/27/2019 Capp 01
33/54
33
Levels of Parallelism
Utilization of functional parallelism It is quite natural to utilize available functional parallelism,
which is inherent (hidden) in a conventional sequential
program, at the instruction level by executing instructions in
parallel. Can be achieved by means of architectures (called ILP
architecture) capable of parallel instruction execution.
Must be detected either by a dedicated compiler or by an ILP
architecture. Available loop and procedure level parallelism will often be
utilized in the form of threads and processes
Two different ways to Threads and Processes Specialized
architectures Multi-Threaded and MIMD Archs.
-
7/27/2019 Capp 01
34/54
34
Levels of Parallelism
Utilization of functional parallelism Two different ways to Threads and Processes Specialized
architectures Multi-Threaded and MIMD Archs., Use
architectures that run threads and processes in sequence under
the supervision of a Multi-threaded orMultitasking OS
-
7/27/2019 Capp 01
35/54
35
Levels of Parallelism
Utilization of Data parallelism Available Data parallelism can be utilized in two different
ways.
One is to exploit data parallelism directly by dedicated
architectures that permit parallel operations on dataelements (Data Parallel architectures)
Other possibility is to convert data parallelism into
functional parallelism by expressing parallel executable
operations on data elements in a sequential manner (usingloop construct)
-
7/27/2019 Capp 01
36/54
-
7/27/2019 Capp 01
37/54
37
Performance Terminology
X is ntimes faster than Y means:
Execution timeY
Execution timeX
= n
X is m%faster than Y means:
Execution timeY - Execution timeX
Execution timeX
= mX 100%
-
7/27/2019 Capp 01
38/54
38
Amdahls Law
-To find the maximum expected
improvement to an overall system when
only part of the system is improved.
-It is often used in parallel computing to
predict the theoretical maximum speedup
using multiple processors.
-
7/27/2019 Capp 01
39/54
39
Amdahls Law
-
7/27/2019 Capp 01
40/54
40
Execution time w/o E (Before)
Execution time w E (After)
Compute SpeedupAmdahls Law
Speedup is due to enhancement(E):
Speedup (E) =
TimeBefore
Suppose that enhancement E accelerates a fraction Fof the task by a factor S, and the remainder of the taskis unaffected, what is the Execution timeafterandSpeedup(E)?
TimeAfter
-
7/27/2019 Capp 01
41/54
41
Amdahls Law
Execution timeafter
Speedup(E)
= ExTimebefore x [(1-F) +FS
]
=ExTimebefore
ExTimeafter
=1
FS ][(1-F) +
-
7/27/2019 Capp 01
42/54
42
Amdahls Law
Execution timeafter
Speedup(E)
= ExTimebefore x [(1-P) +PN
]
=ExTimebefore
ExTimeafter
=1
PN ][(1-P) +
P: Portion of a program that can be made Parallel, N: no. of
Processors
-
7/27/2019 Capp 01
43/54
43
Amdahls Law An Example
Q: Floating point instructions improved to run 2X;but only 10% of execution time are FP ops. What isthe execution time and speedup after improvement?
Ans:F = 0.1, S = 2
ExTimeafter = ExTimebefore x [ (1-0.1) + 0.1/2 ] = 0.95 ExTimebefore
Speedup = ExTimebefore
ExTimeafter
= 1
0.95
= 1.053
Read examples in the book!
-
7/27/2019 Capp 01
44/54
44
Gustafson's Law
Gustafson's Law (also known as Gustafson-Barsis'
law) is a law in computer science which says that
problems with large, repetitive data sets can be
efficiently parallelized
S(P) = P - * (P 1)
P No. of processors Non-parallelizable fraction
-
7/27/2019 Capp 01
45/54
45
Gustafson's Law
-
7/27/2019 Capp 01
46/54
46
CPU Performance
The Fundamental Law
Three components of CPU performance: Instruction count
CPI
Clock cycle time
cycle
seconds
ninstructio
cycles
program
nsinstructio
program
secondstimeCPU
Inst. Count CPI Clock
Program X
Compiler X X
Inst. SetArchitecture
X X X
Arch X X
Physical Design X
-
7/27/2019 Capp 01
47/54
47
CPI - Cycles per InstructionLet Fi be the frequency of type I instructions in a program.
Then, Average CPI:
n
1i
iiii
CountnInstructio
ICFwhereFCPI
CountnInstructioTotal
CycleTotalCPI
)IC(CPItimeCycletimeCPUn
1i
ii
Instruction type ALU Load Store Branch
Frequency 43% 21% 12% 24%
Clock cycles 1 2 2 2
Example:
average CPI = 0.43 + 0.42 + 0.24 + 0.48 = 1.57 cycles/instruction
-
7/27/2019 Capp 01
48/54
48
Example
Instructionmix of a RISC architecture.
Add a register-memory ALU instruction format?
One op. in register, one op. in memory
The new instruction will take 2 cc but will also
increase the Branches to 3 cc.Q: What fraction of loads must be eliminated for this
to pay off?
Inst. ALU Load Store Branch
Freq. 50% 20% 10% 20%
C. C. 1 2 2 2
-
7/27/2019 Capp 01
49/54
49
Solution
Exec Time = Instr. Cnt. x CPI x Cycle time
Instr. Fi CPIi CPIixFi Ii CPIi CPIixIi
ALU .5 1 .5 .5-X 1 .5-X
Load .2 2 .4 .2-X 2 .4-2X
Store .1 2 .2 .1 2 .2
Branch .2 2 .4 .2 3 .6
Reg/Mem X 2 2X
1.0 CPI=1.5 1-X (1.7-X)/(1-X)
Instr. Cntold x CPIold x Cycle timeold >= Instr. Cntnew x CPInew x Cycle timenew1.0 x 1.5 >= (1-X) x (1.7-X)/(1-X)
X >= 0.2
ALL loads must be eliminated for this to be a win!
-
7/27/2019 Capp 01
50/54
50
Improve Memory System
All instructions require an instruction fetch, only afraction require a data fetch/store. Optimize instruction access over data access
Programs exhibit locality
Spatial Locality Temporal Locality
Access to small memories is faster Provide a storage hierarchy such that the most frequent
accesses are to the smallest (closest) memories.
Cache Memory Disk/TapeRegisters
-
7/27/2019 Capp 01
51/54
51
Benchmarks
program as unit of work There are millions of programs
Not all are the same, most are very different Which ones to use?
Benchmarks Standard programs for measuring or comparing
performanceRepresentative of programs people care about
repeatable!!
-
7/27/2019 Capp 01
52/54
52
Choosing Programs to Evaluate Perf.
Toy benchmarks e.g., quicksort, puzzle No one really runs. Scary fact: used to prove the value of RISC
in early 80s
Synthetic benchmarks Attempt to match average frequencies of operations and
operands in real workloads. e.g., Whetstone, Dhrystone Often slightly more complex than kernels; But do not represent
real programs
Kernels Most frequently executed pieces of real programs
e.g., livermore loops Good for focusing on individual features not big picture Tend to over-emphasize target feature
Real programs e.g., gcc, spice, SPEC89, 92, 95, SPEC2000 (standard
performance evaluation corporation), TPCC, TPCD
-
7/27/2019 Capp 01
53/54
53
Networking Benchmarks: Netbench, Commbench,Applications: IP Forwarding, TCP/IP, SSL, Apache, SpecWeb
Commbench:www.ecs.umass.edu/ece/wolf/nsl/software/cb/index.html
Execution Driven Simulators:Simplescalar
http://www.simplescalar.com/
NepSim -http://www.cs.ucr.edu/~yluo/nepsim/
http://www.simplescalar.com/http://www.simplescalar.com/ -
7/27/2019 Capp 01
54/54
54
MIPS and MFLOPS
MIPS: millions of instructions per second: MIPS = Inst. count/ (CPU time * 10**6) = Clock
rate/(CPI*106)
easy to understand and to market
inst. set dependent, cannot be used across machines.
program dependent can vary inversely to performance! (why? read the book)
MFLOPS:million of FP ops per second. less compiler dependent than MIPS.
not all FP ops are implemented in h/w on all machines. not all FP ops have same latencies.
normalized MFLOPS: uses an equivalence table to evenout the various latencies of FP ops.