Final Review - George Mason University

98
1 Copyright © 2012, Elsevier Inc. All rights reserved. Final Review Computer Architecture A Quantitative Approach, Fifth Edition

Transcript of Final Review - George Mason University

1 Copyright © 2012, Elsevier Inc. All rights reserved.

Final Review

Computer Architecture A Quantitative Approach, Fifth Edition

2

How Does a Computer Work?

  What is the Von Neumann computer architecture?   What are the elements of a Von Neumann

computer?   How do these elements interact to execute

programs?

Copyright © 2013, Daniel A. Menasce. All rights reserved.

Introduction

3

Von Neumann Architecture

Copyright © 2013, Daniel A. Menasce. All rights reserved.

Introduction

Stored-program computer

Data

Program

ALU

Registers

system bus

4

Von Neumann Architecture

Copyright © 2013, Daniel A. Menasce. All rights reserved.

Introduction

Stored-program computer

Data

Program

ALU

Registers

Sequence of instructions: (1) Load/Store from memory to/from registers, (2) Arithmetic/Logical Ops., (3) Conditional and unconditional branches.

General registers, FP registers, Program counter, etc.

5 Copyright © 2012, Elsevier Inc. All rights reserved.

Single Processor Performance Introduction What are the architectural evolutions from ILP and why?

6 Copyright © 2012, Elsevier Inc. All rights reserved.

Single Processor Performance Introduction

RISC

Move to multi-processor. From ILP to DLP and TLP

7

Moore’s Law?

Copyright © 2013, Daniel A. Menasce. All rights reserved.

Introduction

Source: Wikimedia Commons

8

Moore’s Law

Copyright © 2013, Daniel A. Menasce. All rights reserved.

Introduction

Source: Wikimedia Commons

The number of transistors on integrated circuits doubles approximately every two years. (Gordon Moore, 1965)

9

Instruction Level Parallelism?

Copyright © 2013, Daniel A. Menasce. All rights reserved.

Introduction

10

Instruction Level Parallelism

Copyright © 2013, Daniel A. Menasce. All rights reserved.

Introduction In assembly language: * e = a+b 001 LOAD A,R1 002 LOAD B,R2 003 ADD R1,R2,R3 004 STO R3,E * f = c + d 005 LOAD C,R4 006 LOAD D,R5 007 ADD R4,R5,R6 008 STO R6,F * g = e * f 009 MULT R3,R6,R7 010 STO R7,G

Instructions that can potentially be executed in parallel:

001, 002, 005, 006 003, 007 004, 008, 009 010

11

Data Level Parallelism (DLP)?

Copyright © 2013, Daniel A. Menasce. All rights reserved.

Introduction

12

Data Level Parallelism (DLP)

Copyright © 2013, Daniel A. Menasce. All rights reserved.

Introduction

Task Data Task Data

13

Task Level Parallelism (TLP)?

Copyright © 2013, Daniel A. Menasce. All rights reserved.

Introduction

14

Task Level Parallelism (TLP)

Copyright © 2013, Daniel A. Menasce. All rights reserved.

Introduction

Tasks Task

15 Copyright © 2012, Elsevier Inc. All rights reserved.

Power and Energy

  What units are used to measure power and energy and how are they related?

  What is the effect of clock rate on power consumption and performance

Trends in Pow

er and Energy

16 Copyright © 2012, Elsevier Inc. All rights reserved.

Power and Energy

  What units are used to measure power and energy and how are they related?   Energy measured in Joules   Power measured in Watts = Joules/sec

  What is the effect of clock rate on power consumption and performance   Clock rate can be reduced dynamically to limit power

consumption   A lower clock rate reduces performance

Trends in Pow

er and Energy

17 Copyright © 2012, Elsevier Inc. All rights reserved.

Principles of Computer Design? P

rinciples

18 Copyright © 2012, Elsevier Inc. All rights reserved.

Principles of Computer Design   Take Advantage of Parallelism

  e.g., multiple processors, disks, memory banks, pipelining, multiple functional units

  Principle of Locality   Reuse of data and instructions

  Focus on the Common Case   Amdahl’s Law

Principles

19 Copyright © 2012, Elsevier Inc. All rights reserved.

The Processor Performance Equation P

rinciples

(average) Clock cycles per instruction

20 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Design

Computer Architecture A Quantitative Approach, Fifth Edition

21 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Introduction

Why is it advantageous to organize memory as a hierarchy?

Give examples of levels in the memory hierarchy.

What is the memory performance gap?

22 Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Introduction

23 Copyright © 2012, Elsevier Inc. All rights reserved.

Placement of Blocks in a Cache   Placement of blocks in a cache

  n-way associative cache?   direct-mapped cache?   fully associative cache?

Introduction

24 Copyright © 2012, Elsevier Inc. All rights reserved.

Placement of blocks in a cache   Placement of blocks in a cache

  Set associative: block is mapped into a set and the block is placed anywhere in the set

  Finding a block: map block address to set and search set (usually in parallel) to find block.

  n blocks in a set: n-way associative cache   One block per set (n=1): direct-mapped cache   One set per cache: fully associative cache

Introduction

25 Copyright © 2013, Daniel A. Menasce. All rights reserved.

Processor/Cache Addressing Introduction

2-way associative cache.

Processor 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

0

1

V B

INDEX: selects set TAG: used to check for cache hit if valid bit (VB) is 1.

1 1 0 0

TAG Index Block Offset

<2> <1> <1>

1

0 1 1 1 0 0 1 0

1 1

TAG

01110010

MEMORY

CACHE

Cache data

<8> address

Set 0: B0

Set 1: B1

Set 0: B2

Set 1: B3

Set 0: B4

Set 1: B5

Set 0: B6

Set 1: B7

1 0 1 0 0 0 1 1

10100011 1 1 0 11101111 10001001

1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1

26 Copyright © 2012, Elsevier Inc. All rights reserved.

Writing to a Cache   Writing to cache: two strategies

  Write-through?

  Write-back?

Introduction

27 Copyright © 2012, Elsevier Inc. All rights reserved.

Writing to a Cache   Writing to cache: two strategies

  Write-through   Immediately update lower levels of hierarchy

  Write-back   Only update lower levels of hierarchy when an updated block

is replaced

Introduction

28

  Note that speculative and multithreaded processors may execute other instructions during a miss   Reduces performance impact of misses

Copyright © 2012, Elsevier Inc. All rights reserved.

Memory Hierarchy Basics Introduction

MemoryStallCycles = NumberOfMisses×MissPenalty =

= IC×Misses

Instruction×MissPenalty

= IC×MemoryAccesses

Instruction×MissRate ×MissPenalty

29 Copyright © 2012, Elsevier Inc. All rights reserved.

Virtual Memory   What is virtual memory?

Virtual Mem

ory and Virtual Machines

30 Copyright © 2012, Elsevier Inc. All rights reserved.

Virtual Memory   Abstraction of physical address space in which

each process’ address space (the virtual address) may be bigger than the physical address space.

  The virtual address space is divided into pages, some of which are in main memory and some are on disk (the paging disk).

  Virtual addresses are mapped to physical addresses by the hardware using a Translation Lookaside Buffer.

  If a referenced page is not in memory, an interrupt is generated to the operating system, which starts the process of bringing the page to memory.

Virtual Mem

ory and Virtual Machines

31 Copyright © 2012, Elsevier Inc. All rights reserved.

What are Virtual Machines? Virtual M

emory and Virtual M

achines

32 Copyright © 2012, Elsevier Inc. All rights reserved.

Virtual Machines   Allow different ISAs and operating systems to be

presented to user programs   “System Virtual Machines”   SVM software is called “virtual machine monitor” or

“hypervisor”   Individual virtual machines run under the monitor are called

“guest VMs”   Sharing a computer among many unrelated users   Supports isolation and security   Enabled by raw speed of processors, making the

overhead more acceptable

Virtual Mem

ory and Virtual Machines

33

Instruction Level Parallelism Basics

34 Multi-cycle CPU

Single Cycle Datapath

1ns 2ns

2ns

2ns

2ns

2ns

35 Pipeline CS465

Pipelining

  What is it?   What are pipeline hazards?

36 Pipeline CS465

Pipelining

  What is it?   Pipelining is an implementation technique in

which multiple instructions are overlapped in execution

  What are pipeline hazards?   Resource, data, or control dependencies that

delay the execution of instructions in the pipeline by one or more cycles.

37 Chapter 4 — The Processor — 37

Hazards   Situations that prevent starting the next

instruction in the next cycle   Structure hazards

  A required resource is busy   Data hazard

  Need to wait for previous instruction to complete its data read/write

  Control hazard   Deciding on control action depends on

previous instruction

38 Chapter 4 — The Processor — 38

Pipeline Performance Single-cycle (Tc= 800ps)

Pipelined (Tc= 200ps)

39 Chapter 4 — The Processor — 39

Speculation   “Guess” what to do with an instruction

  Start operation as soon as possible   Check whether guess was right

  If so, complete the operation   If not, roll-back and do the right thing

  Examples   Speculate on branch outcome

  Roll back if path taken is different

  Speculate on load   Roll back if location is updated

40 Chapter 4 — The Processor — 40

Compiler/Hardware Speculation

  Compiler can reorder instructions   e.g., move load before branch   Can include “fix-up” instructions to recover

from incorrect guess   Hardware can look ahead for instructions

to execute   Buffer results until it determines they are

actually needed   Flush buffers on incorrect speculation

41

I/O & Storage

42

Features of a magnetic disk?

43

Disk Storage

  Nonvolatile, rotating magnetic storage

44

Disk Sectors and Access   Each sector records

  Sector ID   Data (512 bytes, 4096 bytes proposed)   Error correcting code (ECC)

  Used to hide defects and recording errors   Synchronization fields and gaps

  Access to a sector involves   Queuing delay if other accesses are pending   Seek: move the heads   Rotational latency   Data transfer   Controller overhead

45

Features of Flash Storage?

46

Flash Storage

  Nonvolatile semiconductor storage   100× – 1000× faster than disk   Smaller, lower power, more robust   But more $/GB (between disk and DRAM)

47

Bus Types?

48

Bus Types

  Processor-Memory buses   Short, high speed   Design is matched to memory organization

  I/O buses   Longer, allowing multiple connections   Specified by standards for interoperability   Connect to processor-memory bus through a

bridge

49

Measuring I/O Performance

  I/O performance depends on   Hardware: CPU, memory, controllers, buses   Software: operating system, database

management system, application   Workload: request rates and patterns

  I/O system design can trade-off between response time and throughput   Measurements of throughput often done with

constrained response-time

50

I/O vs. CPU Performance   Amdahl’s Law

  Don’t neglect I/O performance as parallelism increases compute performance

  Example   Benchmark takes 90s CPU time, 10s I/O time   Double the CPU speed every 2 years

  I/O unchanged Year CPU time I/O time Elapsed time % I/O time now 90s 10s 100s 10% +2 45s 10s 55s 18% +4 23s 10s 33s 31% +6 11s 10s 21s 47%

51

RAID?

51

52

RAID   Redundant Array of Inexpensive

(Independent) Disks   Use multiple smaller disks (c.f. one large

disk)   Parallelism improves performance   Plus extra disk(s) for redundant data

storage   Provides fault tolerant storage system

  Especially if failed disks can be “hot swapped”

53 Copyright © 2012, Elsevier Inc. All rights reserved.

Data-Level Parallelism in Vector, SIMD, and GPU Architectures

Computer Architecture A Quantitative Approach, Fifth Edition

54 Copyright © 2012, Elsevier Inc. All rights reserved.

SIMD   What is SIMD?   What type of applications benefit from SIMD?   Compare SIMD and MIMD wrt energy efficiency.   Do programmers need to worry about parallelism

when using a SIMD model?

Introduction

55 Copyright © 2012, Elsevier Inc. All rights reserved.

SIMD   SIMD architectures can exploit significant data-

level parallelism for:   matrix-oriented scientific computing   media-oriented image and sound processors

  SIMD is more energy efficient than MIMD   Only needs to fetch one instruction per data operation   Makes SIMD attractive for personal mobile devices

  SIMD allows programmer to continue to think sequentially and achieve parallel speedups.

Introduction

56 Copyright © 2012, Elsevier Inc. All rights reserved.

Types of SIMD Parallelism? Introduction

57 Copyright © 2012, Elsevier Inc. All rights reserved.

Types of SIMD Parallelism

 Vector architectures  Multimedia SIMD instruction set

extensions  Graphics Processor Units (GPUs)

Introduction

58 Copyright © 2012, Elsevier Inc. All rights reserved.

Vector Architectures Basics   Data elements:   Operations:   Application vector size ≠ vector register size:   Conditional Operations:   Multidimensional arrays   Sparse arrays:

Vector Architectures

59 Copyright © 2012, Elsevier Inc. All rights reserved.

Vector Architectures Basics   Data elements: “vector registers”   Operations: Vector to Vector and Vector to

Scalar   Application vector size ≠ vector register size: use

Vector Length Register   Conditional Operations: use vector mask register

(allows parallelization on conditional operations).   Multidimensional arrays: Load/Store with non-

unit strides.   Sparse arrays: scatter-gather load/stores.

Vector Architectures

60 Copyright © 2012, Elsevier Inc. All rights reserved.

Roofline Performance Model   Basic idea:

  Plot peak floating-point throughput as a function of arithmetic intensity

  Ties together floating-point performance and memory performance for a target machine

  Arithmetic intensity   Floating-point operations per byte read

SIM

D Instruction S

et Extensions for M

ultimedia

61 Copyright © 2012, Elsevier Inc. All rights reserved.

Examples   Attainable GFLOPs/sec Min = (Peak Memory BW ×

Arithmetic Intensity, Peak Floating Point Perf.)

SIM

D Instruction S

et Extensions for M

ultimedia

Vector processor Multicore computer

40 GFLOP/sec

4 GFLOP/sec

62 Copyright © 2012, Elsevier Inc. All rights reserved.

SIMD Extensions for Multimedia   Basic data type length vs. native word size :   Arithmetic implementation approach:   Limitations, compared to vector instructions:

SIM

D Instruction S

et Extensions for M

ultimedia

63 Copyright © 2012, Elsevier Inc. All rights reserved.

SIMD Extensions for Multimedia   Basic data type length vs. native word size : narrower than the native

word size

  E.g., 8 bits for primary colors and 8 bits for transparency   E.g., 8 or 16 bit audio samples

  Arithmetic implementation approach: partition the carry chain in 256-bit adder and perform simultaneous operations on short vectors of eight 32-bit, sixteen 16-bit, and thirty two 8-bit operands.

  Limitations, compared to vector instructions:

  Number of data operands encoded into op code   No sophisticated addressing modes (strided, scatter-

gather)   No mask registers

SIM

D Instruction S

et Extensions for M

ultimedia

64 Copyright © 2012, Elsevier Inc. All rights reserved.

Graphics Processing Units

  Basic execution model:   Programming model:

Graphical P

rocessing Units

65 Copyright © 2012, Elsevier Inc. All rights reserved.

Graphics Processing Units

  Basic execution model:   Heterogeneous execution model

  CPU is the host, GPU is the device

  Programming model: C-like programming language for GPU   Unifies all forms of GPU parallelism as CUDA

(Compute Unified Device Architecture) thread   Programming model is “Single Instruction

Multiple Thread”

Graphical P

rocessing Units

66 Copyright © 2012, Elsevier Inc. All rights reserved.

Fermi Multithreaded SIMD Proc. G

raphical Processing U

nits

67 Copyright © 2012, Elsevier Inc. All rights reserved.

Loop-Level Parallelism   Compiler optimization:

  Loops and parallelism:

Detecting and E

nhancing Loop-Level Parallelism

68 Copyright © 2012, Elsevier Inc. All rights reserved.

Loop-Level Parallelism   Compiler optimization:

  compiler analyzes source code to determine how the code can be parallelized.

  Compiler determines when a loop can be parallelized or vectorized, how dependencies between loop iterations prevent a loop from being parallelized and how these dependencies can be eliminated.

  Loops and parallelism: loops are primary source of parallelism.

Detecting and E

nhancing Loop-Level Parallelism

69 Copyright © 2012, Elsevier Inc. All rights reserved.

Finding dependencies   Assume indices are affine:

  I.e., a x i + b (i is loop index)

  Assume:   Store to a x i + b, then   Load from c x i + d   i runs from m to n   Dependence exists if:

  Given j, k such that m ≤ j ≤ n, m ≤ k ≤ n   Store to a x j + b, load from c x k + d, and a x j + b

= c x k + d

Detecting and E

nhancing Loop-Level Parallelism

70 Copyright © 2012, Elsevier Inc. All rights reserved.

Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism:

Computer Architecture A Quantitative Approach, Fifth Edition

71 Copyright © 2012, Elsevier Inc. All rights reserved.

Warehouse-scale Computing   Warehouse-scale computer (WSC)

  Provides Internet services   Search, social networking, online maps, video sharing, online

shopping, email, cloud computing, etc.   Differences with HPC “clusters”:

  Clusters have higher performance processors and network   Clusters emphasize thread-level parallelism, WSCs

emphasize request-level parallelism   Differences with datacenters:

  Datacenters consolidate different machines and software into one location

  Datacenters emphasize virtual machines and hardware heterogeneity in order to serve varied customers

Introduction

72 Copyright © 2012, Elsevier Inc. All rights reserved.

WSC Design Factors (I)   Cost-performance (i.e., work done/$)   Energy efficiency   Dependability:   Network I/O:   Types of workloads:   Importance of ample computational

parallelism:   Operational costs   Scale:

Introduction

73 Copyright © 2012, Elsevier Inc. All rights reserved.

WSC Design Factors (I)   Cost-performance (i.e., work done/$)

  A WSC infrastructure can cost $150 M   A 10% reduction in capital cost saves $15 M

  Small savings add up.

  Energy efficiency   Affects power distribution and cooling   Work done per joule

  Dependability via redundancy:   At least 99.99% availability: down less than one hour/year   Multiple WSCs

  Network I/O: Inter and intra WSC   Interactive and batch processing workloads (e.g., MapReduce)

Introduction

74 Copyright © 2012, Elsevier Inc. All rights reserved.

WSC Design Factors (II)   Ample computational parallelism is not important

  Most jobs are totally independent (e.g., billions of Web pages from a Web crawl)

  Software as a Service: millions of independent users   “Request-level parallelism”: little need to coordinate or

synchronize.   Operational costs count

  Power consumption is a primary, not secondary, constraint when designing system

  Energy, power distribution, and cooling: more than 30% of cost over 10 years.

  Scale and its opportunities and problems   Can afford to build customized systems since WSC

require volume purchase   Flipside: failures.

Introduction

75 Copyright © 2012, Elsevier Inc. All rights reserved.

MapReduce? P

rogramm

ing Models and W

orkloads for WS

Cs

76 Copyright © 2012, Elsevier Inc. All rights reserved.

MapReduce   Batch processing framework

  Map: applies a programmer-supplied function to each logical input record

  Runs on thousands of computers   Provides new set of key-value pairs as

intermediate values

  Reduce: collapses values using another programmer-supplied function

Program

ming M

odels and Workloads for W

SC

s

77

78

79 Copyright © 2012, Elsevier Inc. All rights reserved.

Computer Architecture of WSC C

omputer A

r4chitecture of WS

C

80 Copyright © 2012, Elsevier Inc. All rights reserved.

Computer Architecture of WSC   Hierarchy of networks for

interconnection   Servers organized in racks   Racks organized in clusters (arrays)   Rack switches are fast and cheap   Switches connecting racks are more

expensive   Goal: maximize locality of reference

Com

puter Ar4chitecture of W

SC

81 Copyright © 2012, Elsevier Inc. All rights reserved.

Storage   Storage options:

  Use disks inside the servers, or   Network attached storage through

Infiniband   Google File System (GFS) uses local disks

and maintains at least three replicas

Com

puter Ar4chitecture of W

SC

82 Copyright © 2012, Elsevier Inc. All rights reserved.

Infrastructure and Costs of WSC   Location of WSC

  Proximity to Internet backbones, electricity cost, property tax rates, low risk from earthquakes, floods, and hurricanes

  Power distribution

Physcical Infrastrcuture and C

osts of WS

C

89% efficiency

83 Copyright © 2012, Elsevier Inc. All rights reserved.

Measuring Efficiency of a WSC   Power Utilization Effectiveness (PUE)

  = Total facility power / IT equipment power   Median PUE on 2006 study was 1.69

  Performance   Latency is important metric because it is seen by

users   Bing study: users will use search less as

response time increases   Service Level Objectives (SLOs)/Service Level

Agreements (SLAs)   E.g., 99% of requests must be below 100 ms

Physcical Infrastrcuture and C

osts of WS

C

84

What is Cloud Computing?

85

What is Cloud Computing?

  On demand availability of resources in a dynamic and scalable fashion.   resource = infrastructure, platforms, software,

services, or storage.   The cloud provider must manage its resources in

an efficient way so that the user needs can be met when needed at the desired QoS level.

86

Cloud Computing Services

•  Infrastructure as a Service (IaaS) •  Platform as a Service (PaaS) •  Software as a Service (SaaS)

87

Advantages of Cloud Computing

88

Advantages of Cloud Computing

  Pay as you go.

  No need to provision for peak loads.

  Time to market.

  Consistent performance and availability.

89

Potential Drawbacks of Cloud Computing

90

Potential Drawbacks of Cloud Computing

  Privacy and security.

  External dependency for mission critical applications.

  Disaster recovery.

  Monitoring and Enforcement of SLAs.

91

m = 1 billion

Parallel Computation of π Using Planetlab

92

S =E1En

m = 1 billion

Parallel Computation of π Using Planetlab

93

Example: Amazon’s Elastic Computing Cloud (EC2)

  http://aws.amazon.com/ec2/   Virtual site farm   Users request the number and type of

compute instances they need   Payment: by instance-hour   One EC2 compute unit provides the

equivalent of the CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

94

Examples: Amazon’s Elastic Computing Cloud (EC2)

  EC2’s Auto Scaling allows users to determine when to scale up or down their EC2 usage according to user-defined conditions.   Saves money

  EC2’s CloudWatch aggregates and reports metrics for CPU utilization, data transfer, and disk usage and activity for each EC2 instance.

95

Examples: Google’s App Engine

  Web applications can be deployed on Google’s infrastructures.

  Applications can run in Java or Python run-time environments.

  Free startup: all applications can use up to 500 MB of storage and enough CPU and bandwidth to support an efficient app serving around 5 million page views a month for free.   After that, pay according to resource usage.

96

Capacity Planning for the Cloud: from the consumer’s point of view

  Problems for consumers:   How to select SLAs for various QoS

metrics in a way that maximizes a utility function for the consumer subject to cost-constraints?

97

Solvers: •  NEOS: http://neos.mcs.anl.gov •  MS Excel’s Solver (see Tools menu)

Capacity Planning for the Cloud: from the consumer’s point of view

98 © 2009 D.A. Menascé and Paul Ngo. All Rights Reserved.

wr = 0.4; wx = 0.3; wa = 0.3

Capacity Planning for the Cloud: from the consumer’s point of view