Final Review - George Mason University
Transcript of Final Review - George Mason University
1 Copyright © 2012, Elsevier Inc. All rights reserved.
Final Review
Computer Architecture A Quantitative Approach, Fifth Edition
2
How Does a Computer Work?
What is the Von Neumann computer architecture? What are the elements of a Von Neumann
computer? How do these elements interact to execute
programs?
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
3
Von Neumann Architecture
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Stored-program computer
Data
Program
ALU
Registers
system bus
4
Von Neumann Architecture
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Stored-program computer
Data
Program
ALU
Registers
Sequence of instructions: (1) Load/Store from memory to/from registers, (2) Arithmetic/Logical Ops., (3) Conditional and unconditional branches.
General registers, FP registers, Program counter, etc.
5 Copyright © 2012, Elsevier Inc. All rights reserved.
Single Processor Performance Introduction What are the architectural evolutions from ILP and why?
6 Copyright © 2012, Elsevier Inc. All rights reserved.
Single Processor Performance Introduction
RISC
Move to multi-processor. From ILP to DLP and TLP
7
Moore’s Law?
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Source: Wikimedia Commons
8
Moore’s Law
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Source: Wikimedia Commons
The number of transistors on integrated circuits doubles approximately every two years. (Gordon Moore, 1965)
9
Instruction Level Parallelism?
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
10
Instruction Level Parallelism
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction In assembly language: * e = a+b 001 LOAD A,R1 002 LOAD B,R2 003 ADD R1,R2,R3 004 STO R3,E * f = c + d 005 LOAD C,R4 006 LOAD D,R5 007 ADD R4,R5,R6 008 STO R6,F * g = e * f 009 MULT R3,R6,R7 010 STO R7,G
Instructions that can potentially be executed in parallel:
001, 002, 005, 006 003, 007 004, 008, 009 010
11
Data Level Parallelism (DLP)?
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
12
Data Level Parallelism (DLP)
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Task Data Task Data
13
Task Level Parallelism (TLP)?
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
14
Task Level Parallelism (TLP)
Copyright © 2013, Daniel A. Menasce. All rights reserved.
Introduction
Tasks Task
15 Copyright © 2012, Elsevier Inc. All rights reserved.
Power and Energy
What units are used to measure power and energy and how are they related?
What is the effect of clock rate on power consumption and performance
Trends in Pow
er and Energy
16 Copyright © 2012, Elsevier Inc. All rights reserved.
Power and Energy
What units are used to measure power and energy and how are they related? Energy measured in Joules Power measured in Watts = Joules/sec
What is the effect of clock rate on power consumption and performance Clock rate can be reduced dynamically to limit power
consumption A lower clock rate reduces performance
Trends in Pow
er and Energy
18 Copyright © 2012, Elsevier Inc. All rights reserved.
Principles of Computer Design Take Advantage of Parallelism
e.g., multiple processors, disks, memory banks, pipelining, multiple functional units
Principle of Locality Reuse of data and instructions
Focus on the Common Case Amdahl’s Law
Principles
19 Copyright © 2012, Elsevier Inc. All rights reserved.
The Processor Performance Equation P
rinciples
(average) Clock cycles per instruction
20 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy Design
Computer Architecture A Quantitative Approach, Fifth Edition
21 Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy Introduction
Why is it advantageous to organize memory as a hierarchy?
Give examples of levels in the memory hierarchy.
What is the memory performance gap?
23 Copyright © 2012, Elsevier Inc. All rights reserved.
Placement of Blocks in a Cache Placement of blocks in a cache
n-way associative cache? direct-mapped cache? fully associative cache?
Introduction
24 Copyright © 2012, Elsevier Inc. All rights reserved.
Placement of blocks in a cache Placement of blocks in a cache
Set associative: block is mapped into a set and the block is placed anywhere in the set
Finding a block: map block address to set and search set (usually in parallel) to find block.
n blocks in a set: n-way associative cache One block per set (n=1): direct-mapped cache One set per cache: fully associative cache
Introduction
25 Copyright © 2013, Daniel A. Menasce. All rights reserved.
Processor/Cache Addressing Introduction
2-way associative cache.
Processor 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
0
1
V B
INDEX: selects set TAG: used to check for cache hit if valid bit (VB) is 1.
1 1 0 0
TAG Index Block Offset
<2> <1> <1>
1
0 1 1 1 0 0 1 0
1 1
TAG
01110010
MEMORY
CACHE
Cache data
<8> address
Set 0: B0
Set 1: B1
Set 0: B2
Set 1: B3
Set 0: B4
Set 1: B5
Set 0: B6
Set 1: B7
1 0 1 0 0 0 1 1
10100011 1 1 0 11101111 10001001
1 1 1 0 1 1 1 1 1 0 0 0 1 0 0 1
26 Copyright © 2012, Elsevier Inc. All rights reserved.
Writing to a Cache Writing to cache: two strategies
Write-through?
Write-back?
Introduction
27 Copyright © 2012, Elsevier Inc. All rights reserved.
Writing to a Cache Writing to cache: two strategies
Write-through Immediately update lower levels of hierarchy
Write-back Only update lower levels of hierarchy when an updated block
is replaced
Introduction
28
Note that speculative and multithreaded processors may execute other instructions during a miss Reduces performance impact of misses
Copyright © 2012, Elsevier Inc. All rights reserved.
Memory Hierarchy Basics Introduction
€
MemoryStallCycles = NumberOfMisses×MissPenalty =
= IC×Misses
Instruction×MissPenalty
= IC×MemoryAccesses
Instruction×MissRate ×MissPenalty
29 Copyright © 2012, Elsevier Inc. All rights reserved.
Virtual Memory What is virtual memory?
Virtual Mem
ory and Virtual Machines
30 Copyright © 2012, Elsevier Inc. All rights reserved.
Virtual Memory Abstraction of physical address space in which
each process’ address space (the virtual address) may be bigger than the physical address space.
The virtual address space is divided into pages, some of which are in main memory and some are on disk (the paging disk).
Virtual addresses are mapped to physical addresses by the hardware using a Translation Lookaside Buffer.
If a referenced page is not in memory, an interrupt is generated to the operating system, which starts the process of bringing the page to memory.
Virtual Mem
ory and Virtual Machines
31 Copyright © 2012, Elsevier Inc. All rights reserved.
What are Virtual Machines? Virtual M
emory and Virtual M
achines
32 Copyright © 2012, Elsevier Inc. All rights reserved.
Virtual Machines Allow different ISAs and operating systems to be
presented to user programs “System Virtual Machines” SVM software is called “virtual machine monitor” or
“hypervisor” Individual virtual machines run under the monitor are called
“guest VMs” Sharing a computer among many unrelated users Supports isolation and security Enabled by raw speed of processors, making the
overhead more acceptable
Virtual Mem
ory and Virtual Machines
36 Pipeline CS465
Pipelining
What is it? Pipelining is an implementation technique in
which multiple instructions are overlapped in execution
What are pipeline hazards? Resource, data, or control dependencies that
delay the execution of instructions in the pipeline by one or more cycles.
37 Chapter 4 — The Processor — 37
Hazards Situations that prevent starting the next
instruction in the next cycle Structure hazards
A required resource is busy Data hazard
Need to wait for previous instruction to complete its data read/write
Control hazard Deciding on control action depends on
previous instruction
38 Chapter 4 — The Processor — 38
Pipeline Performance Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
39 Chapter 4 — The Processor — 39
Speculation “Guess” what to do with an instruction
Start operation as soon as possible Check whether guess was right
If so, complete the operation If not, roll-back and do the right thing
Examples Speculate on branch outcome
Roll back if path taken is different
Speculate on load Roll back if location is updated
40 Chapter 4 — The Processor — 40
Compiler/Hardware Speculation
Compiler can reorder instructions e.g., move load before branch Can include “fix-up” instructions to recover
from incorrect guess Hardware can look ahead for instructions
to execute Buffer results until it determines they are
actually needed Flush buffers on incorrect speculation
44
Disk Sectors and Access Each sector records
Sector ID Data (512 bytes, 4096 bytes proposed) Error correcting code (ECC)
Used to hide defects and recording errors Synchronization fields and gaps
Access to a sector involves Queuing delay if other accesses are pending Seek: move the heads Rotational latency Data transfer Controller overhead
46
Flash Storage
Nonvolatile semiconductor storage 100× – 1000× faster than disk Smaller, lower power, more robust But more $/GB (between disk and DRAM)
48
Bus Types
Processor-Memory buses Short, high speed Design is matched to memory organization
I/O buses Longer, allowing multiple connections Specified by standards for interoperability Connect to processor-memory bus through a
bridge
49
Measuring I/O Performance
I/O performance depends on Hardware: CPU, memory, controllers, buses Software: operating system, database
management system, application Workload: request rates and patterns
I/O system design can trade-off between response time and throughput Measurements of throughput often done with
constrained response-time
50
I/O vs. CPU Performance Amdahl’s Law
Don’t neglect I/O performance as parallelism increases compute performance
Example Benchmark takes 90s CPU time, 10s I/O time Double the CPU speed every 2 years
I/O unchanged Year CPU time I/O time Elapsed time % I/O time now 90s 10s 100s 10% +2 45s 10s 55s 18% +4 23s 10s 33s 31% +6 11s 10s 21s 47%
52
RAID Redundant Array of Inexpensive
(Independent) Disks Use multiple smaller disks (c.f. one large
disk) Parallelism improves performance Plus extra disk(s) for redundant data
storage Provides fault tolerant storage system
Especially if failed disks can be “hot swapped”
53 Copyright © 2012, Elsevier Inc. All rights reserved.
Data-Level Parallelism in Vector, SIMD, and GPU Architectures
Computer Architecture A Quantitative Approach, Fifth Edition
54 Copyright © 2012, Elsevier Inc. All rights reserved.
SIMD What is SIMD? What type of applications benefit from SIMD? Compare SIMD and MIMD wrt energy efficiency. Do programmers need to worry about parallelism
when using a SIMD model?
Introduction
55 Copyright © 2012, Elsevier Inc. All rights reserved.
SIMD SIMD architectures can exploit significant data-
level parallelism for: matrix-oriented scientific computing media-oriented image and sound processors
SIMD is more energy efficient than MIMD Only needs to fetch one instruction per data operation Makes SIMD attractive for personal mobile devices
SIMD allows programmer to continue to think sequentially and achieve parallel speedups.
Introduction
57 Copyright © 2012, Elsevier Inc. All rights reserved.
Types of SIMD Parallelism
Vector architectures Multimedia SIMD instruction set
extensions Graphics Processor Units (GPUs)
Introduction
58 Copyright © 2012, Elsevier Inc. All rights reserved.
Vector Architectures Basics Data elements: Operations: Application vector size ≠ vector register size: Conditional Operations: Multidimensional arrays Sparse arrays:
Vector Architectures
59 Copyright © 2012, Elsevier Inc. All rights reserved.
Vector Architectures Basics Data elements: “vector registers” Operations: Vector to Vector and Vector to
Scalar Application vector size ≠ vector register size: use
Vector Length Register Conditional Operations: use vector mask register
(allows parallelization on conditional operations). Multidimensional arrays: Load/Store with non-
unit strides. Sparse arrays: scatter-gather load/stores.
Vector Architectures
60 Copyright © 2012, Elsevier Inc. All rights reserved.
Roofline Performance Model Basic idea:
Plot peak floating-point throughput as a function of arithmetic intensity
Ties together floating-point performance and memory performance for a target machine
Arithmetic intensity Floating-point operations per byte read
SIM
D Instruction S
et Extensions for M
ultimedia
61 Copyright © 2012, Elsevier Inc. All rights reserved.
Examples Attainable GFLOPs/sec Min = (Peak Memory BW ×
Arithmetic Intensity, Peak Floating Point Perf.)
SIM
D Instruction S
et Extensions for M
ultimedia
Vector processor Multicore computer
40 GFLOP/sec
4 GFLOP/sec
62 Copyright © 2012, Elsevier Inc. All rights reserved.
SIMD Extensions for Multimedia Basic data type length vs. native word size : Arithmetic implementation approach: Limitations, compared to vector instructions:
SIM
D Instruction S
et Extensions for M
ultimedia
63 Copyright © 2012, Elsevier Inc. All rights reserved.
SIMD Extensions for Multimedia Basic data type length vs. native word size : narrower than the native
word size
E.g., 8 bits for primary colors and 8 bits for transparency E.g., 8 or 16 bit audio samples
Arithmetic implementation approach: partition the carry chain in 256-bit adder and perform simultaneous operations on short vectors of eight 32-bit, sixteen 16-bit, and thirty two 8-bit operands.
Limitations, compared to vector instructions:
Number of data operands encoded into op code No sophisticated addressing modes (strided, scatter-
gather) No mask registers
SIM
D Instruction S
et Extensions for M
ultimedia
64 Copyright © 2012, Elsevier Inc. All rights reserved.
Graphics Processing Units
Basic execution model: Programming model:
Graphical P
rocessing Units
65 Copyright © 2012, Elsevier Inc. All rights reserved.
Graphics Processing Units
Basic execution model: Heterogeneous execution model
CPU is the host, GPU is the device
Programming model: C-like programming language for GPU Unifies all forms of GPU parallelism as CUDA
(Compute Unified Device Architecture) thread Programming model is “Single Instruction
Multiple Thread”
Graphical P
rocessing Units
66 Copyright © 2012, Elsevier Inc. All rights reserved.
Fermi Multithreaded SIMD Proc. G
raphical Processing U
nits
67 Copyright © 2012, Elsevier Inc. All rights reserved.
Loop-Level Parallelism Compiler optimization:
Loops and parallelism:
Detecting and E
nhancing Loop-Level Parallelism
68 Copyright © 2012, Elsevier Inc. All rights reserved.
Loop-Level Parallelism Compiler optimization:
compiler analyzes source code to determine how the code can be parallelized.
Compiler determines when a loop can be parallelized or vectorized, how dependencies between loop iterations prevent a loop from being parallelized and how these dependencies can be eliminated.
Loops and parallelism: loops are primary source of parallelism.
Detecting and E
nhancing Loop-Level Parallelism
69 Copyright © 2012, Elsevier Inc. All rights reserved.
Finding dependencies Assume indices are affine:
I.e., a x i + b (i is loop index)
Assume: Store to a x i + b, then Load from c x i + d i runs from m to n Dependence exists if:
Given j, k such that m ≤ j ≤ n, m ≤ k ≤ n Store to a x j + b, load from c x k + d, and a x j + b
= c x k + d
Detecting and E
nhancing Loop-Level Parallelism
70 Copyright © 2012, Elsevier Inc. All rights reserved.
Warehouse-Scale Computers to Exploit Request-Level and Data-Level Parallelism:
Computer Architecture A Quantitative Approach, Fifth Edition
71 Copyright © 2012, Elsevier Inc. All rights reserved.
Warehouse-scale Computing Warehouse-scale computer (WSC)
Provides Internet services Search, social networking, online maps, video sharing, online
shopping, email, cloud computing, etc. Differences with HPC “clusters”:
Clusters have higher performance processors and network Clusters emphasize thread-level parallelism, WSCs
emphasize request-level parallelism Differences with datacenters:
Datacenters consolidate different machines and software into one location
Datacenters emphasize virtual machines and hardware heterogeneity in order to serve varied customers
Introduction
72 Copyright © 2012, Elsevier Inc. All rights reserved.
WSC Design Factors (I) Cost-performance (i.e., work done/$) Energy efficiency Dependability: Network I/O: Types of workloads: Importance of ample computational
parallelism: Operational costs Scale:
Introduction
73 Copyright © 2012, Elsevier Inc. All rights reserved.
WSC Design Factors (I) Cost-performance (i.e., work done/$)
A WSC infrastructure can cost $150 M A 10% reduction in capital cost saves $15 M
Small savings add up.
Energy efficiency Affects power distribution and cooling Work done per joule
Dependability via redundancy: At least 99.99% availability: down less than one hour/year Multiple WSCs
Network I/O: Inter and intra WSC Interactive and batch processing workloads (e.g., MapReduce)
Introduction
74 Copyright © 2012, Elsevier Inc. All rights reserved.
WSC Design Factors (II) Ample computational parallelism is not important
Most jobs are totally independent (e.g., billions of Web pages from a Web crawl)
Software as a Service: millions of independent users “Request-level parallelism”: little need to coordinate or
synchronize. Operational costs count
Power consumption is a primary, not secondary, constraint when designing system
Energy, power distribution, and cooling: more than 30% of cost over 10 years.
Scale and its opportunities and problems Can afford to build customized systems since WSC
require volume purchase Flipside: failures.
Introduction
75 Copyright © 2012, Elsevier Inc. All rights reserved.
MapReduce? P
rogramm
ing Models and W
orkloads for WS
Cs
76 Copyright © 2012, Elsevier Inc. All rights reserved.
MapReduce Batch processing framework
Map: applies a programmer-supplied function to each logical input record
Runs on thousands of computers Provides new set of key-value pairs as
intermediate values
Reduce: collapses values using another programmer-supplied function
Program
ming M
odels and Workloads for W
SC
s
79 Copyright © 2012, Elsevier Inc. All rights reserved.
Computer Architecture of WSC C
omputer A
r4chitecture of WS
C
80 Copyright © 2012, Elsevier Inc. All rights reserved.
Computer Architecture of WSC Hierarchy of networks for
interconnection Servers organized in racks Racks organized in clusters (arrays) Rack switches are fast and cheap Switches connecting racks are more
expensive Goal: maximize locality of reference
Com
puter Ar4chitecture of W
SC
81 Copyright © 2012, Elsevier Inc. All rights reserved.
Storage Storage options:
Use disks inside the servers, or Network attached storage through
Infiniband Google File System (GFS) uses local disks
and maintains at least three replicas
Com
puter Ar4chitecture of W
SC
82 Copyright © 2012, Elsevier Inc. All rights reserved.
Infrastructure and Costs of WSC Location of WSC
Proximity to Internet backbones, electricity cost, property tax rates, low risk from earthquakes, floods, and hurricanes
Power distribution
Physcical Infrastrcuture and C
osts of WS
C
89% efficiency
83 Copyright © 2012, Elsevier Inc. All rights reserved.
Measuring Efficiency of a WSC Power Utilization Effectiveness (PUE)
= Total facility power / IT equipment power Median PUE on 2006 study was 1.69
Performance Latency is important metric because it is seen by
users Bing study: users will use search less as
response time increases Service Level Objectives (SLOs)/Service Level
Agreements (SLAs) E.g., 99% of requests must be below 100 ms
Physcical Infrastrcuture and C
osts of WS
C
85
What is Cloud Computing?
On demand availability of resources in a dynamic and scalable fashion. resource = infrastructure, platforms, software,
services, or storage. The cloud provider must manage its resources in
an efficient way so that the user needs can be met when needed at the desired QoS level.
86
Cloud Computing Services
• Infrastructure as a Service (IaaS) • Platform as a Service (PaaS) • Software as a Service (SaaS)
88
Advantages of Cloud Computing
Pay as you go.
No need to provision for peak loads.
Time to market.
Consistent performance and availability.
90
Potential Drawbacks of Cloud Computing
Privacy and security.
External dependency for mission critical applications.
Disaster recovery.
Monitoring and Enforcement of SLAs.
93
Example: Amazon’s Elastic Computing Cloud (EC2)
http://aws.amazon.com/ec2/ Virtual site farm Users request the number and type of
compute instances they need Payment: by instance-hour One EC2 compute unit provides the
equivalent of the CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.
94
Examples: Amazon’s Elastic Computing Cloud (EC2)
EC2’s Auto Scaling allows users to determine when to scale up or down their EC2 usage according to user-defined conditions. Saves money
EC2’s CloudWatch aggregates and reports metrics for CPU utilization, data transfer, and disk usage and activity for each EC2 instance.
95
Examples: Google’s App Engine
Web applications can be deployed on Google’s infrastructures.
Applications can run in Java or Python run-time environments.
Free startup: all applications can use up to 500 MB of storage and enough CPU and bandwidth to support an efficient app serving around 5 million page views a month for free. After that, pay according to resource usage.
96
Capacity Planning for the Cloud: from the consumer’s point of view
Problems for consumers: How to select SLAs for various QoS
metrics in a way that maximizes a utility function for the consumer subject to cost-constraints?
97
Solvers: • NEOS: http://neos.mcs.anl.gov • MS Excel’s Solver (see Tools menu)
Capacity Planning for the Cloud: from the consumer’s point of view