ECE-3056-B Exam Topic Areas John Copeland Friday May 2, 2014 11:30-2:20.

ECE-3056-BExam Topic Areas

John CopelandFriday May 2, 2014

11:30-2:20

2

09b Virtual Memory System

Every page of Physical Memory is stored on the disk(s).

Part of the Main Memory (RAM) is dedicated to acting as a cache for active pages (a fraction of all physical pages).

Programs access instructions and data based on "Virtual Addresses".

If the page size is 4096 bits, the rightmost 12 bits are the "Byte-Offset."

The Physical Address is the Physical Page address || Byte Offset.

The Virtual Address is the Virtual Page address || Byte Offset.

Virtual Memory

• Use main memory as a “cache” for secondary (disk) storage– Managed jointly by CPU hardware and the operating

system (OS)• Programs share main memory

– Each gets a private virtual address space (in physical memory) holding its frequently used code and data

– Protected from other programs (Physical address (Page No.) includes process ID bits)

• CPU and OS translate virtual addresses to physical addresses– VM “block” is called a page– VM page “miss” (not in DRAM) is called a "page fault"

3

Index V Tag (Physical MSBs) Data (32 bytes)

000 N

001 N

010 Y 11010010 011 010 Mem[11010]

011 N

100 N

101 N

110 Y 00011001 110 110 Mem[10110]

111 N

Binary Virtual addr Hit/miss Cache block

10000 11 010 xxxx ? 000

11101 10 110 xxxx ? 011

10000 11 010 xxxx ? 000

09a-20

09a-21

Previous State

What is new State of Cache?

Then This Happens

Answer on 09a-214

Virtual Page Addr.

Physical Page Addr.

Page Offset bits 9:4

1000 11010010 011010

1110 00011001 110110

CPUTLB Translation Look-aside Buffer

Cache

Need to access Page Table

yes

nono

Address Translation

• Fixed-size pages (e.g., 4K)

41 40 39 --------- "Page Table"

on DRAM

of Pages on DRAM

(some)

Pages on disk (all)

5

TLB Operation

• TLB size typically a function of the target domain– High end machines will have fully associative large TLBs

• PTE entries are replaced on a demand driven basis• The TLB is in the critical path

registers

ALU Cache

Memory

Memory

Memory

Memory

TLB

virtual address

physical address

Translate &Update TLB

miss

6

Memory Protection• Different tasks can share parts of their virtual

address spaces– But need to protect against errant access– Requires OS assistance

• Hardware support for OS protection– Privileged supervisor mode (aka kernel mode)– Privileged instructions– Page tables and other state information only accessible

in supervisor mode– System call exception (e.g., syscall in MIPS)

Distinguish between a TLB miss*, a data cache miss, and a page fault.

* TLB may also contain recently used pages that are not present in cache.7

09 b Glossary

• Page Table• Page Table Entry (PTE)• Page fault• Physical address• Physical page

• Translation lookaside buffer (TLB)

• Virtual address• Virtual page

8

Input/Output "I/O"

• I/O devices can be characterized by– Behavior: input, output, storage– Partner: human or machine– Data rate: bytes/sec, transfers/sec

• I/O bus connections

Interrupt (signal) sent to OS when requested data input is ready for retrieval by a process (or thread) that is "blocked" (halted).OS then puts the process on the list of "Ready to Run" processes.

9

Typical x86 PC I/O System

Network Interface

GPU

Software interaction/control

Interconnect

Replaced with Quickpath Interconnect

(QPI)

Note the flow of data (and control) in this system!

Modern Disk Drives contain internal SRAM buffers to reduce latency

10

Disk Performance

• Actuator moves the correct read/write head over the correct sector (seek-time – maximum when it has to move from inner cylinder to outer) – Under the control of the disk controller

• Disk latency = controller overhead + seek time + rotational delay + transfer delay– Seek time and rotational delay are limited by mechanical parts

Actuator ArmHead

Platters

• Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f. one large disk) Parallelism improves performance Plus extra disk(s) for redundant data storage

• Provides fault tolerant storage system Especially if failed disks can be “hot swapped"

RAID

Transfer Rate = (Bytes per Cylinder) * RPM / ( 60 sec per min)

Transfer Delay = Bytes per sector / Tran. Rate

11

Disk Dependability Measures

• Reliability: mean time to failure (MTTF)• Service interruption: mean time to repair (MTTR)• Mean time between failures

– MTBF = MTTF + MTTR

• Availability = MTTF / (MTTF + MTTR)• Improving Availability

– Increase MTTF: fault avoidance, fault tolerance, fault forecasting– Reduce MTTR: improved tools and processes for diagnosis and repair

12

Bus Types, Signals, and Synchronization

• Data lines– Carry address and data– Multiplexed or separate

• Control lines– Indicate data type, synchronize transactions

• Synchronous– Uses a bus clock

• Asynchronous– Uses request/acknowledge control lines for handshaking

• Processor-Memory buses– Short, high speed– Design is matched to memory organization

• I/O buses– Longer, allowing multiple connections– Specified by standards for interoperability– Connect to processor-memory bus through a bridge

13

10 Study Guide

• Provide a step-by-step example of how each of the following work– Polling, DMA, interrupts, read/write accesses in a RAID

configuration, memory mapped I/O

• Compute the bandwidth for data transfers to/from a disk

• How is the I/O system of a desktop or laptop different from that of a server?

14

EnergyDelay

Ene

rgy

or d

elay

VDDVDD

ED

P

Energy Delay Product (EDP)

• Delay decreases with supply voltage but energy & power increases

L

owes

t E

nerg

y pe

r O

pera

tion

Historically, performance scaling was accompanied by scaling down feature sizes.This is no longer true. We have reached a point where power densities are increasing.

15

Processor Power States• Performance States – P-states

– Operate at different voltage/frequencies• Recall delay-voltage relationship

– Lower voltage lower leakage, but slower operation– Lower frequency lower power (same or more energy per operation)– Lower frequency longer execution time

• Idle States - C-states– Sleep states– Which is better: Difference is how much state is saved

• SW or HW managed transitions between states!

Core

Cache

Core

Cache

Core

Cache

Core

Cache

Core

Cache

• 4X #cores• 0.75x voltage• 0.5x Frequency• 1X power• 2X in performance

Example

Concurrency + lower frequency greater energy efficiency

16

Thermal Design Power (TDP)

• This is the maximum power at which the part is designed to operate– Dictates the design of the

cooling system • Max temperature Tjmax

– Typically fixed by worst case workload

• Parts are typically operating below the TDP

• Opportunities for turbo mode (higher clock for short time)?

AMD Trinity APU

http://ecs.vancouver.wsu.edu/thermofluids-research

17

Power and Architecture Activity

• For example, At nth clock cycle, collected counters are:– Data cache:

• read = 20, write = 12;• per-read energy = 0.5nJ; per-write energy = 0.6nJ;• Read energy = read*per-read energy = 10nJ• Write energy = write*per-write energy = 7.2nJ• Total activity energy = read+write energies = 17.2nJ• If n = 50th clock cycle and clock frequency = 2GHz,

Total activity power = energy*clock_freq/n = 688mW

*Note: n/clock_freq = n clock periods in sec power = time average of energy

18

Instruction Level Parallelism (ILP)

IF ID MEM WB

• Single (program) thread of execution• Issue multiple instructions from the same instruction

stream• Average CPI<1• Often called out of order (OOO) cores

Multiple instructions in EX at the same time

19

Thread Level Parallelism (TLP)

• Multiple threads of execution• Exploit ILP in each thread• Exploit concurrent execution across threads

20

Programming Model: Message Passing

• Each processor has private physical address space

• Hardware sends/receives messages between processors

21

Graphics Processing Unit - GPU• Early video cards

– Frame buffer memory with address generation for video output• 3D graphics processing

– Originally high-end computers (e.g., SGI)– Moore’s Law lower cost, higher density– 3D graphics cards now for PCs and game consoles

• Graphics Processing Units– Processors oriented to 3D graphics tasks– Vertex/pixel processing, shading, texture mapping,

rasterization• Processing is highly data-parallel

– GPUs are highly multithreaded– Use thread switching to hide memory latency

• Less reliance on multi-level caches– Graphics memory is wide and high-bandwidth

• Trend toward general purpose GPUs– Heterogeneous CPU/GPU systems– CPU for sequential code, GPU for parallel code

22

ECE-3056-B Exam Topic Areas John Copeland Friday May 2, 2014 11:30-2:20.

Documents

Transcript of ECE-3056-B Exam Topic Areas John Copeland Friday May 2, 2014 11:30-2:20.