ECE-3056-B Exam Topic Areas John Copeland Friday May 2, 2014 11:30-2:20.
-
Upload
mervin-wheeler -
Category
Documents
-
view
215 -
download
1
Transcript of ECE-3056-B Exam Topic Areas John Copeland Friday May 2, 2014 11:30-2:20.
ECE-3056-BExam Topic Areas
John CopelandFriday May 2, 2014
11:30-2:20
2
09b Virtual Memory System
Every page of Physical Memory is stored on the disk(s).
Part of the Main Memory (RAM) is dedicated to acting as a cache for active pages (a fraction of all physical pages).
Programs access instructions and data based on "Virtual Addresses".
If the page size is 4096 bits, the rightmost 12 bits are the "Byte-Offset."
The Physical Address is the Physical Page address || Byte Offset.
The Virtual Address is the Virtual Page address || Byte Offset.
Virtual Memory
• Use main memory as a “cache” for secondary (disk) storage– Managed jointly by CPU hardware and the operating
system (OS)• Programs share main memory
– Each gets a private virtual address space (in physical memory) holding its frequently used code and data
– Protected from other programs (Physical address (Page No.) includes process ID bits)
• CPU and OS translate virtual addresses to physical addresses– VM “block” is called a page– VM page “miss” (not in DRAM) is called a "page fault"
3
Index V Tag (Physical MSBs) Data (32 bytes)
000 N
001 N
010 Y 11010010 011 010 Mem[11010]
011 N
100 N
101 N
110 Y 00011001 110 110 Mem[10110]
111 N
Binary Virtual addr Hit/miss Cache block
10000 11 010 xxxx ? 000
11101 10 110 xxxx ? 011
10000 11 010 xxxx ? 000
09a-20
09a-21
Previous State
What is new State of Cache?
Then This Happens
Answer on 09a-214
Virtual Page Addr.
Physical Page Addr.
Page Offset bits 9:4
1000 11010010 011010
1110 00011001 110110
CPUTLB Translation Look-aside Buffer
Cache
Need to access Page Table
yes
nono
Address Translation
• Fixed-size pages (e.g., 4K)
41 40 39 --------- "Page Table"
on DRAM
of Pages on DRAM
(some)
Pages on disk (all)
5
TLB Operation
• TLB size typically a function of the target domain– High end machines will have fully associative large TLBs
• PTE entries are replaced on a demand driven basis• The TLB is in the critical path
registers
ALU Cache
Memory
Memory
Memory
Memory
TLB
virtual address
physical address
Translate &Update TLB
miss
6
Memory Protection• Different tasks can share parts of their virtual
address spaces– But need to protect against errant access– Requires OS assistance
• Hardware support for OS protection– Privileged supervisor mode (aka kernel mode)– Privileged instructions– Page tables and other state information only accessible
in supervisor mode– System call exception (e.g., syscall in MIPS)
Distinguish between a TLB miss*, a data cache miss, and a page fault.
* TLB may also contain recently used pages that are not present in cache.7
09 b Glossary
• Page Table• Page Table Entry (PTE)• Page fault• Physical address• Physical page
• Translation lookaside buffer (TLB)
• Virtual address• Virtual page
8
Input/Output "I/O"
• I/O devices can be characterized by– Behavior: input, output, storage– Partner: human or machine– Data rate: bytes/sec, transfers/sec
• I/O bus connections
Interrupt (signal) sent to OS when requested data input is ready for retrieval by a process (or thread) that is "blocked" (halted).OS then puts the process on the list of "Ready to Run" processes.
9
Typical x86 PC I/O System
Network Interface
GPU
Software interaction/control
Interconnect
Replaced with Quickpath Interconnect
(QPI)
Note the flow of data (and control) in this system!
Modern Disk Drives contain internal SRAM buffers to reduce latency
10
Disk Performance
• Actuator moves the correct read/write head over the correct sector (seek-time – maximum when it has to move from inner cylinder to outer) – Under the control of the disk controller
• Disk latency = controller overhead + seek time + rotational delay + transfer delay– Seek time and rotational delay are limited by mechanical parts
Actuator ArmHead
Platters
• Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f. one large disk) Parallelism improves performance Plus extra disk(s) for redundant data storage
• Provides fault tolerant storage system Especially if failed disks can be “hot swapped"
RAID
Transfer Rate = (Bytes per Cylinder) * RPM / ( 60 sec per min)
Transfer Delay = Bytes per sector / Tran. Rate
11
Disk Dependability Measures
• Reliability: mean time to failure (MTTF)• Service interruption: mean time to repair (MTTR)• Mean time between failures
– MTBF = MTTF + MTTR
• Availability = MTTF / (MTTF + MTTR)• Improving Availability
– Increase MTTF: fault avoidance, fault tolerance, fault forecasting– Reduce MTTR: improved tools and processes for diagnosis and repair
12
Bus Types, Signals, and Synchronization
• Data lines– Carry address and data– Multiplexed or separate
• Control lines– Indicate data type, synchronize transactions
• Synchronous– Uses a bus clock
• Asynchronous– Uses request/acknowledge control lines for handshaking
• Processor-Memory buses– Short, high speed– Design is matched to memory organization
• I/O buses– Longer, allowing multiple connections– Specified by standards for interoperability– Connect to processor-memory bus through a bridge
13
10 Study Guide
• Provide a step-by-step example of how each of the following work– Polling, DMA, interrupts, read/write accesses in a RAID
configuration, memory mapped I/O
• Compute the bandwidth for data transfers to/from a disk
• How is the I/O system of a desktop or laptop different from that of a server?
14
EnergyDelay
Ene
rgy
or d
elay
VDDVDD
ED
P
Energy Delay Product (EDP)
• Delay decreases with supply voltage but energy & power increases
L
owes
t E
nerg
y pe
r O
pera
tion
Historically, performance scaling was accompanied by scaling down feature sizes.This is no longer true. We have reached a point where power densities are increasing.
15
Processor Power States• Performance States – P-states
– Operate at different voltage/frequencies• Recall delay-voltage relationship
– Lower voltage lower leakage, but slower operation– Lower frequency lower power (same or more energy per operation)– Lower frequency longer execution time
• Idle States - C-states– Sleep states– Which is better: Difference is how much state is saved
• SW or HW managed transitions between states!
Core
Cache
Core
Cache
Core
Cache
Core
Cache
Core
Cache
• 4X #cores• 0.75x voltage• 0.5x Frequency• 1X power• 2X in performance
Example
Concurrency + lower frequency greater energy efficiency
16
Thermal Design Power (TDP)
• This is the maximum power at which the part is designed to operate– Dictates the design of the
cooling system • Max temperature Tjmax
– Typically fixed by worst case workload
• Parts are typically operating below the TDP
• Opportunities for turbo mode (higher clock for short time)?
AMD Trinity APU
http://ecs.vancouver.wsu.edu/thermofluids-research
17
Power and Architecture Activity
• For example, At nth clock cycle, collected counters are:– Data cache:
• read = 20, write = 12;• per-read energy = 0.5nJ; per-write energy = 0.6nJ;• Read energy = read*per-read energy = 10nJ• Write energy = write*per-write energy = 7.2nJ• Total activity energy = read+write energies = 17.2nJ• If n = 50th clock cycle and clock frequency = 2GHz,
Total activity power = energy*clock_freq/n = 688mW
*Note: n/clock_freq = n clock periods in sec power = time average of energy
18
Instruction Level Parallelism (ILP)
IF ID MEM WB
• Single (program) thread of execution• Issue multiple instructions from the same instruction
stream• Average CPI<1• Often called out of order (OOO) cores
Multiple instructions in EX at the same time
19
Thread Level Parallelism (TLP)
• Multiple threads of execution• Exploit ILP in each thread• Exploit concurrent execution across threads
20
Programming Model: Message Passing
• Each processor has private physical address space
• Hardware sends/receives messages between processors
21
Graphics Processing Unit - GPU• Early video cards
– Frame buffer memory with address generation for video output• 3D graphics processing
– Originally high-end computers (e.g., SGI)– Moore’s Law lower cost, higher density– 3D graphics cards now for PCs and game consoles
• Graphics Processing Units– Processors oriented to 3D graphics tasks– Vertex/pixel processing, shading, texture mapping,
rasterization• Processing is highly data-parallel
– GPUs are highly multithreaded– Use thread switching to hide memory latency
• Less reliance on multi-level caches– Graphics memory is wide and high-bandwidth
• Trend toward general purpose GPUs– Heterogeneous CPU/GPU systems– CPU for sequential code, GPU for parallel code
22