Memory Hierarchies and Thread Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse160s05/...CSE...
Transcript of Memory Hierarchies and Thread Performancecseweb.ucsd.edu/groups/csag/html/teaching/cse160s05/...CSE...
-
1
Lecture #6, Slide 1CSE 160 Chien, Spring 2005
Memory Hierarchies and Thread Performance
• Last Time» Parallelism = Performance» Threads and Nodes as Basis for Large-scale Parallelism
• Today» Caches and Performance» Threads and Caches» Parallel Machine Structure
• Reminders/Announcements» Homework #2 went out Tuesday (Due April 21)» Homework #1 already returned (see Sagnik)
Lecture #6, Slide 2CSE 160 Chien, Spring 2005
The Processor-Memory Gap
1980 1985 1990 1995 2000 200510
0
101
102
103
104
105
Year
Per
form
an
ce
Memory (DRAM)
Processor
-
2
Lecture #6, Slide 3CSE 160 Chien, Spring 2005
Caches
• Caches are Critical to “span the gap”» Exploit Locality; Keep the needed data in a small fast
memory» Avoid Accessing the Slow Memory
• If Successful» Processor can execute faster, memory requests are OFTEN
satisfied quickly» Memory system can be much lower bandwidth, many fewer
requests actually go to memory
Lecture #6, Slide 4CSE 160 Chien, Spring 2005
A Typical Memory Hierarchy
small expensive $/bit
large cheap $/bitMemory
on-chip caches
off-chip cache
main memory
Cache
L2 Cache
P
Cache
L3 Cache
-
3
Lecture #6, Slide 5CSE 160 Chien, Spring 2005
Other Attributes of A Memory Hierarchy
Memory
on-chip caches
off-chip cache
main memory
Low latency High Bandwidth
High Latency Low Bandwidth
Cache
L2 Cache
P
Cache
L3 Cache
Lecture #6, Slide 6CSE 160 Chien, Spring 2005
Finding Cache Data
• For fast access, must be able to find data quickly• Lookup: discard a few address bits, access the fast
memory• Tags indicate the data that’s really there• Why does this work?
» Memory location + tag -> complete address
Access +Lookup:
Memory Address
CacheIndex
Cache Tag
Cache
Data Tags
-
4
Lecture #6, Slide 7CSE 160 Chien, Spring 2005
Cache Access
• Part of Memory Address applied to cache• Tags checked against remaining address bits• If match -> Hit!, use data• No match -> Miss, retrieve data from memory• This works pretty well... but there are some
complications...
Data Tags
Memoryaddress lines
= Hit?
Lecture #6, Slide 8CSE 160 Chien, Spring 2005
Multiword Cache Entries
• Cache entries are several contiguous words of memory
• Shared tag, involves only contiguous addresses» reduces tag overhead» exploits spatial locality
• A few address lines select the word from the line on a “hit”.... Which ones?
Cache Blocks Tags
-
5
Lecture #6, Slide 9CSE 160 Chien, Spring 2005
Accessing a Sample Cache• 64 KB cache, direct-mapped, 32-byte cache block size
31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index
valid tag data
64 KB
/ 32 bytes = 2 K
cache blocks/sets
11
=256
32
16
hit/miss
012.........
...204520462047
word offset
Lecture #6, Slide 10CSE 160 Chien, Spring 2005
Accessing a Sample Cache• 32 KB cache, 2-way set-associative, 16-byte block size
31 30 29 28 27 ........... 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0tag index
valid tag data
32 KB
/ 16 bytes / 2 = 1 K
cache sets
10
=
18
hit/miss
012.........
...102110221023
word offset
tag datavalid
=
-
6
Lecture #6, Slide 11CSE 160 Chien, Spring 2005
Example – HP/Compaq/DEC Alpha 21164 Caches
21164 CPUcore
InstructionCache
DataCache
UnifiedL2
Cache
Off-ChipL3 Cache
• ICache and DCache -- 8 KB, DM, 32-byte lines• L2 cache -- 96 KB, 3-way SA, 32-byte lines• L3 cache -- 1 MB, DM, 32-byte lines
Lecture #6, Slide 12CSE 160 Chien, Spring 2005
Example
• 2.4 Ghz Dual Xeon Server» Dual Processors» L1 Cache – 8KB, 2-3 clocks, ~10GB/s » L2 Cache – 512KB, 6GB/s » Memory System – 100’s of clocks, 2GB/s shared
• Caches must be effective for this system to work well.• What is the ratio, just on bandwidth alone? Block
sizes…• What type of working set is possible for an
application? (how much locality)
This is comparable to the FWGrid Opteron Nodes!
-
7
Lecture #6, Slide 13CSE 160 Chien, Spring 2005
Caches and Single Thread Performance
• So, why does this matter?» Single thread Performance is tied to cache performance» High Locality, high cache hit rate is essential» Low hit rate => low thread performance» Single thread performance is the building block for high parallel
performance• For good parallel performance, single threads must make good
use of cache hierarchies
16 xThreads
OK Good Excellent
Single Thread Performance
Lecture #6, Slide 14CSE 160 Chien, Spring 2005
Multithreaded, Hyperthreading, or Simultaneous MT Processors
• Typically share the same memory hierarchy» Same caches, same “ports” to access the cache» No increased capacity, no increased bandwidth » Try to get higher throughput out of the processor
• Collective use of the Memory Hierarchy must have high locality» Capacity: Threads should have small “working sets”, Overlapping
working sets even better» Bandwidth: Not too many misses, or Miss bursts at different times
Cache
L2 Cache
P
Cache
L3 Cache
-
8
Lecture #6, Slide 15CSE 160 Chien, Spring 2005
Multi-Processors: Cache Coherence Problem
• Multiple Processors Sharing a single Memory System» Separate cache hierarchy for each processor
• Copies of a datum may exist in multiple Caches» Writes take a finite amount of time to propagate» Caches might contain a value and respond while another
processor is writing
Cache
L2 Cache
P
Cache
L3 Cache
Cache
L2 Cache
P
Cache
L3 Cache
Memory
Obj A Obj A
This is structure of the FWGrid Opteron Nodes!
Lecture #6, Slide 16CSE 160 Chien, Spring 2005
Coherence Thru Write-back Caches
• First write captures an exclusive copy of the cache block• Successive writes can be buffered in the cache; reducing the required
bus bandwidth; key for scalability• Cache must store exclusive and modified state, complicates the cache• Eventually, the dirty, modified block must be written back into the main
memory (or supplied to another cache) • Basic scheme used for many machines...
ProcWrite ALoad AWrite ALoad AWrite A
BusWrite A............Flush A
-
9
Lecture #6, Slide 17CSE 160 Chien, Spring 2005
Basic Write-back Cache Protocol
• Invalid, Shared, Dirty (exclusive); A/B -- Get A, do B• Dirty state is obtained through an ownership bus
transaction, exiting this state requires a flush to write the data back
• BusRd gets a read copy, BusRdX gets a write copy• Work out all of the transactions. Just about the simplest.
S DBusRd/Flush
ProcWr/BusRdX
PrRd/ --PrRd/--
I
PrWr/--BusRd/--
BusRdX/FlushBusRdX/--
PrWr/BusRdXPrRd/BusRd
Lecture #6, Slide 18CSE 160 Chien, Spring 2005
Illinois Protocol (MESI)
• Extend 3-state protocol with VE (valid, exclusive)• Idea: Optimize BusRd and BusRdX into one
transaction, keep track of private read copies.• Natural extension of protocol is other ways.• Used in SGI Challenge arrays and MOST
multiprocessors…
-
10
Lecture #6, Slide 19CSE 160 Chien, Spring 2005
Illinois Protocol (FSM)
• Other transitions identical to 3-state protocol• “remembers” whether have the only copy
VE
D
I
S
BusRd/FlushPrWr/--
PrRd/--PrRd/BusRd(S)
BusRdX/Flush
Lecture #6, Slide 20CSE 160 Chien, Spring 2005
Complications -- Life ain’t so simple...
• Write buffers» Delay write visibility, how does this affect memory
consistency model?» Aggregation, order can be changed as well. Flush
instructions if necessary• Split transaction busses
» Operations are not atomic, may be many outstanding at the same time (again out of order)
» No problem if the operations are not related (as above)• Cache designs require simultaneous access
» tag replication or dual porting» multi-level caches can alleviate this problem
• Reality: cache controllers compete with processors as the “really hard things to get right” in systems design
-
11
Lecture #6, Slide 21CSE 160 Chien, Spring 2005
Caches and Multiple Processors
• So, why does this matter?» Coordination and data sharing amongst threads can be expensive» Cross-processor – hundreds of cycles; All the way down and all the way
back up» Ideally: Threads operate on independent data.
– Allow execution without interference in a single processor– Allow execution with high efficiency on separate processors
Cache
L2 Cache
P
Cache
L3 Cache
Cache
L2 Cache
P
Cache
L3 Cache
Memory
Parallel Machine Taxonomy
-
12
Lecture #6, Slide 23CSE 160 Chien, Spring 2005
Flynn’s Taxonomy
• Flynn (1966) Classified machines by data and control streams
Multiple Instruction Multiple Data(MIMD)
Multiple Instruction Single Data(MISD)
Single Instruction Multiple DataSIMD
Single InstructionSingle Data(SISD)
Lecture #6, Slide 24CSE 160 Chien, Spring 2005
SISD
• SISD» Model of serial von Neumann machine» Logically, single control processor
P M
-
13
Lecture #6, Slide 25CSE 160 Chien, Spring 2005
SIMD
• SIMD» All processors execute the same program in lockstep» Data that each processor sees may be different» Single control processor» Individual processors can be turned on/off at each cycle
(“masking”)» CM-2, MasPar are some examples
Lecture #6, Slide 26CSE 160 Chien, Spring 2005
CM2 Architecture
• CM2 had 8K-64K 1 bit custom processors • Data Vault provides peripheral mass storage• Programs had normal sequential control flow but all operations happened in
parallel so CM HW supported data-parallel programming model
16K proc
Seq 0
16K proc
Seq 2
16K proc
Seq 3
16K proc
Seq 1
Nexus
Frontend 2
Frontend 0
Frontend 1
Frontend 3
CM I/O SystemDataVault
DataVault
DataVault
GraphicDisplay
-
14
Lecture #6, Slide 27CSE 160 Chien, Spring 2005
MIMD
• All processors execute their own set of instructions• Processors operate on separate data streams• No centralized clock implied• MIMD machines may be shared memory or
message-passing• SP-2, T3E, Origin 2K, Tera MTA, Clusters, Cray’s,
etc.• => Most Machines built from traditional processors
have this structure
Lecture #6, Slide 28CSE 160 Chien, Spring 2005
Models for Communication
• Parallel program = program composed of tasks (processes) which communicate to accomplish an overall computational goal
• Two prevalent models for communication:» Message passing (MP)» Shared memory (SM)
-
15
Lecture #6, Slide 29CSE 160 Chien, Spring 2005
• Processes in message passing program communicate by passing messages
• Basic message passing primitives» Send(parameter list)» Receive(parameter list)» Parameters depend on the software and can be complex
Message Passing Communication
A B
Lecture #6, Slide 30CSE 160 Chien, Spring 2005
Flavors of message passing• Synchronous used for routines that return when the message
transfer has been completed» Synchronous send waits until the complete message can be
accepted by the receiving process before sending the message (send suspends until receive)
» Synchronous receive will wait until the message it is expecting arrives (receive suspends until message sent)
» Also called blocking
A B
request to send
acknowledgement
message
-
16
Lecture #6, Slide 31CSE 160 Chien, Spring 2005
Asynchronous (Non-blocking) Message Passing
• Nonblocking sends return whether or not the message has been received» If receiving processor not ready, message may be stored in
message buffer» Message buffer used to hold messages being sent by A prior to
being accepted by receive in B
» MPI: – routines that use a message buffer and return after their local actions
complete are blocking (even though message transfer may not be complete)
– Routines that return immediately are non-blocking
A B
message buffer
Lecture #6, Slide 32CSE 160 Chien, Spring 2005
Architectural support for Message Passing
• Interconnection network should provide connectivity, low latency, high bandwidth
• Many interconnection networks developed over last 2 decades» Small-size: non-blocking (Xbar)» Medium-size: non-blocking (Xbar)» Large-scale: Mesh, torus,
– multi-stage networks
InterconnectionNetwork
…Processor
Localmemory
Basic Message Passing Multicomputer
-
17
Lecture #6, Slide 33CSE 160 Chien, Spring 2005
Shared Memory Communication
• Processes in shared memory program communicate by accessing shared variables and data structures
• Basic shared memory primitives» Read to a shared variable (or object)» Write to a shared variable (or object)
InterconnectionNetwork…Processors
memories
Basic Shared Memory Multiprocessor Architecture
…
Lecture #6, Slide 34CSE 160 Chien, Spring 2005
• Conflicts may arise if multiple processes want to write to a shared variable at the same time.
• Programmer, language, and/or architecture must provide means of resolving conflicts
Accessing Shared Objects
Shared variable x
+1proc. A
+1proc. B
Process A,B:read xcompute x+1write x
-
18
Lecture #6, Slide 35CSE 160 Chien, Spring 2005
Architectural Support for Shared Memory
• 4 basic types of interconnection media:» Bus (not really used any more)» Crossbar switch» Multistage network» Interconnection network with distributed shared memory
Lecture #6, Slide 36CSE 160 Chien, Spring 2005
Limited Scalable Media• Crossbar
» Crossbar switch connects m processors and n memories with distinct paths between each processor/memory pair
» Crossbar provides uniform access to shared memory (UMA)» O(mn) switches required for m processors and n memories» Crossbar scalable in terms of performance but not in terms of cost, used
for basic switching mechanism in SP2
P1
M1
P4P5
P3P2
M2 M3 M4 M5
-
19
Lecture #6, Slide 37CSE 160 Chien, Spring 2005
Multistage Networks
• Multistage networks provide more scalable performance than bus but less costly to scale than crossbar
• Typically max{logn,logm} stages connect n processors and m shared memories
• “Omega” networks (butterfly, shuffle-exchange) commonly used for multistage network
• Multistage network used for CM-5 (fat-tree connects processor/memory pairs), BBN Butterfly (butterfly), IBM RP3 (omega)
P1
P4P5
P3P2
M1M2M3M4M5
Stage1
Stage2
Stagek
…
Lecture #6, Slide 38CSE 160 Chien, Spring 2005
Omega Networks
• Butterfly multistage» Used for BBN Butterfly, TC2000
• Shuffle multistage» Used for RP3, SP2 high
performance switch
4D
C
1
B3
2A
CD
B
1
4A
32
-
20
Lecture #6, Slide 39CSE 160 Chien, Spring 2005
Fat-tree Interconnect
• Bandwidth is increased towards the root• Used for data network for CM-5 (MIMD MPP)
» 4 leaf nodes, internal nodes have 2 or 4 children• To route from leaf A to leaf B, pick random switch C in the least
common ancestor fat node of A and B, take unique tree route fromA to C and from C to B
Binary fat-tree in which all internal nodes have two children
Lecture #6, Slide 40CSE 160 Chien, Spring 2005
Distributed Shared Memory
• Memory is physically distributed but programmed as shared memory» Programmers find shared memory paradigm desirable» Shared memory distributed among processors, accesses may be sent as
messages» Access to local memory and global shared memory creates NUMA (non-uniform
memory access architectures)» BBN butterfly is NUMA shared memory multiprocessor
InterconnectionNetwork
…
Processorand localmemory
Sharedmemory
PM
PM
PM
…
BBNbutterflyinterconnect
-
21
Lecture #6, Slide 41CSE 160 Chien, Spring 2005
Using both Shared Memory and Message Passing together
• Clusters of SMPs may be effectively programmed using both SM and MP» Shared Memory used within a multiple processor machine/node» Message Passing used between nodes
• FWGrid has this structure
Lecture #6, Slide 42CSE 160 Chien, Spring 2005
Summary
• Single Thread Performance is the building block for parallel performance» Threads should have high data locality, and share caches
efficiently for MT, HT, SMT to work well
• Threads run on Multiprocessors must be decoupled to achieve good parallelism» High data locality, but moderate sharing of data
• Parallel Machine Taxonomy» SISD (traditional), SIMD, MIMD» Shared and Distributed Memory
-
22
Lecture #6, Slide 43CSE 160 Chien, Spring 2005