Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM

11
MULTIPROCESSOR Different Software models to exploit thread level parallelism Parallel processing: Execution of tightly coupled set of threads collaborating on a single task Request Level Parallelism: Execution of multiple, relatively independent processes that may originate from one or more users Multithreading: A technique that supports multiple threads executing in an interleaved fashion on a single multiple issue processors. Clusters: Ultrascale computers built from very large number of processors, connected with networking technology. When these clusters grow to tens of thousands of servers and beyond, we call them Warehouse scale computers. Multicomputers: Special Large-scale multiprocessor systems, which are less tightly coupled than the typical multiprocessors but more tightly coupled than the ware-house scale systems. Grain Size: Amount of computation assigned to a thread. Multiprocessors are classified based on their memory organization a: Symmetric (shared-memory) multiprocessor or SMPs or centralized shared-memory multiprocessor: Share a single centralized memory that all processor have access to, hence symmetric. SMP architecture also called as Unified Memory Access (UMA) multiprocessors (all processors have a uniform latency from memory). Distributed Shared Memory Multiprocessor or Non-Uniform Memory Access Multiprocessor (NUMA). Challenges of Parallel Processing Limited Parallelism: Large Latency of Remote access in parallel processor: 35 to 50 clock cycles among cores on the same chip. 100 to 500 clock cycles among cores of separate chips. Coherence Vs Consistency: Coherence defines memory access behaviour to same memory location; consistency defines memory access behaviour to different locations. A coherent memory system preserves the order among accesses to same location by different processes. Whereas a consistent memory system would preserve and respect the order between accesses to different location issued by a given process. Write Serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors. Sequential Consistency: If the results of any execution of a program are such that it is possible to construct a hypothetical serial order of all operations to the memory (i.e. all locations) that is consistent with the results of the execution. Cache Coherence Schemes: Director based: Uses centralized information to avoid broadcast. The sharing status of the a particular block of physical memory is kept in one centralized location, called directory. It scales well to large number of processors.

description

This a cheat sheet prepared for the Advanced Computer Architecture Midterm Exam at UofM

Transcript of Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM

Page 1: Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM

MULTIPROCESSOR

Different Software models to exploit thread level parallelism

Parallel processing: Execution of tightly coupled set of threads collaborating on a single task

Request Level Parallelism: Execution of multiple, relatively independent processes that may originate from one or more users

Multithreading: A technique that supports multiple threads executing in an interleaved fashion on a single multiple issue processors.

Clusters: Ultrascale computers built from very large number of processors, connected with networking technology. When these clusters grow to tens of thousands of servers and beyond, we call them Warehouse –scale computers.

Multicomputers: Special Large-scale multiprocessor systems, which are less tightly coupled than the typical multiprocessors but more tightly coupled than the ware-house scale systems.

Grain Size: Amount of computation assigned to a thread.

Multiprocessors are classified based on their memory organization a:

Symmetric (shared-memory) multiprocessor or SMPs or centralized shared-memory multiprocessor: Share a single centralized memory that all processor have access to, hence symmetric. SMP architecture also called as Unified Memory Access (UMA) multiprocessors (all processors have a uniform latency from memory).

Distributed Shared Memory Multiprocessor or Non-Uniform Memory Access Multiprocessor (NUMA).

Challenges of Parallel Processing

Limited Parallelism:

Large Latency of Remote access in parallel processor: 35 to 50 clock cycles among cores on the same chip. 100 to

500 clock cycles among cores of separate chips.

Coherence Vs Consistency: Coherence defines memory access behaviour to same memory location; consistency

defines memory access behaviour to different locations. A coherent memory system preserves the order among

accesses to same location by different processes. Whereas a consistent memory system would preserve and

respect the order between accesses to different location issued by a given process.

Write Serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors.

Sequential Consistency: If the results of any execution of a program are such that it is possible to construct a

hypothetical serial order of all operations to the memory (i.e. all locations) that is consistent with the results of the

execution.

Cache Coherence Schemes:

Director based: Uses centralized information to avoid broadcast. The sharing status of the a particular block of

physical memory is kept in one centralized location, called directory. It scales well to large number of processors.

Page 2: Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM

Snooping: Rely on broadcast to observe all coherence traffic. Every cache that has a copy of the data from a block

of physical memory could track the sharing status of the block. Well suited for buses and small scale systems.

MSI Protocol:

MESI: Adds the Exclusive state to the basic MSI protocol to indicate when a cache block is resident only in a single

cache but is clean. A block in E state can be written without generating any invalidates – optimizes the case where a

block is read by a single cache before being written by that same cache. Also subsequent writes to the bock in E

state by the same core need not acquire the bus access or generate invalidate, since the block is known to be

exclusively in this cache; the processor merely changes the state to MODIFIED. A read miss to a block in E state

would cause a state change S to maintain coherence.

MOESI: Adds Owned state to the MESI protocol to indicate that the associated block is owned by that cache and

out-of-date in memory. In MSI and MESI protocol, when there is an attempt to share a block in the Modified state,

the state is changed to Shared (in original and newly sharing cache) and the block must be written back to memory.

However in MOESI protocol, the block can be changed from MODIFIED to OWNED state in the original cache

without writing to memory. The newly sharing cache would keep the block in the SHARED state. On a miss the

owner of the block must supply the block since it is not up to date in the memory and must write back to memory if

it is replaced.

PITFALLS OF SNOOPING SCHEMES:

As the number of processors grows or as the memory demands of each core grows – the centralized resource in the

system (main memory or L3 cache) can become a bottleneck.

Snooping bandwidth at the cache also a problem, since every cache must examine every miss placed on the bus.

FALSE SHARING MISSES: Occurs when a block is invalidated because some word in the block, other than the one

being read, is written into. If the word written into is actually used by the processor that received the invalidate,

then the reference was a true sharing reference and would have caused a miss independent of the block size of the

cache.

Directory Based Cache Coherence Protocol

Page 3: Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM

In addition to tracking the state of each potentially shared memory block, we must also track which needs have

copies of that block, since those copies needs to invalidate on a write.

The local node is the node where a request originates. The home node is the node where the memory location and

the directory entry reside. The physical address space is statically distributed, so the node that contains the

memory and directory for a given physical address is known. A remote node is the node that has a copy of the

cache block, whether exclusive or shared.

Page 4: Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM

SYNCHRONIZATION

Page 5: Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM
Page 6: Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM
Page 7: Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM
Page 8: Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM
Page 9: Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM
Page 10: Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM
Page 11: Cheat Sheet Prepared for Advanced Computer Architecture Midterm Exam - UofM