ccNUMA Cache Coherent Non-Uniform Memory Access

ccNUMAccNUMACache Coherent Non-Uniform Memory Access

Chris Coughlin MSCS521

Prof. Ten EyckSpring 2004

Let’s First Talk About Let’s First Talk About Computer ArchitecturesComputer Architectures

SISD(Single Instruction Stream-Single Data Stream)• A single-processor computer (uniprocessor) in which a single stream of

instructions is generated from the program.SIMD(Single Instruction Stream-Multiple Data Stream)• Each instruction is executed on a different set of data by different

processors. (Used for vector and array processing)MISD(Multiple Instruction Stream-Single Data Stream)• Each processor executes a different sequence of instructions.• Never been commercially implemented.MIMD(Multiple Instruction Stream-Multiple Data Stream)• Each processor has a separate program.• An instruction stream is generated from each program.• Each instruction operates on different data.

In 1966, Michael Flynn proposed a classification for computer architectures based on the number of instruction steams and data streams (Flynn’s Taxonomy).

MultiprocessorsMultiprocessors

• The idea behind multiprocessors is to create powerful computers by connecting many smaller ones.

• Computational speed is increased by using multiple processors operating together on a single problem.

• A parallel processing program is a single program that runs on multiple processors simultaneously.

• The overall problem is split into parts, each of which is performed by a separate processor in parallel.

• In addition to a faster solution, it may also generate a more precise solution.

MIMD SystemsMIMD SystemsShared Memory Multiprocessor System• Multiple processors are connected to multiple memory

modules such that each processor can access any other processor’s memory module. This multiprocessor employs a shared address space (also known as a single address space).

• Communication is implicit with loads and stores – there is no explicit recipient of a shared memory access.

• Processors may communicate without necessarily being aware of one another.

• A single image of the operating system runs across all the processors.

MIMD Systems (cont.)MIMD Systems (cont.)Multicomputer• A term for parallel processors with separate,

private address spaces (not accessible by the other processors in the system).

• Communicate by message-passing – the messages carry data from one processor to another as dictated by the program.

• Complete computers, consisting of a processor and local memory, connected through an interconnection network (e.g. a LAN).

Processor OrganizationsProcessor Organizations

Computer Architecture Classifications Computer Architecture Classifications

Single Instruction,Single Instruction, Single Instruction,Single Instruction, Multiple InstructionMultiple Instruction Multiple InstructionMultiple Instruction

Single Data StreamSingle Data Stream Multiple Data StreamMultiple Data Stream Single Data StreamSingle Data Stream Multiple Data StreamMultiple Data Stream

(SISD)(SISD) (SIMD) (SIMD) (MISD) (MISD) (MIMD) (MIMD)

Uniprocessor Vector Array Shared MemoryUniprocessor Vector Array Shared Memory MulticomputerMulticomputer

Processor Processor (tightly coupled) (loosely coupled)Processor Processor (tightly coupled) (loosely coupled)

Note: We will expand on this later

Back to Shared Memory MultiprocessorsBack to Shared Memory MultiprocessorsTwo styles: UMAUMA and NUMANUMA:

UMA (Uniform Memory Access)• The time to access main memory is the same for all

processors since they are equally close to all memory locations.

• Machines that use UMA are called Symmetric Multiprocessors (SMPs).

• In a typical SMP architecture, all memory accesses are posted to the same shared memory bus.

• Contention - as more CPUs are added, competition for access to the bus leads to a decline in performance.

• Thus, scalability is limited to about 32 processors.

Shared Memory Multiprocessors (cont.)Shared Memory Multiprocessors (cont.)

NUMA (Non-Uniform Memory Access)• Since memory is physically distributed, it is faster

for a processor to access its own local memory than non-local memory (memory local to another processor or shared between processors).

• Unlike SMPs, all processors are not equally close to all memory locations.

• A processor’s own internal computations can be done in its local memory leading to reduced memory contention.

• Designed to surpass the scalability limits of SMPs.

Communication and Connection Communication and Connection Options for MultiprocessorsOptions for Multiprocessors

CategoryCategory ChoiceChoice Number of Number of ProcessorsProcessors

Communication model

Message passing 8-256

Shared address

UMA 2-64

NUMA 8-256

Physical Connection

Network 8-256

Bus 2-36

Multiprocessors come in two main configurations: a single bus connection, and a network connection. The choice of the communication model and the physical connection depends largely on the number of processors in the organization. Notice that the scalability of NUMA makes it ideal for a network configuration. UMA, however, is best suited to a bus connection.

Cache

Processor

Cache

Processor

Cache

Processor

Single bus

Memory I/O

A Multiprocessor Bus ConfigurationA Multiprocessor Bus Configuration

The single bus design is limited in terms of scalability. The largest number of processors in a commercial product using this configuration is 36 (SGI Power Challenge).

A Multiprocessor Network ConfigurationA Multiprocessor Network Configuration

Network

Cache

Processor

Cache

Processor

Cache

Processor

Memory Memory Memory

The network-connected processor design is very scalable. Since each processor has its own memory, the network connection is only used for communication between processors.

A Quick Look at CacheA Quick Look at Cache• Modern processors use a faster, smaller cache

memory to act as a buffer for slower, larger memory. • Caches exploit the principal of locality in memory

accesses.Temporal locality – the concept that if data is

referenced, it will tend to be referenced again soon after.Spatial locality – the concept that data is more likely to be referenced soon if data near it was just referenced.

• Caches hold recently referenced data, as well as data near the recently referenced data.

• This can lead to performance increases by reducing the need to access main memory on every reference.

What is What is ccNUMAccNUMA??• The cc in ccNUMA stands for cache coherent.• The use of cache memory in modern computer architectures leads

to the cache coherence problem.• It is a situation that can occur when two or more processors

reference the same shared data. If one processor modifies its copy of the data, the other processors will have stale copies of the data in their caches.

• Machines that are cache coherent ensure that a processor accessing a memory location receives the most up-to-date version of the data.

• Cache coherence is maintained by software, special-purpose hardware, or both.

• NUMA systems that maintain cache coherence are referred to as ccNUMA machines.

• Since few applications still exist for non-cache coherent NUMA machines, the terms NUMA and ccNUMA are used interchangeably.

Processor OrganizationsProcessor Organizations

Computer Architecture Classifications (revisited)Computer Architecture Classifications (revisited)

Single Instruction,Single Instruction, Single Instruction,Single Instruction, Multiple InstructionMultiple Instruction Multiple InstructionMultiple Instruction

Single Data StreamSingle Data Stream Multiple Data StreamMultiple Data Stream Single Data StreamSingle Data Stream Multiple Data StreamMultiple Data Stream

(SISD)(SISD) (SIMD) (SIMD) (MISD) (MISD) (MIMD) (MIMD)

Uniprocessor Vector Array Shared MemoryUniprocessor Vector Array Shared Memory Multicomputer Multicomputer

Processor Processor (tightly coupled) (loosely coupled)Processor Processor (tightly coupled) (loosely coupled)

UMA (SMP) UMA (SMP) NUMANUMA

ccNUMAccNUMA

Cache Coherency ProtocolsCache Coherency ProtocolsSnooping protocol • A bus-based method in which cache controllers monitor the bus for

activity and update or invalidate cache entries as necessary. • Two types:

Write invalidate – the writing processor sends an invalidation signal to the bus. All other caches check to see if they have a copy of the cache block. If they do, the block containing the data gets invalidated. The writing processor then changes its local copy.Write-update – the writing processor broadcasts the new data over the bus and all copies are updated with the new value.

• Commercial machines use write-invalidate to preserve bandwidth.• Write-update has the advantage of making the new values appear in

the caches sooner.

Cache Coherency Protocols (cont.)Cache Coherency Protocols (cont.)

Directory-based protocol • A central directory maintains the information about

which memory locations are being shared in multiple caches and which are contained in just one processor’s cache.

• On any memory access, it knows the caches that need to be updated or invalidated.

• It is used by all software-based implementations of shared memory.

• It is a scalable scheme that is suitable for a network configuration.

A Side-Effect of Cache CoherencyA Side-Effect of Cache CoherencyFalse sharing• Caches are organized into blocks of contiguous memory

locations – mainly because programs tend to use spatial locality of reference.

• It is therefore possible for two processors to share the same cache block, but to not share the same memory location within the block.

• If one processor writes to its own part of the block, it then causes the other processor’s entire block, including the memory location it was accessing, to get updated or invalidated.

• Unnecessary invalidations can affect performance.• It is up to the programmer to detect it and avoid it.• Compiler-based solutions are being researched.

ccNUMA ImplementationsccNUMA Implementations

Stanford Dash – • Dash stands for Directory Architecture for Shared

Memory.• First to use directory-based cache coherence.

SGI Origin 2000 (Silicon Graphics Inc.) -• Can support up to 1024 processors.• SGI claims it accounts for over 95% of worldwide

shipments of ccNUMA-based systems.

IBM’s LA (Local Access) ccNUMA

ReferencesReferences1. Computer Organization and Design: The Hardware/Software

Interface, David A. Patterson & John L. Hennessy, 1998, 2nd edition

2. Supercomputing Systems: Architectures, Design, and Performance, Svetlana P. Kartashev & Steven I. Kartashev, 1990

3. Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, Barry Wilkinson & Michael Allen, 1999

4. www.mkp.com/cod2e.htm5. Non-Uniform Memory Access – Wikipedia6. Symmetric Multiprocessing - Wikipedia7. Cache Coherence - Wikipedia8. Parallel Computing - Wikipedia9. Locality of Reference – Wikipedia

http://www.mkp.com/cod2e.htm

http://en.wikipedia.org/wiki/Non-Uniform_Memory_Access

http://en.wikipedia.org/wiki/Symmetric_multiprocessing

http://en.wikipedia.org/wiki/Cache_coherence

http://en.wikipedia.org/wiki/Parallel_computing

http://en.wikipedia.org/wiki/Locality_of_reference

References (cont.)References (cont.)

10. A Primer on NUMA ( Non-Uniform Memory Access)

11. Cache Coherence in the context of Shared Memory Architecture

12. Distributed shared memory -- ccNUMA interconnects

13. The Stanford Dash Multiprocessor

14. The SGI Origin: A ccNUMA Highly Scalable Server

15. IBM Distributed Shared Memory Plans Uncovered

16. http://benchoi.info/Bens/Teaching/Csc364/PDF/CH18.pdf

17. http://www.cs.ucsd.edu/classes/fa00/cse240/lectures/Lecture17.html

18. http://www.cs.ucsd.edu/users/carter/260/260class02.pdf

http://www.npac.syr.edu/nse/hpccsurvey/architecture/slide8.html

http://www.epcc.ed.ac.uk/direct/newsletter5/node15.html

ccNUMA Cache Coherent Non-Uniform Memory Access

Documents

Transcript of ccNUMA Cache Coherent Non-Uniform Memory Access