Non-uniform memory access (NUMA)echow/ipcc/hpc-course/HPC...Non-uniform memory access (NUMA)...

Non-uniform memory access (NUMA)

• Memory access between processor core to main memory is not uniform.

• Memory resides in separate regions called “NUMA domains.”

• For highest performance, cores should only access memory in its nearest NUMA domain.

Memory buses and controllers

• Pre-2008: Front side bus

• Post-2008: QuickPath interconnect

Front Side Bus (FSB) and Multicore

• FSB was used up to the Core microarch. (2008)

• Memory access by many cores or sockets across the FSB is a bottleneck

Hypertransport & QuickPath Interconnect

• AMD HyperTransport (2003) Intel QuickPath Interconnect (2008)

• Memory controller now on the CPU

• Memory access is now non-uniform

FSB vs QPI

UMA = Uniform Memory AccessSMP = Symmetric MultiProcessor

NUMA = Non-Uniform Memory Access

Non-Uniform Memory Access (NUMA)

• NUMA architectures support higher aggregate bandwidth to memory than UMA architectures

• “Trade-off” is non-uniform memory access

• Can NUMA effects be observed?

• Locality domain: a set of processor cores and its locally connected memory (a.k.a. locality “node”)

• View the NUMA structure (on Linux):numactl --hardware

numactl can also be used to control NUMA behavior

• How does the OS decide where to put data?

• How does OS decide where to place the threads?

• Ref: Hager and Wellein, Sec 4.2, Chap 8, App A

• Ref: Xeon Phi book, Sec 4.4.5

Page placement by first touch

• A page is placed in the locality region of the processor that first touches it (not when memory is allocated)

• If there is no memory in that locality domain, then another region is used

• Other policies are possible (e.g., set preferred locality domain, or round-robin, using numactl)

numactl – set memory affinity

numactl [--interleave=nodes] [--preferred=node]

[--membind=nodes] [--localalloc] <command> [args] ...

numactl --hardware

numactl --show

numactl –interleave=0,1 <command>

Thread affinity

• Threads can be pinned to cores (otherwise they may move to any core)

– A thread that migrates to a different core usually will no longer have its data in cache (note, some subset of cores may share some levels of cache)

– A thread that migrates to a different locality region may have its data in a different locality region

• For programs compiled with Intel compilers, specify affinity at runtime using KMP_AFFINITY environment variable

KMP_AFFINITY environment variable

2 sockets, 4 cores/socket

0 321 4 765

0 642 1 753

compact

scatterMight not use all sockets

2 sockets, 4 cores/socket, 2-way hyperthreading

0,1 6,74,52,3 8,9 14,1512,1310,11

0,8 6,144,122,10 1,9 7,155,133,11

granularity=core,compact

granularity=core,scatter

Thread can only float between hardware contexts on a core

2 sockets, 4 cores/socket, 2-way hyperthreading

granularity=fine,compact

granularity=fine,scatter

Usually no need to bind threads to hardware thread contexts

0 6421 3 5 7 8 109 11 13 1512 14

0 6428 10 12 14 1 39 11 13 155 7

KMP_AFFINITY on Intel Xeon Phi

4 cores and 4 HW threads per core.Note: “balanced” is only for Intel Xeon Phi.

ExamplesKMP_AFFINITY=“verbose,none” (print affinity information)KMP_AFFINITY=“balanced”KMP_AFFINITY=“granularity=fine,balanced” (bind to single hardware thread)KMP_AFFINITY=“granularity=core,balanced”

(bind to any hardware thread on core; core is default granularity)

How to choose different options

• For BW-bound codes, the optimum number of threads is often less than the number of cores. Distribute and choose number of threads to reduce contention and fully utilize the memory controllers. (Use scatter.)

• For compute-bound codes, use hyperthreading and all available hyperthreads. Threads operating on adjacent data should be placed on the same core (or nearby) in order to share cache. (Use compact.)

• On Intel Xeon Phi, “balanced” often works best. (Probably should have this option on CPUs!)

• Example: bench-dgemm on MIC

Intel Xeon Phi

Technically UMA, but access to L2 cache is nonuniform

Intel compiler optimization reports

• icc … -qopt-report=[0-5]icc … -qopt-report-phase=vec

compiler will output an optimization report in the file xx.optrpt

Backup Figures

Ring Bus to Connect Cores and Caches/DRAM

Intel Xeon Phi

All current Intel CPUsuse a ring bus (alternativeis an all-to-all switch)

Ring Bus on Haswell

Front Side Bus (FSB) vs. QuickPath Interconnect (QPI)

QuickPath InterconnectNUMA: Non-UniformMemory Access

Non-uniform memory access (NUMA)echow/ipcc/hpc-course/HPC...Non-uniform memory access (NUMA)...

Documents

Transcript of Non-uniform memory access (NUMA)echow/ipcc/hpc-course/HPC...Non-uniform memory access (NUMA)...

Unfair Scheduling Patterns in NUMA Architectures · protocols and non-uniform memory access (NUMA). In this paper, we thoroughly investigate concurrent scheduling on real hardware.

Introdução a Computação Paralela com o Open MPI§ão-a-Computação... · 2.3.4. NUMA (Non-Uniform Memory Access)Quando o tempo de acesso de memória está intrinsecamente relacionado

Oracle Secure Backup · 2016. 12. 2. · 4.1.9 Oracle Secure Backup Support for Non-Uniform Memory Access (NUMA) ..... 4-11 4.2 Configuring Oracle Secure Backup for Use with RMAN.....

Introduction to Parallel Programming€¦ · Introduction to Shared Memory Parallel Programming Parallel Architectures Parallel ArchitecturesI cc-NUMA I cc-NUMA (cache-coherent non-uniform

NUMA Time Warp - CORE · Uniform-Memory-Access (NUMA) platforms. The relevance of the NUMA architectural paradigm lies in that systems continuously increase their number of processors

Scale-Free Graph Processing on a NUMA Machinematei/papers/ia3-tanuj.pdf · 2018-10-27 · Non-Uniform Memory Access (NUMA, a.k.a. distributed shared-memory) architecture machines

Chapter 5 Multiprocessors and Thread-Level Parallelismcse.msu.edu/~cse820/lectures/CAQA5e_ch5.pdf · • NUMA (Non Uniform Memory Access time multiprocessor) for shared address, distributed

DimmWitted: A Study of MainMemory Statistical Analyticsbrents/cs494-cdcs/papers/dimmwitted.pdf · Non-Uniform Memory Access (NUMA) machine. Statistical analytics systems diﬀer from

Thread and Memory Placement on NUMA Systems: …uniform memory access times (NUMA), meaning that the latency of data access depends on where (which CPU-cache or memory node) the data

High Performance Computing for Science and Engineering I · A set of processors have direct access to their own physical memory: Non-Uniform Memory Access (NUMA) Two different sockets

The Art of Efﬁcient In-memory Query Processing on NUMA ...sray/papers/ICDE2020_NUMA_Main_Memory_Query.pdfConsequently, remote memory access is slower. In addition to remote memory

NUMA aware heap memory manager

(Mis)Understanding the NUMA Memory System Performance of …angelee/cse539/papers/numa.pdf · design is the non-uniform memory architecture (NUMA). Many recent multicore systems are

MaxGauge for Oracle - EXEM · 2018-02-12 · 因为MaxGauge支持 Uniform Memory Acces和Non-uniform Memory Access (NUMA)两种方 式，需要确认服务器的NUMA与否。 通过SID排序确认NUMA的方法如下。

ccNUMA Cache Coherent Non-Uniform Memory Access

Linux NUMA support for HP ProLiant servers...HP incorporates the NUMA architecture in its servers primarily because of the architecture’s capability for providing scalable memory

Local and Remote Memory: Memory in a Linux/NUMA Systemilinuxkernel.com/files/Local.and.Remote.Memory.Memory.in...SGI 2006 Remote and Local Memory 5 how to use the operating system

Understanding the performance of storage class memory file ... · regions [10]. However, under the non-uniform memory access (NUMA) architectures, where memory devices are distributed

Memory performance of Xeon E5-2600/4600 based systems · local and remote memory access is of the NUMA (Non-Uniform Memory Access) type. The parameters of the memory architecture

NUMA-Aware Shared-Memory Collective Communication for MPI

MaxGauge for Oracle - EXEM · 2018-02-12 · 因为MaxGauge支持 Uniform Memory Acces和Non-uniform Memory Access (NUMA)两种方式，需要确认服务器的NUMA与否。通过SID排序确认NUMA的方法如下。