NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02...
Transcript of NUMA-aware Matrix-Matrix-Multiplication · Performance of Naive vs. MKL 5 0,38 11,79 98,14 0,02...
NUMA-aware Matrix-Matrix-Multiplication
Max Reimann, Philipp Otto
1
About this talk
• Objective: Show how to improve performance of algorithms in a NUMA-system with MMM as an example
• Code was written in C with numa.h, pthread.h • Tested on FSOC
– ubuntu-0101 • 2 Nodes, 24 Cores
– dl980 • 8 Nodes, 128 Cores
• Compiled with gcc – –O3
2
Naïve Matrix-Matrix-Multiplication
• We will examine MMM for large n x n matrices
• 𝒪 𝑛3
3 http://www.mathematrix.de/wp-content/uploads/matrixmul2.png
Naïve MMM implementation
4
Performance of Naive vs. MKL
5
0,38
11,79
98,14
0,02
0,13
1,02
0,015625
0,03125
0,0625
0,125
0,25
0,5
1
2
4
8
16
32
64
128
512 1024 2048
Naive
MKL
dl980 on one core
Intel Math Kernel Library (MKL)
• BLAS: Basic Linear Algebra Subprograms
– Standard for Linear Algebra
• MKL:
– Implements BLAS for Intel hardware
– Vectorized and threaded for highest performance
6
Analysis of Naïve MMM
• Testsetup • Use ubuntu-numa machine
• No thread or memory pinning
• Use numatop/pcm
• Performance tools show:
– Unused cores (obvious)
– QPI cannot be fully loaded with one thread
8
Parallelization I
• How can the work be divided?
– 1. Partition computation of matrixC by rows or columns
• Problem: All threads need matrixA and matrixB
• Solution: – Accept overhead for remote memory access or
– Copy input/output matrices to the other nodes (preprocessing)
9
* =
Parallelization – Partition by rows
10
Parallelization – Partition by rows
11
0,38
11,79
98,14
0,05
0,26
2,54
0,19
0,27 0,28
0,03125
0,0625
0,125
0,25
0,5
1
2
4
8
16
32
64
128
512 1024 2048
Naive Sequential
Naive Parallel
MKL Parallel
dl980 on 128 cores
Parallelization II
• How can the work be divided?
– 2. Partition computation of matrixC by summands
• Benefit: – for computing the i-th summand, only the i-th row of matrixA
/ column of matrixB is needed
– This allows to only copy the needed parts to the other nodes
• Disadvantage: – matrixB has to be transposed to be able to partition the
memory (preprocessing)
– locking or merging of matrixC is needed
12
Parallelization II
13
Performance of „Parallel Sum“ Method
14
1,59
2,81 3,34
14,91
218,84
0,27
1,41
2,94
17,24
186,39
0,19
0,27 0,28
0,43
2,41
0,13
0,25
0,50
1,00
2,00
4,00
8,00
16,00
32,00
64,00
128,00
256,00
512 1024 2048 4096 8192
Parallel sum
Naive Parallel
MKL Parallel
dl980 on 128 cores
Strassen
• Runtime Complexity: – Naive algorithm 𝒪 𝑛3
• Can we get better? – Strassens algorithm, published 1969, was the first
to improve asymptotic complexity
– Runtime 𝒪 𝑛log2 7 ≈ 𝒪 𝑛2.8 • Algorithms today can get O(𝑛2.35), but are not pratical
– Uses only 7 multiplications instead of 8 per recursion step
15
Matrix definition
16
A = 𝐴1,1 𝐴1,2
𝐴2,1 𝐴2,2, 𝐵 =
𝐵1,1 𝐵1,2
𝐵2,1 𝐵2,2, 𝐶 =
𝐶1,1 𝐶1,2𝐶2,1 𝐶2,2
For matrices A,B,C with dimension 𝑛 = 4𝑘, 𝑘 ∈ ℕ A,B,C can be viewed as 2x2 block matrices:
𝐶1,1 = 𝐴1,1 ∙ 𝐵1,1 + 𝐴1,2 ∙ 𝐵2,1 𝐶1,2 = 𝐴1,1 ∙ 𝐵1,2 + 𝐴1,2 ∙ 𝐵2,2
𝐶2,1 = 𝐴2,1 ∙ 𝐵1,1 + 𝐴2,2 ∙ 𝐵2,1 𝐶2,2 = 𝐴2,1 ∙ 𝐵1,2 + 𝐴2,2 ∙ 𝐵2,2
Conventional Algorithm uses 8 (expensive) multiplications:
Strassen’s algorithm
17
𝑀1 ∶= 𝐴1,1 + 𝐴2,2 ∙ 𝐵1,1 + 𝐵2,2
𝑀2 ∶= 𝐴2,1 + 𝐴2,2 ∙ 𝐵1,1
𝑀3 ∶= 𝐴1,1 ∙ 𝐵1,2 − 𝐵2,2
𝑀4 ∶= 𝐴2,2 ∙ 𝐵2,1 − 𝐵1,1
𝑀5 ∶= 𝐴1,1 + 𝐴1,2 ∙ 𝐵2,2
𝑀6 ∶= 𝐴2,1 − 𝐴1,1 ∙ 𝐵1,1 + 𝐵1,2 𝑀7 ∶= (𝐴1,2 − 𝐴2,2) ∙ (B2,1 + 𝐵2,2
Define temporary matrices:
𝐶1,1 = 𝑀1 + 𝑀4 − 𝑀5 + 𝑀7 𝐶1,2 = 𝑀3 + 𝑀5 𝐶2,1 = 𝑀2 + 𝑀4 𝐶2,2 = 𝑀1 − 𝑀2 + 𝑀3 + 𝑀6
Compose final matrix
Only 7 multiplications!
Strassen - Example
18
𝐶1,2 = 𝑀3 + 𝑀5
= 𝐴1,1𝐵1,2 + 𝐴1,2𝐵2,2
= 𝐴1,1𝐵1,2 − 𝐴1,1𝐵2,2 + 𝐴1,1𝐵2,2 + 𝐴1,2𝐵2,2
= 𝐴1,1 ∙ 𝐵1,2 − 𝐵2,2 + 𝐴1,1 + 𝐴1,2 ∙ 𝐵2,2
𝐴1,1 𝐴1,2
𝐴2,1 𝐴2,2∙𝐵1,1 𝐵1,2
𝐵2,1 𝐵2,2=
𝐶1,1 𝐶1,2𝐶2,1 𝐶2,2
Substituting the 𝑀𝑖𝑠 by their term gives back the original formula:
Strassen - Analysis
• Cost: 7 multiplications and 18 additions
– 8 multiplications and 4 additions for naïve
• Only practical for large matrices n > 1000
– Although our results indicate otherwise (later)
• Define cutoff point for recursion
– If n is sufficiently small, do naïve multiplication
19
Strassen - Implementation
20
Execution Time: Single-threaded
0,00
0,00
0,01
0,05
0,38
11,79
98,14
0,00
0,00
0,00
0,02
0,12
0,87
6,12
0,00 0,00
0,00
0,00
0,02
0,13
1,02
0,00010,00010,00020,00050,00100,00200,00390,00780,01560,03130,06250,12500,25000,50001,00002,00004,00008,0000
16,000032,000064,0000
128,0000
32 64 128 256 512 1024 2048
Seco
nd
s
N-dimension
Naive Strassen MKL
21
strassen: BREAK = 64
dl980 on 1 core
Parallelization of Strassen I
• Data dependencies:
– Have to do additions in 𝑀𝑖 before multiplication
• e.g. M1 = 𝐴1,1 + 𝐴2,2 ∙ 𝐵1,1 + 𝐵2,2
– Have to calculate 𝑀𝑖 before calculating C
• 𝐶1,2 = 𝑀3 + 𝑀5
• Easiest solution:
– Calculate in 𝑀𝑖s in parallel
– Then calculate 𝐶𝑖,𝑗 in parallel
22
Parallelization of Strassen II
• Level 1 can be scheduled to 7 threads • Level n can be scheduled to 7𝑛 threads
– Most systems have number of processors on base 2
• We used manual parallelization – 49 distinct functions for Ms and 16 for Cs – Code bloating and not scalable, BUT:
• Automatic parallelization is hard – Thread load becomes very unbalanced – Every level needs 7 temporary matrices
• Exponential rising memory requirements
23
Execution Time – 49 Threads
0,05
0,26
2,54
27,61
228,57
0,05
0,14
0,49
2,06
13,53
0,19 0,27 0,28
0,44
1,84
0,03125
0,0625
0,125
0,25
0,5
1
2
4
8
16
32
64
128
256
512 1024 2048 4096 8192
seco
nd
s
N-dimension
Naive Strassen MKL
24 dl980 on 49 cores
NUMA-Optimizations
• Try to have as much memory local as possible to avoid remote memory access
– Because it is slower by a factor of ~ 1.4
• Partition data and work depending on #nodes and #cores
• Pin threads to nodes with the memory they need
• (Topology for other algorithms)
25
Distributing memory and threads
0,34
11,39
101,12
0,35
18,34
182,45
0,35
21,96
204,85
0,37
14,33
143,44
0,25
0,5
1
2
4
8
16
32
64
128
256
1024 2048 4096
Distributed Memory andThreads
Neither distributed
Distributed threads
Distributed memory
26 Parallel naive on ubuntu-numa0101 on 24 cores
DEMO
27
Application of NUMA-Optimizations
• Copy all data to every node: – Duration of preprocessing:
• 11.11s for a 8192x8192 matrix to 8 nodes
• Partition data and move to corresponding nodes – Duration of preprocessing:
• 1.03s for a 8192x8192 matrix to 8 nodes
• Pin threads to nodes – int numa_run_on_node(int node);
28
Parallelization – Partition by rows Copying memory to different nodes
29
Strassen Memory Distribution Effects
22,147083 19,611477
21,316545
14,671332
0
5
10
15
20
25
30
35
40
6 7 8 distributed
Dimension: 16384
memory copy multiplication result combination
30 dl980 on 128 core
Other optimization techniques
• Tiling
• Vectorization
• Scalar replacement
• Precomputation of constants
• (unrolling)
31
Tiling
• Divide computational work into tiles to leverage cache
• Tile size depends on cache size • gcc -DCLS=$(getconf LEVEL1_DCACHE_LINESIZE)
33
Performance of Tiling perf stat -e L1-dcache-misses,LLC-misses,DTLB-misses bin/matrixmult –n 2048
34
1
8
64
512
4096
32768
262144
2097152
16777216
134217728
1,074E+09
8,59E+09
Not Tiled, not TransposedNot Tiled, TransposedTiled, not TransposedTiled, Transposed
97
39
13 12
0
20
40
60
80
100
120
Time
s
dl980 on 128 cores
Vectorization
• SIMD : Single Instruction Multiple Data • All recent Intel and AMD Processors have
Streaming Instructions Extensions (SSE) • An instruction is simultaneously applied to
multiple floats • Can only operate efficiently on aligned data (16
bit aligned) • SSE operate on 128bit registers
– Newer Intel processors have Advanced Vector Instructions (AVX) with 256 bit
– Dl980 machine only support 128bit operations
35
Auto-Vectorization
• Can this be done automatically?
– Gcc –O3 tries to auto-vectorize
• only possible for simple statements
36
Assembler
37
Aligned Malloc
38
Example: • Numa_alloc returns adrr: 0x1232, not 16bit aligned • We add 15, so addr = 0x1241 or 0b1001001000001 • Now we clear last 4 bits by ANDing ~0x0f (=0xfff0) • => result 0x1240 is now 16bit aligned
Intrinsics
39
Example
Source: https://software.intel.com/sites/landingpage/IntrinsicsGuide/
Use Parallelism for MMM
• We try to construct a 4x4 matrix multiplication
• How to process rows ?
40
continuous memory
X Can’t be loaded in one instr.
Use parallelism for MMM
• We try to construct a 4x4 matrix multiplication
• How to process rows ?
• Idea: process all elements of row of B in parallel
41
X =
A11 𝐵11 𝐵12 𝐵13 𝐵14 A12 A13 A14
Add up results
4x4 Kernel
42
SSE – Single Threaded
1024 2048 4096
naiveSSE 0,27 2 20
tiledSSE 0,48 5 41
tiled 2 24 213
naive 11 97 879
0,25
0,5
1
2
4
8
16
32
64
128
256
512
1024
Seco
nd
s
N-dimensions
naiveSSE tiledSSE tiled naive
43 dl980 on 1 core
Cache Misses of SSE Variants
0
1.000.000.000
2.000.000.000
3.000.000.000
4.000.000.000
5.000.000.000
6.000.000.000
L1 cache misses dTLB misses
naiveSSE tiledSSE
44
Performance for Small Matrices
0,00
0,05
0,10
0,15
0,20
0,25
64 128 256 512
seco
nd
s
naiveSSE tiled strassen MKL
45 dl980 on 128 cores
Performance for Large Matrices
0,79
7,29
3,90
0,17 0,34
1,20
5,09
0,20 0,39 0,53
1,94
0
1
2
3
4
5
6
7
8
9
10
1024 2048 4096 8192
seco
nd
s
naiveSSE tiled strassenSSE MKL
28,3
46 dl980 on 128 cores
Summary
• Analyze algorithm for bottlenecks
– IO optimization
– Hardware specific optimization
• Cache size
• NUMA architecture
• Specific instructions (SSE)
• Try to minimize remote memory access
• Visualisations can facilitate understanding
47