[Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
description
Transcript of [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)
![Page 1: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/1.jpg)
Programming the Memory Hierarchy with Sequoia
Michael Bauer, Alex Aiken
1
Stanford University
![Page 2: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/2.jpg)
OutlineThe Sequoia Programming Model
Compiling Sequoia
Automated Tuning with Sequoia
Performance Results
Extensions for Irregular Parallelism (Time Permitting)
Areas of Future Research
2
![Page 3: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/3.jpg)
The Sequoia Programming Model
3
![Page 4: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/4.jpg)
Sequoia
4
Language: stream programming for machines with deep memory hierarchies
Idea: expose abstract memory hierarchy to the programmer
Implementation: benchmarks run well on many multi-level machines
SMP, CMP, Cluster of CMPs, GPU, Cluster of GPUs, Disk
![Page 5: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/5.jpg)
The key challenge in high performance programming is:
communication (not parallelism)
LatencyBandwidth
LOCALITY
5
![Page 6: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/6.jpg)
Streaming
6
Streaming involves structuring algorithms as collections of independent [locality cognizant] computations with well defined working sets.
This structuring can be done at many scalesKeep temporaries in registers
Cache/scratchpad blockingMessage passing on a clusterOut-of-core algorithms
![Page 7: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/7.jpg)
Streaming
7
Streaming involves structuring algorithms as collections of independent [locality cognizant] computations with well defined working sets.
Efficient programs exhibit thisstructure at many scales.
![Page 8: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/8.jpg)
8
Facilitate development of hierarchy-aware stream
Provide constructs that can be implemented efficiently
Place computation and data in machineExplicit parallelism and communicationLarge bulk transfers
![Page 9: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/9.jpg)
Locality in Programming Languages
9
Local (private) vs. global (remote) addressesUPC, Titanium
Domain distributions (map array elements to locations)HPF, UPC, ZPL, X10, Fortress, Chapel
Focus on communication between nodesIgnore hierarchy within a node
![Page 10: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/10.jpg)
Locality in Programming Languages
10
Streams and kernelsStream data off chip. Kernel data on chip.StreamC/KernelC, BrookGPU Shading (Cg, HLSL)
Architecture specificOnly represent two levels
(Except CUDA and PMH)
![Page 11: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/11.jpg)
Abstract Machine ModelTree of independent address spaces
Each level is progressively smaller, but computationally more powerful
Arbitrary branching factor
Arbitrary number of levels
11
![Page 12: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/12.jpg)
Hierarchical MemoryReal machines as trees of memories
Dual-core PC 4 node cluster of PCs
L2 cache
ALUs ALUs
Main memory
L1 cache L1 cache
L2 cache
ALUs ALUs
Main memory
L1 cache L1 cache L2 cache
ALUs
Nodememory
Aggregate cluster memory(virtual level)
L1 cache
L2 cache
ALUs
Nodememory
L1 cache
L2 cache
ALUs
Nodememory
L1 cache
L2 cache
ALUs
Nodememory
L1 cache
12
![Page 13: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/13.jpg)
Hierarchical Memory
13
CPU Main memory
GPU Main memory
Shared Mem. Shared Mem. Shared Mem. Shared Mem.
Warps
Registers
Warps
Registers
Warps
Registers
Warps
Registers
Warps
Registers
Warps
Registers
Warps
Registers
Warps
Registers
Single GPU
![Page 14: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/14.jpg)
Hierarchical Memory
14
Aggregate cluster memory(virtual level)
CMP Main Memory CMP Main Memory
GPU Main memory
GPU Main memory
GPU Main memory
GPU Main memory
Shared Shared Shared Shared Shared Shared Shared Shared
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
WarpReg
MPI Cluster of CMPs w/ GPU Accelerators
![Page 15: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/15.jpg)
Example: Blocked Matrix Multiply
void matmul_L1( int M, int N, int T,float* A,float* B,float* C)
for (int i=0;; i<M;; i++)
for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)
C[i][j] += A[i][k] * B[k][j];;
C += A x B
matmul_L132x32
matrix mult
A B C
15
![Page 16: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/16.jpg)
Example: Blocked Matrix Multiply
C += A x Bvoid matmul_L2( int M, int N, int T,float* A,float* B,float* C)
Perform series of L1 matrix multiplications.
matmul_L2256x256
matrix mult
A B C
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
16
![Page 17: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/17.jpg)
Example: Blocked Matrix Multiplyvoid matmul( int M, int N, int T,
float* A,float* B,float* C)
Perform series of L2 matrix multiplications.
matmullarge matrix mult
matmul_L132x32
matrix mult ...
matmul_L2256x256
matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
matmul_L2256x256
matrix mult
matmul_L132x32
matrix mult ...matmul_L1
32x32matrix mult
matmul_L132x32
matrix mult
matmul_L132x32
matrix mult
. . .. . .. . .
17
![Page 18: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/18.jpg)
Sequoia TasksSpecial functions called tasks are the building blocks of Sequoia programs
task matmul::leaf( in float A[M][T],in float B[T][N],inout float C[M][N] )
for (int i=0;; i<M;; i++)for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)
C[i][j] += A[i][k] * B[k][j];;
18
![Page 19: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/19.jpg)
Sequoia TasksTask arguments and temporaries define a working setTask working set resident at a specific location in abstract machine modelTasks assigned location in the memory hierarchy
Maintain call-by-value-result (CBVR) semantics
task matmul::leaf( in float A[M][T],in float B[T][N],inout float C[M][N] )
for (int i=0;; i<M;; i++)for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)
C[i][j] += A[i][k] * B[k][j];;
19
![Page 20: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/20.jpg)
Task Hierarchiestask matmul::inner( in float A[M][T],
in float B[T][N],inout float C[M][N] )
tunable int P, Q, R;;
Recursively call matmul task on submatrices
of A, B, and C of size PxQ, QxR, and PxR.
task matmul::leaf( in float A[M][T],in float B[T][N],inout float C[M][N] )
for (int i=0;; i<M;; i++)for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)
C[i][j] += A[i][k] * B[k][j];;20
![Page 21: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/21.jpg)
Task Hierarchiestask matmul::inner( in float A[M][T],
in float B[T][N],inout float C[M][N] )
tunable int P, Q, R;;
mappar( int i=0 to M/P,int j=0 to N/R )
mapseq( int k=0 to T/Q )
matmul( A[P*i:P*(i+1);;P][Q*k:Q*(k+1);;Q],B[Q*k:Q*(k+1);;Q][R*j:R*(j+1);;R],C[P*i:P*(i+1);;P][R*j:R*(j+1);;R] );;
task matmul::leaf( in float A[M][T],
in float B[T][N],inout float C[M][N] )
for (int i=0;; i<M;; i++)
for (int j=0;; j<N;; j++)for (int k=0;; k<T;; k++)
C[i][j] += A[i][k] * B[k][j];;
matmul::inner
matmul::leaf
Variant call graph
21
![Page 22: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/22.jpg)
Summary: Sequoia TasksSingle abstraction for
Isolation/ parallelismExplicit communication/ working setsExpressing locality
Sequoia programs describe hierarchies of tasksMapped onto memory hierarchyParameterized for portability
22
![Page 23: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/23.jpg)
Compiling Sequoia
23
![Page 24: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/24.jpg)
Sequoia CompilerSource-to-source compilation
Three inputsSource FileMachine FileMapping File
Compilation works on hierarchical programs
Many standard optimizationsDone at all levels of the hierarchyGreatly increases leverage of optimization
source.sq source.mp machine.m
Sequoia Compiler (sq++)
Machine Specific Source Code
24
![Page 25: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/25.jpg)
Inter-Level Copy Elimination (1)
25
Copy elimination near the root removes not one instruction, but thousands/millions
B
A
CopyCopy
C
Mi
Mi+1
A
Copy
C
Mi
Mi+1
![Page 26: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/26.jpg)
Inter-Level Copy Elimination (2)
26
Copy elimination near the root removes not one instruction, but thousands/millions
A
B
CopyCopy
C
Mi
Mi+1A
B
Copy
Mi
Mi+1
![Page 27: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/27.jpg)
Software Pipelining
27
SchedulingPrefetch batch of dataCompute on dataInitiate write of results
Overlap communication and computation
compute 1write output 0
time
compute 2
compute 3
read input 2
write output 1
read input 3
write output 2
read input 4
![Page 28: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/28.jpg)
Sequoia RuntimeUniform scheme for explicitly describing memory hierarchies
Capture common traits important for performanceAllow composition of memory hierarchies
Simple, portable API for many parallel machinesMechanism independence for communication and management of parallel resources
SMP, CMP, MPI Cluster, CUDA, Disk, Cell (deprecated), Scalar (debugging), OpenCL (future)
28
![Page 29: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/29.jpg)
Graphical Runtime Representation
Memory Level i+1
CPU Level i+1
Memory Level iChild N
Memory Level iMemory Level iChild 1
Runtime
CPU Level iChild 1
CPU Level i CPU Level iChild N
29
![Page 30: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/30.jpg)
Runtime DesignUniform API to support many devices
Manages basic program tasksData allocation and namingSetup parallel resourcesSynchronization
Greatly simplifies/modularizes implementationCompiler generates code for one API, not many machinesEach runtime is isolated from all othersRuntimes can be implemented separately and composed freely
Makes some basic assumptionsSoftware has control over memory resourcesPersistence of data for software controlled memory
30
![Page 31: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/31.jpg)
Automatic Tuning
31
![Page 32: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/32.jpg)
Autotuner
32
Many parameters to tuneSequoia codes parameterized by tunablesChoice of task variants at different callsites
The tuning framework sets these parametersSearch-basedProgrammer defines the search space
![Page 33: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/33.jpg)
Software-Managed Memory
33
Smooth with high frequency components(due to alignment)
![Page 34: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/34.jpg)
Hierarchical Search
34
M2
M1
M0
set of tunables: S1
Bottom Up
set of tunables: S0
![Page 35: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/35.jpg)
Search Algorithm
35
A pyramid searchGreedy search performed at each level
Achieve good performance quickly because of smooth space
Start with a coarse grid Refine the grid when no further progress can be made
grid spacing
![Page 36: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/36.jpg)
Performance Results
36
![Page 37: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/37.jpg)
Sequoia Benchmarks
37
Linear Algebra: BLAS Level 1SAXPY, Level 2 SGEMV, and Level 3 SGEMM
Conv2D: 2D single precision convolution with 9x9 support (non-periodic boundary constraints)
FFT3D: Complex single-precision FFTGravity: 100 times steps of N-body stellar dynamics
simulation (N^2) single precisionHMMER: Fuzzy protein string matching using HMM
evaluation (Horn et al. SC2005 paper)
![Page 38: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/38.jpg)
Single Runtime Configurations
38
Scalar2.4 GHz Intel Pentium4 Xeon, 1GB
8-way SMP4 dual-core 2.66GHz Intel P4 Xeons, 8GB
Disk2.4 GHz Intel P4, 160GB disk, ~50MB/s from disk
Cluster16, Intel 2.4GHz P4 Xeons, 1GB/node, Infiniband interconnect (780MB/s)
Cell3.2 GHz IBM Cell blade (1 Cell 8 SPE), 1GB
PS33.2 GHz Cell in Sony Playstation 3 (6 SPE), 256MB (160MB usable)
![Page 39: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/39.jpg)
Single Runtime Configurations - GFLOPS
39
Scalar SMP Disk Cluster Cell PS3
SAXPY 0.3 0.7 0.007 4.9 3.5 3.1
SGEMV 1.1 1.7 0.04 12 12 10
SGEMM 6.9 45 5.5 91 119 94
CONV2D 1.9 7.8 0.6 24 85 62
FFT3D 0.7 3.9 0.05 5.5 54 31
GRAVITY 4.8 40 3.7 68 97 71
HMMER 0.9 11 0.9 12 12 7.1
![Page 40: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/40.jpg)
SGEMM Performance
40
ClusterIntel Cluster MKL: 101 GFlop/sSequoia: 91 GFlop/s
SMPIntel MKL: 44 GFlop/sSequoia: 45 GFlop/s
![Page 41: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/41.jpg)
FFT3D Performance
41
CellMercury Computer: 58 GFlop/sFFTW 3.2 alpha 2: 35 GFlop/sSequoia: 54 GFlop/s
ClusterFFTW 3.2 alpha 2: 5.3 GFlop/sSequoia: 5.5 GFlop/s
SMPFFTW 3.2 alpha 2: 4.2 GFlop/sSequoia: 3.9 GFlop/s
![Page 42: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/42.jpg)
Best Known Implementations
42
HMMerATI X1900XT: 9.4 GFlop/s
(Horn et al. 2005)Sequoia Cell: 12 GFlop/sSequoia SMP: 11 GFlop/s
GravityGrape-6A: 2 billion interactions/s
(Fukushige et al. 2005)Sequoia Cell: 4 billion interactions/sSequoia PS3: 3 billion interactions/s
![Page 43: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/43.jpg)
Multi-Runtime System Configurations
43
Cluster of SMPsFour 2-way, 3.16GHz Intel Pentium 4 Xeons connected via GigE (80MB/s peak)
Disk + PS3Sony Playstation 3 bringing data from disk (~30MB/s)
Cluster of PS3sTwo Sony Playstation GigE (60MB/s peak)
![Page 44: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/44.jpg)
SMP vs. Cluster of SMP (GFLOPS)
44
Cluster of SMPs
SMP
SAXPY 1.9 0.7
SGEMV 4.4 1.7
SGEMM 48 45
CONV2D 4.8 7.8
FFT3D 1.1 3.9
GRAVITY 50 40
HMMER 14 11
Same number of total processors
Computed limited applications agnostic to interconnect.
![Page 45: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/45.jpg)
Disk+PS3 Comparison (GFLOPS)
45
Disk+PS3 PS3
SAXPY 0.004 3.1
SGEMV 0.014 10
SGEMM 3.7 94
CONV2D 0.48 62
FFT3D 0.05 31
GRAVITY 66 71
HMMER 8.3 7.1
Some applications have the computational intensity to run from disk with little slowdown.
blocks to hide memory latency.
![Page 46: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/46.jpg)
Extensions for Irregular Parallelism
46
![Page 47: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/47.jpg)
Regular vs. Irregular Parallelism
Regular ComputationsStatically known working setsStatically known communication patternsPredictable running times
Irregular ComputationsDynamically determined working setsDynamically determinedcommunication patternsUnpredictable running times
Regular applications provide scalability, but large applications still have irregular components that need to be parallelized
47
![Page 48: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/48.jpg)
Spawn StatementSpawn launches unbounded number of tasksContinue launching until termination condition is met
task<inner> void performWork() // ...// (task to be run, termination condition)spawn(performWork(this), workQueue.isEmpty());;// ...
48
![Page 49: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/49.jpg)
Parent Pointers and Call-UpParent pointers provide child with way to name parent address spaceA call-up is a task that runs atomically in the
Maintains abstract machine model
task<leaf> void handleWork(parent WorkList *wl) // Perform call-up to retrieve workvector<Work> localWorkQueue = wl->getWork();;
/* Perform Work... */
// Call-Up to add back extra workwl->addWork(localWorkQueue);;
49
![Page 50: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/50.jpg)
Case Study: Boolean Satisfiability (SAT)SAT is useful in many industrial applications
CAD ToolsModel Checkers/ Static Analyses
SAT is a search problem
SAT is an irregular applicationDynamic working set (partial assignments)Dynamic communication (learned clauses)Unknown running times (solving partial assignments)
(x1 + ¬x3 + x4) ^ (¬x2 + x3 + x5) ^ (x4 + ¬x5 + ¬x6
50
![Page 51: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/51.jpg)
Parallel SAT SolvingSpawn many sequential SAT solvers1
Periodically call-up to update assumptions
1. We use the Mini-Sat sequential solver
Level 1
Level 0
(x1 + x2 + ¬x3
SequentialSolver
SequentialSolver
SequentialSolver
51
![Page 52: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/52.jpg)
Parallel SAT SolvingSpawn many sequential SAT solvers1
Periodically call-up to update assumptions
1. We use the Mini-Sat sequential solver
Level 1
Level 0
(x1 + x2 + ¬x3
SequentialSolver
SequentialSolver
SequentialSolver
52
![Page 53: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/53.jpg)
Performance Results for SAT
53
Speedups over Mini-SAT sequential solver2008 SAT Champion
Number of Processors
Speedup over
Sequential
![Page 54: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/54.jpg)
11
Case Study: Parallel Sorting
17 2 16 29821 5
Generalized Quicksort Algorithm
Sorting is an irregular applicationDynamic working set (partitions)Dynamic communication (swizzle)Unknown execution time (partition sort)
54
![Page 55: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/55.jpg)
Sorting PerformanceSort 227 Integers (vs STL Sort)Compare to Original Sequoia
55
![Page 56: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/56.jpg)
Dynamic Load Balancing with SpawnEmploy a re-spawn heuristicDynamic load balancing with enough tasks
56
![Page 57: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/57.jpg)
Case Study: Sparse Matrix Multiply
57
Sparse matrix multiplied with a dense vector
No optimizations for sparse patterns
Dynamically allocate chunks of rows to childrenHandle case were some rows have more non-zero elements than othersChildren have an initial assignment of rowsUse working stealing when finished
![Page 58: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/58.jpg)
Performance Results for SpMV
58
Speedups over sequential OSKI codeExcluding tuning time for OSKI execution
Similar to other results for SpMVInherently memory bound
Number of Processors
Speedup over
Sequential
![Page 59: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/59.jpg)
Areas of Future Research
59
![Page 60: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/60.jpg)
Memory ManagementWhat abstractions are presented for each address space?
Stack? Heap? GC?Persistence?
How to represent distributed data structures and arrays in virtual levels?
Major problem for clusters with distributed memory
How do to handle pointer data structures that need to be partitioned?
Better mechanisms for communicating locality?
60
![Page 61: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/61.jpg)
Sequoia: DSL Compiler Target?
61
Can DSL compilers perform domain specific optimizations in a machine agnostic context?
DSL A
DSL A Compilerwith Optimizations
DSL B
DSL B Compilerwith Optimizations
Sequoia
Sequoia Compilerw/ machine optim.
MPI P-Threads CUDA OpenCL Disk
![Page 62: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/62.jpg)
ConclusionsProgramming to an abstract memory hierarchy provides both locality information and portability
Sequoia provides a general framework for autotuning in deep memory hierarchies
Constructs for irregular parallelism are important for obtaining good performance
62
![Page 63: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/63.jpg)
Questions?http://sequoia.stanford.edu
63
![Page 64: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/64.jpg)
Back-up Slides
64
![Page 65: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/65.jpg)
Locality Aware ProgrammingSpecify functionally independent tasks
Call-by-value-result semantics
Couple locality information with explicit parallelism
Provide tunable variables for machine independence
65
![Page 66: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/66.jpg)
TunablesProvide mechanism for specifying machine dependent variables
Two flavors:Integer tunables for controlTask tunables for task variants
Specified in the mapping file
66
![Page 67: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/67.jpg)
Sequoia Rough EdgesMemory Management
What abstractions are presented for each address space?
structures)?
Abstract Machine ModelIs it too abstract?What operations would be allowed if it was slightly less abstract?
Pipeline ParallelismCould Sequoia support something like GRAMPS graph of queues
67
![Page 68: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/68.jpg)
Case Study: Fluid SimulationFluidanimate application from PARSEC1
3-D fluid flow simulated by particlesSpace is partitioned into cells
Fluid is an irregular applicationDynamic working set (cells per grid)Dynamic communication (ghost cells)Unknown running time (particles per cell)
1. PARSEC benchmark suite: parsec.princeton.cs.edu68
![Page 69: [Harvard CS264] 11a - Programming the Memory Hierarchy with Sequoia (Mike Bauer, Stanford)](https://reader034.fdocuments.in/reader034/viewer/2022051322/5462fcdeb4af9f6c1c8b48e9/html5/thumbnails/69.jpg)
The Ghost Cell ProblemAll copies of a ghost cell must be reduced sequentiallyDifferent ghost cells can be reduced in parallel
Call-up serializes these parallel reductions
Future research: prove independence of call-ups at compile time to allow concurrent call-up execution
69