LARGE EDDY SIMULATION Chin-Hoh Moeng NCAR OUTLINE WHAT IS LES? APPLICATIONS TO PBL FUTURE DIRECTION.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of...
-
Upload
marshall-ferrand -
Category
Documents
-
view
215 -
download
0
Transcript of Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of...
![Page 1: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/1.jpg)
Scalable Multi-Cache Simulation Using GPUsMichael MoengSangyeun ChoRami Melhem
University of Pittsburgh
![Page 2: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/2.jpg)
Background
•Architects simulating more cores▫Increasing simulation times
•Cannot keep doing single-threaded simulations if we want to see results in a reasonable time frame
Host Machine
Simulates
Target Machine
![Page 3: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/3.jpg)
Parallel Simulation Overview• A number of people have begun researching
multithreaded simulation• Multithreaded simulations have some key limitations
▫ Many fewer host cores than the cores in a target machine
▫ Slow communication between threads• Graphics processors have high fine-grained parallelism
▫ Many more cores▫ Cheap communication within ‘blocks’ (software unit)
• We propose using GPUs to accelerate timing simulation
• CPU acts as the functional feeder
![Page 4: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/4.jpg)
Contributions
•Introduce GPUs as a tool for architectural timing simulation
•Implement a proof-of-concept multi-cache simulator
•Study strengths and weaknesses of GPU-based timing simulation
![Page 5: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/5.jpg)
Outline
•GPU programming with CUDA•Multi-cache simulation using CUDA•Performance Results vs CPU
▫Impact of Thread Interaction▫Optimizations
•Conclusions
![Page 6: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/6.jpg)
CUDA Dataflow
Host (CPU)
Main Memory
Device (GPU)
SIMD Processors
Graphics Memory
PCIe Bus
![Page 7: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/7.jpg)
Dataflow – Concurrent CPU/GPU
Host (CPU)
Main Memory
Device (GPU)
SIMD Processors
Graphics Memory
PCIe Bus
![Page 8: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/8.jpg)
Dataflow – Concurrent Kernels
Host (CPU)
Main Memory
Device (GPU)
SIMD Processors
Graphics Memory
PCIe Bus
![Page 9: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/9.jpg)
GPU-driven Cache Simulation
Trace-driven SimulationHost
(CPU)Device (GPU)
L1 kernelL2 kernel
Trace Data
L1 Misses
Statistics
Trace Data
![Page 10: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/10.jpg)
GPU-driven Cache Simulation
•Each cache way is simulated by a thread▫Parallel address lookup▫Communicate via fast shared memory
•Ways from 4 caches form a block▫With 16-way caches, 64 threads per block
•Cache state (tags+metadata) stored in global memory – rely on caching for fast access
•Experimented with tree-based reduction▫No performance improvement (small tree)
![Page 11: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/11.jpg)
Block-to-Block Interactions• Within a block, we can call a cheap barrier
and there is no inaccuracy• Shared L2
▫Upon miss, L1 threads determine L2 home tile▫Atomically add miss to global memory buffer for
L2 to process• Write Invalidations
▫Upon write, L1 thread checks global memory for tag state of other L1 threads
▫Atomically invalidate matching lines in global memory
![Page 12: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/12.jpg)
Evaluation
•Feed traces of memory accesses to simulated cache hierarchy▫Mix of PARSEC benchmarks
•L1/L2 cache hierarchy with private/shared L2
•GeForce GTS 450 – 192 cores (low-mid range)▫Fermi GPU – caching, simultaneous kernels▫Newer NVIDIA GPUs range from 100-500
cores
![Page 13: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/13.jpg)
Private L2
A B C D E F0
0.5
1
1.5
2
2.5
3
3.5
4
CPU 32CPU 64CPU 96GPU 32GPU 64GPU 96
Workload
Sim
ula
tion
Tim
e
Host Ma-chine +
Simulated Cache Count
Linear CPU simulation scaling
GPU sees 13-60% slowdown from 32 to 96 caches
Multithreaded
![Page 14: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/14.jpg)
Shared L2
A B C D E F0
0.5
1
1.5
2
2.5
3
3.5
4
CPU 32CPU 64CPU 96GPU 32GPU 64GPU 96
Workload
Sim
ula
tion
Tim
e
Unbalanced traffic load to a few tiles
Largely serialized
Multithreaded
![Page 15: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/15.jpg)
Inaccuracy from Thread Interaction
•CUDA currently has little support for synchronization between blocks
•Without synchronization support, inter-thread communication is subject to error:▫Shared L2 caches – miss rate▫Write invalidations – invalidation count
![Page 16: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/16.jpg)
Controlling Error
•Only way to synchronize blocks is in between kernel invocations▫Number of trace items processed by each
kernel invocation controls error▫Similar techniques used in parallel CPU
simulation
•There is a performance and error tradeoff with varying trace chunk size
![Page 17: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/17.jpg)
Invalidation – Performance vs Error
8 32 128 512 20480%
2%
4%
6%
8%
10%
12%
0%
20%
40%
60%
80%
100%
Error Performance CPU-onlyTrace Chunk Size
Inva
lid
ati
on
Cou
nt
Err
or
Perf
orm
an
ce
![Page 18: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/18.jpg)
Shared L2 – Miss Rate Error
A B C D E F0.00%
0.10%
0.20%
0.30%
0.40%
0.50%
0.60%
0.70%
326496
Workload
Mis
s R
ate
Err
or
Simulated Cache Count
Largely serialized execution minimizes error
![Page 19: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/19.jpg)
•Transfer memory while executing kernels•Run L1 kernel concurrently with L2
kernel
Concurrent Execution
Trace IOL1
Kernel
L2 Kerne
lTrace IO
L1 Kerne
l
L2 Kerne
lTrace IO
L1 Kerne
l
L2 Kerne
l
![Page 20: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/20.jpg)
32 64 960%
10%
20%
30%
40%
50%
60%
70%
80%
L1PrivateShared
Simulated Cache Tiles
Perf
orm
an
ce I
m-
pro
vem
en
t
Concurrent Execution Speedup
Greater benefit when more data is transferred
Benefits end when GPU is fully utilized
Load imbalance among L2 slices
CacheModel
From better utilization of GPU
From parallel memory transfer and computation
![Page 21: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/21.jpg)
CUDA Block Mapping
•For maximum throughput, CUDA requires a balance between number of blocks and threads per block▫Each block can support many more threads
than the number of ways in each cache•We map 4 caches to each block for
maximum throughput•Also study tradeoff from fewer caches per
block
![Page 22: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/22.jpg)
Block Mapping - Scaling
16 32 48 64 80 960
0.5
1
1.5
2
2.5
3
3.5
124
Simulated Cache Tiles
Sim
ula
tion
Tim
e
Caches per block
Saturated at 32 tilesSaturated at 64 tiles
More Caches per Block
Higher Throughput
Fewer Caches per Block
Lower Latency
![Page 23: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/23.jpg)
Conclusions
•Even with a low-end GPU, we can simulate more and more caches with very small slowdown
•With a GPU co-processor we can leverage both CPU and GPU processor time
•Crucial that we balance the load between blocks
![Page 24: Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.](https://reader030.fdocuments.in/reader030/viewer/2022032516/56649c785503460f9492de07/html5/thumbnails/24.jpg)
Future Work
•Better load balance (adaptive mapping)•More detailed timing model•Comparisons against multi-threaded
simulation•Studies with higher-capacity GPUs