Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos...
-
Upload
patrick-lee-scott -
Category
Documents
-
view
213 -
download
0
Transcript of Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos...
![Page 1: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/1.jpg)
Prefetching Challenges in
Distributed Memories for CMPs
Martí Torrents, Raúl Martínez, and Carlos Molina
Computer Architecture DepartmentUPC – BarcelonaTech
![Page 2: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/2.jpg)
2
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
![Page 3: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/3.jpg)
3
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
![Page 4: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/4.jpg)
4
Prefetching
• Reduce memory latency
• Bring to a nearest cache next data required by CPU
• Increase the hit ratio
• It is implemented in most of the commercial
processors
• Erroneous prefetching may produce
– Cache pollution
– Resources consumption (queues, bandwidth, etc.)
– Power consumption
![Page 5: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/5.jpg)
Motivation
• Number of cores in a same chip grows every year
Nehalem4~6 Cores
Tilera64~100 Cores
Intel Polaris80 Cores
Nvidia GeForceUp to 256 Cores
5
![Page 6: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/6.jpg)
6
Prefetch in CMPs
• Useful prefetchers implies more performance
– Avoid network latency
– Reduce memory access latency
• Useless prefetchers implies less performance
– More power consumption
– More NoC congestion
– Interference with other cores requests
![Page 7: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/7.jpg)
7
Prefetch adverse behaviors
M. Torrents, R. Martínez, C. Molina. “Network Aware Performance Evaluation of Prefetching Techniques in CMPs”. Simulation Modeling Practice and Theory (SIMPAT), 2014.
![Page 8: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/8.jpg)
8
Distributed memories
• Distribution of the memory access pattern:
@ @+2 @+4 @+6 @+8 @+10
@
@ + 2
@ + 4
@ + 6
@ + 8
@ + 10
![Page 9: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/9.jpg)
9
@ @ + 2 @ + 4 @ + 6
@ + 8 @ + 10 @ + 12 @ + 14
TILE 00 TILE 01 TILE 02 TILE 03
TILE 04 TILE 05 TILE 06 TILE 07
Distributed memories
• Distribution of the memory access pattern:
@ @+2 @+4 @+6 @+8 @+10 @+12 @+14
![Page 10: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/10.jpg)
10
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
![Page 11: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/11.jpg)
11
Prefetch Distributed Memory Systems
• Analysis phase
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@
L1 MISS for @
Distributed patterns
![Page 12: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/12.jpg)
12
Pattern Detection Challenge
• Distribution of the memory stream
• Prefetcher aware of a certain part of the stream
• Harder to detect access patterns or correlation
• Not all the prefetchers affected– Correlation prefetchers affected: GHB– One Block Lookahead not affected: Tagged
![Page 13: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/13.jpg)
13
Prefetch Distributed Memory Systems
• Request generation phase
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@
@+4
@+2
@ + 2 @ + 4
Queue filtering
![Page 14: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/14.jpg)
14
Prefetch Queue Filtering Challenge
• Prefetch requests queued in distributed queues
• Independent engines generating requests
• Repeated requests can be queued
• In a centralized queue those would be merged
• Adverse effects:– Power consumption– Network contention
![Page 15: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/15.jpg)
15
Prefetch Distributed Memory Systems
• Evaluation phase
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
PREFETCHL1
CPU
DISTRIBUTED L2 MEMORY
@
@+4
@+2
@ + 2 @ + 4
L1 MISS for @ + 2
Dynamic profiling
![Page 16: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/16.jpg)
16
Dynamic Profiling Challenge
• Prefetch requests generated in one tile
• Dynamic profiling information in another tile
• Erroneous profiling in the self tile
• Techniques using this info may work erroneously– Filtering– Throttling– Concrete prefetching engines
![Page 17: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/17.jpg)
17
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
![Page 18: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/18.jpg)
18
Challenge evaluation methodology
• Three environments to test the challenges
• Pattern Detection Challenge: Ideal Prefetcher– Prefetcher that it is aware of all the memory stream– No extra network contention added in the system– No extra power consumed– Requests classified depending on its core identifier– To preserve the original stream of each core
• Prefetcher used to test: Global History Buffer
![Page 19: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/19.jpg)
19
Pattern Detection Challenge
![Page 20: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/20.jpg)
20
Challenge evaluation methodology
• Three environments to test the challenges
• Prefetch Queue Filtering: Centralized queue– All the requests sent to a centralized queue– Repeated requests are merged– No extra network contention added in the system– No extra power consumed– Repeated requests are not issued
• Prefetcher used to test: Tagged prefercher
![Page 21: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/21.jpg)
21
Prefetch Queue Filtering Challenge
![Page 22: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/22.jpg)
22
Challenge evaluation methodology
• Three environments to test the challenges
• Dynamic Profiling Challenge: Hardware counters– For each statistic and core, add a hardware counter– Useful prefetchers and unuseful prefetchers– Use the id of the origin core to classify the statistic– Quantify the error for each core by:
*Where statistic is useful or unuseful prefetch
• Prefetcher used to test: Tagged Prefetcher
![Page 23: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/23.jpg)
23
Dynamic Profiling Challenge
![Page 24: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/24.jpg)
24
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
![Page 25: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/25.jpg)
25
Experimental framework
• Gem5– 64 x86 CPUs– Ruby memory system– L2 prefetchers– MOESI coherency protocol– Garnet network simulator
• Parsecs 2.1
![Page 26: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/26.jpg)
26
Simulation environment
![Page 27: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/27.jpg)
27
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
![Page 28: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/28.jpg)
28
Pattern Detection Challenge
![Page 29: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/29.jpg)
29
Prefetch Queue Filtering Challenge
![Page 30: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/30.jpg)
30
Dynamic Profiling Challenge
![Page 31: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/31.jpg)
31
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
![Page 32: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/32.jpg)
32
Facing the challenges
• There are two main options – Redesign the entire prefetch philosophy– Adapt the current techniques to work with DSMs
• Moreover, there are two main directions– Centralize the information
– Handicap of communication increment
– Distribute the prefetcher – Handicap of smartly distribute the prefetcher
![Page 33: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/33.jpg)
33
Outline
Introduction
Naming the challenges
Challenge evaluation methodology
Experimental framework
Challenge Quantification
Facing the Challenges
Conclusions
![Page 34: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/34.jpg)
34
Conclusions
• Three challenges when prefetching in DSMs– Prefetch Queue Filtering Challenge– Dynamic Profiling Challenge– Challenge evaluation methodology
• Directions for future investigators
• There are no evident solutions for them
• Not solving them -> limited prefetch performance
![Page 35: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/35.jpg)
35
Q & A
![Page 36: Prefetching Challenges in Distributed Memories for CMPs Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC – BarcelonaTech.](https://reader035.fdocuments.in/reader035/viewer/2022062720/56649efd5503460f94c1192e/html5/thumbnails/36.jpg)
Prefetching Challenges in
Distributed Memories for CMPs
Martí Torrents, Raúl Martínez, and Carlos Molina
Computer Architecture DepartmentUPC – BarcelonaTech