Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite
description
Transcript of Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite
![Page 1: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/1.jpg)
Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite
Javier Lira ψ
Carlos Molina ф
Antonio González λ
λ Intel Barcelona Research Center
Intel Labs - UPC
Barcelona, Spain
ф Dept. Enginyeria Informàtica
Universitat Rovira i Virgili
Tarragona, Spain
ψ Dept. Arquitectura de Computadors
Universitat Politècnica de Catalunya
Barcelona, Spain
XX Jornadas de Paralelismo, A Coruña (Spain) – September 17, 2009
![Page 2: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/2.jpg)
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
![Page 3: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/3.jpg)
Introduction
CMPs have emerged as a dominant paradigm in system design.
1. Keep performance improvement while reducing power consumption.
2. Take advantage of Thread-level parallelism.
Commercial CMPs are currently available.
CMPs incorporate larger and shared last-level caches.
Wire delay is a key constraint.
![Page 4: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/4.jpg)
NUCA
Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al.[1].
NUCA divides a large cache in smaller and faster banks.
Banks close to cache controller have smaller latencies than further banks.
Processor
[1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02
![Page 5: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/5.jpg)
NUCA Policies
Bank Placement Policy Bank Access Policy
Bank Replacement PolicyBank Migration Policy
![Page 6: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/6.jpg)
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
![Page 7: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/7.jpg)
Methodology
Simulation tools:Simics + GEMSCACTI v6.0
PARSEC v2.0 Benchmark Suite
Number of cores 8, 4-way SMT
Branch Predictor YAGS
Intr. Window / ROB 64 / 128 entries
Block size 64 Bytes
L1 Cache (Instr/Data) 32 KBytes, 2-way
L2 Cache (NUCA) 8 MBytes, 256 banks
NUCA Bank 32 Kbytes, 8-way
L1 Latency 3 cycles
NUCA Bank Latency 4 cycles
Router Latency 1 cycle
Wire delay 1 cycle
Off-chip Mem. Latency 350 cycles (from core)
![Page 8: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/8.jpg)
Baseline NUCA cache architecture
L1D L1I L1D L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D L1I L1D L1I
Core 7 Core 6
Core
1
Core 5
Core
0
Core 2 Core 3C
ore 4
8 cores
256 banks
[2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04
![Page 9: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/9.jpg)
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
![Page 10: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/10.jpg)
Bank Placement Policy
1B + Static 16B + Static 16B + Local
L1D L1I L1D L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D L1I L1D L1I
Core 7 Core 6
Core
1
Core 5
Core
0
Core 2 Core 3
Core 4
![Page 11: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/11.jpg)
Bank Placement Policy
1B + Static placement provides fair distribution.
16B configurations concentrate data in few banks.
Placement and migration policies are strictly correlated.
![Page 12: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/12.jpg)
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
![Page 13: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/13.jpg)
Bank Access Policy
Serial 9P + 7P Parallel
L1D L1I L1D L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D L1I L1D L1I
Core 7 Core 6
Core
1
Core 5
Core
0
Core 2 Core 3
Core 4
![Page 14: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/14.jpg)
Bank Access Policy
Power efficiency vs. Perfomance.
9P + 7P is a trade-off, but it is still far from the performance potencial.
These results suggest the broad area of improvement on this policy.
![Page 15: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/15.jpg)
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
![Page 16: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/16.jpg)
Bank Migration Policy
Static
Gradual + Swapping
Gradual + Replication
L1D L1I L1D L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D L1I L1D L1I
Core 7 Core 6
Core
1
Core 5
Core
0
Core 2 Core 3
Core 4
![Page 17: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/17.jpg)
Bank Migration Policy
Replication reduces the effective size of the cache.
Migration approaches concentrate data blocks in few banks.
Static approach fairly distribute data blocks in the whole cache.
Placement and migration policies are strictly correlated.
![Page 18: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/18.jpg)
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
![Page 19: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/19.jpg)
Bank Replacement Policy
Zero-copy One-copy Last Bank
L1D L1I L1D L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D L1I L1D L1I
Core 7 Core 6
Core
1
Core 5
Core
0
Core 2 Core 3
Core 4
Last Bank
![Page 20: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/20.jpg)
Bank Replacement Policy
Giving a second chance to evicted data blocks provides significant performance gain.
Last Bank is a promising mechanism, but this is restricted by its small size.
Further exploration on this policy is required.
![Page 21: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/21.jpg)
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
![Page 22: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/22.jpg)
Conclusions
NUCA is characterized by four policies.
NUCA policies are related.
Static placement with no-migration: Good trade-off.
Bank placement and bank migration are strictly correlated.
Bank access: Power efficiency vs. Performance.
Bank replacement: ↑ Performance (unbounded last bank).
Still room for improvement in all policies.
![Page 23: Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite](https://reader035.fdocuments.in/reader035/viewer/2022062803/568147b0550346895db4f065/html5/thumbnails/23.jpg)
Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite
Questions?