Post on 14-Jan-2016
description
Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite
Javier Lira ψ
Carlos Molina ф
Antonio González λ
λ Intel Barcelona Research Center
Intel Labs - UPC
Barcelona, Spain
antonio.gonzalez@intel.com
ф Dept. Enginyeria Informàtica
Universitat Rovira i Virgili
Tarragona, Spain
carlos.molina@urv.net
ψ Dept. Arquitectura de Computadors
Universitat Politècnica de Catalunya
Barcelona, Spain
javier.lira@ac.upc.edu
XX Jornadas de Paralelismo, A Coruña (Spain) – September 17, 2009
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
Introduction
CMPs have emerged as a dominant paradigm in system design.
1. Keep performance improvement while reducing power consumption.
2. Take advantage of Thread-level parallelism.
Commercial CMPs are currently available.
CMPs incorporate larger and shared last-level caches.
Wire delay is a key constraint.
NUCA
Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al.[1].
NUCA divides a large cache in smaller and faster banks.
Banks close to cache controller have smaller latencies than further banks.
Processor
[1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02
NUCA Policies
Bank Placement Policy Bank Access Policy
Bank Replacement PolicyBank Migration Policy
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
Methodology
Simulation tools:Simics + GEMSCACTI v6.0
PARSEC v2.0 Benchmark Suite
Number of cores 8, 4-way SMT
Branch Predictor YAGS
Intr. Window / ROB 64 / 128 entries
Block size 64 Bytes
L1 Cache (Instr/Data) 32 KBytes, 2-way
L2 Cache (NUCA) 8 MBytes, 256 banks
NUCA Bank 32 Kbytes, 8-way
L1 Latency 3 cycles
NUCA Bank Latency 4 cycles
Router Latency 1 cycle
Wire delay 1 cycle
Off-chip Mem. Latency 350 cycles (from core)
Baseline NUCA cache architecture
L1D L1I L1D L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D L1I L1D L1I
Core 7 Core 6
Core
1
Core 5
Core
0
Core 2 Core 3C
ore 4
8 cores
256 banks
[2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
Bank Placement Policy
1B + Static 16B + Static 16B + Local
L1D L1I L1D L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D L1I L1D L1I
Core 7 Core 6
Core
1
Core 5
Core
0
Core 2 Core 3
Core 4
Bank Placement Policy
1B + Static placement provides fair distribution.
16B configurations concentrate data in few banks.
Placement and migration policies are strictly correlated.
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
Bank Access Policy
Serial 9P + 7P Parallel
L1D L1I L1D L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D L1I L1D L1I
Core 7 Core 6
Core
1
Core 5
Core
0
Core 2 Core 3
Core 4
Bank Access Policy
Power efficiency vs. Perfomance.
9P + 7P is a trade-off, but it is still far from the performance potencial.
These results suggest the broad area of improvement on this policy.
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
Bank Migration Policy
Static
Gradual + Swapping
Gradual + Replication
L1D L1I L1D L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D L1I L1D L1I
Core 7 Core 6
Core
1
Core 5
Core
0
Core 2 Core 3
Core 4
Bank Migration Policy
Replication reduces the effective size of the cache.
Migration approaches concentrate data blocks in few banks.
Static approach fairly distribute data blocks in the whole cache.
Placement and migration policies are strictly correlated.
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
Bank Replacement Policy
Zero-copy One-copy Last Bank
L1D L1I L1D L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D
L1I
L1D L1I L1D L1I
Core 7 Core 6
Core
1
Core 5
Core
0
Core 2 Core 3
Core 4
Last Bank
Bank Replacement Policy
Giving a second chance to evicted data blocks provides significant performance gain.
Last Bank is a promising mechanism, but this is restricted by its small size.
Further exploration on this policy is required.
Outline
IntroductionMethodologyAnalysis of NUCA policies
Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy
Conclusions
Conclusions
NUCA is characterized by four policies.
NUCA policies are related.
Static placement with no-migration: Good trade-off.
Bank placement and bank migration are strictly correlated.
Bank access: Power efficiency vs. Performance.
Bank replacement: ↑ Performance (unbounded last bank).
Still room for improvement in all policies.
Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite
Questions?