Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite

Javier Lira ψ

Carlos Molina ф

Antonio González λ

λ Intel Barcelona Research Center

Intel Labs - UPC

Barcelona, Spain

antonio.gonzalez@intel.com

ф Dept. Enginyeria Informàtica

Universitat Rovira i Virgili

Tarragona, Spain

carlos.molina@urv.net

ψ Dept. Arquitectura de Computadors

Universitat Politècnica de Catalunya

Barcelona, Spain

javier.lira@ac.upc.edu

XX Jornadas de Paralelismo, A Coruña (Spain) – September 17, 2009

Outline

IntroductionMethodologyAnalysis of NUCA policies

Bank Placement PolicyBank Access PolicyBank Migration PolicyBank Replacement Policy

Conclusions

Introduction

CMPs have emerged as a dominant paradigm in system design.

1. Keep performance improvement while reducing power consumption.

2. Take advantage of Thread-level parallelism.

Commercial CMPs are currently available.

CMPs incorporate larger and shared last-level caches.

Wire delay is a key constraint.

Non-Uniform Cache Architecture (NUCA) was first proposed in ASPLOS 2002 by Kim et al.[1].

NUCA divides a large cache in smaller and faster banks.

Banks close to cache controller have smaller latencies than further banks.

Processor

[1] C. Kim, D. Burger and S.W. Keckler. An Adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. ASPLOS ‘02

NUCA Policies

Bank Placement Policy Bank Access Policy

Bank Replacement PolicyBank Migration Policy

Outline

Conclusions

Methodology

Simulation tools:Simics + GEMSCACTI v6.0

PARSEC v2.0 Benchmark Suite

Number of cores 8, 4-way SMT

Branch Predictor YAGS

Intr. Window / ROB 64 / 128 entries

Block size 64 Bytes

L1 Cache (Instr/Data) 32 KBytes, 2-way

L2 Cache (NUCA) 8 MBytes, 256 banks

NUCA Bank 32 Kbytes, 8-way

L1 Latency 3 cycles

NUCA Bank Latency 4 cycles

Router Latency 1 cycle

Wire delay 1 cycle

Off-chip Mem. Latency 350 cycles (from core)

Baseline NUCA cache architecture

L1D L1I L1D L1I

Core 7 Core 6

Core 5

Core 2 Core 3C

8 cores

256 banks

[2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04

Outline

Conclusions

Bank Placement Policy

1B + Static 16B + Static 16B + Local

L1D L1I L1D L1I

Core 7 Core 6

Core 5

Core 2 Core 3

Core 4

Bank Placement Policy

1B + Static placement provides fair distribution.

16B configurations concentrate data in few banks.

Placement and migration policies are strictly correlated.

Outline

Conclusions

Bank Access Policy

Serial 9P + 7P Parallel

L1D L1I L1D L1I

Core 7 Core 6

Core 5

Core 2 Core 3

Core 4

Bank Access Policy

Power efficiency vs. Perfomance.

9P + 7P is a trade-off, but it is still far from the performance potencial.

These results suggest the broad area of improvement on this policy.

Outline

Conclusions

Bank Migration Policy

Static

Gradual + Swapping

Gradual + Replication

L1D L1I L1D L1I

Core 7 Core 6

Core 5

Core 2 Core 3

Core 4

Bank Migration Policy

Replication reduces the effective size of the cache.

Migration approaches concentrate data blocks in few banks.

Static approach fairly distribute data blocks in the whole cache.

Placement and migration policies are strictly correlated.

Outline

Conclusions

Bank Replacement Policy

Zero-copy One-copy Last Bank

L1D L1I L1D L1I

Core 7 Core 6

Core 5

Core 2 Core 3

Core 4

Last Bank

Bank Replacement Policy

Giving a second chance to evicted data blocks provides significant performance gain.

Last Bank is a promising mechanism, but this is restricted by its small size.

Further exploration on this policy is required.

Outline

Conclusions

NUCA is characterized by four policies.

NUCA policies are related.

Static placement with no-migration: Good trade-off.

Bank placement and bank migration are strictly correlated.

Bank access: Power efficiency vs. Performance.

Bank replacement: ↑ Performance (unbounded last bank).

Still room for improvement in all policies.

Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite

Questions?

Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite

Documents

Transcript of Performance Analysis of NUCA Policies for CMPs Using Parsec v2.0 Benchmark Suite

Parsec Intro & Value Prop

Pecka cucha for post grad at nuca

Introduction to PaRSEC - ICL UTKicl.cs.utk.edu/projectsfiles/parsec/pubs/parsec-tutorial-1.pdfIntroduction to PaRSEC Innovative Computing Laboratory University of Tennessee March 2016

Parsec Parsing

PaRSEC Tutorial

Welcome2TheCloud - cmps-people.ok.ubc.ca

Cmps 20081211a international_best_practice_approaches_to_complaints_handling

DELL CMPS 2003

Cmps 20081211a the_1823_integrated_call_centre-a_case_study

The PARSEC Benchmark Suite Tutorialparsec.cs.princeton.edu/download/tutorial/2.0/parsec-2.0-tutorial.pdf · The PARSEC Benchmark Suite Tutorial ... Pirates of the Caribbean 3) ...

Last Parsec - Scientorium

TrakSYS by Parsec

Parsec — Системы контроля доступа и идентификации · aBae1--1HH noc r10M PARSEC CUCTelM OHTPOJ1nePb1 cerreBb1X CHCTeM aBJ1eHHfl 110M Parsec R

OpenVMS Process Internals - PARSEC€¦ · OpenVMS Process Internals Wayne Sauer President, PARSEC Group sauer@parsec.com 888-4-PARSEC. Topics •OpenVMS internal symbol layout •

The PARSEC Benchmark Suite

Cmps Paper Objective Paper

parsec guide

Member Benefits - NUCA Benefits One-Pager.pdf · Utility Contractor Magazine NUCA SafetyNews NUCA@Work NUCA Business Journal Member Benefits Like us: NUCA1964 Follow us: @NUCA_National

The PARSEC Benchmark Suite Tutorial · 2009-02-21 · The PARSEC Benchmark Suite Tutorial - PARSEC 2.0 - by Christian Bienia, Princeton University and ... Fluidanimate Freqmine Raytrace

From The Editors September Parsec Meeting Minutes ...parsec-sff.org/wp-content/uploads/2018/01/October... · Orphan Black Parsec Halloween Party. From The Editors It was 1960 and