Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1 & Lizy K. John 1

HPCA-16 2010

Laboratory for Computer Architecture 1/11/2010

Dimitris Kaseridis1, Jeff Stuecheli1,2, Jian Chen1 & Lizy K. John1

1University of Texas – Austin2IBM – Austin

2 Laboratory for Computer Architecture

Motivation

Datacenters

– Widely spread

– Multiple core/sockets available

– Hierarchical cost of communication

• Core-to-Core, Socket-to-Socket and Board-to-Board

Datacenter-like CMP multi-chip

Motivation Virtualization systems is the norm

– Multiple single thread workloads in system

– Decision based on high level scheduling

algorithms

– CMP heavily relied on shared resources

– Destructive Interference

– Unfairness

– Lack of QoS

– Limit optimization in single-chip suboptimal solutions – Explore opportunities within and outside single chip

Most important shared resources in CMPs– Last Level Cache Capacity Limits

– Memory bandwidth Bandwidth Limits

Capacity and Bandwidth partitioning as promising means of resource management

MotivationPrevious Work focus on single chip

– Trial-and-error

+ lower complexity- less efficient - slow to react

- Artificial Intelligent

+ better performance - Black box difficult to tune- High cost for accurate schemes.

– Predictive evaluating multiple solutions+ more accurate- higher complexity- high cost of wrong decision (drastic changes to configurations)

Need for low-overhead, non-invasive monitoring that efficiently drives resource management algorithms

Equal Partitions

Outline

Applications’ Profiling Mechanisms

– Cache Capacity– Memory Bandwidth

Bandwidth-aware Resource Management Scheme– Intra chip allocation algorithm

– Inter chip resource management

Evaluation

Applications’ Profiling Mechanisms

Overview Resource Requirements Profiling

Based on Mattson’s Stack-distance Algorithm (MSA)

Non-invasive, predictive– Parallel monitoring on each core assuming each core is assigned the whole LLC

Cache misses for all partitions assignment

– Monitor/Predict Cache misses

– Help estimate ideal cache partitions sizes

Memory Bandwidth

– Two components

• Memory Read traffic Cache fills

• Memory Write traffic Dirty Write-back traffic from Cache to Main memory

LLC misses Profiling

Mattson stack algorithm (MSA)

– Originally proposed to concurrently simulate many cache sizes

– Based on LRU inclusion property

– Structure is a true LRU cache

– Stack distance from MRU of each reference is recorded

– Misses can be calculated for fraction of ways

MSA-based Bandwidth Profiling

Read traffic

– proportional to misses

– derived from LLC misses profiling

Write traffic

– Cache evictions of dirty lines sent back to memory

– Traffic depends on assigned cache partition on write-back caches

– Hit to dirty line

• if stack distance of hit bigger than assigned capacity it is sent to main memory Traffic

• Otherwise it is a hit No Traffic • Only one write-back per store should be counted

Monitoring Mechanism

MSA-based Bandwidth Profiling

Additions to profiler

– Dirty Bit: Dirty line

– Dirty Stack Distance (reg): Largest distance a dirty line accessed

– Dirty_Counter: Dirty accesses for every LRU distance

– Track traffic for all cache allocations

– Dirty bit reset when line is evicted from whole monitored cache

– Track greatest stack distance each store is referenced before evicted

– Keep a counter (Dirty_Counter) of this max evictions

Traffic estimation

– For a cache size projection that uses W ways

– Sum of Dirty_Counteri, i= [W : max_ways + 1]

MSA-based Bandwidth Example

Profiling Examplesmilc calculix gcc

Different behavior on write traffic

– Milc: No fit, updates complex matrix structures

– Calculix: Cache blocking of matrix and dot product operations, data contained in cache read only traffic beyond blocking size

– Gcc: Code generation small caches are read dominated due to data tables bigger are write dominated due to code output

Accurate monitoring of Memory Bandwidth use is important

Hardware MSA implementation Naïve algorithm is prohibitive

– Fully associative– Complete cache directory of maximum cache size for every core on the CMP

(total size)

H/W Overhead Reduction– Set Sampling– Partial Hashed Tags – XOR tree of bits – Max capacity assignable per core

Sensitivity Analysis (Details in paper)– 1-in-32 set sampling– 11bit partial hashed tags– 9/16 Maximal capacity

• LRU, Dirty-stack register 6 bits• Hit, Dirty counter 32 bits

– Overall 117 Kbits 1.4% of 8MB LLC

Resource Management Scheme

Overall Scheme

Two levels approach

– Intra-chip Partitioning Algorithm: Assign LLC capacity on a single chip to minimize misses

– Inter-chip Partitioning Algorithm : Use LLC assignments and Memory bandwidth to find a better workload assignment over whole system

Epochs of 100M instructions for re-evaluation and initiate migrations

Intra-chip Partitioning Algorithm

Based on Marginal Utility

Miss rate relative to capacity is non-linear, and heavily workload dependent

Dramatic miss rate reduction as data structures become cache contained

In practice

– Iteratively assign cache to cores that produce the most hits per capacity

O(n2) complexity

Equal Partitions

Inter-chip Partitioning Algorithm Suboptimal assignment of workloads on chips based on execution phase

of each workload

Two greedy algorithms looking over multiple chips

– Cache Capacity

– Memory Bandwidth

Cache Capacity

1. Estimate ideal capacity assignment assuming whole cache belongs to core

2. Find the worst assignment for a core per chip

3. Find chips with most surplus of ways (ways not significantly contributing to miss reduction)

4. Perform with a greedy approach workloads swaps between chips

5. Bound swap with threshold to keep migrations down

6. Perform finally selected migrations

Bandwidth Algorithm Example

A lbm B calculix

C bwaves D zeusmp

AB C D C B

Memory Bandwidth

Algorithm finds combinations of low/high bandwidth demands cores

Migrate high to low bandwidth chips

Migrated jobs should have similar partitions (10% bounds)

Perform until no over-committed or no additional reduction

Evaluation

Methodology Workloads

– 64 cores 8 chips with 8-core CMPs running mix of 29 SPEC CPU2006 workloads

– What benchmark mix? ≈ 30 Million mix of 8 benchmarks

High level - Monte Carlo

– Compare Intra and Inter algorithm to equal partitions assignment• Show algorithm works for many cases / configurations• 1000 experiments

Detailed simulation

– Cycle accurate / Full system• Simics + GEMS+ CMP-DNUCA + Profiling Mechanisms + Cache Partitions

Comparison – Utility-based Cache Partitioning (UCP+) modified for our DNUCA CMP

• Only last level cache misses• Uses Marginal Utility on Single Chip to assign capacity

High level LLC misses

25.7% over simple equal partitions

Average 7.9% reduction over UCP+

Significant reductions with only 1.4% overhead for monitoring mechanisms that UCP+ already requires

As LLC increases more surplus of ways more opportunities for migrations in Inter-chip

Relative Miss rate Relative reduction BW-aware over UCP+

High level Memory Bandwidth

UCP+ reductions are due to miss rate reduction 19% over equal

Average 18% reduction over UCP+ and 36% over equal

Winning more in smaller caches due to contention

Number of Chips increase more opportunities for Inter-chip

Relative Bandwidth Reduction Relative reduction BW-aware over UCP+

Full system case studies

Case 1

8.6% reduction in IPC and 15.3% MPKI reduction

Chip 4 {bwaves, mcf } Chip 7 {povray, calculix}

Case 2

8.5% IPC and 11% MPKI reduction

Chip 7 overcommitted in memory bandwidth

bwaves Chip 7 zeusmp Chip 2

gcc Chip 7 gamess Chip 6

Conclusions As # core in a system increases resource contention

dominating factor

Memory Bandwidth a significant factor in system performance and should always be considered in Memory resource management

Bandwidth-aware achieved 18% reduction in memory bandwidth and 8% in miss rate over existing partitioning techniques and more than 25% over no partitioning schemes

Overall improvement can justify the cost of the proposed monitoring mechanisms of only 1.4% overhead that could exist in predictive single chip schemes

Thank You

Questions?

Laboratory for Computer Architecture

The University of Texas Austin

Backup Slides

Misses absolute and effective error

27 Laboratory for Computer Architecture 9/23/2009

Bandwidth absolute and effective error

Overhead analysis

Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1 & Lizy K. John 1

Documents

Transcript of Dimitris Kaseridis 1 , Jeff Stuecheli 1,2 , Jian Chen 1 & Lizy K. John 1

A Performance Counter Based Workload Characterization on ...A Performance Counter Based Workload Characterization on Blue Gene/P Karthik Ganesan ∗Lizy John Valentina Salapura †James

Copyright by Robert Henry Bell, Jr. 2005lca.ece.utexas.edu/pubs/dissertation_rob_bell.pdfRobert Henry Bell, Jr., Ph. D. The University of Texas at Austin, 2005 Supervisor: Lizy Kurian

Speech signal processing lizy

Copyright by Arun Arvind Nair 2012lca.ece.utexas.edu/pubs/NAIR-DISSERTATION.pdf · Arun Arvind Nair, Ph.D. The University of Texas at Austin, 2012 Supervisor: Lizy Kurian John Reliability

Copyright by Ajay Manohar Joshi 2007Ajay Manohar Joshi, PhD The University of Texas at Austin, 2007 Supervisor: Lizy K. John Benchmarks set standards for innovation in computer architecture

Using Complete System Simulation to Characterize SPECjvm98 ... · Using Complete System Simulation to Characterize SPECjvm98 Benchmarks Tao Li +, Lizy Kurian John +, Vijaykrishnan

Wideband High Frequency (WBHF) for Anti-Access Area · PDF fileWideband High Frequency (WBHF) for Anti-Access Area-Denial (A2AD) Environments. James A. Stevens PhD, Lizy Paul, Timothy

This is a publisher-deposited version published in: http ...oatao.univ-toulouse.fr/5038/1/Lizy-Detrez_5038.pdf · To cite this document: JALABERT Eva, FABACHER Emilien, GUY Nicolas,

The Virtual Write Queue: Coordinating DRAM and Last-Level Cache Policieslca.ece.utexas.edu/people/kaseridis/papers/ISCA_2010_presentation.… · The Virtual Write Queue: Coordinating

a, Matthieu Delbourg , Stéphanie Lizy-Destrez , Bastien Le ...€¦ · Etienne Dumont a*, Weixiong Goha, Matthieu Delbourg , Stéphanie Lizy-Destrez b, Bastien Le Bihan a Department

Copyright by Vikas Agarwal 2004cart/publications/dissertations/agarwal.pdf · Vikas Agarwal, Ph.D. The University of Texas at Austin, 2004 Supervisors: Lizy K. John Stephen W. Keckler

Characterizing Microprocessor Benchmarkslca.ece.utexas.edu/pubs/Arunkumar_master_report.pdfMichael Arunkumar, M.S.E The University of Texas at Austin, 2003 SUPERVISOR: Lizy Kurian

Reena Panda Lizy Kurian John

SPEC Workshop 2008 Laboratory for Computer Architecture1/27/2008 On the Object Orientedness of C++ programs in SPEC CPU 2006 Ciji Isen & Lizy K. John University.

Demystifying Hardware Infrastructure Choices for Deep ... · Demystifying Hardware Infrastructure Choices for Deep Learning Using MLPerf Lizy Kurian John Snehil Verma Qinzhe Wu Bagus

PACT 2010 Big Data Workloads An Architect’s Perspective Lizy K. John University of Texas at Austin BPOE 2014.

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

Sat & mob commn lizy

Mechanistic Modeling of Architectural Vulnerability …Architectural Vulnerability Factor, reliability, soft errors, reliability, availability, servicability (RAS) Lizy John’s research

Copyright by Ajay Manohar Joshi 2007lca.ece.utexas.edu/pubs/ajay_dissertation.pdf · Ajay Manohar Joshi, PhD The University of Texas at Austin, 2007 Supervisor: Lizy K. John Benchmarks