Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1...

Zvika Guz1, Oved Itzhak1, Idit Keidar1, Avinoam Kolodny1, Avi Mendelson2, and Uri C. Weiser1

Threads vs. Caches: Modeling the Behavior of Parallel Workloads

1Technion – Israel Institute of Technology, 2Microsoft Corporation

Challenges: Single-core performance trend is gloomy

Exploit chip-multiprocessors with multithreaded applications

The memory gap is paramount Latency, bandwidth, power

Chip-Multiprocessor Era

2[Figure: Hennessy and Patterson, Computer Architecture- A Quantitative approach]

Two basic remedies: Cache – Reduce the number of out-of-die memory accesses Multi-threading – Hide memory accesses behind threads execution

How do they play together? How do we make the most out of them?

The many-core span Cache-Machines ↔ MT-Machines

A high-level analytical model Performance curves study

Few examples

Summary

Outline

Few examples

Summary

Outline

Cache-Machines vs. MT-Machines

# of Threads

Cache/Thread

Thread Context

Cache Architecture

Region

Many-Core – CMP with many, simple cores Tens hundreds of Processing Elements (PEs)

MT Architecture

Region

Intel’s Larrabee

Nvidia’s GT200

Nvidia’s Fermi

Multi-Core

Region

Uni-Processor

Region

What are the basic tradeoffs? How will workloads behave across the range?

Predicting performance

Few examples

Summary

Outline

Use both cache and many threads to shield memory access The uniform framework renders the comparison meaningful We derive simple, parameterized equations for performance, power, BW,..

A Unified Machine Model

To Memory

Threads Architectural States

Cache Machines

Many cores (each may have its private L1) behind a shared cache

To Memory

# Threads

Performance

Cache Non Effective point (CNE)

Memory latency shielded by multiple thread execution

Multi-Thread Machines

To Memory

Threads Architectural States

# Threads

PerformanceMax performance

executionMemory access

Analysis (1/3) Given a ratio of memory access instructions rm (0≤rm≤1)

Every 1/rm instruction accesses memory A thread executes 1/rm instructions

Then stalls for tavg cycles

tavg=Average Memory Access Time (AMAT) [cycles]

Thread Context

t [cycles]

1CPIexerm

PE stays idle unless filled with instructions from other threads Each thread occupies the PE for additional cycles

threads needed to fully utilize each PE

Analysis (2/3)

t [cycles]

1CPIexerm

ld ld ld ld

1CPIexerm

Thread Context

Analysis (3/3) Machine utilization:

Performance in Operations Per Seconds [OPS]:

1min 1, threads

rN tCPI

Number of available threads

[ ]PEexe

fPerformance N OPS

Peak Performance

#Threads needed to utilize a single PE

Thread Context

Performance Model

min , [ ]1 $,

( , ) 1 ( , )

m reg hit threads

ex m hit hit mem

BWPerformance OPS

r b P n

e r P S n e P S n e

threads

min 1 ,Machine Utilization

$ [ ]$, 1 $, hit threads hit threads mavg cyclesAMAT P n tt t P n

PE Utilization

Off-Chip BW

Few examples

Summary

Outline

# Threads

3 regions: Cache efficiency region, The Valley, MT efficiency region

Unified Machine PerformanceP

MT regionThe Valley

Number Of Threads

Performance for Different Cache Sizes (Limited BW)

perfect $

Increase in cache size cache suffices for more in-flight threads Extends the $ region

Increase in cache size

Cache Size Impact

..AND also Valuable in the MT region Caches reduce off-chip bandwidth delay the BW saturation point

Simulation results from the PARSEC workloads kit Swaptions:

Perfect Valley

Hit Rate Function Impact

Swaptions

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Analytical Model

Simulation

Cache Hit Rate

Simulation results from the PARSEC workloads kit Raytrace:

Monotonically-increasing performance

Hit Rate Function Impact

Raytrace

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Analytical Model

Simulation

Cache Hit Rate

Three applications families based on cache miss rate dependency: A “strong” function of number of threads – f(Nq) when q>1 A “weak” function of number of threads - f(Nq) when q≤1 Not a function of number of threads

Threads

Hit Rate Dependency – 3 ClassesP

# Threads

Simulation results from the PARSEC workloads kit Canneal

Not enough parallelism available

Workload Parallelism Impact

Canneal

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Simulation

Analytical Model

Cache Hit Rate

Few examples

Summary

Outline

A high-level model for many-core engines A unified framework for machines and workloads from across the range

A vehicle to derive intuition Qualitative study of the tradeoffs A tool to understand parameters impact Identifies new behaviors and the applications that exhibit them Enables reasoning of complex phenomena

First step towards escaping the valley

Summary

Thank You!zguz@tx.technion.ac.il

Backup

Model Parameters

Parameter Description

NPENumber of PEs (in-order processing elements)

S$Cache size [Bytes]

NmaxMaximal number of thread contexts in the register file

CPIexeAverage number of cycles required to execute an instruction assuming a perfect (zero-latency) memory system [cycles]

f Processor frequency [Hz]

t$Cache latency [cycles]

tmMemory latency [cycles]

BWmaxMaximal off-chip bandwidth [GB/sec]

bregOperands size [Bytes]

Machine parameters:

Model Parameters

Workload parameters:

n Number of threads that execute or are in ready state (not blocked) concurrently

rmFraction of instructions accessing memory out of the total number of instructions [0≤rm≤1]

Phit(s, n) Cache hit rate for each thread, when n threads are using a cache of size s

Model Parameters

Power parameters:

eexEnergy per operation [j]

e$Energy per cache access [j]

emem Energy per memory access [j]

PowerleakageLeakage power [W]

Parsec Workloads

Model Validation, PARSEC Workloads

Raytrace

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Analytical Model

Simulation

Cache Hit Rate

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of ThreadsP

Analytical Model

Simulation

Cache Hit Rate

Canneal

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Simulation

Analytical Model

Cache Hit Rate

Bodytrack

0 20 40 60 80 100 120 140 160 180 200

Number Of Threads

Analytical Model

Simulation

Cache Hit Rate

Swaptions

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

Analytical Model

Simulation

Cache Hit Rate

Blackscholes

0 200 400 600 800 1000 1200 1400 1600 1800 2000

Number Of Threads

)Analytical Model

Simulation

Cache Hit Rate

Related Work

Similar approach of using high level models: Morad et al., CA-Letters 2005 Hill and Michael, IEEE Computer 2008 Eyerman and Eeckhout, ISCA-2010

Related Work

Agrawal, TPDS-1992

Saavedra-Barrera and Culler, Berkeley 1991

Sorin et al., ISCA-1998

Hong and Kim, ISCA-2009

Baghsorkhi et al., PPoPP-2010

Thread Context

Cache Architecture

Region

MT Architecture

Region

Multi-Core

Region

Uni-Processor

Region

Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1...

Documents

Transcript of Zvika Guz 1, Oved Itzhak 1, Idit Keidar 1, Avinoam Kolodny 1, Avi Mendelson 2, and Uri C. Weiser 1...

MICRO-MODEM RELIABILITY SOLUTION FOR NOC COMMUNICATIONS Arkadiy Morgenshtein, Evgeny Bolotin, Israel Cidon, Avinoam Kolodny, Ran Ginosar Technion – Israel.

Single-ISA Heterogeneous Multi-Core Architecture Zvika Guz zguz@tx.technion.ac.il November, 2004.

Use Case Modeling Written by: Zvika Gutterman Adam Carmi.

Differential Performance in High vs. Low Stakes Tests · Differential Performance in High vs. Low Stakes Tests Yigal Attali,1 Zvika Neeman,2 and Analia Schlosser3 April, 2011 Abstract.

1 1 Avinoam Kolodny Technion – Israel Institute of Technology Intel PVPD Symposium July 2006 Issues in the Design of Wires.

1 Evgeny Bolotin – Efficient Routing, DATE 2007 Routing Table Minimization for Irregular Mesh NoCs Evgeny Bolotin, Israel Cidon, Ran Ginosar, Avinoam Kolodny.

1 Link Division Multiplexing (LDM) for NoC Links IEEE 2006 LDM Link Division Multiplexing Arkadiy Morgenshtein, Avinoam Kolodny, Ran Ginosar Technion –

Zvika Rozenshein,General Manager, EngineeringIQ

UML Class Diagram and Packages Written by Zvika Gutterman Adam Carmi.

Software Requirements by Zvika Gutterman Adam Carmi.

2001 Zvika Serper - Kurosawa's Dreams a Cinematic Reflection of a Traditional Japanese Dream

Ran Manevich, Leon Polishuk, Israel Cidon, and Avinoam Kolodny.

1 Modeling and Optimization of VLSI Interconnect 049031 Lecture 6: Interconnect power Avinoam Kolodny Konstantin Moiseev.

January 2009 Supply Chain Shifting from an expense source to an income resource Sapir Avinoam Managing Director, SLE.

Journal of Economic Psychology - Zvika Neeman

Science 2011 Avinoam 589 92

Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb.

Markets Versus Negotiations: the Predominance of Centralized …zvika/BB25.pdf · 2006. 10. 16. · Markets Versus Negotiations: the Predominance of Centralized Markets ∗ Zvika

CGDE: Game Theorycgde.wifa.uni-leipzig.de/wp-content/uploads/2019/... · Last updated: March 12, 2017 Zvika Neeman 1 Brief Introduction This is an introduction to non-cooperative

Avinoam Danin © Species New to Science discovered and described by Avinoam Danin Ferula daninii Zohary Flora of Israel online – .