Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

22
Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors Blake Hechtman Daniel J. Sorin 1

description

Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors. Blake Hechtman Daniel J. Sorin. Executive Summary. Massively Threaded Throughput-Oriented Processors (MTTOPs) like GPUs are being integrated on chips with CPUs and being used for general purpose programming - PowerPoint PPT Presentation

Transcript of Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

Page 1: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

Exploring Memory Consistency for Massively Threaded Throughput-

Oriented Processors

Blake HechtmanDaniel J. Sorin

1

Page 2: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

2

Executive Summary

• Massively Threaded Throughput-Oriented Processors (MTTOPs) like GPUs are being integrated on chips with CPUs and being used for general purpose programming

• Conventional wisdom favors weak consistency on MTTOPs

• We implement a range of memory consistency models (SC, TSO and RMO) on MTTOPs

• We show that strong consistency is viable for MTTOPs

Page 3: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

3

What is an MTTOP?

• Massively Threaded Throughput-Oriented– 4-16 core clusters– 8-64 threads wide SIMD– 64-128 deep SMTThousands of concurrent threads

• Massively Threaded Throughput-Oriented– Sacrifice latency for throughput• Heavily banked caches and memories• Many cores, each of which is simple

Page 4: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

4

Example MTTOP

EE E E EE E

DecodeFetch

L1

Core Cluster

Core Cluster

Core Cluster

Core Cluster

Core Cluster

Core Cluster

L2Bank

L2Bank

Core Cluster

Core Cluster

L2Bank

L2Bank

Core Cluster

Core Cluster

L2Bank

L2Bank

Core Cluster

Core Cluster

L2Bank

L2Bank

Core Cluster

Core Cluster

Core Cluster

Core Cluster

Core Cluster

MemoryController

Cache Coherent Shared Memory

Page 5: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

5

What is Memory Consistency?Thread 0ST B = 1 LD r1, A

Thread 1ST A = 1 LD r2, B

Sequential Consistency : {r1,r2} = 0,1; 1,0; 1,1

Weak Consistency : {r1,r2} = 0,1; 1,0; 1,1; 0,0 enables store buffering

MTTOP hardware concurrency seems likely to be constrained by Sequential Consistency (SC)

In this work, we explore hardware consistency models

Initially A = B = 0

Page 6: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

6

(CPU) Memory Consistency Debate

• Conclusion for CPUs: trading off ~10-40% performance for programmability– “Is SC + ILP = RC?” (Gniady ISCA99)

Strong Consistency

Weak Consistency

Performance Slower Faster

Programmability Easier Harder

But does this conclusion apply to MTTOPs?

Page 7: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

7

Memory Consistency on MTTOPs

• GPUs have undocumented hardware consistency models

• Intel MIC uses x86-TSO for the full chip with directory cache coherence protocol

• MTTOP programming languages provide weak ordering guarantees– OpenCL does not guarantee store visibility without a

barrier or kernel completion– CUDA includes a memory fence that can enable global store

visibility

Page 8: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

8

MTTOP Conventional Wisdom

• Highly parallel systems benefit from less ordering– Graphics doesn’t need ordering

• Strong Consistency seems likely to limit MLP• Strong Consistency likely to suffer extra latencies

Weak ordering helps CPUs, does it help MTTOPs? It depends on how MTTOPs differ from CPUs …

Page 9: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

9

Diff 1: Ratio of Loads to Stores

MTTOPs perform more loads per store store latency optimizations will not be as critical to MTTOP performance

CPUs MTTOPs

Prior work shows CPUs perform 2-4 loads per store

2dconv barnes bfs djistra fft hotspot kmeans matrix_mul

nn1

10

100

1000

10000

Load

s per

Sto

re

Weak Consistency reduces impact of store latency on performance

Page 10: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

10

Diff 2: Outstanding L1 cache missesWeak consistency enables more outstanding L1 misses per thread

MTTOPs have more outstanding L1 cache misses thread reordering enabled by weak consistency is less important to handle memory

latency

CPU Core MTTOP Core (CU/SM)

threads per core 4 64

SIMD Width 4 64

L1 Miss Rate 0.1 0.5

SC Misses per Core 1.6 (too few misses) 2048 (enough misses)

… … …

RMO Misses per Core 6.4 8192

Page 11: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

11

Diff 3: Memory System Latencies

E E E E

DecodeFetch

L1LSQ

1-2 cycles

5-20 cycles

100-500 cycles

ROB

Issue/Sel

L2

Mem

CPU core

E E E E EE E

DecodeFetch

L110-70 cycles

100-300 cycles

300-1000 cycles

L2

Mem

MTTOP core cluster

MTTOPs have longer memory latencies small latency savings will not significantly improve performance

Weak consistency enables reductions of store latencies

Page 12: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

12

Diff 4: Frequency of Synchronization

MTTOPs have more threads to compute a problem each thread will have fewer independent memory ops between syncs.

CPUs MTTOPs

split problem into regionsassign regions to threadsdo: work on local region synchronize

Weak consistency only re-orders memory ops between sync

CPU local region:~private cache size

MTTOP local region:~private cache size/threads per cache

Page 13: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

13

Diff 5: RAW Dependences Through Memory

MTTOP algorithms have fewer RAW memory dependencies there is little benefit to being able to read from a write buffer

CPUs MTTOPs• Blocking for cache

performance• Frequent function calls• Few architected registersMany RAW dependencies

through memory

• Coalescing for cache performance

• Inlined function calls• Many architected registers Few RAW dependencies

through memory

Weak consistency enables store to load forwarding

Page 14: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

14

MTTOP Differences & Their Impact

• Other differences are mentioned in the paper• How much do these differences affect

performance of memory consistency implementations on MTTOPs?

Page 15: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

15

Memory Consistency Implementations

E E E E EE E

DecodeFetch

L1

SC simple

EE E E EE E

DecodeFetch

L1

SC wb

EE E E EE E

DecodeFetch

L1

TSO

EE E E EE E

DecodeFetch

L1

RMO

No write buffer Per-lane FIFO write buffer drained on LOADS

Per-lane FIFO write buffer drained on FENCES

Per-lane CAM for outstanding write addresses

FIFO WB FIFO WB CAM

Strongest Weakest

Page 16: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

16

Methodology

• Modified gem5 to support SIMT cores running a modified version of the Alpha ISA

• Looked at typical MTTOP workloads– Had to port workloads to run in system model

• Ported Rodinia benchmarks– bfs, hotspot, kmeans, and nn

• Handwritten benchmarks– dijkstra, 2dconv, and matrix_mul

Page 17: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

17

Target MTTOP SystemParameter Valuecore clusters 16 core clusters; 8 wide SIMDcore in-order, Alpha-like ISA, 64 deep SMTinterconnection network 2D toruscoherence protocol Writeback MOESI protocolL1I cache (shared by cluster) perfect, 1-cycle hitL1D cache (shared by cluster) 16KB, 4-way, 20-cycle hit, no local memoryL2 cache (shared by all clusters) 256KB, 8 banks, 8-way, 50-cycle hit

consistency model-specific features (give benefit to weaker models)write buffer (SCwb and TSO) perfect, instant accessCAM for store address matching perfect, instant access

Page 18: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

18

Results

2dconv barnes bfs djisktra fft hotspot kmeans matrix_mul nn0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

MTTOP Consistency Model Performance Comparison

SCSC_WBTSORMO

Spee

dup

Page 19: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

19

Results

2dconv barnes bfs djisktra fft hotspot kmeans matrix_mul nn0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

MTTOP Consistency Model Performance Comparison

SCSC_WBTSORMO

Spee

dup

Significant load re-ordering

Page 20: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

20

Conclusions

• Strong Consistency should not be ruled out for MTTOPs on the basis of performance

• Improving store performance with write buffers appears unnecessary

• Graphics-like workloads may get significant MLP from load reordering (dijkstra, 2dconv)

Conventional wisdom may be wrong about MTTOPs

Page 21: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

21

Caveats and Limitations

• Results do not necessarily apply to all possible MTTOPs or MTTOP software

• Evaluation with writeback caches when current MTTOPs use write-through caches

• Potential of future workloads to be more CPU-like

Page 22: Exploring Memory Consistency for Massively Threaded Throughput-Oriented Processors

Thank you!

Questions?

22