DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

40
DeNovo: Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, Ching- Tsun Chou University of Illinois at Urbana-Champaign and Intel [email protected]

description

DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism. Byn Choi, Rakesh Komuravelli, Hyojin Sung , Robert Smolinski, Nima Honarmand , Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, Ching-Tsun Chou University of Illinois at Urbana-Champaign and Intel - PowerPoint PPT Presentation

Transcript of DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Page 1: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

DeNovo: Rethinking the Multicore Memory Hierarchyfor Disciplined Parallelism

Byn Choi, Rakesh Komuravelli, Hyojin Sung, Robert Smolinski, Nima Honarmand, Sarita V. Adve, Vikram S. Adve, Nicholas P. Carter, Ching-Tsun Chou

University of Illinois at Urbana-Champaign and Intel [email protected]

Page 2: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Motivation• Goal: Safe and efficient parallel computing

– Easy, safe programming model– Complexity-, power-, performance-scalable hardware

• Today: shared-memory– Complex, power- and performance-inefficient hardware

* Complex directory coherence, unnecessary traffic, ...

– Difficult programming model* Data races, non-determinism, composability/modularity?, testing?

– Mismatched interface between HW and SW, a.k.a memory model* Can’t specify “what value can read return”* Data races defy acceptable semantics

Fundamentally broken for hardware & software

Page 3: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

• Goal: Safe and efficient parallel computing– Easy, safe programming model– Complexity-, power-, performance-scalable hardware

• Today: shared-memory– Complex, power- and performance-inefficient hardware

* Complex directory coherence, unnecessary traffic, ...

– Difficult programming model* Data races, non-determinism, composability/modularity?, testing?

– Mismatched interface between HW and SW, a.k.a memory model* Can’t specify “what value can read return”* Data races defy acceptable semantics

Fundamentally broken for hardware & software

Motivation

Banish shared memory?

Page 4: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

• Goal: Safe and efficient parallel computing– Easy, safe programming model– Complexity-, power-, performance-scalable hardware

• Today: shared-memory– Complex, power- and performance-inefficient hardware

* Complex directory coherence, unnecessary traffic, ...

– Difficult programming model* Data races, non-determinism, composability/modularity?, testing?

– Mismatched interface between HW and SW, a.k.a memory model* Can’t specify “what value can read return”* Data races defy acceptable semantics

Fundamentally broken for hardware & software

Motivation

Banish wild shared memory!Need disciplined shared memory!

Page 5: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

What is Shared-Memory?

Shared-Memory =

Global address space +

Implicit, anywhere communication, synchronization

Page 6: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

What is Shared-Memory?

Shared-Memory =

Global address space +

Implicit, anywhere communication, synchronization

Page 7: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

What is Shared-Memory?

Wild Shared-Memory =

Global address space +

Implicit, anywhere communication, synchronization

Page 8: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

What is Shared-Memory?

Wild Shared-Memory =

Global address space +

Implicit, anywhere communication, synchronization

Page 9: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

What is Shared-Memory?

Disciplined Shared-Memory =

Global address space +

Implicit, anywhere communication, synchronizationExplicit, structured side-effects

Page 10: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Benefits of Explicit Effects

Disciplined Shared Memory

Strong safety properties• Determinism-by-default, explicit & safe non-determinism• Simple semantics, composability, testability

Efficiency: complexity, performance, power• Simplify coherence and consistency • Optimize communication and storage layout

explicit effects +structured

parallel control

- Deterministic Parallel Java (DPJ)

- DeNovo

Simple programming model ANDComplexity, power, performance-scalable hardware

Page 11: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

DeNovo Hardware Project • Exploit discipline for efficient hardware

– Current driver is DPJ (Deterministic Parallel Java)– End goal is language-oblivious interface

• Software research strategy– Deterministic codes– Safe non-determinism– OS, legacy, …

• Hardware research strategy– On-chip: coherence, consistency, communication, data layout– Off-chip: similar ideas apply, next step

Page 12: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Current Hardware Limitations• Complexity

– Subtle races and numerous transient states in the protocol– Hard to extend for optimizations

• Storage overhead– Directory overhead for sharer lists

• Performance and power inefficiencies– Invalidation and ack messages– False sharing– Indirection through the directory– Fixed cache-line communication granularity

Page 13: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

• Complexity– Subtle races and numerous transient sates in the protocol– Hard to extend for optimizations

• Storage overhead– Directory overhead for sharer lists

• Performance and power inefficiencies– Invalidation and ack messages– False sharing– Indirection through the directory– Traffic (fixed cache-line communication)

Contributions

− No transient states− Simple to extend for optimizations

− No storage overhead for directory information

− No invalidation and ack messages− No false sharing

Current Hardware Limitations

− No indirection through the directory− Flexible, not cache-line, communication

Up to 81% reduction in memory stall time

Page 14: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Outline• Motivation• Background: DPJ• Base DeNovo Protocol • DeNovo Optimizations• Evaluation

– Complexity– Performance

• Conclusion and Future Work

Page 15: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Background: DPJ [OOPSLA 09]

• Extension to Java; fully Java-compatible• Structured parallel control: nested fork-join style

– foreach, cobegin• A novel region-based type and effect system

• Speedups close to hand-written Java programs• Expressive enough for irregular, dynamic parallelism

Page 16: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Regions and Effects• Region: a name for a set of memory locations

– Programmer assigns a region to each field and array cell– Regions partition the heap

• Effect: a read or write on a region– Programmer summarizes effects of method bodies

• Compiler checks that– Region types are consistent, effect summaries are correct– Parallel tasks are non-interfering (no conflicts)– Simple, modular type checking (no inter-procedural ….)

• Programs that type-check are guaranteed determinism

Page 17: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Example: A Pair Classclass Pair {region Blue, Red;int X in Blue;int Y in Red;void setX(int x) writes Blue {

this.X = x;}void setY(int y) writes Red {

this.Y = y;}void setXY(int x, int y) writes Blue; writes Red {

cobegin {setX(x); // writes Blue setY(y); // writes Red

}}}

Pair

Pair.Blue X 3

Pair.Red Y 42

Declaring and using region names

Region names have static scope (one per class)

Page 18: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Example: A Pair Class

Writing method effect summaries

class Pair {region Blue, Red;int X in Blue;int Y in Red;void setX(int x) writes Blue {

this.X = x;}void setY(int y) writes Red {

this.Y = y;}void setXY(int x, int y) writes Blue; writes Red {

cobegin {setX(x); // writes BluesetY(y); // writes Red

}}}

Pair

Pair.Blue X 3

Pair.Red Y 42

Page 19: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Example: A Pair Class

Expressing parallelism

class Pair {region Blue, Red;int X in Blue;int Y in Red;void setX(int x) writes Blue {

this.X = x;}void setY(int y) writes Red {

this.Y = y;}void setXY(int x, int y) writes Blue; writes Red {

cobegin {setX(x); // writes BluesetY(y); // writes Red

}}}

Pair

Pair.Blue X 3

Pair.Red Y 42

Inferred effects

Page 20: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Outline• Motivation• Background: DPJ• Base DeNovo Protocol • DeNovo Optimizations• Evaluation

– Complexity– Performance

• Conclusion and Future Work

Page 21: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Memory Consistency Model• Guaranteed determinism

Read returns value of last write in sequential order1. Same task in this parallel phase2. Or before this parallel phase

LD 0xa

ST 0xaParallelPhase

ST 0xa

Page 22: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Memory Consistency Model• Guaranteed determinism

Read returns value of last write in sequential order1. Same task in this parallel phase2. Or before this parallel phase

LD 0xa

CoherenceMechanism

ST 0xaParallelPhase

Page 23: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Cache Coherence• Coherence Enforcement

1. Invalidate stale copies in caches2. Track up-to-date copy

• Explicit effects– Compiler knows all regions written in this parallel phase– Cache can self-invalidate before next parallel phase

* Invalidates data in writeable regions not accessed by itself

• Registration– Directory keeps track of one up-to-date copy– Writer updates before next parallel phase

Page 24: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Basic DeNovo Coherence• Assume (for now): Private L1, shared L2; single word line

– Data-race freedom at word granularity

• No space overhead– Keep valid data or registered core id– L2 data arrays double as directory

• No transient states registry

Invalid Valid

Registered

Read

Write Write

Page 25: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Example Runclass Pair {

X in DeNovo-region Blue;Y in DeNovo-region Red;

void setX(int x) writes Blue { this.X = x; }}Pair PArray[size];...Phase1 writes Blue { // DeNovo effect

foreach i in 0, size {PArray[i].setX(…);

}self_invalidate(Blue);

}

R X0 V Y0

R X1 V Y1

R X2 V Y2

V X3 V Y3

V X4 V Y4

V X5 V Y5

L1 of Core 1 R X0 V Y0

R X1 V Y1

R X2 V Y2

I X3 V Y3

I X4 V Y4

I X5 V Y5

L1 of Core 2

I X0 V Y0

I X1 V Y1

I X2 V Y2

R X3 V Y3

R X4 V Y4

R X5 V Y5

Shared L2

RegisteredValidInvalid

V X0 V Y0

V X1 V Y1

V X2 V Y2

V X3 V Y3

V X4 V Y4

V X5 V Y5

V X0 V Y0

V X1 V Y1

V X2 V Y2

V X3 V Y3

V X4 V Y4

V X5 V Y5

V X0 V Y0

V X1 V Y1

V X2 V Y2

R X3 V Y3

R X4 V Y4

R X5 V Y5

Registration Registration

Ack AckV X0 V Y0

V X1 V Y1

V X2 V Y2

V X3 V Y3

V X4 V Y4

V X5 V Y5

R C1 V Y0

R C1 V Y1

R C1 V Y2

R C2 V Y3

R C2 V Y4

R C2 V Y5

Page 26: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Practical DeNovo Coherence• Basic protocol impractical

– High tag storage overhead (a tag per word)• Address/Transfer granularity > Coherence granularity• DeNovo Line-based protocol

– Traditional software-oblivious spatial locality– Coherence granularity still at word

* no word-level false-sharing

“Line Merging” Cache

V V RTag

Page 27: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Current Hardware Limitations

✔✔

• Complexity– Subtle races and numerous transient states in the protocol– Hard to extend for optimizations

• Storage overhead– Directory overhead for sharer lists

• Performance and power inefficiencies– Invalidation and ack messages– False sharing– Indirection through the directory– Fixed cache-line communication granularity

Page 28: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Protocol Optimizations

Insights

1. Traditional directory must be updated at every transfer⇒ DeNovo can copy valid data around freely

2. Traditional systems send cache line at a time⇒ DeNovo uses regions to transfer only relevant data

Page 29: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

L1 of Core 1 …

R X0 V Y0 V Z0

R X1 V Y1 V Z1

R X2 V Y2 V Z2

I X3 V Y3 V Z3

I X4 V Y4 V Z4

I X5 V Y5 V Z5

I X0 V Y0 V Z0

I X1 V Y1 V Z1

I X2 V Y2 V Z2

R X3 V Y3 V Z3

R X4 V Y4 V Z4

R X5 V Y5 V Z5

R C1 V Y0 V Z0

R C1 V Y1 V Z1

R C1 V Y2 V Z2

R C2 V Y3 V Z3

R C2 V Y4 V Z4

R C2 V Y5 V Z5

RegisteredValidInvalid

X3

LD X3

Y3 Z3

Protocol Optimizations• Direct cache-to-cache transfer• Flexible communication

L1 of Core 2 …

…Shared L2

Page 30: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

L1 of Core 1 …

R X0 V Y0 V Z0

R X1 V Y1 V Z1

R X2 V Y2 V Z2

I X3 V Y3 V Z3

I X4 V Y4 V Z4

I X5 V Y5 V Z5

I X0 V Y0 V Z0

I X1 V Y1 V Z1

I X2 V Y2 V Z2

R X3 V Y3 V Z3

R X4 V Y4 V Z4

R X5 V Y5 V Z5

R C1 V Y0 V Z0

R C1 V Y1 V Z1

R C1 V Y2 V Z2

R C2 V Y3 V Z3

R C2 V Y4 V Z4

R C2 V Y5 V Z5

RegisteredValidInvalid

Protocol Optimizations• Direct cache-to-cache transfer• Flexible communication

L1 of Core 2 …

…Shared L2

R X0 V Y0 V Z0

R X1 V Y1 V Z1

R X2 V Y2 V Z2

V X3 V Y3 V Z3

V X4 V Y4 V Z4

V X5 V Y5 V Z5

X3

X4

X5

LD X3

• Direct cache-to-cache transfer• Flexible communication

Page 31: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Current Hardware Limitations

✔✔

✔✔

• Complexity– Subtle races and numerous transient states in the protocol– Hard to extend for optimizations

• Storage overhead– Directory overhead for sharer lists

• Performance and power inefficiencies– Invalidation and ack messages– False sharing– Indirection through the directory– Fixed cache-line communication granularity

Page 32: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Outline• Motivation• Background: DPJ• Base DeNovo Protocol • DeNovo Optimizations• Evaluation

– Complexity– Performance

• Conclusion and Future Work

Page 33: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Protocol Verification• DeNovo vs. MESI word with Murphi model checking• Correctness

– Six bugs in MESI protocol* Difficult to find and fix

– Three bugs in DeNovo protocol* Simple to fix

• Complexity– 15x fewer reachable states for DeNovo– 20x difference in the runtime

Page 34: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Performance Evaluation Methodology• Simulator: Simics + GEMS + Garnet • System Parameters

– 64 cores– Simple in-order core model

• Workloads– FFT, LU, Barnes-Hut, and radix from SPLASH-2– bodytrack and fluidanimate from PARSEC 2.1– kd-Tree (two versions) [HPG 09]

Page 35: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

MESI Word (MW) vs. DeNovo Word (DW)

FFT LU kdFalse kdPaddedBarnes bodytrackMW ML

DDFMW ML

DDF0%

100%

200%

300%

400%

500%

600%

700%

800%

900%

1000%

1100%

1200%

1300%

1400%

1500%

1600%

fluidanimate radixMW ML

DDFMW ML

DDFMW ML

DDFMW ML

DDFMW ML

DDFMW ML

DDF0%

100%

200%

300%

400%

500%Mem HitR L1 HitL2 HitL1 STALL

Memory Stall Time

• DW’s performance competitive with MW

Page 36: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

MESI Line (ML) vs. DeNovo Line (DL)

FFT LU kdFalse kdPaddedBarnes bodytrackMW ML

DDFMW ML

DDF0%

100%

200%

300%

400%

500%

600%

700%

800%

900%

1000%

1100%

1200%

1300%

1400%

1500%

1600%

fluidanimate radixMW ML

DDFMW ML

DDFMW ML

DDFMW ML

DDFMW ML

DDFMW ML

DDF0%

100%

200%

300%

400%

500%Mem HitR L1 HitL2 HitL1 STALL

Memory Stall Time

• DL about the same or better memory stall time than ML• DL outperforms ML significantly with apps with false sharing

Page 37: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Optimizations on DeNovo Line

FFT LU kdFalse kdPaddedBarnes bodytrack fluidanimate radixML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF

0%

50%

100%

150%

200%

10095

42

100

3842

100 101

91100

24 21

10093

81

100104 105

100105 102 100 102 99

Mem HitR L1 HitL2 HitL1 STALLSeries5

• Combined optimizations perform best– Except for LU and bodytrack– Apps with low spatial locality suffer from line-granularity allocation

Page 38: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Network Traffic

ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF ML DL DDF0%

50%

100%

150%

200%InvalidationWBWriteRead

• DeNovo has less traffic than MESI in most cases• DeNovo incurs more write traffic

– due to word-granularity registration– Can be mitigated with “write-combining” optimization

FFT LU kdFalse kdPaddedBarnes bodytrack fluidanimate radix

Page 39: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Conclusion and Future Work• DeNovo rethinks hardware for disciplined models• Complexity

– No transient states: 20X faster to verify than MESI– Extensible: optimizations without new states

• Storage overhead– No directory overhead

• Performance and power inefficiencies– No invalidations, acks, false sharing, indirection– Flexible, not cache-line, communication– Up to 81% reduction in memory stall time

• Future work: region-driven layout, off-chip memory, safe non-determinism, sync, OS, legacy

Page 40: DeNovo : Rethinking the Multicore Memory Hierarchy for Disciplined Parallelism

Thank You!