Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis...

32
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden

Transcript of Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis...

Concurrent Data Structures in Architectures with

Limited Shared Memory Support

Ivan WalulyaYiannis NikolakopoulosMarina Papatriantafilou

Philippas Tsigas

Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden

Yiannis Nikolakopoulos [email protected]

2

Concurrent Data Structures• Parallel/Concurrent programming:– Share data among threads/processes,

sharing a uniform address space (shared memory)

• Inter-process/thread communication and synchronization– Both a tool and a goal

Yiannis Nikolakopoulos [email protected]

3

Concurrent Data Structures:Implementations

• Coarse grained locking– Easy but slow...

• Fine grained locking– Fast/scalable but: error-prone, deadlocks

• Non-blocking– Atomic hardware primitives (e.g. TAS, CAS)– Good progress guarantees (lock/wait-freedom)– Scalable

Yiannis Nikolakopoulos [email protected]

4

What’s happening in hardware?• Multi-cores many-cores– “Cache coherency wall”

[Kumar et al 2011]– Shared address space

will not scale– Universal atomic primitives (CAS, LL/SC) harder to

implement• Shared memory message passing

Cache Cache

IA Core

Shared Local

Yiannis Nikolakopoulos [email protected]

5

• Networks on chip (NoC)• Short distance

between cores• Message passing

model support• Shared memory support

Can we have Data Structures:Fast

ScalableGood progress guarantees

Cache Cache

IA Core

Shared Local

• Eliminatedcache coherency

• Limited support for synchronization primitives

Yiannis Nikolakopoulos [email protected]

6

Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion

Yiannis Nikolakopoulos [email protected]

7

Single-chip Cloud Computer (SCC)• Experimental processor by Intel• 48 independent x86 cores arranged on 24 tiles• NoC connects all tiles• TestAndSet register

per core

Yiannis Nikolakopoulos [email protected]

8

SCC: Architecture Overview

Memory Controllers:to private & shared

main memory

Message Passing

Buffer (MPB) 16Kb

Yiannis Nikolakopoulos [email protected]

9

Programming Challenges in SCC• Message Passing but…– MPB small for

large data transfers– Data Replication is difficult

• No universal atomic primitives (CAS); no wait-free implementations [Herlihy91]

Yiannis Nikolakopoulos [email protected]

10

Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion

Yiannis Nikolakopoulos [email protected]

11

Concurrent FIFO Queues• Main idea:– Data are stored in shared off-chip memory– Message passing for communication/coordination

• 2 design methodologies:– Lock-based synchronization (2-lock Queue)– Message passing-based synchronization

(MP-Queue, MP-Acks)

Yiannis Nikolakopoulos [email protected]

12

2-lock Queue• Array based, in shared off-chip memory (SHM)• Head/Tail pointers in MPBs• 1 lock for each pointer [Michael&Scott96]• TAS based locks on 2 cores

Yiannis Nikolakopoulos [email protected]

13

2-lock Queue:“Traditional” Enqueue Algorithm

• Acquire lock• Read & Update

Tail pointer (MPB)• Add data (SHM)• Release lock

Yiannis Nikolakopoulos [email protected]

14

2-lock Queue:Optimized Enqueue Algorithm

• Acquire lock• Read & Update

Tail pointer (MPB)• Release lock• Add data to node SHM• Set memory flag to dirty Why?

No Cache Coherency!

Yiannis Nikolakopoulos [email protected]

15

2-lock Queue:Dequeue Algorithm

• Acquire lock• Read & Update

Head pointer• Release lock• Check flag• Read node dataWhat about

progress?

Yiannis Nikolakopoulos [email protected]

16

2-lock Queue:Implementation

Head/TailPointers (MPB)

Data nodes

Locks?On which tile(s)?

Yiannis Nikolakopoulos [email protected]

17

Message Passing-based Queue• Data nodes in SHM• Access coordinated by a Server node who

keeps Head/Tail pointers• Enqueuers/Dequeuers request access through

dedicated slots in MPB• Successfully enqueued data are flagged with

dirty bit

Yiannis Nikolakopoulos [email protected]

18

MP-Queue

ENQ

TAIL

DEQ

HEAD

SPIN

What if this fails and is

never flagged?“Pairwise blocking”

only 1 dequeue blocks

ADDDATA

Yiannis Nikolakopoulos [email protected]

19

Adding Acknowledgements• No more flags!

Enqueue sends ACK when done• Server maintains in SHM a private queue of

pointers• On ACK:

Server adds data location to its private queue• On Dequeue:

Server returns only ACKed locations

Yiannis Nikolakopoulos [email protected]

20

MP-Acks

ENQ

TAIL

ACK

DEQ

HEAD

No blocking between

enqueues/dequeues

Yiannis Nikolakopoulos [email protected]

21

Outline• Concurrent Data Structures• Many-core architectures• Intel’s SCC• Concurrent FIFO Queues• Evaluation• Conclusion

Yiannis Nikolakopoulos [email protected]

22

Evaluation

Benchmark:• Each core performs Enq/Deq at random• High/Low contention

• Perfomance? Scalability?• Is it the same for all cores?

23

• Throughput:Data structure operations completed per time unit.

[Cederman et al 2013]

Measures

Yiannis [email protected]

Operations by core i

Average operations per

core

Yiannis Nikolakopoulos [email protected]

24

Throughput – High Contention

Yiannis Nikolakopoulos [email protected]

25

Fairness – High Contention

Yiannis Nikolakopoulos [email protected]

26

Throughput VS Lock Location

Yiannis Nikolakopoulos [email protected]

27

Throughput VS Lock Location

Yiannis Nikolakopoulos [email protected]

28

Conclusion• Lock based queue– High throughput– Less fair– Sensitive to lock locations, NoC performance

• MP based queues– Lower throughput– Fairer– Better liveness properties– Promising scalability

Yiannis Nikolakopoulos [email protected]

29

Thank you!

[email protected]@chalmers.se

Yiannis Nikolakopoulos [email protected]

30

BACKUP SLIDES

Yiannis Nikolakopoulos [email protected]

31

Experimental Setup• 533MHz cores, 800MHz mesh, 800MHz DDR3 • Randomized Enq/Deq operations• High/Low contention• One thread per core• 600ms per execution • Averaged over 12 runs

Yiannis Nikolakopoulos [email protected]

32

Concurrent FIFO Queues• Typical 2-lock queue [Michael&Scott96]