NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks...

93
NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 14 November 2014

Transcript of NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks...

Page 1: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

NON-BLOCKING DATA STRUCTURES

AND TRANSACTIONAL MEMORY

Tim Harris, 14 November 2014

Page 2: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Lecture 6

� Introduction

� Amdahl’s law

� Basic spin-locks

� Queue-based locks

� Hierarchical locks

� Reader-writer locks

� Reading without locking

� Flat combining

Page 3: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Overview� Building shared memory data structures

� Lists, queues, hashtables, …

� Why?

� Used directly by applications (e.g., in C/C++, Java, C#, …)

� Used in the language runtime system (e.g., management of work, implementations of message passing, …)

� Used in traditional operating systems (e.g., synchronization between top/bottom-half code)

� Why not?

� Don’t think of “threads + shared data structures” as a default/good/complete/desirable programming model

� It’s better to have shared memory and not need it…

3

Page 4: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Correctness

What does it mean to be correct?

e.g., if multiple concurrent threads are using iterators on a

shared data structure at the same time?

Ease to write

What do we care about?

4

Does it matter? Who is the target audience? How much effort can they put into it? Is

implementing a data structure an undergrad programming

exercise? …or a research paper?

When can it be used?

How well does it scale?

How fast is it?

Between threads in the same process? Between processes sharing memory? Within an

interrupt handler? With/without some kind of runtime system support?

Suppose I have a sequential implementation (no

concurrency control at all): is the new implementation 5%

slower? 5x slower? 100x slower?

How does performance change as we increase the number of

threads? When does the implementation add or avoid

synchronization?

Page 5: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Correctness

Ease to write

What do we care about?

5

When can it be used?

How well does it scale?

How fast is it?

Page 6: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

What do we care about?1. Be explicit about goals and trade-offs

� A benefit in one dimension often has costs in another

� Does a perf increase prevent a data structure being used in some particular setting?

� Does a technique to make something easier to write make the implementation slower?

� Do we care? It depends on the setting

2. Remember, parallel programming is rarely a recreational activity

� The ultimate goal is to increase perf (time, or resources used)

� Does an implementation scale well enough to out-perform a good sequential implementation?

6

Page 7: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Suggested reading� “The art of multiprocessor programming”, Herlihy & Shavit

– excellent coverage of shared memory data structures, from both practical and theoretical perspectives

� “Transactional memory, 2nd edition”, Harris, Larus, Rajwar –recently revamped survey of TM work, with 350+ references

� “NOrec: streamlining STM by abolishing ownership records”, Dalessandro, Spear, Scott, PPoPP 2010

� “Simplifying concurrent algorithms by exploiting transactional memory”, Dice, Lev, Marathe, Moir, Nussbaum, Olszewski, SPAA 2010

� Intel “Haswell” spec for SLE (speculative lock elision) and RTM (restricted transactional memory)

7

Page 8: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law

Page 9: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law� “Sorting takes 70% of the execution time of a sequential

program. You replace the sorting algorithm with one that scales perfectly on multi-core hardware. On a machine with n cores, how many cores do you need to use to get a 4x speed-up on the overall algorithm?”

9

Page 10: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=70%

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Sp

eed

up

#cores

Desired 4x speedup

Speedup achieved (perfect scaling on 70%)

10

Page 11: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=70%

����������(, �) = 1

�(1 − ) + �� �

f = fraction of code speedup applies toc = number of cores used

11

Page 12: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=70%

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Sp

eed

up

#cores

Desired 4x speedup

Speedup achieved (perfect scaling on 70%)

Limit as c→∞ = 1/(1-f) = 3.33

12

Page 13: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=10%

0.94

0.96

0.98

1.00

1.02

1.04

1.06

1.08

1.10

1.12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Sp

eed

up

#cores

Speedup achieved with perfect scaling

Amdahl’s law limit, just 1.11x

13

Page 14: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=98%

0

10

20

30

40

50

601 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91 97 103

109

115

121

127

Sp

eed

up

#cores14

Page 15: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law & multi-core

Suppose that the same h/w budget (space or power) can make us:

1 2

5 6

3 4

7 8

9 10

13 14

11 12

15 16

1

1 2

3 4

15

Page 16: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Perf of big & small cores

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1/16 1/8 1/4 1/2 1

Co

re p

erf(

rela

tive

to

1 b

ig c

ore

Resources dedicated to core

Assumption: perf = α √resource

Total perf:16 * 1/4 = 4

Total perf:1 * 1 = 1

16

Page 17: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=98%

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Perf

(rel

ativ

e to

1 b

ig c

ore

)

#Cores

1 big

4 medium

16 small

17

Page 18: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=75%

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Perf

(rel

ativ

e to

1 b

ig c

ore

)

#Cores

1 big

4 medium

16 small

18

Page 19: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=5%

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Perf

(rel

ativ

e to

1 b

ig c

ore

)

#Cores

1 big

4 medium

16 small

19

Page 20: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Asymmetric chips

1

3 4

7 8

9 10

13 14

11 12

15 16

20

Page 21: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=75%

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Perf

(rel

ativ

e to

1 b

ig c

ore

)

#Cores

1 big

4 medium

16 small

1+12

21

Page 22: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=5%

0

0.2

0.4

0.6

0.8

1

1.2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Perf

(rel

ativ

e to

1 b

ig c

ore

)

#Cores

1 big

4 medium

16 small

1+12

22

Page 23: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=98%

0

0.5

1

1.5

2

2.5

3

3.5

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Perf

(rel

ativ

e to

1 b

ig c

ore

)

#Cores

1 big

4 medium 16 small

1+12

23

Page 24: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=98%

0

1

2

3

4

5

6

7

8

9

Sp

eed

up (r

elat

ive

to 1

big

co

re)

#Cores

256 small

1+192

24

Page 25: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Amdahl’s law, f=98%

0

1

2

3

4

5

6

7

8

9

Sp

eed

up (r

elat

ive

to 1

big

co

re)

#Cores

256 small

1+192

Leave larger core idle in parallel section

25

Page 26: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Basic spin-locks

Page 27: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Test and set (pseudo-code)

bool testAndSet(bool *b) {bool result;atomic {result = *b;*b = TRUE;

}return result;

}

Pointer to a location holding a boolean

value (TRUE/FALSE)

Read the current contents of the

location b points to…

…set the contents of *b to TRUE

27

Page 28: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Test and set

time

• Suppose two threads use it at once

Thread 2:

Thread 1:

testAndSet(b)->true

testAndSet(b)->false

Non-blocking data structures and transactional memory 28

Page 29: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

FALSElock:

void acquireLock(bool *lock) {while (testAndSet(lock)) {

/* Nothing */}

}

void releaseLock(bool *lock) {*lock = FALSE;

}

Test and set lock

FALSE => lock availableTRUE => lock held

Each call tries to acquire the lock, returning TRUE

if it is already held

NB: all this is pseudo-code, assuming SC

memory

Non-blocking data structures and transactional memory 29

Page 30: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Test and set lock

FALSElock:

void acquireLock(bool *lock) {while (testAndSet(lock)) {

/* Nothing */}

}

void releaseLock(bool *lock) {*lock = FALSE;

}

Thread 1

TRUE

Thread 2

Non-blocking data structures and transactional memory 30

Page 31: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

What are the problems here?

testAndSetimplementation

causes contention

Non-blocking data structures and transactional memory 31

Page 32: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Single-threaded

core

Contention from testAndSet

L1 cache

Single-threaded

core

L1 cache

Main memory

L2 cache L2 cache

Non-blocking data structures and transactional memory 32

Page 33: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Single-threaded

core

L1 cache

Single-threaded

core

L1 cache

Main memory

L2 cache L2 cache

Multi-core h/w – separate L2

testAndSet(k)

k

k

Non-blocking data structures and transactional memory 33

Page 34: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Single-threaded

core

L1 cache

Single-threaded

core

L1 cache

Main memory

L2 cache L2 cache

Multi-core h/w – separate L2

testAndSet(k)

k

k

Non-blocking data structures and transactional memory 34

Page 35: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Single-threaded

core

L1 cache

Single-threaded

core

L1 cache

Main memory

L2 cache L2 cache

Multi-core h/w – separate L2

testAndSet(k)

k

k

Non-blocking data structures and transactional memory

Does this still happen in practice? Do modern

CPUs avoid fetching the line in exclusive mode

on failing TAS?

35

Page 36: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

What are the problems here?

Spinning may waste resources while

waiting

No control over locking policy

testAndSetimplementation

causes contention

Only supports mutual exclusion: not reader-

writer locking

36

Page 37: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

General problem� No logical conflict between two failed lock acquires

� Cache protocol introduces a physical conflict

� For a good algorithm: only introduce physical conflicts if a logical conflict occurs

� In a lock: successful lock-acquire & failed lock-acquire

� In a set: successful insert(10) & failed insert(10)

� But not:

� In a lock: two failed lock acquires

� In a set: successful insert(10) & successful insert(20)

� In a non-empty queue: enqueue on the left and remove on the right

37

Page 38: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Test and test and set lock

FALSElock:

void acquireLock(bool *lock) {do {while (*lock) { }

} while (testAndSet(lock));}

void releaseLock(bool *lock) {*lock = FALSE;

}

FALSE => lock availableTRUE => lock held

Spin while the lock is held… only do

testAndSet when it is clear

38

Page 39: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Performance

# Threads

Tim

e

Ideal

TATASTAS

Based on Fig 7.4, Herlihy & Shavit, “The Art of Multiprocessor Programming” 39

Page 40: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Stampedes

TRUElock:

void acquireLock(bool *lock) {do {while (*lock) { }

} while (testAndSet(lock));}

void releaseLock(bool *lock) {*lock = FALSE;

}

Non-blocking data structures and transactional memory 40

Page 41: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Back-off algorithms

1. Start by spinning, watching the lock for “s”iterations

2. If the lock does not become free, wait locally for “w” (without watching the lock)

What should “s” be?What should “w” be?

Non-blocking data structures and transactional memory 41

Page 42: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Time spent spinning on the lock “s”

� Lower values:� Less time to build up a set of threads that will

stampede

� Less contention in the memory system, if remote reads incur a cost

� Risk of a delay in noticing when the lock becomes free if we are not watching

� Higher values:� Less likelihood of a delay between a lock being

released and a waiting thread noticing

Non-blocking data structures and transactional memory 42

Page 43: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Local waiting time “w”

� Lower values:� More responsive to the lock becoming available

� Higher values:� If the lock doesn’t become available then the

thread makes fewer accesses to the shared variable

Non-blocking data structures and transactional memory 43

Page 44: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Methodical approach

� For a given workload and performance model:� What is the best that could be done (i.e. given an

“oracle” with perfect knowledge of when the lock becomes free)?

� How does a practical algorithm compare with this?

� Look for an algorithm with a bound between its performance and that of the oracle

� “Competitive spinning”

Non-blocking data structures and transactional memory 44

Page 45: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Rule of thumb

� Spin on the lock for a duration that’s comparable with the shortest back-off interval

� Exponentially increase the per-thread back-off interval (resetting it when the lock is acquired)

� Use a maximum back-off interval that is large enough that waiting threads don’t interfere with the other threads’ performance

Non-blocking data structures and transactional memory 45

Page 46: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Systems problems

46

Shared physical memory

Cache(s)

Lots of h/w threads multiplexed over a core

Core

� The threads need to “wait efficiently”

� Not consuming processing resources (contending with lock holder) & not consuming power

� “monitor” / “mwait” operations –e.g., SPARC M7

Page 47: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Systems problems

47

Shared physical memory

Cache(s)

S/W threads multiplexed on cores

� Spinning gets in the way of other s/w threads, even if done efficiently

� For long delays, may need to actually block and unblock

� ...as with back-off, how long to spin for before blocking?

CoreCore …

Page 48: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Queue-based locks

48

Page 49: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Queue-based locks

� Lock holders queue up: immediately provides FCFS behavior

� Each spins locally on a flag in their queue entry: no remote memory accesses while waiting

� A lock release wakes the next thread directly: no stampede

Non-blocking data structures and transactional memory 49

Page 50: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

MCS locks

lock:

FALSEFALSEFALSE

QNode 1 QNode 2 QNode 3

Head Tail

Local flag

Lock identifies tail

Non-blocking data structures and transactional memory 50

Page 51: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

MCS lock acquire

lock:

FALSEvoid acquireMCS(mcs *lock, QNode *qn) {QNode *prev;qn->flag = false;qn->next = NULL;while (true) {prev = lock->tail;/* Label 1 */if (CAS(&lock->tail, prev, qn)) break;

}if (prev != NULL) {prev->next = qn; /* Label 2 */while (!qn->flag) { } // Spin

} }

Find previous tail node

Atomically replace “prev” with “qn” in

the lock itself

Add link within the queue

Non-blocking data structures and transactional memory 51

Page 52: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

MCS lock release

lock:

FALSE

void releaseMCS(mcs *lock, QNode *qn) {if (lock->tail = qn) {if (CAS(&lock->tail, qn, NULL)) return;

}while (qn->next == NULL) { }qn->next->flag = TRUE;

}

TRUEqn:

If we were at the tail then remove us

Wait for next lock holder to announce themselves;

signal them

Non-blocking data structures and transactional memory 52

Page 53: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Hierarchical locks

Page 54: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Hierarchical locks

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cache

Memory bus

Memory

Non-blocking data structures and transactional memory

Page 55: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Hierarchical locks

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cache

Memory bus

Memory

Non-blocking data structures and transactional memory 55

Page 56: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Hierarchical locks

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cache

Memory bus

Memory

Pass lock “nearby” if

possible

Call this a “cluster” of

cores

Non-blocking data structures and transactional memory 56

Page 57: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Hierarchical TATAS with backoff

-1lock:

void acquireLock(bool *lock) {do {holder = *lock;if (holder != -1) {if (holder == MY_CLUSTER) {BackOff(SHORT);

} else {BackOff(LONG);

}}

} while (!CAS(lock, -1, MY_CLUSTER));}

-1 => lock availablen => lock held by cluster n

Non-blocking data structures and transactional memory 57

Page 58: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Hierarchical locks: unfairness v throughput

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cache

Memory bus

Memory

Avoid this cycle repeating,

starving 5 & 7…

Non-blocking data structures and transactional memory 58

Page 59: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Lock cohorting� “Lock Cohorting: A General Technique for Designing NUMA

Locks”, Dice et al PPoPP 2012

59

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cachePer-NUMA-

domain lock SAPer-NUMA-

domain lock SB

System-wide arbitration lock G

Page 60: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Lock cohorting� Lock acquire, uncontended

60

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cachePer-NUMA-

domain lock SAPer-NUMA-

domain lock SB

System-wide arbitration lock G

(1) Acquire local lock

(2) Acquire global lock

Page 61: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Lock cohorting� Lock acquire, contended

61

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cachePer-NUMA-

domain lock SAPer-NUMA-

domain lock SB

System-wide arbitration lock G

(1) Wait for local lock (e.g., MCS)

Page 62: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Lock cohorting� Lock release, with successor

62

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cachePer-NUMA-

domain lock SAPer-NUMA-

domain lock SB

System-wide arbitration lock G

(1) Pass global lock to successor

Page 63: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Lock cohorting, requirements� Global: “thread oblivious” (acq one thread, release another)

� Local lock: “cohort detection” (can test for successors)

63

Core 1 Core 2

Core 3 Core 4

Shared L2 cache

Core 5 Core 6

Core 7 Core 8

Shared L2 cachePer-NUMA-

domain lock SAPer-NUMA-

domain lock SB

System-wide arbitration lock G

Page 64: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Reader-writer locks

Page 65: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Reader-writer locks (TATAS-like)

65

0lock:

void acquireWrite(int *lock) {do {if ((*lock == 0) &&

(CAS(lock, 0, -1))) {break;

} while (1);}

void releaseWrite(int *lock) {*lock = 0;

}

-1 => Locked for write0 => Lock available

+n => Locked by n readers

void acquireRead(int *lock) {do {

int oldVal = *lock;if ((oldVal >= 0) &&

(CAS(lock, oldVal, oldVal+1))) { break;

} } while (1);}

void releaseRead(int *lock) {FADD(lock, -1); // Atomic fetch-and-add

}

Page 66: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

The problem with readers

66

int readCount() {acquireRead(lock);int result = count;releaseRead(lock);return result;

}

void incrementCount() {acquireWrite(lock);count++;releaseWrite(lock);

}

� Each acquireRead fetches the cache line holding the lock in exclusive mode

� Again: acquireRead are not logically conflicting, but this introduces a physical confliect

� The time spent managing the lock is likely to vastly dominate the actual time looking at the counter

� Many workloads are read-mostly…

Page 67: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Keeping readers separate

67

Owner Flag-1 Flag-2 Flag-3 Flag-N

Acquire write on core i: CAS the owner from 0 to i

…then spin until all of the flags are clear

…then check that the owner is 0 (if not then clear own flag and wait)

Acquire read on core i: set own flag to true…

Page 68: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Keeping readers separate� With care, readers do not need to synchronize with other

readers

� Extend the flags to be whole cache lines

� Pack multiple locks flags for the same thread onto the same line

� Exploit the cache structure in the machine: Dice & Shavit’sTLRW byte-lock on SPARC Niagara

� If “N” threads is very large..

� Dedicate the flags to specific important threads

� Replace the flags with ordinary multi-reader locks

� Replace the flags with per-NUMA-domain multi-reader locks

68

Page 69: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Other locking techniques� Affinity

� Allow one thread fast access to the lock

� “One thread” – e.g., previous lock holder

� “Fast access” – e.g., with fewer / no atomic CAS operations

� Mike Burrows “Implementing unnecessary mutexes” (Do the assumptions hold? How slow is an uncontended CAS on a modern machine? Are these techniques still useful?)

69

Page 70: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Other locking techniques� Affinity

� Allow one thread fast access to the lock

� “One thread” – e.g., previous lock holder

� “Fast access” – e.g., with fewer / no atomic CAS operations

� Mike Burrows “Implementing unnecessary mutexes” (Do the assumptions hold? How slow is an uncontended CAS on a modern machine? Are these techniques still useful?)

� Inflation

� Start out with a simple lock for likely-to-be-uncontended use

� Replace with a “proper” lock if contended

� David Bacon (thin locks), Agesen et al (meta-locks)

� Motivating example: standard libraries in Java

70

Page 71: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Where are we� Amdahl’s law: to scale to large numbers of cores, we need

critical sections to be rare and/or short

� A lock implementation may involve updating a few memory locations

� Accessing a data structure may involve only a few memory locations too

� If we try to shrink critical sections then the time in the lock implementation becomes proportionately greater

� So:

� try to make the cost of the operations in the critical section lower, or

� try to write critical sections correctly without locking

71

Page 72: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Reading without locking

72

Page 73: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

What if updates are very rare

73

No updates at all: no need for

locking

Modest number of updates: could use reader-writer

locks

Page 74: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Version numbers

74

100

Per-data-structure version number

Sequential data structure

with write lock

Page 75: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Version numbers: writers

75

100

Page 76: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Version numbers: writers

76

1011. Take write lock2. Increment

version number

Writers:

Page 77: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Version numbers: writers

77

1011. Take write lock2. Increment

version number3. Make update

Writers:

Page 78: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Version numbers: writers

78

1021. Take write lock2. Increment

version number3. Make update4. Increment

version number5. Release write

lock

Writers:

Page 79: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Version numbers: readers

79

1021. Take write lock2. Increment

version number3. Make update4. Increment

version number5. Release write

lock

Writers:

1. Wait for versionnumber to be even

Readers:

Page 80: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Version numbers: readers

80

1021. Take write lock2. Increment

version number3. Make update4. Increment

version number5. Release write

lock

Writers:

1. Wait for versionnumber to be even

2. Do operation

Readers:

Page 81: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Version numbers: readers

81

1021. Take write lock2. Increment

version number3. Make update4. Increment

version number5. Release write

lock

Writers:

1. Wait for versionnumber to be even

2. Do operation3. Has the version

number changed?4. Yes? Go to 1

Readers:

Page 82: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Why do we need the two steps?

82

1021. Take write lock2. Increment

version number3. Make update4. Increment

version number5. Release write

lock

Writers:

1. Wait for versionnumber to be even

2. Do operation3. Has the version

number changed?4. Yes? Go to 1

Readers:

Page 83: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Read-Copy-Update (RCU)

83

Page 84: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Read-Copy-Update (RCU)

84

1. Copy existing structure

Page 85: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Read-Copy-Update (RCU)

85

1. Copy existing structure2. Update copy

Page 86: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Read-Copy-Update (RCU)

86

1. Copy existing structure2. Update copy3. Install copy with CAS on root pointer

Page 87: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Read-Copy-Update (RCU)

87

� Use locking to serialize updates (typically)� …but allow readers to operate concurrently with updates

� Ensure that readers don’t go wrong if they access data mid-update� Have data structures reachable via a single root pointer:

update the root pointer rather than updating the data structure in-place

� Ensure that updates don’t affect readers – e.g., initializing nodes before splicing them into a list, and retaining “next” pointers in deleted nodes

� Exact semantics offered can be subtle (ongoing research direction)

� Memory management problems common with lock-free data structures

Page 88: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

When will these techniques be effective?� Update rate low

� So the need to serialize updates is OK

� Readers behaviour is OK mid-update

� E.g., structure small enough to clone, rather than update in place

� Readers will be OK until a version number check (not enter endless loops / crash / etc.)

� Deallocation or re-use of memory can be controlled

88

Page 89: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Flat combining

89

Page 90: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Flat combining� “Flat Combining and the Synchronization-Parallelism

Tradeoff”, Hendler et al

� Intuition:

� Acquiring and releasing a lock involves numerous cache line transfers on the interconnect

� These may take hundreds of cycles (e.g., between cores in different NUMA nodes)

� The work protected by the lock may involve only a few memory accesses…

� …and these accesses may be likely to hit in the cache of the previous lock holder (but miss in your own)

� So: if a lock is not available, request that the current lock holder does the work on your behalf

90

Page 91: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Flat combining

91

LockSequential data

structure

Request / response table

Thread 1

Thread 2

Thread 3 …

Page 92: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Flat combining: uncontended acquire

92

LockSequential data

structure

Request / response table

Thread 1

Thread 2

Thread 3 …

1. Write proposed op

to req/resp table

2. Acquire lock if it is

free

3. Process requests

4. Release lock

5. Pick up response

Lock

Thread 2’s requestThread 2’s response

Page 93: NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL …€¦ · Queue-based locks Hierarchical locks Reader-writer locks Reading without locking Flat combining. Overview ... Non-blocking

Flat combining: contended acquire

93

LockSequential data

structure

Request / response table

Thread 1

Thread 2

Thread 3 …

1. Write proposed op

to req/resp table

2. See lock is not free

3. Wait for response

4. Pick up response

Lock

Thread 2’s requestThread 2’s response

Thread 3’s requestThread 3’s response