Download - Welcome and Lightning Intros

Page 1: Welcome and Lightning Intros

Welcome to RACES’12

Saturday 4 May 13

Page 2: Welcome and Lightning Intros

Thank You ✦ Stefan Marr, Mattias De Wael✦ Presenters✦ Authors✦ Program Committee✦ Co-chair & Organizer: Theo D’Hondt✦ Organizers: Andrew Black, Doug Kimelman, Martin

Rinard✦ Voters

Saturday 4 May 13

Page 3: Welcome and Lightning Intros


✦ Program at: ✦

✦ Strict timekeepers✦ Dinner?✦ Recording

Saturday 4 May 13

Page 4: Welcome and Lightning Intros

9:00 Lightning and Welcome9:10 Unsynchronized Techniques for Approximate Parallel Computing9:35 Programming with Relaxed Synchronization9:50 (Relative) Safety Properties for Relaxed Approximate Programs10:05 Break10:35 Nondeterminism is unavoidable, but data races are pure evil11:00 Discussion11:45 Lunch1:15 How FIFO is Your Concurrent FIFO Queue?1:35 The case for relativistic programming1:55 Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models2:15 Does Better Throughput Require Worse Latency?2:30 Parallel Sorting on a Spatial Computer2:50 Break3:25 Dancing with Uncertainty3:45 Beyond Expert-Only Parallel Programming4:00 Discussion4:30 Wrap up

Saturday 4 May 13

Page 5: Welcome and Lightning Intros


Saturday 4 May 13

Page 6: Welcome and Lightning Intros





Expandable  Array

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

Saturday 4 May 13

Page 7: Welcome and Lightning Intros





Expandable  Array

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

Saturday 4 May 13

Page 8: Welcome and Lightning Intros





Expandable  Array

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

Saturday 4 May 13

Page 9: Welcome and Lightning Intros





Expandable  Array

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

Saturday 4 May 13

Page 10: Welcome and Lightning Intros





Expandable  Array

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

Saturday 4 May 13

Page 11: Welcome and Lightning Intros





Expandable  Array

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

Saturday 4 May 13

Page 12: Welcome and Lightning Intros





Expandable  Array

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

Data Race!

Saturday 4 May 13

Page 13: Welcome and Lightning Intros





Expandable  Array

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

Data Race!

Saturday 4 May 13

Page 14: Welcome and Lightning Intros





Expandable  Array

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

Data Race!

Saturday 4 May 13

Page 15: Welcome and Lightning Intros





Expandable  Array

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

append(o) c = a; i =; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; = i + 1;

Data Race!

Saturday 4 May 13

Page 16: Welcome and Lightning Intros


Towards Approximate Computing: Programming with Relaxed Synchronization

Precise Less PreciseAccurate

Less Accurate, less up-to-date, possibly






Computing model today

Human Brain

Relaxed Synchronization

Renganarayanan et al, IBM Research, RACES’12, Oct. 21, 2012Saturday 4 May 13

Page 17: Welcome and Lightning Intros

(Relative) Safety Properties for Relaxed Approximate

ProgramsMichael Carbin and Martin Rinard

Saturday 4 May 13

Page 18: Welcome and Lightning Intros

Nondeterminism  is  Unavoidable,but  Data  Races  are  Pure  Evil

Hans-­‐J.  Boehm,  HP  Labs  

• Much  low-­‐level  code  is  inherentlynondeterminisBc,  but

• Data  races–Are  forbidden  by  C/C++/OpenMP/Posix  language  standards.

–May  break  code  now  or  when  you  recompile.


–Don’t  improve  scalability  significantly,  even  if  the  code  sBll  works.

–Are  easily  avoidable  in  C11  &  C++11.

Saturday 4 May 13

Page 19: Welcome and Lightning Intros

How FIFO is Your Concurrent FIFO Queue?Andreas Haas, Christoph M. Kirsch, Michael Lippautz, Hannes Payer

University of Salzburg

semantically correct and therefore “slow”

FIFO queues

semantically relaxed and thereby “fast”

FIFO queues

Semantically relaxed FIFO queues can appear more FIFO than semantically correct FIFO queues.


Saturday 4 May 13

Page 20: Welcome and Lightning Intros

A Case for Relativistic Programming

• Alter ordering requirements (Causal, not Total)

• Don’t Alter correctness requirements• High performance, Highly scalable• Easy to program

Philip W. Howard and Jonathan Walpole

Saturday 4 May 13

Page 21: Welcome and Lightning Intros

IBM Research

© 2012 IBM Corporation1 Cain and Lipasti RACES’12 Oct 21, 2012

Edge Chasing Delayed Consistency: Pushing the Limits of Weak Ordering

§ From the RACES website: – “an approach towards scalability that reduces synchronization requirements

drastically, possibly to the point of discarding them altogether.”

§ A hardware developer’s perspective:– Constraints of Legacy Code

• What if we want to apply this principle, but have no control over the applications that are running on a system?

– Can one build a coherence protocol that avoids synchronizing cores as much as possible?• For example by allowing each core to use stale versions of cache lines as long as

possible• While maintaining architectural correctness; i.e. we will not break existing code

• If we do that, what will happen?

Trey Cain and Mikko Lipasti

Saturday 4 May 13

Page 22: Welcome and Lightning Intros

Does Better Throughput Require Worse Latency?

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Does Better Throughput Require Worse Latency?

David Ungar, Doug Kimelman, Sam Adams, Mark Wegman

IBM T. J. Watson Research Center

IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:

Algorithms that improve application-level throughput worsen inter-core application-level latency.

We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.

Throughput and LatencyFor this proposal, we define throughput and latency as follows:

• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.

• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it

constitutes a real lower bound for the overall latency that is apparent to an application.

Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear

Table 1: Hypothetical figuresif tradeoff were linear


Core count

B e s t - p o s s i b l e i n t e r - c o r e latency

Mean observed latency in application

Normalized latency(observed / best possible)

App-operations/sec. ( 1 core)

App.-operations/sec. ( 10 cores)

Normalized throughput (normalized to perfect scaling)

Latency / Throughput


10 10

200 µs 200 µs

1,000 µs 3,000 µs

5 15

1,000 1,000

2,500 7,500

0.25 0.75

20 20

A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:

• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.

• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must

Taking turns, broadcasting changes: Low latency

Dividing into sections, round-robin: High throughput

throughput -> parallel -> distributed/replicated -> latency

David Ungar, Doug Kimelman, Sam Adams and Mark Wegman: IBM

Saturday 4 May 13

Page 23: Welcome and Lightning Intros

spatial computingoffers insights into:

• the costs and constraints of communication in large parallel computer arrays

• how to design algorithms that respect these costs and constraints

parallel sorting on a spatial computerMax Orhai, Andrew P. Black

Saturday 4 May 13

Page 24: Welcome and Lightning Intros

Dancing with Uncertainty

Sasa Misailovic, Stelios Sidiroglou and Martin Rinard

Saturday 4 May 13

Page 25: Welcome and Lightning Intros

© 2009 IBM Corporation


Sea Change In Linux-Kernel Parallel Programming

In 2006, Linus Torvalds noted that since 2003, the Linux

kernel community's grasp of concurrency had improved to the

point that patches were often correct at first submission

Why the improvement?–Not programming language: C before, during, and after–Not synchronization primitives: Locking before, during, and after–Not a change in personnel: Relatively low turnover–Not born parallel programmers: Remember Big Kernel Lock!

So what was it?–Stick around for the discussion this afternoon and find out!!!

Paul E. McKenney: Beyond Expert-Only Parallel Programming?

Saturday 4 May 13