Welcome and Lightning Intros
-
Upload
racesworkshop -
Category
Technology
-
view
419 -
download
0
description
Transcript of Welcome and Lightning Intros
Welcome to RACES’12
Saturday 4 May 13
Thank You ✦ Stefan Marr, Mattias De Wael✦ Presenters✦ Authors✦ Program Committee✦ Co-chair & Organizer: Theo D’Hondt✦ Organizers: Andrew Black, Doug Kimelman, Martin
Rinard✦ Voters
Saturday 4 May 13
Announcements
✦ Program at: ✦ http://soft.vub.ac.be/races/program/
✦ Strict timekeepers✦ Dinner?✦ Recording
Saturday 4 May 13
9:00 Lightning and Welcome9:10 Unsynchronized Techniques for Approximate Parallel Computing9:35 Programming with Relaxed Synchronization9:50 (Relative) Safety Properties for Relaxed Approximate Programs10:05 Break10:35 Nondeterminism is unavoidable, but data races are pure evil11:00 Discussion11:45 Lunch1:15 How FIFO is Your Concurrent FIFO Queue?1:35 The case for relativistic programming1:55 Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models2:15 Does Better Throughput Require Worse Latency?2:30 Parallel Sorting on a Spatial Computer2:50 Break3:25 Dancing with Uncertainty3:45 Beyond Expert-Only Parallel Programming4:00 Discussion4:30 Wrap up
Saturday 4 May 13
Lightning
Saturday 4 May 13
24length
next
values
a
Expandable Array
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
Saturday 4 May 13
24length
next
values
a
Expandable Array
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
Saturday 4 May 13
24length
next
values
a
Expandable Array
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
Saturday 4 May 13
24length
next
values
a
Expandable Array
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
Saturday 4 May 13
24length
next
values
a
Expandable Array
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
Saturday 4 May 13
24length
next
values
a
Expandable Array
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
Saturday 4 May 13
24length
next
values
a
Expandable Array
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
Data Race!
Saturday 4 May 13
24length
next
values
a
Expandable Array
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
Data Race!
Saturday 4 May 13
24length
next
values
a
Expandable Array
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
Data Race!
Saturday 4 May 13
24length
next
values
a
Expandable Array
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
append(o) c = a; i = c.next; if (c.length <= i) n = expand c; a = n; c = n; c.values[i] = o; c.next = i + 1;
Data Race!
Saturday 4 May 13
Hardware
Towards Approximate Computing: Programming with Relaxed Synchronization
Precise Less PreciseAccurate
Less Accurate, less up-to-date, possibly
corrupted
Reliable
Variable
Computation
Data
Computing model today
Human Brain
Relaxed Synchronization
Renganarayanan et al, IBM Research, RACES’12, Oct. 21, 2012Saturday 4 May 13
(Relative) Safety Properties for Relaxed Approximate
ProgramsMichael Carbin and Martin Rinard
Saturday 4 May 13
Nondeterminism is Unavoidable,but Data Races are Pure Evil
Hans-‐J. Boehm, HP Labs
• Much low-‐level code is inherentlynondeterminisBc, but
• Data races–Are forbidden by C/C++/OpenMP/Posix language standards.
–May break code now or when you recompile.
DataRaces
–Don’t improve scalability significantly, even if the code sBll works.
–Are easily avoidable in C11 & C++11.
Saturday 4 May 13
How FIFO is Your Concurrent FIFO Queue?Andreas Haas, Christoph M. Kirsch, Michael Lippautz, Hannes Payer
University of Salzburg
semantically correct and therefore “slow”
FIFO queues
semantically relaxed and thereby “fast”
FIFO queues
Semantically relaxed FIFO queues can appear more FIFO than semantically correct FIFO queues.
vs.
Saturday 4 May 13
A Case for Relativistic Programming
• Alter ordering requirements (Causal, not Total)
• Don’t Alter correctness requirements• High performance, Highly scalable• Easy to program
Philip W. Howard and Jonathan Walpole
Saturday 4 May 13
IBM Research
© 2012 IBM Corporation1 Cain and Lipasti RACES’12 Oct 21, 2012
Edge Chasing Delayed Consistency: Pushing the Limits of Weak Ordering
§ From the RACES website: – “an approach towards scalability that reduces synchronization requirements
drastically, possibly to the point of discarding them altogether.”
§ A hardware developer’s perspective:– Constraints of Legacy Code
• What if we want to apply this principle, but have no control over the applications that are running on a system?
– Can one build a coherence protocol that avoids synchronizing cores as much as possible?• For example by allowing each core to use stale versions of cache lines as long as
possible• While maintaining architectural correctness; i.e. we will not break existing code
• If we do that, what will happen?
Trey Cain and Mikko Lipasti
Saturday 4 May 13
Does Better Throughput Require Worse Latency?
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Does Better Throughput Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
IntroductionAs we continue to make the transition from uniprocessor to multicore programming, pushed along by the changing trajectory of hardware technology and system architecture, we are seeing an explosion of techniques for crossing the chasm between sequential and parallel data structures and algorithms. In considering a spectrum of techniques for moderating application access to shared data on multicore and manycore systems, we have observed that as application synchronization latency gets closer to hardware inter-core latency, throughput decreases. The spectrum of techniques we looked at includes: locks and mutexes, lock-free approaches based on atomic instructions, RCU, and (non-deterministic) race-and-repair. Below we present definitions of our notion of synchronization latency and throughput, and describe our observation in greater detail. We conclude by wondering whether there is a fundamental law relating latency to throughput:
Algorithms that improve application-level throughput worsen inter-core application-level latency.
We believe that such a law would be of great utility as a unification that would provide a common perspective from which to view and compare synchronization approaches.
Throughput and LatencyFor this proposal, we define throughput and latency as follows:
• Throughput is the amount of application-level work performed in unit time, normalized to the amount of work that would be accomplished with perfect linear scaling. In other words, a throughput of 1.0 would be achieved by a system that performed N times as much work per unit time with N cores as it did with one core. This formulation reflects how well an application exploits the parallelism of multiple cores.
• Latency denotes the mean time required for a thread on one core to observe a change effected by a thread on another core, normalized to the best latency possible for the given platform. This formulation isolates the latency inherent in the algorithms and data structures from the latency arising out of the platform (operating system, processor, storage hierarchy, communication network, etc.). As an example of algorithm-and-data-structure-imposed latency, if one chooses to replicate a data structure, it will take additional time to update the replicas. The best possible latency for a given platform can be difficult to determine, but nonetheless it
constitutes a real lower bound for the overall latency that is apparent to an application.
Table 1 presents some fictional numbers in order to illustrate the concept: It describes two versions of the same application, A and B, running on a hypothetical system. The numbers are consistent with a linear version of the proposed law, because Version B sacrifices a factor of three in latency to gain a factor of 3 in throughput.
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Table 1: Hypothetical figuresif tradeoff were linear
Version
Core count
B e s t - p o s s i b l e i n t e r - c o r e latency
Mean observed latency in application
Normalized latency(observed / best possible)
App-operations/sec. ( 1 core)
App.-operations/sec. ( 10 cores)
Normalized throughput (normalized to perfect scaling)
Latency / Throughput
A B
10 10
200 µs 200 µs
1,000 µs 3,000 µs
5 15
1,000 1,000
2,500 7,500
0.25 0.75
20 20
A Progression of Techniques Trading Throughput for Latency As techniques have evolved for improving performance, each seems to have offered more throughput at the expense of increased latency:
• Mutexes and Locks: Mutexes and locks are perhaps the simplest method for protecting shared data [1]. In this style, each thread obtains a shared lock (or mutex) on a data structure before accessing or modifying it. Latency is minimized because a waiter will observe any changes as soon as the updating thread releases the lock. However, the overhead required to obtain a lock, and the processing time lost while waiting for a lock can severely limit throughput.
• Lock-Free: In the lock-free style, each shared data structure is organized so that any potential races are confined to a single word. An updating thread need not lock the structure in advance. Instead, it prepares an updated value, then uses an atomic instruction (such as Compare-And-Swap) to attempt to store the value into the word [1]. The atomic instruction ensures that the word was not changed by some other thread while the updater was working. If it was changed, the updater must
Taking turns, broadcasting changes: Low latency
Dividing into sections, round-robin: High throughput
throughput -> parallel -> distributed/replicated -> latency
David Ungar, Doug Kimelman, Sam Adams and Mark Wegman: IBM
Saturday 4 May 13
spatial computingoffers insights into:
• the costs and constraints of communication in large parallel computer arrays
• how to design algorithms that respect these costs and constraints
parallel sorting on a spatial computerMax Orhai, Andrew P. Black
Saturday 4 May 13
Dancing with Uncertainty
Sasa Misailovic, Stelios Sidiroglou and Martin Rinard
Saturday 4 May 13
© 2009 IBM Corporation
1
Sea Change In Linux-Kernel Parallel Programming
In 2006, Linus Torvalds noted that since 2003, the Linux
kernel community's grasp of concurrency had improved to the
point that patches were often correct at first submission
Why the improvement?–Not programming language: C before, during, and after–Not synchronization primitives: Locking before, during, and after–Not a change in personnel: Relatively low turnover–Not born parallel programmers: Remember Big Kernel Lock!
So what was it?–Stick around for the discussion this afternoon and find out!!!
Paul E. McKenney: Beyond Expert-Only Parallel Programming?
Saturday 4 May 13