Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Performance of Java 8...
-
Upload
maurice-naftalin -
Category
Software
-
view
345 -
download
0
Transcript of Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Performance of Java 8...
Good and Wicked FairiesThe Tragedy of the Commons
Understanding the Performanceof Java 8 Streams
Kirk Pepperdine @kcpeppeMauriceNaftalin @mauricenaftalin
#java8fairies
About Kirk
• Specialises in performance tuning• speaks frequently about performance• author of performance tuning workshop
• Co-founder jClarity• performance diagnositic tooling
• Java Champion (since 2006)
About Kirk
• Specialises in performance tuning• speaks frequently about performance• author of performance tuning workshop
• Co-founder jClarity• performance diagnositic tooling
• Java Champion (since 2006)
Kirk’s quiz: what is this?
About MauriceDeveloper, designer, architect, teacher, learner, writer
About Maurice
Co-author
Developer, designer, architect, teacher, learner, writer
About Maurice
Co-author Author
Developer, designer, architect, teacher, learner, writer
About Maurice
Co-author Author
Developer, designer, architect, teacher, learner, writer
Varian 620/i
6
First Computer I Used
Agenda
#java8fairies
Agenda
• Define the problem
#java8fairies
Agenda
• Define the problem• Implement a solution
#java8fairies
Agenda
• Define the problem• Implement a solution• Analyse performance
#java8fairies
Agenda
• Define the problem• Implement a solution• Analyse performance
– find the bottleneck
#java8fairies
Agenda
• Define the problem• Implement a solution• Analyse performance
– find the bottleneck
#java8fairies
Agenda
• Define the problem• Implement a solution• Analyse performance
– find the bottleneck
• Fork/Join parallelism in the real world
#java8fairies
Agenda
• Define the problem• Implement a solution• Analyse performance
– find the bottleneck
• Fork/Join parallelism in the real world
#java8fairies
Agenda
• Define the problem• Implement a solution• Analyse performance
– find the bottleneck
• Fork/Join parallelism in the real world
#java8fairies
Case Study: grep -b
grep -b: “The offset in bytes of a matched pattern is displayed in front of the matched line.”
Case Study: grep -b
The Moving Finger writes; and, having writ,Moves on: nor all thy Piety nor Wit
Shall bring it back to cancel half a LineNor all thy Tears wash out a Word of it.
rubai51.txt
grep -b: “The offset in bytes of a matched pattern is displayed in front of the matched line.”
Case Study: grep -b
The Moving Finger writes; and, having writ,Moves on: nor all thy Piety nor Wit
Shall bring it back to cancel half a LineNor all thy Tears wash out a Word of it.
rubai51.txt
grep -b: “The offset in bytes of a matched pattern is displayed in front of the matched line.”
$ grep -b 'W.*t' rubai51.txt 44:Moves on: nor all thy Piety nor Wit 122:Nor all thy Tears wash out a Word of it.
Case Study: grep -b
Obvious iterative implementation- process file line by line
- maintain byte displacement in an accumulator variable
Case Study: grep -b
Obvious iterative implementation- process file line by line
- maintain byte displacement in an accumulator variable
Objective here is to implement it in stream code- no accumulators allowed!
Case Study: grep -b
Streams – Why?
• Intention: replace loops for aggregate operations
10
Streams – Why?
• Intention: replace loops for aggregate operations
List<Person> people = … Set<City> shortCities = new HashSet<>();
for (Person p : people) { City c = p.getCity(); if (c.getName().length() < 4 ) { shortCities.add(c); } }
instead of writing this:
10
Streams – Why?
• Intention: replace loops for aggregate operations
• more concise, more readable, composable operations, parallelizable
Set<City> shortCities = new HashSet<>();
for (Person p : people) { City c = p.getCity(); if (c.getName().length() < 4 ) { shortCities.add(c); } }
instead of writing this:
List<Person> people = … Set<City> shortCities = people.stream() .map(Person::getCity) .filter(c -> c.getName().length() < 4) .collect(toSet());
11
we’re going to write this:
Streams – Why?
• Intention: replace loops for aggregate operations
• more concise, more readable, composable operations, parallelizable
Set<City> shortCities = new HashSet<>();
for (Person p : people) { City c = p.getCity(); if (c.getName().length() < 4 ) { shortCities.add(c); } }
instead of writing this:
List<Person> people = … Set<City> shortCities = people.stream() .map(Person::getCity) .filter(c -> c.getName().length() < 4) .collect(toSet());
11
we’re going to write this:
Streams – Why?
• Intention: replace loops for aggregate operations
• more concise, more readable, composable operations, parallelizable
Set<City> shortCities = new HashSet<>();
for (Person p : people) { City c = p.getCity(); if (c.getName().length() < 4 ) { shortCities.add(c); } }
instead of writing this:
List<Person> people = … Set<City> shortCities = people.parallelStream() .map(Person::getCity) .filter(c -> c.getName().length() < 4) .collect(toSet());
12
we’re going to write this:
Visualising Sequential Streams
x2x0 x1 x3x0 x1 x2 x3
Source Map Filter Reduction
Intermediate Operations
Terminal Operation
“Values in Motion”
Visualising Sequential Streams
x2x0 x1 x3x1 x2 x3 ✔
Source Map Filter Reduction
Intermediate Operations
Terminal Operation
“Values in Motion”
Visualising Sequential Streams
x2x0 x1 x3 x1x2 x3 ❌✔
Source Map Filter Reduction
Intermediate Operations
Terminal Operation
“Values in Motion”
Visualising Sequential Streams
x2x0 x1 x3 x1x2x3 ❌✔
Source Map Filter Reduction
Intermediate Operations
Terminal Operation
“Values in Motion”
Journey’s End, JavaOne, October 2015
Simple Collector – toSet()
Collectors.toSet()
people.stream().collect(Collectors.toSet())
Journey’s End, JavaOne, October 2015
Simple Collector – toSet()
Stream<Person>
Collectors.toSet()
Set<Person>
people.stream().collect(Collectors.toSet())
Collector<Person,?,Set<Person>>
Journey’s End, JavaOne, October 2015
Simple Collector – toSet()
Stream<Person>
Collectors.toSet()
Set<Person>
people.stream().collect(Collectors.toSet())
bill
Collector<Person,?,Set<Person>>
Journey’s End, JavaOne, October 2015
Simple Collector – toSet()
Stream<Person>
Collectors.toSet()
Set<Person>
people.stream().collect(Collectors.toSet())
bill
Collector<Person,?,Set<Person>>
Journey’s End, JavaOne, October 2015
Simple Collector – toSet()
Stream<Person>
Collectors.toSet()
Set<Person>
people.stream().collect(Collectors.toSet())
bill
Collector<Person,?,Set<Person>>
Journey’s End, JavaOne, October 2015
Simple Collector – toSet()
Stream<Person>
Collectors.toSet()
Set<Person>
people.stream().collect(Collectors.toSet())
billjon
Collector<Person,?,Set<Person>>
Journey’s End, JavaOne, October 2015
Simple Collector – toSet()
Stream<Person>
Collectors.toSet()
Set<Person>
people.stream().collect(Collectors.toSet())
billjon
Collector<Person,?,Set<Person>>
Journey’s End, JavaOne, October 2015
Simple Collector – toSet()
Stream<Person>
Collectors.toSet()
Set<Person>
people.stream().collect(Collectors.toSet())
amybill
jon
Collector<Person,?,Set<Person>>
Journey’s End, JavaOne, October 2015
Simple Collector – toSet()
Stream<Person>
Collectors.toSet()
Set<Person>
people.stream().collect(Collectors.toSet())
amy
billjon
Collector<Person,?,Set<Person>>
Journey’s End, JavaOne, October 2015
Simple Collector – toSet()
Collectors.toSet()
Set<Person>
people.stream().collect(Collectors.toSet())
amy
billjon
Collector<Person,?,Set<Person>>
Journey’s End, JavaOne, October 2015
Simple Collector – toSet()
Collectors.toSet()
Set<Person>{ , , }
people.stream().collect(Collectors.toSet())
amybilljon
Collector<Person,?,Set<Person>>
Classical Reduction
+
+
+
0 1 2 3
+
+
+
0
4 5 6 7
0
++
+
Classical Reduction
+
+
+
0 1 2 3
+
+
+
0
4 5 6 7
0
++
+intStream.reduce(0,(a,b) -> a+b)
Mutable Reduction
a
a
a
e0 e1 e2 e3
a
a
a
e4 e5 e6 e7
aa
c
a: accumulatorc: combiner
Supplier
()->[] ()->[]
Mutable Reduction
a
a
a
e0 e1 e2 e3
a
a
a
e4 e5 e6 e7
aa
c
a: accumulatorc: combiner
Supplier
()->[] ()->[]
Agenda
• Define the problem• Implement a solution• Analyse performance
– find the bottleneck
• Fork/Join parallelism in the real world
#java8fairies
122Nor …Moves … 44
grep -b: Collector combiner
0The …[ , 80Shall … ], ,
41 122Nor …Moves … 36 44
grep -b: Collector combiner
0The … 44[ , 42 80Shall … ], ,
41 122Nor …Moves … 36 44
grep -b: Collector combiner
0The … 44[ , 42 80Shall … ]
41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ]
, ,
,] [
41 122Nor …Moves … 36 44
grep -b: Collector solution
0The … 44[ , 42 80Shall … ]
41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ]
, ,
,] [
41 122Nor …Moves … 36 44
grep -b: Collector solution
0The … 44[ , 42 80Shall … ]
41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ]
, ,
,] [
80
41 122Nor …Moves … 36 44
grep -b: Collector solution
0The … 44[ , 42 80Shall … ]
41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ]
, ,
,] [
80
41 122Nor …Moves … 36 44
grep -b: Collector solution
0The … 44[ , 42 80Shall … ]
41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ]
, ,
,] [
80
41 122Nor …Moves … 36 44
grep -b: Collector combiner
0The … 44[ , 42 80Shall … ]
41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ]
, ,
,] [
Moves … 36 0 41 0Nor …42 0Shall …0The … 44[ ]] [[ [] ]
grep -b: Collector accumulator
44 0The moving … writ,
“Moves on: … Wit”
44 0The moving … writ, 36 44Moves on: … Wit
][
][ ,
[ ]
Supplier “The moving … writ,”
accumulator
accumulator
Agenda
• Define the problem• Implement a solution• Analyse performance
– find the bottleneck
• Fork/Join parallelism in the real world
#java8fairies
Because we don’t have a problem?
Why Shouldn’t We Optimize Code?
Why Shouldn’t We Optimize Code?
Because we don’t have a problem- No performance target!
Why Shouldn’t We Optimize Code?
Because we don’t have a problem- No performance target!
Else there is a problem, but not in our process
Why Shouldn’t We Optimize Code?
Because we don’t have a problem- No performance target!
Else there is a problem, but not in our process
Why Shouldn’t We Optimize Code?
Demo…
Because we don’t have a problem- No performance target!
Else there is a problem, but not in our process- The OS is struggling!
Why Shouldn’t We Optimize Code?
Because we don’t have a problem- No performance target!
Else there is a problem, but not in our process- The OS is struggling!
Else there’s a problem in our process, but not in the code
Why Shouldn’t We Optimize Code?
Because we don’t have a problem- No performance target!
Else there is a problem, but not in our process- The OS is struggling!
Else there’s a problem in our process, but not in the code
Why Shouldn’t We Optimize Code?
Demo…
Because we don’t have a problem- No performance target!
Else there is a problem, but not in our process- The OS is struggling!
Else there’s a problem in our process, but not in the code- GC is using all the cycles!
Why Shouldn’t We Optimize Code?
Because we don’t have a problem- No performance target!
Else there is a problem, but not in our process- The OS is struggling!
Else there’s a problem in our process, but not in the code- GC is using all the cycles!
Now we can consider the code
Else there’s a problem in the code… somewhere- now we can go and profile it
Because we don’t have a problem- No performance target!
Else there is a problem, but not in our process- The OS is struggling!
Else there’s a problem in our process, but not in the code- GC is using all the cycles!
Now we can consider the code
Else there’s a problem in the code… somewhere- now we can go and profile it
Demo…
So, streaming IO is slow
So, streaming IO is slow
If only there was a way of getting all that data into memory…
So, streaming IO is slow
If only there was a way of getting all that data into memory…
Demo…
Agenda
• Define the problem• Implement a solution• Analyse performance
– find the bottleneck
• Fork/Join parallelism in the real world
#java8fairies
Agenda
• Define the problem• Implement a solution• Analyse performance
– find the bottleneck OR – have a bright idea!• Fork/Join parallelism in the real world
#java8fairies
Intel Xeon E5 2600 10-core
Parallelism – Why?
The Free Lunch Is Over
http://www.gotw.ca/publications/concurrency-ddj.htm
The Transistor, Blessed at Birth
What’s Happened?
What’s Happened?
Physical limitations of the technology:
What’s Happened?
Physical limitations of the technology:• signal leakage
What’s Happened?
Physical limitations of the technology:• signal leakage• heat dissipation
What’s Happened?
Physical limitations of the technology:• signal leakage• heat dissipation• speed of light!
What’s Happened?
Physical limitations of the technology:• signal leakage• heat dissipation• speed of light!
– 30cm = 1 light-nanosecond
What’s Happened?
Physical limitations of the technology:• signal leakage• heat dissipation• speed of light!
– 30cm = 1 light-nanosecond
We’re not going to get faster cores,
What’s Happened?
Physical limitations of the technology:• signal leakage• heat dissipation• speed of light!
– 30cm = 1 light-nanosecond
We’re not going to get faster cores, we’re going to get more cores!
x2
Visualizing Parallel Streams
x0
x1
x3
x0
x1x2x3
x2
Visualizing Parallel Streams
x0
x1
x3
x0
x1
x2x3
x2
Visualizing Parallel Streams
x1
x3
x0
x1
x3
✔
❌
x2
Visualizing Parallel Streams
x1 y3
x0
x1
x3
✔
❌
A Parallel Solution for grep -b
• Parallel streams need splittable sources• Streaming I/O makes you subject to Amdahl’s Law:
A Parallel Solution for grep -b
• Parallel streams need splittable sources• Streaming I/O makes you subject to Amdahl’s Law:
Blessing – and Curse – on the Transistor
Stream Sources for Parallel Processing
Implemented by a Spliterator
Stream Sources for Parallel Processing
Implemented by a Spliterator
Stream Sources for Parallel Processing
Implemented by a Spliterator
Stream Sources for Parallel Processing
Implemented by a Spliterator
Stream Sources for Parallel Processing
Implemented by a Spliterator
Stream Sources for Parallel Processing
Implemented by a Spliterator
Stream Sources for Parallel Processing
Implemented by a Spliterator
Stream Sources for Parallel Processing
Implemented by a Spliterator
Stream Sources for Parallel Processing
Implemented by a Spliterator
Moves …Wit
LineSpliterator
The moving Finger … writ \n Shall … Line Nor all thy … it\n \n \n
spliterator coverage
Moves …Wit
LineSpliterator
The moving Finger … writ \n Shall … Line Nor all thy … it\n \n \n
spliterator coverage
MappedByteBuffer
Moves …Wit
LineSpliterator
The moving Finger … writ \n Shall … Line Nor all thy … it\n \n \n
spliterator coverage
MappedByteBuffer mid
Moves …Wit
LineSpliterator
The moving Finger … writ \n Shall … Line Nor all thy … it\n \n \n
spliterator coverage
MappedByteBuffer mid
Moves …Wit
LineSpliterator
The moving Finger … writ \n Shall … Line Nor all thy … it\n \n \n
spliterator coveragenew spliterator coverage
MappedByteBuffer mid
Moves …Wit
LineSpliterator
The moving Finger … writ \n Shall … Line Nor all thy … it\n \n \n
spliterator coveragenew spliterator coverage
MappedByteBuffer mid
Demo…
Parallelizing grep -b
• Splitting action of LineSpliterator is O(log n)• Collector no longer needs to compute index• Result (relatively independent of data size):
- sequential stream ~2x as fast as iterative solution- parallel stream >2.5x as fast as sequential stream
- on 4 hardware threads
Parallelizing Streams
Parallel-unfriendly intermediate operations:
stateful ones– need to store some or all of the stream data in memory
– sorted()
those requiring ordering– limit()
Collectors Cost Extra!
Depends on the performance of accumulator and combiner functions• toList(), toSet(), toCollection() – performance
normally dominated by accumulator• but allow for the overhead of managing multithread access to non-
threadsafe containers for the combine operation
• toMap(), toConcurrentMap() – map merging is slow. Resizing maps, especially concurrent maps, is very expensive. Whenever possible, presize all data structures, maps in particular.
Agenda
• Define the problem• Implement a solution• Analyse performance
– find the bottleneck
• Fork/Join parallelism in the real world
#java8fairies
Simulated Server Environment
threadPool.execute(() -> { try { double value = logEntries.parallelStream() .map(applicationStoppedTimePattern::matcher) .filter(Matcher::find) .map( matcher -> matcher.group(2)) .mapToDouble(Double::parseDouble) .summaryStatistics().getSum(); } catch (Exception ex) {}});
How Does It Perform?
• Total run time: 261.7 seconds• Max: 39.2 secs, Min: 9.2 secs, Median: 22.0
secs
Tragedy of the Commons
Garrett Hardin, ecologist (1968):Imagine the grazing of animals on a common ground. Each flock owner gains if they add to their own flock. But every animal added to the total degrades the commons a small amount.
Tragedy of the Commons
Tragedy of the Commons
You have a finite amount of hardware– it might be in your best interest to grab it all– but if everyone behaves the same way…
Tragedy of the Commons
You have a finite amount of hardware– it might be in your best interest to grab it all– but if everyone behaves the same way…
Be a good neighbor
Tragedy of the Commons
You have a finite amount of hardware– it might be in your best interest to grab it all– but if everyone behaves the same way…
With many parallelStream() operations running concurrently, performance is limited by the size of the common thread pool and the number of cores you have
Be a good neighbor
Configuring Common Pool
Size of common ForkJoinPool is • Runtime.getRuntime().availableProcessors() - 1
-Djava.util.concurrent.ForkJoinPool.common.parallelism=N-Djava.util.concurrent.ForkJoinPool.common.threadFactory-Djava.util.concurrent.ForkJoinPool.common.exceptionHandler
Fork-Join
Support for Fork-Join added in Java 7• difficult coding idiom to master
Used internally by parallel streams• uses a spliterator to segment the stream
• each stream is processed by a ForkJoinWorkerThread
How fork-join works and performs is important to latency
ForkJoinPool invoke
ForkJoinPool.invoke(ForkJoinTask) uses the submitting thread as a worker• If 100 threads all call invoke(), we would have
100+ForkJoinThreads exhausting the limiting resource, e.g. CPUs, IO, etc.
ForkJoinPool submit/get
ForkJoinPool.submit(Callable).get() suspends the submitting thread• If 100 threads all call submit(), the work queue can become very
long, thus adding latency
Fork-Join Performance
Fork Join comes with significant overhead• each chunk of work must be large enough to amortize the
overhead
C/P/N/Q Performance Model
C - number of submittersP - number of CPUsN - number of elementsQ - cost of the operation
When to go Parallel
The workload of the intermediate operations must be great enough to outweigh the overheads (~100µs):
– initializing the fork/join framework – splitting– concurrent collection
Often quoted as N x Q
size of data set(typically > 10,000) processing cost per element
Kernel Times
CPU will not be the limiting factor when• CPU is not saturated
• kernel times exceed 10% of user time
More threads will decrease performance• predicted by Little’s Law
Common Thread Pool
Fork-Join by default uses a common thread pool• default number of worker threads == number of logical cores - 1
• Always contains at least one thread
Performance is tied to whichever you run out of first• availability of the constraining resource
• number of ForkJoinWorkerThreads/hardware threads
Everyone is working together to get it done!
All Hands on Deck!!!
Little’s Law
Fork-Join is a work queue• work queue behavior is typically modeled using Little’s Law
Number of tasks in a system equals the arrival rate times the amount of time it takes to clear an item
Task is submitted every 500ms, or 2 per second Number of tasks = 2/sec * 2.8 seconds = 5.6 tasks
Components of Latency
Latency is time from stimulus to result• internally, latency consists of active and dead time
Reducing dead time assumes you• can find it and are able to fill in with useful work
From Previous Example
if there is available hardware capacity then make the pool biggerelse add capacity or tune to reduce strength of the dependency
ForkJoinPool Observability
• In an application where you have many parallel stream operations all running concurrently performance will be affected by the size of the common thread pool• too small can starve threads from needed resources• too big can cause threads to thrash on contended resources
ForkJoinPool comes with no visibility• no metrics to help us tune
• instrument ForkJoinTask.invoke()• gather measures that can be feed into Little’s Law
• Collect• service times
• time submitted to time returned
• inter-arrival times
Instrumenting ForkJoinPool
public final V invoke() { ForkJoinPool.common.getMonitor().submitTask(this); int s; if ((s = doInvoke() & DONE_MASK) != NORMAL) reportException(s); ForkJoinPool.common.getMonitor().retireTask(this); return getRawResult(); }
Performance
Submit log parsing to our own ForkJoinPool
new ForkJoinPool(16).submit(() -> ……… ).get()new ForkJoinPool(8).submit(() -> ……… ).get()new ForkJoinPool(4).submit(() -> ……… ).get()
16 workerthreads
8 workerthreads
4 workerthreads
0
75000
150000
225000
300000
Stream Parallel Flood Stream Flood Parallel
Performance mostly doesn’t matterBut if you must…• sequential streams normally beat iterative solutions• parallel streams can utilize all cores, providing
- the data is efficiently splittable- the intermediate operations are sufficiently expensive and are
CPU-bound- there isn’t contention for the processors
Conclusions
#java8fairies
Resources
#java8fairies
Resources
http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html
#java8fairies
Resources
http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.htmlhttp://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdf
#java8fairies
Resources
http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.htmlhttp://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdfhttp://openjdk.java.net/projects/code-tools/jmh/
#java8fairies
Resources
http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.htmlhttp://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdfhttp://openjdk.java.net/projects/code-tools/jmh/
#java8fairies