Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Performance of Java 8...

Good and Wicked FairiesThe Tragedy of the Commons

Understanding the Performanceof Java 8 Streams

Kirk Pepperdine @kcpeppeMauriceNaftalin @mauricenaftalin

#java8fairies

About Kirk

• Specialises in performance tuning• speaks frequently about performance• author of performance tuning workshop

• Co-founder jClarity• performance diagnositic tooling

• Java Champion (since 2006)

Kirk’s quiz: what is this?

About MauriceDeveloper, designer, architect, teacher, learner, writer

About Maurice

Co-author

Developer, designer, architect, teacher, learner, writer

About Maurice

Co-author Author

Developer, designer, architect, teacher, learner, writer

@mauricenaftalin

The Lambda FAQ

www.lambdafaq.org

http://www.lambdafaq.org

Varian 620/i

6

First Computer I Used

Agenda

#java8fairies

Agenda

• Define the problem

#java8fairies

Agenda

• Define the problem• Implement a solution

#java8fairies

Agenda

• Define the problem• Implement a solution• Analyse performance

#java8fairies

Agenda


– find the bottleneck

#java8fairies

Agenda



• Fork/Join parallelism in the real world

#java8fairies

Case Study: grep -b

grep -b: “The offset in bytes of a matched pattern is displayed in front of the matched line.”

Case Study: grep -b

The Moving Finger writes; and, having writ,Moves on: nor all thy Piety nor Wit

Shall bring it back to cancel half a LineNor all thy Tears wash out a Word of it.

rubai51.txt


Case Study: grep -b

The Moving Finger writes; and, having writ,Moves on: nor all thy Piety nor Wit

Shall bring it back to cancel half a LineNor all thy Tears wash out a Word of it.

rubai51.txt


$ grep -b 'W.*t' rubai51.txt 44:Moves on: nor all thy Piety nor Wit 122:Nor all thy Tears wash out a Word of it.

Case Study: grep -b

Obvious iterative implementation- process file line by line

- maintain byte displacement in an accumulator variable

Case Study: grep -b

Obvious iterative implementation- process file line by line

- maintain byte displacement in an accumulator variable

Objective here is to implement it in stream code- no accumulators allowed!

Case Study: grep -b

Streams – Why?

• Intention: replace loops for aggregate operations

10

Streams – Why?


List<Person> people = … Set<City> shortCities = new HashSet<>();

for (Person p : people) { City c = p.getCity(); if (c.getName().length() < 4 ) { shortCities.add(c); } }

instead of writing this:

10

Streams – Why?


• more concise, more readable, composable operations, parallelizable

Set<City> shortCities = new HashSet<>();



List<Person> people = … Set<City> shortCities = people.stream() .map(Person::getCity) .filter(c -> c.getName().length() < 4) .collect(toSet());

11

we’re going to write this:

Streams – Why?


• more concise, more readable, composable operations, parallelizable

Set<City> shortCities = new HashSet<>();



List<Person> people = … Set<City> shortCities = people.parallelStream() .map(Person::getCity) .filter(c -> c.getName().length() < 4) .collect(toSet());

12

we’re going to write this:

Visualising Sequential Streams

x2x0 x1 x3x0 x1 x2 x3

Source Map Filter Reduction

Intermediate Operations

Terminal Operation

“Values in Motion”


x2x0 x1 x3x1 x2 x3 ✔



Terminal Operation



x2x0 x1 x3 x1x2 x3 ❌✔



Terminal Operation



x2x0 x1 x3 x1x2x3 ❌✔



Terminal Operation


Journey’s End, JavaOne, October 2015

Simple Collector – toSet()

Collectors.toSet()

people.stream().collect(Collectors.toSet())



Stream<Person>

Collectors.toSet()

Set<Person>


Collector<Person,?,Set<Person>>



Stream<Person>

Collectors.toSet()

Set<Person>


bill




Stream<Person>

Collectors.toSet()

Set<Person>


billjon




Stream<Person>

Collectors.toSet()

Set<Person>


amybill

jon




Stream<Person>

Collectors.toSet()

Set<Person>


amy

billjon




Collectors.toSet()

Set<Person>


amy

billjon




Collectors.toSet()

Set<Person>{ , , }


amybilljon


Classical Reduction

+

+

+

0 1 2 3

+

+

+

0

4 5 6 7

0

++

+

Classical Reduction

+

+

+

0 1 2 3

+

+

+

0

4 5 6 7

0

++

+intStream.reduce(0,(a,b) -> a+b)

Mutable Reduction

a

a

a

e0 e1 e2 e3

a

a

a

e4 e5 e6 e7

aa

c

a: accumulatorc: combiner

Supplier

()->[] ()->[]

Agenda




#java8fairies

122Nor …Moves … 44

grep -b: Collector combiner

0The …[ , 80Shall … ], ,

41 122Nor …Moves … 36 44


0The … 44[ , 42 80Shall … ], ,

41 122Nor …Moves … 36 44


0The … 44[ , 42 80Shall … ]

41 42Nor …Moves … 36 440The … 44[ , 42 0Shall … ]

, ,

,] [

41 122Nor …Moves … 36 44

grep -b: Collector solution

0The … 44[ , 42 80Shall … ]


, ,

,] [

41 122Nor …Moves … 36 44

grep -b: Collector solution

0The … 44[ , 42 80Shall … ]


, ,

,] [

80

41 122Nor …Moves … 36 44


0The … 44[ , 42 80Shall … ]


, ,

,] [

Moves … 36 0 41 0Nor …42 0Shall …0The … 44[ ]] [[ [] ]

grep -b: Collector accumulator

44 0The moving … writ,

“Moves on: … Wit”

44 0The moving … writ, 36 44Moves on: … Wit

][

][ ,

[ ]

Supplier “The moving … writ,”

accumulator

accumulator

Agenda




#java8fairies

Because we don’t have a problem?

Why Shouldn’t We Optimize Code?


Because we don’t have a problem- No performance target!


Else there is a problem, but not in our process



Else there is a problem, but not in our process


Demo…


Else there is a problem, but not in our process- The OS is struggling!




Else there’s a problem in our process, but not in the code




Else there’s a problem in our process, but not in the code


Demo…



Else there’s a problem in our process, but not in the code- GC is using all the cycles!





Now we can consider the code

Else there’s a problem in the code… somewhere- now we can go and profile it




Now we can consider the code

Else there’s a problem in the code… somewhere- now we can go and profile it

Demo…

So, streaming IO is slow


If only there was a way of getting all that data into memory…


If only there was a way of getting all that data into memory…

Demo…

Agenda




#java8fairies

Agenda


– find the bottleneck OR – have a bright idea!• Fork/Join parallelism in the real world

#java8fairies

Intel Xeon E5 2600 10-core

Parallelism – Why?

The Free Lunch Is Over

http://www.gotw.ca/publications/concurrency-ddj.htm

http://www.gotw.ca/publications/concurrency-ddj.htm

The Transistor, Blessed at Birth

What’s Happened?

What’s Happened?

Physical limitations of the technology:

What’s Happened?

Physical limitations of the technology:• signal leakage

What’s Happened?

Physical limitations of the technology:• signal leakage• heat dissipation

What’s Happened?

Physical limitations of the technology:• signal leakage• heat dissipation• speed of light!

What’s Happened?


– 30cm = 1 light-nanosecond

What’s Happened?



We’re not going to get faster cores,

What’s Happened?



We’re not going to get faster cores, we’re going to get more cores!

x2

Visualizing Parallel Streams

x0

x1

x3

x0

x1x2x3

x2


x0

x1

x3

x0

x1

x2x3

x2


x1

x3

x0

x1

x3

✔

❌

x2


x1 y3

x0

x1

x3

✔

❌

A Parallel Solution for grep -b

• Parallel streams need splittable sources• Streaming I/O makes you subject to Amdahl’s Law:

Blessing – and Curse – on the Transistor

Stream Sources for Parallel Processing

Implemented by a Spliterator

Moves …Wit

LineSpliterator

The moving Finger … writ \n Shall … Line Nor all thy … it\n \n \n

spliterator coverage

Moves …Wit

LineSpliterator



MappedByteBuffer

Moves …Wit

LineSpliterator



MappedByteBuffer mid

Moves …Wit

LineSpliterator


spliterator coveragenew spliterator coverage


Moves …Wit

LineSpliterator


spliterator coveragenew spliterator coverage


Demo…

Parallelizing grep -b

• Splitting action of LineSpliterator is O(log n)• Collector no longer needs to compute index• Result (relatively independent of data size):

- sequential stream ~2x as fast as iterative solution- parallel stream >2.5x as fast as sequential stream

- on 4 hardware threads

Parallelizing Streams

Parallel-unfriendly intermediate operations:

stateful ones– need to store some or all of the stream data in memory

– sorted()

those requiring ordering– limit()

Collectors Cost Extra!

Depends on the performance of accumulator and combiner functions• toList(), toSet(), toCollection() – performance

normally dominated by accumulator• but allow for the overhead of managing multithread access to non-

threadsafe containers for the combine operation

• toMap(), toConcurrentMap() – map merging is slow. Resizing maps, especially concurrent maps, is very expensive. Whenever possible, presize all data structures, maps in particular.

Agenda




#java8fairies

Simulated Server Environment

threadPool.execute(() -> { try { double value = logEntries.parallelStream() .map(applicationStoppedTimePattern::matcher) .filter(Matcher::find) .map( matcher -> matcher.group(2)) .mapToDouble(Double::parseDouble) .summaryStatistics().getSum(); } catch (Exception ex) {}});

How Does It Perform?

• Total run time: 261.7 seconds• Max: 39.2 secs, Min: 9.2 secs, Median: 22.0

secs

Tragedy of the Commons

Garrett Hardin, ecologist (1968):Imagine the grazing of animals on a common ground. Each flock owner gains if they add to their own flock. But every animal added to the total degrades the commons a small amount.


You have a finite amount of hardware– it might be in your best interest to grab it all– but if everyone behaves the same way…



Be a good neighbor



With many parallelStream() operations running concurrently, performance is limited by the size of the common thread pool and the number of cores you have

Be a good neighbor

Configuring Common Pool

Size of common ForkJoinPool is • Runtime.getRuntime().availableProcessors() - 1

-Djava.util.concurrent.ForkJoinPool.common.parallelism=N-Djava.util.concurrent.ForkJoinPool.common.threadFactory-Djava.util.concurrent.ForkJoinPool.common.exceptionHandler

Fork-Join

Support for Fork-Join added in Java 7• difficult coding idiom to master

Used internally by parallel streams• uses a spliterator to segment the stream

• each stream is processed by a ForkJoinWorkerThread

How fork-join works and performs is important to latency

ForkJoinPool invoke

ForkJoinPool.invoke(ForkJoinTask) uses the submitting thread as a worker• If 100 threads all call invoke(), we would have

100+ForkJoinThreads exhausting the limiting resource, e.g. CPUs, IO, etc.

ForkJoinPool submit/get

ForkJoinPool.submit(Callable).get() suspends the submitting thread• If 100 threads all call submit(), the work queue can become very

long, thus adding latency

Fork-Join Performance

Fork Join comes with significant overhead• each chunk of work must be large enough to amortize the

overhead

C/P/N/Q Performance Model

C - number of submittersP - number of CPUsN - number of elementsQ - cost of the operation

When to go Parallel

The workload of the intermediate operations must be great enough to outweigh the overheads (~100µs):

– initializing the fork/join framework – splitting– concurrent collection

Often quoted as N x Q

size of data set(typically > 10,000) processing cost per element

Kernel Times

CPU will not be the limiting factor when• CPU is not saturated

• kernel times exceed 10% of user time

More threads will decrease performance• predicted by Little’s Law

Common Thread Pool

Fork-Join by default uses a common thread pool• default number of worker threads == number of logical cores - 1

• Always contains at least one thread

Performance is tied to whichever you run out of first• availability of the constraining resource

• number of ForkJoinWorkerThreads/hardware threads

Everyone is working together to get it done!

All Hands on Deck!!!

Little’s Law

Fork-Join is a work queue• work queue behavior is typically modeled using Little’s Law

Number of tasks in a system equals the arrival rate times the amount of time it takes to clear an item

Task is submitted every 500ms, or 2 per second Number of tasks = 2/sec * 2.8 seconds = 5.6 tasks

Components of Latency

Latency is time from stimulus to result• internally, latency consists of active and dead time

Reducing dead time assumes you• can find it and are able to fill in with useful work

From Previous Example

if there is available hardware capacity then make the pool biggerelse add capacity or tune to reduce strength of the dependency

ForkJoinPool Observability

• In an application where you have many parallel stream operations all running concurrently performance will be affected by the size of the common thread pool• too small can starve threads from needed resources• too big can cause threads to thrash on contended resources

ForkJoinPool comes with no visibility• no metrics to help us tune

• instrument ForkJoinTask.invoke()• gather measures that can be feed into Little’s Law

• Collect• service times

• time submitted to time returned

• inter-arrival times

Instrumenting ForkJoinPool

public final V invoke() { ForkJoinPool.common.getMonitor().submitTask(this); int s; if ((s = doInvoke() & DONE_MASK) != NORMAL) reportException(s); ForkJoinPool.common.getMonitor().retireTask(this); return getRawResult(); }

Performance

Submit log parsing to our own ForkJoinPool

new ForkJoinPool(16).submit(() -> ……… ).get()new ForkJoinPool(8).submit(() -> ……… ).get()new ForkJoinPool(4).submit(() -> ……… ).get()

16 workerthreads

8 workerthreads

4 workerthreads

0

75000

150000

225000

300000

Stream Parallel Flood Stream Flood Parallel

Performance mostly doesn’t matterBut if you must…• sequential streams normally beat iterative solutions• parallel streams can utilize all cores, providing

- the data is efficiently splittable- the intermediate operations are sufficiently expensive and are

CPU-bound- there isn’t contention for the processors

Conclusions

#java8fairies

Resources

#java8fairies

Resources

http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html

#java8fairies


http://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdf

http://openjdk.java.net/projects/code-tools/jmh/

Resources

http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.htmlhttp://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdf

#java8fairies




Resources

http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.htmlhttp://shipilev.net/talks/devoxx-Nov2013-benchmarking.pdfhttp://openjdk.java.net/projects/code-tools/jmh/

#java8fairies




Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Performance of Java 8...

Software

Transcript of Good and Wicked Fairies, and the Tragedy of the Commons: Understanding the Performance of Java 8...