Stream Collections - Scala Days

50
Streams as Scala Collections S3 Scala Client with Play Iteratees and Composable Operations Greg Silin Platform Engineer @gregsbriefs www.github.com/nitro/streamcollections ScalaDays 2015

Transcript of Stream Collections - Scala Days

Streams as Scala CollectionsS3 Scala Client with Play Iteratees and Composable Operations

Greg SilinPlatform [email protected]/nitro/streamcollections

ScalaDays 2015

Agenda

• Reactive at Nitro

• Smart Documents at Scale

• Motivation for Streaming Collections

• Building Streams with Iteratees

• Streams as Scala Collections

• Applications

• Questions

The Old Way

Create & PrepareOn the Desktop

PrintDocument

Sign PrintedDocument

Scan IntoComputer

Knowledge workers spend approximately 11+ hours a week creating and managing documents

The New Way

Create PrepareSign

(Anywhere)

Nitro accelerates the way businesses create, prepare, and

sign documents.

Anytime and anywhere.

Smarter Documents for EveryoneTM

Reactive Systems at Nitroreact to user expectations <- responsive

react to state changes <- message driven

react to variable load <- elastic

react to failure <- resilient

Smart Documents at Scale

multiple pages

and formats

per document

Smart Documents at Scale

Each action results in a new document version

render sign approve

...

Smart Documents at Scale

documents / second *

versions / document *

pages / version =

billions of objects in S3

Smart Documents at Scale

millions of new document uploads a day

100MM+/day document state changes resulting in 10x messages

billions of objects in S3

Motivation for Streaming Collections

counting

copying

extracting

cleanup

become non-trivial at scale

Motivation for Streaming Collections

1 percent error margin = 10M objects

That’s money for the business

How?

How do we traverse the data?

How?

Command line tools don’t provide flexibility / scale

How?

Can’t load everything in memory

Command line tools don’t provide flexibility / scale

How?

Can’t load everything in memory

Need some batched solution

Command line tools don’t provide flexibility / scale

How?

Amazon S3 SDK has a Java key iterator

How?

Amazon S3 SDK has a Java key iterator

How?

...

Amazon S3 SDK has a Java key iterator

But we are Scala engineers!

How?

How?

Streaming is a natural fit

Amazon SDK has a Java key iterator

How?

Streaming is a natural fit

We are reactive

Amazon SDK has a Java key iterator

How?

Streaming is a natural fit

Amazon SDK has a Java key iterator

Thus asynchronous streams

We are reactive

How?

Streaming is a natural fit

Amazon SDK has a Java key iterator

Thus asynchronous streams

We are reactive

Can’t over-parallelize

What Streams?

Enter Play Iteratees

Enumerator - Source

Enumeratee - Transformer

Iteratee - Consumer / Sink

Building Streams with Iteratees

Why Play Iteratees?

Building Streams with Iteratees

Why Play Iteratees?

Most mature technology at the time

Building Streams with Iteratees

Why Play Iteratees?

Most mature technology at the time

Production Experience

Building Streams with Iteratees

Play Iteratees via a counting example

Building Streams with Iteratees

Enumerator = Source

Building Streams with Iteratees

Enumeratee = Transformer

Building Streams with Iteratees

Iteratee = Sink / Reduce

Building Streams with Iteratees

Tying things together...

Building Streams with Iteratees

Can this be simplified?

Streams as Scala Collections

We are all familiar with Scala collections

Streams as Scala Collections

We are all familiar with Scala collections

map

filter

foreach

grouped

count

Streams as Scala CollectionsCan reason about iteratee streams as a collection

Streams as Scala Collections

Can now redo our grouped & count example

Streams as Scala Collections

Can now redo our grouped & count example

Streams as Scala CollectionsWith the internals hidden, my counting code becomes simple

Streams as Scala Collections - Examples

Cleaning up files

Streams as Scala Collections - Examples

Extract data by date

Streams as Scala Collections - Applications

Can extend this model onto other data sources

We don’t have to stop at S3

➔ Relational DB

➔ ElasticSearch

➔ HBase / Cassandra

➔ Spark

"Much of my work has come from being lazy." - John BackusQuoted in the IBM employee magazine Think in 1979 (http://en.wikiquote.org/wiki/John_Backus)

What We Learned

Iteratees are good for traversing large volume of data

Programming iteratees can get a bit tricky

Scaling ain’t easy

Stream Collections abstraction makes streams simple

Future of Streams as Scala Collections

Continue developing a reactive S3 Client

In use in Nitro Production

Introduce other stream implementations (akka streams, etc)

www.github.com/nitro/streamcollections

Contributors:www.github.com/gregsilin / @gregsbriefswww.github.com/mkolod / @marekinfo

Open Sourcing

Are you interested? We welcome collaborators!

San Francisco Scala Days 2015• Nitro is a Gold sponsor

• Meet us at our community booth

sfscala.org:

• Wed: Scala D’Ehs meetup @ Stock in Trade

• Thu: unconference @ Galvanize

• Thu evening: Spark Notebook & Rapture @ Nitro

• Fri: free Shapeless training @ Nitro

We Are Hiring!

gonitro.com/about/jobs