Portable stateful big data processing in Apache Beam · Portable stateful big data processing in...

35
Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC Software Engineer @ Google [email protected] / @KennKnowles https://s.apache.org/ffsf-2017-beam-state Flink Forward San Francisco 2017

Transcript of Portable stateful big data processing in Apache Beam · Portable stateful big data processing in...

Page 1: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Portable stateful big data processing in Apache Beam

Kenneth Knowles

Apache Beam PMCSoftware Engineer @ Google

[email protected] / @KennKnowles https://s.apache.org/ffsf-2017-beam-stateFlink Forward San Francisco 2017

Page 2: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Agenda1. What is Apache Beam?

2. State

3. Timers

4. Example & Little Demo

Page 3: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

What isApache Beam?

Page 4: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

TL;DR

4(Flink draws it more like this)

Page 5: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

20142004 2006 2008 2010 2012 20162005 2007 2009 2013 20152011

MapReduce(paper)

Apache Hadoop

Dataflow Model(paper)

MillWheel(paper)

Heron

ApacheSpark

ApacheStorm

Apache Gearpump

(incubating)Apache

Apex

Apache Flink

Cloud Dataflow

FlumeJava(paper)

Apache Beam

DAGs, DAGs, DAGs

Apache Samza

Page 6: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

The Beam Vision

Sum Per Key

6

input.apply(

Sum.integersPerKey())

Java

input | Sum.PerKey()

Python

Apache Flinklocal, on-prem,

cloud

Apache Sparklocal, on-prem,

cloud

Cloud Dataflow: fully managed

Apache Apexlocal, on-prem,

cloud

Apache Gearpump

(incubating)

Page 7: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

The Beam Vision

KafkaIO

7

input | KakaIO.read()

Python

class KafkaIO extends

UnboundedSource { … }

Java

Apache Flinklocal, on-prem,

cloud

Apache Sparklocal, on-prem,

cloud

Cloud Dataflow: fully managed

Apache Apexlocal, on-prem,

cloud

Apache Gearpump

(incubating)

Page 8: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

The Beam Model

Pipeline

8

PTransform

PCollection(bounded or unbounded)

Page 9: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

The Beam Model

What are you computing? (read, map, reduce)

Where in event time? (event time windowing)

9

When in processing time are results produced? (triggers)

How do refinements relate? (accumulation mode)

Page 10: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

What are you computing?

10

ReadParallel connectors to

external systems

ParDoPer element

"Map"

GroupingGroup by key,

Combine per key,"Reduce"

CompositeEncapsulated

subgraph

Page 11: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

State

Page 12: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

What are you computing?

12

ReadParallel connectors to

external systems

ParDoPer element

"Map"

GroupingGroup by key,

Combine per key,"Reduce"

CompositeEncapsulated

subgraph

Page 13: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

State for ParDo

13

ParDo(DoFn)

Page 14: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

"Example"

14

Per Key Quantiles

ParDo.of(new DoFn<...>() {

// declare some state

@ProcessElement

public void process(...) {

// update quantiles

// output if needed

}

})

Page 15: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Partitioned

15

Quantiles Quantiles Quantiles

Page 16: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Parallelism!

16

QuantilesDoFn

QuantilesDoFn

QuantilesDoFn

new DoFn<Foo, Quantiles<Foo>>() {

@StateId("quantiles")

private final StateSpec<...>

quantilesState = StateSpecs.combining();

…}

Page 17: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Windowed

17

QuantilesDoFn

QuantilesDoFn

QuantilesDoFn

Window into Fixed windows of one hour

Expected result: Quantiles for each hour

Page 18: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Windowed

18

QuantilesDoFn

QuantilesDoFn

QuantilesDoFn

Window into windows of 30 min sliding by 10 min

Expected result: Quantiles for 30 minutes sliding by 10 min

Page 19: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

<k, w>1

<k, w>2

<k, w>3

...

"x" 3 7 15

"y" "fizz" "7" "fizzbuzz"

...

State is per key and window

Bonus: automatically garbage collected when a window expires(vs manual clearing of keyed state)

Page 20: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Windowed

20

QuantilesDoFn

QuantilesDoFn

QuantilesDoFn

Window into Fixed windows of one hour

Expected result: Quantiles for each hour

Page 21: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

What about Combine?

21

QuantilesCombiner

Window into Fixed windows of one hour

Expected result: Quantiles for each hour

QuantilesCombiner

QuantilesCombiner

Page 22: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Combine vs State (naive conceptualization)

22

Window hourly

Expected result:Quantiles for each hour

QuantilesCombiner

Window hourly

QuantilesDoFn

Expected result:Quantiles for each hour

Page 23: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Combine vs State (likely execution plan)

Quantiles Combiner

Quantiles Combiner

Quantiles Combiner

Shuffle Accumulators

QuantilesDoFn

Shuffle Elements

associativecommutativesingle-outputenables optimizations(engine is in control)

non-associative*non-commutative*

multi-outputside outputs(user is in control)

Page 24: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Combine vs State

Quantiles Combiner Quantiles

DoFn

output governed by trigger

(data/computation unaware) "output only when there's an interesting change"

(data/computation aware)

Page 25: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Example: Arbitrary indices

D,0

A

Assign indices

B

C

D

E

C,1

A,2

B,3

E,4

non-associative

non-commutative

non-deterministic

and totally fine!

Page 26: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Kinds of state

● Value - just a mutable cell for a value

● Bag - supports "blind writes"

● Combining - has a CombineFn built in; can support blind writes and lazy accumulation

● Set - membership checking

● Map - lookups and partial writes

Page 27: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Timers

Page 28: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Timers for ParDo

28

Stateful ParDo

new DoFn<...>() {

@TimerId("timeout")

private final TimerSpec timeoutTimer =

TimerSpecs.timer(TimeDomain.PROCESSING_TIME);

@OnTimer("timout")

public void timeout(...) {

// access state, set new timers, output

}

…}

Page 29: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Timers in processing time

"call me back in 5 seconds"

output based only on incoming element

output return value of batched RPC

buffer request

batched RPC

On Timer

On Element

Page 30: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Timers in event time

"call me back when the watermark hits the end of the window"

output speculative result immediately

output replacement value only if changed

store speculative result

On Timer

On Element

Page 31: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

● Per-key arbitrary numbering

● Output only when result changes

● Tighter "side input" management for slowly changing dimension

● Streaming join-matrix / join-biclique

● Fine-grained combine aggregation and output control

● Per-key "workflows" like user sign up flow w/ expiration

● Low-latency deduplication (let the first through, squash the rest)

More example uses for state & timers

Page 32: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Performance considerations (cross-runner)

32

● Shuffle to colocate keys

● Linear processing of elements for key+window

● Window merging

● Storage of state and timers

● GC of state

Page 33: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Demo

33

Page 34: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

SummaryState and Timers in Beam...

● … unlock new uses cases

● … they "just work" with event time windowing

● … are portable across runners (implementation ongoing)

Page 35: Portable stateful big data processing in Apache Beam · Portable stateful big data processing in Apache Beam Kenneth Knowles Apache Beam PMC ... Spark Apache Storm Apache Gearpump

Thank you for listening!This talk:

● Me - @KennKnowles● These Slides - https://s.apache.org/ffsf-2017-beam-state

Go Deeper● Design doc - https://s.apache.org/beam-state● Blog post - https://beam.apache.org/blog/2017/02/13/stateful-processing.html

Join the Beam community:● User discussions - [email protected]● Development discussions - [email protected]● Follow @ApacheBeam on Twitter

You can contribute to Beam + Flink● New types of state● Easy launch of Beam job on Flink-on-YARN● Integration tests at scale● Fit and finish: polish, polish, polish!● … and lots more!

https://beam.apache.org