Have your cake and eat it too, further dispelling the myths of the lambda architecture

59
Have Your Cake & Eat It Too Further Dispelling the Myths of the Lambda Architecture Tyler Akidau Staff Software Engineer

Transcript of Have your cake and eat it too, further dispelling the myths of the lambda architecture

Page 1: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Have Your Cake & Eat It TooFurther Dispelling the Myths of the Lambda ArchitectureTyler AkidauStaff Software Engineer

Page 2: Have your cake and eat it too, further dispelling the myths of the lambda architecture

MillWheel Streaming

FlumeCloud Dataflow

- Stream Processing System- High-level API- Data Processing Service

Page 3: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Google Cloud Dataflow

Optimize

Schedule

GCS

GCS

Page 4: Have your cake and eat it too, further dispelling the myths of the lambda architecture

- Slava Chernyak, Josh Haberman, Reuven Lax, Daniel Mills, Paul Nordstrom, Sam McVeety, Sam Whittle, and more...

- Robert Bradshaw, Daniel Mills,and more...

- Robert Bradshaw, Craig Chambers, Reuven Lax, Daniel Mills, Frances Perry, and more...

MillWheel

Streaming Flume

Cloud Dataflow

Page 5: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Cloud Dataflow is unreleased.

Things may change.

Page 6: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Lambda vs Streaming

Strong Consistency

Reasoning About Time

1

2

3

Agenda

Page 7: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Lambda vs Streaming1

Page 8: Have your cake and eat it too, further dispelling the myths of the lambda architecture

http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html

Page 9: Have your cake and eat it too, further dispelling the myths of the lambda architecture

The Lambda Architecture

Page 10: Have your cake and eat it too, further dispelling the myths of the lambda architecture

http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html

Page 11: Have your cake and eat it too, further dispelling the myths of the lambda architecture

The Evolution of Streaming

Page 12: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Strong Consistency

Tools for Reasoning About Time

What does it take?

Page 13: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Strong Consistency2

Page 14: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Consistent Storage

Storage

Page 15: Have your cake and eat it too, further dispelling the myths of the lambda architecture

• Mostly correct is not good enough• Required for exactly-once processing• Required for repeatable results• Cannot replace batch without it

Why consistency is important

Page 16: Have your cake and eat it too, further dispelling the myths of the lambda architecture

• Sequencers (e.g. BigTable)• Leases (e.g. Spanner)• Federation of storage silos (e.g. Samza,

Dataflow)• RDDs (e.g. Spark)

How?

Page 17: Have your cake and eat it too, further dispelling the myths of the lambda architecture

http://research.google.com/pubs/pub41378.html

Page 18: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Reasoning About Time 3

Page 19: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Event Time vs Stream TimeBatch vs Streaming

ApproachesDataflow API

Page 20: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Event Time - When Events Happened

Stream Time - When Events Are Processed

Page 21: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Batch vs Streaming

Page 22: Have your cake and eat it too, further dispelling the myths of the lambda architecture

MapReduce

Batch

Page 23: Have your cake and eat it too, further dispelling the myths of the lambda architecture

MapReduce

[10:00 - 11:00)

[10:00 - 11:00)

[11:00 - 12:00)[12:00 -

13:00)[13:00 - 14:00)[14:00 -

15:00)[15:00 - 16:00)[16:00 -

17:00)[18:00 - 19:00)[19:00 -

20:00)[21:00 - 22:00)[22:00 -

23:00)[23:00 - 0:00)

Batch: Fixed Windows

Page 24: Have your cake and eat it too, further dispelling the myths of the lambda architecture

MapReduce

[10:00 - 11:00)

[11:00 - 12:00)

Batch: User Sessions

Joan

Larry

Ingo

Amanda

Cheryl

Arthur

[11:00 - 12:00)

[10:00 - 11:00)

Page 25: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Streaming

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Page 26: Have your cake and eat it too, further dispelling the myths of the lambda architecture

UnorderedUnbounded

Of Varying Event Time Skew

Confounding characteristics of data streams

Page 27: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Event Time Skew

Stre

am

Tim

e

Event Time

Skew

Page 28: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Approaches

Page 29: Have your cake and eat it too, further dispelling the myths of the lambda architecture

1.Time-Agnostic Processing2.Approximation3.Stream Time Windowing4.Event Time Windowing

Approaches to reasoning about time

Page 30: Have your cake and eat it too, further dispelling the myths of the lambda architecture

1.Time-Agnostic Processing - Filters

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Stream Time

Example Input:Example Output:

Pros:

Cons:

Web server traffic logsAll traffic from specific domains

StraightforwardEfficientLimited utility

Page 31: Have your cake and eat it too, further dispelling the myths of the lambda architecture

1.Time-Agnostic Processing - Hash Join

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Stream Time

Example Input:Example Output:

Pros:

Cons:

Query & Click traffic Joined stream of Query + Click pairs

StraightforwardEfficientLimited utility

Page 32: Have your cake and eat it too, further dispelling the myths of the lambda architecture

2.Approximation via Online Algorithms

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Stream Time

Example Input:Example Output:

Pros:Cons:

Twitter hashtagsApproximate top N hashtags per prefix

EfficientInexactComplicated Algorithms

Page 33: Have your cake and eat it too, further dispelling the myths of the lambda architecture

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Stream Time

Web server request trafficPer-minute rate of received requestsStraightforwardResults reflect contents of streamResults don’t reflect events as they happenedIf approximating event time, usefulness varies

Example Input:Example Output:

Pros:

Cons:

3.Windowing by Stream Time

Page 34: Have your cake and eat it too, further dispelling the myths of the lambda architecture

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Event Time

Example Input:Example Output:

Pros:Cons:

Twitter hashtagsTop N hashtags by prefix per hour.Reflects events as they occurredMore complicated bufferingCompleteness issues

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Stream Time

4.Windowing by Event Time - Fixed Windows

Page 35: Have your cake and eat it too, further dispelling the myths of the lambda architecture

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Event Time

Example Input:Example Output:

Pros:Cons:

User activity streamPer-session group of activitiesReflects events as they occurredMore complicated bufferingCompleteness issues

11:00

10:00

16:00

15:00

14:00

13:00

12:00

Stream Time

4.Windowing by Event Time - Sessions

Page 36: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Dataflow API

Page 37: Have your cake and eat it too, further dispelling the myths of the lambda architecture

What are you computing?

Where in event time?

When in stream time?

Page 38: Have your cake and eat it too, further dispelling the myths of the lambda architecture

What = Aggregation API

Where = Windowing API

When = Watermarks + Triggers API

Page 39: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Aggregation API

PCollection<KV<String, Double>> sums = Pipeline .begin() .read(“userRequests”) .apply(new Sum());

Page 40: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Aggregation API

2

4

7

0

1

6

33

8

918

9

16Sum

Page 41: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Streaming Mode

10:02 10:0010:06 10:04 Stream Time

24 3 1

63

3 87

0 24

16

3

38

9

0

4

7

0 33

2

10:02 10:0010:06 10:04 Event Time

1

38

9

046

18

7

0 23

432

74 3

6

3

0 33

2

Page 42: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Windowing API

PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE))); .apply(new Sum());

Page 43: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Windowing API

10:02 10:0010:06 10:04 Stream Time

24 3 1

63

3 87

0

10:02 10:0010:06 10:04 Event Time

10:02 10:0010:06 10:04 Event Time

24

16

3

38

9

0

4

7

0 33

2

FixedWindows

Sum

1

38

9

046

18

7

0 23

432

74 3

6

3

0 33

2

13 12616 1835 415

Page 44: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Watermarks

●f(S) -> E●S = a point in stream time (i.e. now)

●E = the point in event time up to which input data is complete as of S

Page 45: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Event Time Skew

Stre

am

Tim

e

Event Time

Page 46: Have your cake and eat it too, further dispelling the myths of the lambda architecture

FixedWindows

Sum

10:02 10:00

13616 1835 415 12

10:06 10:04 Stream Time

24 3 1

63

3 87

0

10:02 10:0010:06 10:04 Event Time

10:01 10:0010:03 10:02 Event Time

24

16

3

38

9

0

4

7

0 33

2

1

38

9

046

18

7

0 23

432

74 3

6

3

0 33

2

Watermarks

Page 47: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Watermark Caveats

Too slow = more latency

Too fast = late data

Page 48: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Triggers

When in stream time to emit?

Page 49: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Triggers API

PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTES)) .trigger(new AtWatermark()); .apply(new Sum());

Page 50: Have your cake and eat it too, further dispelling the myths of the lambda architecture

2

3

1

8

4

8

7

6 3

1313

Event Time10:05 10:0610:0110:00

10:0310:00

10:06

2

12 5

52020

99

10:0110:02

10:0510:04

10:02 10:03 10:04

5Late datum

Page 51: Have your cake and eat it too, further dispelling the myths of the lambda architecture

A Better Strategy

1.Once per stream time minute

2.At watermark3.Once per record for two weeks

Page 52: Have your cake and eat it too, further dispelling the myths of the lambda architecture

1325 5

2

3

1

8

4

8

7

6 3

5

Event Time10:05 10:0610:0110:00

10:0310:00

10:06

2

12 5

52020

99

10:0110:02

10:0510:04

10:02 10:03 10:04

12

20

1320

5

1320

13

1325

2

12

13

9

20 520

25Late datum

91325

Page 53: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Triggers APIPCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());

Page 54: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Lambda vs Streaming

Low-latency, approximate resultsComplete, correct results as soon as possible

Ability to deal with changes upstream

Page 55: Have your cake and eat it too, further dispelling the myths of the lambda architecture

One Last Thing...

What if I want sessions?

Page 56: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Triggers APIPCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new Sessions(1, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());

Page 57: Have your cake and eat it too, further dispelling the myths of the lambda architecture

2

8

4

8

7

6 3

Event Time10:05 10:0610:0110:00

10:0310:00

10:0610:01

10:0210:05

10:04

10:02 10:03 10:04

22

3

1

9

1 minute2 7

1 minute2

2 7

1 minute

9 39 3

39 1 48

89 7 2525

25

5Late datum

25 83333

33

335 3838

6 3938 9

20 5

13

2

12

25

9

Page 58: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Summary

Lambda is greatStreaming by itself is better :-)

Strong Consistency = CorrectnessStreaming = Aggregation + Windowing +

TriggersTools For Reasoning About Time = Power +

Flexibility

Page 59: Have your cake and eat it too, further dispelling the myths of the lambda architecture

Thank you!

Questions?Questions about this talk:

Questions about Cloud Dataflow:

[email protected] (Tyler Akidau)[email protected] (Eric Schmidt)