Have your cake and eat it too, further dispelling the myths of the lambda architecture
-
Upload
dimos-raptis -
Category
Software
-
view
106 -
download
0
Transcript of Have your cake and eat it too, further dispelling the myths of the lambda architecture
Have Your Cake & Eat It TooFurther Dispelling the Myths of the Lambda ArchitectureTyler AkidauStaff Software Engineer
MillWheel Streaming
FlumeCloud Dataflow
- Stream Processing System- High-level API- Data Processing Service
Google Cloud Dataflow
Optimize
Schedule
GCS
GCS
- Slava Chernyak, Josh Haberman, Reuven Lax, Daniel Mills, Paul Nordstrom, Sam McVeety, Sam Whittle, and more...
- Robert Bradshaw, Daniel Mills,and more...
- Robert Bradshaw, Craig Chambers, Reuven Lax, Daniel Mills, Frances Perry, and more...
MillWheel
Streaming Flume
Cloud Dataflow
Cloud Dataflow is unreleased.
Things may change.
Lambda vs Streaming
Strong Consistency
Reasoning About Time
1
2
3
Agenda
Lambda vs Streaming1
http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
The Lambda Architecture
http://radar.oreilly.com/2014/07/questioning-the-lambda-architecture.html
The Evolution of Streaming
Strong Consistency
Tools for Reasoning About Time
What does it take?
Strong Consistency2
Consistent Storage
Storage
• Mostly correct is not good enough• Required for exactly-once processing• Required for repeatable results• Cannot replace batch without it
Why consistency is important
• Sequencers (e.g. BigTable)• Leases (e.g. Spanner)• Federation of storage silos (e.g. Samza,
Dataflow)• RDDs (e.g. Spark)
How?
http://research.google.com/pubs/pub41378.html
Reasoning About Time 3
Event Time vs Stream TimeBatch vs Streaming
ApproachesDataflow API
Event Time - When Events Happened
Stream Time - When Events Are Processed
Batch vs Streaming
MapReduce
Batch
MapReduce
[10:00 - 11:00)
[10:00 - 11:00)
[11:00 - 12:00)[12:00 -
13:00)[13:00 - 14:00)[14:00 -
15:00)[15:00 - 16:00)[16:00 -
17:00)[18:00 - 19:00)[19:00 -
20:00)[21:00 - 22:00)[22:00 -
23:00)[23:00 - 0:00)
Batch: Fixed Windows
MapReduce
[10:00 - 11:00)
[11:00 - 12:00)
Batch: User Sessions
Joan
Larry
Ingo
Amanda
Cheryl
Arthur
[11:00 - 12:00)
[10:00 - 11:00)
Streaming
11:00
10:00
16:00
15:00
14:00
13:00
12:00
UnorderedUnbounded
Of Varying Event Time Skew
Confounding characteristics of data streams
Event Time Skew
Stre
am
Tim
e
Event Time
Skew
Approaches
1.Time-Agnostic Processing2.Approximation3.Stream Time Windowing4.Event Time Windowing
Approaches to reasoning about time
1.Time-Agnostic Processing - Filters
11:00
10:00
16:00
15:00
14:00
13:00
12:00
Stream Time
Example Input:Example Output:
Pros:
Cons:
Web server traffic logsAll traffic from specific domains
StraightforwardEfficientLimited utility
1.Time-Agnostic Processing - Hash Join
11:00
10:00
16:00
15:00
14:00
13:00
12:00
Stream Time
Example Input:Example Output:
Pros:
Cons:
Query & Click traffic Joined stream of Query + Click pairs
StraightforwardEfficientLimited utility
2.Approximation via Online Algorithms
11:00
10:00
16:00
15:00
14:00
13:00
12:00
Stream Time
Example Input:Example Output:
Pros:Cons:
Twitter hashtagsApproximate top N hashtags per prefix
EfficientInexactComplicated Algorithms
11:00
10:00
16:00
15:00
14:00
13:00
12:00
Stream Time
Web server request trafficPer-minute rate of received requestsStraightforwardResults reflect contents of streamResults don’t reflect events as they happenedIf approximating event time, usefulness varies
Example Input:Example Output:
Pros:
Cons:
3.Windowing by Stream Time
11:00
10:00
16:00
15:00
14:00
13:00
12:00
Event Time
Example Input:Example Output:
Pros:Cons:
Twitter hashtagsTop N hashtags by prefix per hour.Reflects events as they occurredMore complicated bufferingCompleteness issues
11:00
10:00
16:00
15:00
14:00
13:00
12:00
Stream Time
4.Windowing by Event Time - Fixed Windows
11:00
10:00
16:00
15:00
14:00
13:00
12:00
Event Time
Example Input:Example Output:
Pros:Cons:
User activity streamPer-session group of activitiesReflects events as they occurredMore complicated bufferingCompleteness issues
11:00
10:00
16:00
15:00
14:00
13:00
12:00
Stream Time
4.Windowing by Event Time - Sessions
Dataflow API
What are you computing?
Where in event time?
When in stream time?
What = Aggregation API
Where = Windowing API
When = Watermarks + Triggers API
Aggregation API
PCollection<KV<String, Double>> sums = Pipeline .begin() .read(“userRequests”) .apply(new Sum());
Aggregation API
2
4
7
0
1
6
33
8
918
9
16Sum
Streaming Mode
10:02 10:0010:06 10:04 Stream Time
24 3 1
63
3 87
0 24
16
3
38
9
0
4
7
0 33
2
10:02 10:0010:06 10:04 Event Time
1
38
9
046
18
7
0 23
432
74 3
6
3
0 33
2
Windowing API
PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE))); .apply(new Sum());
Windowing API
10:02 10:0010:06 10:04 Stream Time
24 3 1
63
3 87
0
10:02 10:0010:06 10:04 Event Time
10:02 10:0010:06 10:04 Event Time
24
16
3
38
9
0
4
7
0 33
2
FixedWindows
Sum
1
38
9
046
18
7
0 23
432
74 3
6
3
0 33
2
13 12616 1835 415
Watermarks
●f(S) -> E●S = a point in stream time (i.e. now)
●E = the point in event time up to which input data is complete as of S
Event Time Skew
Stre
am
Tim
e
Event Time
FixedWindows
Sum
10:02 10:00
13616 1835 415 12
10:06 10:04 Stream Time
24 3 1
63
3 87
0
10:02 10:0010:06 10:04 Event Time
10:01 10:0010:03 10:02 Event Time
24
16
3
38
9
0
4
7
0 33
2
1
38
9
046
18
7
0 23
432
74 3
6
3
0 33
2
Watermarks
Watermark Caveats
Too slow = more latency
Too fast = late data
Triggers
When in stream time to emit?
Triggers API
PCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTES)) .trigger(new AtWatermark()); .apply(new Sum());
2
3
1
8
4
8
7
6 3
1313
Event Time10:05 10:0610:0110:00
10:0310:00
10:06
2
12 5
52020
99
10:0110:02
10:0510:04
10:02 10:03 10:04
5Late datum
A Better Strategy
1.Once per stream time minute
2.At watermark3.Once per record for two weeks
1325 5
2
3
1
8
4
8
7
6 3
5
Event Time10:05 10:0610:0110:00
10:0310:00
10:06
2
12 5
52020
99
10:0110:02
10:0510:04
10:02 10:03 10:04
12
20
1320
5
1320
13
1325
2
12
13
9
20 520
25Late datum
91325
Triggers APIPCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new FixedWindows(2, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());
Lambda vs Streaming
Low-latency, approximate resultsComplete, correct results as soon as possible
Ability to deal with changes upstream
One Last Thing...
What if I want sessions?
Triggers APIPCollection<KV<String, Long>> sums = Pipeline .begin() .read(“userRequests”) .apply(Window.into(new Sessions(1, MINUTE)) .trigger(new SequenceOf( new RepeatUntil( new AtPeriod(1, MINUTE), new AtWatermark()), new AtWatermark(), new RepeatUntil( new AfterCount(1), new AfterDelay( 14, DAYS, TimeDomain.EVENT_TIME)))); .apply(new Sum());
2
8
4
8
7
6 3
Event Time10:05 10:0610:0110:00
10:0310:00
10:0610:01
10:0210:05
10:04
10:02 10:03 10:04
22
3
1
9
1 minute2 7
1 minute2
2 7
1 minute
9 39 3
39 1 48
89 7 2525
25
5Late datum
25 83333
33
335 3838
6 3938 9
20 5
13
2
12
25
9
Summary
Lambda is greatStreaming by itself is better :-)
Strong Consistency = CorrectnessStreaming = Aggregation + Windowing +
TriggersTools For Reasoning About Time = Power +
Flexibility
Thank you!
Questions?Questions about this talk:
Questions about Cloud Dataflow:
[email protected] (Tyler Akidau)[email protected] (Eric Schmidt)