Early Math Counting & Skip Counting. Early Math “0” – see the number Counting & Skip Counting.
Counting Elements in Streams
-
Upload
jamie-grier -
Category
Engineering
-
view
548 -
download
0
Transcript of Counting Elements in Streams
![Page 2: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/2.jpg)
Introduction
![Page 3: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/3.jpg)
Data streaming is becoming increasingly popular*
*Biggest understatement of 2016
![Page 4: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/4.jpg)
Streaming technology is enabling the obvious: continuous processing on data
that is continuously produced
![Page 5: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/5.jpg)
Streaming is the next programming paradigm for data applications, and you
need to start thinking in terms of streams
![Page 6: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/6.jpg)
Counting
![Page 7: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/7.jpg)
Continuous countingA seemingly simple
application, but generally an unsolved problem
E.g., count visitors, impressions, interactions, clicks, etc
Aggregations and OLAP cube operations are generalizations of counting
![Page 8: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/8.jpg)
Counting in batch architectureContinuous
ingestion
Periodic (e.g., hourly) files
Periodic batch jobs
![Page 9: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/9.jpg)
Problems with batch architectureHigh latency
Too many moving parts
Implicit treatment of time
Out of order event handling
Implicit batch boundaries
![Page 10: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/10.jpg)
Counting in λ architecture"Batch layer":
what we had before
"Stream layer": approximate early results
![Page 11: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/11.jpg)
Problems with batch and λWay too many moving parts (and code
dup)
Implicit treatment of time
Out of order event handling
Implicit batch boundaries
![Page 12: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/12.jpg)
Counting in streaming architectureMessage queue ensures stream durability and
replay
Stream processor ensures consistent counting
![Page 13: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/13.jpg)
Counting in Flink DataStream API
Number of visitors in last hour by country DataStream<LogEvent> stream = env .addSource(new FlinkKafkaConsumer(...)); // create stream from Kafka .keyBy("country"); // group by country .timeWindow(Time.minutes(60)) // window of size 1 hour .apply(new CountPerWindowFunction()); // do operations per window
![Page 14: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/14.jpg)
Counting hierarchy of needs
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
Based on Maslow's hierarchy of needs
![Page 15: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/15.jpg)
Counting hierarchy of needs
Continuous counting
![Page 16: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/16.jpg)
Counting hierarchy of needs
Continuous counting
... with low latency,
![Page 17: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/17.jpg)
Counting hierarchy of needs
Continuous counting
... with low latency,
... efficiently on high volume streams,
![Page 18: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/18.jpg)
Counting hierarchy of needs
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
![Page 19: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/19.jpg)
Counting hierarchy of needs
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
![Page 20: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/20.jpg)
Counting hierarchy of needs
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable1.1+
![Page 21: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/21.jpg)
Rest of this talk
Continuous counting
... with low latency,
... efficiently on high volume streams,
... fault tolerant (exactly once),
... accurate and
repeatable,
... queryable
![Page 22: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/22.jpg)
Latency
![Page 23: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/23.jpg)
Yahoo! Streaming BenchmarkStorm, Spark Streaming, and Flink benchmark by
the Storm team at Yahoo!
Focus on measuring end-to-end latency at low throughputs
First benchmark that was modeled after a real application
Read more:• https
://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
![Page 24: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/24.jpg)
Benchmark task: counting! Count ad
impressions grouped by campaign
Compute aggregates over last 10 seconds
Make aggregates available for queries (Redis)
![Page 25: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/25.jpg)
Results (lower is better)
Flink and Storm at sub-second latencies
Spark has a latency-throughput tradeoff
170k events/sec
![Page 26: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/26.jpg)
Efficiency, and scalability
![Page 27: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/27.jpg)
Handling high-volume streamsScalability: how many events/sec can a
system scale to, with infinite resources?
Scalability comes at a cost (systems add overhead to be scalable)
Efficiency: how many events/sec can a system scale to, with limited resources?
![Page 28: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/28.jpg)
Extending the Yahoo! benchmarkYahoo! benchmark is a great starting point to
understand engine behavior.
However, benchmark stops at low write throughput and programs are not fault tolerant.
We extended the benchmark to (1) high volumes, and (2) to use Flink's built-in state.• http://data-artisans.com/extending-the-yahoo-streaming-benchmark
/
![Page 29: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/29.jpg)
Results (higher is better)
Also: Flink jobs are correct under failures (exactly once), Storm jobs are not
500k events/sec
3mi events/sec
15mi events/sec
![Page 30: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/30.jpg)
Fault tolerance and repeatability
![Page 31: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/31.jpg)
Stateful streaming applicationsMost interesting stream applications are stateful
How to ensure that state is correct after failures?
![Page 32: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/32.jpg)
Fault tolerance definitionsAt least once
• May over-count • In some cases even under-count with different
definition
Exactly once• Counts are the same after failure
End-to-end exactly once• Counts appear the same in an external sink (database,
file system) after failure
![Page 33: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/33.jpg)
Fault tolerance in FlinkFlink guarantees exactly once
End-to-end exactly once supported with specific sources and sinks • E.g., Kafka ➔ Flink ➔ HDFS
Internally, Flink periodically takes consistent snapshots of the state without ever stopping the computation
![Page 34: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/34.jpg)
Savepoints Maintaining stateful
applications in production is challenging
Flink savepoints: externally triggered, durable checkpoints
Easy code upgrades (Flink or app), maintenance, migration, and debugging, what-if simulations, A/B tests
![Page 35: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/35.jpg)
Explicit handling of time
![Page 36: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/36.jpg)
Notions of time
1977 1980 1983 1999 2002 2005 2015
Episode IV:A New Hope
Episode V:The EmpireStrikes Back
Episode VI:Return of the Jedi
Episode I:The Phantom
Menace
Episode II:Attach of the Clones
Episode III:Revenge of
the Sith
Episode VII:The Force Awakens
This is called event time
This is called processing time
![Page 37: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/37.jpg)
Out of order streams
Event time windows
Arrival time windows
Instant event-at-a-time
First burst of eventsSecond burst of events
![Page 38: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/38.jpg)
Why event timeMost stream processors are limited to
processing/arrival time, Flink can operate on event time as well
Benefits of event time • Accurate results for out of order data• Sessions and unaligned windows• Time travel (backstreaming)
![Page 39: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/39.jpg)
What's coming up in Flink
![Page 40: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/40.jpg)
Evolution of streaming in Flink Flink 0.9 (Jun 2015): DataStream API in beta, exactly-
once guarantees via checkpoiting
Flink 0.10 (Nov 2015): Event time support, windowing mechanism based on Dataflow/Beam model, graduated DataStream API, high availability, state interface, new/updated connectors (Kafka, Nifi, Elastic, ...), improved monitoring
Flink 1.0 (Mar 2015): DataStream API stability, out of core state, savepoints, CEP library, improved monitoring, Kafka 0.9 support
![Page 41: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/41.jpg)
Upcoming featuresSQL: ongoing work in collaboration with Apache
Calcite
Dynamic scaling: adapt resources to stream volume, historical stream processing
Queryable state: ability to query the state inside the stream processor
Mesos support
More sources and sinks (e.g., Kinesis, Cassandra)
![Page 42: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/42.jpg)
Queryable state
Using the stream processor as a database
![Page 43: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/43.jpg)
Closing
![Page 44: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/44.jpg)
SummaryStream processing gaining momentum, the right
paradigm for continuous data applications
Even seemingly simple applications can be complex at scale and in production – choice of framework crucial
Flink: unique combination of capabilities, performance, and robustness
![Page 45: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/45.jpg)
Flink in the wild
30 billion events daily 2 billion events in 10 1Gb machines
Picked Flink for "Saiki"data integration &
distribution platform
See talks by at
![Page 46: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/46.jpg)
Join the community!Follow: @ApacheFlink, @dataArtisansRead: flink.apache.org/blog, data-artisans.com/blogSubscribe: (news | user | dev) @ flink.apache.org
![Page 47: Counting Elements in Streams](https://reader035.fdocuments.in/reader035/viewer/2022062523/5872853b1a28abc7068b701b/html5/thumbnails/47.jpg)
2 meetups next week in Bay Area!
April 6, San JoseApril 5, San Francisco
What's new in Flink 1.0 & recent performance benchmarks with Flinkhttp://www.meetup.com/Bay-Area-Apache-Flink-Meetup/