How to extract valueable information from real time data feeds
-
Upload
gene-leybzon -
Category
Software
-
view
803 -
download
4
Transcript of How to extract valueable information from real time data feeds
![Page 1: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/1.jpg)
How to extract valuable information from real-time data
feedsGene Leybzon, February 2016
![Page 2: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/2.jpg)
“The critical challenge is using this data when it is still in motion – and extracting valuable information from it.”- Frédéric Combaneyre, SAS
IoT Challenge
![Page 3: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/3.jpg)
Detect events of interest and trigger appropriate actions
Aggregate information for monitoring Sensor data cleansing and validation Real-time predictive and optimized
operations (support for real-time decision making)
Role of Data Streams
![Page 4: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/4.jpg)
Platforms
![Page 5: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/5.jpg)
Google Cloud Platform
![Page 6: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/6.jpg)
AWS IoT Initiative
![Page 7: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/7.jpg)
SAS
![Page 8: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/8.jpg)
Transform data — convert the data into another format, for example, converting a captured device signal voltage to a calibrated unit measure of temperature
Aggregate and compute data — By combining data you can add checks: such as averaging data across multiple devices to avoid acting on a single, spurious, device; or ensure you have actionable data if a single device goes offline. By adding computation to your pipeline, you can apply streaming analytics to data while it is still in the processing pipeline.
Enrich data — You can combine the device-generated data with other metadata about the device, or with other datasets, such as weather or traffic data, for use in subsequent analysis.
Move data — You can store the processed data in one or more final storage locations.
Role of “Pipelines”
![Page 9: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/9.jpg)
Architecture
![Page 10: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/10.jpg)
Fault-tolerance against hardware failures and human errors Support for a variety of use cases that include low latency
querying as well as updates Linear scale-out capabilities, meaning that throwing more
machines at the problem should help with getting the job done Extensibility so that the system is manageable and can
accommodate newer features easily Consistency - data is the same across the cluster Availability - ability to access the cluster even if a node in the
cluster goes down Partition-tolerance - cluster continues to function even if there is a
"partition" (communications break) between two nodes
What we want from stream architecture?
![Page 11: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/11.jpg)
“It is impossible for a distributed computer system to simultaneously provide all three of the following guarantees: Consistency (all nodes see the same data at
the same time) Availability (a guarantee that every request
receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite arbitrary partitioning due to network failures)”
CAP Theorem
![Page 12: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/12.jpg)
Facing the Cap Theorem
Consistency Availability
PartitionTolerance
∅Cassandra
RiakCouchBaseMongoDB
λ
PoxosZabRaft
![Page 13: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/13.jpg)
λ-Architecture
![Page 14: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/14.jpg)
One-way data flow (doesn’t transact and make per-event decisions on the streaming data, nor does it respond immediately to the events coming in)
Eventual consistency NoSQL Complexity
Limitations of the λ-Architecture
![Page 15: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/15.jpg)
Out-of the box Solutions
![Page 16: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/16.jpg)
Designed for low latency Open-sourced in 2012 Long history of data Scale > 500K events/sec in Avg
Druid Project
![Page 17: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/17.jpg)
Druid data store
![Page 18: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/18.jpg)
Distributed stream processing framework Simple API Fault tolerance Manages stream state Fault tolerance Guarantee that messages are processed in
the order they were written to a partition, and that no messages are ever lost.
Apache Samza
![Page 19: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/19.jpg)
Apache Samza
![Page 20: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/20.jpg)
Samza Architecture
![Page 21: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/21.jpg)
VoltDB
![Page 22: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/22.jpg)
Stream Databases and Pipelines
Building Blocks
![Page 23: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/23.jpg)
PipelineDB (example of usage)
![Page 24: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/24.jpg)
AWS Kinesis
![Page 25: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/25.jpg)
Apache Cassandra
Decentralized (Every node in the cluster has the same role.) No single point of failure. Scalable Read and write throughput both increase linearly as new machines
are added, with no downtime or interruption to applications. Fault-tolerant Tunable level of consistency, all the way from "writes never fail" to
"block for all replicas to be readable” Hadoop integration, integration with MapReduce Query language
![Page 26: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/26.jpg)
Apache Flink
• High performance
• Low latency• Support for out-
of order events• Flexible
streaming window
• Fault tolerance
![Page 27: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/27.jpg)
Stream Processing Algorithms
![Page 28: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/28.jpg)
Finding frequent items Estimating number of distinct Statistics Finding “signal” Error correction Filtering Anomaly detection Incremental learning Data clustering
Popular Stream Algorithms
![Page 29: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/29.jpg)
Machine Learning from Stream Data
![Page 30: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/30.jpg)
Take into account recent history
ML Model is updatable (“evolves” as new data comes in)
How ML from stream data is different from traditional ML techniques?
![Page 31: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/31.jpg)
Incremental algorithms (both support vector machines and neural networks can work incrementally)
Periodic retraining with new data batch
Two Approaches to Adopt ML to Stream Data
![Page 32: How to extract valueable information from real time data feeds](https://reader035.fdocuments.in/reader035/viewer/2022062503/58868f4d1a28abf6158b5f85/html5/thumbnails/32.jpg)
Questions?
?