Real-Time Big Data Processing with DataTorrent...
Transcript of Real-Time Big Data Processing with DataTorrent...
![Page 1: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/1.jpg)
![Page 2: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/2.jpg)
Apache Apex Unified Batch and Stream Processing for Big Data
Milind Barve
Nov. 03, 2015
![Page 3: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/3.jpg)
Project History
• Project development started
in 2012 at DataTorrent
• Open-sourced in July 2015
• Apache Apex started incubation in August 2015
![Page 4: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/4.jpg)
Project Status
Mentor ListTed Dunning: Apache Member, MapRAlan Gates: Apache Member, HortonworksTaylor Goetz: Apache Member, Hortonworks
Justin Mclean: Apache Member, Class SoftwareChris Nauroth: Apache Member, HortonworksHitesh Shah: Apache Member, Hortonworks
Apex In Apache Incubation Stage
![Page 5: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/5.jpg)
Apache Apex (Incubating) Committer List
Over 50 committers already…And growing….
![Page 6: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/6.jpg)
What we will serve you today …
– Batch & Streaming-Two worlds collide??
– Apex Engine- all the nerdy features
– Questions, you still have some???
– Develop your first app on Apex …
![Page 7: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/7.jpg)
Batch Layer
Speed Layer
Serving Layer
master dataset
real time view
real time view
batch view
query
query
Lambda Architecture
![Page 8: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/8.jpg)
Aggregate Layer
master dataset
Incremental Layer
aggregate query
incremental dataset
Aggregate View
Apex Real-time Unified Architecture
![Page 9: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/9.jpg)
Aggregate Layer
master dataset
Incremental Layer
rolling query
aggregate query
incremental dataset
Aggregate View
Incremental View
Apex Real-time Unified Architecture
![Page 10: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/10.jpg)
Apex Platform Overview Enterprise Edition
![Page 11: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/11.jpg)
Apache Apex-Malhar
![Page 12: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/12.jpg)
Directed Acyclic Graph (DAG)
Application Programming Model
• A Stream is a sequence of data tuples
• An Operator takes one or more input streams, performs computations & emits one or more output streams• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance in single-threaded
• Directed Acyclic Graph (DAG) is made up of operators and streams
Output StreamTuple Tupleer
Operator
er
Operator
er
Operator
er
Operator
Application Programming Model
![Page 13: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/13.jpg)
Hadoop Edge Node
DT RTS Management
Server
Hadoop Node
YARN Container
Apex App Master
Hadoop Node
YARN ContainerYARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Streaming Container
Hadoop Node
YARN ContainerYARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Streaming Container
CLI
REST API
DT RTS Management
Server
REST API
Part of Community Edition
Apex Component Overview
![Page 14: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/14.jpg)
Apex Engine
Core Features
![Page 15: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/15.jpg)
• YARN is the resource manager
• HDFS used for storing any persistent state
Native Hadoop Integration
![Page 16: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/16.jpg)
Partitioning & Scaling built-in
• Operators can be statically/dynamically scaled
• Flexible Streams split
• Parallel partitioning
• MxN partitioning
• Unifiers
Partitioning and Scaling Out
![Page 17: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/17.jpg)
Advanced Windowing support
• Application window
• Sliding window and tumbling window
• Checkpoint window
• No artificial latency
Advanced Windowing Support
![Page 18: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/18.jpg)
• Supported out of the box– Application state
– Application master state
– No data loss
• Automatic recovery
• Lunch test
• Buffer server
Stateful Fault Tolerance
![Page 19: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/19.jpg)
• AT_LEAST_ONCE (default): – Windows are processed at least once
• AT_MOST_ONCE: – Windows are processed at most once
• During recovery, all downstream operators are fast-forwarded to the window of latest checkpoint
• EXACTLY_ONCE: – Windows are processed exactly once
• Checkpoint every window• Checkpointing becomes blocking
Processing Semantics
![Page 20: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/20.jpg)
Data locality• Stream locality for placement of operators
– Rack local – Distributed deployment
– Node local – Data does not traverse NIC
– Container local – Data doesn’t need to be serialized
– Thread local – Operators run in same thread
Compute Locality
![Page 21: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/21.jpg)
• Dynamic topology updates
– Properties of operators can be changed
– New operators
• Upcoming
– Update attributes
Dynamic Updates
![Page 22: Real-Time Big Data Processing with DataTorrent RTSfiles.meetup.com/18978602/Pune_Apex_Meetup_Slides_Nov3rd_2015.pdfApache Apex Unified Batch and Stream Processing for Big Data Milind](https://reader030.fdocuments.in/reader030/viewer/2022040306/5ec8d2f92ef4a550307fde83/html5/thumbnails/22.jpg)
© 2014 DataTorrent Confidential – Do Not Distribute
For more Info …
• Mailing List: [email protected]
• Apache Apex: http://apex.apache.org/
• Github
ᵒ Apex Core: http://github.com/apache/incubator-apex-core
ᵒ Apex Malhar: http://github.com/apache/incubator-apex-malhar
• DataTorrent: http://www.datatorrent.com