Business Etiquettes Dr. Rajendra Barve Training for Essar Group.
Real-Time Big Data Processing with DataTorrent RTS · Apache Apex Unified Batch and Stream...
Transcript of Real-Time Big Data Processing with DataTorrent RTS · Apache Apex Unified Batch and Stream...
Apache Apex Unified Batch and Stream Processing for Big Data
Milind Barve
Nov. 03, 2015
Project History
• Project development started
in 2012 at DataTorrent
• Open-sourced in July 2015
• Apache Apex started incubation in August 2015
Project Status
Mentor ListTed Dunning: Apache Member, MapRAlan Gates: Apache Member, HortonworksTaylor Goetz: Apache Member, Hortonworks
Justin Mclean: Apache Member, Class SoftwareChris Nauroth: Apache Member, HortonworksHitesh Shah: Apache Member, Hortonworks
Apex In Apache Incubation Stage
Apache Apex (Incubating) Committer List
Over 50 committers already…And growing….
What we will serve you today …
– Batch & Streaming-Two worlds collide??
– Apex Engine- all the nerdy features
– Questions, you still have some???
– Develop your first app on Apex …
Batch Layer
Speed Layer
Serving Layer
master dataset
real time view
real time view
batch view
query
query
Lambda Architecture
Aggregate Layer
master dataset
Incremental Layer
aggregate query
incremental dataset
Aggregate View
Apex Real-time Unified Architecture
Aggregate Layer
master dataset
Incremental Layer
rolling query
aggregate query
incremental dataset
Aggregate View
Incremental View
Apex Real-time Unified Architecture
Apex Platform Overview Enterprise Edition
Apache Apex-Malhar
Directed Acyclic Graph (DAG)
Application Programming Model
• A Stream is a sequence of data tuples
• An Operator takes one or more input streams, performs computations & emits one or more output streams• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library
• Operator has many instances that run in parallel and each instance in single-threaded
• Directed Acyclic Graph (DAG) is made up of operators and streams
Output StreamTuple Tupleer
Operator
er
Operator
er
Operator
er
Operator
Application Programming Model
Hadoop Edge Node
DT RTS Management
Server
Hadoop Node
YARN Container
Apex App Master
Hadoop Node
YARN ContainerYARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Streaming Container
Hadoop Node
YARN ContainerYARN Container
YARN Container
Thread1
Op2
Op1
Thread-N
Op3
Streaming Container
CLI
REST API
DT RTS Management
Server
REST API
Part of Community Edition
Apex Component Overview
Apex Engine
Core Features
• YARN is the resource manager
• HDFS used for storing any persistent state
Native Hadoop Integration
Partitioning & Scaling built-in
• Operators can be statically/dynamically scaled
• Flexible Streams split
• Parallel partitioning
• MxN partitioning
• Unifiers
Partitioning and Scaling Out
Advanced Windowing support
• Application window
• Sliding window and tumbling window
• Checkpoint window
• No artificial latency
Advanced Windowing Support
• Supported out of the box– Application state
– Application master state
– No data loss
• Automatic recovery
• Lunch test
• Buffer server
Stateful Fault Tolerance
• AT_LEAST_ONCE (default): – Windows are processed at least once
• AT_MOST_ONCE: – Windows are processed at most once
• During recovery, all downstream operators are fast-forwarded to the window of latest checkpoint
• EXACTLY_ONCE: – Windows are processed exactly once
• Checkpoint every window• Checkpointing becomes blocking
Processing Semantics
Data locality• Stream locality for placement of operators
– Rack local – Distributed deployment
– Node local – Data does not traverse NIC
– Container local – Data doesn’t need to be serialized
– Thread local – Operators run in same thread
Compute Locality
• Dynamic topology updates
– Properties of operators can be changed
– New operators
• Upcoming
– Update attributes
Dynamic Updates
© 2014 DataTorrent Confidential – Do Not Distribute
For more Info …
• Mailing List: [email protected]
• Apache Apex: http://apex.apache.org/
• Github
ᵒ Apex Core: http://github.com/apache/incubator-apex-core
ᵒ Apex Malhar: http://github.com/apache/incubator-apex-malhar
• DataTorrent: http://www.datatorrent.com