Post on 15-Apr-2017
Apache Apex as YARN Application
Chinmay Kolhatkar (chinmay@apache.org)Mar 22, 2016
Apache Apex Meetup
Agenda• Directed Acyclic Graph
• Apex as a YARN Application
• Application Components of Apex
• Lifecycle of Apex as a YARN Application
Apache Apex Meetup
Directed Acyclic Graph (DAG)
• Defines compute stages of streaming application
• Defines tuple flow across Operators via Stream
Compute1
Apache Apex Meetup
Compute3
Compute2
Compute4
DAG Components
• Tuple● Atomic data that flows over a stream
• Operator● Basic compute unit per tuple
• Stream● Connector abstraction between operators● Tuples flow over this
Operator1
Operator2
Apache Apex Meetup
Streamtuple
3tuple
1tuple
2
DAG Types
O1 O2
O3
O4
Physical DAG
Apache Apex Meetup
O5
Logical DAG
• Logical Plan● Logical representation of computation● Defines operators, streams and dataflow
• Physical Plan● Deployable plan on cluster● Contains partition information of operators● Has ready-to-deploy serialized operatorinstances
O1P1
O1P2
O1P3
O2P1
O2P2
O2P3
U
O3
O4
O5
Apex as YARN application
Node
ResourceManager(AsM + Scheduler)
NM Node NM Node NM
YarnClient
AppMaster
YarnContainer
YarnContainer
YarnContainerStrAM
(AppMaster)
YarnContainerStrAMChild
O1 O2
YarnContainerStrAMChild
O3
DTCLIStrAMClient
YarnClient
Apache Apex Meetup
ClientRMProtocol
AMRMProtocol
ContainerManagerProtocol
ContainerManagerProtocol
ClientRMProtocol
AMRMProtocol
ContainerManagerProtocol
Application Components of Apex - StrAMClient• Part of dtcli client interface• Invoked by “launch” command of dtcli
• Tasks:● Copy required the application package files into HDFS● Validate Logical Plan● Serialize Logical plan to HDFS● Launch Application Master i.e. StrAM
Apache Apex Meetup
Application Components of Apex - StrAM
• Streaming Application Master• Started by StrAMClient on a YarnContainer• Tasks:
● Convert logical plan to physical plan● Serialize operators to HDFS● Request for resources to ResourceManager● Start StrAMChild in YarnContainer(s)● Monitor StrAMChild using ContainerManager protocol● Generate Application statistics● Host results on WebService (dtManage)● Fault Tolerance● Checkpointing/Committing Application States● Support Security● Shutdown Application
Apache Apex Meetup
Application Components of Apex - StrAMChild• Deployed on YarnContainer• Started by NodeManager as instructed by StrAM• Instance of StreamingContainer• Contains Operators (compute-related)• Contains BufferServer (stream-related)• Tasks:
● Regularly send heartbeat to StrAM● Execute commands from StrAM● Shutdown or Kill self if instructed● Manage lifecycle of an Operator● Network communication using BufferServer
Apache Apex Meetup
Lifecycle of Apex/YARN Application - Start
Node
ResourceManager(AsM + Scheduler)
NM Node NM Node NM
DTCLI/StrAMClient(YarnClient)
1) Access cluster information
HDFS3) Submit Application to RM
StrAM(AppMaster)
4) StrAM Registers with RM5) StrAM sends heartbeats regularly6) StrAM request containers with specifications
7) StrAMChild reads serialized operator from HDFS8) StrAMChild starts operator lifecycle
Apache Apex Meetup
2) Copies files from HDFS
ClientRMProtocol
AMRMProtocol
YarnContainerStrAMChild
O2
O1 YarnContainerStrAMChild
O3
YarnContainerStrAMChild
O4ContainerManagerProtocol
ContainerManagerProtocol
Lifecycle of Apex/YARN Application - Running
Node
ResourceManager(AsM + Scheduler)
NM Node NM Node NM
DTCLI/StrAMClient(YarnClient)
HDFS
StrAM(AppMaster)
Apache Apex Meetup
ClientRMProtocol
AMRMProtocol
YarnContainerStrAMChild
O2
O1 YarnContainerStrAMChild
O3
YarnContainerStrAMChild
O4ContainerManagerProtocol
ContainerManagerProtocol
1) StrAMChild sends heartbeats2) StrAMChild sends operator data
3) StrAM send regular heartbeats to RM
4) Query status of application
Lifecycle of Apex/YARN Application - Shutdown
Node
ResourceManager(AsM + Scheduler)
NM Node NM Node NM
DTCLI/StrAMClient(YarnClient)
HDFS
StrAM(AppMaster)
Apache Apex Meetup
ClientRMProtocol
AMRMProtocol
YarnContainerStrAMChild
O2
O1 YarnContainerStrAMChild
O3
YarnContainerStrAMChild
O4ContainerManagerProtocol
ContainerManagerProtocol
1) Connect on WebService
REST API
3) Send shutdown signal to StrAMChild4) StrAMChild finishes operator lifecycle
5) Check if all containers are freed6) StrAM unregisters itself7) StrAM exits
8) Check if application has shutdown
2) Send command to StrAM
Lifecycle of Apex/YARN Application - Kill
Node
ResourceManager(AsM + Scheduler)
NM Node NM Node NM
DTCLI/StrAMClient(YarnClient)
HDFS
StrAM(AppMaster)
Apache Apex Meetup
ClientRMProtocol
AMRMProtocol
YarnContainerStrAMChild
O2
O1 YarnContainerStrAMChild
O3
YarnContainerStrAMChild
O4ContainerManagerProtocol
ContainerManagerProtocol
1) Send kill-app command to YARN2) RM kills all containers
Summary – Apex platform
• Enables YARN to be used for Streaming Applications
• Takes care of YARN specific work
• User can focus on business logic defined in Operators
Apache Apex Meetup
15
Apache Apex Meetup
Resources
Apache Apex Meetup
• Apache Apex - http://apex.apache.org/• Subscribe - http://apex.apache.org/community.html• Download - https://www.datatorrent.com/download/• Twitter
o @ApacheApex; Follow - https://twitter.com/apacheapexo @DataTorrent; Follow – https://twitter.com/datatorrent
• Meetups - http://www.meetup.com/topics/apache-apex• Webinars - https://www.datatorrent.com/webinars/• Videos - https://www.youtube.com/user/DataTorrent• Slides - http://www.slideshare.net/DataTorrent/presentations • Startup Accelerator Program - Full featured enterprise product
o https://www.datatorrent.com/product/startup-accelerator/
We Are Hiring
Apache Apex Meetup
• jobs@datatorrent.com• Developers/Architects• QA Automation Developers• Information Developers• Build and Release• Community Leaders