Apache Airflow
-
Upload
sumit-maheshwari -
Category
Software
-
view
650 -
download
0
Transcript of Apache Airflow
Apache Airflow
Sumit Maheshwari Qubole
Bangalore Big Data Meetup @ LinkedIn 27 Aug 2016
Agenda● Workflows
● Problem statement
● Options
● Airflow
○ Anatomy
○ Sample DAG
○ Architecture
○ Demo
● Experiences
Workflows?
A B C
A E H
D
CB F
G
A E H
D
CB F
G
n
BackgroundQubole was looking for a complete workflow solution. We do have a simple
(sequential) workflow and a very stable scheduler in-house already.
Options were:
1. Extend in-house workflow to full-fledged workflow
2. Oozie
3. Pinball
4. Luigi
5. Briefly
6. Airflow
In House
Pro:
● Full control● Faster bug fixing● Prioritised Qubole related features
Cons:
● Ever growing list of features● Much longer dev & qa cycles● Difficult to keep pace with latest trends
OoziePros:
● Used by thousands of
companies
● Web apis, java apis, cli and
html support
● Oldest among all
OozieCons:
● XML
● Significant efforts in
managing - frequent
OOM
● Difficult to customise
PinballPros:
● Pythonic way of defining
DAGs.
● Extensible and horizontal
scalable.
● Pinterest is already using
pinball to submit commands
to Qubole.
Cons:
● Complex in understanding
● “pip install” was broken.
● Lack of community interest.
Luigi
Pros:
● Pythonic way to write DAGs
● Pretty stable
● Huge community
● Built in support for hadoop
Luigi
Cons:
● Have to schedule workflows
externally
● Minimal UI
● State persistence via files
● No inbuilt monitoring, alerting
Briefly
Pros: Very small codebase to
understand and modify. Inbuilt
support for Qubole.
Cons: Too naive for production
uses
Airflow● Python code base
● Callable events
● Trigger rules
● Xcoms
● Cool UI & Rich CLI
● Queues & Pools
● Zombie cleanup
● Growing community
● The job definitions, in python code.
● A rich CLI (command line interface) to test, run, backfill, describe and clear parts of your
DAGs.
● A web application, to explore your DAGs definition, their dependencies, progress, metadata
and logs.
● A metadata repository that Airflow uses to keep track of task job statuses and other persistent
information.
● An array of workers, running the jobs task instances in a distributed fashion.
● Scheduler processes, that fire up the task instances that are ready to run.
Anatomy
Sample DAG
Demo
Airflow: Some factsSmall code base of size ~ 20k lines of python code.
Born at Airbnb, open sourced in June-15 and recently moved to Apache incubator
Under active development, some numbers:
a. ~1.5yr old project, 3400 commits, 177 contributors, around 20+ commits per week
b. Companies using airflow: Airbnb, Agari, Lyft, Wepay, Easytaxi, Qubole and many others
c. 1000+ closed PRs
Airflow: Architecture
Airflow comes with 4 types of builtin execution modes
● Sequential
● Local
● Celery
● Mesos
And it’s very easy to add your own execution mode as well
Sequential
● Default mode
● Minimum setup - works with sqlite
as well
● Processes 1 task at a time
● Good for demoable purposes only
Local Executor
● Spawned by scheduler processes
● Vertical scalable
● Production grade
● Doesn’t need broker etc
Celery Executor
Celery Executor
● Vertical and Horizontal scalable
● Can be monitored (via Flower)
● Support Pools and Queues
Key aspects considered while productionizing Airflow at Qubole
● Availability
● Reliability
● Security
● Usability
Experiences
Thank You !
gitter - @msumit
PS: Qubole is hiring, ping me :)