Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand...
Transcript of Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand...
Building (Better) Data Pipelines using Apache Airflow
Sid Anand (@r39132) QCon.AI 2018
�1
About Me
�2
Work [ed | s] @
Maintainer of
Spare time
Co-Chair for
Apache Airflow
�3
What is it?
�4
Apache Airflow : What is it?
In a :
Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs or Directed Acyclic Graphs)
Apache Airflow
�5
UI Walk-Through
�6
Apache Airflow : UI Walk-through
Airflow - Authoring DAGs
�7
Airflow: Visualizing a DAG
�8
Airflow: Author DAGs in Python! No need to bundle many XML files!
Airflow - Authoring DAGs
�9
Airflow: The Tree View offers a view of DAG Runs over time!
Airflow - Authoring DAGs
Airflow - Performance Insights
�10
Airflow: Gantt charts reveal the slowest tasks for a run!
�11
Airflow: …And we can easily see performance trends over time
Airflow - Performance Insights
Apache Airflow
�12
Why use it?
�13
Apache Airflow : Why use it?When would you use a Workflow Scheduler like Airflow?
• ETL Pipelines
• Machine Learning Pipelines
• Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification,
Recommender System, etc…
• General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment
�14
What should a Workflow Scheduler do well? • Schedule a graph of dependencies
• where Workflow = A DAG of Tasks
• Handle task failures
• Report / Alert on failures
• Monitor performance of tasks over time
• Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met
• Easily scale for growing load
Apache Airflow : Why use it?
�15
What Does Apache Airflow Add?
• Configuration-as-code
• Usability - Stunning UI / UX
• Centralized configuration
• Resource Pooling
• Extensibility
Apache Airflow : Why use it?
Use-Case : Message ScoringBatch Pipeline Architecture
�16
Use-Case : Message Scoring
�17
enterprise Aenterprise Benterprise C
S3
S3 uploads every 15 minutes
Use-Case : Message Scoring
�18
enterprise Aenterprise Benterprise C
S3
Airflow kicks of a Spark message scoring job
every hour
Use-Case : Message Scoring
�19
enterprise Aenterprise Benterprise C
S3
Spark job writes scored messages and stats to
another S3 bucket
S3
Use-Case : Message Scoring
�20
enterprise Aenterprise Benterprise C
S3
This triggers SNS/SQS messages events
S3
SNS
SQS
Use-Case : Message Scoring
�21
enterprise Aenterprise Benterprise C
S3
An Autoscale Group (ASG) of Importers spins up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
�22
enterprise Aenterprise Benterprise C
S3
The importers rapidly ingest scored messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
�23
enterprise Aenterprise Benterprise C
S3
Users receive alerts of untrusted emails & can review them in
the web app
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
�24
enterprise Aenterprise Benterprise C
S3 S3
SNS
SQS
Importers
ASGDB
Airflow manages the entire process
Use-Case : Message Scoring
�25
Airflow DAG
Apache Airflow
�26
Incubating
�27
Apache Airflow : Incubating
Timeline • Airflow was created @ Airbnb in 2015 by Maxime
Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator
Today • 2400+ Forks • 7600+ GitHub Stars • 430+ Contributors • 150+ companies officially using it! • 14 Committers/Maintainers <— We’re growing here
Thank You!
�28
Apache Airflow
�29
Behind the Scenes
�30
Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs)
It ships with a • DAG Scheduler • Web application (UI) • Powerful CLI • Celery Workers!
Apache Airflow : Behind the Scenes
�31
Apache Airflow : Behind the ScenesWebserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages DAGs using the Airflow UI!
2. Airflow’s webserver stores scheduling metadata in the metadata DB
3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ
4. Airflow workers pick up Airflow tasks over Celery
Celery / RabbitMQ
�32
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages DAGs using the Airflow UI!
2. Airflow’s webserver stores scheduling metadata in the metadata DB
3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ
4. Airflow workers pick up Airflow tasks over Celery
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
1. A user schedules / manages DAGs using the Airflow UI!
2. Airflow’s webserver stores scheduling metadata in the metadata DB
3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ
4. Airflow workers pick up Airflow tasks over Celery
�33
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
�34
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages DAGs using the Airflow UI!
2. Airflow’s webserver stores scheduling metadata in the metadata DB
3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ
4. Airflow workers pick up Airflow tasks from RabbitMQ
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
Thank You!
�35