Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand...

35
Building ( Better ) Data Pipelines using Apache Airflow Sid Anand ( @r39132 ) QCon.AI 2018 1

Transcript of Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand...

Page 1: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Building (Better) Data Pipelines using Apache Airflow

Sid Anand (@r39132) QCon.AI 2018

�1

Page 2: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

About Me

�2

Work [ed | s] @

Maintainer of

Spare time

Co-Chair for

Page 3: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Apache Airflow

�3

What is it?

Page 4: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�4

Apache Airflow : What is it?

In a :

Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs or Directed Acyclic Graphs)

Page 5: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Apache Airflow

�5

UI Walk-Through

Page 6: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�6

Apache Airflow : UI Walk-through

Page 7: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Airflow - Authoring DAGs

�7

Airflow: Visualizing a DAG

Page 8: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�8

Airflow: Author DAGs in Python! No need to bundle many XML files!

Airflow - Authoring DAGs

Page 9: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�9

Airflow: The Tree View offers a view of DAG Runs over time!

Airflow - Authoring DAGs

Page 10: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Airflow - Performance Insights

�10

Airflow: Gantt charts reveal the slowest tasks for a run!

Page 11: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�11

Airflow: …And we can easily see performance trends over time

Airflow - Performance Insights

Page 12: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Apache Airflow

�12

Why use it?

Page 13: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�13

Apache Airflow : Why use it?When would you use a Workflow Scheduler like Airflow?

• ETL Pipelines

• Machine Learning Pipelines

• Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification,

Recommender System, etc…

• General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment

Page 14: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�14

What should a Workflow Scheduler do well? • Schedule a graph of dependencies

• where Workflow = A DAG of Tasks

• Handle task failures

• Report / Alert on failures

• Monitor performance of tasks over time

• Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met

• Easily scale for growing load

Apache Airflow : Why use it?

Page 15: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�15

What Does Apache Airflow Add?

• Configuration-as-code

• Usability - Stunning UI / UX

• Centralized configuration

• Resource Pooling

• Extensibility

Apache Airflow : Why use it?

Page 16: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Use-Case : Message ScoringBatch Pipeline Architecture

�16

Page 17: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Use-Case : Message Scoring

�17

enterprise Aenterprise Benterprise C

S3

S3 uploads every 15 minutes

Page 18: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Use-Case : Message Scoring

�18

enterprise Aenterprise Benterprise C

S3

Airflow kicks of a Spark message scoring job

every hour

Page 19: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Use-Case : Message Scoring

�19

enterprise Aenterprise Benterprise C

S3

Spark job writes scored messages and stats to

another S3 bucket

S3

Page 20: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Use-Case : Message Scoring

�20

enterprise Aenterprise Benterprise C

S3

This triggers SNS/SQS messages events

S3

SNS

SQS

Page 21: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Use-Case : Message Scoring

�21

enterprise Aenterprise Benterprise C

S3

An Autoscale Group (ASG) of Importers spins up when it detects SQS

messages

S3

SNS

SQS

Importers

ASG

Page 22: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�22

enterprise Aenterprise Benterprise C

S3

The importers rapidly ingest scored messages and aggregate statistics into

the DB

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

Page 23: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�23

enterprise Aenterprise Benterprise C

S3

Users receive alerts of untrusted emails & can review them in

the web app

S3

SNS

SQS

Importers

ASGDB

Use-Case : Message Scoring

Page 24: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�24

enterprise Aenterprise Benterprise C

S3 S3

SNS

SQS

Importers

ASGDB

Airflow manages the entire process

Use-Case : Message Scoring

Page 25: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�25

Airflow DAG

Page 26: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Apache Airflow

�26

Incubating

Page 27: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�27

Apache Airflow : Incubating

Timeline • Airflow was created @ Airbnb in 2015 by Maxime

Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator

Today • 2400+ Forks • 7600+ GitHub Stars • 430+ Contributors • 150+ companies officially using it! • 14 Committers/Maintainers <— We’re growing here

Page 28: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Thank You!

�28

Page 29: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Apache Airflow

�29

Behind the Scenes

Page 30: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�30

Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs)

It ships with a • DAG Scheduler • Web application (UI) • Powerful CLI • Celery Workers!

Apache Airflow : Behind the Scenes

Page 31: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�31

Apache Airflow : Behind the ScenesWebserver

Scheduler

WorkerWorkerWorker

Meta DB

1. A user schedules / manages DAGs using the Airflow UI!

2. Airflow’s webserver stores scheduling metadata in the metadata DB

3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ

4. Airflow workers pick up Airflow tasks over Celery

Celery / RabbitMQ

Page 32: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�32

Webserver

Scheduler

WorkerWorkerWorker

Meta DB

1. A user schedules / manages DAGs using the Airflow UI!

2. Airflow’s webserver stores scheduling metadata in the metadata DB

3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ

4. Airflow workers pick up Airflow tasks over Celery

Celery / RabbitMQ

Apache Airflow : Behind the Scenes

Page 33: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

1. A user schedules / manages DAGs using the Airflow UI!

2. Airflow’s webserver stores scheduling metadata in the metadata DB

3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ

4. Airflow workers pick up Airflow tasks over Celery

�33

Webserver

Scheduler

WorkerWorkerWorker

Meta DB

Celery / RabbitMQ

Apache Airflow : Behind the Scenes

Page 34: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

�34

Webserver

Scheduler

WorkerWorkerWorker

Meta DB

1. A user schedules / manages DAGs using the Airflow UI!

2. Airflow’s webserver stores scheduling metadata in the metadata DB

3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ

4. Airflow workers pick up Airflow tasks from RabbitMQ

Celery / RabbitMQ

Apache Airflow : Behind the Scenes

Page 35: Building (Better) Data Pipelines...Building (Better) Data Pipelines using Apache Airflow Sid Anand (@r39132) QCon.AI 2018 1. About Me 2 ... Airflow: Author DAGs in Python! No need

Thank You!

�35