Commemorative trees at centre of Suffolk housebuilding wrangle
Building Robust Pipelines with Airflow | Wrangle Conference 2017
-
Upload
cloudera-inc -
Category
Technology
-
view
564 -
download
1
Transcript of Building Robust Pipelines with Airflow | Wrangle Conference 2017
![Page 1: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/1.jpg)
@erinshellman Wrangle Conf July 20th, 2017
Building Robust Pipelines with Airflow
![Page 2: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/2.jpg)
Zymology: is the science of fermentation and it’s applied to make materials and molecules
!
"
#
$
Beer
Insulin
Food additives
Plastics
![Page 3: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/3.jpg)
![Page 4: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/4.jpg)
Zymergen provides a platform for rapid improvement of microbial strains through genetic engineering.
![Page 5: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/5.jpg)
Robotic automation
Our experimentation is increasingly orchestrated with robotics and machine learning.
![Page 6: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/6.jpg)
Learning how to efficiently navigate the genome is the mission
of data science at Zymergen
![Page 7: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/7.jpg)
Blocker: process failure
Orchestrating complex experiments with robots is hard, and there are process failures. These failures often cause sporadic, extreme measurement values.
![Page 8: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/8.jpg)
Blocker: batch effects
We see temporal effects based on when experiments were performed
![Page 9: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/9.jpg)
Blocker: different interpretations of results
We’re building a platform that can support any microbe and any molecule.
Sometimes that results in a proliferation of solutions with disagreement on which is best.
![Page 10: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/10.jpg)
Processing pipeline
1.Identify process failures
2.Quantify and remove process-related bias
3.Identify strains that show improvement using consistent criteria
Clean model inputs
Outlier detection
Normalization
Hit detection
![Page 11: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/11.jpg)
Rolling our own ETL pipeline
There are many ways to measure the concentration of a molecule.
Any microbe, any molecule… any experiment, many data formats.
![Page 12: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/12.jpg)
Describing complex processing dependencies is hard.
Rolling our own ETL pipeline
![Page 13: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/13.jpg)
Airflow
https://airflow.incubator.apache.org/
“Airflow is a platform to programmatically author, schedule and monitor workflows.”
Airflow gives us flexibility to apply a common set of processing steps to variable data inputs, schedule complex processing workflows, and has become a delivery mechanism for our products.
![Page 14: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/14.jpg)
Structure and Flexibility
![Page 15: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/15.jpg)
e.g. Normalization
Airflow workflows are described as directed acyclic graphs (DAGs).
Each task node in the DAG is an operator.
![Page 16: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/16.jpg)
The anatomy of a DAG
Custom operators
Ordering
Instantiate DAG
![Page 17: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/17.jpg)
Modularity and flexibility
![Page 18: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/18.jpg)
Airflow + PyStan
With Bayesian hierarchical models we estimate (and monitor) the distribution of batch effects.
Experimental bias
![Page 19: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/19.jpg)
DropBox
• Scientists at Zymergen work with data using many different tools including JMP, SQL, and Excel.
• We use a custom DropBox hook to make quick data ingestion pipelines.
![Page 20: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/20.jpg)
Alerting / Communication
![Page 21: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/21.jpg)
3rd-party hooks & operators
![Page 22: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/22.jpg)
Operator
![Page 23: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/23.jpg)
Pairs well with Superset!
“Apache Superset is a modern, enterprise-ready business intelligence web application”
https://github.com/apache/incubator-superset
![Page 24: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/24.jpg)
Constructing machine learning workflows
![Page 25: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/25.jpg)
Fairflow: Functional Airflow
• The core of Fairflow is an abstract base class foperator that takes care of instantiating your Airflow operators and setting their dependencies.
• In Fairflow, DAGs are constructed from foperators that create the upstream operators when the final foperator is called.
![Page 26: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/26.jpg)
Configuring complex ML workflows… functionally
![Page 27: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/27.jpg)
Defining ML workflowsIn the DAG definition, create an instance of the task.
Then, instantiate a DAG like usual and call the compare task on the DAG.
![Page 28: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/28.jpg)
Defining ML workflows
The design allows for simple creation of complicated experimental workflows with arbitrary sets of models, parameters, and evaluation metrics.
![Page 29: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/29.jpg)
Is Airflow for you?
Do you have heterogeneous data sources?
Do you have complex dependencies between processing tasks?
Do you have data with different velocities?
Do you have constraints on your time?
Probably!
![Page 30: Building Robust Pipelines with Airflow | Wrangle Conference 2017](https://reader031.fdocuments.in/reader031/viewer/2022021923/5aaaf09b7f8b9a586f8b4c13/html5/thumbnails/30.jpg)
Thanks team!
%%
& '()
*
+