Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to...

18
Managing Scalable Data Workflows on HPC Clusters Erfan Nourbakhsh (UC Davis) Astroinformatics 2019 June 26 th , Caltech

Transcript of Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to...

Page 1: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

ManagingScalableDataWorkflowsonHPCClusters

Erfan Nourbakhsh (UCDavis)

Astroinformatics 2019

June26th,Caltech

Page 2: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

Outline

• Time-basedjobschedulers• Smartschedulingandscalability• Workflowmanagementsystemsandbigdatapipelines• IntroducingApacheAirflow• Airflowcomponents• UIwalk-through• AirflowandHPC

Page 3: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

Time-basedjobschedulers

Page 4: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

Smartschedulingandscalability

• SmartScheduling• SchedulingtaskexecutionminimalcriterionforaWMS• Wewanttaskexecutiontobemore“data-aware”

• Scalability• Notonlyaddressourcurrentdataprocessingneeds,butalsoscalewithourfuturegrowth• Thisshouldbewithoutrequiringsubstantialengineeringtimeandinfrastructurechanges

Page 5: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

Workflowmanagementsystemsandbigdatapipelines

Page 6: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

Workflowmanagementsystemsandbigdatapipelines

Pinball

Page 7: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

ApacheAirflow

• Airflowisaplatformtoprogrammaticallyauthor,scheduleandmonitorworkflows.

• DevelopedatAirbnb in2014• JoinedtheApacheSoftwareFoundation’sincubationprogramin2016

• Ithasbeenbuilttoscale• Pythonscript(configurationascode)• ActivedevelopmentonGitHub (835contributors,6,534commits)• RichwebUI• Officiallyusedbyabout300companies(Tesla,Twitter,Yahoo!,PayPal,UnitedAirlinesetc.)

Page 8: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

AirflowCom

pone

nts

Page 9: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

AirflowCom

pone

nts

Page 10: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin
Page 11: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

UIWalk-through

Page 12: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

UIWalk-through

Page 13: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

UIWalk-through

Page 14: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

UIWalk-through

Page 15: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

AirflowandHPC

• ApacheSparkIntegration• MPIwithSlurm Integration(notnativelysupported)

Page 16: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

Slurm Output

Yourjoboutput...

Page 17: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

TryAirflowandapplyittoyourwork

Thankyou!

Conclusion

Page 18: Managing Scalable Data Workflows on HPC Clusters · Apache Airflow • Airflow is a platform to programmatically author, schedule and monitor workflows. • Developed at Airbnbin

Sourceshttps://github.com/apache/airflow

https://www.alooma.com/answers/what-is-apache-airflow

https://sonra.io/2018/01/01/using-apache-airflow-to-build-a-data-pipeline-on-aws/

https://www.slideshare.net/mesodiar/intro-to-airflow-good-bye-cron-welcome-scheduled-workflow-management

https://qcon.ai/system/files/presentation-slides/sid_anand_qcon_ai_2018_v2.pdf

https://dabble-of-devops.com/learn-airflow-by-example-part-1-introduction/

https://towardsdatascience.com/why-quizlet-chose-apache-airflow-for-executing-data-workflows-3f97d40e9571