Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

38
Engineering a robust data pipeline with Luigi and AWS Elastic Map Reduce Aaron Knight Full Stack Engineer at Voxy October 20, 2016 (ish)

Transcript of Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Page 1: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Engineering a robust

data pipeline with Luigi and AWS Elastic Map

Reduce

Aaron KnightFull Stack Engineer at Voxy

October 20, 2016

(ish)

Page 2: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

What do you mean by “Engineering”?

Page 3: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

stabilitymonitoring

debuggabilitymaintainability

Page 4: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Background

Page 5: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Page 6: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Page 7: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

PageViewPageView

UserProfileUserProfile

TutoringReservationTutoringReservation

S3

Page 8: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Page 9: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Amazon Redshift

Page 10: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Page 11: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Part 1: Luigi

Page 12: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Page 13: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.”

Page 14: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

How to make pasta1. Heat a pan on medium heat.

2. Add oil to the pan.

3. Add onion and bell peppers and sautée until softened.

4. Add ground beef and cook until meat is well done.

5. Add the tomato sauce, salt, pepper and garlic powder.

6. Meanwhile, bring a pot of water to a rolling boil.

7. Cook noodles as directed.

8. Mix the sauce and noodles

Page 15: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

DrainWater()SimmerSauce()

MixSauceAndNoodles()

AddPastaToWater()

BrownMeat()

SauteePeppersAndOnions()

HeatOil()ChopOnions() ChopPeppers()

How to make pasta with Luigi

BringWaterToBoil()

BoilPasta()

Page 16: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

DrainWater()MakeSauce()

MixSauceAndNoodles()

AddPastaToWater()

BrownMeat()

SauteePeppersAndOnions()

HeatOil()ChopOnions() ChopPeppers() BringWaterToBoil()

BoilPasta()

Page 17: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

The Luigi Magic● output● requires● run

Page 18: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

import luigi

class SomeTask(luigi.Task): def output(self):

# An indication that the task is done

def requires(self):return SomeOtherTask()

def run(self):# Your code

if __name__ == '__main__': luigi.run()

some_task.py

Page 19: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

run_my_luigi_task.sh

#!/bin/sh

luigid --background --port=8082

python some_task.py SomeTask

Page 20: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Examplehttps://github.com/phrasemix/luigi-hello-world

Page 21: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Configuration[core]email-sender: [email protected]: [email protected], [email protected]_host: email-smtp.us-east-1.amazonaws.com...

[database]user: <database user>password: <database password>...

[hadoop]version: cdh4

./client.cfg

Page 22: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

run_staging_pipeline.sh

#!/bin/sh

export LUIGI_CONFIG_PATH=~/config/staging.cfgluigid --background --port=8082python luigi_tasks.py AllTasks

Page 23: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

page_views.json[ { “user_id”: “1234”, “ts”: “1476652039.12”, “url”: “/example.html” }, …]

user_profiles.json[ { “user_id”: “1234”, “email”: “[email protected]”, “membership_start”: “1476652039.12”, “membership_end”: “” }, …]

Page 24: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Table "public.user_page_views" Column | Type -----------------+----------------------------- url | character varying(16384) | view_ts | timestamp without time zone | user_id | character varying(16384) | email | character varying(16384) | membership_start| timestamp without time zone | is_current | integer |

Table "public.recommended_pages" Column | Type -----------------+----------------------------- url | character varying(16384) |

Page 25: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

AllRedshiftTables(luigi.WrapperTask)

UserPageViewsToRedshift

RecommendedPagesToRedshit

Page 26: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

AllRedshiftTables(luigi.WrapperTask)

UserPageViewsToRedshift RecommendedPagesToRedshit

UserPageViewsTsvToS3

UserPageViews

IngestUserProfiles IngestPageViews

Page 27: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

AllRedshiftTables(luigi.WrapperTask)

UserPageViewsToRedshift RecommendedPagesToRedshit

UserPageViewsTsvToS3

UserPageViews

IngestUserProfiles IngestPageViews

RecommendedPagesTsvToS3

RecommendedPages

Page 28: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Luigi Considerations● Include a unique identifier for your pipeline● Be careful with Luigi task inheritance● Make use of contrib modules:

○ Hadoop○ Spark○ Redshift○ Elasticsearch

● Consider errors

Page 29: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Part 2: Running on AWS

Page 30: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Page 31: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Why use EMR?- Pre-installed services

- Hadoop- Pig- Spark- Zookeeper

- Easy scaling- “Cleanliness”

Page 32: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Page 33: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Page 34: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Data Pipeline

Master

Core Core Core

EC2

EMR Cluster

EC2 Instance

Page 35: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Masterluigid- UserPageViewsToRedshift

- UserPageViewsToS3- UserPageViews**

- IngestUserProfiles**- IngestPageViews**

- UserProfilesToS3

Core Core Core

EMR Cluster

Page 36: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

luigid- RecommendedPagesToRedshift

- RecommendedPagesTsvToS3- RecommendedPages

- DownloadUserProfiles

EC2 Instance

Page 37: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce
Page 38: Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

EMR Considerations- Monitor your pipeline and cluster as well- Make use of Spot instances to save $$$- Data pipeline creation can be automated too