Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Post on 16-Apr-2017

174 views 1 download

Transcript of Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Engineering a robust

data pipeline with Luigi and AWS Elastic Map

Reduce

Aaron KnightFull Stack Engineer at Voxy

October 20, 2016

(ish)

What do you mean by “Engineering”?

stabilitymonitoring

debuggabilitymaintainability

Background

PageViewPageView

UserProfileUserProfile

TutoringReservationTutoringReservation

S3

Amazon Redshift

Part 1: Luigi

Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.”

How to make pasta1. Heat a pan on medium heat.

2. Add oil to the pan.

3. Add onion and bell peppers and sautée until softened.

4. Add ground beef and cook until meat is well done.

5. Add the tomato sauce, salt, pepper and garlic powder.

6. Meanwhile, bring a pot of water to a rolling boil.

7. Cook noodles as directed.

8. Mix the sauce and noodles

DrainWater()SimmerSauce()

MixSauceAndNoodles()

AddPastaToWater()

BrownMeat()

SauteePeppersAndOnions()

HeatOil()ChopOnions() ChopPeppers()

How to make pasta with Luigi

BringWaterToBoil()

BoilPasta()

DrainWater()MakeSauce()

MixSauceAndNoodles()

AddPastaToWater()

BrownMeat()

SauteePeppersAndOnions()

HeatOil()ChopOnions() ChopPeppers() BringWaterToBoil()

BoilPasta()

The Luigi Magic● output● requires● run

import luigi

class SomeTask(luigi.Task): def output(self):

# An indication that the task is done

def requires(self):return SomeOtherTask()

def run(self):# Your code

if __name__ == '__main__': luigi.run()

some_task.py

run_my_luigi_task.sh

#!/bin/sh

luigid --background --port=8082

python some_task.py SomeTask

Examplehttps://github.com/phrasemix/luigi-hello-world

Configuration[core]email-sender: luigi-errors@example.com.comerror-email: person1@example.com, person2@example.comsmtp_host: email-smtp.us-east-1.amazonaws.com...

[database]user: <database user>password: <database password>...

[hadoop]version: cdh4

./client.cfg

run_staging_pipeline.sh

#!/bin/sh

export LUIGI_CONFIG_PATH=~/config/staging.cfgluigid --background --port=8082python luigi_tasks.py AllTasks

page_views.json[ { “user_id”: “1234”, “ts”: “1476652039.12”, “url”: “/example.html” }, …]

user_profiles.json[ { “user_id”: “1234”, “email”: “user@example.com”, “membership_start”: “1476652039.12”, “membership_end”: “” }, …]

Table "public.user_page_views" Column | Type -----------------+----------------------------- url | character varying(16384) | view_ts | timestamp without time zone | user_id | character varying(16384) | email | character varying(16384) | membership_start| timestamp without time zone | is_current | integer |

Table "public.recommended_pages" Column | Type -----------------+----------------------------- url | character varying(16384) |

AllRedshiftTables(luigi.WrapperTask)

UserPageViewsToRedshift

RecommendedPagesToRedshit

AllRedshiftTables(luigi.WrapperTask)

UserPageViewsToRedshift RecommendedPagesToRedshit

UserPageViewsTsvToS3

UserPageViews

IngestUserProfiles IngestPageViews

AllRedshiftTables(luigi.WrapperTask)

UserPageViewsToRedshift RecommendedPagesToRedshit

UserPageViewsTsvToS3

UserPageViews

IngestUserProfiles IngestPageViews

RecommendedPagesTsvToS3

RecommendedPages

Luigi Considerations● Include a unique identifier for your pipeline● Be careful with Luigi task inheritance● Make use of contrib modules:

○ Hadoop○ Spark○ Redshift○ Elasticsearch

● Consider errors

Part 2: Running on AWS

Why use EMR?- Pre-installed services

- Hadoop- Pig- Spark- Zookeeper

- Easy scaling- “Cleanliness”

Data Pipeline

Master

Core Core Core

EC2

EMR Cluster

EC2 Instance

Masterluigid- UserPageViewsToRedshift

- UserPageViewsToS3- UserPageViews**

- IngestUserProfiles**- IngestPageViews**

- UserProfilesToS3

Core Core Core

EMR Cluster

luigid- RecommendedPagesToRedshift

- RecommendedPagesTsvToS3- RecommendedPages

- DownloadUserProfiles

EC2 Instance

EMR Considerations- Monitor your pipeline and cluster as well- Make use of Spot instances to save $$$- Data pipeline creation can be automated too