Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Engineering a robust

data pipeline with Luigi and AWS Elastic Map

Reduce

Aaron KnightFull Stack Engineer at Voxy

October 20, 2016

What do you mean by “Engineering”?

stabilitymonitoring

debuggabilitymaintainability

Background

PageViewPageView

UserProfileUserProfile

TutoringReservationTutoringReservation

Amazon Redshift

Part 1: Luigi

Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.”

How to make pasta1. Heat a pan on medium heat.

2. Add oil to the pan.

3. Add onion and bell peppers and sautée until softened.

4. Add ground beef and cook until meat is well done.

5. Add the tomato sauce, salt, pepper and garlic powder.

6. Meanwhile, bring a pot of water to a rolling boil.

7. Cook noodles as directed.

8. Mix the sauce and noodles

DrainWater()SimmerSauce()

MixSauceAndNoodles()

AddPastaToWater()

BrownMeat()

SauteePeppersAndOnions()

HeatOil()ChopOnions() ChopPeppers()

How to make pasta with Luigi

BringWaterToBoil()

BoilPasta()

DrainWater()MakeSauce()

MixSauceAndNoodles()

AddPastaToWater()

BrownMeat()

SauteePeppersAndOnions()

HeatOil()ChopOnions() ChopPeppers() BringWaterToBoil()

BoilPasta()

The Luigi Magic● output● requires● run

import luigi

class SomeTask(luigi.Task): def output(self):

# An indication that the task is done

def requires(self):return SomeOtherTask()

def run(self):# Your code

if __name__ == '__main__': luigi.run()

some_task.py

run_my_luigi_task.sh

#!/bin/sh

luigid --background --port=8082

python some_task.py SomeTask

Examplehttps://github.com/phrasemix/luigi-hello-world

Configuration[core]email-sender: luigi-errors@example.com.comerror-email: person1@example.com, person2@example.comsmtp_host: email-smtp.us-east-1.amazonaws.com...

[database]user: <database user>password: <database password>...

[hadoop]version: cdh4

./client.cfg

run_staging_pipeline.sh

#!/bin/sh

export LUIGI_CONFIG_PATH=~/config/staging.cfgluigid --background --port=8082python luigi_tasks.py AllTasks

page_views.json[ { “user_id”: “1234”, “ts”: “1476652039.12”, “url”: “/example.html” }, …]

user_profiles.json[ { “user_id”: “1234”, “email”: “user@example.com”, “membership_start”: “1476652039.12”, “membership_end”: “” }, …]

Table "public.recommended_pages" Column | Type -----------------+----------------------------- url | character varying(16384) |

AllRedshiftTables(luigi.WrapperTask)

UserPageViewsToRedshift

RecommendedPagesToRedshit

UserPageViewsToRedshift RecommendedPagesToRedshit

UserPageViewsTsvToS3

UserPageViews

IngestUserProfiles IngestPageViews

UserPageViewsToRedshift RecommendedPagesToRedshit

UserPageViewsTsvToS3

UserPageViews

IngestUserProfiles IngestPageViews

RecommendedPagesTsvToS3

RecommendedPages

Luigi Considerations● Include a unique identifier for your pipeline● Be careful with Luigi task inheritance● Make use of contrib modules:

○ Hadoop○ Spark○ Redshift○ Elasticsearch

● Consider errors

Part 2: Running on AWS

Why use EMR?- Pre-installed services

- Hadoop- Pig- Spark- Zookeeper

- Easy scaling- “Cleanliness”

Data Pipeline

Master

Core Core Core

EMR Cluster

EC2 Instance

Masterluigid- UserPageViewsToRedshift

- UserPageViewsToS3- UserPageViews**

- IngestUserProfiles**- IngestPageViews**

- UserProfilesToS3

Core Core Core

EMR Cluster

luigid- RecommendedPagesToRedshift

- RecommendedPagesTsvToS3- RecommendedPages

- DownloadUserProfiles

EC2 Instance

EMR Considerations- Monitor your pipeline and cluster as well- Make use of Spot instances to save $$$- Data pipeline creation can be automated too

Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Technology

Transcript of Engineering a robust(ish) data pipeline with Luigi and AWS Elastic Map Reduce

Luigi future

Ish - Catalogo Marzo2015

Parenteral nutrition ish

GARLIC BREAD - Don Luigi Formby , Liverpool – Don Luigi ...

Luigi Moretti

Ish Catalogo Marzo

Luigi Russolo

LUIGI GIAMUNDO

Tenco Luigi

enn with - I T.A.K.E. (Un) confitakeunconf.com/.../Automate-all-things-AWS-with-Ansible.pdfAWS WITH ANSIBLE elastic elastic elastic iii elastic elastic elastic elastic elastic elastic

ISH Brochure 2017

Ish Format Document

Alice Ish 1

Luigi Rossi’s

ISH Primary Newsletter

Ish Program Engl

16drawings Ish TWO

Primary Newsletter ISH

Tesi Luigi

luigi madison