Post on 16-Apr-2017
Engineering a robust
data pipeline with Luigi and AWS Elastic Map
Reduce
Aaron KnightFull Stack Engineer at Voxy
October 20, 2016
(ish)
What do you mean by “Engineering”?
stabilitymonitoring
debuggabilitymaintainability
Background
PageViewPageView
UserProfileUserProfile
TutoringReservationTutoringReservation
S3
Amazon Redshift
Part 1: Luigi
Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, workflow management, visualization, handling failures, command line integration, and much more.”
“
How to make pasta1. Heat a pan on medium heat.
2. Add oil to the pan.
3. Add onion and bell peppers and sautée until softened.
4. Add ground beef and cook until meat is well done.
5. Add the tomato sauce, salt, pepper and garlic powder.
6. Meanwhile, bring a pot of water to a rolling boil.
7. Cook noodles as directed.
8. Mix the sauce and noodles
DrainWater()SimmerSauce()
MixSauceAndNoodles()
AddPastaToWater()
BrownMeat()
SauteePeppersAndOnions()
HeatOil()ChopOnions() ChopPeppers()
How to make pasta with Luigi
BringWaterToBoil()
BoilPasta()
DrainWater()MakeSauce()
MixSauceAndNoodles()
AddPastaToWater()
BrownMeat()
SauteePeppersAndOnions()
HeatOil()ChopOnions() ChopPeppers() BringWaterToBoil()
BoilPasta()
The Luigi Magic● output● requires● run
import luigi
class SomeTask(luigi.Task): def output(self):
# An indication that the task is done
def requires(self):return SomeOtherTask()
def run(self):# Your code
if __name__ == '__main__': luigi.run()
some_task.py
run_my_luigi_task.sh
#!/bin/sh
luigid --background --port=8082
python some_task.py SomeTask
Examplehttps://github.com/phrasemix/luigi-hello-world
Configuration[core]email-sender: luigi-errors@example.com.comerror-email: person1@example.com, person2@example.comsmtp_host: email-smtp.us-east-1.amazonaws.com...
[database]user: <database user>password: <database password>...
[hadoop]version: cdh4
./client.cfg
run_staging_pipeline.sh
#!/bin/sh
export LUIGI_CONFIG_PATH=~/config/staging.cfgluigid --background --port=8082python luigi_tasks.py AllTasks
page_views.json[ { “user_id”: “1234”, “ts”: “1476652039.12”, “url”: “/example.html” }, …]
user_profiles.json[ { “user_id”: “1234”, “email”: “user@example.com”, “membership_start”: “1476652039.12”, “membership_end”: “” }, …]
Table "public.user_page_views" Column | Type -----------------+----------------------------- url | character varying(16384) | view_ts | timestamp without time zone | user_id | character varying(16384) | email | character varying(16384) | membership_start| timestamp without time zone | is_current | integer |
Table "public.recommended_pages" Column | Type -----------------+----------------------------- url | character varying(16384) |
AllRedshiftTables(luigi.WrapperTask)
UserPageViewsToRedshift
RecommendedPagesToRedshit
AllRedshiftTables(luigi.WrapperTask)
UserPageViewsToRedshift RecommendedPagesToRedshit
UserPageViewsTsvToS3
UserPageViews
IngestUserProfiles IngestPageViews
AllRedshiftTables(luigi.WrapperTask)
UserPageViewsToRedshift RecommendedPagesToRedshit
UserPageViewsTsvToS3
UserPageViews
IngestUserProfiles IngestPageViews
RecommendedPagesTsvToS3
RecommendedPages
Luigi Considerations● Include a unique identifier for your pipeline● Be careful with Luigi task inheritance● Make use of contrib modules:
○ Hadoop○ Spark○ Redshift○ Elasticsearch
● Consider errors
Part 2: Running on AWS
Why use EMR?- Pre-installed services
- Hadoop- Pig- Spark- Zookeeper
- Easy scaling- “Cleanliness”
Data Pipeline
Master
Core Core Core
EC2
EMR Cluster
EC2 Instance
Masterluigid- UserPageViewsToRedshift
- UserPageViewsToS3- UserPageViews**
- IngestUserProfiles**- IngestPageViews**
- UserProfilesToS3
Core Core Core
EMR Cluster
luigid- RecommendedPagesToRedshift
- RecommendedPagesTsvToS3- RecommendedPages
- DownloadUserProfiles
EC2 Instance
EMR Considerations- Monitor your pipeline and cluster as well- Make use of Spot instances to save $$$- Data pipeline creation can be automated too