Metail and Elastic MapReduce

1

April 2016 – AWS Loft, LondonGareth Rogers, Data Engineer

2

Metail lets you try on clothes online

Discover clothes on your body shape

Create, save outfits and share

Shop with confidence of size and fit

3

Proven impact as validated by American business schools and A/B tests

‘‘

…customers who had access to the fitting tool are more likely to come back to the site, and this effect is statistically significant…

‘‘

…shows approximately a 5.1 percent reduction in returns compared to the control group…In other words, providing fit information reduces average fulfilment costs”

…sales for users with access to the tool were substantially higher overall - 22.32 percent larger

‘‘Source: “The Value of Fit Information in Online Retail: Evidence from a Randomized Field Experiment” by Prof Santiago Gallino (Dartmouth College - Tuck School of Business) & Prof Antonio Moreno (Northwestern University) –Oct 21, 2015

DATA1000+ GARMENTS

POINTS3M

http://poseidon01.ssrn.com/delivery.php?ID=675084119021001015102024112117066122024006056079005030120082090022113111009099072123124060121106033007109027002117105018067118107006090023002099027120107100099016118063082019111098095017126090122081116007112025022028099001088004088090098075002106021094&EXT=pdf






4

Architecture Theory• Our architecture is modelled on Nathan Marz’s Lambda Architecture:

http://lambda-architecture.net• Should include a speed layer to give a real time view on sampled data

– We’ve not implemented this

New Data

Batch Layer

Master dataset

Serving Layer

Batch viewsQuery

QueryQuery

http://lambda-architecture.net/

5

Architecture Practice – Data Collection

6

Architecture Practice – Data Collection

New Data and Collection• We’re using Snowplow for the initial stages of our

pipeline• Using their JavaScript tracker and Cloudfront

collector configuration• Tracker performs a GET request on a Cloudfront

distributed image (pixel)• Query parameters of the contain the event data

e.g. GET http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&...

• Cloudfront configured to log the requests to S3• We now have our master record

http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&





7

Architecture Practice – Serving Layer

Serving Layer• Initially queries over Hadoop Redshift came along

• RedshiftSQL good for small data science team!• Not so good for everyone else in the company• Introduced Looker

• Data model in SQL• Dashboards• Point and click data exploration• Permissions• Version control

8

Architecture Practice – Batch Layer

• Daily process the raw events to create batch view• Run using Elastic MapReduce (EMR) hosted Hadoop service in AWS• Create views of the master record through enrichment and aggregation• Populates the schema for speedy Redshift queries

Batch Layer

9

Extract Transform and Load (ETL)• Snowplow’s ETL driven by config files executed in Ruby

– Initial step executed outside of EMR– Copy data from Cloudfront incoming log bucket to another S3 bucket

for processing– Next create EMR cluster

10

Extract Transform and Load (ETL)• Snowplow’s ETL driven by config files executed in Ruby

– Initial step executed outside of EMR– Copy data from Cloudfront incoming log bucket to another S3 bucket

for processing– Next create EMR cluster

11

Extract Transform and Load (ETL)• To that cluster we add steps• Initial step use s3distcp to aggregate the log files• Snowplow’s ETL written in Scalding

– Scalding = Cascading (Java higher level MapReduce libraries) in Scala– They provide a compiled JAR hosted in S3

12

Extract Transform and Load (ETL)• Metail’s ETL is very similar to

Snowplow’s• Use AWS’ Data Pipeline to drive

the workflow– Really great to get going– But quickly hit complexity

limitations

13

Extract Transform and Load (ETL)• Metail ETL written in

– Cascalog, logic programming over Hadoop– Cascalog = Cascading + Datalog in Clojure– Ridiculously compact and expressive– But steep learning curve and impenetrable errors

14

Extract Transform and Load (ETL)• Soon Parkour a Clojure wrapper over Hadoop Java API

– Access to full Hadoop API with no abstractions just more idiomatic Clojure– Learning curve is mainly Hadoop– Errors still impenetrable

15

Summary• This pipeline has been built and managed by 3-5 people• It’s about a year and a half old and continues to evolve• Composed of a few different technologies and EMR used to do the batch

processing• Using EMR has made cluster managing and scaling straightforward• The synergy between EMR and S3 is a powerful feature

– Encourages immutable infrastructure– You don’t need your compute cluster running to hold your data!

Metail and Elastic MapReduce

Technology

Transcript of Metail and Elastic MapReduce