Metail and Elastic MapReduce
-
Upload
gareth-rogers -
Category
Technology
-
view
128 -
download
0
Transcript of Metail and Elastic MapReduce
1
April 2016 – AWS Loft, LondonGareth Rogers, Data Engineer
2
Metail lets you try on clothes online
Discover clothes on your body shape
Create, save outfits and share
Shop with confidence of size and fit
3
Proven impact as validated by American business schools and A/B tests
‘‘
…customers who had access to the fitting tool are more likely to come back to the site, and this effect is statistically significant…
‘‘
…shows approximately a 5.1 percent reduction in returns compared to the control group…In other words, providing fit information reduces average fulfilment costs”
…sales for users with access to the tool were substantially higher overall - 22.32 percent larger
‘‘Source: “The Value of Fit Information in Online Retail: Evidence from a Randomized Field Experiment” by Prof Santiago Gallino (Dartmouth College - Tuck School of Business) & Prof Antonio Moreno (Northwestern University) –Oct 21, 2015
DATA1000+ GARMENTS
POINTS3M
4
Architecture Theory• Our architecture is modelled on Nathan Marz’s Lambda Architecture:
http://lambda-architecture.net• Should include a speed layer to give a real time view on sampled data
– We’ve not implemented this
New Data
Batch Layer
Master dataset
Serving Layer
Batch viewsQuery
QueryQuery
5
Architecture Practice – Data Collection
6
Architecture Practice – Data Collection
New Data and Collection• We’re using Snowplow for the initial stages of our
pipeline• Using their JavaScript tracker and Cloudfront
collector configuration• Tracker performs a GET request on a Cloudfront
distributed image (pixel)• Query parameters of the contain the event data
e.g. GET http://d2sgzneryst63x.cloudfront.net/i?e=pv&url=...&page=...&...
• Cloudfront configured to log the requests to S3• We now have our master record
7
Architecture Practice – Serving Layer
Serving Layer• Initially queries over Hadoop Redshift came along
• RedshiftSQL good for small data science team!• Not so good for everyone else in the company• Introduced Looker
• Data model in SQL• Dashboards• Point and click data exploration• Permissions• Version control
8
Architecture Practice – Batch Layer
• Daily process the raw events to create batch view• Run using Elastic MapReduce (EMR) hosted Hadoop service in AWS• Create views of the master record through enrichment and aggregation• Populates the schema for speedy Redshift queries
Batch Layer
9
Extract Transform and Load (ETL)• Snowplow’s ETL driven by config files executed in Ruby
– Initial step executed outside of EMR– Copy data from Cloudfront incoming log bucket to another S3 bucket
for processing– Next create EMR cluster
10
Extract Transform and Load (ETL)• Snowplow’s ETL driven by config files executed in Ruby
– Initial step executed outside of EMR– Copy data from Cloudfront incoming log bucket to another S3 bucket
for processing– Next create EMR cluster
11
Extract Transform and Load (ETL)• To that cluster we add steps• Initial step use s3distcp to aggregate the log files• Snowplow’s ETL written in Scalding
– Scalding = Cascading (Java higher level MapReduce libraries) in Scala– They provide a compiled JAR hosted in S3
12
Extract Transform and Load (ETL)• Metail’s ETL is very similar to
Snowplow’s• Use AWS’ Data Pipeline to drive
the workflow– Really great to get going– But quickly hit complexity
limitations
13
Extract Transform and Load (ETL)• Metail ETL written in
– Cascalog, logic programming over Hadoop– Cascalog = Cascading + Datalog in Clojure– Ridiculously compact and expressive– But steep learning curve and impenetrable errors
14
Extract Transform and Load (ETL)• Soon Parkour a Clojure wrapper over Hadoop Java API
– Access to full Hadoop API with no abstractions just more idiomatic Clojure– Learning curve is mainly Hadoop– Errors still impenetrable
15
Summary• This pipeline has been built and managed by 3-5 people• It’s about a year and a half old and continues to evolve• Composed of a few different technologies and EMR used to do the batch
processing• Using EMR has made cluster managing and scaling straightforward• The synergy between EMR and S3 is a powerful feature
– Encourages immutable infrastructure– You don’t need your compute cluster running to hold your data!