Building and Improving Products with Hadoop
-
Upload
hadoop-summit -
Category
Technology
-
view
596 -
download
1
description
Transcript of Building and Improving Products with Hadoop
![Page 1: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/1.jpg)
2013
Building and
Improving Products
with Hadoop
Matthew Rathbone
![Page 2: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/2.jpg)
2013
What is Foursquare
Foursquare helps you explore the world around you.
Meet up with friends, discover new places, and save money using your phone.
4bn check-ins
35mm users
50mm POI
150 employees
1tb+ a day of data
![Page 3: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/3.jpg)
2013
FIRST, A STORY
http://www.flickr.com/photos/shannonpatrick17
![Page 4: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/4.jpg)
2013
The Right Tool for the Job
• Nginx – Serving static files
• Perl – Regular expressions
• XML – Frustrating people
• Hadoop (Map Reduce) – Counting
![Page 5: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/5.jpg)
2013
COUNTING – WHAT IS IT GOOD FOR
http://www.flickr.com/photos/blaahhi/
![Page 6: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/6.jpg)
2013
![Page 7: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/7.jpg)
2013
![Page 8: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/8.jpg)
2013
![Page 9: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/9.jpg)
2013
![Page 10: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/10.jpg)
2013
![Page 11: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/11.jpg)
2013
Statistically Improbable Phrases
Statistically Improbable Phrases
![Page 12: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/12.jpg)
2013
SIPS use cases
• menu extraction
• sentiment analysis
• venue ratings
• specific recommendations
• search indexing
• pricing data
• facility information
![Page 13: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/13.jpg)
2013
How is SIPS built?
Basically lots of counting.
![Page 14: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/14.jpg)
2013
SIPS
• Tokenize data with a language model (into N-Grams)
• built using tips, shouts, menu items, likes, etc
• Apply a TF-IDF algorithm (Term frequency, inverse document frequency)
• Global phrase count
• Local phrase count ( in a venue )
• Some Filtering and ranking
• Re-compute & deploy nightly
![Page 15: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/15.jpg)
2013
WHY USE HADOOP?
http://www.flickr.com/photos/dbrekke/
![Page 16: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/16.jpg)
2013
SIPS – Without Hadoop
Potential Problems
• Database Query Throttling
• Venues are out of sync
• Altering the algorithm could take forever to populate for all venues
• Where would you store the results?
• What about debug data?
• Does it scale to 10x, 100x?
• What about other, similar workflows?
![Page 17: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/17.jpg)
2013
SIPS – Hadoop Benefits
• Quick Deployment
• Modular & Reusable
• Arbitrarily complex combination of many datasets
• Every step of the workflow creates value
![Page 18: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/18.jpg)
2013
Apple Store - Downtown San Francisco
1 tip mentions "haircuts"
Search for "haircuts" in "san francisco" Apple store???
Fixed by looking at % of tips and overall frequency
“Hey Apple, how bout less shiny pizzazz and fancy haircuts and more fix-
my-f!@#$-imac”
![Page 19: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/19.jpg)
2013
Data & Modularity
![Page 20: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/20.jpg)
2013
![Page 21: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/21.jpg)
2013
![Page 22: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/22.jpg)
2013
![Page 23: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/23.jpg)
2013
ACTUALLY, IT’S A BIT MORE
COMPLICATED http://www.flickr.com/photos/bfishadow
![Page 24: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/24.jpg)
2013
These benefits require infrastructure
![Page 25: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/25.jpg)
2013
Dependency Management
Many options
• Oozie (Apache)
• Azkaban (LinkedIn)
• Luigi ( Spotify, we <3 this )
• Hamake ( Codeminders )
• Chronos ( AirBNB)
![Page 26: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/26.jpg)
2013
![Page 27: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/27.jpg)
2013
Database / Log Ingestion
• Sqoop
• Mongo-Hadoop
• Kafka
• Flume
• Scribe
• etc
![Page 28: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/28.jpg)
2013
![Page 29: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/29.jpg)
2013
MapReduce Friendly Datastore
A few obvious ones:
• Hbase
• Cassandra
• Voldemort
we built our own, it’s very similar to
Voldemort and uses the Hfile API
![Page 30: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/30.jpg)
2013
![Page 31: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/31.jpg)
2013
Getting started without all that stuff
![Page 32: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/32.jpg)
2013
Components you likely don’t have
![Page 33: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/33.jpg)
2013
The best way to start
Don’t use Hadoop.
*but pretend you do
![Page 34: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/34.jpg)
2013
Other reasons to not use Hadoop
• Your idea might not be very good
• Hadoop will slow you down to start with
• You don’t have enough infrastructure yet
• build it when you need it
• V1 might not be that complex
• V1 could be a spreadsheet
![Page 35: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/35.jpg)
2013
![Page 36: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/36.jpg)
2013
![Page 37: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/37.jpg)
2013
SIPS
Version 1
• Off the shelf language model
• A subset of Venues & Tips
• Did not use Map Reduce
• Did not push to production at all
![Page 38: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/38.jpg)
2013
SIPS
Version 2
• Started building our own language model
• Rewritten as a Map Reduce
• Manually loaded data to production
• Filters for English data only.
Tweak, improve, etc
![Page 39: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/39.jpg)
2013
SIPS
Version 3
• Incorporated more data sources into our language model
• Deployment to KV store (auto)
• Incorporated lots of debug output
• Language pipeline also feeds sentiment analysis
Now we’re in the perfect place to iterate & improve
![Page 40: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/40.jpg)
2013
…to explore data
![Page 41: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/41.jpg)
2013
In Summary
• Hadoop is good for counting, so use it for counting
• Move quickly whenever possible and don’t worry about automation
• Bring in new production services as you need them
• Freedom!
![Page 42: Building and Improving Products with Hadoop](https://reader033.fdocuments.in/reader033/viewer/2022042614/55945db31a28ab270c8b47af/html5/thumbnails/42.jpg)
20132013
Thanks!
@rathboma
Bonus:
http://hadoopweekly.com
from my colleague, Joe Crobak (presenting later!)