Testtting
-
Upload
sprinklrdev -
Category
Documents
-
view
93 -
download
1
Transcript of Testtting
Data Platform and Services
Vipul Sharma and Eyal Reuveni
Agenda
EventbriteData ProductsData Platform
RecommendationsQuestions
• A social event ticketing and discovery platform• 50th Million Ticket Sold• Revenue doubled YOY• 180 Employees in SOMA SF• Solving significant engineering problems
• Data• Data, Infrastructure, Mobile, Web, Scale, Ops, QA
• Firing all cylinders and hiring blazing fastwww.eventbrite.com/jobs
Data Products
Analytics
• Add–Hoc queries by Analysts
Fraud and Spam
Data Platform
Hadoop Cluster
• 30 persistent EC2 High-Memory Instances• 30TB disk with replication factor of 2, ext3
formatted• CDH3 • Fair Scheduler• HBase
Infrastructure
• Search• Solr• Incremental updates towards event driven
• Recommendation/Graph• Hadoop• Native Java MapReduce• Bash for workflow
• Persistence• MySql• HDFS• HBase• MongoDB (Investigating Cassandra and Riak)
Infrastructure
• Stream• RabbitMQ• Internal Fire hose (Investigating Kafka)
• Offline• MapRedude• Streaming• Hive• Hue
Infrastructure - Sqoozie
• Workflow for mysql imports to HDFS• Generate Sqoop commands• Run these imports in parallel
• Transparent to schema changes• Include or exclude on column, data types, table
level• Data Type Casting tinyint(1) Integer• Distributed Table Imports
Infrastructure - Blammo
• Raw logs are imported to HDFS via flume• Almost real-time – 5 min latency• Logs are key-value pairs in JSON• Each log producer publishes schema in yaml• Hive schema and schema yaml in sync using
thrift• Control exclusion and inclusion
Recommendations
You will like to attend this event
Item Hierarchy (You bought camera so you need batteries - Amazon)
Collaborative Filtering – User-User Similarity (People who bought camera also bought batteries - Amazon)
Collaborative Filtering – Item-Item similarity(You like Godfather so you will like Scarface - Netflix)
Social Graph Based (Your friends like Lady Gaga so you will like Lady Gaga, PYMK – Facebook, Linkedin)
Interest Graph Based (Your friends who like rock music like you are attending Eric Clapton Event–Eventbrite)
Recommendation Engines
Why Interest?
Events are Social Events are Interest
Dense Graph is IrrelevantInterest are Changing
How do we know your Interest?
• We ask you• Based on your activity
• Events Attended• Events Browsed
• Facebook Interests• User Interest has to match Event category• Static
• Machine Learning• Logistic Regression using MLE• Sparse Matrix is generated using MapReduce• A model for each interest
Model Based vs Clustering
Building Social Graph is Clustering Step
Social Graph Recommendation is a Ranking Problem
Item-Item vs User-User
Implicit Social Graph
U1
U2 U3
U4 U5
E1
E2 E3
E4
Mixed Social Graph
U1
U2 U3
U4 U5
E1
E2 E3FB
LI
15M * 260 * 260 = 1.14 Trillion Edges
4Billion edges ranked
Each node is a feature vector representing a User
Each edge is a feature vector representing a Relationship
Feature Generation
• Mixed Features• A series of map-reduce jobs• Output on HDFS in flat files; Input to subsequent jobs• Orders = Event Attendees
• MAP: eid: uid• REDUCE: eid:[uid]
• Attendees Social Graph• Input: eid:[uid]• MAP: uidi:[uid]
• REDUCE: uid:[neighbors]
• Interest based features, user specific, graph mining etc• Upload feature values to HBase
U1
U2 U3
HBase
HBase
• Collect data from multiple Map Reduce jobs• Stores entire social graph• Over one million writes per second
HBase
rowid neighbors events featureX
2718282 101 3 0.3678795
HBase
rowid 314159:n 314159:e 314159:fx 161803:n 161803:e 161803:fx
2718282 31 1 0.3183 83 2 0.618
Tips & Tricks
• Distributed cache database• Sped up some Map Reduce jobs by hours• Be sure to use counters!
Tips & Tricks
• Hive (ab)uses• Almost as many hive jobs as custom ones• “flip join”• Statistical functions using hive• UDF
Tips & Tricks
• Memory Memory Memory• LZO, WAL• Combiners are great until• Shuffle and Sorting stage• Hadoop ecosystem is still new
Questions?