Big data @ uber vu (1)
-
Upload
mihnea-giurgea -
Category
Documents
-
view
982 -
download
3
description
Transcript of Big data @ uber vu (1)
Big Data at uberVU
Mihnea GiurgeaLead Developer @ uberVU
Contents
Introduction
Infrastructure
Signals
Lessons learned
Questions?
● Gather mentions that refer to a specific subject from the web:○ a brand (Coca-Cola)○ an event (Comic-con), etc.
● Search, aggregate and analyze data to provide statistics and insights to clients
● Everything is within the context of a "stream"
What do we do?
What do we do?
● social media monitoring, reporting and engagement
● reactive market lead us to actionable insights
● Signals - collection of intelligent algorithms○ some machine learning○ mostly just statistics
Signals
Signals
● Twitter top influencers○ reach out to promote your brand
● spikes & bursts○ be aware of global events for specific annotations
● "asking for help" mentions○ generate leads
● trending stories○ promote and raise engagement
BigData?
What is "big data"?
Wikipedia says: "Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications"
The LHC is one of the biggest data sources:● produces 15 PB per year (~41 TB per day)
BigData?
What's our data?
● ~70M mentions per day○ tweets○ Facebook (public) posts○ Google+ posts, etc.
● 100s of Amazon instances
● we record ~3TB per month
Contents
Introduction
Infrastructure
Signals
Lessons learned
Questions?
Technologies
● Amazon Web Services
● MongoDB - lots of use-cases
● Kestrel - fast & low administration
● Redis - fast, in-memory dataset
● DynamoDB - fast, easy to scale, but $$$
Data acquisition
● collect data from multiple platforms (20+)○ Twitter, Facebook, Google+, blogs, boards, etc.
● specialized workers per each platformsuch that adding a new platform is easy
● in-house refreshing system○ periodically poll each feed ○ adjust refresh rate according to activity
Data processing
Each mention needs to be processed by multiple modules:● language detection● sentiment detection● location detection● persistence (database storage)● ...
(preferably in real-time)
Data processing
Workers
● each processing step is done by a specialized worker○ input: a (single) tweet○ output: a (processed) tweet○ output is sent to next worker or written to database
● a worker is just a Python process○ each worker runs in multiple instances○ across multiple machines○ capable of processing multiple tweets in parallel
Queues
● workers communicate using a system of distributed queues
● each worker has its own input queue
● queues need to be persistent
● multiple independent Kestrel servers
Kestrel sharding
● multiple independent servers (Round-robin)○ start with a random server○ switch to next server every 100 operations○ or when server is down
● dequeue○ switch to next server when queue is empty
● sharding causes loose ordering :(
Kestrel sharding
Fault-tolerance
Guaranteed by queueing system
● Kestrel servers are durable○ persisted to EBS○ resistant to instance failure
● ack messages using Kestrel's primitives○ /open to read ○ /close to confirm processed message
Failure scenarios
● worker failure?○ just a small decrease in processing speed
● Kestrel server failure?○ using multiple servers (Round-robin)○ workload will be handled by remaining servers○ possible performance impact○ some messages will be temporarily unavailable
Scalability
● easily scalable by adding more nodes○ both Kestrel servers & workers
● worker requirements○ stateless○ fail-fast○ boot-fast
● small granularity○ allows you to grow infrastructure costs steadily
Contents
Introduction
Infrastructure
Signals
Lessons learned
Questions?
Signals
● machine learning turned out to be very slow○ language detection○ sentiment detection○ mention clustering
● most ML boils down to matrix multiplication
● detecting sentiment for one tweet at a time is very inefficient
Signals
Solution?● batch processing
○ e.g.: run algorithm for 100 tweets at a time
● within the same pipeline
● batch modules "wait" to gather more data
Signals
Still real-time?● yes, via max_wait
○ don't wait more than 30 seconds
● overall performance increased○ despite the artificial "wait"○ then we found a lot of other use-cases for this
Twitter Influencers
● find the Twitter users that are most influential
● in a given context (topic)○ e.g.: "big data OR #bigdata location:Romania"
● right now○ top influencers change quickly, every few days
Twitter Influencers
Build a Twitter graph using:
● each user is a node○ Weight(node) = f(# tweets of user)○ only measure activity in context of some given topic
● each retweet is a directed edge○ Weight(edge) = g(# retweets)
Influencers Graph
Twitter Influencers
● we're only interested in recent data○ e.g.: last few days
● because we want to determine current influencers○ not all-time influencers
● How - use Redis to create a "rolling graph"○ Redis > memcache for eviction strategies
Twitter Influencers
● each tweet updates a node
● each retweet updates an edge
● use Redis to expire old data
● => only recent data will be stored
Twitter Influencers
Algorithm:● similar to Google's PageRank
● influencers are computed almost in real-time○ batch processing (hence "almost" real-time)○ continuous computation○ each subgraph is updated every ~30 minutes
Contents
Introduction
Infrastructure
Signals
Lessons learned
Questions?
Lessons learned
Monitoring is vital, monitor everything!
● we use Graphite for everything○ number of messages○ average processing speed○ usage reports (histograms)○ etc.
Lessons learned
● Assume nothing○ Eventually, everything that can fail, will!
● Plan for everything○ "Failing to plan is planning to fail"
● Failures are usually correlated○ Expect multiple components to fail at the same time.
Thank you!
Mihnea Giurgea @ uberVU
Credits to the following colleagues:● Andrei Vasilescu
● Sonia Stan
● Bogdan Sandulescu
Questions?