Post on 27-Jan-2015
description
COMPLEMENTING HADOOP WITH REAL-TIME DATA ANALYSIS
SPEAKER: Vipul SharmaDirector of Data EngineeringEventbrite
Monday, April 1, 13
Real Time Data Processing at ScaleVipul Sharma – Director of Data Engineering
Monday, April 1, 13
Eventbrite by the Numbers
Monday, April 1, 13
1.5 million events80 million tickets sold
$1 billion in gross ticket salesEvents in 179 countries
Eventbrite by the Numbers
Monday, April 1, 13
Who am I?
Director of Data Engineering at EventbriteInfrastructure, Data Science, Analytics, Spam and Fraud
linkedin.com/in/vipulsharma3@vipulsharmavipul@eventbrite.com
Monday, April 1, 13
Real Time
• Definition of real time varies with use case• Real time at scale is a challenge• Active learning requires real time data processing• Spam/Fraud• Discovery • Search
• Analytics• Real time analytics
• Data Changes• Changes in inventory, user settings etc
Monday, April 1, 13
Scaling for Growth
• Decouple Services• Decouple services based on CAP, Size and Growth• NoSQL attractive for out of the box sharding, replication and multi data
center support along with high write speeds• Multiple data stores pose a challenges of data flow between services in real
time• Batch Processing• Batch processing for big data e.g. data science, analytics etc• MapReduce is not built for real time• Data locality requires data to be stored on HDFS• Data Sync to Hadoop in real time is a challenge
Monday, April 1, 13
Monday, April 1, 13
Challenges with Real Time• Data Flow• How to transfer data captured in logs to services in real
time• How to transfer data captured in database to services in
real time• Data Processing• How to process significant data in real time• Distributed data processing for real time
Monday, April 1, 13
Data Flow
• Database polling• Rather than each application polling build a single polling service• Downstream applications polls from this service• Built for consistency and read scalability• Example: Event Cache• Excited about Linkedin’s Databus - http://data.linkedin.com/projects/
databus• Persisted Queues• Transfer logs via a distributed persisted message queue• Downstream applications subscribe to these queues getting a stream of
data• Example: Firehose• Excited about Linkedin’s Kafka - http://kafka.apache.org/index.html
Monday, April 1, 13
Data Processing
• Denormalization• Write data ready to serve• NoSQL built for Denormalization• Example: See who’s visiting
• Distributed Data Processing• Complex business logic needs more than de-normalization• Example: API stats using Storm• http://storm-project.net/
Monday, April 1, 13
Questions?
See it in action. Download our app:
eventbrite.com/eventbriteapp
Monday, April 1, 13
Thank You!@vipulsharma/ vipul@eventbrite.com
Monday, April 1, 13
Monday, April 1, 13