Building LinkedIn’s Real-time Data Pipeline

55
Building LinkedIn’s Real-time Data Pipeline Jay Kreps

description

Building LinkedIn’s Real-time Data Pipeline. Jay Kreps. What is a data pipeline?. What data is there?. Database data Activity data Page Views, Ad Impressions, etc Messaging JMS, AMQP, etc Application and System Metrics Rrdtool , graphite, etc Logs Syslog , log4j, etc. - PowerPoint PPT Presentation

Transcript of Building LinkedIn’s Real-time Data Pipeline

Page 1: Building LinkedIn’s Real-time Data Pipeline

Building LinkedIn’s Real-time Data Pipeline

Jay Kreps

Page 2: Building LinkedIn’s Real-time Data Pipeline
Page 3: Building LinkedIn’s Real-time Data Pipeline

What is a data pipeline?

Page 4: Building LinkedIn’s Real-time Data Pipeline

What data is there?• Database data• Activity data– Page Views, Ad Impressions, etc

• Messaging– JMS, AMQP, etc

• Application and System Metrics– Rrdtool, graphite, etc

• Logs– Syslog, log4j, etc

Page 5: Building LinkedIn’s Real-time Data Pipeline
Page 6: Building LinkedIn’s Real-time Data Pipeline

Data Systems at LinkedIn• Search• Social Graph• Recommendations• Live Storage• Hadoop• Data Warehouse• Monitoring Systems• …

Page 7: Building LinkedIn’s Real-time Data Pipeline

Problem: Data Integration

Page 8: Building LinkedIn’s Real-time Data Pipeline

Point-to-Point Pipelines

Page 9: Building LinkedIn’s Real-time Data Pipeline

Centralized Pipeline

Page 10: Building LinkedIn’s Real-time Data Pipeline

How have companies solved this problem?

Page 11: Building LinkedIn’s Real-time Data Pipeline

The Enterprise Data Warehouse

Page 12: Building LinkedIn’s Real-time Data Pipeline

Problems• Data warehouse is a batch system• Central team that cleans all data?• One person’s cleaning…• Relational mapping is non-trivial…

Page 13: Building LinkedIn’s Real-time Data Pipeline

My Experience

Page 14: Building LinkedIn’s Real-time Data Pipeline

LinkedIn’s Pipeline

Page 15: Building LinkedIn’s Real-time Data Pipeline

LinkedIn Circa 2010• Messaging: ActiveMQ• User Activity: In house log

aggregation• Logging: Splunk• Metrics: JMX => Zenoss• Database data: Databus, custom ETL

Page 16: Building LinkedIn’s Real-time Data Pipeline

2010 User Activity Data Flow

Page 17: Building LinkedIn’s Real-time Data Pipeline

Problems• Fragility• Multi-hour delay• Coverage• Labor intensive• Slow• Does it work?

Page 18: Building LinkedIn’s Real-time Data Pipeline

Four Ideas1 Central commit log for all data2 Push data cleanliness upstream3 O(1) ETL4 Evidence-based correctness

Page 19: Building LinkedIn’s Real-time Data Pipeline

Four Ideas1 Central commit log for all data2 Push data cleanliness upstream3 O(1) ETL4 Evidence-based correctness

Page 20: Building LinkedIn’s Real-time Data Pipeline

What kind of infrastructure is needed?

Page 21: Building LinkedIn’s Real-time Data Pipeline

Very confused• Messaging (JMS, AMQP, …)• Log aggregation• CEP, Streaming

Page 22: Building LinkedIn’s Real-time Data Pipeline

First Attempt: Don’t reinvent the wheel!

Page 23: Building LinkedIn’s Real-time Data Pipeline
Page 24: Building LinkedIn’s Real-time Data Pipeline

Problems With Messaging Systems

• Persistence is an afterthought• Ad hoc distribution• Odd semantics• Featuritis

Page 25: Building LinkedIn’s Real-time Data Pipeline

Second Attempt: Reinvent the wheel!

Page 26: Building LinkedIn’s Real-time Data Pipeline

Idea: Central, Distributed Commit Log

Page 27: Building LinkedIn’s Real-time Data Pipeline

What is a commit log?

Page 28: Building LinkedIn’s Real-time Data Pipeline

Data Flow

Page 29: Building LinkedIn’s Real-time Data Pipeline

Apache Kafka

Page 30: Building LinkedIn’s Real-time Data Pipeline

Some Terminology• Producers send messages to Brokers• Consumers read messages from

Brokers• Messages are sent to a Topic• Each topic is broken into one or more

ordered partitions of messages

Page 31: Building LinkedIn’s Real-time Data Pipeline

APIs• send(String topic, String key, Message

message)• Iterator<Message>

Page 32: Building LinkedIn’s Real-time Data Pipeline

Distribution

Page 33: Building LinkedIn’s Real-time Data Pipeline

Performance• 50MB/sec writes• 110MB/sec reads

Page 34: Building LinkedIn’s Real-time Data Pipeline

Performance

Page 35: Building LinkedIn’s Real-time Data Pipeline

Performance Tricks• Batching– Producer– Broker– Consumer

• Avoid large in-memory structures– Pagecache friendly

• Avoid data copying– sendfile

• Batch Compression

Page 36: Building LinkedIn’s Real-time Data Pipeline

Kafka Replication• In 0.8 release• Messages are highly available• No centralized master

Page 37: Building LinkedIn’s Real-time Data Pipeline

Kafka Info

http://incubator.apache.org/kafka

Page 38: Building LinkedIn’s Real-time Data Pipeline

Usage at LinkedIn• 10 billion messages/day• Sustained peak: – 172,000 messages/second written– 950,000 messages/second read

• 367 topics• 40 real-time consumers• Many ad hoc consumers• 10k connections/colo• 9.5TB log retained• End-to-end delivery time: 10 seconds (avg)

Page 39: Building LinkedIn’s Real-time Data Pipeline

Datacenters

Page 40: Building LinkedIn’s Real-time Data Pipeline

Four Ideas1 Central commit log for all data2 Push data cleanliness upstream3 O(1) ETL4 Evidence-based correctness

Page 41: Building LinkedIn’s Real-time Data Pipeline

Problem• Hundreds of message types• Thousands of fields• What do they all mean?• What happens when they change?

Page 42: Building LinkedIn’s Real-time Data Pipeline

Make activity data part of the domain model

Page 43: Building LinkedIn’s Real-time Data Pipeline

Schema free?

Page 44: Building LinkedIn’s Real-time Data Pipeline

LOAD ‘student’ USING PigStorage() AS (name:chararray, age:int, gpa:float)

Page 45: Building LinkedIn’s Real-time Data Pipeline

Schemas• Structure can be exploited– Performance– Size

• Compatibility• Need a formal contract

Page 46: Building LinkedIn’s Real-time Data Pipeline

Avro Schema• Avro – data definition and schema• Central repository of all schemas• Reader always uses same schema as

writer• Programatic compatibility model

Page 47: Building LinkedIn’s Real-time Data Pipeline

Workflow1 Check in schema2 Code review3 Ship

Page 48: Building LinkedIn’s Real-time Data Pipeline

Four Ideas1 Central commit log for all data2 Push data cleanliness upstream3 O(1) ETL4 Evidence-based correctness

Page 49: Building LinkedIn’s Real-time Data Pipeline

Hadoop Data Load• Map/Reduce job does data load• One job loads all events• Hive registration done automatically• Schema changes handled

transparently• ~5 minute lag on average to HDFS

Page 50: Building LinkedIn’s Real-time Data Pipeline

Four Ideas1 Central commit log for all data2 Push data cleanliness upstream3 O(1) ETL4 Evidence-based correctness

Page 51: Building LinkedIn’s Real-time Data Pipeline

Does it work?

Page 52: Building LinkedIn’s Real-time Data Pipeline

All messages sent must be delivered to all consumers (quickly)

Page 53: Building LinkedIn’s Real-time Data Pipeline

Audit Trail• Each producer, broker, and consumer

periodically reports how many messages it saw

• Reconcile these counts every few minutes

• Graph and alert

Page 54: Building LinkedIn’s Real-time Data Pipeline

Audit Trail

Page 55: Building LinkedIn’s Real-time Data Pipeline

Questions?