The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s...

45
The “Big Data” Ecosystem at LinkedIn Presented by Zhongfang Zhuang

Transcript of The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s...

Page 1: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

The “Big Data” Ecosystem at LinkedIn

Presented by Zhongfang Zhuang

Page 2: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

• Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah.

Page 3: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

The Ecosystems

Page 4: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Hadoop Ecosystem

Page 5: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

HDFS

MapReduce

Pig Hive OozieH

Base

ZooKeeper

Flume

Sqoop

Page 6: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Data Integration Problem

Page 7: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

–Sean Kandel et al. Enterprise Data Analysis and Visualization: An Interview Study!

!

“Some analysts performed this integration themselves. In other cases, analysts—especially application users and scripters—relied on the IT team to assemble this data for them.”!”

Page 8: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Ecosystem at LinkedIn

Page 9: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Online datacenters

Offline production datacenters Offline development datacenters

Hadoop for ETL

Page 10: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Online datacenters

Offline production datacenters

Page 11: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

ETL Extract-Transform-Load

More information about ETL is available from Cloudera

Page 12: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Online Datacenters• Voldemort!• Oracle

Page 13: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Voldemort• Combines In-memory caching and storage

system!• Possible to emulate the storage layer!• Reads and Writes scale horizontally!• Simple API!• Transparent data partitioning

More information about Voldemort is available at: http://www.project-voldemort.com/voldemort/

Page 14: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Next• Ingress and egress out of

Hadoop system!• Data flows

Page 15: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

IngressFrom online to offline

Page 16: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Online datacenters

Hadoop for ETL

Page 17: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

• Database includes information about users, companies, connections, and other primary site data.!

• Event data includes logs of page views being served, search queries, and clicks.

Online datacenters

Hadoop for ETL

Page 18: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Challenges• Provide infrastructure that can

make all data available without manual intervention or processing.

Page 19: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Challenges• Datasets are so large!• Data is diverse!• Data schemas are evolving!• Data monitoring and validating

Page 20: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

KafkaA high-throughput distributed messaging system, for all activity data.

Page 21: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

• Persists messages in a write-ahead log, partitioned and distributed over multiple brokers!

• Allows data publishers to add records to a log!

• New data streams to be added and scaled independent of one another.

Page 22: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Example• Collecting data about searches being

performed. The search service would produce these records and publish them to a topic named “SearchQueries” where any number of subscribers might read these messages.

Page 23: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Data Evolution• Requires schemas changes!• Two common solutions:!

• Load data streams in whatever form they appear.!

• Manually map the source data into a stable, well-thought-out schema.

Page 24: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

LinkedIn’s Solution• Retains the same structure throughout our data pipeline!• Enforces compatibility and other correctness

conventions!• Schema evolves automatically!• Schema check is done at compile and run time.!• Review process.

Page 25: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Load into Hadoop• Pull data from Kafka to Hadoop with

a Map-only job.!• Recurrence every 10 minutes.!• Scheduled by Azkaban scheduling

system

Page 26: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

• Two Kafka clusters kept synchronized automatically.!

• The primary Kafka supports production.

Online Datacenters

ServicesServices

Kafka

Offline Development Datacenters

Kafka

Page 27: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

• Each step in the flow all publish an audit trail!

• Consist of the topic, machine name, etc.!

• Confirm all published event reached all consumers by aggregating audit data.

Audit

Page 28: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

WorkflowProcessed offline

Page 29: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Workflow in Hadoop

• A directed acyclic graph of dependencies.!

• Wrappers help restrict data being read to a certain time range based on parameters specified.!

• Hadoop-Kafka data pull job places the data into time-partitioned folders.

Page 30: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Workflow in Hadoop

• A directed acyclic graph of dependencies.!

• One workflow can have a size of 50-100 jobs.

Page 31: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

AzkabanAn open-sourced workflow scheduler.

Page 32: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Azkaban

• Supports multiple job types!• Run as individual or chained!• Configurations and

dependencies are maintained!• Visualize and manipulate

dependencies via graphs in UI.

Page 33: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Construction of

Workflows

• Experimenting, massage it into a useful form!

• Transform features into feature vectors!

• Trained into models!• Iterate workflows

Page 34: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Offline Development Datacenters

Kafka

Hadoop for ETLHadoop for

development

Azkaban Azkaban

Page 35: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Offline Production Datacenters

Staging Voldemort Read-

Hadoop for production

Azkaban

Page 36: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

EgressBack online!

Page 37: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Results• Usually pushed to other systems

(Back online or for further consumption)

Page 38: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Three main mechanism

• Key-Value!• Streams!• OLAP

Page 39: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Voldemort

• Based on Amazon’s Dynamo!• Distributed and Elastic!• Horizontally scalable!• Bulk load pipeline from Hadoop!• Simple to use

Reference: Slides of The big data ecosystem at LinkedIn, presented at SIGMOD 2013

Page 40: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Stream output performed by"

Kafka

• Implemented using Hadoop API!• Each MR slot acts as a Kafka producer!• More details available at

http://kafka.apache.org

Page 41: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

OLAP by

Avatara

• Automatically generates corresponding Azkaban job pipelines.!

• Small cubes are multidimensional arrays of tuples!

• Each tuple is combination of dimension and measure pairs!

• Output cubes to Voldemort as a read-only store.

Page 42: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Applications

Page 43: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Applications

Key-Value

Stream

OLAP

“People you may know”

Collaborative Filtering

Skill Endosements

Related Searches

News Feed Updates

Email

Relationship Strength

“Who’s viewed my profile?”

“Who’s viewed this job?”

Page 44: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

Future Work

Page 45: The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s Solution • Retains the same structure throughout our data pipeline! • Enforces

• MapReduce for processing large graphs!

• Streaming System for near line low-latency data processing