The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s...

The “Big Data” Ecosystem at LinkedIn

Presented by Zhongfang Zhuang

• Based on the paper “The Big Data Ecosystem at LinkedIn”, written by Roshan Sumbaly, Jay Kreps, and Sam Shah.

The Ecosystems

Hadoop Ecosystem

MapReduce

Pig Hive OozieH

ZooKeeper

Data Integration Problem

–Sean Kandel et al. Enterprise Data Analysis and Visualization: An Interview Study!

“Some analysts performed this integration themselves. In other cases, analysts—especially application users and scripters—relied on the IT team to assemble this data for them.”!”

Ecosystem at LinkedIn

Online datacenters

Offline production datacenters Offline development datacenters

Hadoop for ETL

Online datacenters

Offline production datacenters

ETL Extract-Transform-Load

More information about ETL is available from Cloudera

Online Datacenters• Voldemort!• Oracle

Voldemort• Combines In-memory caching and storage

system!• Possible to emulate the storage layer!• Reads and Writes scale horizontally!• Simple API!• Transparent data partitioning

More information about Voldemort is available at: http://www.project-voldemort.com/voldemort/

Next• Ingress and egress out of

Hadoop system!• Data flows

IngressFrom online to offline

Online datacenters

Hadoop for ETL

• Database includes information about users, companies, connections, and other primary site data.!

• Event data includes logs of page views being served, search queries, and clicks.

Online datacenters

Hadoop for ETL

Challenges• Provide infrastructure that can

make all data available without manual intervention or processing.

Challenges• Datasets are so large!• Data is diverse!• Data schemas are evolving!• Data monitoring and validating

KafkaA high-throughput distributed messaging system, for all activity data.

• Persists messages in a write-ahead log, partitioned and distributed over multiple brokers!

• Allows data publishers to add records to a log!

• New data streams to be added and scaled independent of one another.

Example• Collecting data about searches being

performed. The search service would produce these records and publish them to a topic named “SearchQueries” where any number of subscribers might read these messages.

Data Evolution• Requires schemas changes!• Two common solutions:!

• Load data streams in whatever form they appear.!

• Manually map the source data into a stable, well-thought-out schema.

LinkedIn’s Solution• Retains the same structure throughout our data pipeline!• Enforces compatibility and other correctness

conventions!• Schema evolves automatically!• Schema check is done at compile and run time.!• Review process.

Load into Hadoop• Pull data from Kafka to Hadoop with

a Map-only job.!• Recurrence every 10 minutes.!• Scheduled by Azkaban scheduling

system

• Two Kafka clusters kept synchronized automatically.!

• The primary Kafka supports production.

Online Datacenters

ServicesServices

Offline Development Datacenters

• Each step in the flow all publish an audit trail!

• Consist of the topic, machine name, etc.!

• Confirm all published event reached all consumers by aggregating audit data.

WorkflowProcessed offline

Workflow in Hadoop

• A directed acyclic graph of dependencies.!

• Wrappers help restrict data being read to a certain time range based on parameters specified.!

• Hadoop-Kafka data pull job places the data into time-partitioned folders.

Workflow in Hadoop

• A directed acyclic graph of dependencies.!

• One workflow can have a size of 50-100 jobs.

AzkabanAn open-sourced workflow scheduler.

Azkaban

• Supports multiple job types!• Run as individual or chained!• Configurations and

dependencies are maintained!• Visualize and manipulate

dependencies via graphs in UI.

Construction of

Workflows

• Experimenting, massage it into a useful form!

• Transform features into feature vectors!

• Trained into models!• Iterate workflows

Offline Development Datacenters

Hadoop for ETLHadoop for

development

Azkaban Azkaban

Offline Production Datacenters

Staging Voldemort Read-

Hadoop for production

Azkaban

EgressBack online!

Results• Usually pushed to other systems

(Back online or for further consumption)

Three main mechanism

• Key-Value!• Streams!• OLAP

Voldemort

• Based on Amazon’s Dynamo!• Distributed and Elastic!• Horizontally scalable!• Bulk load pipeline from Hadoop!• Simple to use

Reference: Slides of The big data ecosystem at LinkedIn, presented at SIGMOD 2013

Stream output performed by"

• Implemented using Hadoop API!• Each MR slot acts as a Kafka producer!• More details available at

http://kafka.apache.org

OLAP by

Avatara

• Automatically generates corresponding Azkaban job pipelines.!

• Small cubes are multidimensional arrays of tuples!

• Each tuple is combination of dimension and measure pairs!

• Output cubes to Voldemort as a read-only store.

Applications

Key-Value

Stream

“People you may know”

Collaborative Filtering

Skill Endosements

The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s...

Documents

Transcript of The bigdata ecosystem at linkedIn - WPIweb.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/...LinkedIn’s...

CS525: Big Data Analytics HBase Elke A. Rundensteiner Fall 2013 1.

LinkedIn’s Culture of Transformation

LinkedIn’s Annual HR Leaders Roundtable | Mumbai 2017

Continuous Query Processing of Spatio-Temporal Data ...web.cs.wpi.edu/~cs525/f06s-EAR/cs525-homepage_files/...environments where objects as well as queries are continuously moving.

Distributed Graph Analytics Imranul Hoque CS525 Spring 2013.

Sensor-based Human Motion Tracking Drew Walton CS525 Project Dr. Chow UCCS.

Accelerating LinkedIn’s Vision Through Innovation

CS525 ISFET pH Probe - Campbell Scientific: dataloggers, data

The “Big Data” Ecosystem at LinkedIn - Computer ...web.cs.wpi.edu/~cs525/f13b-EAR/cs525-homepage/lectures/PAPERS/… · The “Big Data” Ecosystem at LinkedIn Roshan Sumbaly,

The “Big Data” Ecosystem at LinkedIncs525/f13b-EAR/cs525-homepage/lectures/… · of streaming events generated by the services handling requests on LinkedIn. For example, a member

Descent with Distributed Stochastic Gradient Large-Scale ...web.cs.wpi.edu/~cs525/f13b-EAR//cs525-homepage/... · A general machine learning problem Recommender systems, text indexing,

Keynote at ConnectIn Dubai: LinkedIn’s Economic Graph and Growth in MENA

LinkedIn’s top 25 inspirational brands

ConnectIn London 2015: LinkedIn’s Roadmap 2015

Speech Recognition with Dynamic Time Warping …cs.uccs.edu/~cs525/studentproj/projS2010/plama/doc/cs525...Speech Recognition with Dynamic Time Warping using MATLAB Palden Lama and

FIRST AMENDED COMPLAINT FOR: (1) VIOLATION OF THE …...from all 2013 Fortune 500 companies as members. 21. LinkedIn’s core guiding value is Members First. Because LinkedIn’s members

2002 Spring CS525 Lecture 2

CS525: Big Data Analytics

Ambry: LinkedIn’s Scalable Geo-Distributed Object Store

CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.