Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time,...

40
Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Transcript of Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time,...

Page 1: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Building a real-time, self-service data analytics ecosystem

Greg Arnold, Sr. Director Engineering

Page 2: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

“Self Service” at scale

1

2

3

5

6

4

Page 3: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

? Relational?

MPP? Hadoop?

Page 4: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

350M Members

4.8B Endorsements

2M Jobs

3.5M Active company profiles

25B Quarterly page views

Linkedin data

Page 5: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Translate data into

insights

Business Insights Member Insights

Analytics

Infrastructure

Page 6: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

The Good Old Days

Page 7: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Data Flow@10000 ft

Page 8: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Scale Challenges

1. Human intervention

2. Long latencies to obtain

insights from data.

3. Complexity of integration

with increasing data

sources.

Page 9: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

What does it take to

build a

self-service,

real-time,

democratic

analytics platform?

Page 10: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Analytics Infra

Storage and Compute

(Hadoop, Pinot, Cubert)

Data Management Systems

[ingest, export, access, workflows]

(Gobblin,…)

Self Serve Applications [reporting, lineage, perf tuning etc]

(WhereHows, Dr. Elephant, …)

Core Data Warehouse

[Views, Metrics, Dimensions, Datasets, Core flows]

Page 11: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Storage and Compute

Platforms

Hadoop

HDFS

Y

A

R

N

Map-Reduce Spark Tez

Pig Hive Cubert

Pinot

Scalding

Page 12: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Hadoop @ LinkedIn

• Deployment

• x Clusters (~x000 nodes)

• xx+ PB of data

• xxx k jobs / week

• xM compute hrs / month

ETL

R & D

PROD ETL

Online Data Serving

Ingest Export

R & D

PROD

Page 13: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Supporting > 1000

Hadoop users

• Development process

• do code, [review], deploy, while (! good);

• Hadoop is complex: lots of knobs, tuning helps

• Performance symptoms not easily identifiable: scattered evidence

• Performance implications of changes

Page 14: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Dr. Elephant:

diagnosis

Page 15: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

What about real-time

analytics?

Page 16: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Slow Queries

Page 17: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Solution

• Avoid joins at query time when possible.

• Denormalize data in Hadoop and load into a fast engine for slice-n-dice.

Page 18: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Real-time analytics

• A challenge for Hadoop

• Slice and dice billions of records, hundreds of dimensions

• End to end freshness of minutes not hours

• Sub-second query response times

• e.g. Which are top regions that contribute to my profile views? Which industries in those regions?

Page 19: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Pinot for realtime

analytics

g

• Distributed, fault-tolerant

• Compressed Columnar indexes

• Data ingestion from Kafka and Hadoop

• No joins, yet.

Page 20: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Who viewed my

profile

Page 21: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Pinot: Data Flow

Profile

Kafka

Hadoop

Pinot

Who

Viewed

My Profile

Profile

Analytics

Dashboard

Member-facing

Internal ProfileViewEvent

hours / days

minutes

segment building

Page 22: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Pig and Hive are

great but....

• Operate on individual records

• Re-compute scheduled batch ETL

jobs with full scans.

• Can do better by reorganization and

processing data in blocks

Page 23: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Cubert: Accelerating

Batch computation

0 hours

5 hours

10 hours

15 hours

20 hours

25 hours

30 hours

35 hours

40 hours

XLNT (Statistical) SPI (Graph) Plato (OLAP Cube)

Pig/Hive Cubert

Page 24: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Cubert Internals

•Organizes data in blocks

•Blocks created and transformed with

operators

•Cubert provides a scripting language

and a runtime to execute the operators

in Map-Reduce operations.

Page 25: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Technology Stack

Storage and Compute

(Hadoop, Pinot, Cubert)

Data Management Systems

[ingest, export, access, workflows]

(Gobblin,…)

Self Serve Applications [reporting, lineage, perf tuning etc]

(WhereHows, Dr. Elephant, …)

Core Data Warehouse

[Views, Metrics, Dimensions, Datasets, Core flows]

Page 26: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Perception

Page 27: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Reality

Page 28: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Unifying Ingress into Hadoop

Page 29: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Ingest operator chain

Page 30: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Gobblin: roadmap

• Open source in 2014

• Current work

• Continuous and batch ingest

• Data profiling, summarization

• Flexible deployment

• Resource utilization and sharing

Page 31: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Workflow

Management

Oozie

Azkaban EasyDat

a

Scheduling

Backend

Workflow

Mgmt Apps

Page 32: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Technology Stack

Storage and Compute

(Hadoop, Pinot, Cubert)

Data Management Systems

[ingest, export, access, workflows]

(Gobblin,…)

Self Serve Applications [reporting, lineage, perf tuning etc]

(WhereHows, Dr. Elephant, …)

Core Data Warehouse

[Views, Metrics, Dimensions, Datasets, Core flows]

Page 33: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

“WhereHows” Data Exploration

• Discover datasets • Spread across storage systems (HDFS, TD, Kafka…)

• Murky semantics for data and columns

• Lineage to traverse relationships

• Discover processes • Spread across process execution engines (Azkaban, Ad-

hoc, Appworx, EasyData)

• See code and logic

• Correlate data and processes

Page 34: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

WhereHows

Page 35: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Lineage in action

Page 36: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Reporting and

Visualization

1. Dashboards

Page 37: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Reporting and

Visualization

1. Dashboards

2. Curated

Exploration

Page 38: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Reporting and

Visualization

1. Dashboards

2. Curated

Exploration

3. Ad-hoc

Page 39: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Summary

Hadoop storage & compute

Pinot* for real-time querying

Dr. Elephant* for tuning Hadoop

Cubert* for batch M/R

Gobblin*: data ingest

WhereHows: explore data, lineage

Reporting: dashboards, curated exploration, ad-hoc

Workflow Mgmt

Hadoop

HDFS

Y

A

R

N Map-Reduce Spark Tez

Pig Hive Cubert Scalding

Oozie

Azkaban EasyDat

a

Pinot

Page 40: Building a real-time, self-service data analytics ecosystem · PDF fileBuilding a real-time, self-service data analytics ecosystem ... [reporting, lineage, perf tuning ... Workflow

Thanks!

Greg Arnold,

Sr. Director Engineering