Building a Data Pipeline from Scratch - Joe Crobak

55
From Scratch 1 Joe Crobak @joecrobak Tuesday, June 24, 2014 Axium Lyceum - New York, NY BUILDING A DATA PIPELINE

description

http://www.hakkalabs.co/articles/building-data-pipeline-scratch

Transcript of Building a Data Pipeline from Scratch - Joe Crobak

Page 1: Building a Data Pipeline from Scratch - Joe Crobak

From Scratch

1

Joe Crobak @joecrobak

!Tuesday, June 24, 2014

Axium Lyceum - New York, NY

BUILDING A DATA PIPELINE

Page 2: Building a Data Pipeline from Scratch - Joe Crobak

INTRODUCTION

2

Software Engineer @ Project Florida

!

Previously: • Foursquare •Adconion Media Group • Joost

Page 3: Building a Data Pipeline from Scratch - Joe Crobak

OVERVIEW

3

Why do we care?

Defining Data Pipeline

Events

System Architecture

Page 4: Building a Data Pipeline from Scratch - Joe Crobak

4

DATA PIPELINES ARE EVERYWHERE

Page 5: Building a Data Pipeline from Scratch - Joe Crobak

RECOMMENDATIONS

5

http://blog.linkedin.com/2010/05/12/linkedin-pymk/

Page 6: Building a Data Pipeline from Scratch - Joe Crobak

RECOMMENDATIONS

6

Clicks

Views

Recommendations

http://blog.linkedin.com/2010/05/12/linkedin-pymk/

Page 7: Building a Data Pipeline from Scratch - Joe Crobak

AD NETWORKS

7

Page 8: Building a Data Pipeline from Scratch - Joe Crobak

AD NETWORKS

8

Clicks

Impressions

User Ad Profile

Page 9: Building a Data Pipeline from Scratch - Joe Crobak

SEARCH

9

http://lucene.apache.org/solr/

Page 10: Building a Data Pipeline from Scratch - Joe Crobak

SEARCH

10

Search Rankings

Page Rank

http://www.jevans.com/pubnetmap.html

Page 11: Building a Data Pipeline from Scratch - Joe Crobak

A / B TESTING

11

https://flic.kr/p/4ieVGa

Page 12: Building a Data Pipeline from Scratch - Joe Crobak

A / B TESTING

12

https://flic.kr/p/4ieVGa

A conversions

B conversions

Experiment Analysis

Page 13: Building a Data Pipeline from Scratch - Joe Crobak

DATA WAREHOUSING

13

http://gethue.com/hadoop-ui-hue-3-6-and-the-search-dashboards-are-out/

Page 14: Building a Data Pipeline from Scratch - Joe Crobak

DATA WAREHOUSING

14

http://gethue.com/hadoop-ui-hue-3-6-and-the-search-dashboards-are-out/

key metrics

user events

Data Warehouse

Page 15: Building a Data Pipeline from Scratch - Joe Crobak

15

WHAT IS A DATA PIPELINE?

Page 16: Building a Data Pipeline from Scratch - Joe Crobak

DATA PIPELINE

16

A Data Pipeline is a unified system for capturing events for analysis and building products.

Page 17: Building a Data Pipeline from Scratch - Joe Crobak

DATA PIPELINE

17

click data

user events

Data Warehouse

web visits

email sends

Product Features

Ad Hoc analysis•Counting •Machine Learning • Extract Transform Load (ETL)

Page 18: Building a Data Pipeline from Scratch - Joe Crobak

DATA PIPELINE

18

A Data Pipeline is a unified system for capturing events for analysis and building products.

Page 19: Building a Data Pipeline from Scratch - Joe Crobak

19

EVENTS

Page 20: Building a Data Pipeline from Scratch - Joe Crobak

EVENTS

20

Each of these actions can be thought of as an event.

Page 21: Building a Data Pipeline from Scratch - Joe Crobak

COARSE-GRAINED EVENTS

21

• Events are captured as a by-product.

• Stored in text logs used primarily for debugging and secondarily for analysis.

Page 22: Building a Data Pipeline from Scratch - Joe Crobak

COARSE-GRAINED EVENTS

22

127.0.0.1 - - [17/Jun/2014:01:53:16 UTC] "GET / HTTP/1.1" 200 3969!

IP Address Timestamp Action Status

•Events are captured as a

• Stored in debugging and secondarily for analysis.

Page 23: Building a Data Pipeline from Scratch - Joe Crobak

COARSE-GRAINED EVENTS

23

Implicit tracking—i.e. a “page load” event is a proxy for ≥1 other event. !

e.g. event GET /newsfeed corresponds to:

•App Load (but only if this is the first time loaded this session)

• Timeline load, user is in “group A” of an A/B Test

These implementations details have to be known at analysis time.

Page 24: Building a Data Pipeline from Scratch - Joe Crobak

FINE-GRAINED EVENTS

24

Record events like:

• app opened

• auto refresh

•user pull down refresh !

Rather than:

•GET /newsfeed

Page 25: Building a Data Pipeline from Scratch - Joe Crobak

FINE-GRAINED EVENTS

25

Annotate events with contextual information like:

• view the user was on

•which button was clicked

Page 26: Building a Data Pipeline from Scratch - Joe Crobak

FINE-GRAINED EVENTS

26

Decouple logging and analysis. Create events for everything!

Page 27: Building a Data Pipeline from Scratch - Joe Crobak

FINE-GRAINED EVENTS

27

A couple of schema-less formats are popular (e.g. JSON and CSV), but they have drawbacks.

• harder to change schemas

• inefficient

• require writing parsers

Page 28: Building a Data Pipeline from Scratch - Joe Crobak

SCHEMA

28

Used to describe data, providing a contract about fields and their types. !

Two schemas are compatible if you can read data written in schema 1 with schema 2.

Page 29: Building a Data Pipeline from Scratch - Joe Crobak

SCHEMA

29

Facilities automated analytics—summary statistics, session/funnel analysis, a/b testing.

Page 30: Building a Data Pipeline from Scratch - Joe Crobak

SCHEMA

30

https://engineering.twitter.com/research/publication/the-unified-logging-infrastructure-for-data-analytics-at-twitter

Facilities automated analytics—summary statistics, session/funnel analysis, a/b testing.

Page 31: Building a Data Pipeline from Scratch - Joe Crobak

SCHEMA

31

client:page:section:component:element:action e.g.: !iphone:home:mentions:tweet:button:click!!

Count iPhone users clicking from home page: !iphone:home:*:*:*:click!!

Count home clicks on buttons or avatars: !*:home:*:*:{button,avatar}:click

Page 32: Building a Data Pipeline from Scratch - Joe Crobak

32

KEY COMPONENTS

Page 33: Building a Data Pipeline from Scratch - Joe Crobak

EVENT FRAMEWORK

33

For easily generating events from your applications

Page 34: Building a Data Pipeline from Scratch - Joe Crobak

EVENT FRAMEWORK

34

For applications

Page 35: Building a Data Pipeline from Scratch - Joe Crobak

BIG MESSAGE BUS

35

•Horizontally scalable

•Redundant

•APIs / easy to integrate

Page 36: Building a Data Pipeline from Scratch - Joe Crobak

BIG MESSAGE BUS

36

•Scribe (Facebook) •Apache Chukwa •Apache Flume •Apache Kafka*

!

•Horizontally scalable

•Redundant

•APIs / easy to integrate

* My recommendation

Page 37: Building a Data Pipeline from Scratch - Joe Crobak

DATA PERSISTENCE

37

For storing your events in files for batch processing

Page 38: Building a Data Pipeline from Scratch - Joe Crobak

DATA PERSISTENCE

38

For processing

Kite Software Development Kit http://kitesdk.org/ !Spring Hadoop http://projects.spring.io/spring-hadoop/

Page 39: Building a Data Pipeline from Scratch - Joe Crobak

WORKFLOW MANAGEMENT

39

For coordinating the tasks in your data pipeline

Page 40: Building a Data Pipeline from Scratch - Joe Crobak

WORKFLOW MANAGEMENT

40

… or your own system written in your own language of choice.

*

For pipeline

Page 41: Building a Data Pipeline from Scratch - Joe Crobak

SERIALIZATION FRAMEWORK

41

Used for converting an Event to bytes on disk. Provides efficient, cross-language framework for serializing/deserializing data.

Page 42: Building a Data Pipeline from Scratch - Joe Crobak

SERIALIZATION FRAMEWORK

42

•Apache Avro* •Apache Thrift •Protocol Buffers (google)

Used for diskframework for serializing/deserializing data.

Page 43: Building a Data Pipeline from Scratch - Joe Crobak

BATCH PROCESSING AND AD HOC ANALYSIS

43

• Apache Hadoop (MapReduce)

•Apache Hive (or other SQL-on-Hadoop)

•Apache Spark

Page 44: Building a Data Pipeline from Scratch - Joe Crobak

SYSTEM OVERVIEW

44

Applicationlogging

frameworkdata

serialization

Message BusPersistant Storage

Data Warehouse

Ad hoc Analysis

Product data flow

workflow engine

Production DB dumps

Page 45: Building a Data Pipeline from Scratch - Joe Crobak

SYSTEM OVERVIEW (OPINIONATED)

45

Applicationlogging

frameworkdata

serialization

Message BusPersistant Storage

Data Warehouse

Ad hoc Analysis

Product data flow

workflow engine

Production DB dumps

Apache Avro

Apache Kafka Luigi

Page 46: Building a Data Pipeline from Scratch - Joe Crobak

NEXT STEPS

46

This architecture opens up a lot of possibilities

•Near-real time computation—Apache Storm, Apache Samza (incubating), Apache Spark streaming.

•Sharing information between services asynchronously—e.g. to augment user profile information.

• Cross-datacenter replication

• Columnar storage

Page 47: Building a Data Pipeline from Scratch - Joe Crobak

LAMBDA ARCHITECTURE

47

Term coined by Nathan Marz (creator of Apache Storm) for hybrid batch and real-time processing. !

Batch processing is treated as source of truth, and real-time updates models/insights between batches.

Page 48: Building a Data Pipeline from Scratch - Joe Crobak

LAMBDA ARCHITECTURE

48

http://lambda-architecture.net/

Page 49: Building a Data Pipeline from Scratch - Joe Crobak

SUMMARY

49

•Data Pipelines are everywhere.

•Useful to think of data as events.

• A unified data pipeline is very powerful.

• Plethora of open-source tools to build data pipeline.

Page 50: Building a Data Pipeline from Scratch - Joe Crobak

FURTHER READING

50

The Unified Logging Infrastructure for Data Analytics at Twitter !

The Log: What every software engineer should know about real-time data's unifying abstraction (Jay Kreps, LinkedIn) !

Big Data by Nathan Marz and James Warren !

Implementing Microservice Architectures

Page 51: Building a Data Pipeline from Scratch - Joe Crobak

THANK YOU

51

Questions?

!

Shameless plug: www.hadoopweekly.com

Page 52: Building a Data Pipeline from Scratch - Joe Crobak

52

EXTRA SLIDES

Page 53: Building a Data Pipeline from Scratch - Joe Crobak

WHY KAFKA?

53

• https://kafka.apache.org/documentation.html#design

• Pull model works well

• Easy to configure and deploy

• Good JVM support

• Well-integrated with the LinkedIn stack

Page 54: Building a Data Pipeline from Scratch - Joe Crobak

WHY LUIGI?

54

• Scripting language (you’ll end up writing scripts anyway)

• Simplicity (low learning curve)

• Idempotency

• Easy to deploy

Page 55: Building a Data Pipeline from Scratch - Joe Crobak

WHY AVRO?

55

• Self-describing files

• Integrated with nearly everything in the ecosystem

• CLI tools for dumping to JSON, CSV