Resilient Predictive Data Pipelines (GOTO Chicago 2016)

104
Resilient Predictive Data Pipelines Sid Anand ( @r39132 ) GOTO Chicago 2016 1

Transcript of Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Page 1: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Resilient Predictive Data Pipelines

Sid Anand (@r39132) GOTO Chicago 2016

1

Page 2: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

About Me

2

Work [ed | s] @

Committer & PPMC on

Report to

Co-Chair for

Page 3: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

MotivationWhy is a Data Pipeline talk in this Always Available Track?

3

Page 4: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Motivation

4

Always On work has traditionally focused on the availability of Serving Systems :

• Synchronous or Semi-Synchronous• Often Transactional• Latency-sensitive

Page 5: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

HA Goals of Serving Systems

5

Outages are Big News Items!

Page 6: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

And sometimes your failures become your brand!

6

Page 7: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Motivation

7

Always On work has traditionally focused on the availability of Serving Systems :

• Synchronous or Semi-Synchronous• Often Transactional• Latency-sensitive

Page 8: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Motivation

8

Arguably, the more valuable parts of online services are driven by Data Flow Systems (a.k.a. Data Pipelines):

• Asynchronous• Throughput-sensitive

Page 9: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Data Products

9

Page 10: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Serving + Data Pipelines

10

A business’s viability is based on its ability to

•keep the site up (Always-On Serving Architectures) & 

•maintain engagement (views & clicks) with customers (Always-On Data Pipelines)

This talk is about Always On Data Pipelines!

Page 11: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Serving + Data Pipelines

11

Serving Data Pipeline

Data Integration Layer

Web Servers

Microservice Layer

Data Layer (DB, Search, Caching,

Graph DB, Object Store)

FE Load BalancersDAGs + Scheduler + Distributed

Computation Engine

Page 12: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Data Pipeline Challenges

12

Page 13: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Data Pipeline ChallengesProblem 1 : The Blast Radius Problem

13

Page 14: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

The Blast Radius Problem

14

• A developer introduces a bug in Data Pipeline Job 1

• Data Pipeline Job 1 reads Data A & writes Data B

Page 15: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

The Blast Radius Problem

15

• A developer introduces a bug in Data Pipeline Job 1

• Data Pipeline Job 1 reads Data A & writes Data B

• Data Pipeline Job 2 reads Data B & writes Data C

Page 16: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

The Blast Radius Problem

16

• A developer introduces a bug in Data Pipeline Job 1

• Data Pipeline Job 1 reads Data A & writes Data B

• Data Pipeline Job 2 reads Data B & writes Data C

• Data Pipeline Job 3 reads Data C & writes Data D to a Serving System DB

Page 17: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

The Blast Radius Problem

17

• A developer introduces a bug in Data Pipeline Job 1

• Data Pipeline Job 1 reads Data A & writes Data B

• Data Pipeline Job 2 reads Data B & writes Data C

• Data Pipeline Job 3 reads Data C & writes Data D to a Serving System DB

• Serving System 4 reads Data D, where the bug is discovered!

Page 18: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

The Blast Radius Problem

18

•The previous diagram only shows one path of a tree

•The reality is much worse

•For each data set produced, there are multiple consuming jobs and hence multiple bad downstream outputs

Page 19: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

The Blast Radius ProblemAn acute pain point

19

Page 20: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

The Blast Radius Problem

20

Detect Bug

Job 1

Job 2

Job 3

Serving System

Page 21: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

The Blast Radius Problem

21

Detect Bug

Identify Cause

Page 22: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Identify Cause

22

Page 23: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

The Blast Radius Problem

23

Detect Bug

Identify Cause

Page 24: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

The Blast Radius Problem

24

Detect Bug

Identify Cause

Deploy a Fix

Page 25: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Rollout a Fix & Rerun all Downstream Jobs in the Affected Time Window

25

Page 26: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

The Blast Radius Problem

26

Detect Bug

Identify Cause

Re_Run All Jobs

over a Time Window

Page 27: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Take Aways?

27

• The cost in people, time, and morale for a Data Pipeline bug is high and they can occur frequently.

• In most areas of software, testing is invaluable, less so in data pipelines

• Data Pipeline bugs can be due to a logic problem or bad input data!

• Best Option : Detect & Rollback/Fix Forward

Page 28: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

The Blast Radius Solution

28

Detect Bug

Identify Cause

Re_Run 1 Job

Page 29: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Data Pipeline ChallengesTimeliness

29

Page 30: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Timeliness

Job 1 Job 2 Job 3

Definition : job = workflow = DAG of tasks

Job 3’s output is pushed to a

serving system

Page 31: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Timeliness

Run 1 : Monday

Run 2 : Tuesday

Run 3 : Wednesday

Consider the Daily Run Schedule below:

Page 32: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

32

Timeliness

Within Time SLA OUT

Run 4

Run 5

Run 6

Page 33: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

TimelinessWhy do jobs get slower?

33

Page 34: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

34

Timeliness

Page 35: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

35

new features

Timeliness

Page 36: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

36

Algo taking longer

Timeliness

Page 37: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

37

Algo bug fix

Timeliness

Page 38: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

38

Algo bug fixes

Timeliness

Page 39: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

39

new features

Timeliness

Page 40: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Take Aways?

40

• Data Science & Engineering work is a virtuous cycle of adding features (and the like) + tuning performance

• Latency does matter (a bit)

Page 41: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Design GoalsDesirable Qualities of a Resilient Data Pipeline

41

Page 42: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

42

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

Page 43: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

43

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

• Data Integrity (no loss, etc…) • Expected data distributions

• All output within time-bound SLAs

• Fine-grained Monitoring & Alerting of Correctness & Timeliness SLAs

• Quick Recoverability

• Pay-as-you-go

Page 44: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Quickly Recoverable

44

• Bugs happen!

• Bugs in Predictive Data Pipelines have a large blast radius

• Optimize for MTTR

Page 45: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

ImplementationUsing AWS to meet Design Goals

45

Page 46: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

SQSSimple Queue Service

46

Page 47: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

SQS - Overview

47

AWS’s low-latency, highly scalable, highly available message queue

Infinitely Scalable Queue (though not FIFO)

Low End-to-end latency (generally sub-second)

Pull-based

Page 48: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

visibility timer

SQS - Typical Operation Flow

48

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 1: A consumer reads a message from SQS. This starts a visibility timer!

Page 49: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

visibility timer

SQS - Typical Operation Flow

49

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 2: Consumer persists message contents to DB

Page 50: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

visibility timer

SQS - Typical Operation Flow

50

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 3: Consumer ACKs message in SQS

Page 51: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

visibility timer

SQS - Time Out Example

51

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 1: A consumer reads a message from SQS

Page 52: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

visibility timer

SQS - Time Out Example

52

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 2: Consumer attempts persists message contents to DB

Page 53: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

visibility time out

SQS - Time Out Example

53

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1SQS

Step 3: A Visibility Timeout occurs & the message becomes visible again.

Page 54: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

visibility timer

SQS - Time Out Example

54

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1

m1

SQS

Step 4: Another consumer reads and persists the same message

Page 55: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

visibility timer

SQS - Time Out Example

55

Producer

Producer

Producer

m1m2m3m4m5

Consumer

Consumer

Consumer

DB

m1

SQS

Step 5: Consumer ACKs message in SQS

Page 56: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

SQS - Dead Letter Queue

56

SQS - DLQ

visibility timer

Producer

Producer

Producer

m2m3m4m5

Consumer

Consumer

Consumer

DB

m1

SQS

Redrive rule : 2x

m1

Page 57: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

SNSSimple Notification Service

57

Page 58: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

SNS - Overview

58

Highly Scalable, Highly Available, Push-based Topic Service

Whereas SQS is pull-based, SNS is push-based

There is no message retention & there is a finite retry count

No Reliable Message Delivery

Whereas SQS ensures each message is seen by at least 1 consumer

SNS ensures that each message is seen by every consumer

Reliable Multi-Push

Can we work around this limitation while getting Reliable Multi-push?

Page 59: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

SNS + SQS Design Pattern

59

m1m2

m1m2

m1m2

SQS Q1

SQS Q2

SNS T1

Reliable Multi Push

Reliable Message Delivery

Page 60: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

SNS + SQS

60

Producer

Producer

Producer

m1m2Consumer

Consumer

Consumer

DB

m1

m1m2

m1m2

SQS Q1

SQS Q2

SNS T1

Consumer

Consumer

Consumer

ES

m1

Page 61: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Batch Pipeline ArchitecturePutting the Pieces Together

61

Page 62: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

But First ….

62

Page 63: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

What Does Agari Do?

63

Page 64: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

What Does Agari Do?

64

Customersemail

metadataapply trust

models

email + trust score

Agari’s Current Product

Page 65: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

What Does Agari Do?

65

Enterprise Customers email

metadataapply trust

models

email md+ trust score

Agari’s Future Product

Page 66: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Batch Pipeline ArchitecturePutting the Pieces Together

66

Page 67: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Batch Architecture

67

•S3 to hold all source & computed data (Avro)

•EMR Spark for scoring + summarization

•Apache Airflow for hourly job scheduling

•SNS+SQS for messaging

•ASG Importer to import

•WebApp in Ruby-on-Rails

Page 68: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Tackling Cost & TimelinessLeveraging the AWS Cloud

68

Page 69: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Tackling Cost

69

Between Hourly Runs During Hourly Runs

Page 70: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Tackling TimelinessAuto Scaling Group (ASG)

70

Page 71: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

ASG - Overview

71

What is it?

A means to automatically scale out/in clusters to handle variable load/traffic

A means to keep a cluster/service of a fixed size always up

Page 72: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

ASG - Data Pipeline

72

importer

importer

importer

importer

Importer ASG

scale out / inSQS

DB

Page 73: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

73

Sent

CPU

ACKd/Recvd

CPU-based auto-scaling is good at scaling in/out to keep the average CPU constant

ASG : CPU-based

Page 74: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

ASG : CPU-based

74

Sent

CPU

Recv

Premature Scale-in

Premature Scale-in:

• The CPU drops to noise-levels before all messages are consumed

• This causes scale in to occur while the last few messages are still being committed

Page 75: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

75

Scale-out: When Visible Messages > 0 (a.k.a. when queue depth > 0)

Scale-in: When Invisible Messages = 0 (a.k.a. when the last in-flight message is ACK’d)

This causes the ASG to grow

This causes the ASG to shrink

ASG : Queue-based

Page 76: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

76

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost• ASG • EMR Spark

• ASG • EMR Spark

Page 77: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Tackling Operability & CorrectnessLeveraging Tooling

77

Page 78: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

78

A simple way to author and manage workflows

Provides visual insight into the state & performance of workflow runs

Integrates with our alerting and monitoring tools

Tackling Operability : Requirements

Page 79: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Apache AirflowWorkflow Automation & Scheduling

79

Page 80: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

80

Airflow: Author DAGs in Python! No need to bundle many config files!

Apache Airflow - Authoring DAGs

Page 81: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

81

Airflow: Visualizing a DAG

Apache Airflow - Authoring DAGs

Page 82: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

82

Airflow: It’s easy to manage multiple DAGs

Apache Airflow - Managing DAGs

Page 83: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Apache Airflow - Perf. Insights

83

Airflow: Gantt chart view reveals the slowest tasks for a run!

Page 84: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

84

Apache Airflow - Perf. InsightsAirflow: Task Duration chart view show task completion time trends!

Page 85: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

85

Airflow: …And easy to integrate with Ops tools!Apache Airflow - Alerting

Page 86: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

86

Apache Airflow - Correctness

Page 87: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

87

Desirable Qualities of a Resilient Data Pipeline

OperabilityCorrectness

Timeliness Cost

Page 88: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Near-Real Time Data PipelinesStream Processing @ Agari

88

Page 89: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

NRT Architecture

89

Page 90: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

NRT Architecture

90

Page 91: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

91

The Architecture is composed of repeated patterns of :

ASG-based compute

Kinesis streams (i.e. AWS’ managed “Kafka”)

Lambda-based Avro Schema Registry

NRT Architecture

Page 92: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Avro Schema RegistryAvro Schema Storage

92

Page 93: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

93

{ "namespace":"com.agari.ep.collector.model", "type":"record", "doc":"This Schema describes the server-side configuration of Agari's Enterprise-Protect Collector", "name":"collector_config", "fields":[ {"name": "email_log_enabled", "type": "boolean"}, {"name": "email_log_interval_seconds", "type": ["int", "null"]}, {"name": "email_log_bucket_name", "type": "string"}, {"name": "phone_home_interval_seconds", "type": "int"}, {"name": "phone_home_sns_topic_ARN", "type": "string"}, {"name": "config_pull_interval_seconds", "type": "int"}, {"name": "receiver_netblocks", "type": ["null", {"type": "array", "items": "string"}]}, { "name": "connecting_ip", "type": [ "null", { "type": "record", "name": "connecting_ip_record", "fields": [ { "name": "received_header_index", "type": "int" }, ]

……

A self-describing (schema’d) serialization format

What is Avro?

Page 94: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

What is Avro?

94

Typically, the schema is stored in the same file as the data it represents

In HDFS, where files are typically large, the schema overhead is negligible

Page 95: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Schema Registry

95

P

PC

C

C

Kinesis Stream

In streaming, where each record may be sent individually, the schema will be the majority of the data transmitted!

This is a fat message

Can we be smarter?

Page 96: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Schema Registry

96

P

PC

C

C

Kinesis Stream

Schema Registry

(Lambda)

DynamoDB

get_schema_by_idregister_schema

Page 97: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

What is AWS Lambda?

97

AWS-hosted code execution environment (Python, Node, Java, Ruby)

You upload some code & specify a simple memory and CPU profile (e.g. medium CPU, 256 GB memory)

The code will get a new version (e.g. v2)

Code Rollback as easy as setting $LATEST alias to a previous version (e.g. $LATEST=v1)

Page 98: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Elastic Stream ProcessingHow Do We Handle Increasing Traffic?

98

Page 99: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Elastic Stream Processing

99

P

PC

C

C

Kinesis Stream

Page 100: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Elastic Stream Processing

100

P

PC

C

C

Kinesis Stream

Page 101: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Elastic Stream Processing

101

C

C

C

Kinesis Stream

C

C

C

P

P

Agari Scaling

Utils

Page 102: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Open Source Plans

102

In late Q2/early Q3, we plan to open-source our cloud tools for :

• Avro Schema Registry &

• Agari (Kinesis+ASG) scaling tools

To be notified, follow @AgariEng & @r39132

Page 103: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Acknowledgments

103

• Vidur Apparao • Stephen Cattaneo • Jon Chase • Andrew Flury • William Forrester • Chris Haag • Mike Jones

• Scot Kennedy • Thede Loder • Paul Lorence • Kevin Mandich • Gabriel Ortiz • Jacob Rideout • Josh Yang

None of this work would be possible behind the contributions of the strong team below

Page 104: Resilient Predictive Data Pipelines (GOTO Chicago 2016)

Questions? (@r39132)

104