AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

46
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Greg Brandt, Liyin Tang (Airbnb) December 2, 2016 Streaming ETL For Amazon RDS and Amazon DynamoDB DAT315

Transcript of AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Page 1: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Greg Brandt, Liyin Tang (Airbnb)

December 2, 2016

Streaming ETLFor Amazon RDS and Amazon DynamoDB

DAT315

Page 2: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

What to Expect from the Session

• Database Change Data Capture (CDC)

• Improving ETL to Data Warehouse

Page 3: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Spinaltap (CDC)

Page 4: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Architectural Evolution

From monolithic Rails app

Too many specialized

services/data stores

Page 5: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

New Challenges

• Co-processing logic breaks down out of process/transaction context

• Primary tables/indices on many machines, not single RDBMS

• Specialized systems needed for certain use cases (analytics, search,

etc.)

Page 6: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Architectural Tenants

• Build for production

• Plan for the future, build for today

• Prefer existing solutions and patterns that we have

experience with in production

• Services should own their data and not share their

storage

• Mutations to data should be propagated via

standardized events

Page 7: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Change Data Capture (CDC)

Goal: Provide streams of data mutations

• In near real time

• With timeline consistency

To keep all these systems in sync

Page 8: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Option 1: Application-Driven Dual Writes

• Consistency hard

• (2PC/consensus needed)

• Data model easy

• (Schema controlled by application)

• Development easy

• Use queue e.g. Kafka, RabbitMQ in addition to RDBMS

Page 9: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Option 2: Database Log Mining

• Consistency easy

• (Leverage commit log semantics)

• Parsing/Data model hard

• (Database’s internal commit log)

Page 10: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

We Chose Database Log Mining

• Parsing is easier than consensus

• Many libraries/APIs exist to make parsing easy

• Consuming stream of commits gives timeline

consistency by default

Page 11: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Data Ecosystem

Page 12: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Requirements

• Timeline consistency with at-least-once message

delivery

• Easily add new sources to consume (new machines if

necessary)

• Support low latency and high throughput use cases

• High availability with automatic failover

• Heterogeneous data sources (MySQL, Amazon

DynamoDB)

Page 13: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

MySQL Commit Log

• Java library for binary log parsing • https://github.com/shyiko/mysql-binlog-

connector-java/

• Emit mutation events • (Write_rows, Update_rows, Delete_rows)

• Logical clock determined from binlog

file/offset • (Single-master, Multi-AZ setup)

• Leverage XidEvent for transaction

boundary metadata/checkpointing• (InnoDB implementation detail)

Page 14: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

DynamoDB Streams

• Using DynamoDB Streams Kinesis

Adapter

• Guarantees• Each stream record appears exactly once

in the stream.

• Stream records appear in the same

sequence as the actual modifications to

the item

• Monotonically increasing logical clock

is hard• Need to incorporate shard id, parent/child

splitting semantics

• SequenceNumber is not global

Page 15: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Abstract Mutation

• Provide monotonically increasing* id

from logical clock

• Source-specific metadata (e.g. MySQL

binlog filename/offset)

• The beforeImage of the row in DB

(possibly null)

• The afterImage of the row in DB

(possibly null)

• Encode this using source-agnostic

format (e.g. Thrift)

• Write this object to message bus (e.g.

Kafka)

{

id: Long,

opCode: [

INSERT,

UPDATE,

DELETE

],

metadata: Map<String, String>,

beforeImage: Record,

afterImage: Record

}

Page 16: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Clustering/Configuration

• LEADER/STANDBY state model

• Each machine is LEADER for a subset of

sources

• Workload distributed evenly

• Use ZooKeeper-based Apache Helix

framework for cluster management

• http://helix.apache.org/

• Dynamic source configuration changes

• Helix Instance group tags to separate

MySQL/DynamoDB nodes

Page 17: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Fault Tolerance

• Controller handles node failure/elects

new LEADER for sources

• Maintain leader_epoch counter in Helix

ZooKeeper property store

• Prefix generated ids with leader_epoch

for monotonicity

• E.g. (leader_epoch, binlog_file,

binlog_pos)

Page 18: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Pub/Sub

• Produce mutations to Kafka with

durable configuration*

• Async coprocessors consume

messages, produce new streams

• Model streaming library allows

encapsulation of DB table schema• Service controls both API endpoint and

streaming view of data

• Keep 24 hours of MySQL binlog• Alert / rewind on failures in this tier

Page 19: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Online Validation

• Download binlog after it is flushed/immutable

• Check for holes/ordering violations by consuming stream from Kafka

• Allows us to maintain low latency with confidence in consistency of stream

• Auto-healing• Reset binlog position to earlier if too many failures

Page 20: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Production Lessons

• Need schema history store for regions of commit log to support rewind• E.g. write DDL to commit log, apply to local MySQL while processing stream to obtain

range/schema mapping

• Be careful about table encodings! (latin1, utf8...)

• request.required.acks = all can potentially hit every broker…• (Group produce requests by broker to avoid hitting too many)

• Per-source produce buffer size• (Tune for throughput/latency)

Page 21: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Data Ecosystem

Page 22: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Streaming DB Exports

Page 23: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Batch Infrastructure

Airflow Scheduling

Events

Log

DB

Mutation

Gold SilverBatch Ingestion

Query Engines:

Hive/Presto/Spark

RDS EC2

Page 24: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Growing Pain

Airflow Scheduling

Events

Log

DB

Mutation

Gold SilverBatch Ingestion

Query Engines:

Hive/Presto/Spark

RDS EC2

Page 25: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Point-in-Time Restore based DB Export

• Pros:

• Simple

• Especially for schema change

• Consistent

• Cons:

• No SLA for RDS PITR restoration time

• No near real time ad hoc query

• No hourly snapshot

• High storage cost

Page 26: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Overviews

Page 27: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Real-Time Ingestion on HBase

HBase HDFSSpinaltap

Query Engines: Hive/Presto/Spark

Spark

Streaming

RDS

Real time

query

snapshot

Batch

query

Page 28: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Access Data in HBase

HBase HDFS

Streaming:

Spark

snapshot

Unified view on real time data

Interactive Query:

Presto

Batch Job:

Hive/Spark

Page 29: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Snapshot & Reseed

HBase HDFS

Snapshot

(Hfile Links)

Bulk upload

(Reseed)

Page 30: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Onboard New Tables

HBase

RDS

HDFS

Streaming of Mutations from SpinalTap

Reseed

Reseed

Ingest

Page 31: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Disaster Recovery - Checkpoint

HBase

RDS

HDFS

Streaming of Mutations from SpinalTap

Reseed

Reseed

Ingest

Page 32: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Disaster Recovery - Rewind

HBase

RDS

HDFS

Streaming of Mutations from SpinalTap

Reseed

Reseed

Ingest

Page 33: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Disaster Recovery - Reseed

HBase

RDS

HDFS

Streaming of Mutations from SpinalTap

Reseed

Reseed

Ingest

Page 34: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

HBase Schema

Page 35: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Key Space Design

• Multiplex all DB tables on Single HBase Table

• Fast point look up based on primary keys

• Efficient sequential scans for one table

• Load balance

Page 36: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

HBase Row Keys – Primary Keys

• Hash Key= md5(DB_TABLE, PK1=v1, PK2=v2)

• Row Key = Hash Key + DB_TABLE + PK1=v1 +

Pk2=v2

• Fast point lookup based on primary keys

• Efficient sequential scan for all the keys in same

DB/Table

• Balanced based on hash key

Hash DB_TABLE PK1=v1 PK2=v2

Page 37: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

HBase Row Keys – Secondary Keys

• Hash Key= md5(DB_TABLE, Index_1=v1)

• Row Key = Hash Key + DB_TABLE + Index_1=v1 +

PK1=vpk1

• Prefix scan for a given secondary index

Hash DB_TABLE Index=v1 PK1=vpk1

Page 38: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

HBase Versioning

Rows CF: Columns Version Value

<ShardKey><DB_TABLE_#1><

PK_a=A>id Fri May 19 00:33:19 2016 101

<ShardKey><DB_TABLE_#1><

PK_a=A>city Fri May 19 00:33:19 2016 San Francisco

<ShardKey><DB_TABLE_#1><

PK_a=A>city Fri May 10 00:34:19 2016 New York

<ShardKey><DB_TABLE_#2><

PK_a=A’>id Fri May 19 00:33:19 2016 1

Page 39: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Version by Timestamp

Binlog Order

TXN 1

COMMIT_T

S: 101

TXN 2

COMMIT_T

S: 102

TXN 3

COMMIT_T

S: 103

TXN N

COMMIT_T

S: N’…

Page 40: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Version by Timestamp

Binlog Order

TXN 1

COMMIT_T

S: T1

TXN 2

COMMIT_T

S: T3

TXN 3

COMMIT_T

S: T2

TXN N

COMMIT_T

S: N’…

mysql-

bin.00000:1

00

mysql-

bin.00000:1

01

mysql-

bin.00000:1

02

mysql-

bin.00000:

N

NTP

Page 41: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

HBase Versioning

Rows CF: Columns Version Commit TS

<ShardKey><DB_TABLE_#1><

PK_a=A>id mysql-bin.00000:100 T0

<ShardKey><DB_TABLE_#1><

PK_a=A>id mysql-bin.00000:101 T1

<ShardKey><DB_TABLE_#1><

PK_a=A>id mysql-bin.00000:102 T3

<ShardKey><DB_TABLE_#1><

PK_a=A>id mysql-bin.00000:103 T2

Page 42: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

PITR Semantics

Binlog Order

TXN 1

COMMIT_T

S: 101

TXN 2

COMMIT_T

S: 103

TXN 3

COMMIT_T

S: 102

TXN N

COMMIT_T

S: N’…

NTP

Page 43: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

PITR Semantics: Binlog Commit Time Index

Rows Version (Logical Offset) Value

<ShardKey><DB_TABLE_#1><

2016-05-23 23><100>100 mysql-bin.00000:100

<ShardKey><DB_TABLE_#1><

2016-05-23 23><101>101 mysql-bin.00000:101

<ShardKey><DB_TABLE_#1><

2016-05-23 23><103>103 mysql-bin.00000:103

<ShardKey><DB_TABLE_#1><2

016-05-24 00><102>102 mysql-bin.00000:102

First mutation

across PITR

The last

mutation before

PITR

Page 44: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Streaming DB Export

• Pros:

• Consistent

• High SLA for the daily snapshot

• Consistent as PITR semantics

• Near real time ad hoc query

• Hive/Spark compatible

• Hourly snapshot view

• Low storage cost

• Cons:

• Schema change

Page 45: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Thank you!

Page 46: AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)

Remember to complete

your evaluations!