NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

93
@ksshams @swami_79 SPOT 401 - Leading the NoSQL Revolution: under the covers of Distributed Systems @ scale

description

The Dynamo paper started a revolution in distributed systems. The contributions from this paper are still impacting the design and practices of some of the world's largest distributed systems, including those at Amazon.com and beyond. Building distributed systems is hard, but our goal in this session is to simplify the complexity of this topic to empower the hacker in you! Have you been bitten by the eventual consistency bug lately? We show you how to tame eventual consistency and make it a great scaling asset. As you scale up, you must be ready to deal with node, rack, and data center failure. We share insights on how to limit the blast radius of the individual components of your system, battle tested techniques for simulating failures (network partitions, data center failure), and how we used core distributed systems fundamentals to build highly scalable, performance, durable, and resilient systems. Come watch us uncover the secret sauce behind Amazon DynamoDB, Amazon SQS, Amazon SNS, and the fundamental tenents that define them as Internet scale services. To turn this session into a hacker's dream, we go over design and implementation practices you can follow to build an application with virtually limitless scalability on AWS within an hour. We even share insights and secret tips on how to make the most out of one of the services released during the morning keynote.

Transcript of NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Page 1: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

SPOT 401 - Leading the NoSQL Revolution:

under the covers of Distributed Systems @ scale

Page 2: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

what are we covering?

The evolution of large scale

distributed systems @ Amazon from

the 90’s to today

The lessons we learned and insights you can employ in your own distributed systems

Page 3: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

let’s start with a story about a little

company called amazon.com

Page 4: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

once upon a time...

(in 2000)

episode 1

Page 5: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

a thousand miles

away...

(seattle)

Page 6: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

amazon.com - a rapidly growing Internet based retail

business relied on relational databases

Page 7: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

we had 1000s of independent services

Page 8: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

each service managed its own state in RDBMs

Page 9: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

RDBMs are actually kind of cool

Page 10: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

first of all... SQL!!

Page 11: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

so it is easier to query..

Page 12: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

easier to learn

Page 13: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

they are as versatile as a swiss army knife

complex queries

key-value access

transactions analytics

Page 14: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

RDBMs are *very* similar to

Swiss Army Knives

Page 15: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

but sometimes.. swiss army knifes..

can be more than what you bargained for

Page 16: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

partitioning

easy re-

partitioning HARD..

Page 17: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

so we bought

bigger boxes...

Page 18: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Q4 was hard-work at Amazon

benchmark new

hardware

migrate to new

hardware

repartition databases

pray ...

Page 19: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

RDBMs availability challenges..

Page 20: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

then.. (in 2005)

episode 2

Page 21: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

amazon dynamo predecessor to

dynamoDB

specialist tool :

•limited querying capabilities

•simpler consistency

replicated DHT with consistent hashing

optimistic replication

“sloppy quorum”

anti-entropy mechanism

object versioning

Page 22: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

dynamo had many benefits • higher availability

• we traded it off for eventual consistency

• incremental scalability

• no more repartitioning

• no need to architect apps for peak

• just add boxes

• simpler querying model ==>> predictable performance

Page 23: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

but dynamo was not perfect...

lacked strong consistency

Page 24: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

but dynamo was not perfect...

scaling was easier, but...

Page 25: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

but dynamo was not perfect...

steep learning curve

Page 26: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

but dynamo was not perfect...

dynamo was a product ... ==>> not a service...

Page 27: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

then.. (in 2012)

episode 3

Page 28: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

ADMIN

“Even though we have years of experience with large, complex

NoSQL architectures, we are happy to be finally out of the

business of managing it ourselves.” - Don MacAskill, CEO

• NoSQL database

• fast & predictable performance

• seamless scalability

• easy administration

DynamoDB

Page 29: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

services, services, services

Page 30: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

amazon.com’s experience with services

Page 31: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

how do you create a successful service?

Page 32: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

with great services, comes great responsibility

Page 33: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Architect

Customer

Page 34: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

DynamoDB Goals and Philosophies

never compromise on durability scale is our

problem easy to use

scale in rps consistent and low

latencies

Page 35: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

how to build these large scale services?

Page 36: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

don’t compromise on durability…

Page 37: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

don’t compromise on… availability

Page 38: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

plan for success, plan for scalability

Page 39: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Page 40: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Fault tolerant design is key.. • Everything fails all the time

• Planning for failures is not easy

• How do you ensure your recovery strategies work correctly?

Page 41: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Byzantine General Problem

Page 42: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

A simple 2-way replication system of a traditional database…

Primary Standby

Writes

Page 43: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79 @ksshams @swami_79

P S

S is dead, need to trigger new

replica

P is dead, need to promote myself

P’

Page 44: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79 @ksshams @swami_79

Improved Replication: Quorum

Replica

Replica

Writes Replica

Quorum: Successful write on a majority

Page 45: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Not so easy..

Replica B

Replica C

Writes from

client A Replica A

Replica D

New member in the

group

Should I continue to serve reads?

Should I start a new quorum?

Replica E Replica F

Reads and

Writes from

client B

Classic Split Brain Issue in Replicated systems leading to lost writes!

Page 46: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Building correct distributed systems is not straight forward..

• How do you handle replica failures?

• How do you ensure there is not a parallel

quorum?

• How do you handle partial failures of replicas?

• How do you handle concurrent failures?

Page 47: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

correctness is hard, but necessary

Page 48: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Formal Methods

Page 49: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Formal Methods

to minimize bugs, we must have a precise description of the design

Page 50: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Formal Methods

code is too detailed

how would you express partial failures or concurrency?

design documents and diagrams are vague & imprecise

Page 51: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Formal Methods

law of large numbers is your friend,

so design for scale

until you hit large numbers

Page 52: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

TLA+ to the rescue?

Page 53: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

PlusCal

Page 54: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

formal methods are necessary

but not sufficient..

Page 55: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

customer

Page 56: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

forget to test - no, serious don’t ly

Page 57: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

embrace failure and don’t be surprised

simulate failures at unit

test level

fault injection testing

datacenter testing

network brown out testing

scale testing

Page 58: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

testing is a lifelong journey

Page 59: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

testing is necessary but not sufficient..

Page 60: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Architect

Customer

Page 61: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

release cycle

gamma simulate real

world

one box does it work?

phased deployment

treading lightly

monitor does it still

work?

Page 62: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Monitor customer behavior

Page 63: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

measuring customer experience is key

don’t be satisfied by average - look at

99 percentile

Page 64: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

understand the scaling dimensions

Page 65: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

understand how your service will be abused

Page 66: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

let’s see these rules in action through a true story

Page 67: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

we were building distributed systems all over

amazon.com

Page 68: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

we needed a uniform and correct way to do

consensus..

Page 69: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

so we built a paxos lock library service

Page 70: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

such a service is so much more useful than just leader election.. it became a distributed

state store

Page 71: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

such a service is so much more useful than just leader election.. or a distributed state

store

wait wait.. you’re telling me if I poll,

I can detect node failure?

Page 72: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

we acted quickly - and scaled up our entire fleet

with more nodes

doh!!!!

we slowed consensus...

Page 73: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

understand the scaling dimensions

& scale them independently...

Page 74: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

State Store

a lock service has 3 components..

Page 75: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

State Store

they must be scaled independently..

Page 76: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

State Store

they must be scaled independently..

Page 77: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

State Store

they must be scaled independently..

Page 78: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Let’s Go Over The demo from this morning

Page 79: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

stream ingestion

Page 80: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

stream ingestion

Page 81: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

stream ingestion

Page 82: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Real-time tweet analytics using DynamoDB

• Stream from Kinesis to DynamoDB

• What data do want in real-time? • (per-second, top words)

• How does DynamoDB help? • Atomic counters (per-word counts in that second)

• Indexed queries (top N word-counts in that second

Page 83: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Time Word Count

2013-10-13T12:00 Earth 9

2013-10-13T12:00 Mars 10

2013-10-13T12:00 Pluto 5

2013-10-13T12:03 Earth 8

Time Count Word

2013-10-13T12:00 5 Pluto

2013-10-13T12:00 9 Earth

2013-10-13T12:00 10 Mars

2013-10-13T12:03 8 Earth

WordCount Table Local Secondary Index

Page 84: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

DynamoDB cost: $0.25 / hr

Page 85: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

stream ingestion

Page 86: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Aggregate queries using Redshift

• Simple Redshift connector (buffer files, store in s3, call copy command)

• Manifest copy connector • 2 streams

• transaction table for deduplication

• manifest copy

Page 87: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Right tool for right job…

• Canal -> DynamoDB -> Redshift -> Glacier…

Page 88: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

You are not done yet..

• Listen to customer feedback

• Iterate..

Page 89: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

Example: DynamoDB

• Start with immediate needs of reliable, super scalable, low latency datastore

• Iterate • Developers wanted flexible query: Local Secondary Indexes

• Developers wanted parallel loads: Parallel Scans

• Mobile developers wanted direct access to their datastore: Fine-grained Access Control

• Mobile developers wanted geo-awareness: Geospatial library

• Developers wanted DynamoDB on their laptop: DynamoDB Local

• Developers wanted richer query: Global Secondary Indexes

• We will continue to innovate..

Page 90: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79 @ksshams @swami_79

Sacred Tenets in Distributed Systems

plan for success – plan for scalability

don’t compromise durability for performance

plan for failures - fault -tolerance is key consistent performance

is important release - think of blast radius

insist on correctness

Page 91: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

understand scaling

dimensions

observe how service is

used

scalability

over features

strive for

correctness

relentlessly test

monitor like a hawk

Page 92: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79

Please give us your feedback on this

presentation

Don’t miss SPOT 201!!!

SPOT 401

Page 93: NoSQL Revolution: Under the Covers of Distributed Systems at Scale (SPOT401) | AWS re:Invent 2013

@ksshams @swami_79