C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves...

37
When Bad Things Happen to Good Data: Understanding Anti-Entropy in Cassandra Jason Brown @jasobrown [email protected]

description

This talk focuses Cassandra's anti-entrpoy mechanisms. Jason will discuss the details of read repair, hinted handoff, node repair, and more as they aide in reolving data that has become inconsistent across nodes. In addition, he'll provide insight into how those techniques are used to ensure data consistency at Netflix.

Transcript of C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves...

Page 1: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

When Bad Things Happen to Good Data:

Understanding Anti-Entropy in Cassandra

Jason Brown

@jasobrown [email protected]

Page 2: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

About me

•  Senior Software Engineer @ Netflix •  Apache Cassandra committer

•  E-Commerce Architect, Major League Baseball Advanced Media

•  Wireless developer (J2ME and BREW)

Page 3: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Maintaining consistent state is hard in a distributed system

CAP theorem works against you

Page 4: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Inconsistencies creep in

•  Node is down •  Network partition •  Dropped mutations •  Process crash before commit log flush •  File corruption

Cassandra trades C for AP

Page 5: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Anti-Entropy Overview

•  write time o  tunable consistency o  atomic batches o  hinted handoff

•  read time o  consistent reads o  read repair

•  maintenance time o  node repair

Page 6: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Write Time

Page 7: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Cassandra Writes Basics

•  determine all replica nodes in all DCs •  send to replicas in local DC •  send one replica node in remote DCs,

o  it will forward to peers

•  all respond back to original coordinator

Page 8: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Writes - request path

Page 9: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Writes - response path

Page 10: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Writes - Tunable consistency

Coordinator blocks for specified count of replicas to respond

•  consistency level o  ALL o  EACH_QUORUM o  LOCAL_QUORUM o  ONE / TWO / THREE o  ANY

Page 11: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Hinted handoff

Save a copy of the write for down nodes, and replay later

hint = target replica + mutation data

Page 12: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Hinted handoff - storing

•  on coordinator, store a hint for any nodes not currently 'up'

•  if a replica doesn't respond within write_request_timeout_in_ms, store a hint

•  max_hint_window_in_ms - maximum amount of time a dead host will have hints generated.

Page 13: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Hinted handoff - replay

•  try to send hints to nodes •  runs every ten minutes •  multithreaded (as of 1.2) •  throttable (kb per second)

Page 14: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Hinted Handoff - R2 down

R2 down, coordinator (R1) stores hint

Page 15: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Hinted handoff - replay

R2 comes back up, R1 plays hints for it

Page 16: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

What if coordinator dies?

Page 17: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Atomic Batches

•  coordinator stores incoming mutation to two peers in same DC o  deletes from peers on successful completion

•  peers will replay the batch if not deleted o  runs every 60 seconds

•  with 1.2, all mutates use atomic batch

Page 18: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Read Time

Page 19: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Cassandra Reads - setup

•  determine endpoints to invoke o  consistency level vs. read repair

•  first data node to send back full data set, other nodes only return a digest

•  wait until the CL number of nodes to return

Page 20: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

LOCAL_QUORUM read

Pink nodes contain requested row key

Page 21: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Consistent reads

•  compare the digests of returned data sets •  if any mismatches, send request again to

same CL data nodes. o  this time no digests, full data set

•  compare the full data sets, send updates to out of date replicas

•  block until those fixes are responded to •  return data to caller

Page 22: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Read Repair

•  synchronizes the client-requested data amongst all replicas

•  piggy-backs on normal reads, but waits for all replicas to respond asynchronously

•  then, just like consistent reads, compares the digests, and fix if needed

Page 23: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Read Repair

green lines = LOCAL_QUORUM nodes blue lines = nodes for read repair

Page 24: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Read Repair - configuration

•  setting per column family •  percentage of all calls to CF •  Local DC vs. Global chance

Page 25: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Read repair fixes data that is actually requested,

... but what about data that isn't requested?

Page 26: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Node Repair - introduction

•  repairs inconsistencies across all replicas for a given range

•  nodetool repair o  repairs the ranges the node contains o  one of more column families (within the same

keyspace) o  can choose local datacenter only (c* 1.2)

Page 27: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

•  should be part of std operations maintenance for c*, esp if you delete data o  ensures tombstones are propagated, and avoid

resurrected data

•  repair is IO and CPU intensive

Node Repair - cautions

Page 28: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Node Repair - details 1

•  determine peer nodes with matching ranges •  triggers a major (validation) compaction on

peer nodes o  read and generate hash for every row in CF o  add result to a Merkle Tree o  return tree to initiator

Page 29: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Node Repair - details 2

•  initiator awaits trees from all nodes •  compares each tree to every other tree •  if any differences exist, two nodes are

exchange the conflicting ranges o  these ranges get written out as new, local sstables

Page 30: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

'ABC' node is repair initiator

Page 31: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Nodes sharing range A

Page 32: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Nodes sharing range B

Page 33: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Nodes sharing range C

Page 34: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Five nodes participating in repair

Page 35: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Anti-Entropy wrap-up

•  CAP Theorem lives, tradeoffs must be made

•  C* contains processes to make diverging data sets consistent

•  Tunable controls exist at write and read times, as well on-demand

Page 36: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Thank you!

Q & A time

@jasobrown

Page 37: C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves Inconsistent Data by Jason Brown

Notes from Netflix

•  carefully tune RR_chance •  schedule repair operations •  tickler •  store more hints vs. running repair