Solr & Cassandra: Searching Cassandra with DataStax Enterprise
C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves...
-
Upload
planet-cassandra -
Category
Technology
-
view
1.310 -
download
0
description
Transcript of C* Summit 2013: When Bad Things Happen to Good Data: A Deep Dive Into How Cassandra Resolves...
When Bad Things Happen to Good Data:
Understanding Anti-Entropy in Cassandra
Jason Brown
@jasobrown [email protected]
About me
• Senior Software Engineer @ Netflix • Apache Cassandra committer
• E-Commerce Architect, Major League Baseball Advanced Media
• Wireless developer (J2ME and BREW)
Maintaining consistent state is hard in a distributed system
CAP theorem works against you
Inconsistencies creep in
• Node is down • Network partition • Dropped mutations • Process crash before commit log flush • File corruption
Cassandra trades C for AP
Anti-Entropy Overview
• write time o tunable consistency o atomic batches o hinted handoff
• read time o consistent reads o read repair
• maintenance time o node repair
Write Time
Cassandra Writes Basics
• determine all replica nodes in all DCs • send to replicas in local DC • send one replica node in remote DCs,
o it will forward to peers
• all respond back to original coordinator
Writes - request path
Writes - response path
Writes - Tunable consistency
Coordinator blocks for specified count of replicas to respond
• consistency level o ALL o EACH_QUORUM o LOCAL_QUORUM o ONE / TWO / THREE o ANY
Hinted handoff
Save a copy of the write for down nodes, and replay later
hint = target replica + mutation data
Hinted handoff - storing
• on coordinator, store a hint for any nodes not currently 'up'
• if a replica doesn't respond within write_request_timeout_in_ms, store a hint
• max_hint_window_in_ms - maximum amount of time a dead host will have hints generated.
Hinted handoff - replay
• try to send hints to nodes • runs every ten minutes • multithreaded (as of 1.2) • throttable (kb per second)
Hinted Handoff - R2 down
R2 down, coordinator (R1) stores hint
Hinted handoff - replay
R2 comes back up, R1 plays hints for it
What if coordinator dies?
Atomic Batches
• coordinator stores incoming mutation to two peers in same DC o deletes from peers on successful completion
• peers will replay the batch if not deleted o runs every 60 seconds
• with 1.2, all mutates use atomic batch
Read Time
Cassandra Reads - setup
• determine endpoints to invoke o consistency level vs. read repair
• first data node to send back full data set, other nodes only return a digest
• wait until the CL number of nodes to return
LOCAL_QUORUM read
Pink nodes contain requested row key
Consistent reads
• compare the digests of returned data sets • if any mismatches, send request again to
same CL data nodes. o this time no digests, full data set
• compare the full data sets, send updates to out of date replicas
• block until those fixes are responded to • return data to caller
Read Repair
• synchronizes the client-requested data amongst all replicas
• piggy-backs on normal reads, but waits for all replicas to respond asynchronously
• then, just like consistent reads, compares the digests, and fix if needed
Read Repair
green lines = LOCAL_QUORUM nodes blue lines = nodes for read repair
Read Repair - configuration
• setting per column family • percentage of all calls to CF • Local DC vs. Global chance
Read repair fixes data that is actually requested,
... but what about data that isn't requested?
Node Repair - introduction
• repairs inconsistencies across all replicas for a given range
• nodetool repair o repairs the ranges the node contains o one of more column families (within the same
keyspace) o can choose local datacenter only (c* 1.2)
• should be part of std operations maintenance for c*, esp if you delete data o ensures tombstones are propagated, and avoid
resurrected data
• repair is IO and CPU intensive
Node Repair - cautions
Node Repair - details 1
• determine peer nodes with matching ranges • triggers a major (validation) compaction on
peer nodes o read and generate hash for every row in CF o add result to a Merkle Tree o return tree to initiator
Node Repair - details 2
• initiator awaits trees from all nodes • compares each tree to every other tree • if any differences exist, two nodes are
exchange the conflicting ranges o these ranges get written out as new, local sstables
'ABC' node is repair initiator
Nodes sharing range A
Nodes sharing range B
Nodes sharing range C
Five nodes participating in repair
Anti-Entropy wrap-up
• CAP Theorem lives, tradeoffs must be made
• C* contains processes to make diverging data sets consistent
• Tunable controls exist at write and read times, as well on-demand
Thank you!
Q & A time
@jasobrown
Notes from Netflix
• carefully tune RR_chance • schedule repair operations • tickler • store more hints vs. running repair