Post on 19-Jul-2015
Phoenix Cassandra Users Meetup
January 26th, 2015
Narasimhan Sampath
Choice Hotels International
Cassandra Internals
What is Cassandra
SEDA
Data Placement, Replication and Partition Aware Drivers
Read and Write Path
Merkle Trees, SSTables, Read Repair and Compaction
Single and Multi-threaded Operations
Demo
Agenda
Cassandra is a decentralized distributed database No master or slave nodes No single point of failure
Peer-Peer architecture Read / write to any available node Replication and data redundancy built into the architecture Data is eventually consistent across all cluster nodes
Linearly (and massively) scalable Multiple Data Center support built in – a single cluster can span geo locations Adding or removing nodes / data centers is easy and does not require down time Data redistribution / rebalance seamless and non blocking
Runs on commodity hardware Hardware failure is expected and factored into the Architecture
Internal architecture more complex than non-distributed databases
Cassandra
Automatic Sharding (partitioning) Total data to be managed by the cluster is (ideally) divided equally among the cluster nodes
Each node is responsible for a subset of the data
Copies of that subset are stored on other nodes for high availability and redundancy
Data placement design determines node balancing (token assignment, adding and removing nodes)
Data Synchronization within the decentralized cluster is complex, but implementation mostly hidden from the users
Availability and Partition Tolerance given precedence over Consistency (CAP – Data is eventually consistent) Consistency (all nodes see the same data at the same time)
Availability (a guarantee that every request receives a response about whether it succeeded or failed)
Partition tolerance (the system continues to operate despite a part of the system failing)
Brewer’s CAP theorem (For further reading)
Staged Event Driven Architecture – framework for achieving high concurrency and load Uses events, messages and queues to process tasks
Decouples the request and response from the worker threads
Cassandra
Ring – Visual representation of data managed by Cassandra
Node – Individual machine in the ring
Data Center – A collection of related nodes
Cluster – Collection of (geographically separated) data centers
Commitlog – The equivalent of a transaction log file for Durability
Memtable – In Memory structures to store data (per column family)
Keyspace – Container for application data (Analogous to schema)
Table – Structure that holds data in rows and columns
SSTable – An immutable file (for each table) on disk to which data structures in memory are dumped periodically
Cassandra Terminology
Gossip – Peer to Peer protocol to discover and share location and state information on nodes
Tokens – A number used to assign a range of data to a node within a datacenter
Partitioner – A Hashing function for deriving the token
Replication Factor – determines the number of copies of data
Snitch – Snitch informs Cassandra about network topology
Replica – Copies of data on different nodes for redundancy and fault tolerance
Replication Factor – total number of copies on the cluster
Terminology
Cassandra is linearly (horizontal) and massively scalable
Just add or remove nodes to the cluster as load increases or decreases
There is no down time required for this
SEDA – Staged Event Driven Architecture guarantees consistent throughput
Core Strength - Scalability
Quantifying Massive
Avoids the pitfalls of Client Server based design
Eliminates storage bottlenecks No single data repository
Redundancy built in
All nodes participate (whether they have requested data or not)
Shared nothing
Transparently add / remove nodes as necessary without downtime
Comes with a trade-off – eventual consistency (CAP)
Newer Staged Event Driven Architecture
How does it Scale?
Legacy systems typically use thread based concurrency models
Programming traditional multi-threaded applications is hard Distributed multithreaded applications are even harder
Leads to severe scalability bottlenecks
A new thread or process is usually created for each request
There is a maximum number of threads a system can support
Challenges with thread execution model Deadlocks
Livelocks (wastes CPU cycles)
starvation (wait for resources)
Overheads – Context switching, synchronization and data movement
Request and response typically handled by the same thread
Sequential execution
Legacy Systems
a
Threads
Event Driven Architecture
Evolution of Event Driven Architecture (EDA)
This consists of a set of loosely coupled software components and services
An Event is something that an application can act upon A hotel booking event A check-in event
A listener can pick up a check-in event and act on it In-room entertainment system displays a personalized greeting Partners may get notified and can send personalized offers (Spa / massage/ restaurant
discounts)
This is much more scalable than thread based concurrency models
SEDA is an Architectural approach
An application is broken down into a set of logical stages
These stages are loosely coupled and connected via queues
Decouples event and thread scheduling from DB Engine logic
Prevents resources from being overcommitted under high load
Enables modularity and code reuse
SEDA Explained
Understanding Stage (SEDA)
Understanding Stage
SEDA enables Massive Concurrency No thread deadlocks or livelocks or Starvation to worry about (for most part)
Thread Scheduling and Resource Management abstracted
Supports self tuning / resource allocation / management
Easier to debug and monitor application performance at scale
Distributed debugging / tracing easier
Graceful degradation under excessive load Maintains throughput at the expense of latency
Why SEDA matters
Examples of Stages
Data Placement
Facebook’s DC
Cassandra has a listen and broadcast IP address
Snitch maps IP address to Racks and Data Centers
Gossip uses this information to help Cassandra build node location map
Snitch helps Cassandra with replica placement
Helps Cassandra minimize cross data center latency
Role of Snitch
Once built and configured, a cluster is ready to store data
Each node owns a Token Range Can be manually assigned in YAML file
Or Cassandra can manage token assignment - a concept called vNodes
A Keyspace needs to be created with replication options
CREATE KEYSPACE “Choice"
WITH REPLICATION =
{'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};
Cassandra Schema objects are replicated globally to all nodes
This enables each node in the cluster to act as a coordinator node
Data Placement
Data gets replicated as defined in the Keyspace
Within a data center, murmur3hash of PK decides which node owns the data
Replication Strategy determines which nodes contain replicas
Simple Strategy – Replicas are placed in succeeding nodes
Network Topology – Walks the ring clockwise and places each copy on the first node on successive racks
Asymmetric replica groupings are possible (DR / Analytics etc.)
Data Placement
empID empName deptID deptName hiredate
22 Sam 12 Finance 1/22/1996
33 Scott 18 Human Resources 12/8/2006
44 Walter 24 Shipping 11/20/2009
55 Bianca 30 Marketing 1/1/2015
Data Placement
Partition Sample Hash
Finance-2245462676723220000
Human Resources7723358927203680000
Shipping-6723372854036780000
Marketing1168604627387940000
Data Placement
Data Access Cassandra’s location independent Architecture means a user can connect to any node of
the cluster, which then acts as coordinator node Schemas get replicated globally – even to nodes that do not contain a copy of the data
Cassandra offers tunable consistency – an extension of eventual consistency Clients determine how consistent the data should be
They can choose between high availability (CL ONE) and high safety (CL ALL) among other options
Further reading
Request goes through stages – the thread that received the initial request will insert the request into a queue and wait for the next user request
Partition aware drivers help route traffic to the nearest node
Hinted Hand-offs – store and forward write requests
Data Access
Memtables
Commitlog
SSTables
Tombstones
Compaction
Repair
Reads and Writes
Reads & Writes
Reads and Writes
Write Process Write requests written to a MemTable
When Memtable is full, contents get queued to be flushed to disk
Writes are also simultaneously persisted on Disk to a CommitLog file This helps achieving durable writes
CommitLog entries are purged after MemTable is flushed to disk
MemTables and SSTables are created on a per table basis
Tunable consistency determines how may MemTables and Commitlogs the row has to be written to
SSTables are immutable and cannot be modified once written to
Compaction consolidates SSTables and removes tombstones SizeTiered Compaction
Leveled Compaction
Repair is a process that synchronizes copies located in different nodes Uses Merkle Trees to make this more efficient
Is a feature that enables high write availability
Has to be enabled / disabled in the YAML file
When a replica node is down
A hint is stored in the coordinator node
Hints are stored for three hours (default)
Hinted writes do not count towards CL
Replaying hints does not affect system performance
Hinted Hand-off
Read Path A row of data will likely exist in multiple locations
Unflushed Memtable Un-compacted and compacted SSTables
Tunable consistency determines how many nodes have to respond
Cassandra does not rewrite entire row to new file on update No read before writes Updated / new columns exists in new file Unmodified columns exist in old file
The timestamped version of the row can be different in each location
All these must be retrieved, reconstructed and processed based on timestamp
Uses Bloom filters to make key lookups more efficient
Row fragments may exist in multiple SSTables
May exist in Memtable as well
Bloom filters speed lookups
Read Path
It is a Probabilistic Bit Vector Data Structure Supports two operations – Test and Add
Cassandra uses Bloom filters to reduce Disk I/O during key lookup
Each SSTable has a bloom filter associated with it
A Bloom filter is used to test if an element is a part of a set
False Positives are possible, but False negatives are not Means a key is “possibly in set” or “definitely not in set”
Check out JasonDavies.com for a cool interactive demo
http://www.jasondavies.com/bloomfilter/
http://pages.cs.wisc.edu/~cao/papers/summary-cache/node8.html
Bloom Filters
Deletes are handled differently when compared to traditional RDBMS
Data to be deleted is marked using Tombstones (using a write operation)
Actual removal takes place later during compaction
Run Repair on each node within 10 days (default)
Repair removes inconsistencies between replicas
Inconsistencies happen because nodes can be down for longer than hinted handoff window, thereby missing deletes/updates
Distributed deletes are hard in a peer to peer system that has no SPOF
Deletes
Distributed Systems are eventually consistent
Only a small number of nodes have to respond for successful (delete) operation
As the delete command propagates through the system, some nodes may be unavailable
The commands are stored (as hinted hand-offs) and will be delivered when the downed node comes online
The delete command may be “lost” if the downed node does not come back within the hinted hand-off window (default 3 hours)
Why are Distributed Deletes hard?
Cassandra does not support in-row updates
Updates are implemented as a delete and an insert
Updated values are written to a new file
Unmodified columns of the original row exist in old file
Compaction consolidates all values and writes row to new file
Updates
Cassandra does not perform in-place updates or deletes
Instead the new data is written to a new SSTable file
Cassandra marks data to be deleted using markers called Tombstones
Tombstones exist for the time period defined by GC_GRACE_SECONDS
Compactions merges data in each SSTable by partition key
Evicts tombstones, deletes data and consolidates SSTables into a single SSTable
Old SSTables are deleted as soon as existing reads complete
Compaction
Read Repair and Node Repair
Read Repair synchronizes data requested in a read operation
Node repair synchronizes all data (for a range) in a node with all replicas
Node repair needs to be scheduled to run at least once within the GC_GRACE_SECONDS Window (default 10 days)
Repair
There are two stages to the repair process
Build a Merkle Tree
Each replica will compare the differences between the trees
Once the compare completes, changes stream over
Streams are written to new SSTables
Repair is a resource intensive operation
Read up on Advanced Repair techniques
Repair Process
The distributed, decentralized nature of Cassandra requires repair operations
Repair involves comparing all data elements in each replica and updating the data This happens asynchronously and in the background
Cassandra uses Merkle Trees to detect data inconsistencies quicker and minimize data transferred between different nodes
Merkle Tree is an inverted hash tree structure
Used to compare data stored in different nodes
Partial branches of tree can be compared
Minimizes repair time and traffic between nodes
Merkle Trees
Single threaded Operations
Some Examples of Single threaded operations:
Merkle Tree Comparison
Triggering Repair
Deleting files Obsolete Sstables
Commitlog segments
Gossip
Hinted Handoff (default value = 1)
Message Streaming
This demo is to help get a better understanding on:
Gossip
Replication
Data Manipulation (Inserts, Updates, Deletes)
Role of Memtable, CommitLog and Tombstones
Compaction
Demo
Demo - Steps
Modify core cluster and table settings
Insert Data in one node
Verify Replication
Shut down one node
Continue DML operations
Start the downed node
Understand Outcome
Let’s see it!
Demo Time
Commands issued to Cassandra when one node was down
Demo commnds
Expected results
Actual Results
Results
Demo Recap
What just happened?
Inserts disappeared
Updates rolled back
Deletes reappeared
What happened to Durability?
And this thing called eventual consistency?
All nodes were up and running
Initial writes came in, got persisted and replicated
All nodes have received the data and are in sync.
Memtable Flush, Compaction and SSTables Consolidation This clears the memory and the commit log
None of the 3 nodes have any entries in the commit log for these rows Data exists in SSTables and so query returns data back to user
What really happened?
One node is brought down The state is preserved in that node
Inserts / Updates and Deletes continue in other nodes
Replication and Synchronization happens
Consolidation and Compaction happens on the other 2 nodes
Every time this happens, commit log is cleared and tombstones evicted
gc_grace_seconds & hinted_handoff play a critical role for this demo to work
3rd node that was down is brought up and it starts synchronizing
It still has the original state preserved and sends that copy to the other 2 nodes
Other 2 nodes receive the data and look for commit log entries and Tombstones locally
When the nodes do not find the entries, they proceed to apply that change (as new data) and the system reverts back
What really happened?
http://www.Datastax.com
http://www.eecs.harvard.edu/~mdw/papers/seda-sosp01.pdf
http://berb.github.io/diploma-thesis/original/052_threads.html
Choice Hotels is hiring!
Please contact Jeremiah Anderson for details.
Jeremiah_Anderson@choicehotels.com
References