Cassandra internals

Phoenix Cassandra Users Meetup

January 26th, 2015

Narasimhan Sampath

Choice Hotels International

Cassandra Internals

What is Cassandra

Data Placement, Replication and Partition Aware Drivers

Read and Write Path

Merkle Trees, SSTables, Read Repair and Compaction

Single and Multi-threaded Operations

Agenda

Cassandra is a decentralized distributed database No master or slave nodes No single point of failure

Peer-Peer architecture Read / write to any available node Replication and data redundancy built into the architecture Data is eventually consistent across all cluster nodes

Linearly (and massively) scalable Multiple Data Center support built in – a single cluster can span geo locations Adding or removing nodes / data centers is easy and does not require down time Data redistribution / rebalance seamless and non blocking

Runs on commodity hardware Hardware failure is expected and factored into the Architecture

Internal architecture more complex than non-distributed databases

Cassandra

Automatic Sharding (partitioning) Total data to be managed by the cluster is (ideally) divided equally among the cluster nodes

Each node is responsible for a subset of the data

Copies of that subset are stored on other nodes for high availability and redundancy

Data placement design determines node balancing (token assignment, adding and removing nodes)

Data Synchronization within the decentralized cluster is complex, but implementation mostly hidden from the users

Availability and Partition Tolerance given precedence over Consistency (CAP – Data is eventually consistent) Consistency (all nodes see the same data at the same time)

Availability (a guarantee that every request receives a response about whether it succeeded or failed)

Partition tolerance (the system continues to operate despite a part of the system failing)

Brewer’s CAP theorem (For further reading)

Staged Event Driven Architecture – framework for achieving high concurrency and load Uses events, messages and queues to process tasks

Decouples the request and response from the worker threads

Cassandra

Ring – Visual representation of data managed by Cassandra

Node – Individual machine in the ring

Data Center – A collection of related nodes

Cluster – Collection of (geographically separated) data centers

Commitlog – The equivalent of a transaction log file for Durability

Memtable – In Memory structures to store data (per column family)

Keyspace – Container for application data (Analogous to schema)

Table – Structure that holds data in rows and columns

SSTable – An immutable file (for each table) on disk to which data structures in memory are dumped periodically

Cassandra Terminology

Gossip – Peer to Peer protocol to discover and share location and state information on nodes

Tokens – A number used to assign a range of data to a node within a datacenter

Partitioner – A Hashing function for deriving the token

Replication Factor – determines the number of copies of data

Snitch – Snitch informs Cassandra about network topology

Replica – Copies of data on different nodes for redundancy and fault tolerance

Replication Factor – total number of copies on the cluster

Terminology

Cassandra is linearly (horizontal) and massively scalable

Just add or remove nodes to the cluster as load increases or decreases

There is no down time required for this

SEDA – Staged Event Driven Architecture guarantees consistent throughput

Core Strength - Scalability

Quantifying Massive

Avoids the pitfalls of Client Server based design

Eliminates storage bottlenecks No single data repository

Redundancy built in

All nodes participate (whether they have requested data or not)

Shared nothing

Transparently add / remove nodes as necessary without downtime

Comes with a trade-off – eventual consistency (CAP)

Newer Staged Event Driven Architecture

How does it Scale?

Legacy systems typically use thread based concurrency models

Programming traditional multi-threaded applications is hard Distributed multithreaded applications are even harder

Leads to severe scalability bottlenecks

A new thread or process is usually created for each request

There is a maximum number of threads a system can support

Challenges with thread execution model Deadlocks

Livelocks (wastes CPU cycles)

starvation (wait for resources)

Overheads – Context switching, synchronization and data movement

Request and response typically handled by the same thread

Sequential execution

Legacy Systems

Threads

Event Driven Architecture

Evolution of Event Driven Architecture (EDA)

This consists of a set of loosely coupled software components and services

An Event is something that an application can act upon A hotel booking event A check-in event

A listener can pick up a check-in event and act on it In-room entertainment system displays a personalized greeting Partners may get notified and can send personalized offers (Spa / massage/ restaurant

discounts)

This is much more scalable than thread based concurrency models

SEDA is an Architectural approach

An application is broken down into a set of logical stages

These stages are loosely coupled and connected via queues

Decouples event and thread scheduling from DB Engine logic

Prevents resources from being overcommitted under high load

Enables modularity and code reuse

SEDA Explained

Understanding Stage (SEDA)

Understanding Stage

SEDA enables Massive Concurrency No thread deadlocks or livelocks or Starvation to worry about (for most part)

Thread Scheduling and Resource Management abstracted

Supports self tuning / resource allocation / management

Easier to debug and monitor application performance at scale

Distributed debugging / tracing easier

Graceful degradation under excessive load Maintains throughput at the expense of latency

Why SEDA matters

Examples of Stages

Data Placement

Facebook’s DC

Why is data placement important

Cassandra has a listen and broadcast IP address

Snitch maps IP address to Racks and Data Centers

Gossip uses this information to help Cassandra build node location map

Snitch helps Cassandra with replica placement

Helps Cassandra minimize cross data center latency

Role of Snitch

Once built and configured, a cluster is ready to store data

Each node owns a Token Range Can be manually assigned in YAML file

Or Cassandra can manage token assignment - a concept called vNodes

A Keyspace needs to be created with replication options

CREATE KEYSPACE “Choice"

WITH REPLICATION =

{'class' : 'NetworkTopologyStrategy', 'dc1' : 3, 'dc2' : 2};

Cassandra Schema objects are replicated globally to all nodes

This enables each node in the cluster to act as a coordinator node

Data Placement

Data gets replicated as defined in the Keyspace

Within a data center, murmur3hash of PK decides which node owns the data

Replication Strategy determines which nodes contain replicas

Simple Strategy – Replicas are placed in succeeding nodes

Network Topology – Walks the ring clockwise and places each copy on the first node on successive racks

Asymmetric replica groupings are possible (DR / Analytics etc.)

Data Placement

empID empName deptID deptName hiredate

22 Sam 12 Finance 1/22/1996

33 Scott 18 Human Resources 12/8/2006

44 Walter 24 Shipping 11/20/2009

55 Bianca 30 Marketing 1/1/2015

Data Placement

Partition Sample Hash

Finance-2245462676723220000

Human Resources7723358927203680000

Shipping-6723372854036780000

Marketing1168604627387940000

Data Placement

Data Access Cassandra’s location independent Architecture means a user can connect to any node of

the cluster, which then acts as coordinator node Schemas get replicated globally – even to nodes that do not contain a copy of the data

Cassandra offers tunable consistency – an extension of eventual consistency Clients determine how consistent the data should be

They can choose between high availability (CL ONE) and high safety (CL ALL) among other options

Cassandra internals

Technology

Transcript of Cassandra internals

MySQL Internals Manual - KAMBING.ui.ac.idkambing.ui.ac.id/.../mysql/internals-en.pdf · MySQL Internals Manual - KAMBING.ui.ac.id ... 4

Cassandra Community Webinar - August 22 2013 - Cassandra Internals

State of Cassandra, 2012 - NoSQL | Apache Cassandra · State of Cassandra, 2012 Jonathan Ellis Project Chair, Apache Cassandra CTO, DataStax @spyced ©2012 DataStax Some Cassandra

Cassandra Core Concepts and Design Internals

Apache Cassandra in Action - O'Reilly Mediaassets.en.oreilly.com/1/event/55/Apache Cassandra in Action... · Apache Cassandra in Action. Why Cassandra? ... Cassandra in production.

Understanding Cassandra internals to solve real-world problems

Associate)Professor)Cassandra)L.Atherton) Deakin ...cassandra-atherton.com/wp-content/uploads/2018/01/Cassandra-Athert… · Associate)Professor)Cassandra)L.Atherton) Deakin’University’

USENIX 2001, Boston, Ma. Solaris Internals Solaris Internals

Cassandra Freeman - Thoughtful Inspirationsthoughtfulinspirations.com/.../2017/12/Cassandra-Freeman-Final-Bio.pdf · Cassandra Freeman Cassandra Freeman ... Jim Rohn, Zig Ziglar Leadership

Cassandra Training Session 2svn.wso2.org/repos/wso2/people/kasunw/BAM/Cassandra/Cassandra... · Configuring Cassandra Contd ... replicationStrategy, replicationFactor, cfs); cluster.addKeyspace(definition);

Apache Cassandra at Target - Cassandra Summit 2014

Apache Con NA 2013 - Cassandra Internals

Chicago Cassandra - Cassandra from Python

Cassandra Internals Overview by Sam Tunnicliffe (1)

Cassandra Summit 2014: Cassandra Compute Cloud: An elastic Cassandra Infrastructure

Cassandra CLuster Management by Japan Cassandra Community

1200 Reactor Internals - raschig.de - Reactor Internals/Info... · Internals with a majority of these internals belonging to the Hydroprocessing class of internals as well as other

Cassandra EU 2012 - Storage Internals by Nicolas Favre-Felix

C* Summit EU 2013: Cassandra Internals

Cassandra Summit 2014: Cassandra at Instagram 2014