Cassandra Fundamentals - C* 2.0

Apache Cassandra Fundamentals

or: How I stopped worrying and learned to love the CAP theorem

Russell Spitzer @RussSpitzer

Software Engineer in Test at DataStax

Who am I?• Former Bioinformatics Student

at UCSF

• Work on the integration of Cassandra (C*) with Hadoop, Solr, and Redacted!

• I Spend a lot of time spinning up clusters on EC2, GCE, Azure, …http://www.datastax.com/dev/blog/testing-cassandra-1000-nodes-at-a-time

• Developing new ways to make sure that C* Scales

Apache Cassandra is a Linearly Scaling and Fault Tolerant noSQL Database

Linearly Scaling: The power of the database increases linearly with the number of machines 2x machines = 2x throughput

http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

Fault Tolerant: Nodes down != Database Down Datacenter down != Database Down

http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

CAP Theorem Limits What Distributed Systems can do

ConsistencyWhen I ask the same question to any part of the system I should get the same answer

How many planes do we have?




1 1 1 1 1 1 1

Consistent




1 4 1 2 1 8 1

Not Consistent


When I ask a question I will get an answer


1zzzzz *snort* zzz

Available

Availability

CAP Theorem Limits What Distributed Systems can doAvailability

When I ask a question I will get an answer


I have to wait for major snooze to wake upzzzzz *snort* zzz

Not Available


Partition ToleranceI can ask questions even when the system is having intra-system communication

problems


Team Edward Team Jacob

1

Tolerant


Partition ToleranceI can ask questions even when the system is having intra-system communication

problems


Team Edward Team JacobI’m not sure without asking those

vampire lovers and we aren’t speaking

Not Tolerant

Cassandra is an AP System which is Eventually Consistent

I don’t know without asking those vampire lovers and we aren’t speaking

How many planes do we have? How many planes do we have?

1 1 1 1 1 1

I just heard !we actually !

have 2 2 2 2 2 2 2 2

Eventually consistent: New information will make it to everyone eventually

Two knobs control fault tolerance in C*: Replication and Consistency Level

Server Side - Replication: How many copies of a data should exist in the cluster?

ACD

ABCABD

BCD

RF=3

Client

SimpleStrategy: Replicas NetworkTopologyStrategy: Replicas per Datacenter

Coordinator for this operation


Client Side - Consistency Level: How many replicas should we check before acknowledgment?

ACD

ABCABD

BCD

ClientClient


CL = One


Client Side - Consistency Level: How many replicas should we check before acknowledgment?

ACD

ABCABD

BCD

CL = Quorum

ClientClient


Nodes own data whose primary key hashes to their their token ranges

ACD

ABCABD

BCD

Every piece of data belongs on the node who owns the

Murmur3(2.0) Hash of its partition key + (RF-1) other

nodes

ID: ICBM_432 Loc: SF , Status: Idle

Partition Key Rest of Data

Time: 30

Clustering Key

Murmur3Hash

ID: ICBM_432

Murmur3: A

Cassandra writes are FAST due to log-append storage

Par ReClu Memory

Par ReClu

Par ReClu

Par ReClu

Commit Log

Memtable Memtable

SSTable SSTable

FlushedDisk

Memtable

Deletes in a distributed System are Challenging

We need to keep records of deletions in case of network partitions

Node1

Node2 Power Outage

Time

Tombstone Tombstone

Tombstone

Compactions merge and unify data in our stables

SSTable1

SSTable2+ SSTable

3

Since SSTables are immutable this is our chance to consolidate rows and remove tombstones (After GC Grace)

Layout of Data Allows for Rapid Queries Along Clustering Columns

ID: ICBM_432

ID: ICBM_9210

ID: ICBM_900

Time: 30

Loc: SF

Status: Idle

Disclaimer: Not exactly like this (Use sstable2json to see real layout)

Time: 45

Loc: SF

Status: Idle

Time: 60

Loc: SF

Status: Idle

Time: 30

Loc: Boston

Status: Idle

Time: 45

Loc: Boston

Status: Idle

Time: 60

Loc: Boston

Status: Idle

Time: 30

Loc: Tulsa

Status: Idle

Time: 45

Loc: Tulsa

Status: Idle

Time: 60

Loc: Tulsa

Status: Idle

CQL allows easy definition of Table Structures

ID: ICBM_432 Time: 30

Loc: SF

Status: Idle

Time: 45

Loc: SF

Status: Idle

Time: 60

Loc: SF

Status: Idle

CREATE TABLE icbmlog ( name text, time timestamp, location text, status text, PRIMARY KEY (name,time) );

Reading data is FAST but limited by disk IO

Memory

Par ReClu

Par ReClu

Par ReClu

Commit Log

Memtable Memtable

SSTable SSTable

Disk

Memtable

Client

Par ReClu

Replica

LWWPar ReClu

Reading data is FAST but limited by disk IO

Memory

Par ReClu

Par ReClu

Par ReClu

Commit Log

Memtable Memtable

SSTable SSTable

Disk

Memtable

Client

Par ReClu

Replica

LWWPar ReClu

Read Repair

New Clients provide a holistic view of the C* cluster

Client

ACD

ABCABD

BCD

Initial Contact

Cluster.builder().addContactPoint("127.0.0.1").build()

Session Objects Are used for Executing Requests

session = cluster.connect() session.execute("DROP KEYSPACE IF EXISTS icbmkey") session.execute("CREATE KEYSPACE icbmkey with replication = {'class':'SimpleStrategy','replication_factor':'1'}")

For highest throughput use asynchronous methodsResultSetFuture executeAsync(Query query)

Then add a callback or Queue the ResultSetFutures

ResultSetFuture

ResultSetFuture

ResultSetFuture

Token Aware Policies allow the reduction in the number of intra-network requests

made

Client

ACD

ABCABD

BCD

A

Prepared statements allow for sending less data over the wire

Prepared batch statements can further improve throughput

PreparedStatement ps = session.prepare("INSERT INTO messages (user_id, msg_id, title, body) VALUES (?, ?, ?, ?)"); BatchStatement batch = new BatchStatement(); batch.add(ps.bind(uid, mid1, title1, body1)); batch.add(ps.bind(uid, mid2, title2, body2)); batch.add(ps.bind(uid, mid3, title3, body3)); session.execute(batch);

Query is prepared on all nodes by driver

Avoid• Preparing statements more than once • Creating batches which are too large • Running statements in serial • Using consistency-levels above your need • Secondary Indexes in your main queries

• or really at all unless you are doing analytics

Have fun with C*

Questions?

Cassandra Fundamentals - C* 2.0

Data & Analytics

Transcript of Cassandra Fundamentals - C* 2.0