Intro to Cassandra

CassandraIntro to

Tyler Hobbs

Dynamo(clustering)

History

BigTable(data model)

Cassandra

Every node plays the same role– No masters, slaves, or special nodes

– No single point of failure

Clustering

Consistent Hashing

0

10

20

30

40

50

0

10

20

30

40

50

Key: “www.google.com”

Consistent Hashing

0

10

20

30

40

50


14

md5(“www.google.com”)

Consistent Hashing

0

10

20

30

40

50

14



Consistent Hashing

0

10

20

30

40

50

14



Replication Factor = 3

Consistent Hashing

Client can talk to any node

Clustering

Scaling

50

0

10

20

30

The node at50 owns the red portion

RF = 2

Scaling

50

0

10

20

30

40Add a new node at 40

RF = 2

Node Failures

50

0

10

20

30

RF = 2

40

Replicas

Node Failures

50

0

10

20

30

RF = 2

40

Consistency, Availability Consistency

– Can I read stale data? Availability

– Can I write/read at all? Tunable Consistency

Consistency N = Total number of replicas R = Number of replicas read from

– (before the response is returned) W = Number of replicas written to

– (before the write is considered a success)

Consistency N = Total number of replicas R = Number of replicas read from

– (before the response is returned) W = Number of replicas written to

– (before the write is considered a success)

W + R > N gives strong consistency

Consistency


N = 3W = 2R = 2

2 + 2 > 3 ==> strongly consistent

Consistency


N = 3W = 2R = 2

2 + 2 > 3 ==> strongly consistent

Only 2 of the 3 replicas must be available.

Consistency Tunable Consistency

– Specify N (Replication Factor) per data set– Specify R, W per operation

Consistency Tunable Consistency

– Specify N (Replication Factor) per data set– Specify R, W per operation– Quorum: N/2 + 1

• R = W = Quorum• Strong consistency• Tolerate the loss of N – Quorum replicas

– R, W can also be 1 or N

Availability Can tolerate the loss of:

– N – R replicas for reads– N – W replicas for writes

CAP Theorem

Availability

Consistency

During node or network failure:

100%

100%

Possible

Not Possible

CAP Theorem

Availability

Consistency

During node or network failure:

100%

100%

Cassandra

Not Possible

Possible

No single point of failure Replication that works Scales linearly

– 2x nodes = 2x performance• For both writes and reads

– Up to 100's of nodes Operationally simple Multi-Datacenter Replication

Clustering

Comes from Google BigTable Goals

– Minimize disk seeks– High throughput– Low latency– Durable

Data Model

Keyspace– A collection of Column Families– Controls replication settings

Column Family– Kinda resembles a table

Data Model

Static– Object data– Similar to a table in a relational database

Dynamic– Pre-calculated query results– Materialized views

Column Families

Static Column Families

zznate

driftx

thobbs

jbellis

password: *

password: *

password: *

name: Nate

name: Brandon

name: Tyler

password: * name: Jonathan site: riptano.com

Users

Rows– Each row has a unique primary key– Sorted list of (name, value) tuples

• Like a sorted map or dictionary– The (name, value) tuple is called a “column”

Dynamic Column Families


zznate

driftx

thobbs

jbellis

driftx: thobbs:

driftx: thobbs:mdennis: zznate

Following

zznate:

pcmanus xedin:

Column Timestamps– Each column (tuple) has a timestamp– In the case of a collision, the latest timestamp wins– Client specifies timestamp with write– Writes are idempotent

• Infinite retries allowed


Dynamic Column Families Other Examples:

– Timeline of tweets by a user– Timeline of tweets by all of the people a user is

following– List of comments sorted by score– List of friends grouped by state

The Data API Two choices

– RPC-based API– CQL

• Cassandra Query Language

Inserting Data

INSERT INTO users (KEY, “name”, “age”) VALUES (“thobbs”, “Tyler”, 24);

Updating Data

INSERT INTO users (KEY, “age”) VALUES (“thobbs”, 34);

Updates are the same as inserts:

Or

UPDATE users SET “age” = 34 WHERE KEY = “thobbs”;

Fetching Data

SELECT * FROM users WHERE KEY = “thobbs”;

Whole row select:

Fetching Data

SELECT “name”, “age” FROM users WHERE KEY = “thobbs”;

Explicit column select:

Fetching Data

UPDATE letters SET 1='a', 2='b', 3='c', 4='d', 5='e' WHERE KEY = “key”;

SELECT 1..3 FROM letters WHERE KEY = “key”;

Get a slice of columns

Returns [(1, a), (2, b), (3, c)]

Fetching Data

SELECT FIRST 2 FROM letters WHERE KEY = “key”;


Returns [(1, a), (2, b)]

SELECT FIRST 2 REVERSED FROM letters WHERE KEY = “key”;

Returns [(5, e), (4, d)]

Fetching Data

SELECT 3..'' FROM letters WHERE KEY = “key”;


Returns [(3, c), (4, d), (5, e)]

SELECT FIRST 2 REVERSED 4..'' FROM letters WHERE KEY = “key”;

Returns [(4, d), (3, c)]

Deleting Data

DELETE FROM users WHERE KEY = “thobbs”;

Delete a whole row:

DELETE “age” FROM users WHERE KEY = “thobbs”;

Delete specific columns:

Secondary Indexes

CREATE INDEX ageIndex ON users (age);

SELECT name FROM USERS WHERE age = 24 AND state = “TX”;

Builtin basic indexes

Performance Writes

– 10k – 30k per second per node– Sub-millisecond latency

Reads– 1k – 10k per second per node– Depends on data set, caching– Usually 0.1 to 10ms latency

Other Features Distributed Counters

– Can support millions of high-volume counters Excellent Multi-datacenter Support

– Disaster recovery– Locality

Hadoop Integration– Isolation of resources– Hive and Pig drivers

Compression

What Cassandra Can't Do Transactions

– Unless you use a distributed lock– Atomicity, Isolation– These aren't needed as often as you'd think

Limited support for ad-hoc queries– Know what you want to do with the data

Not One-size-fits-all Use alongside an RDBMS

– Use the RDBMS for highly-transactional or highly-relational data• Usually a small set of data

– Let Cassandra scale to handle the rest

Language Support Good:

– Java– Python– Ruby– PHP– C#

Coming Soon:– Everything else, now that we have CQL

Tyler Hobbs@tylhobbs

[email protected]

Questions?

Intro to Cassandra

Technology

Transcript of Intro to Cassandra