Cassandra Training Modeling

Cassandra: 0-60

Jonathan Ellis / @spyced

Keyspaces & ColumnFamilies

● Conceptually, like “schemas” and “tables”

Inside CFs, columns are dynamic

● Twitter: “Fifteen months ago, it took two weeks to perform ALTER TABLE on the statuses [tweets] table.”

ColumnFamilies

Columns

“static” Cfs vs “dynamic”

Inserting

● Really “insert or update”● As much of the row as you want

(remember sstable merge-on-read)

Column indexes

● Name vs range flters● “reversed=true”

Denormalization

● Whiteboard: Turn, long, skinny tables into long rows

● Reduces i/o and cpu to perform read

Example: twissandra

● http://twissandra.com

CREATE TABLE users ( id INTEGER PRIMARY KEY, username VARCHAR(64), password VARCHAR(64));

CREATE TABLE following ( user INTEGER REFERENCES user(id), followed INTEGER REFERENCES user(id));

CREATE TABLE tweets ( id INTEGER, user INTEGER REFERENCES user(id), body VARCHAR(140), timestamp TIMESTAMP);

Cassandrifed

Connecting

CLIENT = pycassa.connect_thread_local()

USER = pycassa.ColumnFamily(CLIENT, 'Twissandra', 'User', dict_class=OrderedDict)

'a4a70900-24e1-11df-8924-001ff3591711': { 'id': 'a4a70900-24e1-11df-8924-001ff3591711', 'username': 'ericflo', 'password': '****',}

username = 'jericevans'password = '**********'useruuid = str(uuid()) columns = {'id': useruuid, 'username': username, 'password': password} USER.insert(useruuid, columns)

Natural keys vs surrogate

Friends and Followers

'a4a70900-24e1-11df-8924-001ff3591711': { # friend id: timestamp when the friendship was added '10cf667c-24e2-11df-8924-...': '1267413962580791', '343d5db2-24e2-11df-8924-...': '1267413990076949', '3f22b5f6-24e2-11df-8924-...': '1267414008133277',}

frienduuid = 'a4a70900-24e1-11df-8924-001ff3591711' FRIENDS.insert(useruuid, {frienduuid: time.time()})FOLLOWERS.insert(frienduuid, {useruuid: time.time()})

Your row is your index

● Long skinny table vs short, fat columnfamily

Tweets

'7561a442-24e2-11df-8924-001ff3591711': { 'id': '89da3178-24e2-11df-8924-001ff3591711', 'user_id': 'a4a70900-24e1-11df-8924-001ff3591711', 'body': 'Trying out Twissandra. This is awesome!', '_ts': '1267414173047880',}

Userline

'a4a70900-24e1-11df-8924-001ff3591711': { # timestamp of tweet: tweet id 1267414247561777: '7561a442-24e2-11df-8924-...', 1267414277402340: 'f0c8d718-24e2-11df-8924-...', 1267414305866969: 'f9e6d804-24e2-11df-8924-...', 1267414319522925: '02ccb5ec-24e3-11df-8924-...',}

Timeline

'a4a70900-24e1-11df-8924-001ff3591711': { # timestamp of tweet: tweet id 1267414247561777: '7561a442-24e2-11df-8924-...', 1267414277402340: 'f0c8d718-24e2-11df-8924-...', 1267414305866969: 'f9e6d804-24e2-11df-8924-...', 1267414319522925: '02ccb5ec-24e3-11df-8924-...',}

Adding a tweet

tweetuuid = str(uuid())body = '@ericflo thanks for Twissandra, it helps!'timestamp = long(time.time() * 1e6) columns = {'id': tweetuuid, 'user_id': useruuid, 'body': body, '_ts': timestamp}TWEET.insert(tweetuuid, columns) columns = {struct.pack('>d', timestamp): tweetuuid}USERLINE.insert(useruuid, columns) TIMELINE.insert(useruuid, columns)for otheruuid in FOLLOWERS.get(useruuid, 5000): TIMELINE.insert(otheruuid, columns)

timeline = USERLINE.get(useruuid, column_reversed=True)tweets = TWEET.multiget(timeline.values())

start = request.GET.get('start')limit = NUM_PER_PAGE timeline = TIMELINE.get(useruuid, column_start=start, column_count=limit, column_reversed=True)tweets = TWEET.multiget(timeline.values())

I can has smarter clients?

● Shouldn't need to pack('>d', int), Cassandra provides describe_keyspace so this can be introspected

Raw thrift API: Connecting

def get_client(host='127.0.0.1', port=9170): socket = TSocket.TSocket(host, port) transport = TTransport.TBufferedTransport(socket) transport.open() protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport) client = Cassandra.Client(protocol) return client

Raw thrift API: Inserting

data = {'id': useruuid, ...}columns = [Column(k, v, time.time()) for (k, v) in data.items()]mutations = [Mutation(ColumnOrSuperColumn(column=c)) for c in columns]rows = {useruuid: {'User': mutations}}

client.batch_mutate('Twissandra', rows, ConsistencyLevel.ONE)

Raw thrift API: Fetching

● get, get_slice, get_count, multiget_slice, get_range_slices

● ColumnOrSuperColumn● http://wiki.apache.org/cassandra/API

Running twissandra

● cd twissandra● python manage.py runserver● Navigate to http://127.0.0.1:8000

Pycassa cheat sheet

● get(key, …)● multiget(key_list)● get_range(...)● insert(key, columns_dict)● remove(key, ...)

Exercise

● python manage.py shell● import cass● help(cass.TWEET.remove)● Delete the most recent tweet by user

Exercise

● Open cass.py● Finish save_retweet

Language support

● Python● Scala● Ruby

● Speed is a negative

● Java

PHP [thrift] tickets

● https://issues.apache.org/jira/browse/THRIFT-347

Done yet?

● Still doing 1+N queries per page

SuperColumns

Applying SuperColumns to Twissandra

ColumnParent

Supercolumns: limitations

● Column names should be uuids, not longs, to avoid collisions

● Version 1 UUIDs can be sorted by time (“TimeUUID”)

● Any UUID can be sorted by its raw bytes (“LexicalUUID”)● Usually Version 4

● Slightly less overhead

0.7: secondary indexes

●Obviate need for Userline (but not Timeline)

Lucandra

● What documents contain term X?● … and term Y?

● … or start with Z?

Lucandra ColumnFamilies

Lucandra data

Term Key col name value"field/term" => { documentId , position vector }

Document Key"documentId" => { fieldName , value }

Lucandra queries

● get_slice● get_range_slices● No silver bullet

FAQ: counting

● UUIDs + batch process● Mutex (contrib/mutex or “cages”)● Use redis or mysql or memcached● 0.7: vector clocks

● Insert instead of check-then-insert● Bulk delete with 'forged' timestamps

● In 0.7: use ttl instead

as notroot/notroot:git clone http://github.com/ericflo/twissandra.git

as root/riptano:apt-get updateapt-get install python-setuptoolsapt-get install python-djangoeasy_install -U thriftrm -r /var/lib/cassandra/*cp twissandra/storage-conf.xml /etc/cassandraedit /etc/cassandra/log4j.properties to DEBUG/etc/init.d/cassandra starttail -f /var/log/cassandra/system.log

as notroot:find templates |xargs grep empty# r/m the {empty} blockspython manage.py runserver

Cassandra Training Modeling

Documents

Transcript of Cassandra Training Modeling

Cassandra 3.0 Data Modeling

Cassandra, Modeling and Availability at AMUG

Cassandra for the relational brain - Percona · A little Cassandra for the Relational Brain 1. Relational Modeling ... 'First in a three part series for Cassandra Data Modeling','v

Cassandra Design Patterns - Packt Publishing · Cassandra Design Patterns Sanjay Sharma Chapter No. 1 "An Overview of Architecture and Data Modeling in Cassandra"

Data Modeling with Cassandra and Time Series Data

Data Modeling for Microservices with Cassandra and Spark

Cassandra data modeling talk

Data Modeling in Apache Cassandra™ · 2019-12-30 · 3 Data Modeling in Apache Cassandra™ INTRODUCTION For web-scale applications, Apache Cassandra is a favorite choice among

Modeling the IoT with TitanDB and Cassandra

Cassandra Deep Diver & Data Modeling

DZone Cassandra Data Modeling Webinar

Rigorous Cassandra Data Modeling for the Relational Data Architect

Cassandra NYC 2011 Data Modeling

Introduction to Dating Modeling for Cassandra

Datastax day 2016 : Cassandra data modeling basics

Cassandra Day Chicago 2015: Apache Cassandra Data Modeling 101

Cassandra Day SV 2014: Fundamentals of Apache Cassandra Data Modeling

Cassandra Introduction Demo | Basics | Online Training

Cassandra Day Chicago 2015: Advanced Data Modeling

Tsunami alerting with Cassandra - lesfurets.github.io · Tsunami alerting with Cassandra ... • Preload the data into Cassandra on the AWS cluster ... Modeling Data Importing Data