Download - Real data models of silicon valley

Transcript
Page 1: Real data models of silicon valley

Real Data Models of Silicon ValleyPatrick McFadin

Chief Evangelist for Apache Cassandra !

@PatrickMcFadin

Page 2: Real data models of silicon valley

It's been an epic year

Page 3: Real data models of silicon valley

I've had a ton of fun!

• Traveling the world talking to people like you!

Warsaw

Stockholm

Melbourne

New YorkVancouver

Dublin

Page 4: Real data models of silicon valley

What's new?• 2.1 is out!

• Amazing changes for performance and stability

Page 5: Real data models of silicon valley

Where are we going?

• 3.0 is next. Just hold on…

Page 6: Real data models of silicon valley

KillrVideo.com• 2012 Summit

• Complete example for data modeling

www.killrvideos.com

Video TitleRecommended

MeowAds

by Google

Comments

Description

Upload New!

Username

Rating: Tags: Foo Bar

*Cat drawing by goodrob13 on Flickr

Page 7: Real data models of silicon valley

It’s alive!!!• Hosted on Azure

• Code on Github

Page 8: Real data models of silicon valley

Data Model - Revisited• Add in some 2.1 data models

• Replace (or remove) some app code

• Become a part of Cassandra OSS download

Page 9: Real data models of silicon valley

User Defined Types• Complex data in one place

• No multi-gets (multi-partitions)

• Nesting! CREATE TYPE address ( street text, city text, zip_code int, country text, cross_streets set<text> );

Page 10: Real data models of silicon valley

BeforeCREATE TABLE videos ( videoid uuid, userid uuid, name varchar, description varchar, location text, location_type int, preview_thumbnails map<text,text>, tags set<varchar>, added_date timestamp, PRIMARY KEY (videoid) );

CREATE TABLE video_metadata ( video_id uuid PRIMARY KEY, height int, width int, video_bit_rate set<text>, encoding text );

SELECT * FROM videos WHERE videoId = 2; !SELECT * FROM video_metadata WHERE videoId = 2;

Title: Introduction to Apache Cassandra !Description: A one hour talk on everything you need to know about a totally amazing database.

480 720

Playback rate:

In-application join

Page 11: Real data models of silicon valley

After• Now video_metadata is

embedded in videos

CREATE TYPE video_metadata ( height int, width int, video_bit_rate set<text>, encoding text );

CREATE TABLE videos ( videoid uuid, userid uuid, name varchar, description varchar, location text, location_type int, preview_thumbnails map<text,text>, tags set<varchar>, metadata set <frozen<video_metadata>>, added_date timestamp, PRIMARY KEY (videoid) );

Page 12: Real data models of silicon valley

Wait! Frozen??• Staying out of technical

debt

• 3.0 UDTs will not have to be frozen

• Applicable to User Defined Types and Tuples (wait for it…)

Do you want to build a schema? Do you want to store some JSON?

Page 13: Real data models of silicon valley

Let’s store some JSON{ "productId": 2, "name": "Kitchen Table", "price": 249.99, "description" : "Rectangular table with oak finish", "dimensions": { "units": "inches", "length": 50.0, "width": 66.0, "height": 32 }, "categories": { { "category" : "Home Furnishings" { "catalogPage": 45, "url": "/home/furnishings" }, { "category" : "Kitchen Furnishings" { "catalogPage": 108, "url": "/kitchen/furnishings" } } }

Page 14: Real data models of silicon valley

Let’s store some JSON{ "productId": 2, "name": "Kitchen Table", "price": 249.99, "description" : "Rectangular table with oak finish", "dimensions": { "units": "inches", "length": 50.0, "width": 66.0, "height": 32 }, "categories": { { "category" : "Home Furnishings" { "catalogPage": 45, "url": "/home/furnishings" }, { "category" : "Kitchen Furnishings" { "catalogPage": 108, "url": "/kitchen/furnishings" } } }

CREATE TYPE dimensions ( units text, length float, width float, height float );

Page 15: Real data models of silicon valley

Let’s store some JSON{ "productId": 2, "name": "Kitchen Table", "price": 249.99, "description" : "Rectangular table with oak finish", "dimensions": { "units": "inches", "length": 50.0, "width": 66.0, "height": 32 }, "categories": { { "category" : "Home Furnishings" { "catalogPage": 45, "url": "/home/furnishings" }, { "category" : "Kitchen Furnishings" { "catalogPage": 108, "url": "/kitchen/furnishings" } } }

CREATE TYPE dimensions ( units text, length float, width float, height float );

CREATE TYPE category ( catalogPage int, url text );

Page 16: Real data models of silicon valley

Let’s store some JSON{ "productId": 2, "name": "Kitchen Table", "price": 249.99, "description" : "Rectangular table with oak finish", "dimensions": { "units": "inches", "length": 50.0, "width": 66.0, "height": 32 }, "categories": { { "category" : "Home Furnishings" { "catalogPage": 45, "url": "/home/furnishings" }, { "category" : "Kitchen Furnishings" { "catalogPage": 108, "url": "/kitchen/furnishings" } } }

CREATE TYPE dimensions ( units text, length float, width float, height float );

CREATE TYPE category ( catalogPage int, url text );

CREATE TABLE product ( productId int, name text, price float, description text, dimensions frozen <dimensions>, categories map <text, frozen <category>>, PRIMARY KEY (productId) );

Page 17: Real data models of silicon valley

Let’s store some JSONINSERT INTO product (productId, name, price, description, dimensions, categories) VALUES (2, 'Kitchen Table', 249.99, 'Rectangular table with oak finish', { units: 'inches', length: 50.0, width: 66.0, height: 32 }, { 'Home Furnishings': { catalogPage: 45, url: '/home/furnishings' }, 'Kitchen Furnishings': { catalogPage: 108, url: '/kitchen/furnishings' } ! } );

dimensions frozen <dimensions>

categories map <text, frozen <category>>

Page 18: Real data models of silicon valley

Retrieving fields

Page 19: Real data models of silicon valley

Counters pt Deux

• Since .8

• Commit log replay would change counters

• Repair could change counters

• Performance was inconsistent. Lots of GC

Page 20: Real data models of silicon valley

The good• Stable under load

• No commit log replay issues

• No repair weirdness

Page 21: Real data models of silicon valley

The bad

• Still can’t delete/reset counters

• Still needs to do a read before write.

Page 22: Real data models of silicon valley

UsageWait for it…

It’s the same! Carry on…

Page 23: Real data models of silicon valley

Static Fields• New as of 2.0.6

• VERY specific, but useful

• Thrift people will like this

CREATE TABLE t ( k text, s text STATIC, i int, PRIMARY KEY (k, i) );

Page 24: Real data models of silicon valley

Why?CREATE TABLE weather ( id int, time timestamp, weatherstation_name text, temperature float, PRIMARY KEY (id, time) );

ID = 1Partition Key

(Storage Row Key)

2014-09-08 12:00:00 : name

SFO

2014-09-08 12:00:00 : temp

63.4

2014-09-08 12:01:00 : name

SFO

2014-09-08 12:00:00 : temp

63.9

2014-09-08 12:02:00 : name

SFO

2014-09-08 12:00:00 : temp

64.0

Partition Row 1 Partition Row 2 Partition Row 3

ID = 1Partition Key

(Storage Row Key)

name

SFO

2014-09-08 12:00:00 : temp

63.4

2014-09-08 12:00:00 : temp

63.9

2014-09-08 12:00:00 : temp

64.0

Partition Row 1 Partition Row 1 Partition Row 1

CREATE TABLE weather ( id int, time timestamp, weatherstation_name text static, temperature float, PRIMARY KEY (id, time) );

Page 25: Real data models of silicon valley

Usage• Put a static at the end of the declaration

• Can’t be a part of primary key

CREATE TABLE video_event ( videoid uuid, userid uuid, preview_image_location text static, event varchar, event_timestamp timeuuid, video_timestamp bigint, PRIMARY KEY ((videoid,userid),event_timestamp,event) ) WITH CLUSTERING ORDER BY (event_timestamp DESC,event ASC);

Page 26: Real data models of silicon valley

Tuples

• A type that represents a group

• Up to 256 different elements

CREATE TABLE tuple_table ( id int PRIMARY KEY, three_tuple frozen <tuple<int, text, float>>, four_tuple frozen <tuple<int, text, float, inet>>, five_tuple frozen <tuple<int, text, float, inet, ascii>> );

Page 27: Real data models of silicon valley

Example Usage• Track a drone’s position

• x, y, z in a 3D Cartesian

CREATE TABLE drone_position ( droneId int, time timestamp, position frozen <tuple<float, float, float>>, PRIMARY KEY (droneId, time) );

Page 28: Real data models of silicon valley

What about partition size?

• A CQL partition is a logical projection of a storage row

• Storage rows can have up to 2 billion cells

• Each cell can hold up to 2G of data

Page 29: Real data models of silicon valley

How much is too much?

• How many cells before performance degrades?

• How many bytes per partition before it’s unmanageable

• What is “practical”

Page 30: Real data models of silicon valley

Old answer• 2011: Pre-Cassandra 1.2 (actually tested on .8)

• Aaron Morton, Cassandra MVP and Founder of The Last Pickle

Page 31: Real data models of silicon valley

Conclusion• Keep partition (storage row) length < 10k cells

• Total size in bytes below 64M (Multi-pass compaction)

• Multiple hits to 64k page size will start to hurt

TL;DR - It’s a performance tunable

Page 32: Real data models of silicon valley

The tests revisited

• Attempted to reproduce the same tests using CQL

• Cassandra 2.1, 2.0 and 1.2

• Tested partitions sizes 1. 100 2. 2114 3. 5,000 4. 10,000 5. 100,000 6. 1,000,000 7. 10,000,000 8. 100,000,000 9. 1,000,000,000

Page 33: Real data models of silicon valley

Results

mSec

Cells per partition

Page 34: Real data models of silicon valley

The new answer

• 100’s of thousands is not problem

• 100’s of megs per partition is best operationally

• The issue to manage is operations

Page 35: Real data models of silicon valley

Thank You!

Follow me on twitter for more @PatrickMcFadin

Page 36: Real data models of silicon valley

CASSANDRASUMMIT2014September 10 - 11 | #CassandraSummit