DZone Cassandra Data Modeling Webinar
-
Upload
matthew-dennis -
Category
Technology
-
view
2.472 -
download
2
description
Transcript of DZone Cassandra Data Modeling Webinar
![Page 1: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/1.jpg)
Modeling Data In Cassandra Conceptual Differences Versus RDBMS
Matthew F. Dennis, DataStax // @mdennis
June 27, 2012
![Page 2: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/2.jpg)
Cassandra Is Not Relationalget out of the relational mindset when working
with Cassandra (or really any NoSQL DB)
![Page 3: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/3.jpg)
Work Backwards From QueriesThink in terms of queries, not in terms of
normalizing the data; in fact, you often want to denormalize (already common in the data
warehousing world, even in RDBMS)
![Page 4: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/4.jpg)
OK great, but how do I do that?Well, you need to know how Cassandra models
data (e.g. Google Big Table)
research.google.com/archive/bigtable-osdi06.pdf
Go Read It!
![Page 5: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/5.jpg)
In Cassandra:
➔data is organized into Keyspaces (usually one per app)
➔each Keyspace can have multiple Column Families
➔each Column Family can have many Rows
➔each Row has a Row Key and a variable number of Columns
➔each Column consists of a Name, Value and Timestamp
![Page 6: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/6.jpg)
In Cassandra, Keyspaces:
➔are similar in concept to a “database” in some RDBMs
➔are stored in separate directories on disk
➔are usually one-one with applications
➔are usually the administrative unit for things related to ops
➔contain multiple column families
![Page 7: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/7.jpg)
In Cassandra, In Keyspaces, Column Famlies:
➔are similar in concept to a “table” in most RDBMs
➔are stored in separate files on disk (many per CF)
➔are usually approximately one-one with query type
➔are usually the administrative unit for things related to your data
➔can contain many (~billion* per node) rows
* for a good sized node(you can always add nodes)
![Page 8: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/8.jpg)
In Cassandra, In Keyspaces, In Column Families ...
![Page 9: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/9.jpg)
Rows
thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
Row Keys
![Page 10: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/10.jpg)
thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
Columns
![Page 11: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/11.jpg)
Column Names
thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
![Page 12: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/12.jpg)
Column Values
thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
![Page 13: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/13.jpg)
thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
Rows Are Randomly Ordered(if using the RandomPartitioner)
![Page 14: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/14.jpg)
thepaul office: Austin OS: OSX twitter: thepaul0
mdennis office: UA OS: Linux twitter: mdennis
thobbs office: Austin twitter: tylhobbs
Columns Are Ordered by Name(by a configurable comparator)
![Page 15: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/15.jpg)
Columns are ordered because doing so allows very efficient
implementations of useful and common operations
(e.g. merge join)
![Page 16: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/16.jpg)
In particular, within a row columns with a given name can
be located very quickly. (ordered names => log(n) binary search)
![Page 17: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/17.jpg)
More importantly, I can query for a slice between a start and end
RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ...
start end
Row Key
![Page 18: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/18.jpg)
Why does that matter?Because columns within don’t have to be static!
(and random disk seeks are teh evil)
![Page 19: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/19.jpg)
INTC ts0: $25.20 ts1: $25.25 ...
AMR ts0: $6.20 ts9: $0.26 ...
CRDS ts0: $1.05 ts5: $6.82 ...
Columns Are Ordered by Name(in this case by a TimeUUID Comparator)
The Column Name Can Be Part of Your Data
![Page 20: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/20.jpg)
Turns Out That Pattern Comes Up A Lot
➔stock ticks➔event logs➔ad clicks/views➔sensor records➔access/error logs➔plane/truck/person/”entity” locations➔…
![Page 21: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/21.jpg)
OK, but I can do that in SQLNot efficiently at scale, at least not easily ...
![Page 22: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/22.jpg)
ticker timestamp bid ask ...
AMR ts0 ... ... ...
... ... ... ... ...
CRDS ts0 ... ... ...
... ... ... ... ...
... ts0 ... ... ...
AMR ts1 ... ... ...
... ... ... ... ...
... ... ... ... ...
… ts1 ... ... ...
AMR ts2 ... ... ...
... ts2 ... ... ...
Data I Care About
How it Looks In a RDBMS
![Page 23: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/23.jpg)
ticker timestamp bid ask ...
AMR ts0 ... ... ...
AMR ts1 ... ... ...
AMR ts2 ... ... ...
... ts2 ... ... ...
Disk Seeks
How it Looks In a RDBMS
Larger Than Your Page Size
Larger Than Your Page Size
![Page 24: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/24.jpg)
OK, but what about ...
➔PostgreSQL Cluster Command?
➔MySQL Cluster Indexes?
➔Oracle Index Organized Tables?
➔SQLServer Clustered Index?
![Page 25: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/25.jpg)
OK, but what about ...
➔PostgreSQL Cluster Using?
➔MySQL [InnoDB] Cluster Indexes?
➔Oracle Index Organized Table?
➔SQLServer Clustered Index?
Meh ...
![Page 26: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/26.jpg)
The on-disk management of that clustering results in tons of IO …
In the case of PostgreSQL:
➔clustering is a one time operation (implies you must periodically rewrite the entire table)
➔new data is *not* written in clustered order (which is often the data you care most about)
![Page 27: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/27.jpg)
OK, so just partition the tables ...
![Page 28: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/28.jpg)
Not a bad idea, except in MySQL there is a limit of 1024 partitions and generally less if using NDB
(you should probably still do it if using MySQL though)
http://dev.mysql.com/doc/refman/5.5/en/partitioning-limitations.html
![Page 29: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/29.jpg)
OK fine, I agree storing data that is queried together on disk together is a good thing but
what's that have to do with modeling in Cassandra?
RK ts0 ts1 ... ... tsM ... ... ... ... tsN ... ... ... ... ...
Read Precisely My Data *
Seek To Here
* more on some caveats later
![Page 30: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/30.jpg)
Well, that's what is meant by “work backwards from your queries” or “think in terms of queries”
(NB: this concept, in general, applies to RDBMSat scale as well; it is not specific to Cassandra)
![Page 31: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/31.jpg)
An Example From Fraud Detection
To calculate risk it is common to need to know all the emails, destinations, origins, devices, locations, phone
numbers, et cetera ever used for the account in question
![Page 32: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/32.jpg)
id name ...
1 guy ...
2 gal ...
... ... ...
id email ...
100 guy@ ...
200 gal@ ...
... ... ...
id dest ...
15 USA ...
25 Finland ...
... ... ...
id device ...
1000 0xdead ...
2000 0xb33f ...
... ... ...
id origin ...
150 USA ...
250 Nigeria ...
... ... ...
In a normalized model that usually translates to a table for each type of entity being tracked
![Page 33: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/33.jpg)
The problem is that at scale that also means a disk seek for each one …
(even for perfect IOT et al if across multiple tables)
➔Previous emails? That's a seek …➔Previous devices? That's a seek …➔Previous destinations? That's a seek ...
![Page 34: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/34.jpg)
But In Cassandra I Store The Data I Query Together On Disk Together
(remember, column names need not be static)
acctY ... ... ... ... ... ... ...
acctX dest21 dev2 dev7 email9 orig4 ...
acctZ ... ... ... ... ... ... ...
Data I Care About
email:[email protected] = dateEmailWasLastUsed
Column Name Column Value
email3
![Page 35: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/35.jpg)
Don't treat Cassandra (or any DB) as a black box
➔Understand how your DBs (and data structures) work
➔Understand the building blocks they provide
➔Understand the work complexity (“big O”) of queries
➔For data sets > memory, goal is to minimize seeks *
* on a related note, SSDs are awesome
![Page 36: DZone Cassandra Data Modeling Webinar](https://reader034.fdocuments.in/reader034/viewer/2022042613/546d1addaf795971298b513d/html5/thumbnails/36.jpg)
Q?Modeling Data In Cassandra
Conceptual Differences Versus RDBMSMatthew F. Dennis, DataStax // @mdennis