Storage cassandra
-
Upload
pl-dream -
Category
Technology
-
view
4.270 -
download
0
Transcript of Storage cassandra
![Page 1: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/1.jpg)
Cassandra
Roc.Yang
2011.04
![Page 2: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/2.jpg)
Contents
Overview1
2 Data Model
Storage Model3
4 System Architecture
Read & Write5
6 Other
![Page 3: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/3.jpg)
Cassandra
Overview
![Page 4: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/4.jpg)
Cassandra From Facebook
![Page 5: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/5.jpg)
Cassandra To
![Page 6: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/6.jpg)
Cassandra – From Dynamo and Bigtable
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google's BigTable. Like Dynamo, Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-
based data model richer than typical key/value systems. Cassandra was open sourced by Facebook in 2008, where
it was designed by Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik ( Facebook Engineer). In a lot of ways you can think of Cassandra as Dynamo 2.0 or a marriage of Dynamo and BigTable.
![Page 7: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/7.jpg)
Cassandra - Overview
Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure;
Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format.
![Page 8: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/8.jpg)
Cassandra - Highlights
● High availability
● Incremental scalability
● Eventually consistent
● Tunable tradeoffs between consistency and latency
● Minimal administration
● No SPF(Single Point of Failure).
![Page 9: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/9.jpg)
Cassandra – Trade Offs
● No Transactions
● No Adhoc Queries
● No Joins
● No Flexible Indexes
•Data Modeling with Cassandra Column Familieshttp://www.slideshare.net/gdusbabek/data-modeling-with-cassandra-column-families
![Page 10: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/10.jpg)
Cassandra From Dynamo and BigTable
•Introduction to Cassandra: Replication and Consistency http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency
![Page 11: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/11.jpg)
Dynamo-like Features
● Symmetric, P2P Architecture No Special Nodes/SPOFs
● Gossip-based Cluster Management
● Distributed Hash Table for Data Placement Pluggable Partitioning Pluggable Topology Discovery Pluggable Placement Strategies
● Tunable, Eventual Consistency
•Data Modeling with Cassandra Column Familieshttp://www.slideshare.net/gdusbabek/data-modeling-with-cassandra-column-families
![Page 12: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/12.jpg)
BigTable-like Features
● Sparse “Columnar” Data Model Optional, 2-level Maps Called Super Column
Families
● SSTable Disk Storage Append-only Commit Log Memtable(buffer and sort) Immutable SSTable Files
● Hadoop Integration
•Data Modeling with Cassandra Column Familieshttp://www.slideshare.net/gdusbabek/data-modeling-with-cassandra-column-families
![Page 13: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/13.jpg)
Brewer's CAP Theorem
CAP(Consistency, Availability and Partition Tolerance). Pick two of Consistency, Availability, Partition tolerance. Theorem: You can have at most two of these properties
for any shared-data system.
http://www.julianbrowne.com/article/viewer/brewers-cap-theorem
![Page 14: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/14.jpg)
ACID & BASE
ACID (Atomicity, Consistency, Isolation, Durability). BASE (Basically Available, Soft-state, Eventually
Consistent)
ACID: http://en.wikipedia.org/wiki/ACID ACID and BASE: MySQL and NoSQL:
http://www.schoonerinfotech.com/solutions/general/what_is_nosql
ACIDStrong consistencyIsolationFocus on “commit”Nested transactionsAvailability?Conservative
(pessimistic)Difficult evolution
(e.g. schema)
BASEWeak consistency
– stale data OKAvailability firstBest effortApproximate answers OKAggressive (optimistic)Simpler!FasterEasier evolution
![Page 15: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/15.jpg)
NoSQL
The term "NoSQL" was used in 1998 as the name for a lightweight, open source relational database that did not expose a SQL interface. Its author, Carlo Strozzi, claims that as the NoSQL movement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL', or something to that effect.“ CAP BASE Eventual Consistency
NoSQL: http://en.wikipedia.org/wiki/NoSQL
http://nosql-database.org/
![Page 16: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/16.jpg)
Dynamo & Bigtable
Dynamo partitioning and replicationLog-structured ColumnFamily data model similar
to Bigtable's
● Bigtable: A distributed storage system for structured data, 2006
● Dynamo: amazon's highly available keyvalue store, 2007
![Page 17: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/17.jpg)
Dynamo & Bigtable
● BigTableStrong consistencySparse map data modelGFS, Chubby, etc
● DynamoO(1) distributed hash table (DHT)BASE (eventual consistency)Client tunable consistency/availability
![Page 18: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/18.jpg)
Dynamo & Bigtable
●CPBigtableHypertableHBase
● APDynamoVoldemortCassandra
![Page 19: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/19.jpg)
Cassandra
Dynamo Overview
![Page 20: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/20.jpg)
Dynamo Architecture & Lookup
● O(1) node lookup
● Explicit replication
● Eventually consistent
![Page 21: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/21.jpg)
Dynamo
Dynamo:
a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.
a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor.
![Page 22: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/22.jpg)
Service-Oriented Architecture
![Page 23: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/23.jpg)
Dynamo Techniques
问题 采取的相关技术
数据均衡分布 改进的一致性哈希算法,数据备份
数据冲突处理 向量时钟( vector clock )
临时故障处理 Hinted handoff (数据回传机制),参数(W,R,N )可调的弱 quorum 机制
永久故障后的恢复 Merkle 哈希树
成员资格以及错误检测 基于 gossip 的成员资格协议和错误检测
Dynamo 架构的主要技术
![Page 24: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/24.jpg)
Dynamo Techniques Advantages
Summary of techniques used in Dynamo and their advantages
![Page 25: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/25.jpg)
Dynamo 数据均衡分布的问题
一致性哈希算法优势: -- 负载均衡 -- 屏蔽节点处理
能力差异虚拟节点A
虚拟节点B
虚拟节点C
虚拟节点D
键 k
节点A
节点B
节点C节点D
节点E
节点F
节点G计算节点的哈希值
计算数据键值的哈希值
![Page 26: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/26.jpg)
Dynamo 数据冲突处理
最终一致性模型向量时钟 ( Vector Clock )
![Page 27: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/27.jpg)
Dynamo 临时故障处理机制
读写参数 W、 R、 N
N :系统中每条记录的副本数W :每次记录成功写操作需要写入的副本数R :每次记录读请求最少需要读取的副本数。
满足 R+W>N ,用户即可自行配置 R和W优势:实现可用性与容错性之间的平衡
![Page 28: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/28.jpg)
Dynamo 永久性故障恢复
0
1 2
3 4 5 6
7 8 9 10 11 12 13 14
11
1 15
3 4 16 6
7 8 9 10 17 12 13 14
merkle树A merkle树B
Merkle 哈希树技术Dynamo中Merkle 哈希树的叶子节点是存储数据所对应的哈希值,父节点是其所有子节点的哈希值
![Page 29: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/29.jpg)
Dynamo 成员资格及错误检测
基于 Gossip 协议的成员检测机制
种子节点(seed)
A
新节点2
C
B
新节点1
![Page 30: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/30.jpg)
Consistent Hashing - Dynamo
Dynamo 把每台 server 分成 v 个虚拟节点,再把所有虚拟节点 (n*v) 随机分配到一致性哈希的圆环上,这样所有的用户从自己圆环上的位置顺时针往下取到第一个 vnode 就是自己所属节点。当此节点存在故障时,再顺时针取下一个作为替代节点。
发生单点故障时负载会均衡分散到其他所有节点,程序实现也比较优雅。
![Page 31: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/31.jpg)
Consistent Hashing - Dynamo
![Page 32: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/32.jpg)
Cassandra
Bigtable Overview
![Page 33: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/33.jpg)
Bigtable
Replica
Replica
Replica
Replica
Master
GFS(Google File System)
Bigtable
TabletServer
TabletServer
TabletServer
Chubby
Client
Cluster Managermement System
![Page 34: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/34.jpg)
Bigtable
Tablet
在 BigtableT 中,对表进行切片,一个切片称为 tablet ,保证 100- 200MB/tablet
Column Families① the basic unit of access control;
② All data stored in a column family is usually of the same type (we compress data in the same column family together).
Timestamp
Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp.
Treats data as uninterpreted strings
![Page 35: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/35.jpg)
Bigtable: Data Model
<Row, Column, Timestamp> triple for key - lookup, insert, and delete API
Arbitrary “columns” on a row-by-row basis Column family:qualifier. Family is heavyweight, qualifier
lightweight Column-oriented physical store- rows are sparse!
Does not support a relational model No table-wide integrity constraints No multirow transactions
![Page 36: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/36.jpg)
a three-level hierarchy analogous to that of a B+ tree to store tablet location information
Bigtable: Tablet location hierarchy
![Page 37: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/37.jpg)
Bigtable: METADATA
The first level is a file stored in Chubby that contains the location of the root tablet
The root tablet contains the location of all tablets in a special METADATA table
The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identier and its end row
Each METADATA row stores approximately 1KB of data in memory
METADATA table also stores secondary information, including a log of all events pertaining to each tablet (such as when a server begins serving it). This information is helpful for debugging and performance analysis
![Page 38: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/38.jpg)
Bigtable: Tablet Representation
![Page 39: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/39.jpg)
Bigtable: SSTable
![Page 40: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/40.jpg)
Cassandra
Data Model
![Page 41: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/41.jpg)
Cassandra – Data Model
A table in Cassandra is a distributed multi dimensional map indexed by a key. The value is an object which is highly structured.
Every operation under a single row key is atomic per replica no matter how many columns are being read or written into.
Columns are grouped together into sets called column families (very much similar to what happens in the Bigtable system. Cassandra exposes two kinds of columns families, Simple and Super column families.
Super column families can be visualized as a column family within a column family
![Page 42: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/42.jpg)
Cassandra – Data Model
Columns are added and modified
dynamically
KEYColumnFamily1 Name : MailList Type : Simple Sort : Name
Name : tid1
Value : <Binary>
TimeStamp : t1
Name : tid2
Value : <Binary>
TimeStamp : t2
Name : tid3
Value : <Binary>
TimeStamp : t3
Name : tid4
Value : <Binary>
TimeStamp : t4
ColumnFamily2 Name : WordList Type : Super Sort : Time
Name : aloha
ColumnFamily3 Name : System Type : Super Sort : Name
Name : hint1
<Column List>
Name : hint2
<Column List>
Name : hint3
<Column List>
Name : hint4
<Column List>
C1
V1
T1
C2
V2
T2
C3
V3
T3
C4
V4
T4
Name : dude
C2
V2
T2
C6
V6
T6
Column Families are declared
upfront
SuperColumns are added and
modified dynamically
Columns are added and modified dynamically
![Page 43: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/43.jpg)
Cassandra – Data Model
Keyspace Uppermost namespace Typically one per application ~= database
ColumnFamily Associates records of a similar kind not same kind, because CFs are sparse tables Record-level Atomicity Indexed
Row each row is uniquely identifiable by key rows group columns and super columns
Column Basic unit of storage
![Page 44: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/44.jpg)
Cassandra – Data Model
![Page 45: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/45.jpg)
Cassandra – Data Model(a example)
![Page 46: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/46.jpg)
Cassandra – Data Model
http://www.divconq.com/2010/cassandra-columns-and-supercolumns-and-rows/
![Page 47: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/47.jpg)
Cassandra – Data Model
http://www.divconq.com/2010/cassandra-columns-and-supercolumns-and-rows/
![Page 48: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/48.jpg)
Cassandra – Data Model - Cluster
Cluster
![Page 49: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/49.jpg)
Cassandra – Data Model - Cluster
Cluster > Keyspace
Partitioners:OrderPreservingPartitioner
RandomPartitioner
Like an RDBMS schema:Keyspace per application
![Page 50: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/50.jpg)
Cassandra – Data Model
Cluster > Keyspace > Column Family
Like an RDBMS table:Separates types in an app
![Page 51: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/51.jpg)
Cassandra – Data Model
SortedMap<Name,Value>...
Cluster > Keyspace > Column Family > Row
![Page 52: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/52.jpg)
Cassandra – Data Model
Cluster > Keyspace > Column Family > Row > “Column”
…Name → Valuebyte[] → byte[]
+version timestamp
Not like an RDBMS column:Attribute of the row: each row can
contain millions of different columns
![Page 53: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/53.jpg)
Cassandra – Data Model
Any column within a column family is accessed using the convention:
column family : column Any column within a column family that is of type
super is accessed using the convention:
column family :super column : column
![Page 54: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/54.jpg)
Cassandra
Storage Model
![Page 55: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/55.jpg)
Storage Model
Key (CF1 , CF2 , CF3)
Commit LogBinary serialized
Key ( CF1 , CF2 , CF3 )
Memtable ( CF1)
Memtable ( CF2)
Memtable ( CF2)
• Data size
• Number of Objects
• Lifetime
Dedicated Disk
<Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>
---
---
---
---
<Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>
BLOCK Index <Key Name> Offset, <Key Name> Offset
K128 Offset
K256 Offset
K384 Offset
Bloom Filter(Index in memory)
Data file on disk
![Page 56: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/56.jpg)
Storage Model-Compactions
K1 < Serialized data >
K2 < Serialized data >
K3 < Serialized data >
--
--
--
Sorted
K2 < Serialized data >
K10 < Serialized data >
K30 < Serialized data >
--
--
--
Sorted
K4 < Serialized data >
K5 < Serialized data >
K10 < Serialized data >
--
--
--
Sorted
MERGE SORT
K1 < Serialized data >
K2 < Serialized data >
K3 < Serialized data >
K4 < Serialized data >
K5 < Serialized data >
K10 < Serialized data >
K30 < Serialized data >
Sorted
K1 Offset
K5 Offset
K30 Offset
Bloom Filter
Loaded in memory
Index File
Data File
D E L E T E D
![Page 57: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/57.jpg)
Storage Model - Write
客户端给 Cassandra 集群的任一随机节点发送写请求
" 分割器 " 决定由哪个节点对此数据负责 RandomPartitioner ( 完全按照 Hash 进行分布 ) OrderPreservingPartitioner( 按照数据的原始顺序排序 )
Owner 节点先在本地记录日志 , 然后将其应用到内存副本 (MemTable)
提交日志 (Commit Log) 保存在机器本地的一个独立磁盘上 .
![Page 58: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/58.jpg)
Storage Model - Write
关键路径上没有任何锁顺序磁盘访问表现类似于写入式缓存 (write through cache)只有 Append 操作 , 没有额外的读开销只保证基于 ColumnFamily 的原子性始终可写 ( 利用 Hinted Handoff)即使在出现节点故障时仍然可写
![Page 59: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/59.jpg)
Storage Model - Read
从任一节点发起读请求 由 " 分割器 " 路由到负责的节点 等待 R 个响应 在后台等待 N - R 个响应并处理 Read Repair
• 读取多个 SSTable• 读速度比写速度要慢 ( 不过仍然很快 )
• 通过使用 BloomFilter 降低检索 SSTable 的次数• 通过使用 Key/Column index来提供在 SSTable 检索 Key 以及 Column 的效率
• 可以通过提供更多内存来降低检索时间 / 次数• 可以扩展到百万级的记录
![Page 60: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/60.jpg)
Cassandra – Storage
Cassandra 的存储机制,借鉴了 Bigtable 的设计,采用Memtable和 SSTable 的方式。和关系数据库一样, Cassandra 在写数据之前,也需要先记录日志,称之为 commitlog , 然后数据才会写入到 Column Family对应的 Memtable 中,并且Memtable 中的内容是按照key 排序好的。 Memtable 是一种内存结构,满足一定条件后批量刷新到 磁盘上,存储为 SSTable 。这种机制,相当于缓存写回机制 (Write-back Cache) ,优势在于将随机 IO 写变成顺序 IO 写,降低大量的写操作对于存储系统的压力。 SSTable 一旦完成写入,就不可变更,只能读取。下一次 Memtable 需要刷新到一个新的 SSTable文件中。所以对于 Cassandra 来说,可以认为只有顺序写,没有随机写操作。 SSTable: http://wiki.apache.org/cassandra/ArchitectureSSTable
![Page 61: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/61.jpg)
Cassandra – Storage
因为 SSTable 数据不可更新,可能导致同一个 Column Family 的数据存储在多个 SSTable 中,这时查询数据时,需要去合并读取 Column Family 所有的 SSTable和Memtable ,这样到一个 Column Family 的数量很大的时候,可能导致查询效率严重下降。因此需要有一种机制能快速定位查询的 Key落在哪些 SSTable 中,而不需要去读取合并所有 的 SSTable。 Cassandra采用的是 Bloom Filter 算法,通过多个 hash函数将 key映射到一个位图中,来快速判断这个 key 属于哪个 SSTable 。
![Page 62: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/62.jpg)
Cassandra – Storage
为了避免大量 SSTable带来的性能影响, Cassandra 也提供一种定期将多个 SSTable合并成一个新的 SSTable 的机制,因为每个 SSTable 中的 key都是已经排序好的,因此只需要做一次合并排序就可以完成该任务,代价还是可以接受的。所以在 Cassandra 的数据存储目录 中,可以看到三种类型的文件,格式类似于: Column Family Name- 序号 -Data.db Column Family Name- 序号 -Filter.db Column Family Name- 序号 -index.db
其中 Data.db文件是 SSTable 数据文件, SSTable是Sorted Strings Table 的缩写,按照 key 排序后存储 key/value 键值字符串。 index.db 是索引文件,保存的是每个 key 在数据文件中的偏移位置,而 Filter.db则是 Bloom Filter 算法生产的映射文件。。
![Page 63: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/63.jpg)
Cassandra
System Architecture
![Page 64: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/64.jpg)
System Architecture Content
OverviewPartitioningReplicationMembership & Failure DetectionBootstrappingScaling the ClusterLocal PersistenceCommunication
![Page 65: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/65.jpg)
System Architecture
Core Layer Middle Layer Top Layer
Messaging service Gossip Failure detection Cluster state Partitioner Replication
Commit log Memtable SSTable Indexes Compaction
Tombstones Hinted handoff Read repair Bootstrap Monitoring Admin tools
![Page 66: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/66.jpg)
System Architecture
Core Layer Middle Layer Top Layer Above the top layer
![Page 67: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/67.jpg)
System Architecture
Core Layer:
§ Messaging Service (async, non-blocking)
§ Gossip Failure detector
§ Cluster membership/state
§ Partitioner(Partitioning scheme)
§ Replication strategy
![Page 68: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/68.jpg)
System Architecture
Middle Layer
§ Commit log
§ Memory-table
§ Compactions
§ Hinted handoff
§ Read repair
§ Bootstrap
![Page 69: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/69.jpg)
System Architecture
Top Layer
§ Key, block, & column indexes
§ Read consistency
§ Touch cache
§ Cassandra API
§ Admin API
§ Read Consistency
![Page 70: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/70.jpg)
System Architecture
Above the top layer:
§ Tools
§ Hadoop integration
§ Search API and Routing
![Page 71: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/71.jpg)
System Architecture
Messaging Layer
Cluster MembershipFailure Detector
Storage Layer
Partitioner Replicator
Cassandra API Tools
![Page 72: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/72.jpg)
Cassandra - Architecture
![Page 73: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/73.jpg)
System Architecture
The architecture of a storage system needs to have the following characteristics:
scalable and robust solutionsfor load balancing membership and failure detection failure recovery replica synchronization overload handling state transfer concurrency and job scheduling request marshalling request routing system monitoring and alarming conguration management
![Page 74: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/74.jpg)
System Architecture
we will focus on the core distributed systems techniques used in Cassandra:
partitioningreplicationmembershipFailure handlingScalingAll these modules work in synchrony to handle read/write
requests
![Page 75: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/75.jpg)
System Architecture - Partitioning
One of the key design features for Cassandra is the ability to scale incrementally. This requires, the ability to dynamically partition the data over the set of nodes in the cluster. Cassandra partitions data across the cluster using consistent hashing but uses an order preserving hash function to do so.
Cassandra uses Consistent-Hashing. The idea is that all the nodes hash-wised are located on a ring. The position of a node on the ring is randomly determined. Each node is responsible for replicated a range of hash function’s output space.
![Page 76: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/76.jpg)
System Architecture – Partitioning (Ring Topology)
a
j
g
d
RF=3
Conceptual Ring
One token per node
Multiple ranges per node
![Page 77: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/77.jpg)
a
j
g
d
RF=2
Conceptual Ring
One token per node
Multiple ranges per node
System Architecture – Partitioning (Ring Topology)
![Page 78: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/78.jpg)
Token assignment
Range adjustment
Bootstrap
Arrival only affects immediate neighbors
a
j
g
d
RF=3
m
System Architecture – Partitioning (New Node)
![Page 79: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/79.jpg)
Node dies
Available?HintingHandoff
Achtung!Plan for this
a
j
g
d
RF=3
System Architecture – Partitioning (Ring Partition)
![Page 80: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/80.jpg)
System Architecture – Partitioning
在 Cassandra 实际的环境,一个必须要考虑的关键问题是Token 的选择。 Token 决定了每个节点存储的数据的分布范围,每个节点保存的数据的 key在 (前一个节点 Token,本节点 Token] 的半开半闭区间内,所有的节点形成一个首尾相接的环,所以第一个节点保存的是大于最大 Token小于等于最小 Token 之间的数据 ;
根据采用的分区策略的不同, Token 的类型和设置原则也有所不同。 Cassandra (0.6版本 ) 本身支持三种分区策略:
RandomPartitioner
OrderPreservingPartitioner
CollatingOrderPreservingPartitioner
![Page 81: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/81.jpg)
System Architecture – Partitioning
RandomPartitioner :随机分区是一种 hash 分区策略,使用的 Token 是大整数型 (BigInteger) ,范围为 [0 ~ 2^127] ,因此极端情况下,一个采用随机分区策略的Cassandra 集群的节点可以达到 (2^127 + 1) 个节点。
Cassandra 采用了 MD5 作为 hash 函数,其结果是 128位的整数值 ( 其中一位是符号位, Token 取绝对值为结 果 ) 。采用随机分区策略的集群无法支持针对 Key 的范围查询。假如集群有 N 个节点,每个节点的 hash 空间采取平均分布的话,那么第 i 个节点的 Token 可以设置为:
i * ( 2 ^ 127 / N )
![Page 82: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/82.jpg)
System Architecture – Partitioning
OrderPreservingPartitioner :如果要支持针对 Key 的范围查询,那么可以选择这种有序分区策略。该策略采用的是字符串类型的 Token 。每个节点的具体选择需要根据Key 的情况来确定。如果没有指定 InitialToken ,则系统会使用一个 长度为 16 的随机字符串作为 Token ,字符串包含大小写字符和数字。
CollatingOrderPreservingPartitioner :和OrderPreservingPartitioner 一样是有序分区策略。只是排序的方式不一样,采用的是字节型 Token ,支持设置不同语言环境的排序方式,代码中默认是 en_US 。
![Page 83: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/83.jpg)
System Architecture – Partitioning
Randomsystem will use MD5 (key)
to distribute data across nodes
even distribution of keys from one CF across ranges/nodes
Order Preservingkey distribution determined
by token lexicographical orderingcan specify the token for
this node to use ‘scrabble’ distribution required for range queries
– scan over rows like cursor in index
![Page 84: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/84.jpg)
System Architecture – Partitioning - Token
A Token is partitioner-dependent element on the Ring. Each Node has a single, unique Token. Each Node claims a Range of the Ring from its Token
to the Token of the previous Node on the Ring.
![Page 85: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/85.jpg)
System Architecture – Partitioning
Map from Key Space to Token RandomPartitioner
Tokens are integers in the range [0 .. 2^127] MD5(Key) Token Good: Even Key distribution Bad: Inefficient range queries
OrderPreservingPartitioner Tokens are UTF8 strings in the range [“” .. ) Key Token Good: Inefficient range queries Bad: UnEven Key distribution
![Page 86: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/86.jpg)
System Architecture – Snitching
Map from Nodes to Physical Location EndpointSnitch
Guess at rack and DataCenter based on IP address octets DataCenterEndpointSnitch
Specify IP subnets for racks, grouped per DataCenter PropertySnitch
Specify arbitrary mappings from indivdual IP address to racks and DataCenters
![Page 87: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/87.jpg)
System Architecture - Replication
Cassandra uses replication to achieve high availability and durability.
Each data item is replicated at N hosts, where N is the replication factor configured “per-instance”.
Each key,k, is assigned to a coordinator node. The coordinator is in charge of the replication of the data
items that fall within its range. In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 nodes in the ring.
![Page 88: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/88.jpg)
System Architecture – Placement
Map from Token Space to Nodes The first replica is always placed on the node the claims
the range in which the token falls Strategies determine where the rest of the replicas are
placed Cassandra provides the client with various options for how
data needs to be replicated. Cassandra provides various replication policies such as:
Rack Unaware Rack Aware (within a datacenter) Datacenter Aware
![Page 89: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/89.jpg)
System Architecture - Replication
Rack Unaware
Place replicas on the N-1 subsequent nodes around the ring, ignoring topology.
If certain application chooses “Rack Unaware” replication strategy then the non-coordinator replicas are chosen by picking N-1 successors of the coordinator on the ring.
![Page 90: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/90.jpg)
System Architecture - Replication
Rack Aware (within a datacenter)
Place the second replica in another datacenter, and the remaining N-2 replicas on nodes in other racks in the same datacenter.
![Page 91: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/91.jpg)
System Architecture - Replication
Datacenter Aware
Place M of the N replicas in another datacenter, and the remaining N-M-1 replicas on nodes in other racks in the same datacenter.
![Page 92: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/92.jpg)
System Architecture – Partitioning
![Page 93: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/93.jpg)
System Architecture - Replication
1) Every node is aware of every other node in the system and hence the range they are responsible for. This is through Gossiping (not the leader).
2) A key is assigned to a node, that node is the key’s coordinator,who is responsible for replicating the item associated with the key on N-1 replicas in addition to itself.
3) Cassandra offers several replication policies and leaves it up to the application to choose one. These polices differ in the location of the selected Replicas. Rack Aware, Rack Unaware, Datacenter Aware are some of these polices.
4) Whenever a new node joins the system it contacts the Leader of the Cassandra, who tells the node what is the range for which it is responsible for replicating the associated keys.
5) Cassandra uses Zookeeper for maintaining the Leader.
6) The nodes that are responsible for the same range are called “Preference List” for that range. This terminology is borrowed from Dynamo.
![Page 94: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/94.jpg)
System Architecture – Replication
![Page 95: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/95.jpg)
System Architecture - Replication
Replication factor How many nodes data is replicated on
Consistency level Zero, One, Quorum, All Sync or async for writes Reliability of reads Read repair
![Page 96: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/96.jpg)
System Architecture – Replication(Leader)
Cassandra system elects a leader amongst its nodes using a system called Zookeeper.
All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for and leader makes a concerted effort to maintain the invariant that no node is responsible for more than N-1 ranges in the ring.
The metadata about the ranges a node is responsible is cached locally at each node and in a fault-tolerant manner inside Zookeeper - this way a node that crashes and comes back up knows what ranges it was responsible for. We borrow from Dynamo parlance and deem the nodes that are responsible for a given range the “preference list” for the range.
![Page 97: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/97.jpg)
System Architecture - Membership
Cluster membership in Cassandra is based on Scuttlebutt, a very ecient anti-entropy Gossip based mechanism.
![Page 98: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/98.jpg)
System Architecture - Failure handling
Failure detection is a mechanism by which a node can locally determine if any other node in the system is up or down. In Cassandra failure detection is also used to avoid attempts to communicate with unreachable nodes during various operations.
Cassandra uses a modied version of the Accrual Failure Detector.
![Page 99: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/99.jpg)
System Architecture - Bootstrapping
When a node starts for the first time, it chooses a random token for its position in the ring:
In Cassandra, joins and leaves of the nodes are initiated using an explicit mechanism, rather than an automatic one. A node is ordered to leave the Network, due to some malfunctioning observed in it. But it should be back soon. If a node leaves the network forever, then data partitioning is required. When a new node joins, data re-partitioning is required. As it frequently happens that the reason for adding a new node, is that some current nodes cannot any more handle all the load on them. So we add a new node and assign part of the range for which some heavily loaded nodes are currently responsible for. In this case data must be transferred between these two Replicas, the old and the new one. This is usually done after the administrator issues a new join. This should not shut the system down for this particular fraction of the range being transferred as there are hopefully other replicas having the same data. Once data is transferred to the new node, then the older node does not have that data any more
![Page 100: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/100.jpg)
System Architecture - Scaling
When a new node is added into the system, it gets assigned a token such that it can alleviate a heavily loaded node
![Page 101: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/101.jpg)
System Architecture - Scaling
![Page 102: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/102.jpg)
System Architecture - Local Persistence
The Cassandra system relies on the local file system for data persistence.
![Page 103: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/103.jpg)
System Architecture - Communication
Control messages use UDP;Application related messages like read/write
requests and replication requests are based on TCP.
![Page 104: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/104.jpg)
Cassandra
Read & Write
![Page 105: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/105.jpg)
Cassandra – Read/Write
Tunable Consistency - per read/write• One - Return once one replica responds success• Quorum - Return once RF/2 + 1 replicas respond• All - Return when all replicas respond
Want async replication?
Write = ONE, Read = ONE (Performance++) Want Strong consistency?
Read = QUORUM, Write = QUORUM Want Strong Consistency per DataCenter?
Read = LOCAL_QUORUM, write LOCAL_QUORUM
![Page 106: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/106.jpg)
Cassandra – Read/Write
When a read or write request reaches at any node in the cluster the state machine morphs through the following states:
① The nodes that replicate the data for the key are identified.
② The request is forwarded to all the nodes and wait on the responses to arrive.
③ if the replies do not arrive within a congured timeout value fail the request and return to the client.
④ If replies received, figure out the latest response based on timestamp.
⑤ Update replicas with old data(schedule a repair of the data at any replica if they do not have the latest piece of data).
![Page 107: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/107.jpg)
Cassandra - Read Repair
每次读取时都读取所有的副本 只返回一个副本的数据 对所有副本应用 Checksum或 Timestamp校验
如果存在不一致 取出所有的数据并做合并 将最新的数据写回到不同步( out of sync) 的节点
![Page 108: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/108.jpg)
Cassandra - Reads
Practically lock free Sstable proliferation New in 0.6:
Row cache (avoid sstable lookup, not write-through)
Key cache (avoid index scan)
![Page 109: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/109.jpg)
Cassandra - Read
Any node Read repair Usual caching conventions apply
![Page 110: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/110.jpg)
Read
Query
Closest replica
Cassandra Cluster
Replica A
Result
Replica B Replica C
Result
Client
Read repair if digests differRead repair if digests differ
Digest ResponseDigest Query
Digest Response
![Page 111: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/111.jpg)
Cassandra - Write
No reads No seeks Sequential disk access Atomic within a column family Fast Any node Always writeable
![Page 112: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/112.jpg)
Cassandra – Write(Properties)
No locks in the critical pathSequential disk accessBehaves like a write back CacheAppend support without read aheadAtomicity guarantee for a key“Always Writable”(accept writes during failure sc
enarios)
![Page 113: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/113.jpg)
Cassandra - Writes
Commit log for durability Configurable fsync Sequential writes only
Memtable – no disk access (no reads or seeks)
SSTables are final (become read only) Indexes Bloom filter Raw data
Bottom line: FAST
![Page 114: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/114.jpg)
Cassandra - Write
![Page 115: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/115.jpg)
Cassandra - Write
The system can be congured to perform either synchronous or asynchronous writes.
For certain systems that require high throughput we rely on asynchronous replication.
Here the writes far exceed the reads that come into the system.
During the synchronous case we wait for a quorum of responses before we return a result to the client.
![Page 116: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/116.jpg)
Cassandra - Write
![Page 117: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/117.jpg)
Cassandra – Write(Fast)
fast writes: staged edaA general-purpose framework for high concurrency &
load conditioningDecomposes applications into stagesseparated by queuesAdopt a structured approach to event-driven
concurrency.
![Page 118: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/118.jpg)
Cassandra – Write cont’d
![Page 119: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/119.jpg)
Cassandra – Write(Compactions)
![Page 120: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/120.jpg)
Cassandra – Gossip
Cassandra 是一个有单个节点组成的集群 – 其中没有“主”节点或单点故障 -因此,每个节点都必须积极地确认集群中其他节点的状态。它们使用一个称为闲话( Gossip )的机制来做此事 . 每个节点每秒中都会将集群中每个节点的状态“以闲话的方式传播”到 1-3 个其他节点 . 系统为闲话数据添加了版本 ,因此一个节点的任何变更都会快速地传播遍整个集群 . 通过这种方式 , 每个节点都能知道任一其他节点的当前状态 : 是在正在自举呢 , 还是正常运行呢 , 等。
![Page 121: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/121.jpg)
Cassandra – Hinted Handoff
Cassandra 会存储数据的拷贝到 N 个节点 . 客户端可以根据数据的重要性选择一个一致性级别 (Consistency level),例如 , QUORUM 表示 , 只有这 N 个节点中的多数返回成功才表示这个写操作成功。如果这些节点中的一个宕机了 , 会发生什么呢 ? 写操作稍后将如何传递到此节点呢 ?
Cassandra 使用了提示移交 (Hinted Handoff) 的技术来解决此问题 , 其中数据会被写入并保存到另一个随机节点 X, 并提示这些数据需要被保存到节点 Y, 并在节点重新在线时进行重放 ( 记住 ,当节点 Y变成在线时 ,闲话机制会快速通知 X 节点 ). 提示移交可以确保节点 Y 可以快速的匹配上集群中的其他节点 .注意 ,如果提示移交由于某种原因没有起作用 , 读修复最终仍然会“修复”这些过期数据,不过只有当客户端访问这些数据时才会进行读修复。提示的写是不可读的 (因为节点 X 并不是这 N 份拷贝的其中一个正式节点 ),因此 ,它们并不会记入写一致性 .如果Cassandra 的配置了 3 份拷贝 ,而其中的两个节点不可用 , 就不可能实现一个 QUORUM 的写操作。
![Page 122: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/122.jpg)
Cassandra – Anti-entropy
Cassandra 的一个众所周知的秘密武器是逆熵 (Anti-entropy).逆熵明确保证集群中的节点一致认可当前数据 .如果由于默认情况 , 读修复 (read repair) 与提示移交 (hinted handoff)都没有生效 ,逆熵会确保节点达到最终一致性 .逆熵服务是在“主压缩” ( 等价与关系数据库中的重建表 ) 时运行的 ,因此,它是一个相对重量级但运行不频繁的进程 .逆熵使用Merkle树 ( 也称为散列树 )来确定节点在列族 (column family) 数据树内的什么位置不能一致认可 ,接着修复该位置的每一个分支 。
![Page 123: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/123.jpg)
Cassandra
Other
![Page 124: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/124.jpg)
Other - Gossip
![Page 125: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/125.jpg)
Other
![Page 126: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/126.jpg)
Other
![Page 127: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/127.jpg)
Other - DHT
DHTs(Distributed hash tables) : A DHT is a class of a decentralized distributed system that provides a lookup service similar to a hash table; (key, value) pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key;
DHTs form an infrastructure that can be used to build more complex services, such as anycast, cooperative Web caching, distributed file systems, domain name services, instant messaging, multicast, and also peer-to-peer file sharing and content distribution systems.
http://en.wikipedia.org/wiki/Distributed_hash_table
![Page 128: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/128.jpg)
Other - DHT
![Page 129: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/129.jpg)
Other - Cassandra - Domain Models
![Page 130: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/130.jpg)
Other -
![Page 131: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/131.jpg)
Other -
![Page 132: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/132.jpg)
Other - Bloom filter
An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3. http://en.wikipedia.org/wiki/Bloom_filter
![Page 133: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/133.jpg)
Other - Bloom filter
![Page 134: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/134.jpg)
Other - Bloom filter
Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk which has slow access times. Bloom filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives). Overall answer speed is better with the Bloom filter than without the Bloom filter. Use of a Bloom filter for this purpose, however, does increase memory usage. 。
![Page 135: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/135.jpg)
Other - Timestamps and Vector Clocks
Eventual consistency relies on deciding what value a row will eventually converge to;
In the case of two writers writing at “the same" time, this is difficult;
Timestamps are one solution, but rely on synchronized clocks and don't capture causality;
Vector clocks are an alternative method of capturing order in a distributed system.
![Page 136: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/136.jpg)
Other - Vector Clocks
Definition A vector clock is a tuple {T1, T2, … …, TN} of clock
values from each node V1 < V2 if:
• For all I , V1I <= V2I
• For at least one I , V1I < V2I V1 < V2 implies global time ordering of events
When data is written from node I , it sets TI to its clock value.
This allows eventual consistency to resolve consistency between writes on multiple replicas.
![Page 137: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/137.jpg)
Other - CommitLog
和关系型数据库系统一样, Cassandra 也是采用的先写日志再写数据的方式,其日志称之为 Commitlog 。和Memtable/SSTable 不一样的是, Commitlog是 server级别的,不是 Column Family 级别的。 每个 Commitlog文件的大小是固定的,称之为一个 Commitlog Segment ,目前版本 (0.5.1) 中,这个大小是 128MB ,这是硬编码在代码 (src\java\org\apache\cassandra \db\Commitlog.java)中的。当一个 Commitlog文件写满以后,会新建一个的文件。当旧的 Commitlog文件不再需要时,会自动清除 .
![Page 138: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/138.jpg)
Other - CommitLog
每个 Commitlog文件 (Segment)都有一个固定大小(大小根据Column Family 的数目而定)的 CommitlogHeader结构,其中有两个重要的数组,每一个 Column Family 在这两个数组中都存在一个对应的元素。其中一个是位图数组 (BitSet dirty) ,如果 Column Family 对应的 Memtable 中有脏数据,则置为 1 ,否则为 0 ,这在恢复的时候可以指出哪些Column Family 是需要利用 Commitlog 进行恢复的。另外一个是整数数组 (int[] lastFlushedAt) , 保存的是 Column Family 在上一次 Flush 时日志的偏移位置,恢复时则可以从这个位置读取 Commitlog 记录。通过这两个数组结构, Cassandra 可以在异 常重启服务的时候根据持久化的SSTable和 Commitlog重构内存中 Memtable 的内容,也就是类似 Oracle 等关系型数据库的实例恢复 .
![Page 139: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/139.jpg)
Other - CommitLog
当Memtable flush 到磁盘的 SStable 时,会将所有 Commitlog文件的 dirty 数组对应的位清零,而在 Commitlog达到大小限制创建新的文件 时, dirty 数组会从上一个文件中继承过来。如果一个 Commitlog文件的 dirty 数组全部被清零,则表示这个 Commitlog 在恢复的时候不再需要,可以被清除。因此,在恢复的时候,所有的磁盘上存在的 Commitlog文件都是需要的 .
http://wiki.apache.org/cassandra/ArchitectureCommitLog
http://www.ningoo.net/html/2010/cassandra_commitlog.html
![Page 140: Storage cassandra](https://reader035.fdocuments.in/reader035/viewer/2022081505/554f443bb4c905cd048b5667/html5/thumbnails/140.jpg)
Cassandra
The End