Cassandra data structures and algorithms
-
Upload
duyhai-doan -
Category
Technology
-
view
1.282 -
download
3
description
Transcript of Cassandra data structures and algorithms
![Page 1: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/1.jpg)
@doanduyhai
Cassandra data structures & algorithms DuyHai DOAN, Technical Advocate
![Page 2: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/2.jpg)
@doanduyhai
Shameless self-promotion!
2
Duy Hai DOAN Cassandra technical advocate • talks, meetups, confs • open-source devs (Achilles, …) • Cassandra technical point of contact • Cassandra troubleshooting
![Page 3: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/3.jpg)
@doanduyhai
Agenda!
3
Data structures • CRDT • Bloom filter • Merkle tree Algorithms • HyperLogLog
![Page 4: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/4.jpg)
@doanduyhai
Why Cassandra ?!
4
Linear scalability ≈ unbounded extensivity • 1k+ nodes cluster
![Page 5: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/5.jpg)
@doanduyhai
Why Cassandra ?!
5
Continuous availability (≈100% up-time) • resilient architecture (Dynamo) • rolling upgrades • data backward compatible n/n+1 versions
![Page 6: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/6.jpg)
@doanduyhai
Why Cassandra ?!
6
Multi-data centers • out-of-the-box (config only) • AWS conf for multi-region DCs Operational simplicity • 1 node = 1 process + 1 config file
![Page 7: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/7.jpg)
@doanduyhai
Cassandra architecture!
7
Data-store layer • Google Big Table paper • Columns/Columns Family Cluster layer • Amazon DynamoDB paper • masterless architecture
![Page 8: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/8.jpg)
@doanduyhai
Cassandra architecture!
8
DATA STORE (BIG TABLES)
CLUSTER (DYNAMO)
API (CQL & RPC)
DISKS
Node1
Client request
Node2
CLUSTER (DYNAMO)
API (CQL & RPC)
DISKS
DATA STORE (BIG TABLES)
![Page 9: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/9.jpg)
@doanduyhai
Data access!
9
By CQL query via native protocol • INSERT, UPDATE, DELETE, SELECT • CREATE/ALTER/DROP TABLE
Always by partition key (#partition) • partition == physical row
![Page 10: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/10.jpg)
@doanduyhai
Data distribution!
10
Random: hash of #partition → token = hash(#p) Hash: ]0, 2127-1] Each node: 1/8 of ]0, 2127-1]
n1
n2
n3
n4
n5
n6
n7
n8
![Page 11: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/11.jpg)
@doanduyhai
Data replication!
11
Replication Factor = 3
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
![Page 12: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/12.jpg)
@doanduyhai
Coordinator node!Incoming requests (read/write) Coordinator node handles the request
Every node can be coordinator àmasterless
n1
n2
n3
n4
n5
n6
n7
n8
1
2
3
coordinator request
12
![Page 13: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/13.jpg)
CRDT!
by Marc Shapiro, 2011
![Page 14: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/14.jpg)
@doanduyhai
INSERT!
14
Table « users »
ddoan age name
33 DuyHai DOAN
INSERT INTO users(login, name, age) VALUES(‘ddoan’, ‘DuyHai DOAN’, 33);
![Page 15: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/15.jpg)
@doanduyhai
INSERT!
15
Table « users »
ddoan age name
33 DuyHai DOAN
INSERT INTO users(login, name, age) VALUES(‘ddoan’, ‘DuyHai DOAN’, 33);
#partition column names
![Page 16: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/16.jpg)
@doanduyhai
INSERT!
ddoan age (t1) name (t1)
33 DuyHai DOAN
16
Table « users »
INSERT INTO users(login, name, age) VALUES(‘ddoan’, ‘DuyHai DOAN’, 33);
auto-generated timestamp (μs)
.
![Page 17: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/17.jpg)
@doanduyhai
UPDATE!
17
Table « users »
UPDATE users SET age = 34 WHERE login = ddoan;
ddoan age (t1) name (t1)
33 DuyHai DOAN ddoan
age (t2)
34
File1 File2
![Page 18: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/18.jpg)
@doanduyhai
DELETE!
18
Table « users »
DELETE age FROM users WHERE login = ddoan;
ddoan age (t3)
ý
tombstone
ddoan age (t1) name (t1)
33 DuyHai DOAN ddoan
age (t2)
34
File1 File2 File3
![Page 19: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/19.jpg)
@doanduyhai
SELECT!
19
Table « users »
SELECT age FROM users WHERE login = ddoan;
? ? ?
ddoan age (t3)
ý ddoan
age (t1) name (t1)
33 DuyHai DOAN ddoan
age (t2)
34
File1 File2 File3
![Page 20: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/20.jpg)
@doanduyhai
SELECT!
20
Table « users »
SELECT age FROM users WHERE login = ddoan;
✓ ✕ ✕
ddoan age (t3)
ý ddoan
age (t1) name (t1)
33 DuyHai DOAN ddoan
age (t2)
34
File1 File2 File3
![Page 21: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/21.jpg)
@doanduyhai
Cassandra columns!
21
look very similar to …
CRDT
![Page 22: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/22.jpg)
@doanduyhai
CRDT Recap!
22
CRDT = Convergent Replicated Data Types Useful in distributed system Formal proof for strong « eventual convergence » of replicated data
![Page 23: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/23.jpg)
@doanduyhai
CRDT Recap!
23
A join semilattice (or just semilattice hereafter) is a partial order ≤v equipped with a least upper bound (LUB) ⊔v, defined as follows: Definition 2.4 m = x ⊔v y is a Least Upper Bound of {x, y} under ≤v iff • x ≤v m and • y ≤v m and • there is no m′ ≤v m such that x ≤v m′ and y ≤v m′ It follows from the definition that ⊔v is: commutative: x ⊔v y =v y ⊔v x; idempotent: x ⊔v x =v x; and associative: (x⊔v y)⊔v z =v x⊔v (y⊔v z).
![Page 24: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/24.jpg)
@doanduyhai
CRDT Recap!
24
Definition 2.5 (Join Semilattice). An ordered set (S, ≤v) is a Join Semilattice iff ∀x,y ∈ S, x ⊔v y exists.
Let’s define Stk,n = set of Cassandra columns identified by
• partition key k • column name n • assigned a timestamp t The ordered set (St
k,n, maxt) is a Join Semilattice
#partition column name(t1)
… #partition
column name(t2)
…
![Page 25: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/25.jpg)
@doanduyhai
Cassandra column as CRDT!
25
Proof: • S1
k,n ≤ maxt (S1k,n,S2
k,n) • S2
k,n ≤ maxt (S1k,n,S2
k,n) • there is no Sx
k,n ≤ maxt (S1k,n,S2
k,n) such that S1k,n ≤ Sx
k,n and S2k,n ≤ Sx
k,n
![Page 26: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/26.jpg)
@doanduyhai
Cassandra column as CRDT!
26
Idempotent ! ddoan
age (t2)
33 ddoan
age (t2)
33 ddoan
age (t2)
33
ddoan age (t2)
33
node1 node2 node3
coordinator
![Page 27: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/27.jpg)
@doanduyhai
Cassandra column as CRDT!
27
Commutative !ddoan
age (t1)
33 ddoan
age (t2)
34
ddoan age (t2)
34
node1 node2
coordinator
ddoan age (t2)
34 ddoan
age (t1)
33
node2 node1
ddoan age (t2)
34
coordinator
![Page 28: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/28.jpg)
@doanduyhai
Cassandra column as CRDT!
28
Associative !ddoan
age (t1)
33 ddoan
age (t2)
34
ddoan age (t3)
35 ddoan
age (t2)
34
node1 node2
node3
coordinator
![Page 29: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/29.jpg)
@doanduyhai
Cassandra column as CRDT!
29
ddoan address(t1) age (t1)
12 rue de.. 33 ddoan
age (t2)
34
ddoan age (t3)
35 ddoan
address(t7)
17 avenue..
File1 File2
File3 File4
![Page 30: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/30.jpg)
@doanduyhai
Cassandra column as CRDT!
30
ddoan age (t1)
33 ddoan age (t2)
34 ddoan
age (t3)
35
Stddoan,age =
Stddoan,address=
ddoan address(t1)
12 rue de.. ddoan
address(t7)
17 avenue..
t
t
![Page 31: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/31.jpg)
@doanduyhai
Eventual convergence!
31
Proposition 2.1. Any two object replicas of a CvRDT eventually converge, assuming the system transmits payload infinitely often between pairs of replicas over eventually-reliable point-to-point channels.
![Page 32: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/32.jpg)
@doanduyhai
Eventual convergence!
32
Proposition 2.1. Any two object replicas of a CvRDT eventually converge, assuming the system transmits payload infinitely often between pairs of replicas over eventually-reliable point-to-point channels. !!eventually-reliable point-to-point channels à there is a network cable connecting 2 nodes …
![Page 33: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/33.jpg)
@doanduyhai
Eventual convergence!
33
The system transmits payload infinitely often between pairs of replicas
![Page 34: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/34.jpg)
@doanduyhai
Eventual convergence!
34
Repair
The system transmits payload infinitely often between pairs of replicas
![Page 35: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/35.jpg)
@doanduyhai
Eventual convergence!
35
Strong hypothesis in the case of Cassandra CRDT !
![Page 36: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/36.jpg)
@doanduyhai
Eventual convergence!
36
maxtimestamp as merge function !
Strong hypothesis in the case of Cassandra CRDT !
![Page 37: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/37.jpg)
@doanduyhai
Eventual convergence!
37
Time is reliable … isn’t it ? !
![Page 38: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/38.jpg)
@doanduyhai
Eventual convergence!
38
NTP server-side mandatory
![Page 39: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/39.jpg)
Q & R
! " !
![Page 40: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/40.jpg)
Bloom filters!
by Burton Howard Bloom, 1970
![Page 41: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/41.jpg)
@doanduyhai
Cassandra Write Path!
41
Commit log1
. . .
1
Commit log2
Commit logn
Memory
![Page 42: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/42.jpg)
@doanduyhai
Cassandra Write Path!
42
Memory
Commit log1
. . .
1
Commit log2
Commit logn
MemTable Table1
MemTable Table2
MemTable TableN
2
. . .
![Page 43: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/43.jpg)
@doanduyhai
Cassandra Write Path!
43
Commit log1
Commit log2
Commit logn
Table1
SStable1
Table2 Table3
SStable2 SStable3 3
Memory
. . .
![Page 44: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/44.jpg)
@doanduyhai
Cassandra Write Path!
44
Commit log1
Commit log2
Commit logn
Table1
SStable1
Table2 Table3
SStable2 SStable3
Memory . . . MemTable Table1
MemTable Table2
MemTable TableN
. . .
![Page 45: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/45.jpg)
@doanduyhai
Cassandra Write Path!
45
Commit log1
Commit log2
Commit logn
Table1
SStable1
Table2 Table3
SStable2 SStable3
Memory
SStable1
SStable2
SStable3 . . .
![Page 46: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/46.jpg)
@doanduyhai
Cassandra Read Path!
46
Either in memory
![Page 47: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/47.jpg)
@doanduyhai
Cassandra Read Path!
47
Either in memory
or hit disk (many SSTables)
![Page 48: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/48.jpg)
@doanduyhai
Cassandra Read Path!
48
How to optimize disk seeks ?
![Page 49: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/49.jpg)
@doanduyhai
Cassandra Read Path!
49
Only read necessary SSTables !
![Page 50: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/50.jpg)
@doanduyhai
Cassandra Read Path!
50
Bloom filters !
![Page 51: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/51.jpg)
@doanduyhai
Bloom filters recap!
51
Space-efficient probabilistic data structure. Used for membership test True negative, possible false positive
![Page 52: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/52.jpg)
@doanduyhai
Bloom filters in Cassandra!
52
For each SSTable, create a bloom filter Upon data insertion, populate it Upon data retrieval, ask the bloom filter for skipping
![Page 53: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/53.jpg)
@doanduyhai
Bloom filters in action!
53
1 0 0 1 0 0 1 0 0 0
#partition = foo
h2 h3
Write
h1
![Page 54: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/54.jpg)
@doanduyhai
Bloom filters in action!
54
1 0 0 1* 0 0 1 0 1 1
#partition = foo
h1 h2 h3
#partition = bar Write
![Page 55: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/55.jpg)
@doanduyhai
Bloom filters in action!
55
1 0 0 1* 0 0 1 0 1 1
#partition = foo
h1 h2 h3
#partition = bar Write
Read #partition = qux
![Page 56: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/56.jpg)
@doanduyhai
Bloom filters maths!
56
![Page 57: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/57.jpg)
@doanduyhai
Bloom filters maths!
57
probability of a bit to be set to 1:
1 0 0 1 0 0 1 0 0 0
m bits
1m
1− 1m
probability of a bit to be set to 0:
![Page 58: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/58.jpg)
@doanduyhai
Bloom filters maths!
58
probability with k … and n … of the bit to be set to 1: 1− 1− 1m
"
#$
%
&'kn
probability with k hash functions of the bit to be set to 0: 1− 1m
"
#$
%
&'k
probability with k … and n elements inserted … : 1− 1m
"
#$
%
&'kn
![Page 59: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/59.jpg)
@doanduyhai
Bloom filters maths!
59
But why do we need to calculate probability of a bit: • to be set to 1 • then to be set to 0 • then back to 1 again ?
![Page 60: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/60.jpg)
@doanduyhai
Bloom filters maths!
60
Because of bits colliding on 1 when applying many k & n !
![Page 61: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/61.jpg)
@doanduyhai
Bloom filters maths!
61
For an element not in the SSTable, probability that all k hash functions return 1 (false positive chance, fpc):
1− 1− 1m
"
#$
%
&'kn"
#$$
%
&''
k
≈ 1− e−knm
"
#$
%
&'
k
To minimize fpc: koptimal ≈mnln(2)
![Page 62: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/62.jpg)
@doanduyhai
Bloom filters maths!
62
fpc = 1− e−mnln(2)n
m
"
#
$$$
%
&
'''
mnln(2)
= 1− eln( 12)"
#$
%
&'
mnln(2)
=12
mnln(2)
ln( fpc) = mnln(12)ln(2) = −m
nln(2)2
m = nln( 1
fpc)
ln(2)2
![Page 63: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/63.jpg)
@doanduyhai
Bloom filters maths!
63
m = nln( 1
fpc)
ln(2)2
For n = 109 of #partition • fpc = 10%, m ≈ 500Mb • fpc = 1%, m ≈ 1.2Gb
![Page 64: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/64.jpg)
@doanduyhai
Bloom filters (notes)!
64
Cannot remove elements once inserted (1-bit colliding) • cannot resize • collision increases with load
![Page 65: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/65.jpg)
Q & R
! " !
![Page 66: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/66.jpg)
Merkle tree!
by Ralph Merkle, 1987
![Page 67: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/67.jpg)
@doanduyhai
Repairing data!
67
Repair
The system transmits payload infinitely often between pairs of replicas
![Page 68: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/68.jpg)
@doanduyhai
Why repair ?!
68
Data diverge between replicas because: • writing with low consistency for perf • nodes down • network down • dropped writes
![Page 69: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/69.jpg)
@doanduyhai
Repairing data!
69
Compare full data ? • read all data • I/O intensive • network intensive (streaming is expensive)
![Page 70: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/70.jpg)
@doanduyhai
Repairing data!
70
Compare full data ? • read all data • I/O intensive • network intensive (streaming is expensive)
Compare digests ? • read all data • I/O intensive • network intensive (streaming is expensive)
![Page 71: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/71.jpg)
@doanduyhai
Merkle tree!
71
Tree of digests • leaf nodes : digest of data • non-leaf nodes: digest of children nodes digest • tree resolution = nb leaf nodes = 2depth
![Page 72: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/72.jpg)
@doanduyhai
Merkle tree in action!
72
Depth = 15, resolution = 32 768 leaf nodes
leaf1 leaf2 leaf3
node node
root
…
n-partitions bucket n-partitions bucket n-partitions bucket
![Page 73: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/73.jpg)
@doanduyhai
Merkle tree in action!
73
Repair process • send the tree to replicas • compare digests, starting from root node • if mismatch, stream partition bucket(s) that differ
![Page 74: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/74.jpg)
@doanduyhai
Merkle tree in action!
74
If mismatch, stream partition bucket(s) that differ Example • 327 680 partitions • resolution = 32 768 à10 partitions/bucket • 1 column differs in 1 partition à 10 partitions streamed
leaf
10-partitions
![Page 75: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/75.jpg)
@doanduyhai
Over-streaming nightmare!
75
![Page 76: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/76.jpg)
@doanduyhai
Merkle tree in action!
76
Improve tree resolution by increasing depth (dynamically)
leaf1
node
… leaf2 leafN
node node
node node
root
leaf1
node
… leaf3 leafN
node node
node node
root
node node node node
node node node node node
leaf2
![Page 77: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/77.jpg)
@doanduyhai
Merkle tree in action!
77
Improve tree resolution by repairing by partition ranges
leaf1
node
… leaf2 leafN
node node
node node
root
leaf1
node
… leaf2 leafN
node node
node node
root
leaf1
node
… leaf2 leafN
node node
node node
root
![Page 78: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/78.jpg)
Q & R
! " !
![Page 79: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/79.jpg)
HyperLogLog!
by late Philippe Flajolet, 2007
![Page 80: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/80.jpg)
@doanduyhai
Cassandra Read Path!
80
Remember that ?
Table1
SStable1
Table2 Table3
SStable2 SStable3
SStable1
SStable2
SStable3
![Page 81: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/81.jpg)
@doanduyhai
Cassandra Read Path!
81
Even Bloom filter can’t save you if data spills on many SSTables
![Page 82: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/82.jpg)
@doanduyhai
Cassandra Read Path!
82
Compaction !
![Page 83: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/83.jpg)
@doanduyhai
Compaction!
83
Algorithm: • take n SSTables • load data in memory
• for each Stk,n apply the merge function (maxtimestamp)
• remove (when applicable) tombstones • build a new SSTable
![Page 84: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/84.jpg)
@doanduyhai
Compaction!
84
Build a new SSTable à allocate memory for new Bloom filter
![Page 85: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/85.jpg)
@doanduyhai
Compaction!
85
But how large is the new Bloom filter ?
![Page 86: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/86.jpg)
@doanduyhai
Compaction!
86
SStable1 SStable2
Bloom filters
double size?
in between ?
same size ?
![Page 87: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/87.jpg)
@doanduyhai
Compaction!
87
Bloom filter size depends on … elements cardinality (fpc constant)
![Page 88: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/88.jpg)
@doanduyhai
Compaction!
88
Bloom filter size depends on … elements cardinality (fpc constant) If we can count distinct elements in SSTable1 & SSTable2, we can allocate new Bloom filter
![Page 89: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/89.jpg)
@doanduyhai
Compaction!
89
Bloom filters
Given constant fpc, if cardinality = C1+C2, then m = …
SStable1 SStable2
Cardinality: C1 Cardinality: C2
![Page 90: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/90.jpg)
@doanduyhai
Compaction!
90
But counting exact cardinality is memory-expensive ...
![Page 91: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/91.jpg)
@doanduyhai
Compaction!
91
Can’t we have a cardinality estimate ?
![Page 92: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/92.jpg)
@doanduyhai
Cardinality estimators!
92
Counter Bytes used Error Java HashSet 10 447 016 0%
Linear Probabilistic Counter 3 384 1% HyperLogLog 512 3%
credits: http://highscalability.com/
![Page 93: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/93.jpg)
@doanduyhai
LogLog intuition!
93
1) given a well distributed hash function h 2) given a sufficiently high number of elements n For a set of n elements, look that the bit pattern
∀ i ∈ [1,n], h(elementi)
0xxxxx… 1xxxxx…
≈ n/2 ≈ n/2
![Page 94: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/94.jpg)
@doanduyhai
LogLog intuition!
94
∀ i ∈ [1,n], h(elementi)
01xxxx… 10xxxx…
≈ n/4
00xxxx… 11xxxx…
000xxx… 001xxx… 010xxx… 011xxx… 100xxx… 101xxx… 110xxx… 111xxx…
≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8 ≈ n/8
≈ n/4 ≈ n/4 ≈ n/4
![Page 95: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/95.jpg)
@doanduyhai
LogLog intuition!
95
Flip back the reasonning. If we see a hash like this: 000 000 000 1… Since the hash distribution is uniform, we should also have seen: 000 000 001 0… and 000 000 001 1… and 000 000 010 0… and … 111 111 111 1… Thus an estimated cardinality of 210 elements for n
![Page 96: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/96.jpg)
@doanduyhai
LogLog intuition!
96
Toy example: n = 8, hash of 8 elements, 3 bit long:
000, 001, 010, 011, 100, 101, 110, 111 Uniform hash à equi-probability of each combination If I observed 001, I should have seen 000 too, and 010 too … If I observed 001, I should have seen 7 other combinations If I observed 001, n ≈ 8 (23)
![Page 97: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/97.jpg)
@doanduyhai
LogLog intuition!
97
1) given a well distributed hash function h 2) given a sufficiently high number of elements n If I find a hash starting with 01…, it’s likely that there are 22 distinct elements (n = 22) 001…, it’s likely that there are 23 distinct elements (n = 23) 0001…, it’s likely that there are 24 distinct elements (n = 24) … 00000000001…, it’s likely that there are 2r distinct elements (n = 2r)
r
![Page 98: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/98.jpg)
@doanduyhai
LogLog intuition!
98
max(r) = longest 0000…1 position observed among all hash values
n ≈ 2max(r)
![Page 99: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/99.jpg)
@doanduyhai
LogLog intuition!
99
Still, it’s a very terrible estimation … What if we have these hash values for n = 16: 10 x 010….. 5 x 100…. 1 x 000 000 001…
![Page 100: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/100.jpg)
@doanduyhai
LogLog intuition!
100
Still, it’s a very terrible estimation … What if we have these hash values for n = 16: 10 x 010….. 5 x 100…. 1 x 000 000 001…
n ≈ 2max(r) ≈ 29 ≈ 512 ?
![Page 101: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/101.jpg)
@doanduyhai
LogLog intuition!
101
Still, it’s a very terrible estimation … What if we have these hash values for n = 16: 10 x 010….. 5 x 100…. 1 x 000 000 001…
n ≈ 2max(r) ≈ 29 ≈ 512 ?
outlier & skewed distribution sensitivity
![Page 102: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/102.jpg)
@doanduyhai
HyperLogLog intuition!
102
To eliminate outliers … use harmonic mean !
credits: http://economistatlarge.com
![Page 103: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/103.jpg)
@doanduyhai
HyperLogLog intuition!
103
Harmonic means definition (thank you Wikipedia)
H =m
1x1+1x2+...+ 1
xm
![Page 104: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/104.jpg)
@doanduyhai
HyperLogLog intuition!
104
First, split the set into m = 2b buckets Bucket number is determined by first b bits
b = 6, m = 32 buckets
Buckets list: B1, B2, … B32 (index is 1-based)
h(element) = 001001 0100…
![Page 105: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/105.jpg)
@doanduyhai
HyperLogLog intuition!
105
Example m = 8 (23) buckets
000xxx… 001xxx… 010xxx… 011xxx… 100xxx… 101xxx… 110xxx… 111xxx…
B1 B2 B3 B4 B5 B6 B7 B8
![Page 106: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/106.jpg)
@doanduyhai
HyperLogLog intuition!
106
New intuition: • in each bucket j, there are ≈ Mj elements • harmonic mean (Mj) = H(Mj) ≈ n/m
n ≈ mH(Mj)
![Page 107: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/107.jpg)
@doanduyhai
HyperLogLog intuition!
107
But how do we calculate each Mj ?
![Page 108: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/108.jpg)
@doanduyhai
HyperLogLog intuition!
108
Use LogLog !
![Page 109: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/109.jpg)
@doanduyhai
HyperLogLog intuition!
109
How to solve a big hard problem ?
![Page 110: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/110.jpg)
@doanduyhai
HyperLogLog intuition!
110
So on each hash value
bits for choosing bucket Bj
001100 0000001… bits for LogLog
![Page 111: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/111.jpg)
@doanduyhai
HyperLogLog improvement!
111
Greater precision compared to LogLog Computation can be distributed (each bucket processed separately)
![Page 112: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/112.jpg)
@doanduyhai
HyperLogLog the maths!
112
![Page 113: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/113.jpg)
@doanduyhai
HyperLogLog formal definition!
113
Let h : D → [0, 1] ≡ {0, 1}∞ hash data from domain D to the binary domain. Let ρ(s), for s ∈ {0, 1}∞ , be the position of the leftmost 1-bit. (ρ(0001 · · · ) = 4) It is the rank of the 0000..1 observed sequence Let m = 2b with b∈Z>0 m = number of buckets Let M : multiset of items from domain D M is the set of elements to estimate cardinality
![Page 114: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/114.jpg)
@doanduyhai
HyperLogLog formal definition!
114
Algorithm HYPERLOGLOG
Initialize a collection of m registers, M1, . . . , Mm, to −∞ for each element v ∈ M do • set x := h(v) //hash of v in binary form • set j = 1 + ⟨x1x2 · · · xb⟩2 //bucket number (1-based) • set w := xb+1xb+2 · · · //bits for LogLog • set Mj := max(Mj, ρ(w)) //take the longest 0000..1 position
observed in bucket Bj
![Page 115: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/115.jpg)
@doanduyhai
HyperLogLog formal definition!
115
Compute //what is that Z ? Z = 2−M j
j=1
m
∑#
$%%
&
'((
−1
Return n ≈ αmm2Z • αm as given by Equation (3) //what is that αm ? !
![Page 116: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/116.jpg)
@doanduyhai
HyperLogLog maths workout!
116
Mj = longest 0000...1 observed for bucket j. H(Mj) ≈ n/m
H =m
1x1+1x2+...+ 1
xm
Remember our intuition n ≈ mH(Mj) ?
Harmonic mean definition
![Page 117: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/117.jpg)
@doanduyhai
HyperLogLog maths workout!
117
H =m
1x1+1x2+...+ 1
xm
=m 11x jj=1
m∑
H =m 1x jj=1
m∑"
#$$
%
&''
−1
=m xj−1
j=1
m∑( )
−1
![Page 118: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/118.jpg)
@doanduyhai
HyperLogLog maths workout!
118
H =m xj−1
j=1
m∑( )
−1
Z = 2−M j
j=1
m
∑#
$%%
&
'((
−1
compare it with
let xj = 2Mj , the cardinality estimate for bucket Bj
H =mZ
![Page 119: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/119.jpg)
@doanduyhai
HyperLogLog maths workout!
119
Remember our intuition n ≈ mH(Mj) ?
n ≈ mH ≈ m2Z ☞ αmm2Z
![Page 120: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/120.jpg)
@doanduyhai
HyperLogLog harder maths!
120
What’s about the αm constant ?
![Page 121: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/121.jpg)
@doanduyhai
HyperLogLog harder maths!
121
You don’t want to dig into that, trust me …
![Page 122: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/122.jpg)
@doanduyhai
HyperLogLog harder maths!
122
8 pages full of this:
![Page 123: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/123.jpg)
@doanduyhai
HyperLogLog harder maths!
123
and this
![Page 124: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/124.jpg)
@doanduyhai
HyperLogLog harder maths!
124
and this…
![Page 125: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/125.jpg)
@doanduyhai
Compaction!
125
Bloom filters
Given constant fpc, if cardinality = C1+C2, then m = …
SStable1 SStable2
Cardinality: C1 Cardinality: C2
![Page 126: Cassandra data structures and algorithms](https://reader030.fdocuments.in/reader030/viewer/2022020123/5593b4f21a28ab296b8b45d5/html5/thumbnails/126.jpg)
Q & R
! " !