Distributed Database Architecture - UNIMIB
Transcript of Distributed Database Architecture - UNIMIB
![Page 1: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/1.jpg)
Distributed Database
Architecture
![Page 2: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/2.jpg)
• Data distribution
• Data replication
Outline
![Page 3: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/3.jpg)
@source IBM
Distributed data:
summary
Appl
DBMS
DB
Basic (single db)
connect
Appl
DBMS
DB
DBMS
DB
connect
Fed Srv
Federation
Appl
DBMS
DB
Appl
DBMS
DB
Repl
Srv
Replication
Appl
Appl
DBMS
DB
EP
Srv
Event Publishing
Appl
DBMS
DB
Appl
DBMS
DB
ETL srv
DW
DBMS
Extract Trasform
& Load
Appl
DBMS
DB
DBMS
DB
connect connect
Distributed Access
DA
TA
MO
VE
: N
OD
ATA
MO
VE
: Y
ES
![Page 4: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/4.jpg)
Data distribution
![Page 5: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/5.jpg)
• Shared everything
• Shared disk
• Shared nothing
Type of architecture
![Page 6: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/6.jpg)
Business Logic Presentation logic
Mainframe
Shared everything
Database
Business LogicPresentation logic
Dumbterminal
Dumbterminal
Dumbterminal
Database server
Database
Dumbterminal
Dumbterminal
Dumbterminal
![Page 7: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/7.jpg)
Web Browser
Shared everything
Presentation logic (javascript)
Database server
Database
Dumbterminal
Dumbterminal
Dumbterminal
Application serverApplication server
Business LogicPresentation logic
Business LogicPresentation logic
![Page 8: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/8.jpg)
Shared disk
![Page 9: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/9.jpg)
• Adopted solution of Nosql database architecturesupporting scale out
Shared nothing
![Page 10: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/10.jpg)
• http://www.mullinsconsulting.com/db2arch-sd-sn.html
Evaluation
![Page 11: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/11.jpg)
• What is high availability? Is a mix of
• Architecture design
• people!
• process
• technology
• What is NOT high availability
– A pure technology solution
– A close term to
• scalability
• manageability
Scalability or availability?
![Page 12: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/12.jpg)
How many 9?
availability Downtime (in one year)
100% Never
99.999% < 5.26 minutes
99.99% 5.26 – 52 minutes
99.9 % 52 minutes – 8 hours and 45 minutes
99 % 8 hours and 45 minutes –
87 hours and 36 minutes
90% 788 hours and 24 minutes –
875 hours and 54 minutes
![Page 13: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/13.jpg)
Replication
![Page 14: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/14.jpg)
• A log is a sequential file that is stored in a stable memory (that it a “conceptual” storage the will never fail)
• It stores all activitied realized by all transactions in a chronological order.
• Two type of record are stored:
– Transaction log • It includes operation on tables
– System events• Checkpoint
• Dump
System log
![Page 15: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/15.jpg)
– It dependes on the specific relational operation
– Legenda
O=object, AS = After State, BS = Before State
Possible operation in a transaction
– begin, B(T)
– insert, I(T,O, AS)
– delete, D(T,O,BS)
– update, U(T,O,BS,AS)
– commit, C(T), o abort, A(T)
Transaction log
![Page 16: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/16.jpg)
A log example
B(T1)B(T2) C(T2) B(T3)
U(T3,…)U(T1,…)
U(T1,…)U(T2,…) U(T1,…)
![Page 17: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/17.jpg)
Checkpoint
• checkpoint is in charge to storing the set of running transcations in a giventime point T1, …, Tn
![Page 18: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/18.jpg)
Example of transaction log and
checkpoing
dump
CK
B(T1) B(T2) C(T2) B(T3)
U(T3,…)U(T1,…)
U(T1,…)U(T2,…)U(T1,…)
T1
T2
T3
committed
uncommitted
It is not started yet
![Page 19: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/19.jpg)
• A dump is a full copy of the entire state of a DB in a stable memory
• offline execution
• It generates a backup
• After the backup is completed the dump record of log is written
2.3 Dump
![Page 20: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/20.jpg)
Log example
CKB(T1)B(T2) C(T2) B(T3)
U(T3,…)U(T1,…)
U(T1,…)U(T2,…)U(T1,…)
dump
![Page 21: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/21.jpg)
21 @source IBM
CD1SOURCE
TARGET TARGET TARGET
Data Distribution (1:many)
CD1SOURCE CD1SOURCE CD1SOURCE
TARGET
Data Consolidation (many:1)
CD1SOURCE
CD1STAGING CD1STAGING
TARGETTARGET
Multi-Tier Staging
TARGETTARGET
CD1SOURCE
Peer-to-Peer
CD1SOURCE CD1SOURCE
CD1PRIMARY
Bi-directional
SECONDARY
Conflic
t D
ete
ction/R
esolu
tion
Replica architecture
![Page 22: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/22.jpg)
22
How to create a replica
1. Detach 2. Copy 4. Attach
3. Attach
![Page 23: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/23.jpg)
23
How to create a replica
1. Backup (2. Copy) 3. Restore
![Page 24: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/24.jpg)
How to create a replica
Full backupTransaction log
![Page 25: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/25.jpg)
How to create a replica
TX1: INSERT S1
TX2: INSERT S2
TX3: ROLLBACK
TX1: COMMIT
TX1: UPDATE S1
TX3: DELETE S1
D2 Log
Q-SUBS
Q-PUBS
SOURCE2
SOURCE1
TX1: INSERT S1
TX1: COMMIT
TX1: UPDATE S1
CAPTURE
In-Memory-Transactions
Transaction is still „in-flight“
Nothing inserted yet.
„Zapped“ at Abort
Never makes it to send queue
TX3: DELETE S1
TX3: ROLLBACK TX2: INSERT S2
Restart
Queue
MQ Put when Commit
record is found
Send Queue
![Page 26: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/26.jpg)
Source
SOURCE2
SOURCE1
DB Log
Capture
• From a conceptual view point it is a replica without apply
Target
SOA/User
Application
User
Application
WBI Event
Broker
TARGET
TARGET
TARGET
Event Publishing
![Page 27: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/27.jpg)
Replica execution
Primary
Full backup Full restore
Copy
Secondary
Log backup Log restore
Copy
Inizializzazione
Sincronizzazione
Monitor
![Page 28: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/28.jpg)
Another architecture
Subscribers
Distributor
Publisher
![Page 29: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/29.jpg)
Distribution in NoSQL
![Page 30: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/30.jpg)
MongoDB's Approach to Sharding
![Page 31: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/31.jpg)
Partitioning
• User defines shard key
• Shard key defines range of data
• Key space is like points on a line
• Range is a segment of that line
![Page 32: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/32.jpg)
Initially 1 chunk
Default max chunk size: 64mb
MongoDB automatically splits & migrates chunks when max reached
Data Distribution
![Page 33: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/33.jpg)
Queries routed to specific shards
MongoDB balances cluster
MongoDB migrates data to new nodes
Routing and Balancing
![Page 34: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/34.jpg)
MongoDB Auto-Sharding
• Minimal effort required
– Same interface as single mongod
• Two steps
– Enable Sharding for a database
– Shard collection within database
![Page 35: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/35.jpg)
Architecture
![Page 36: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/36.jpg)
What is a Shard?
• Shard is a node of the cluster
• Shard can be a single mongod or a replica set
![Page 37: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/37.jpg)
Meta Data Storage
• Config Server
– Stores cluster chunk ranges and locations
– Can have only 1 or 3 (production must have 3)
– Not a replica set
![Page 38: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/38.jpg)
Routing and Managing Data
• Mongos
– Acts as a router / balancer
– No local data (persists to config database)
– Can have 1 or many
![Page 39: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/39.jpg)
Sharding infrastructure
![Page 40: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/40.jpg)
Configuration
![Page 41: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/41.jpg)
Example Cluster
![Page 42: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/42.jpg)
mongod --configsvr
Starts a configuration server on the default port (27019)
Starting the Configuration Server
![Page 43: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/43.jpg)
mongos --configdb <hostname>:27019
For 3 configuration servers:
mongos --configdb<host1>:<port1>,<host2>:<port2>,<host3>:<port3>
This is always how to start a new mongos, even if the cluster is already running
Start the mongos Router
![Page 44: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/44.jpg)
mongod --shardsvr
Starts a mongod with the default shard port (27018)
Shard is not yet connected to the rest of the cluster
Shard may have already been running in production
Start the shard database
![Page 45: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/45.jpg)
On mongos:
– sh.addShard(‘<host>:27018’)
Adding a replica set:
– sh.addShard(‘<rsname>/<seedlist>’)
Add the Shard
![Page 46: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/46.jpg)
db.runCommand({ listshards:1 })
{ "shards" :
[{"_id”: "shard0000”,"host”: ”<hostname>:27018” } ],
"ok" : 1
}
Verify that the shard was added
![Page 47: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/47.jpg)
Enabling Sharding
• Enable sharding on a database
sh.enableSharding(“<dbname>”)
• Shard a collection with the given key
sh.shardCollection(“<dbname>.people”,{“country”:1})
• Use a compound shard key to prevent duplicates
sh.shardCollection(“<dbname>.cars”,{“year”:1,
”uniqueid”:1})
![Page 48: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/48.jpg)
Tag Aware Sharding
• Tag aware sharding allows you to control the distribution of your data
• Tag a range of shard keys
– sh.addTagRange(<collection>,<min>,<max>,<tag>)
• Tag a shard
– sh.addShardTag(<shard>,<tag>)
![Page 49: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/49.jpg)
Mechanics
![Page 50: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/50.jpg)
Partitioning
• Remember it's based on ranges
![Page 51: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/51.jpg)
Chunk is a section of the entire range
![Page 52: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/52.jpg)
A chunk is split once it exceeds the maximum size
There is no split point if all documents have the same shard key
Chunk split is a logical operation (no data is moved)
Chunk splitting
![Page 53: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/53.jpg)
Balancer is running on mongos
Once the difference in chunks between the most dense shard and the least dense shard is above the migration threshold, a balancing round starts
Balancing
![Page 54: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/54.jpg)
The balancer on mongos takes out a “balancer lock”
To see the status of these locks:use config
db.locks.find({ _id: “balancer” })
Acquiring the Balancer Lock
![Page 55: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/55.jpg)
The mongos sends a moveChunk command to source shard
The source shard then notifies destination shard
Destination shard starts pulling documents from source shard
Moving the chunk
![Page 56: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/56.jpg)
When complete, destination shard updates
config server
– Provides new locations of the chunks
Committing Migration
![Page 57: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/57.jpg)
Source shard deletes moved data
– Must wait for open cursors to either close or time out– NoTimeout cursors may prevent the release of the lock
The mongos releases the balancer lock after old chunks are deleted
Cleanup
![Page 58: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/58.jpg)
Routing Requests
![Page 59: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/59.jpg)
Cluster Request Routing
• Targeted Queries
• Scatter Gather Queries
• Scatter Gather Queries with Sort
![Page 60: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/60.jpg)
Cluster Request Routing: Targeted
Query
![Page 61: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/61.jpg)
Routable request received
![Page 62: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/62.jpg)
Request routed to appropriate shard
![Page 63: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/63.jpg)
Shard returns results
![Page 64: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/64.jpg)
Mongos returns results to client
![Page 65: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/65.jpg)
Cluster Request Routing: Non-Targeted
Query
![Page 66: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/66.jpg)
Non-Targeted Request Received
![Page 67: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/67.jpg)
Request sent to all shards
![Page 68: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/68.jpg)
Shards return results to mongos
![Page 69: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/69.jpg)
Mongos returns results to client
![Page 70: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/70.jpg)
Cluster Request Routing: Non-Targeted
Query with Sort
![Page 71: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/71.jpg)
Non-Targeted request with sort
received
![Page 72: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/72.jpg)
Request sent to all shards
![Page 73: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/73.jpg)
Query and sort performed locally
![Page 74: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/74.jpg)
Shards return results to mongos
![Page 75: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/75.jpg)
Mongos merges sorted results
![Page 76: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/76.jpg)
Mongos returns results to client
![Page 77: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/77.jpg)
Shard Key
![Page 78: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/78.jpg)
Shard Key
• Shard key is immutable
• Shard key values are immutable
• Shard key must be indexed
• Shard key limited to 512 bytes in size
• Shard key used to route queries
– Choose a field commonly used in queries
• Only shard key can be unique across shards
– `_id` field is only unique within individual shard
![Page 79: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/79.jpg)
Shard Key Considerations
• Cardinality
• Write Distribution
• Query Isolation
• Reliability
• Index Locality
![Page 80: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/80.jpg)
HBase Architecture
87
![Page 81: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/81.jpg)
Three Major Components
88
• The HBaseMaster
– One master
• The HRegionServer
– Many region servers
• The HBase client
![Page 82: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/82.jpg)
HBase Components
• Region– A subset of a table’s rows, like horizontal range
partitioning– Automatically done
• RegionServer (many slaves)– Manages data regions– Serves data for reads and writes (using a log)
• Master– Responsible for coordinating the slaves– Assigns regions, detects failures– Admin functions
89
![Page 83: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/83.jpg)
Big Picture
90
![Page 84: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/84.jpg)
Hbase architecture
![Page 85: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/85.jpg)
ZooKeeper
• HBase depends on
ZooKeeper
• By default HBase manages
the ZooKeeper instance
– E.g., starts and stops
ZooKeeper
• HMaster and HRegionServers
register themselves with
ZooKeeper
92
![Page 86: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/86.jpg)
Cassandra Architecture
![Page 87: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/87.jpg)
Cassandra Architecture Overview
○ Cassandra was designed with the understanding that system/
hardware failures can and do occur
○ Peer-to-peer, distributed system
○ All nodes are the same
○ Data partitioned among all nodes in the cluster
○ Custom data replication to ensure fault tolerance
○ Read/Write-anywhere design
○ Google BigTable - data model
○ Column Families
○ Memtables
○ SSTables
○ Amazon Dynamo - distributed systems technologies
○ Consistent hashing
○ Partitioning
○ Replication
○ One-hop routing
![Page 88: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/88.jpg)
Transparent Elasticity
Nodes can be added and removed from Cassandra online, with no downtime being experienced.
1
2
3
4
5
6
1
7
10 4
2
3
5
68
9
11
12
![Page 89: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/89.jpg)
Transparent Scalability
Addition of Cassandra nodes increases performance linearly and ability to manage TB’s-PB’s of data.
1
2
3
4
5
6
1
7
10 4
2
3
5
68
9
11
12
Performance throughput = N
Performance throughput = N x 2
![Page 90: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/90.jpg)
High Availability
Cassandra, with its peer-to-peer architecture has no single point of failure.
![Page 91: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/91.jpg)
Multi-Geography/Zone Aware
Cassandra allows a single logical database to span 1-N datacenters that are geographically dispersed. Also supports a hybrid on-premise/Cloud implementation.
![Page 92: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/92.jpg)
Data Redundancy
Cassandra allows for customizable data redundancy so that data is completely protected. Also supports rack awareness (data can be replicated between different racks to guard against machine/rack failures).
uses ‘Zookeeper’ to choose a leader which tells nodes the range they are replicas for
![Page 93: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/93.jpg)
• Nodes are logically structured in Ring Topology.
• Hashed value of key associated with data partition is used to assign it to a node in the ring.
• Hashing rounds off after certain value to support ring structure.
• Lightly loaded nodes moves position to alleviate highly loaded
nodes.
Partitioning
![Page 94: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/94.jpg)
01
1/2
F
E
D
C
B
A N=3
h(key2)
h(key1)
Partitioning & Replication
![Page 95: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/95.jpg)
• Used to discover location and state information about the other nodes participating in a Cassandra cluster
• Network Communication protocols inspired for real life rumor spreading.
• Periodic, Pairwise, inter-node communication.
• Low frequency communication ensures low cost.
• Random selection of peers.
• Example – Node A wish to search for pattern in data
– Round 1 – Node A searches locally and then gossips with node B.
– Round 2 – Node A,B gossips with C and D.
– Round 3 – Nodes A,B,C and D gossips with 4 other nodes ……
• Round by round doubling makes protocol very robust.
Gossip Protocols
![Page 96: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/96.jpg)
• Gossip process tracks heartbeats from other nodes both directly and indirectly
• Node Fail state is given by variable Φ
– tells how likely a node might fail (suspicion level) instead of simple binary value (up/down).
• This type of system is known as Accrual Failure Detector
• Takes into account network conditions, workload, or other conditions that might affect perceived heartbeat rate
• A threshold for Φ tells is used to decide if a node is dead
• If node is correct, phi will be constant set by application.
Generally Φ(t) = 0
Failure Detection
![Page 97: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/97.jpg)
Write Operation Stages
• Logging data in the commit log
• Writing data to the memtable
• Flushing data from the memtable
• Storing data on disk in SSTables
![Page 98: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/98.jpg)
• Commit Log
– First place a write is recorded– Crash recovery mechanism– Write not successful until recorded in commit log– Once recorded in commit log, data is written to Memtable
• Memtable
– Data structure in memory– Once memtable size reaches a threshold, it is flushed (appended) to SSTable– Several may exist at once (1 current, any others waiting to be flushed)– First place read operations look for data
• SSTable
– Kept on disk– Immutable once written– Periodically compacted for performance
Write Operations
![Page 99: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/99.jpg)
Write Operations
![Page 100: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/100.jpg)
Consistency
• Read Consistency– Number of nodes that must agree before read request
returns– ONE to ALL
• Write Consistency– Number of nodes that must be updated before a write is
considered successful– ANY to ALL– At ANY, a hinted handoff is all that is needed to return.
• QUORUM– Commonly used middle-ground consistency level– Defined as (replication_factor / 2) + 1
![Page 101: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/101.jpg)
Write Consistency (ONE)
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3
R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ONE
![Page 102: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/102.jpg)
Write Consistency (QUORUM)
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3
R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY QUORUM
![Page 103: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/103.jpg)
• Write intended for a node that’s offline
• An online node, processing the request, makes a note to carry out the write once the node comes back online.
Write Operations: Hinted
Handoff
![Page 104: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/104.jpg)
Hinted Handoff
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6
replication_factor = 3and
hinted_handoff_enabled = true
R1
R2
R3Client
INSERT INTO table (column1, …) VALUES (value1, …) USING CONSISTENCY ANY
Write locally: system.hints
Note: Doesn’t not count toward consistency level (except ANY)
![Page 105: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/105.jpg)
• Tombstones
– On delete request, records are marked for deletion.
– Similar to “Recycle Bin.”
– Data is actually deleted on major compaction or configurable timer
Delete Operations
![Page 106: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/106.jpg)
• Compaction runs periodically to merge multiple SSTables– Reclaims space– Creates new index– Merges keys– Combines columns– Discards tombstones– Improves performance by minimizing disk seeks
• Two types– Major– Read-only
Compaction
![Page 107: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/107.jpg)
Compaction
![Page 108: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/108.jpg)
• Ensures synchronization of data across nodes
• Compares data checksums against neighboring nodes
• Uses Merkle trees (hash trees)
• Snapshot of data sent to neighboring nodes
• Created and broadcasted on every major compaction
• If two nodes take snapshots within TREE_STORE_TIMEOUT of each other, snapshots are compared and data is synced.
Anti-Entropy
![Page 109: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/109.jpg)
Merkle Tree
![Page 110: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/110.jpg)
• Read Repair
– On read, nodes are queried until the number of nodes which respond with the most recent value meet a specified consistency level from ONE to ALL.
– If the consistency level is not met, nodes are updated with the most recent value which is then returned.
– If the consistency level is met, the value is returned and any nodes that reported old values are then updated.
Read Operations
![Page 111: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/111.jpg)
Read Repair
0
Node 1
Node 2
Node 3Node 4
Node 5
Node 6R1
R2
R3Client
SELECT * FROM table USING CONSISTENCY ONE
replication_factor = 3
![Page 112: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/112.jpg)
• Bloom filters provide a fast way of checking if a value is not in a set.
Read Operations: Bloom Filters
![Page 113: Distributed Database Architecture - UNIMIB](https://reader030.fdocuments.in/reader030/viewer/2022012915/61c6483adb659467ba77db07/html5/thumbnails/113.jpg)
Read
MemoryDisk
Bloom Filter
Key Cache
Partition Summary
Compression Offsets
Partition Index Data
Cache Hit
Cache Miss
= Off-heap
key_cache_size_in_mb > 0
index_interval = 128 (default)