Understanding Cassandra internals to solve real-world problems
Cassandra Internals Overview
-
Upload
beobal -
Category
Technology
-
view
406 -
download
0
Transcript of Cassandra Internals Overview
STARTUPorg.apache.cassandra.service.CassandraDaemon
protected void setup()
Load config
Run preflight checks
Load schema
Clean up local temporary state
Recover CommitLog
Schedule background compactions
Initialize storage service
PREFLIGHT CHECKSSane clockJNIJVM & InstrumentationFilesystem permissionsSystem keyspace statusUpgrades (#8049)Incompatible SSTables (#8049)
STARTUPorg.apache.cassandra.service.CassandraDaemon
protected void setup()
Load config
Run pre-flight checks
Load schema
Clean up local temporary state
Recover CommitLog
Schedule background compactions
Initialize storage service
STARTUPorg.apache.cassandra.db.commitlog.CommitLog
public int recover() throws IOException
Load config
Run pre-flight checks
Load schema
Clean up local temporary state
Recover CommitLog
Schedule background compactions
Initialize storage service
INITIALIZE STORAGE SERVICEorg.apache.cassandra.service.StorageService
public synchronized void initServer() throws ConfigurationException
Load ring state (unless don't)
Start gossip & get initial ring info
Set tokens
INITIALIZE STORAGE SERVICELoad ring state (unless don't)
Start gossip & get initial ring info
Set tokens
Setup auth resources
Ensure gossip stabilized
STARTUPLoad config
Run preflight checks
Load schema
Clean up local temporary state
Recover CommitLog
Schedule background compactions
Initialize storage service
MESSAGINGSERVICEorg.apache.cassandra.net.MessagingService
Low level one-way messagingpublic void sendOneWay(MessageOut message, InetAddress to)
Async Request/Responsepublic int sendRR(MessageOut message, InetAddress to, IAsyncCallback cb)
MESSAGINGSERVICEorg.apache.cassandra.net.MessagingService
Readspublic int sendRRWithFailure(MessageOut message,
InetAddress to,
IAsyncCallbackWithFailure cb)
Writespublic int sendRR(MessageOut<? extends IMutation> message,
InetAddress to,
AbstractWriteResponseHandler handler,
boolean allowHints)
MESSAGINGSERVICEPre-emptively drops messages when overwhelmed
Dropped if time at execution > send time + timeout
Timeout value dependant on message type
Most client-initated requests can be dropped
(see MessagingService.DROPPABLE_VERBS)
GOSSIPWhat it does do:
Disseminates members' state around the clusterVersioned: generation (per JVM) & version (per value)Heartbeats: incremented every gossip roundApplication state:
StatusTokensRelease & schema versionDC & RackAddressesData sizeHealth
GOSSIPWhat doesn't it do:
Notify about up or down nodesPropagate schemaTransmit data filesDistribute mutations
GOSSIP
https://wiki.apache.org/cassandra/ArchitectureGossip
GOSSIPorg.apache.cassandra.gms.Gossiper
private class GossipTask implements Runnable
{
public void run()
{...
Each round (1 second) gossip to:
1 live endpointmaybe 1 unreachable endpointmaybe 1 seed - if neither of the above
SCHEMA MIGRATIONAnother custom protocol
Also uses MessagingService
Target schema objects serialized as Mutations
diff/merge schema representations
SCHEMA PUSHorg.apache.cassandra.service.MigrationManager
private static Future<?> announce(final Collection<Mutation> schema)
SCHEMA PULLorg.apache.cassandra.service.MigrationManager
public void scheduleSchemaPull(InetAddress endpoint, EndpointState state)
Client request arrives at coordinator:
COORDINATION
Transformed into actionable command(s):
IReadCommandIMutation
Coordinator distributes execution around the cluster
Replicas perform commands and respond to coordinator
Gather responses and determine client response
COORDINATIONorg.apache.cassandra.service
StorageProxyAbstractWriteResponseHandlerAbstractReadExecutor
org.apache.cassandra.locatorAbstractReplicationStrategyIEndpointSnitch
https://wiki.apache.org/cassandra/ArchitectureInternals
COORDINATING WRITESorg.apache.cassandra.service.StorageProxy
public static void mutate(Collection<? extends IMutation> mutations,
ConsistencyLevel consistency_level)
Get endpoints using replication strategy
Get pending endpoints from ring metadata
Deliver mutations to both sets of endpoints
Collate responses & determine client response
Maybe store local hints for unreachable replicas
https://wiki.apache.org/cassandra/ArchitectureInternals
COORDINATING WRITESorg.apache.cassandra.service.StorageProxy
public static void mutate(Collection<? extends IMutation> mutations,
ConsistencyLevel consistency_level)
Get endpoints using replication strategy
Get pending endpoints from ring metadata
Deliver mutations to both sets of endpoints
Collate responses & determine client response
Maybe store local hints for unreachable replicas
DELIVERING MUTATIONSorg.apache.cassandra.service.StorageProxy
public static void sendToHintedEndpoints(final Mutation mutation,
Iterable<InetAddress> targets,
AbstractWriteResponseHandler responseHandler,
String localDataCenter)
Mutations sent to replicas using MessagingService
ResponseHandler registered as callback
Callback registry triggers an event on expiry
Sent directly within local datacenter
Forwarded via single node in each remote DC
COORDINATING WRITESorg.apache.cassandra.service.StorageProxy
public static void mutate(Collection<? extends IMutation> mutations,
ConsistencyLevel consistency_level)
Get endpoints using replication strategy
Get pending endpoints from ring metadata
Deliver mutations to both sets of endpoints
Collate responses & determine client response
Maybe store local hints for unreachable replicas
HINTSNodes can be down
Writes may timeout
In which case we may hint
Enabled/disabled globally or enabled per-DC
Writing a hint counts towards ConsistencyLevel.ANY
Deliver hints when a node comes back up & periodically
Too many hints in progress for a replica means we bail early
Determine point of failure by WriteType
LOGGED BATCHESorg.apache.cassandra.service.StorageProxy
public static void mutateAtomically(Collection<Mutation> mutations,
ConsistencyLevel consistency_level)
CommitLog for batches
Guarantee eventual success of batched statements
Strives to distribute to across racks in local DC
On success, cleanup log entries asynchronously
Failed batches replayed by the nodes holding the logs
WriteType.BATCH_LOGWriteType.BATCH
COORDINATING READSorg.apache.cassandra.service.StorageProxy
public static List<Row> read(List<ReadCommand> commands,
ConsistencyLevel consistencyLevel,
ClientState state)
Partition based reads
Read Repair & Data vs Digest Requests
Rapid Read Protection & (non)speculating executors
Distribution is more slightly complex than for writes
IDENTIFY TARGET ENDPOINTSorg.apache.cassandra.service.AbstractReadExecutor
public static AbstractReadExecutor getReadExecutor(ReadCommand command,
ConsistencyLevel consistencyLevel)
Use replication strategy to get live endpoints
Snitch sorts by proximity & health of replicas
Consult table metadata for Read Repair Decision
READ REPAIR DECISIONApply filter to sorted list of all live replicas
NONE: closest n replicas required by CLGLOBAL: all live replicasDC_LOCAL: all local replicas
Add closest n remotes needed to satisfy CLDefault Global Chance: 0.0Default Local Chance: 0.1
Give us a list of replicas to send read requests
LIGHTS, CAMERA, EXECUTIONFire off each command using read executor
Requests are sent via MessagingService
Closest replica(s) sent full data requests
Others get digest requests
FOREGROUND READ REPAIRAll data requests, no digests
Includes replicas contacted initially
Effectively ConsistencyLevel.ALL
Specialized resolver: RowDataResolver
Retry any short reads
May also perform background Read Repair