Cassandra from tarball to production

Cassandra: From tarball to production

Why talk about this?You are about to deploy CassandraYou are looking for “best practices”You don’t want:... to scour through the documentation... to do something known not to work well... to forget to cover some important step

What we won’t cover● Cassandra: how does

it work?● How do I design my

schema?● What’s new in

Cassandra X.Y?

So many things to doMonitoring Snitch DC/Rack Settings Time Sync

Seeds/Autoscaling Full/Incremental Backups

AWS Instance Selection

Disk - SSD?

Disk Space - 2x? AWS AMI (Image) Selection

Periodic Repairs Replication Strategy

Compaction Strategy

SSL/VPC/VPN Authorization + Authentication

OS Conf - Users

OS Conf - Limits OS Conf - Perms OS Conf - FSType OS Conf - Logs

C* Start/Stop OS Conf - Path Use case evaluation

Chef to the rescue?Chef community cookbook availablehttps://github.com/michaelklishin/cassandra-chef-cookbook

Installs java Creates a “cassandra” user/group

Download/extract the tarball Fixes up ownership

Builds the C* configuration files Sets the ulimits for filehandles, processes, memory locking

Sets up an init script Sets up data directories

Chef Cookbook CoverageMonitoring Snitch DC/Rack Settings Time Sync

Seeds/Autoscaling Full/Incremental Backups

Disk - SSD? Disk - How much?

AWS Instance Type AWS AMI (Image) Selection

Periodic Repairs Replication Strategy

Compaction Strategy

SSL/VPC/VPN Authorization + Authentication

OS Conf - Users

OS Conf - Limits OS Conf - Perms OS Conf - FSType OS Conf - Logs

C* Start/Stop OS Conf - Path Use case evaluation

MonitoringIs every node answering queries?Are nodes talking to each other?Are any nodes running slowly?Push UDP! (statsd)http://hackers.lookout.com/2015/01/cassandra-monitoring/https://github.com/lookout/cassandra-statsd-agent

http://hackers.lookout.com/2015/01/cassandra-monitoring/

http://hackers.lookout.com/2015/01/cassandra-monitoring/

https://github.com/lookout/cassandra-statsd-agent

https://github.com/lookout/cassandra-statsd-agent

Monitoring - SyntheticHealth checks, bad and good● ‘nodetool status’ exit code

○ Might return 0 if the node is not accepting requests○ Slow, cross node reads

● cqlsh -u sysmon -p password < /dev/null● Verifies this node can read auth table● https://github.com/lookout/cassandra-health-check

What about OpsCenter?We chose not to use itWant consistent interface for all monitoringGUI vs Command Line argumentDidn’t see good auditing capabilitiesDidn’t interface well with our chef solution

SnitchUse the right snitch!● AWS EC2MultiRegionSnitch● Google? GoogleCloudSnitch● GossipingPropertyFileSnitchNOT● SimpleSnitch (default)Community cookbook: set it!

http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architectureSnitchEC2MultiRegion_c.html

What is RF?Replication Factor is how many copies of dataValue is hashed to determine primary hostAdditional copies always next node

Hash here

What is CL?Consistency Level -- It’s not RF!Describes how many nodes must respond before operation is considered COMPLETECL_ONE - only one node respondsCL_QUORUM - (RF/2)+1 nodes (round down)CL_ALL - RF nodes respond

DC/Rack SettingsYou might need to set these

Maybe you’re not in AmazonRack == Availability Zone?Hard: Renaming DC or adding racks

Renaming DCsClients “remember” which DC they talk toRenaming single DC causes all clients to failBetter to spin up a new one than rename old

Adding a rackStart with 6 node cluster, rack R1Replication factor 3Add 1 node in R2, and rebalanceALL data in R2 node?Good idea to keep racks balanced

I don’t have time for thisClusters must have synchronized timeYou will get lots of drift with: [0-3].amazon.pool.ntp.orgCommunity cookbook doesn’t cover anything here

Better make time for thisC* serializes write operations by time stampsClocks on virtual machines drift!It’s the relative difference among clocks that mattersC* nodes should synchronize with each otherSolution: use a pair of peered NTP servers (level 2 or 3) and a small set of known upstream providers

From a small seed…Seeds are used for new nodes to find clusterEvery new node should use the same seedsSeed nodes get topology changes fasterEach seed node must be in the config fileMultiple seeds per datacenter recommendedTricky to configure on AWS

Backups - Full+IncrementalNothing in the cookbooks for thisC* makes it “easy”: snapshot, then copySnapshots might require a lot more spaceRemove the snapshot after copying it

Disk selectionSSD Rotational

EphemeralEBS

Low latency Any size instance Any size instance

Recommended Not cheap Less expensive

Great random r/w perf Good write performance No node rebuilds

No network use for disk No network use for disk

AWS Instance SelectionWe moved to EC2c3.2xlarge (15GiB mem, Disk 160GB)?i2.xlarge (30GiB mem, 800GB disk)Max recommended storage per node is 1TBUse instance types that support HVMSome previous generation instance types, such as T1, C1, M1, and M2 do not support Linux HVM AMIs. Some current generation instance types, such as T2, I2, R3, G2, and C4 do not support PV AMIs.

How much can I use??Snapshots take space (kind of)Best practice: keep disks half full!800GB disk becomes 400GBSnapshots during repairs?Lots of uses for snapshots!

Periodic RepairsBuried in the docs:“As a best practice, you should schedule repairs weekly”http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_repair_nodes_c.html

● “-pr” (yes)● “-par” (maybe)● “--in-local-dc” (no)

Repair TipsRaise gc_grace_seconds (tombstones?)Run on one node at a timeSchedule for low usage hoursUse “par” if you have dead time (faster)Tune with: nodetool setcompactionthroughput

I thought I deleted thatCompaction removes “old” tombstones10 day default grace period (gc_grace_period)After that, deletes will not be propagated!Run ‘nodetool repair’ at least every 10 daysOnce a week is perfect (3 day grace)Node down >7 days? ‘nodetool remove’ it!

Changing RF within DC?Easy to decrease RFImpossible to increase RF without (usually)Reads with CL_ONE might fail!

Hash here

Replication StrategyHow many replicas should we have?What happens if some data is lost?Are you write-heavy or read-heavy?Quorum considerations: odd is better!RF=1? RF=3? RF=5?

Magic JMX setting: reduce traffic to a nodeGreat when node is “behind” the 4 hour windowUsed by gossiper to divert traffic during repairsWrites: ok, read repair: ok, nodetool repair: ok$ java -jar jmxterm.jar -l localhost:7199$> set -b org.apache.cassandra.db:type=DynamicEndpointSnitch Severity 10000

Don’t be too severe!

Compaction StrategySolved by using a good C* designSizeTiered or Leveled?

Leveled has better guarantees for read timesSizeTiered may require 10 (or more) reads!Leveled uses less disk spaceLeveled tombstone collection is slower

Auth*Cookbooks default to OFF

Turn authenticator and authorizer on‘cassandra’ user is super special

Requires QUORUM (cross-DC) for signonLOCAL_ONE for all other users!

UsersOS users vs Cassandra users: 1 to 1?Shared credentials for apps?Nothing logs the user taking the action!‘cassandra’ user is created by cookbookAll processes run as ‘cassandra’

LimitsChef helps here! Startup:ulimit -l unlimited # mem lockulimit -n 48000 # fds

/etc/security/limits.dcassandra - nofile 48000cassandra - nproc unlimitedcassandra - memlock unlimited

Filesystem TypeOfficially supported: ext4 or XFSXFS is slightly fasterInteresting options:● ext4 without journal● ext2● zfs

LogsTo consolidate or not to consolidate?Push or pull? Usually push!FOSS: syslogd, syslog-ng, logstash/kibana, heka, bananaOthers: Splunk, SumoLogic, Loggly, Stackify

ShutdownNice init script with cookbook, steps are:● nodetool disablethrift (no more clients)● nodetool disablegossip (stop talking to

cluster)● nodetool drain (flush all memtables)● kill the jvm

Quick performance wins● Disable assertions - cookbook property● No swap space (or vm.swappiness=1)● max_concurrent_reads● max_concurrent_writes

Thank You!

@[email protected]

Cassandra from tarball to production

Technology

Transcript of Cassandra from tarball to production