Cassandra at Twitter
-
Upload
chris-goffinet -
Category
Documents
-
view
5.955 -
download
3
Transcript of Cassandra at Twitter
Cassandra SF July 11th, 2011
Cassandra @
Team
@lennox
@stuhood @rk
Chris Goffinet
Stu Hood
Oscar Moll
Alan Liang Melvin Wang
Ryan King
@padauk9@alan
Measuring ourselves
#prostyle
Measuring ourselves‣ Hardware Platform
‣ Data Storage
‣ Latency and Throughput
‣ Operational Efficiency
‣ Capacity Planning
‣ Developer Integration
‣ Testing
Hardware Platform‣ CPU Core Utilization
‣ Memory bandwidth and consumption
‣ Machine cost
‣ RAID
‣ Filesystems and I/O Schedulers
‣ IOPS
‣ Network bandwidth
‣ Kernel
‣
‣ CPU Core Utilization
‣ Memory bandwidth and consumption
‣ Machine cost
‣ RAID
‣ Filesystems and I/O Schedulers
‣ IOPS
‣ Network bandwidth
‣ Kernel
‣
Hardware Platform
‣ Ext4
‣ Data mode = Ordered
‣ Data mode = Writeback
‣ XFS
‣ RAID
‣ 0 and 10
‣ far side vs near side copies
‣ 128 vs 256 vs 512 stripe sizes
Filesystem configurations
Hardware Platform
I/O Schedulers‣ CFQ vs Noop vs Deadline vs Anticipatory
‣ Workloads
‣ Timeseries
‣ 50/50
‣ Measure
‣ p90
‣ p99
‣ Average
‣ Max
Hardware Platform
I/O Schedulers
Scheduler p90 p99 Average Max
cfq 73ms 210ms 11.72ms 4940ms
noop 47ms 167ms 9.12ms 4132ms
deadline 75ms 233ms 12.72ms 3718ms
anticipatory 76ms 214ms 12.37ms 5120ms
5050 - Reads
Hardware Platform
I/O Schedulers
Scheduler p90 p99 Average Max
cfq 2ms 2ms 2.02ms 5927ms
noop 2ms 2ms 2.06ms 3475ms
deadline 2ms 2ms 2.13ms 3718ms
anticipatory 2ms 2ms 2.03ms 5119ms
5050 - Writes
Hardware Platform
Measuring ourselves‣ Hardware Platform
‣ Data Storage
‣ Latency and Throughput
‣ Operational Efficiency
‣ Capacity Planning
‣ Developer Integration
‣ Testing
‣ How efficient is our on-disk storage?
‣ Could we do compression?
‣ Do we have CPU to trade?
‣ How do we push for better?
‣ Is it worth it?
Data Storage
Old New
Easy to Implement
Checksumming
Varint Encoding
Delta Encoding
Type Specific Compression
Fixed Size Blocks
Data Storage
Old New
Easy to Implement X
Checksumming X
Varint Encoding X
Delta Encoding X
Type Specific Compression X
Fixed Size Blocks X
Data Storage
How did we do?
Data Storage
‣ 1.5x?
‣ 2.5x?
‣ 3.5x?
Data Storage
7.03x
Data Storage
Rows Columns Size on diskbytes per column
Current Format 10000 250M 16,716,432,189 66.8
New Format 10000 250M 2,375,027,696 9.5
10,00o rows; 250M columns
Data Storage
Timeseries
LongType column names
CounterColumnType values
Data Storage‣ compression
‣ type specific
‣ fine-grained corruption detection
‣ index promotion
‣ normalizing narrow and wide rows
‣ predictable performance
‣ no double-pass on compaction
‣ range and slice deletes
Measuring ourselves‣ Hardware Platform
‣ Data Storage
‣ Latency and Throughput
‣ Operational Efficiency
‣ Capacity Planning
‣ Developer Integration
‣ Testing
‣ What are our issues?
‣ Compaction Performance?
‣ Caching?
‣ Too many disk seeks?
‣ Garbage Collection?
Latency and Throughput
‣ Compaction
Latency and Throughput
‣ Compaction
Latency and Throughput
‣ Multithread Compaction + Throttling
‣ Compact each bucket in parallel
‣ Throttle across all buckets
‣ Compaction running all the time
‣ CASSANDRA-2191
‣ CASSANDRA-2156
Latency and Throughput
‣ Measure latency
‣ p99
‣ p999
‣ No averages!
‣ Every customer has p99 and p999 targets we must hit
‣ 24x7 on-call rotation
Latency and Throughput
Latency and Throughput‣ Caching?
‣ In-heap
‣ Off-heap
‣ Pluggable cache
‣ Memcache
Case Study: Tweet Button
‣ Growth was requiring entire dataset in memory. Why?
‣ How big is the active dataset within 24hours?
‣ What happens when dataset outgrows memory?
‣ Could other storage solutions do better?
‣ What are we missing here?
Case Study: Tweet Button
‣ Key Size Variable length (each one a url)
‣ Implement hashing on keys
‣ Can we do better?
‣ But... the cache in Java isn’t very efficient...
‣ or is it?
Case Study: Tweet Button
Case Study: Tweet Button‣ On-heap
‣ Requires us to scale the JVM heap with cache
‣ Off-heap
‣ Store pointers to data allocated out of the JVM
‣ Memcache
‣ Out of process
Case Study: Tweet Button‣ On-heap
‣ Data + CLHM overhead (87GB)
‣ Off-heap
‣ CLHM overhead (67GB just the pointers!)
‣ Memcache
‣ Internal overhead + data (48GB!)
‣ * CLHM (Concurrent Linked HashMap)
Case Study: Tweet Button
Cassandra
Memcache
Cassandra
Memcache
Cassandra
Memcache
Cassandra
Memcache
‣ Co-locate memcache on each node
‣ Routing + Cache replication
‣ Write through LRU
‣ Rolling restarts do not cause degraded performance states
Case Study: Tweet Button‣ In production today
‣ Stats
‣ 99th percentile went before 200ms - 800ms when data > memory
‣ 99th percentile now - 2.5ms
‣ New observability stack
‣ Replaces Ganglia
‣ Collect metrics for graphing in real-time
‣ Scale based on machine count + defined metrics
‣ Heavy write throughput requirements
‣ SLA Target
‣ All metrics written under 60 seconds
Case Study: Cuckoo
‣ 1.3 million writes/second
‣ 112 billion writes a day
‣ 3.2 gigabit/s over the network
‣ 492GB of new data per hour
‣ 140MB/s writes across cluster
‣ 70MB/s reads across cluster
Case Study: Cuckoo
‣ 36,000 writes/second
‣ persistently to disk on each node
‣ 36 nodes without RF (Replication Factor)
‣ Replication Factor = 3
‣ 30-35% cpu utilization
‣ FSync Commit Log every 10s
Case Study: Cuckoo
‣ Garbage Collection Challenge
‣ 30-60 second pauses multiple times per hour on each node
‣ Why?
‣ Heap fragmentation
Case Study: Cuckoo
time
value
5.0e+08
1.0e+09
1.5e+09
5.0e+08
1.0e+09
1.5e+09
1000 2000 3000 4000
free_spacemax_chunk
Case Study: Cuckoo
‣ Slab Allocation
‣ Fixed sized chunks (2MB)
‣ Copy byte[] into slabs using CAS (Compare & Swap)
‣ Largely reduced fragmentation
‣ CASSANDRA-2252
Case Study: Cuckoo
No Slab Slab
GC Pause Avg Time 30-60 seconds
Frequency of pause Every hour
Case Study: Cuckoo
No Slab Slab
GC Pause Avg Time 30-60 seconds 5 seconds
Frequency of pause Every hour 3 days 10 hours
Case Study: Cuckoo
‣ Pluggable Compaction
‣ Custom strategy for retention support
‣ Used for our timeseries
‣ Drop SSTables after N days
‣ Make it easy to implement more interesting and intelligent compaction strategies
‣ SSTable Min/Max Timestamp
‣ Read time optimization
Case Study: Cuckoo
Measuring ourselves‣ Hardware Platform
‣ Data Storage
‣ Latency and Throughput
‣ Operational Efficiency
‣ Capacity Planning
‣ Developer Integration
‣ Testing
Operational Efficiency‣ Automated infrastructure burn-in process
‣ Rack awareness to handle switch failures
‣ Grow clusters per rack, not per node
‣ Lower Server RPC timeout (200ms to 1s)
‣ Fail fast
‣ Split out RPC timeouts by read & writes
‣ CASSANDRA-2819
‣ Fault tolerance at the disk level
‣ Eject from cluster if raid array fails
‣ CASSANDRA-2118
‣ No swap and dedicated commit log
‣ Multiple hard drive vendors
‣ 300+ nodes in production
‣ Run on cheap commodity hardware
‣ Design for failure
Operational Efficiency
‣ Bad memory that causes corruption
‣ Multiple disks dying on same hosts within hours
‣ Rack switch failures
‣ Memory allocation delays causing JVM to encounter higher latency GC collections (mlockall recommended)
‣ Stop the world pauses if traffic patterns change
Operational EfficiencyWhat failures do we see in production?
‣ Network cards sometimes negotiating down to 100Mbit
‣ Machines randomly die and never come back
‣ Disks auto-ejecting themselves from the raid array
Operational EfficiencyWhat failures do we see in production?
Operational EfficiencyDeploy Process
Cass
Driver Hudson Git
Cass
Cass
Cass Cass
Cass
Cass
Cass
Operational EfficiencyDeploy Process
‣ Deploy to hundreds of nodes in under 20s
‣ Roll the cluster
‣ Disable Gossip on a node
‣ Check ring on all nodes to ensure ‘Down’ state
‣ Drain
‣ Restart
Measuring ourselves‣ Hardware Platform
‣ Data Storage
‣ Latency and Throughput
‣ Operational Efficiency
‣ Capacity Planning
‣ Developer Integration
‣ Testing
Capacity Planning‣ In-house capacity planning tool
‣ Collect input from sources:
‣ hardware platform (kernel, hw data)
‣ on-disk serialization overhead
‣ cost of read/write (seeks, index overhead)
‣ query cost (cpu, memory usage)
‣ requirements from customers
Capacity Planning
spec = { 'read_qps': 500, 'write_qps': 1000, 'replication_factor': 3, 'dataset_hot_percent': 0.05, 'latency_95': 350.0, 'latency_99': 250.0, 'read_growth_percentage': 0.1, 'write_growth_percentage': 0.1, ......}
Input Example
Capacity Planning
90 days datasize: 14.49T page cache size: 962.89G number of disks: 68 disk capacity: 15.22T iops: 6800.00/s replication_factor: 3 servers: 51 servers (w/o replication): 17 read_ops: 2323 write_ops: 991066 servers: 57 servers (w/o replication): 19 read_ops: 2877 write_ops: 1143171
Output Example
Measuring ourselves‣ Hardware Platform
‣ Data Storage
‣ Latency and Throughput
‣ Operational Efficiency
‣ Capacity Planning
‣ Developer Integration
‣ Testing
Developer Integration‣ Cassie
‣ Light-weight Cassandra Client
‣ Cluster member auto discovery
‣ Uses Finagle (http://github.com/twitter/finagle)
‣ Scala + Java support
‣ Open sourcing
Measuring ourselves‣ Hardware Platform
‣ Data Storage
‣ Latency and Throughput
‣ Operational Efficiency
‣ Capacity Planning
‣ Developer Integration
‣ Testing
Testing‣ Distributed Testing Harness
‣ Open sourced to community
‣ Custom Internal Build of YCSB
‣ Performance Benchmarking
‣ Custom workloads such as timeseries
‣ Performance Framework
Performance Framework‣ Custom framework that uses YCSB
‣ What we do:
‣ Collect as much data as possible
‣ Measure
‣ Do it again
‣ Generate reports per build
Performance Framework‣ Read/Insert/Update Combinations: 30
‣ Request Targeting (per second): 8
‣ 500, 1000, 2000, 4000, 8000, 16000, 32000, 64000
‣ Payload Sizes: 5
‣ 100, 500, 1000, 2000, 4000 bytes
‣ Single node vs cluster
Performance Framework
Total test combinations: 1,200
Summary‣ Understand your hardware and operating
system
‣ Rigorously exercise your entire stack
‣ Capacity plan with math not guesswork
‣ Measure everything, then do it again
‣ Invest in your storage technology
‣ Automate
‣ Expect everything to fail
We’re hiring
@jointheflock