CaSSanDra: An SSD Boosted Key-Value Store

UNIVERSITY OF TORONTO

UNIVERSITY OF

TORONTO

Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)

Ryan�Johnson

UNIVERSITY OF

TORONTO

Ryan�Johnson

MIDDLEWARE SYSTEMS RESEARCH GROUP

MSRG.ORG

CaSSanDra: An SSD Boosted Key-‐Value StorePrashanth Menon, Tilmann Rabl, Mohammad Sadoghi (*), Hans-‐Arno Jacobsen

UNIVERSITY OF

TORONTO

Ryan�Johnson

MSRG.ORG

Outline• ApplicaHon Performance Management

• Cassandra and SSDs

• Extending Cassandra’s Row Cache

• ImplemenHng a Dynamic Schema Catalogue

• Conclusions

UNIVERSITY OF

TORONTO

Ryan�Johnson

MSRG.ORG

Modern Enterprise Architecture

• Many different soPware systems

• Complex interacHons

• Stateful systems oPen distributed/parHHoned/replicated

• Stateless systems certainly duplicated

UNIVERSITY OF

TORONTO

Ryan�Johnson

MSRG.ORG

ApplicaHon Performance Management

• Lightweight agent aSached to each soPware system instance

• Monitors system health

• Traces transacHons

• Determines root causes

• Raw APM metric:

Agent Agent

AgentAgent

UNIVERSITY OF

TORONTO

Ryan�Johnson

MSRG.ORG

ApplicaHon Performance Management

• Problem: Agents have short memory and only have a local view • What was the average response Hme for requests served by servlet X between December 18-‐31 2011?

• What was the average Hme spent in each service/database to respond to client requests?

Agent Agent

AgentAgent

UNIVERSITY OF

TORONTO

Ryan�Johnson

MSRG.ORG

APM Metrics Datastore

• All agents store metric data in high write-‐throughput datastore

• Metric data is at a fine granularity (per-‐acHon, millisecond etc)

• User now has global view of metrics

• What is the best database to store APM metrics?

Agent Agent

AgentAgent

Agent?

UNIVERSITY OF

TORONTO

Ryan�Johnson

MSRG.ORG

Cassandra Wins APM

• APM experiments performed by Rabl et al. [1] show Cassandra performs best for APM use case • In memory workloads including 95%, 50% and 5% read • Workloads requiring disk access with 95%, 50% and 5% reads

Read: 95%

100000

150000

200000

250000

2 4 6 8 10 12

ughput (O

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 6: Throughput for Workload RW

2 4 6 8 10 12

Logarith

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 7: Read latency for Workload RW

has a throughput that is about 10% higher than for the first work-load. HBase’s throughput increases by 40% for the higher writerate, while Project Voldemort’s throughput shrinks by 33% as doesMySQL’s throughput.

For multiple nodes, Cassandra, HBase, and Project Voldemortfollow the same linear behavior as well. MySQL exhibits a goodspeed-up up to 8 nodes, in which MySQL’s throughput matchesCassandra’s throughput. For 12 nodes, its throughput does no longergrow noticeably. Finally, Redis and VoltDB exhibit the same be-havior as for the Workload R.

As can be seen in Figure 7, the read latency of all systems is es-sentially the same for both Workloads R and RW. The only notabledifference is MySQL, which is 75% less for one node and 40% lessfor 12 nodes.

In Figure 8, the write latency for Workload RW is summarized.The trends closely follows the write latency of Workload R. How-ever, there are two important subtle differences: (1) HBase’s la-

2 4 6 8 10 12

Logarith

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 8: Write latency for Workload RW

100000

150000

200000

250000

2 4 6 8 10 12

ughput (O

Number of Nodes

CassandraHBase

Project Voldemort

VoltDBRedis

Figure 9: Throughput for Workload W

2 4 6 8 10 12

Logaritm

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 10: Read latency for Workload W

tency is almost 50% lower than for Workload R; and (2) MySQL’slatency is twice as high on average for all scales.

5.3 Workload WWorkload W is the one that is closest to the APM use case (with-

out scans). It has a write rate of 99% which is too high for webinformation systems’ production workloads. Therefore, this is aworkload neither of the systems was specifically designed for. Thethroughput results can be seen in Figure 9. The results for onenode are similar to the results for Workload RW with the differencethat all system have a worse throughput except for Cassandra andHBase. While Cassandra’s throughput increases modestly (2% for12 nodes), HBase’s throughput increases almost by a factor of 2(for 12 nodes).

For the read latency in Workload W, shown in Figure 7, the mostapparent change is the high latency of HBase. For 12 nodes, it goesup to 1 second on average. Furthermore, Voldemort’s read latencyalmost twice as high while it was constant for Workload R and RW.For the other systems the read latency does not change significantly.

The write latency for Workload W is captured in Figure 11. Itcan be seen that HBase’s write latency increased significantly, bya factor of 20. In contrast to the read latency, Project Voldemort’swrite latency is almost identical to workload RW. For the other sys-tems the write latency increased in the order of 5-15%.

5.4 Workload RSIn the second part of our experiments, we also introduce scans

in the workloads. In particular, we used the existing YCSB clientfor Project Voldemort which does not support scans. Therefore,we omitted Project Voldemort in the following experiments. In thescan experiments, we split the read percentage in equal sized scanand read parts. For Workload RS this results in 47% read and scanoperations and 6% write operations.

Read: 50%

100000

120000

140000

160000

180000

2 4 6 8 10 12

ughput (O

tions/

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 3: Throughput for Workload R

million records per node, thus, scaling the problem size with thecluster size. For each run, we used a freshly installed system andloaded the data. We ran the workload for 10 minutes with max-imum throughput. Figure 3 shows the maximum throughput forworkload R for all six systems.

In the experiment with only one node, Redis has the highestthroughput (more than 50K ops/sec) followed by VoltDB. Thereare no significant differences between the throughput of Cassan-dra and MySQL, which is about half that of Redis (25K ops/sec).Voldemort is 2 times slower than Cassandra (with 12K ops/sec).The slowest system in this test on a single node is HBase with 2.5Koperation per second. However, it is interesting to observe that thethree web data stores that were explicitly built for scalability in webscale – i.e. Cassandra, Voldemort, and HBase – demonstrate a nicelinear behavior in the maximum throughput.

As discussed previously, we were not able to run the cluster ver-sion of Redis, therefore, we used the Jedis library that shards thedata on standalone instances for multiple nodes. In theory, this is abig advantage for Redis, since it does not have to deal with propa-gating data and such. This also puts much more load on the client,therefore, we had to double the number of machines for the YCSBclients for Redis to fully saturate the standalone instances. How-ever, the results do not show the expected scalability. During thetests, we noticed that the data distribution is unbalanced. This ac-tually caused one Redis node to consistently run out of memoryin the 12 node configuration7. For VoltDB, all configurations thatwe tested showed a slow-down for multiple nodes. It seems thatthe synchronous querying in YCSB is not suitable for a distributedVoltDB configuration. For MySQL we used a similar approach asfor Redis. Each MySQL node was independent and the client man-aged the sharding. Interestingly, the YCSB client for MySQL dida much better sharding than the Jedis library, and we observed analmost perfect speed-up from one to two nodes. For higher numberof nodes the increase of the throughput decreased slightly but wascomparable to the throughput of Cassandra.

Workload R was read-intensive and modeled after the require-ments of web information systems. Thus, we expected a low la-tency for read operations at the three web data stores. The averagelatencies for read operations for Workload R can be seen in Figure4. As mentioned before, the latencies are presented in logarithmicscale. For most systems, the read latencies are fairly stable, whilethey differ strongly in the actual value. Again, Cassandra, HBase,and Voldemort illustrate a similar pattern – the latency increasesslightly for two nodes and then stays constant. Project Voldemort

7We tried both supported hashing algorithms in Jedis, Mur-MurHash and MD5, with the same result. The presented resultsare achieved with MurMurHash

2 4 6 8 10 12

Logarith

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 4: Read latency for Workload R

2 4 6 8 10 12

Logarith

Number of Nodes

CassandraHBase

VoldemortVoltDB

RedisMySQL

Figure 5: Write latency for Workload R

has the lowest latency of 230 µs for one node and 260 µs for 12nodes. Cassandra has a higher average latency of 5 - 8 ms andHBase has a much higher latency of 50 - 90 ms. Both shardedstores, Redis, and MySQL, have a similar pattern as well, with Re-dis having the best latency among all systems. In contrast to theweb data stores, they have a latency that tends to decrease with thescale of the system. This is due to the reduced load per system thatreduces the latency as will be further discussed in Section 5.6. Thelatency for reads in VoltDB is increasing which is consistent withthe decreasing throughput. The read latency is surprisingly highalso for the single node case which, however, has a solid through-put.

The latencies for write operations in Workload R can be seen inFigure 5. The differences in the write latencies are slightly big-ger than the differences in the read latencies. The best latency hasHBase which clearly trades a read latency for write latency. It is,however, not as stable as the latencies of the other systems. Cas-sandra has the highest (stable) write latency of the benchmarkedsystems, which is surprising since it was explicitly built for highinsertion rates [19]. Project Voldemort has roughly the same writeas read latency and, thus, is a good compromise for write and readspeed in this type of workload. The sharded solutions, Redis andMySQL, exhibit the same behavior as for read operations. How-ever, Redis has much lower latency then MySQL while it has lessthroughput for more than 4 nodes. VoltDB again has a high latencyfrom the start which gets prohibitive for more than 4 nodes.

5.2 Workload RWIn our second experiment, we ran Workload RW which has 50%

writes. This is commonly classified as a very high write rate. InFigure 6, the throughput of the systems is shown. For a single node,VoltDB achieves the highest throughput, which is only slightly lowerthan its throughput for Workload R. Redis has a similar through-put, but it has 20% less throughput than for Workload R. Cassandra

[1] http://msrg.org/publications/pdf_files/2012/vldb12-bigdata-Solving_Big_Data_Challenges_fo.pdf

UNIVERSITY OF

TORONTO

Ryan�Johnson

MSRG.ORG

Cassandra

• Built at Facebook by previous Dynamo engineers • Open sourced to Apache in 2009

• DHT with consistent hashing • MD5 hash of key • MulHple nodes handle segments of ring for load balancing

• Dynamo distribuHon and replicaHon model + BigTable storage model

Commit&&Log&

Memtable&

SS&Tables&

UNIVERSITY OF

TORONTO

Ryan�Johnson

MSRG.ORG

Cassandra and SSDs• Improve performance by either adding nodes or improving per-‐node performance

• Node performance is directly dependent on the disk I/O performance of the system

• Cassandra stores two enHHes on disk: • Commit Log • SSTables

• Should SSDs be used to store both?

• We evaluated each possible configura<on

UNIVERSITY OF

TORONTO

Ryan�Johnson

MSRG.ORG

Experiment Setup• Server specificaHon:

• 2x Intel 8-‐core X5450, 16GB RAM, 2x 2TB RAID0 HDD, 2x 250GB Intel x520 SSD • Apache Cassandra 1.10

• Used YCSB benchmark • 100M rows, 50GB total raw data, ‘latest’ distribuHon • 95% read, 5% write

• Minimum three runs per workload, fresh data on each run

• Broken into phases: • Data load • FragmentaHon • Cache warm-‐up • Workload (> 12h process)

UNIVERSITY OF

TORONTO

Ryan�Johnson

MSRG.ORG

SSD vs. HDD

• LocaHon of log is irrelevant

• LocaHon of data is important • DramaHc performance improvement of SSD over HDD

• SSD benefits from high parallelism

Configura<on # of clients # of threads/client Loca<on of Data Loca<on of Commit Log

C1 1 2 RAID (HDD) RAID (HDD)C2 1 2 RAID (HDD) SSDC3 1 2 SSD RAID (HDD)C4 1 2 SSD SSDC5 4 16 RAID (HDD) RAID (HDD)C6 4 16 SSD SSD

C1 C2 C3 C4 C5 C6

ughput (o

Configuration

(a) HDD vs SSD Throughput

C1 C2 C3 C4 C5 C6

s)Configuration

(b) HDD vs SDD Latency

HDD SSD

ughput (o

Data Location

Empty DiskFull Disk

(c) 99% Fill HDD vs SDD Throughput

HDD SSD

Data Location

Empty DiskFull Disk

(d) 99% Fill HDD vs SDD Latency

Fig. 4. Throughput/Latency Results for HDD vs SSD and Disk Full vs Disk Empty

on HDD for the bulk of data that is infrequently accessed.Another reason to do this is the fact that SSD performancedegrades with higher fill ratios. As seen in Figure 4(c), theperformance of a highly filled SSD degrades much worse thanthe performance of a highly filled disk. It has to be noted thatthe workload in this case is still read heavy, for write heavyworkloads even worse degradations will be experienced.

When evaluating our extended SSD row cache, the sizeof the data set was 100 million records, where each recordhad five columns having a size of 75 bytes. The total sizeof the data on disk after load averaged 50GB. Our evaluationprocess was broken down into four phases: data loading, datafragmentation, Memtable flush, bufferpool warmup, and trans-actional workload phases. The fragmentation phase attemptsto spread the columns of a row across multiple SSTablesto illustrate the effect of read amplification on LSM-basedstorage systems. In the fragmentation phase, we used a latestrequest distribution with 10% of operations being read and theremaining 90% of operations updating anywhere between oneand all five columns. The warming phase also used a latestrequest distribution with read operations accounting for 99%of all operations. The warmup phase was run until either thecache was full or stored at most 10% of the total dataset.The transactional phase was run with a latest distribution (azipfian distribution where the most recently entered keys arefavoured). These experiments all used configuration C5 (referto Table I), the optimal configuration for HDDs to provide abalanced evaluation.

When evaluating our dynamic schema model, we used adataset consisting of 40 million records where each recordconsisted of between 5 and 10 columns, of 10 bytes. Bydefault, YCSB does not vary the number of columns in arecord during the loading phase. We modified YCSB to createa new varying-size record generator, which we plugged intothe default data generator. Each run of the experiment createda different amount of data on disk, but we observed thatthe average total data size was between 6.5GB and 7GB. Inall runs, we varied the read percentage for the experimentsbetween 95%, 50% and 5% using configuration C6.

A. SSD Row CacheIn Figure 5(a), the throughput of the two Cassandra in-

stances can be seen for the three different workloads thatwere tested. For the 95% read-heavy workload, we see thatthe SSD-enabled row-cache provides an 85% improvementin throughput growing from 384 reads/sec to 710 reads/sec.

This is because a larger portion of the hot data is cached onthe SSD; in fact, our configuration enabled storing more thantwice the amount of data than when using an in-memory cachealone, achieving a cache-hit ratio of more than 85%. Whena read operation reaches the server for a row that does notreside in the off-heap memory cache, only a single SSD seekis required to fulfill the request. In addition, cached data ispre-compacted, meaning that at most one seek is required tofetch the row. We see the same effect in the remaining twoworkloads despite a lower proportion of reads. Cassandra is awrite-optimized system meaning that in write-heavy scenarios,the efficacy of a cache is reduced. This is evidenced by thereduction in the cache-hit ratio from 72% in the workload with85% reads to 60% in the 75%-read workload.

As seen in Figure 5(b), in the 95% read workload, theSSD-enabled row cache averaged a latency of 3ms while thein-memory cache managed a read latency of 5.6ms, a 46%improvement. As the proportion of reads is reduced from 85%to 75%, the latency when using an SSD for the row-cacheremains roughly the same. This is because the latest requestdistribution gives us a high probability that the reads forthe rows can be served directly from Cassandra’s Memtable,which effectively acts as a write-back cache.

B. Dynamic SchemaNext, we illustrate that by extracting the metadata (i.e.,

schema) from the data on-disk we suffer no perceivableperformance penalty. The column names in our test were fixedat 5 bytes and the number of columns varied between 5 and10. This accounts to a minimum saving of 25 bytes frombeing written on a per-row basis. Cassandra, not uncommonfrom many commercial databases, performs buffered I/O;all reads and writes are executed in 16 KB pages. In ourexperiment configuration, one row fits well within a singleCassandra page. This means that reading a row will incur noadditional overhead since the total size of a row with a co-located schema is larger than a modified row with the schemaextracted out. When we extract out the metadata, we expectedno degradation in performance or latency and the results inFigure 5(c) and Figure 5(d) confirm our assertion. Specifically,we conclude that in the 95% and 50%-read workloads, thelatency and throughput were comparable with any differencebeing attributed to the environment.

Throughput and latency are not major motivations for imple-menting the dynamic schema. Fairly significant space savingscan be obtained by extracting redundant schema information

UNIVERSITY OF

TORONTO

Ryan�Johnson

MSRG.ORG

SSD vs. HDD (II)

• SSD offers more than 7x improvement to throughput on empty disk

• SSD performance degrades by half as storage device fills up

• Filling the SSD or running it near capacity is not advisable