Post on 08-May-2015
description
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
CaSSanDra: An SSD Boosted Key-‐Value StorePrashanth Menon, Tilmann Rabl, Mohammad Sadoghi (*), Hans-‐Arno Jacobsen
!1
*
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Outline• ApplicaHon Performance Management
• Cassandra and SSDs
• Extending Cassandra’s Row Cache
• ImplemenHng a Dynamic Schema Catalogue
• Conclusions
!2
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Modern Enterprise Architecture
• Many different soPware systems
• Complex interacHons
• Stateful systems oPen distributed/parHHoned/replicated
• Stateless systems certainly duplicated
!3
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
ApplicaHon Performance Management
• Lightweight agent aSached to each soPware system instance
• Monitors system health
• Traces transacHons
• Determines root causes
• Raw APM metric:
!4
Agent
Agent
Agent
Agent
Agent Agent
AgentAgent
Agent
Agent
Agent
Agent
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
ApplicaHon Performance Management
• Problem: Agents have short memory and only have a local view • What was the average response Hme for requests served by servlet X between December 18-‐31 2011?
• What was the average Hme spent in each service/database to respond to client requests?
!5
Agent
Agent
Agent
Agent
Agent Agent
AgentAgent
Agent
Agent
Agent
Agent
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
APM Metrics Datastore
• All agents store metric data in high write-‐throughput datastore
• Metric data is at a fine granularity (per-‐acHon, millisecond etc)
• User now has global view of metrics
• What is the best database to store APM metrics?
!6
Agent
Agent
Agent
Agent
Agent Agent
AgentAgent
Agent
Agent
Agent
Agent?
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Cassandra Wins APM
• APM experiments performed by Rabl et al. [1] show Cassandra performs best for APM use case • In memory workloads including 95%, 50% and 5% read • Workloads requiring disk access with 95%, 50% and 5% reads
!7
Read: 95%
0
50000
100000
150000
200000
250000
2 4 6 8 10 12
Thro
ughput (O
ps/
sec)
Number of Nodes
CassandraHBase
VoldemortVoltDB
RedisMySQL
Figure 6: Throughput for Workload RW
0.1
1
10
100
1000
2 4 6 8 10 12
Late
ncy
(m
s) -
Logarith
mic
Number of Nodes
CassandraHBase
VoldemortVoltDB
RedisMySQL
Figure 7: Read latency for Workload RW
has a throughput that is about 10% higher than for the first work-load. HBase’s throughput increases by 40% for the higher writerate, while Project Voldemort’s throughput shrinks by 33% as doesMySQL’s throughput.
For multiple nodes, Cassandra, HBase, and Project Voldemortfollow the same linear behavior as well. MySQL exhibits a goodspeed-up up to 8 nodes, in which MySQL’s throughput matchesCassandra’s throughput. For 12 nodes, its throughput does no longergrow noticeably. Finally, Redis and VoltDB exhibit the same be-havior as for the Workload R.
As can be seen in Figure 7, the read latency of all systems is es-sentially the same for both Workloads R and RW. The only notabledifference is MySQL, which is 75% less for one node and 40% lessfor 12 nodes.
In Figure 8, the write latency for Workload RW is summarized.The trends closely follows the write latency of Workload R. How-ever, there are two important subtle differences: (1) HBase’s la-
0.01
0.1
1
10
100
2 4 6 8 10 12
Late
ncy
(m
s) -
Logarith
mic
Number of Nodes
CassandraHBase
VoldemortVoltDB
RedisMySQL
Figure 8: Write latency for Workload RW
0
50000
100000
150000
200000
250000
2 4 6 8 10 12
Thro
ughput (O
ps/
sec)
Number of Nodes
CassandraHBase
Project Voldemort
VoltDBRedis
MySQL
Figure 9: Throughput for Workload W
0.1
1
10
100
1000
10000
2 4 6 8 10 12
Late
ncy
(m
s) -
Logaritm
ic
Number of Nodes
CassandraHBase
VoldemortVoltDB
RedisMySQL
Figure 10: Read latency for Workload W
tency is almost 50% lower than for Workload R; and (2) MySQL’slatency is twice as high on average for all scales.
5.3 Workload WWorkload W is the one that is closest to the APM use case (with-
out scans). It has a write rate of 99% which is too high for webinformation systems’ production workloads. Therefore, this is aworkload neither of the systems was specifically designed for. Thethroughput results can be seen in Figure 9. The results for onenode are similar to the results for Workload RW with the differencethat all system have a worse throughput except for Cassandra andHBase. While Cassandra’s throughput increases modestly (2% for12 nodes), HBase’s throughput increases almost by a factor of 2(for 12 nodes).
For the read latency in Workload W, shown in Figure 7, the mostapparent change is the high latency of HBase. For 12 nodes, it goesup to 1 second on average. Furthermore, Voldemort’s read latencyalmost twice as high while it was constant for Workload R and RW.For the other systems the read latency does not change significantly.
The write latency for Workload W is captured in Figure 11. Itcan be seen that HBase’s write latency increased significantly, bya factor of 20. In contrast to the read latency, Project Voldemort’swrite latency is almost identical to workload RW. For the other sys-tems the write latency increased in the order of 5-15%.
5.4 Workload RSIn the second part of our experiments, we also introduce scans
in the workloads. In particular, we used the existing YCSB clientfor Project Voldemort which does not support scans. Therefore,we omitted Project Voldemort in the following experiments. In thescan experiments, we split the read percentage in equal sized scanand read parts. For Workload RS this results in 47% read and scanoperations and 6% write operations.
Read: 50%
0
20000
40000
60000
80000
100000
120000
140000
160000
180000
2 4 6 8 10 12
Thro
ughput (O
pera
tions/
sec)
Number of Nodes
CassandraHBase
VoldemortVoltDB
RedisMySQL
Figure 3: Throughput for Workload R
million records per node, thus, scaling the problem size with thecluster size. For each run, we used a freshly installed system andloaded the data. We ran the workload for 10 minutes with max-imum throughput. Figure 3 shows the maximum throughput forworkload R for all six systems.
In the experiment with only one node, Redis has the highestthroughput (more than 50K ops/sec) followed by VoltDB. Thereare no significant differences between the throughput of Cassan-dra and MySQL, which is about half that of Redis (25K ops/sec).Voldemort is 2 times slower than Cassandra (with 12K ops/sec).The slowest system in this test on a single node is HBase with 2.5Koperation per second. However, it is interesting to observe that thethree web data stores that were explicitly built for scalability in webscale – i.e. Cassandra, Voldemort, and HBase – demonstrate a nicelinear behavior in the maximum throughput.
As discussed previously, we were not able to run the cluster ver-sion of Redis, therefore, we used the Jedis library that shards thedata on standalone instances for multiple nodes. In theory, this is abig advantage for Redis, since it does not have to deal with propa-gating data and such. This also puts much more load on the client,therefore, we had to double the number of machines for the YCSBclients for Redis to fully saturate the standalone instances. How-ever, the results do not show the expected scalability. During thetests, we noticed that the data distribution is unbalanced. This ac-tually caused one Redis node to consistently run out of memoryin the 12 node configuration7. For VoltDB, all configurations thatwe tested showed a slow-down for multiple nodes. It seems thatthe synchronous querying in YCSB is not suitable for a distributedVoltDB configuration. For MySQL we used a similar approach asfor Redis. Each MySQL node was independent and the client man-aged the sharding. Interestingly, the YCSB client for MySQL dida much better sharding than the Jedis library, and we observed analmost perfect speed-up from one to two nodes. For higher numberof nodes the increase of the throughput decreased slightly but wascomparable to the throughput of Cassandra.
Workload R was read-intensive and modeled after the require-ments of web information systems. Thus, we expected a low la-tency for read operations at the three web data stores. The averagelatencies for read operations for Workload R can be seen in Figure4. As mentioned before, the latencies are presented in logarithmicscale. For most systems, the read latencies are fairly stable, whilethey differ strongly in the actual value. Again, Cassandra, HBase,and Voldemort illustrate a similar pattern – the latency increasesslightly for two nodes and then stays constant. Project Voldemort
7We tried both supported hashing algorithms in Jedis, Mur-MurHash and MD5, with the same result. The presented resultsare achieved with MurMurHash
0.1
1
10
100
2 4 6 8 10 12
Late
ncy
(m
s) -
Logarith
mic
Number of Nodes
CassandraHBase
VoldemortVoltDB
RedisMySQL
Figure 4: Read latency for Workload R
0.01
0.1
1
10
100
2 4 6 8 10 12
Late
ncy
(m
s) -
Logarith
mic
Number of Nodes
CassandraHBase
VoldemortVoltDB
RedisMySQL
Figure 5: Write latency for Workload R
has the lowest latency of 230 µs for one node and 260 µs for 12nodes. Cassandra has a higher average latency of 5 - 8 ms andHBase has a much higher latency of 50 - 90 ms. Both shardedstores, Redis, and MySQL, have a similar pattern as well, with Re-dis having the best latency among all systems. In contrast to theweb data stores, they have a latency that tends to decrease with thescale of the system. This is due to the reduced load per system thatreduces the latency as will be further discussed in Section 5.6. Thelatency for reads in VoltDB is increasing which is consistent withthe decreasing throughput. The read latency is surprisingly highalso for the single node case which, however, has a solid through-put.
The latencies for write operations in Workload R can be seen inFigure 5. The differences in the write latencies are slightly big-ger than the differences in the read latencies. The best latency hasHBase which clearly trades a read latency for write latency. It is,however, not as stable as the latencies of the other systems. Cas-sandra has the highest (stable) write latency of the benchmarkedsystems, which is surprising since it was explicitly built for highinsertion rates [19]. Project Voldemort has roughly the same writeas read latency and, thus, is a good compromise for write and readspeed in this type of workload. The sharded solutions, Redis andMySQL, exhibit the same behavior as for read operations. How-ever, Redis has much lower latency then MySQL while it has lessthroughput for more than 4 nodes. VoltDB again has a high latencyfrom the start which gets prohibitive for more than 4 nodes.
5.2 Workload RWIn our second experiment, we ran Workload RW which has 50%
writes. This is commonly classified as a very high write rate. InFigure 6, the throughput of the systems is shown. For a single node,VoltDB achieves the highest throughput, which is only slightly lowerthan its throughput for Workload R. Redis has a similar through-put, but it has 20% less throughput than for Workload R. Cassandra
[1] http://msrg.org/publications/pdf_files/2012/vldb12-bigdata-Solving_Big_Data_Challenges_fo.pdf
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Cassandra
• Built at Facebook by previous Dynamo engineers • Open sourced to Apache in 2009
• DHT with consistent hashing • MD5 hash of key • MulHple nodes handle segments of ring for load balancing
• Dynamo distribuHon and replicaHon model + BigTable storage model
!8
Commit&&Log&
Memtable&
SS&Tables&
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Cassandra and SSDs• Improve performance by either adding nodes or improving per-‐node performance
• Node performance is directly dependent on the disk I/O performance of the system
• Cassandra stores two enHHes on disk: • Commit Log • SSTables
• Should SSDs be used to store both?
• We evaluated each possible configura<on
!9
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Experiment Setup• Server specificaHon:
• 2x Intel 8-‐core X5450, 16GB RAM, 2x 2TB RAID0 HDD, 2x 250GB Intel x520 SSD • Apache Cassandra 1.10
• Used YCSB benchmark • 100M rows, 50GB total raw data, ‘latest’ distribuHon • 95% read, 5% write
• Minimum three runs per workload, fresh data on each run
• Broken into phases: • Data load • FragmentaHon • Cache warm-‐up • Workload (> 12h process)
!10
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
SSD vs. HDD
• LocaHon of log is irrelevant
• LocaHon of data is important • DramaHc performance improvement of SSD over HDD
• SSD benefits from high parallelism
!11
Configura<on # of clients # of threads/client Loca<on of Data Loca<on of Commit Log
C1 1 2 RAID (HDD) RAID (HDD)C2 1 2 RAID (HDD) SSDC3 1 2 SSD RAID (HDD)C4 1 2 SSD SSDC5 4 16 RAID (HDD) RAID (HDD)C6 4 16 SSD SSD
0
1000
2000
3000
4000
5000
6000
7000
8000
C1 C2 C3 C4 C5 C6
Thro
ughput (o
ps/
sec)
Configuration
(a) HDD vs SSD Throughput
0
1
2
3
4
5
6
7
8
C1 C2 C3 C4 C5 C6
Late
ncy
(m
s)Configuration
(b) HDD vs SDD Latency
0
1000
2000
3000
4000
5000
6000
7000
8000
HDD SSD
Thro
ughput (o
ps/
sec)
Data Location
Empty DiskFull Disk
(c) 99% Fill HDD vs SDD Throughput
0
50
100
150
200
250
HDD SSD
Late
ncy
(m
s)
Data Location
Empty DiskFull Disk
(d) 99% Fill HDD vs SDD Latency
Fig. 4. Throughput/Latency Results for HDD vs SSD and Disk Full vs Disk Empty
on HDD for the bulk of data that is infrequently accessed.Another reason to do this is the fact that SSD performancedegrades with higher fill ratios. As seen in Figure 4(c), theperformance of a highly filled SSD degrades much worse thanthe performance of a highly filled disk. It has to be noted thatthe workload in this case is still read heavy, for write heavyworkloads even worse degradations will be experienced.
When evaluating our extended SSD row cache, the sizeof the data set was 100 million records, where each recordhad five columns having a size of 75 bytes. The total sizeof the data on disk after load averaged 50GB. Our evaluationprocess was broken down into four phases: data loading, datafragmentation, Memtable flush, bufferpool warmup, and trans-actional workload phases. The fragmentation phase attemptsto spread the columns of a row across multiple SSTablesto illustrate the effect of read amplification on LSM-basedstorage systems. In the fragmentation phase, we used a latestrequest distribution with 10% of operations being read and theremaining 90% of operations updating anywhere between oneand all five columns. The warming phase also used a latestrequest distribution with read operations accounting for 99%of all operations. The warmup phase was run until either thecache was full or stored at most 10% of the total dataset.The transactional phase was run with a latest distribution (azipfian distribution where the most recently entered keys arefavoured). These experiments all used configuration C5 (referto Table I), the optimal configuration for HDDs to provide abalanced evaluation.
When evaluating our dynamic schema model, we used adataset consisting of 40 million records where each recordconsisted of between 5 and 10 columns, of 10 bytes. Bydefault, YCSB does not vary the number of columns in arecord during the loading phase. We modified YCSB to createa new varying-size record generator, which we plugged intothe default data generator. Each run of the experiment createda different amount of data on disk, but we observed thatthe average total data size was between 6.5GB and 7GB. Inall runs, we varied the read percentage for the experimentsbetween 95%, 50% and 5% using configuration C6.
A. SSD Row CacheIn Figure 5(a), the throughput of the two Cassandra in-
stances can be seen for the three different workloads thatwere tested. For the 95% read-heavy workload, we see thatthe SSD-enabled row-cache provides an 85% improvementin throughput growing from 384 reads/sec to 710 reads/sec.
This is because a larger portion of the hot data is cached onthe SSD; in fact, our configuration enabled storing more thantwice the amount of data than when using an in-memory cachealone, achieving a cache-hit ratio of more than 85%. Whena read operation reaches the server for a row that does notreside in the off-heap memory cache, only a single SSD seekis required to fulfill the request. In addition, cached data ispre-compacted, meaning that at most one seek is required tofetch the row. We see the same effect in the remaining twoworkloads despite a lower proportion of reads. Cassandra is awrite-optimized system meaning that in write-heavy scenarios,the efficacy of a cache is reduced. This is evidenced by thereduction in the cache-hit ratio from 72% in the workload with85% reads to 60% in the 75%-read workload.
As seen in Figure 5(b), in the 95% read workload, theSSD-enabled row cache averaged a latency of 3ms while thein-memory cache managed a read latency of 5.6ms, a 46%improvement. As the proportion of reads is reduced from 85%to 75%, the latency when using an SSD for the row-cacheremains roughly the same. This is because the latest requestdistribution gives us a high probability that the reads forthe rows can be served directly from Cassandra’s Memtable,which effectively acts as a write-back cache.
B. Dynamic SchemaNext, we illustrate that by extracting the metadata (i.e.,
schema) from the data on-disk we suffer no perceivableperformance penalty. The column names in our test were fixedat 5 bytes and the number of columns varied between 5 and10. This accounts to a minimum saving of 25 bytes frombeing written on a per-row basis. Cassandra, not uncommonfrom many commercial databases, performs buffered I/O;all reads and writes are executed in 16 KB pages. In ourexperiment configuration, one row fits well within a singleCassandra page. This means that reading a row will incur noadditional overhead since the total size of a row with a co-located schema is larger than a modified row with the schemaextracted out. When we extract out the metadata, we expectedno degradation in performance or latency and the results inFigure 5(c) and Figure 5(d) confirm our assertion. Specifically,we conclude that in the 95% and 50%-read workloads, thelatency and throughput were comparable with any differencebeing attributed to the environment.
Throughput and latency are not major motivations for imple-menting the dynamic schema. Fairly significant space savingscan be obtained by extracting redundant schema information
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
SSD vs. HDD (II)
• SSD offers more than 7x improvement to throughput on empty disk
• SSD performance degrades by half as storage device fills up
• Filling the SSD or running it near capacity is not advisable
!12
0
1000
2000
3000
4000
5000
6000
7000
8000
C1 C2 C3 C4 C5 C6
Th
rou
gh
pu
t (o
ps/
sec)
Configuration
(a) HDD vs SSD Throughput
0
1
2
3
4
5
6
7
8
C1 C2 C3 C4 C5 C6
La
ten
cy (
ms)
Configuration
(b) HDD vs SDD Latency
0
1000
2000
3000
4000
5000
6000
7000
8000
HDD SSD
Th
rou
gh
pu
t (o
ps/
sec)
Data Location
Empty DiskFull Disk
(c) 99% Fill HDD vs SDD Throughput
0
50
100
150
200
250
HDD SSD
La
ten
cy (
ms)
Data Location
Empty DiskFull Disk
(d) 99% Fill HDD vs SDD Latency
Fig. 4. Throughput/Latency Results for HDD vs SSD and Disk Full vs Disk Empty
on HDD for the bulk of data that is infrequently accessed.Another reason to do this is the fact that SSD performancedegrades with higher fill ratios. As seen in Figure 4(c), theperformance of a highly filled SSD degrades much worse thanthe performance of a highly filled disk. It has to be noted thatthe workload in this case is still read heavy, for write heavyworkloads even worse degradations will be experienced.
When evaluating our extended SSD row cache, the sizeof the data set was 100 million records, where each recordhad five columns having a size of 75 bytes. The total sizeof the data on disk after load averaged 50GB. Our evaluationprocess was broken down into four phases: data loading, datafragmentation, Memtable flush, bufferpool warmup, and trans-actional workload phases. The fragmentation phase attemptsto spread the columns of a row across multiple SSTablesto illustrate the effect of read amplification on LSM-basedstorage systems. In the fragmentation phase, we used a latestrequest distribution with 10% of operations being read and theremaining 90% of operations updating anywhere between oneand all five columns. The warming phase also used a latestrequest distribution with read operations accounting for 99%of all operations. The warmup phase was run until either thecache was full or stored at most 10% of the total dataset.The transactional phase was run with a latest distribution (azipfian distribution where the most recently entered keys arefavoured). These experiments all used configuration C5 (referto Table I), the optimal configuration for HDDs to provide abalanced evaluation.
When evaluating our dynamic schema model, we used adataset consisting of 40 million records where each recordconsisted of between 5 and 10 columns, of 10 bytes. Bydefault, YCSB does not vary the number of columns in arecord during the loading phase. We modified YCSB to createa new varying-size record generator, which we plugged intothe default data generator. Each run of the experiment createda different amount of data on disk, but we observed thatthe average total data size was between 6.5GB and 7GB. Inall runs, we varied the read percentage for the experimentsbetween 95%, 50% and 5% using configuration C6.
A. SSD Row CacheIn Figure 5(a), the throughput of the two Cassandra in-
stances can be seen for the three different workloads thatwere tested. For the 95% read-heavy workload, we see thatthe SSD-enabled row-cache provides an 85% improvementin throughput growing from 384 reads/sec to 710 reads/sec.
This is because a larger portion of the hot data is cached onthe SSD; in fact, our configuration enabled storing more thantwice the amount of data than when using an in-memory cachealone, achieving a cache-hit ratio of more than 85%. Whena read operation reaches the server for a row that does notreside in the off-heap memory cache, only a single SSD seekis required to fulfill the request. In addition, cached data ispre-compacted, meaning that at most one seek is required tofetch the row. We see the same effect in the remaining twoworkloads despite a lower proportion of reads. Cassandra is awrite-optimized system meaning that in write-heavy scenarios,the efficacy of a cache is reduced. This is evidenced by thereduction in the cache-hit ratio from 72% in the workload with85% reads to 60% in the 75%-read workload.
As seen in Figure 5(b), in the 95% read workload, theSSD-enabled row cache averaged a latency of 3ms while thein-memory cache managed a read latency of 5.6ms, a 46%improvement. As the proportion of reads is reduced from 85%to 75%, the latency when using an SSD for the row-cacheremains roughly the same. This is because the latest requestdistribution gives us a high probability that the reads forthe rows can be served directly from Cassandra’s Memtable,which effectively acts as a write-back cache.
B. Dynamic SchemaNext, we illustrate that by extracting the metadata (i.e.,
schema) from the data on-disk we suffer no perceivableperformance penalty. The column names in our test were fixedat 5 bytes and the number of columns varied between 5 and10. This accounts to a minimum saving of 25 bytes frombeing written on a per-row basis. Cassandra, not uncommonfrom many commercial databases, performs buffered I/O;all reads and writes are executed in 16 KB pages. In ourexperiment configuration, one row fits well within a singleCassandra page. This means that reading a row will incur noadditional overhead since the total size of a row with a co-located schema is larger than a modified row with the schemaextracted out. When we extract out the metadata, we expectedno degradation in performance or latency and the results inFigure 5(c) and Figure 5(d) confirm our assertion. Specifically,we conclude that in the 95% and 50%-read workloads, thelatency and throughput were comparable with any differencebeing attributed to the environment.
Throughput and latency are not major motivations for imple-menting the dynamic schema. Fairly significant space savingscan be obtained by extracting redundant schema information
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
SSD vs. HDD: Summary• Cassandra benefits most when storing data on SSD (not the log)
• LocaHon of commit log not important
• SSD performance inversely proporHonal to fill raHo
• Storing all data on SSD is uneconomical • Replacing 3TB HDD with 3x 1TB SSD is 10x more costly • SSDs have limited lifeHme (10-‐50K write-‐erase cycles), replacement more frequently
• Rabl et al. [1] show adding node is 100% costlier, with 100% throughput improvement
• Build hybrid system to get comparable performance for marginal cost
!13
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Cassandra: Read + Write Path• Write path is fast:
1. Write update into commit log 2. Write update into Memtable
• Memtables flush to SSTables asynchronously when full • Never blocks writes
• Read path can be slow: 1. Read key-‐value from Memtable 2. Read key-‐value from each SSTable on disk 3. Construct merged view of row from each
input source
!14
ReadUpdate
Memtable
SSTableSSTableSSTable SSTableSSTableSSTable
Memory
• Each read needs to do O(# of SSTables) I/O
Disk
Log
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Cassandra: SSTables• Cassandra allows blind-‐writes
• Row data can be fragmented over mulHple SSTables over Hme
!
!
!
!
• Bloom filters and indexes can potenHally help
• Ul<mately, mul<ple fragments need to be read from disk
!15
Employee(ID( First(Name( Last(Name( Age( Department(ID(99231234& Prashanth& Menon& 25& MSRG&
{SSTables
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Cassandra: Row Cache• Row cache buffers full merged row in memory
• Cache miss follows regular read path, constructs merged row, brings into cache
• Makes read path faster for frequently accessed data
• Problem: Row cache occupies memory • Takes away precious memory from rest of system
!16
• Extend the row cache efficiently onto SSD
ReadUpdate
Memtable
SSTableSSTableSSTable SSTableSSTableSSTable
Memory
Disk
Log
Row Cache
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Extended Row Cache• Extend the row cache onto SSD
• Chained with in-‐memory row cache • LRU in-‐memory, overflow onto LRU SSD row cache
• Implemented as append-‐only cache files • Efficient sequenHal writes • Fast random reads
• Zero I/O for hit in first level row cache
• One random I/O on SSD for second level row cache
!17
Log SSTableSSTableSSTable SSTableSSTableSSTable
Memory
Memtable
1rst Level Row Cache
2nd Level Cache Index
Disk
2nd Level Row CacheSSD
ReadUpdate
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
EvaluaHon: SSD Row Cache
• Setup: • 100M rows, 50GB total data, 6GB row cache
• Results: • 75% improvement in throughput • 75% improvement in latency • RAM-‐only cache has too liSle hit raHo
!18
0
200
400
600
800
1000
95% 85% 75%
Thro
ughput (o
ps/
sec)
Read Percentage
DisabledRAM
RAM+SSD
(a) Row Cache (Throughput)
0
1
2
3
4
5
6
7
8
95% 85% 75%
Late
ncy
(m
s)
Read Percentage
DisabledRAM
RAM+SSD
(b) Row Cache (Latency)
0
1000
2000
3000
4000
5000
6000
7000
95% 50% 5%
Thro
ughput (o
ps/
sec)
Read Percentage
RegularDynamic
(c) Dynamic Schema (Throughput)
0
20
40
60
80
100
120
140
95% 50% 5%
Late
ncy
(m
s)
Read Percentage
RegularDynamic
(d) Dynamic Schema (Latency)
Fig. 5. Throughput/Latency Results for Row Cache Extension and Dynamic Schema
and we find this to be much more compelling. In normaloperation, data sizes averaged 6.8GB compressed after theinitial load of 40 million keys. With a modified Cassandra,data sizes averaged at 6.01GB of data, a savings of roughly10%. This value will grow as the number of columns in thetable grow and as column names grow in length.
Another potential benefit for dynamic schema model (omit-ted in the interest of space), is in executing column-slicequeries. When performing a read from Cassandra, it is possibleto read a slice of the row by specifying which columns to read.Though Cassandra has an index per-row, it is only a sample;not every column has an appropriate index entry. If we havea schema on hand, we know precisely the layout of the rowon disk which we can use to optimize the read process andavoid cache pollution.
Finally, it is important to note that we are not using high-endenterprise PCIe-bus SSDs (e.g., FusionIO), yet we are getting asubstantial performance improvement. Therefore, we concludethat even with inexpensive commodity SSDs, a considerablethroughput and latency improvement is achieved.
VII. RELATED WORK
There exists a recent move in the database community toexploit key SSD characteristics such fast random reads thatis orders of magnitude faster than magnetic physical drivesand using SSDs to make updates disk-I/O friendly, e.g., [3],[4], [5]. One way to exploit SSDs is to introduce a storagehierarchy in which SSDs are placed as a cache between mainmemory and disks. This extends the database bufferpool tospan over both main memory and SSDs. A novel temperature-based bufferpool replacement policy was introduced in [4],which substantially improved both transactional and analyticalquery processing in IBM DB2. In our work, we go beyonda simply extension of the bufferpool with SSDs, instead wedevelop specialized bufferpool enhancements that targets theslow read path problem (incurring many random I/Os in orderto consolidate across many SSTables) of key-value stores in thecontext of Cassandra. Furthermore, we introduce the conceptof dynamic schema (i.e., dynamic catalogue) that decouplesthe commonly joint meta-data and data on key-value stores(such as Cassandra [6] and BigTable [7]) by maintaining theschema information on SSDs. Lastly, in [14], similar to ourframework, the use of SSDs as cache was also explored ina proof-of-concept key-value store prototype. In contrast, weintroduce the storage hierarchy and our SSD caching tech-niques within a commercialized key-value store. Furthermore,
we identify new avenues for exploiting the use of SSDs withinkey-value stores, namely, our dynamic cataloguing technique.
VIII. CONCLUSION
In this paper, we investigated the performance benefits ofSSDs in key-value stores. We benchmarked different con-figurations of SSD and HDD combinations. We proposedand implemented two specific optimizations for SSD-HDDhybrid systems and showed their effectiveness in detailedbenchmarks. Our extended row cache strategy transparentlystores hot data on SSD and thus extends the row cache inCassandra. Our benchmarking results show that this extensioncan achieve improvements of 85% for realistic workloads. Oursecond technique for SSD-HDD hybrid systems is a dynamicschema catalogue. It reduces the disk impact of row-levelschema models and thus increases the performance of commonworkloads and data sets.
For future work, we will adapt our methodology so it canbe directly run on SSD instead of going through the FTL. Thiswill increase the performance of the SSD operations and allowfor SSD optimized data structures and algorithms.
REFERENCES
[1] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, andA. H. Byers, “Big data: The Next Frontier for Innovation, Competition,and Productivity,” McKinsey Global Institute, Tech. Rep., 2011.
[2] T. Rabl, M. Sadoghi, H.-A. Jacobsen, S. Gomez-Villamor, V. Muntes-Mulero, and S. Mankowskii, “Solving Big Data Challenges for Enter-prise Application Performance Management,” PVLDB, 2012.
[3] M. Canim, G. A. Mihaila, B. Bhattacharjee, K. A. Ross, and C. A.Lang, “An object placement advisor for DB2 using solid state storage,”PVLDB, 2009.
[4] ——, “SSD bufferpool extensions for database systems,” PVLDB, 2010.[5] M. Sadoghi, K. A. Ross, M. Canim, and B. Bhattacharjee, “Making
updates disk-I/O friendly using SSDs,” PVLDB’13.[6] A. Lakshman and P. Malik, “Cassandra: a decentralized structured
storage system,” SIGOPS Review, 2010.[7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur-
rows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributedstorage system for structured data,” in OSDI, 2006.
[8] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo:Amazon’s Highly Available Key-Value Store,” in SOSP, 2007.
[9] R. Cartell, “Scalable SQL and NoSQL data stores,” SIGMOD Record,2010.
[10] M. Cornwell, “Anatomy of a solid-state drive,” Communications of theACM, 2012.
[11] L. Bouganim, B. r Jnsson, and P. Bonnet, “uFLIP: Understanding FlashIO Patterns,” in CIDR ’09: Fourth Biennial Conference on InnovativeData Systems Research.
[12] G. Graefe, “The Five-Minute Rule 20 Years Later: and How FlashMemory Changes the Rules ,” Communications of the ACM, 2009.
[13] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears,“Benchmarking cloud serving systems with YCSB,” in SoCC, 2010.
[14] B. Debnath, S. Sengupta, and J. Li, “FlashStore: high throughputpersistent key-value store,” PVLDB, 2010.
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Dynamic Schema• Key-‐value stores covet schema-‐less data model
• Very flexible, good for highly varying data • Schemas oPen change, defining up front can be detrimental !!!!!
!
• ObservaHon: many big data applicaHons have relaHvely stable schemas • e.g., Click stream, APM, sensor data etc.
• Redundant schemas have significant overhead in I/O and space usage
!19
Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1'
Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988848' Value' 5' Max' 7' Min' 1'
Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1'
OnHDisk'Format'
Metric'Name' Timestamp' Value' Max' Min'HostA/AgentX/AVGResponse' 1332988833' 4' 6' 1'
ApplicaKon'Format'
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Dynamic Schema (III)• Don’t serialize redundant schema with rows
• Extract schema from data, store on SSD, serialize schema ID with data
• Allows for large number of schemas
!20
Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988833' Value' 4' Max' 6' Min' 1'Metric'Name' HostA/AgentX/AVGResponse' Timestamp' 1332988848' Value' 5' Max' 7' Min' 1'Metric'Name' HostA/AgentX/Failures' Timestamp' 1332988849' All' 4' Warn' 3' Error' 1'
S1'S2'
Metric'Name'Timestamp' Value' Max' Min'Metric'Name'Timestamp' All' Warn' Error'
HostA/AgentX/AVGResponse'1332988833' S1' 4' 6' 1'HostA/AgentX/AVGResponse'1332988848'
HostA/AgentX/Failures' 1332988849'S1'S2'
5' 7' 1'4' 3' 1'
New'Disk'Format'Schema'Catalogue'
Old'Disk'Format'
SSD
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
EvaluaHon: Dynamic Schema
• Setup: • 40M rows, variable columns 5-‐10 (638 schemas), 6GB row cache
• Results: • 10% reducHon in disk usage (6.8GB vs 6GB) • Slightly improved throughput, stable latency
• EffecHve SSD usage (only random reads) & reduce I/O and space usage
!21
0
200
400
600
800
1000
95% 85% 75%
Thro
ughput (o
ps/
sec)
Read Percentage
DisabledRAM
RAM+SSD
(a) Row Cache (Throughput)
0
1
2
3
4
5
6
7
8
95% 85% 75%
Late
ncy
(m
s)
Read Percentage
DisabledRAM
RAM+SSD
(b) Row Cache (Latency)
0
1000
2000
3000
4000
5000
6000
7000
95% 50% 5%
Thro
ughput (o
ps/
sec)
Read Percentage
RegularDynamic
(c) Dynamic Schema (Throughput)
0
20
40
60
80
100
120
140
95% 50% 5%
Late
ncy
(m
s)
Read Percentage
RegularDynamic
(d) Dynamic Schema (Latency)
Fig. 5. Throughput/Latency Results for Row Cache Extension and Dynamic Schema
and we find this to be much more compelling. In normaloperation, data sizes averaged 6.8GB compressed after theinitial load of 40 million keys. With a modified Cassandra,data sizes averaged at 6.01GB of data, a savings of roughly10%. This value will grow as the number of columns in thetable grow and as column names grow in length.
Another potential benefit for dynamic schema model (omit-ted in the interest of space), is in executing column-slicequeries. When performing a read from Cassandra, it is possibleto read a slice of the row by specifying which columns to read.Though Cassandra has an index per-row, it is only a sample;not every column has an appropriate index entry. If we havea schema on hand, we know precisely the layout of the rowon disk which we can use to optimize the read process andavoid cache pollution.
Finally, it is important to note that we are not using high-endenterprise PCIe-bus SSDs (e.g., FusionIO), yet we are getting asubstantial performance improvement. Therefore, we concludethat even with inexpensive commodity SSDs, a considerablethroughput and latency improvement is achieved.
VII. RELATED WORK
There exists a recent move in the database community toexploit key SSD characteristics such fast random reads thatis orders of magnitude faster than magnetic physical drivesand using SSDs to make updates disk-I/O friendly, e.g., [3],[4], [5]. One way to exploit SSDs is to introduce a storagehierarchy in which SSDs are placed as a cache between mainmemory and disks. This extends the database bufferpool tospan over both main memory and SSDs. A novel temperature-based bufferpool replacement policy was introduced in [4],which substantially improved both transactional and analyticalquery processing in IBM DB2. In our work, we go beyonda simply extension of the bufferpool with SSDs, instead wedevelop specialized bufferpool enhancements that targets theslow read path problem (incurring many random I/Os in orderto consolidate across many SSTables) of key-value stores in thecontext of Cassandra. Furthermore, we introduce the conceptof dynamic schema (i.e., dynamic catalogue) that decouplesthe commonly joint meta-data and data on key-value stores(such as Cassandra [6] and BigTable [7]) by maintaining theschema information on SSDs. Lastly, in [14], similar to ourframework, the use of SSDs as cache was also explored ina proof-of-concept key-value store prototype. In contrast, weintroduce the storage hierarchy and our SSD caching tech-niques within a commercialized key-value store. Furthermore,
we identify new avenues for exploiting the use of SSDs withinkey-value stores, namely, our dynamic cataloguing technique.
VIII. CONCLUSION
In this paper, we investigated the performance benefits ofSSDs in key-value stores. We benchmarked different con-figurations of SSD and HDD combinations. We proposedand implemented two specific optimizations for SSD-HDDhybrid systems and showed their effectiveness in detailedbenchmarks. Our extended row cache strategy transparentlystores hot data on SSD and thus extends the row cache inCassandra. Our benchmarking results show that this extensioncan achieve improvements of 85% for realistic workloads. Oursecond technique for SSD-HDD hybrid systems is a dynamicschema catalogue. It reduces the disk impact of row-levelschema models and thus increases the performance of commonworkloads and data sets.
For future work, we will adapt our methodology so it canbe directly run on SSD instead of going through the FTL. Thiswill increase the performance of the SSD operations and allowfor SSD optimized data structures and algorithms.
REFERENCES
[1] J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh, andA. H. Byers, “Big data: The Next Frontier for Innovation, Competition,and Productivity,” McKinsey Global Institute, Tech. Rep., 2011.
[2] T. Rabl, M. Sadoghi, H.-A. Jacobsen, S. Gomez-Villamor, V. Muntes-Mulero, and S. Mankowskii, “Solving Big Data Challenges for Enter-prise Application Performance Management,” PVLDB, 2012.
[3] M. Canim, G. A. Mihaila, B. Bhattacharjee, K. A. Ross, and C. A.Lang, “An object placement advisor for DB2 using solid state storage,”PVLDB, 2009.
[4] ——, “SSD bufferpool extensions for database systems,” PVLDB, 2010.[5] M. Sadoghi, K. A. Ross, M. Canim, and B. Bhattacharjee, “Making
updates disk-I/O friendly using SSDs,” PVLDB’13.[6] A. Lakshman and P. Malik, “Cassandra: a decentralized structured
storage system,” SIGOPS Review, 2010.[7] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Bur-
rows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable: A distributedstorage system for structured data,” in OSDI, 2006.
[8] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman,A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels, “Dynamo:Amazon’s Highly Available Key-Value Store,” in SOSP, 2007.
[9] R. Cartell, “Scalable SQL and NoSQL data stores,” SIGMOD Record,2010.
[10] M. Cornwell, “Anatomy of a solid-state drive,” Communications of theACM, 2012.
[11] L. Bouganim, B. r Jnsson, and P. Bonnet, “uFLIP: Understanding FlashIO Patterns,” in CIDR ’09: Fourth Biennial Conference on InnovativeData Systems Research.
[12] G. Graefe, “The Five-Minute Rule 20 Years Later: and How FlashMemory Changes the Rules ,” Communications of the ACM, 2009.
[13] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears,“Benchmarking cloud serving systems with YCSB,” in SoCC, 2010.
[14] B. Debnath, S. Sengupta, and J. Li, “FlashStore: high throughputpersistent key-value store,” PVLDB, 2010.
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Conclusions• Storing Cassandra commit logs on SSD doesn’t help
• Managing SSDs at capacity degrades its performance
• Using SSDs as a secondary row-‐cache dramaHcally improves performance
• ExtracHng redundant schemas onto and SSD reduces disk space usage and required I/O
!22
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Thanks!!
• QuesHons?
!
• Contact: • Prashanth Menon (prashanth.menon@utoronto.ca)
!23
UNIVERSITY OF TORONTO
UNIVERSITY OF
TORONTO
Fighting�back:�Using�observability tools�to�improve�the�DBMS�(not�just�diagnose�it)
Ryan�Johnson
MIDDLEWARE SYSTEMS RESEARCH GROUP
MSRG.ORG
Future Work• What types of tables benefit most from a dynamic schema?
• Impact of compacHon on read-‐heavy workloads • How can SSDs be used to improve the performance of compacHon?
• How is performance when storing only SSTable indexes on SSD?
!24