Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

38
Building Data Pipelines with SMACK: Storage Strategies for Scale & Performance June 8, 2016 Jonathan Shook, Solution Architect, DataStax

Transcript of Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Page 1: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Building Data Pipelines with SMACK: Storage Strategies for Scale & PerformanceJune 8, 2016Jonathan Shook, Solution Architect, DataStax

Allene Jue
Added / DSE to first bullet.
Allene Jue
Combined your two pro/con slides. I put it in the appendix in case you want to reference them as you review.I also added /DSE to the SSD con bullet.
Page 2: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Spark

Mesos

Akka

Cassandra

Kafka

Page 3: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

1 Essential Storage Concepts

2 Design Strategies

3 Storage Selection

4 Q & A

3© DataStax, All Rights Reserved.

Page 4: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Essential Storage ConceptsThe Basics

Page 5: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Important Terms

• Topology

• Bandwidth, Throughput, Headroom

• Latency, Minimum Latency

• Concurrency, Parallelism, Contention

© DataStax, All Rights Reserved. 5

Page 6: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Basic System Topology

6

Every modern system is essentially a network of components.

The language of message delivery applies at every level of design.

System Topology Example (high level)

HDD SSD

Page 7: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Term: Bandwidth, Throughput, Headroom

• Bandwidth - Maximum rated transfer speed of a device• Throughput - Measurement of achievable transfer speed• Headroom - Safety margin above normal usage - “reserve

capacity”

© DataStax, All Rights Reserved. 7

Page 8: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Throughput Example: SATA3

Using a popular SSD and an online benchmark...

© DataStax, All Rights Reserved. 8

Bandwidth Throughput Headroom

6Gb/s (750MB/s) 40MB-500MB as tested, depending on operation type

30%, for example. This is a design parameter.

In this case, if you can achieve 200MB throughput on the drive for your operational patterns, headroom of 30% means you should be scaling out before your metrics show 140MB/s.

Page 9: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Term: Latency and Minimum Latency

• Latency - How long it takes to receive a response, once a request is submitted

• Minimum Latency - Latency which is possible on a single node when there is no resource contention

© DataStax, All Rights Reserved. 9

Single Node Replica Set of 3 Nodes and LOCAL_QUORUM

• However fast that node can service the request, uncontended.

• Writes: The fastest 2 of 3 nodes in the replica set to respond.

• Reads: Usually the fastest 2 of 3, based on latency trends.

Page 10: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Latency and Throughput Example:Random reads at different block sizes

© DataStax, All Rights Reserved. 10

SATA HDD has an unavoidable seek time penalty for all op sizes. Throughput tops out at 180MB/s at 16MB read sizes and over 1.5 seconds of latency.

SATA SSD performs well. 550MB is possible, but desirable latencies are found below 1MB read size.

The NVMe drive can push 2 CDs worth of data per second at 128KB read sizes. At 16MB, latency is only .25 seconds.

Page 11: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

© DataStax, All Rights Reserved. 11

Latency and Throughput Example:Compared by Drive Type

This shows the same measurements compared between drive types.

Page 12: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Latency & Throughput Example:Comparative Numbers

12

1 block read (512 bytes)

KB/s µs latency iops

NVMe 62006 177 124013

SATA SSD 38700 306 77400

SATA HDD 215 119000 430

256 block read(128 KB)

KB/s µs latency iops

NVMe 1707520 1160 13339

SATA SSD 549133 2320 4290

SATA HDD 41198 157000 321

32K block read(16 MB)

KB/s µs latency iops

NVMe 1339596.8 235000 81

SATA SSD 554920 594000 33

SATA HDD 179063 1647000 10

Page 13: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Term: Concurrency, Parallelism, Contention

• Concurrency - Multiple requests in flight• Parallelism - Simultaneous processing of requests• Resource Contention - When work is blocked awaiting

access to a shared resource

Concurrency without parallelism causes resource contention, queueing, latency increases, and unhappy users.

© DataStax, All Rights Reserved. 13

Page 14: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

(Storage) Design StrategiesCore Strategies for Going Fast and Staying Fast

Page 15: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Key Design Strategies

1. Design to the Workload

2. Simplify the Storage Path

3. Maintain Headroom

4. Balance Compute and I/O

5. Balance I/O Caching

© DataStax, All Rights Reserved. 15

Page 16: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Strategy #1: Design to the Workload

• Estimate your workloads. Focus on the read patterns.

• Can your users endure effects of resource contention?

• Can they endure disruptive outliers?

• How do you know?© DataStax, All Rights Reserved. 16

Page 17: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Strategy #2: Simplify the Storage Path

© DataStax, All Rights Reserved. 17

• Avoid unnecessary hardware layers. Go directly from your system chipset to the drive when possible.

• Favor JBOD over storage aggregation.• Only use RAID for:

– Datacenter or Operator Standards with HDDs. (Try to avoid RAID with SSDs if possible.)

– Aggregating smaller disks. (Why not just get larger drives for JBOD?)

Page 18: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Strategy #3: Maintain Headroom

• Build-in headroom according to your loading patterns.• Measure your system with bench tools. • Saturate during non-prod testing, and use that as a reference

point in production.

© DataStax, All Rights Reserved. 18

Page 19: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Strategy #4: Balance Compute and I/O

© DataStax, All Rights Reserved. 19

• Databases are not just storage APIs.

• You need to keep your CPU and IO throughput in relative balance.

• Perfection is not required, but extreme imbalances are no fun.

• There will always be a bottleneck.

Page 20: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Strategy #5: Balance I/O Caching

© DataStax, All Rights Reserved. 20

• Understand the potential benefits of caching: best and worst cases.

• “Unused” memory in Linux is available for caching.

• Don’t depend on cache to solve cold read latencies.

• Design around cold-read performance first.

Page 21: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Storage SelectionBuild for Effect

Page 22: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

22

It’s a bad idea.

SANs for distributed databases...

Have strong skepticism when anybody tells you otherwise. Perhaps they haven’t tried it yet, or are ignoring the obvious.

You don’t have to suffer the pains of others in order to learn from their experiences. Still, some insist on trying.

Page 23: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

HDD vs. SSD

23

Type Pro Con

HDD ● Cheap? ● All concurrent operations are contended● Random access is slow - drive seek● Power usage● Lower latencies come with much higher

costs● Little room for further improvement

SSD ● Cheap? (1TB ~ $300)● Fast● Low internal contention● Runs cooler / lower

wattage● Faster transport

technology available

● Initial capacities available - encouraged RAID shenanigans → No longer an issue for reasonable data densities with Cassandra/DSE.

● MTBF of earlier designs → No longer an issue as SSDs have made huge strides in reliability and DWPD limits

● Initial cost - No longer an issue

Page 24: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Workload Concurrency & Storage Parallelism

© DataStax, All Rights Reserved. 24

Page 25: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Selecting SSD vs. HDD

Favor modern SSDs by default.

Use HDDs only if you must for:● High-write applications with low read concurrency● Archival or Logging systems with low read concurrency● Commit log storage, if you have the option● Persistent messaging systems● Non-latency sensitive batch/analytics workloads

25

Page 26: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Storage Path

© DataStax, All Rights Reserved. 26

A) Direct SSDB) Direct HDDC) NVMeD) SSDs via HBAE) HDDs via HBAF) Combo via HBA

We’ll come back to this slide if we have time.

HDD SSD

Page 27: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Data Density

• Keep data density in reasonable bounds.

• Every database must deal with the realities of storage traversal.

• Avoid trying to store too much data on a node.

© DataStax, All Rights Reserved. 27

Page 28: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

In Conclusion...

• Provision with headroom to avoid unnecessary contention.

• Select hardware to support user and workload requirements.

• Keep the storage path as simple as possible.

• Consider SSDs by default for your data directories.

28

Page 29: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Coming Soon!

● June 23: Top 5 Reasons Why DSE is Game Changing

● July 7: Proofpoint & DataStax Webinar

● For the latest schedule of webinars, check out our Webinars page: http://www.datastax.com/resources/webinars

© 2015 DataStax, All Rights Reserved. 29

Page 30: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Get your SMACK on!

Thank You!

Follow me on Twitter: @Shookinator

© 2015 DataStax, All Rights Reserved. 30

Page 31: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

THANK YOU!

© 2015 DataStax, All Rights Reserved. 31

Page 32: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Q & A

© 2015 DataStax, All Rights Reserved. 32

Page 33: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Additional Resources

Page 34: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Latency Spectrum for small ops

© DataStax, All Rights Reserved. 34

Page 35: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Math relating to Scale & Performance

Little’s LawRelates latency, concurrency and throughput as averages.

Ahmdahl’s LawRelates latency to improvements in working resources.

Pigeonhole principleStatistics of the pigeonhole principle come up again and again in distributed computing.

Latency numbers every programmer should know.

© DataStax, All Rights Reserved. 35

Page 36: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Online Resources

C* Microbench scriptsFio scripts to measure a disk subsystem across many C*-style workloads.https://github.com/jshook/perfscripts

Al’s Tuning Guide: https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html

© DataStax, All Rights Reserved. 36

Page 37: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Terms: Concurrency, Parallelism, visually

© DataStax, All Rights Reserved. 37

concurrency only concurrency with parallelism

Page 38: Building Data Pipelines with SMACK: Designing Storage Strategies for Scale and Performance

Addendum: What about RAID?

See IBM Patent 4092732 about a 1978 solution to a 1978 problem: drives were very unreliable, and systems were not resilient to failure. In 1978, parallelism was pronounced “mainframe”. Times have changed.

System topologies of today expose storage parallelism all the way to the drive. Cassandra allows drive failure without cluster failure. Cassandra can make direct use of the parallelism exposed at the storage layer.

© DataStax, All Rights Reserved. 38