Part 1: Storage and Retrieval - Anton Fagerberg

85
Designing Data‑Intensive Applications Part 1: Storage and Retrieval (Which really is chapter 3)

Transcript of Part 1: Storage and Retrieval - Anton Fagerberg

Page 1: Part 1: Storage and Retrieval - Anton Fagerberg

Designing Data‑Intensive Applications

Part 1: Storage and Retrieval

(Which really is chapter 3)

Page 2: Part 1: Storage and Retrieval - Anton Fagerberg

Technology is a powerful force in our society. Data, software,and communication can be used for bad: to entrench unfairpower structures, to undermine human rights, and to protectvested interests. But they can also be used for good: to makeunderrepresented people’s voices heard, to createopportunities for everyone, and to avert disasters. This book isdedicated to everyone working toward the good.

Page 3: Part 1: Storage and Retrieval - Anton Fagerberg

This book has over 800 references to articles, blog posts, talks,documentation, and more...

Page 4: Part 1: Storage and Retrieval - Anton Fagerberg

Turning the database inside‑outhttps://martin.kleppmann.com/2015/11/05/database‑inside‑out‑at‑oredev.html

Page 5: Part 1: Storage and Retrieval - Anton Fagerberg

CAP is broken, and it's time to replace it

Page 6: Part 1: Storage and Retrieval - Anton Fagerberg

Data‑intensive = data is its primary challenge

Quantity of data

Complexity of data

Speed of change

Opposed to compute‑intensive where CPU cycles are the bottleneck

Page 7: Part 1: Storage and Retrieval - Anton Fagerberg

[...]the term “Big Data” is so overused and underdefined that itis not useful in a serious engineering discussion.

Page 8: Part 1: Storage and Retrieval - Anton Fagerberg

DirectionCompanies need to handle huge volumes of data traffic

CPU clock speeds are barely increasing

Multi‑core processors are standard

Networks are getting faster

Services are expected to be highly available

Page 9: Part 1: Storage and Retrieval - Anton Fagerberg

Chapter 3: Storage and Retrieval

Page 10: Part 1: Storage and Retrieval - Anton Fagerberg

A database needs to do two thingswhen you give it some data, it should store the data

when you ask it again later, it should give the data back to you

Page 11: Part 1: Storage and Retrieval - Anton Fagerberg

World’s simplest database, implemented as two Bash functions

#!/bin/bash

db_set () { echo "$1,$2" >> database}

db_get () { grep "̂$1," database | sed -e "s/̂$1,//" | tail -n 1}

Page 12: Part 1: Storage and Retrieval - Anton Fagerberg

> source db.sh; db_set hello world> cat databasehello,world> source db.sh; db_set hello foo> cat databasehello,worldhello,foo> source db.sh; db_get hellofoo

Page 13: Part 1: Storage and Retrieval - Anton Fagerberg

GoodPerformance ‑ appending to file is very efficient

Using a log (append‑only) internally is common

BadUpdate doesn't remove old data

Read scans entire databaseDouble the number of records, twice as slow

Page 14: Part 1: Storage and Retrieval - Anton Fagerberg

How do we avoid running out of disk space?

Break the log into segments of a certain size

Make subsequent writes to a new segment file

Perform compaction on the segments

Page 15: Part 1: Storage and Retrieval - Anton Fagerberg

CompactionOften makes segments much smaller (key is overwritten)

Can be done on multiple segments at once

Segments are never modified

Merging and compactation can run on a background thread

After merging, point to new segment, delete olds segments

Page 16: Part 1: Storage and Retrieval - Anton Fagerberg

Speed up reads: Index

Page 17: Part 1: Storage and Retrieval - Anton Fagerberg

IndexAdditional structure derived from the primary data

Add / remove indexes doesn't affect the content

Only affects the performance of queries

Well‑chosen indexes speed up read quries

Usually slows down writes ‑ why not enabled by default

Requires knowledge about application's typical query patterns

Cost / benefit

Page 18: Part 1: Storage and Retrieval - Anton Fagerberg

Hash IndexesKeep an in‑memory hash map where every key is mapped to abyte offset in the data file

Page 19: Part 1: Storage and Retrieval - Anton Fagerberg

Hash indexSounds simplistic but is a viable approach

Essentially what Bitcask (default Riak storage engine) is doing

Offers high‑performance read and writes

Suited when the value for each key is updated frequently

Requires that keys fit into the available RAM

Page 20: Part 1: Storage and Retrieval - Anton Fagerberg

Index & compactionEach segment has its own in‑memory hash map

Mapping key to offset

On lookup, check the most recent segment's hash mapIf not present, pick the second most recent (and so on)

Merging process keeps the number of segments small

Lookup doesn't need to check many hash maps

Page 21: Part 1: Storage and Retrieval - Anton Fagerberg

ImprovementsFile format

Binary format is faster and simpler

DeletionsSpecial "tombstone" record

Crash recovery (in‑memory hash map is lost)Re‑building is possible but slow

Bitcask store a snapshot on disk

Partially written recordsBitcask include checksum

Concurrency controlCommon to have one writer thread

Page 22: Part 1: Storage and Retrieval - Anton Fagerberg

Good thingsAppending and segment merging are sequential write operations

Much faster than random writes, especially on spinning‑disk

To some extent preferable on SSD (see book)

Merging old segments avoid data files getting fragmented overtime

Immutable is good ‑ no worries about crash during writes

Page 23: Part 1: Storage and Retrieval - Anton Fagerberg

ProblemsHash map must fit into memory

Hash map on disk is difficult to make performant

Range queries are not efficientCan't find people between  age > 20 and < 50 

Every key must be looked up in hash map

Page 24: Part 1: Storage and Retrieval - Anton Fagerberg

Sorted String Tables & LSM‑Trees

(SSTables)

Page 25: Part 1: Storage and Retrieval - Anton Fagerberg

Simple changeRequire that the sequence of key‑value pairs is sorted by key

Page 26: Part 1: Storage and Retrieval - Anton Fagerberg

Mergesort

Copy the "lowest" key

If identical ‑ keep the value from the most recent segment

Page 27: Part 1: Storage and Retrieval - Anton Fagerberg

Find by key

 handiwork  has unknown exact offset, but must be between handbag  and  handsome 

Jump to  handbag  and scan until found (or not)

Still need in‑memory index but it can be small & sparseOne key for every few kb (scanned quickly)

Page 28: Part 1: Storage and Retrieval - Anton Fagerberg

Improve scan on read requestGroup into a block and compress it before writing to disk

Each entry of the sparse in‑memory points to the start of acompressed block

Saves disk space and reduces I/O bandwidth use

Page 29: Part 1: Storage and Retrieval - Anton Fagerberg

Constructing and maintaining SSTables

(Sorted segments)

Maintaining on disk is possible (B‑Trees later)

Maintaining in memory is much easierWeel known data structures: red‑black trees / AVL trees

Insert key, get them back ordered

Sometimes called "memtable"

Memtable gets bigger than some threshold (few mb)Write it to disk as an SSTable file

Efficient ‑ already sorted!

While writing to disk, start maintaining a new in memory

Page 30: Part 1: Storage and Retrieval - Anton Fagerberg

RunningTo serve read request

check memtable

most recent on‑disk segment

next segment, and so on

From time to time, run compaction in the background

Data is sorted ‑ efficient range queries

Disk writes are sequential ‑ high write throughput

Stratergies for compaction & merge (size‑tierd / leveled)

Page 31: Part 1: Storage and Retrieval - Anton Fagerberg

TL;DRSize‑tiered compaction, newer and smaller SSTables aresuccessively merged into older and larger SSTables. In leveledcompaction, the key range is split up into smaller SSTables andolder data is moved into separate “levels,” which allows thecompaction to proceed more incrementally and use less diskspace.

Page 32: Part 1: Storage and Retrieval - Anton Fagerberg

Problem1: on crash the memory is wipedMaintain separate append‑only log written immediately

Not sorted, only used on restore

When memtable is written to an SSTable, discard log

Page 33: Part 1: Storage and Retrieval - Anton Fagerberg

Problem 2: looking up non‑existing keysCheck memtable and ALL segment files

Use Bloom filters to approximate content of set

Page 34: Part 1: Storage and Retrieval - Anton Fagerberg

B‑TreesLog‑structured indexes are gaining acceptance ‑ but the most widelyused is the B‑tree

Page 35: Part 1: Storage and Retrieval - Anton Fagerberg

Similarities with SSTablesKeep key‑value pairs sorted by key

Efficient lookup and range queries

Otherwise very different

Page 36: Part 1: Storage and Retrieval - Anton Fagerberg

B‑treesBreak down database into fixed‑size "blocks" or "pages"

Read or write one page at a time

Design corresponds to underlying hardwareDisks are also arranged in fixed size blocks

Each page can be identified using an address or locationOne page can refer to another (similar to pointers)

Page 37: Part 1: Storage and Retrieval - Anton Fagerberg

B‑trees

Page 38: Part 1: Storage and Retrieval - Anton Fagerberg

B‑treesOne page is designated the root (lookup starts here)

Page contains everal keys and references to child pages

Each child is responsible for a continuous range of keys

Keys between references indicate the boundaries

Page 39: Part 1: Storage and Retrieval - Anton Fagerberg

UpdatesFind the leaf page

Change the value in that page and write page back to disk

References remain valid

Add new keyFind page whose ranges encompasses the new key and add it

If there isn't enough free spaceSplit into two half‑full pages

Update parent page

Ensures the tree is balanced

Page 40: Part 1: Storage and Retrieval - Anton Fagerberg

Adding key 334

Page 41: Part 1: Storage and Retrieval - Anton Fagerberg

B‑tree with n keys always has a depth of O(log n)

Most databases can fit into a B‑tree that is three or four levelsdeep, so you don’t need to follow many page references

A four‑level tree of 4 KB pages with a branching factor of 500can store up to 256 TB

Page 42: Part 1: Storage and Retrieval - Anton Fagerberg

Making B‑trees reliableBasic underlying write operation is to overwrite a page with newdata

Assumed that the write does not change location of the page

Some operation requires several pages be overwritten ‑dangerous on crash

Use write‑ahead log (WAL, a.k.a. redo log)

Append only structure, written to first before te tree itself

Restore B‑Tree with it after crash

Concurrency control (update in place)Use latches (lightweight locks)

More complicated than logs

Page 43: Part 1: Storage and Retrieval - Anton Fagerberg

B‑tree optimizationsCopy‑on‑write (instead of WAL for crash recovery)

Save space by not storing the entire keyOnly need to provide enough information to act asboundries between key ranges

Packing more keys into a page ‑ high branching factor,fewer levels

Lay out tree so leaf appear in sequential order on diskDifficult to maintain when the tree grows

Additional pointer (sibling references) for faster scan

B‑tree variants such as fractal trees borrow some log‑structureideas to readuce disk seeks

Page 44: Part 1: Storage and Retrieval - Anton Fagerberg

Comparing B‑Trees & LSM‑trees

Page 45: Part 1: Storage and Retrieval - Anton Fagerberg

Advantages of LSM‑treesB‑tree index must write all data at least twice

Write‑ahead log (WAL)

Actaul tree page (and perhaps again if pages are split)

B‑tree has overhead for writing an entire page at a timeSome even overwrite twice to avoid partially updated pageson power failure

(Although log also rewrite several times, write amplification)

Typically higher write throughput (lower write amplification)Mostly on magnetic drives where sequential writes are fast

Better compression (smaller files on disk)

Less fragmentation (B‑tree split, page space remain unused)

Page 46: Part 1: Storage and Retrieval - Anton Fagerberg

Downsides of LSM‑treesCompaction can interfear with performance (read & write)

Response time of queries can be high (B‑trees are morepredicatable)

Disk's finite bandwidth is shared between compaction and write

On high throughput, compaction won't keep up

LSM‑trees store multiple copies of the same key

B‑trees has "built in" support for transaction isolation becauselocks can be attached to the tree

Page 47: Part 1: Storage and Retrieval - Anton Fagerberg

Secondary indexesSame thing, but keys are not unique

Store a list of matching row identifiers

Make key unique by appending row identifier

Page 48: Part 1: Storage and Retrieval - Anton Fagerberg

Storing values with the indexValue can be either:

the actual row (document, vertex)

a reference to the row stored elsewhere (heap file)

Page 49: Part 1: Storage and Retrieval - Anton Fagerberg

Heap fileHeap file is common ‑ no duplicate data on secondary indexes

On update (larger value)Move to a new location in the heap file

all indexes need to be updated

or, forward pointer is added to the heap file

Page 50: Part 1: Storage and Retrieval - Anton Fagerberg

If extra hop to heap file is too expensiveUse clustered index, store row with index (MySQL's InnoDB)

Primary key is always a clustered index

Secondary indexes refer to the primary key (not heap file)

Why Uber switched from Postgres to MySQLhttps://eng.uber.com/mysql‑migration/

Page 51: Part 1: Storage and Retrieval - Anton Fagerberg

Covering indexStores some of a table's columns with the index

Allows some queries to be answered using the index alone

Page 52: Part 1: Storage and Retrieval - Anton Fagerberg

Limitations

Page 53: Part 1: Storage and Retrieval - Anton Fagerberg

Problem 1: Multi‑column indexes

SELECT * FROM restaurants WHERE latitude > 51.4946 AND latitude < 51.5079 AND longitude > -0.1162 AND longitude < -0.1004;

B‑tree or LSM‑tree can't answer this query efficiently

Only all resturants in a longitude range (but anywhere betweenthe North and South pole)

Page 54: Part 1: Storage and Retrieval - Anton Fagerberg

Multi‑column indexesSingle number space‑filling curve (and then regular B‑tree)

Specialized spatial indexes (R‑trees)

2D index

Page 55: Part 1: Storage and Retrieval - Anton Fagerberg

Problem 2: full‑text search, fuzzy indexSee book (references)

Page 56: Part 1: Storage and Retrieval - Anton Fagerberg

Or ‑ just keep everything in memoryRAM becoms cheapter (cost‑per‑gigaby argument eroded)

Restart?Special hardware: battery‑powered RAM

Write a log of changes / periodic snapshot to disk

Performance not due to not reading from disk!Even disk based may never read from disk (cached blocks)

Real reason: overhead of encoding in‑memory datastructures to disk format

Other possibility ‑ anti‑caching, evict least recently used to disk

Non‑volatile memory (NVM) ‑ keep an eye in the future

Page 57: Part 1: Storage and Retrieval - Anton Fagerberg

Transaction Processing vs Analytics

Page 58: Part 1: Storage and Retrieval - Anton Fagerberg

Typical applicationLooks up small number of records by key (indexed)

Records are inserted or updated based on the user's input

Applications are interactive

For end user / customer (via web application)

Latest state of data (current point in time)

Gigabytes to terabytes

Highly available / low latency

OnLine Transaction Processing (OLTP)

Page 59: Part 1: Storage and Retrieval - Anton Fagerberg

Data analyticsQuery needs to scan over many records

Only reading a few columns per record

Calculates aggregate statistics ( count ,  sum ,  average , ...)

Bulk import (Extract Transform Load ‑ "ETL") or event stream

For analyst / decision support / report to management

History of events that happened over time

Terabytes to petabytes

Read only copy

OnLine Analytic Processing (OLAP)

Page 60: Part 1: Storage and Retrieval - Anton Fagerberg

ELT ‑ Extract Transform Load

Page 61: Part 1: Storage and Retrieval - Anton Fagerberg

Data warehousingHistorically used the same database

Separate database "data warehouse"

Commonly relational databases

SQL quite flexible for OLTP and OLAPGraphical query generating tools

"Drill‑down" & "slicing and dicing"

Index algorithm (previously discussed) not very good

Need storage engines optimized for analytics instead

Page 62: Part 1: Storage and Retrieval - Anton Fagerberg

✨ Stars & Snowflakes ❄

Page 63: Part 1: Storage and Retrieval - Anton Fagerberg

Star schema (dimensional modeling)

Fact table at the center

Page 64: Part 1: Storage and Retrieval - Anton Fagerberg

Fact tablesEach row represents an event

The dimensions represent the who, what, where, when, how,and why of the event

Each row represent an event at a particular time (e.g. purchase)

Maximum flexibility for analysis

Table can become extremely large

Some columns are attributes (e.g. price)

Other columns are foregin key references (dimension tables)

Typically very wide

Page 65: Part 1: Storage and Retrieval - Anton Fagerberg

Snowflake schemaDimensions are futher broken down into subdimensions

More normalized (than star schemas)

Harder to work with

Page 66: Part 1: Storage and Retrieval - Anton Fagerberg

Column‑Oriented StorageFact tables are often 100 columns wide

A typical data warehouse query only accesses 4 or 5

SELECT dim_date.weekday, dim_product.category, SUM(fact_sales.quantity) AS quantity_soldFROM fact_sales JOIN dim_date ON fact_sales.date_key = dim_date.date_key JOIN dim_product ON fact_sales.product_sk = dim_product.product_skWHERE dim_date.year = 2013 AND dim_product.category IN ('Fresh fruit', 'Candy')GROUP BY dim_date.weekday, dim_product.category;

Page 67: Part 1: Storage and Retrieval - Anton Fagerberg

Column‑Oriented StorageOLTP databases usually store in row‑oriented fashion

All values from one row are stored next to each other

Document databases are similar

Problematic with previous query ‑ all data needs to be scanned

Solution: store each column together instead!

Also works in nonrelational data models

Reassemble a row is problematic (but rarely needed)

Page 68: Part 1: Storage and Retrieval - Anton Fagerberg
Page 69: Part 1: Storage and Retrieval - Anton Fagerberg

CompressionColumn‑oriented storage often good for compression

Many repeating column values (previous image)

Save disk space

Efficient use of CPU cycles

Retail typically have billions of sale but only 100,000 products

Bitmap encoding

Page 70: Part 1: Storage and Retrieval - Anton Fagerberg

Sort order in column storageDoesn't necessarily matter ‑ easiest to store in insert order

Can't sort each column independently (reconstruct row)Must sort entire row

Queries often target date ranges ‑ make it the sort key

Several sort keys can be used (telephone book)

Also helps compression (mostly on first sort key)

Store same data in several different waysData needs replication anyway

Page 71: Part 1: Storage and Retrieval - Anton Fagerberg

Writing to column‑oriented storageWrites are more difficult

Update‑in‑place (B‑trees) is not possible

LSM‑treees worksWrites first go in‑memory store

Added to a sorted structured, prepared for disk write

Doesn't matter if it's row‑oriented or column‑oriented

Page 72: Part 1: Storage and Retrieval - Anton Fagerberg

Data cubes & materialized views

Page 73: Part 1: Storage and Retrieval - Anton Fagerberg

Materialized aggregatesQueries often use  count ,  sum ,  avg ,  min  or  max 

Wasteful to crunch if used in many queries

Materialized views ‑ copy of query results written to diskNeeds to be updated (not a "virtual view")

Makes sense in OLAP (not in OLTP)

Page 74: Part 1: Storage and Retrieval - Anton Fagerberg

Data cube / OLAP cube

Grid of aggregates group by different dimensions

Page 75: Part 1: Storage and Retrieval - Anton Fagerberg

Data cube / OLAP cubeCan have many dimensions (e.g. five‑dimensional hypercube)

Hard to imagine but same principle

Page 76: Part 1: Storage and Retrieval - Anton Fagerberg

AdvantagesCertain queries become very fast (precomputed)

DisadvantagesNot same flexibility as raw data

E.g. sales from items which cost more than $100 (price isn'ta dimension)

So...Data warehouses typically keep as much raw data as possible

Aggregates (data cubes) only as performance boost

Page 77: Part 1: Storage and Retrieval - Anton Fagerberg

Summary

Page 78: Part 1: Storage and Retrieval - Anton Fagerberg

OLTPTypically user‑facing

Hughe volume of requests

Touch a small number of records in each query

Requests records using some kind of key

Storage engine uses and index to find data

Disk seek time is the often the bottleneck

Page 79: Part 1: Storage and Retrieval - Anton Fagerberg

Data warehouseAnalytics systems are less well known

Primary used by business analysts ‑ not end users

Lower volume of queries

Queries are typically very demanding

Column‑oriented storage is increasingly popular solution

Disk bandwidth (not seek time) is bottleneck

Indexes irrelevant

Important to encode data very compactlyMinimize data needed to be read from disk

Column‑oriented storage helps this

Page 80: Part 1: Storage and Retrieval - Anton Fagerberg

Seek time: time it takes the head assembly on the actuator arm totravel to the track of the disk where the data will be read or written

Bandwidth: the bit‑rate of available or consumed informationcapacity expressed typically in metric multiples of bits per second

Page 81: Part 1: Storage and Retrieval - Anton Fagerberg

Log‑structuresOnly permits

Appending to files

Deleting obsolete files

Never updates a file (files are immutable)

Bitcask, SSTables, LSM‑trees, LevelDB, Cassandra, HBase,Luecene, ...

Comparatively recent development

Turn random‑access writes to sequential writes on diskHigher throughput

Page 82: Part 1: Storage and Retrieval - Anton Fagerberg

Update‑in‑placeTreats the disk as a set of fixed‑size pages

Pages can be overwritten

B‑trees is the most common

Used in "all" major relational databases and many non‑relational

Page 83: Part 1: Storage and Retrieval - Anton Fagerberg

Also...More complicated index structures

Databases optimized for keeping all data in memory

Page 84: Part 1: Storage and Retrieval - Anton Fagerberg

With this knowlegeYou know the internals of stage engines

Which tool is best suited for your application

Adjust the tuning of a database

A vocabulary to make sense of the documentation

Page 85: Part 1: Storage and Retrieval - Anton Fagerberg

THE END