CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

46
CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 1 Database Systems II Secondary Storage

Transcript of CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

Page 1: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 1

Database Systems II

Secondary Storage

Page 2: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 2

The Memory Hierarchy

DiskDisk

Tertiary Storage: Tape, Network BackupTertiary Storage: Tape, Network Backup

Main MemoryMain Memory

L1/L2-Cache (256KB–4MB) L1/L2-Cache (256KB–4MB)

Disk-Cache (2–16MB)Disk-Cache (2–16MB)300 MB/s(SATA-300)

16 GB/s (64bit@2GHz)

6,400 MB/s – 12,800 MB/s(DDR2, dual channel, 800MHz)

CPU-to-L1-Cache: ~5 cycles initial latency, then “burst” mode

CPUCPU

Virtual Memory

FileSystem

CPU-to-Main-Memory: ~200 cycles latency

3,200 MB/s (DDR-SDRAM@200MHz)

Swapping,Main-memoryDBMS’s

Page 3: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 3

The Memory Hierarchy

CacheData and instructions in cache when needed by CPU. On-board (L1) cache on same chip as CPU, L2 cache on separate chip.Capacity ~ 1MB, access time a few nanoseconds.Main memoryAll active programs and data need to be in main memory. Capacity ~ 1 GB, access time 10-100 nanoseconds.

Page 4: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 4

The Memory Hierarchy

Secondary storageSecondary storage is used for permanent storage of large amounts of data, typically a magnetic disk.Capacity up to 1 TB, access time ~ 10 milliseconds. Tertiary storageTo store data collections that do not fit onto secondary storage, e.g. magnetic tapes or optical disks.Capacity ~ 1 PB, access time seconds / minutes.

Page 5: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 5

The Memory Hierarchy

Trade-offThe larger the capacity of a storage device, the slower the access (and vice versa).A volatile storage device forgets its contents when power is switched off, a non-volatile device remembers its content.Secondary storage and tertiary storage is non-volatile, all others are volatile.DBS needs non-volatile (secondary) storage devices to store data permanently.

Page 6: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 6

The Memory Hierarchy

RAM (main memory) for subset of database used by current transactions.Disk to store current version of entire database (secondary storage).Tapes for archiving older versions of the database (tertiary storage).

Page 7: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 7

The Memory Hierarchy

Typically programs are executed in virtual memory of size equal to the address space of the processor.Virtual memory is managed by the operating system, which keeps the most relevant part in the main memory and the rest on disk.A DBS manages the data itself and does not rely on the virtual memory.However, main memory DBS do manage their data through virtual memory.

Page 8: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 8

Moore’s Law

Gordon Moore in 1965 observed that the density of integrated circuits (i.e., number of transistors per unit) increased at an exponential rate, thus roughly doubles every 18 months.Parameters that follow Moore‘s law:- number of instructions per second that can be exceuted for unit cost,- number of main memory bits that can be bought for unit cost,- number of bytes on a disk that can be bought for unit cost.

Page 9: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 9

Moore’s Law

Number of transistors on an integrated circuit

Page 10: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 10

Moore’s Law

Disk capacity

Page 11: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 11

Moore’s Law

But some other important hardware parameters do not follow Moore’s law and grow much slower.Theses are, in particular, - speed of main memory access, and- speed of disk access.For example, disk latencies (seek times) have almost stagnated for past 5 years.Thus, moving data from one level of the memory hierarchy to the next becomes progressively larger.

Page 12: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 12

Disks

Secondary storage device of choice. Data is stored and retrieved in units called disk blocks or pages.Main advantage over tapes: random access vs. sequential access.Unlike RAM, time to retrieve a disk page varies depending upon location on disk. Therefore, relative placement of pages on disk has major impact on DBMS performance!

Page 13: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 13

DisksDisk consists of two main, moving parts: disk assembly and head assembly.Disk assembly stores information, head assembly reads and writes information.

Platters

Spindle

Disk head

Arm movement

Arm assembly

Tracks

Page 14: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 14

Disks

The platters rotate around central spindle.Upper and lower platter surfaces are covered with magnetic material, which is used to store bits.The arm assembly is moved in or out to position a head on a desired track. All tracks under heads at the same time make a cylinder (imaginary!).Only one head reads/writes at any one time.

Page 15: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 15

DisksTrack

Sector

Gap

Top viewof a plattersurface

Page 16: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 16

Disks

Block size is a multiple of sector size (which is fixed).Time to access (read/write) a disk block (disk latency) consists of three components:- seek time: moving arms to position disk

head on track, - rotational delay (waiting for block to

rotate under head), and- transfer time (actually moving data

to/from disk surface).Seek time and rotational delay dominate.

Page 17: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 17

Disks

3 or 5x

x

1 N

Cylinders Traveled

TimeSeek time

Page 18: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 18

Disks

Average seek time

SEEKTIME (i j)

S =

N(N-1)

N N

i=1 j=1ji

Typical average seek time = 5 ms

Page 19: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 19

Disks

Average rotational delay

Average rotational delay R = 1/2 revolution

Typical R = 5 ms

Head Here

Block I Want

Page 20: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 20

Disks

Transfer time

Typical transfer rate: 100 MB/sec Typical block size: 16KB

Transfer time: block size transfer rate

Typical transfer time = 0.16 ms

Page 21: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 21

Disks

Typical average disk latency is 10 ms, maximum latency 20 ms.In 10 ms, a modern microprocessor can execute millions of instructions.Thus, the time for a block access by far dominates the time typically needed for processing the data in memory.The number of disk I/Os (block accesses) is a good approximation for the cost of a database operation.

Page 22: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 22

Accelerating Disk Access

Organize data by cylinders to minimize the seek time and rotational delay.‘Next’ block concept: - blocks on same track, followed by- blocks on same cylinder, followed by- blocks on adjacent cylinder.

Blocks in a file are placed sequentially on disk (by ‘next’).Disk latency can approach the transfer rate.

Page 23: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 23

Accelerating Disk AccessExample

Assuming 10 ms average seek time, no rotational delay, 40 MB/s transfer rate.Read a single 4 KB Block

– Random I/O: 10 ms– Sequential I/O: 10 ms

Read 4 MB in 4 KB Blocks (amortized)– Random I/O: 10 s– Sequential I/O: 0.1 s

Speedup factor of 100

Page 24: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 24

Accelerating Disk AccessBlock size selection

Bigger blocks amortize I/O cost.Bigger blocks read in more useless stuff and takes longer to read.Good trade-off block size from 4KB to 16 KB.With decreasing memory costs, blocks are becoming bigger!

Page 25: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 25

Accelerating Disk AccessUsing multiple disks

Replace one disk (with one independent head) by many disks (with many independent heads).Striping a relation R: divide its blocks over n disks in a round robin fashion. Assuming that disk controller, bus and main memory can handle n times the transfer rate, striping a relation across n disks can lead to a speedup factor of up to n.

Page 26: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 26

Accelerating Disk AccessDisk scheduling

For I/O requests from different processes, let the disk controller choose the processing order.According to the elevator algorithm, the disk controller keeps sweeping from the innermost to the outermost cylinder, stopping at a cylinder for which there is an I/O request.Can reverse sweep direction as soon as there is no I/O request ahead in the current direction. Optimizes the throughput and average response time.

Page 27: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 27

Accelerating Disk AccessDouble buffering

In some scenarios, we can predict the order in which blocks will be requested from disk by some process.Prefetching (double buffering) is the method of fetching the necessary blocks into the buffer in advance.Requires enough buffer space.Speedup factor up to n, where n is the number of blocks requested by a process.

Page 28: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 28

Accelerating Disk AccessSingle buffering

(1) Read B1 Buffer (2) Process Data in Buffer (3) Read B2 Buffer (4) Process Data in Buffer

...Execution time = n(P+R)where

P = time to process one blockR = time to read in one blockn = # blocks read.

Page 29: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29

Accelerating Disk AccessDouble buffering

(1) Read B1, . . ., Bn Buffer (2) Process B1 in Buffer (3) Process B2 in Buffer

...Execution time = R + nPas opposed to n(P+R).

remember that R >> P

Page 30: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 30

Disk Failures

In an intermittent failure, a read or write operation is unsuccessful, but succeeds with repeated tries. parity checks to detect intermittent failures Media decay is a permanent corruption of one or more bits which make the corresponding sector impossible to read / write. stable storage to recover from media decayA disk crash makes the entire disk permanently unreadable. RAID to recover from disk crashes

Page 31: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 31

Disk FailuresChecksums

Add n parity bits every m data bits.The number of 1’s among a collection of bits and their parity bit is always even.The parity bit is the modulo-2 sum of its data bits.

m=8, n=1Block A: 01101000:1 (odd # of 1’s)Block B: 11101110:0 (even # of 1’s)

If Block A instead containsBlock A’: 01100000:1 (has odd # of 1’s)

error detected

Page 32: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 32

Disk Failures

ChecksumsBut what if multiple bits are corrupted?E.g., if Block A instead contains

Block A’’: 01000000:1 (has even # of 1’s) error cannot be detected

Probability that a single parity bit cannot detect a corrupt block is ½. This is assuming that the probability of disk failures involving an odd / even number of bits is identical.

Page 33: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 33

Disk Failures

ChecksumsMore parity bits decrease the probability of an undetected failure. With n ≤ m independent parity bits, this probability is only 1/2n .E.g., we can have eight parity bits, one for the first bit of every byte, the second one for the second bit of every byte . . .The chance for not detecting a disk failure is then only 1/256.

Page 34: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 34

Disk FailuresStable storage

Sectors are paired, and information X is written both on sectors Xl and Xr.

Assume that both copies are written with a sufficient number of parity bits so that bad sectors can be detected. If sector is bad (according to checksum), write to alternative sector.Alternate reading Xl and Xr until a good value is returned.Probability of Xl and Xr both failing is very low.

Page 35: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 35

Disk FailuresDisk arrays

So far, we cannot recover from disk crashes.To address this problem, use Redundant Arrays of Independent Disks (RAID), arrangements of several disks that gives abstraction of a single, large disk.Goals: Increase reliability (and performance).Redundant information allows reconstruction of data if a disk fails.Data striping improves the disk performance.

Page 36: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 36

Disk FailuresFailure Models for Disks

What is the expected time until disk crash?We assume uniform distribution of failures over time.Mean time to failure: time period by which 50% of a population of disks have failed (crashed).Typical mean time to failure is 10 years.In this case, 5% of disks crash in the first year, 5% crash in the second year, . . ., 5% crash in the tenth year, . . ., 5% crash in the twentieth year.

Page 37: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 37

Disk FailuresFailure Models for Disks

Given the mean time to failure (mtf) in years, we can derive the probability p of a particular disk failing in a given year. p = 1 / (2 * mtf)Ex.: mtf = 10, p = 1/20 = 5%Mean time to data loss: time period by which 50% of a population of disks have had a crash that resulted in data loss.The mean time to disk failure is not necessarily the same as the mean time to data loss.

Page 38: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 38

Disk FailuresFailure Models for Disks

Failure rate: percentage of disks of a population that have failed until a certain point of time.Survival rate: percentage of disks of a population that have not failed until a certain point of time.While it simplifies the analysis, the assumption of uniform distribution of failures is unrealistic. Disks tend to fail early (manufacturing defects that have not been detected) or late (wear-and-tear).

Page 39: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 39

Disk FailuresFailure Models for Disks

Survival rate (realistic)

time

Page 40: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 40

Disk FailuresMirroring

The data disk is copied unto a second disk, the mirror disk.When one of the disk crashes, we replace it by a new disk and copy the other disk to the new one.Data loss can only occur if the second disk crashes while the first one is being replaced.This probability is negligible.Mirroring is referred to as RAID level 1.

Page 41: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 41

Disk FailuresParity blocks

Mirroring doubles the number of disks needed.The parity block approach needs only one redundant disk for n (arbitray) data disks.In the redundant disk, the ith block stores parity checks for the ith blocks of all the n data disks.

Parity block approach is called RAID level 4.

AA BB CC PP

Page 42: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 42

Disk FailuresParity blocks

Reading blocks is the same as without parity blocks.When writing a block on a data disk, we also need to update the corresponding block of the redundant disk.This can be done using four (three additional) disk I/O: read old value of data disk block, read corresponding block of redundant (parity) disk, write new data block, recompute and write new redundant block.

Page 43: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 43

Disk FailuresParity blocks

If one of the disks crashes, we bring in a new disk.The content of this disk can be computed, bit by bit, using the remaining n disks.No difference between data disks and parity disk.Computation based on the definition of parity, i.e. total number of ones is even.

Page 44: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 44

Disk FailuresExample

n = 3 data disksDisk 1, block 1: 11110000Disk 2, block 1: 10101010Disk 3, block 1: 00111000

… and one parity diskDisk 4, block 1: 01100010

Sum over each column is always an even number of 1’s

Mod-2 sum can recover any missing single row

Page 45: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 45

Disk FailuresExample

Suppose we have:Disk 1, block 1: 11110000Disk 2, block 1: ????????Disk 3, block 1: 00111000Disk 4, block 1: 01100010 (parity)

Use mod-2 sums for block 1 over disks 1,3,4 to recover block 1 of failed disk 2: Disk 2, block 1: 10101010

Page 46: CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 29 Database Systems II Secondary Storage.

CMPT 454, Simon Fraser University, Fall 2009, Martin Ester 46

Disk FailuresRAID level 5

In the RAID 4 scheme, the parity disk is the bottleneck. On average, n-times as many writes on the parity disk as on the data disks.However, the failure recovery method does not distinguish the types of the n + 1 disks.RAID level 5 does not use a fixed parity disk, but use block i of disk j as redundant if i MOD n+1 = j.

DDCCBBAA