External Memory Data Structures

External Memory Data Structures

Srinivasa Rao Satti

Workshop on Recent Advances in Data Structures

December 20, 2011

Fundamental Algorithmic Problems

• Searching: Given a list (sequence) L of elements x1, x2, .., xn and query element x, check whether x is present in L.

– When L is not sorted, we use linear search – scan the list to check if x is present in it.

– When L is sorted, we use binary search – divide the remaining list to be searched in half with every comparison.

Also insert and delete elements to/from L.

• Sorting: Given a sequence of elements, sort them in increasing (or decreasing) order.

– Insertion sort, bubble sort, quick sort, merge sort

2

Random Access Machine (RAM) Model

• Standard theoretical model of computation:

– Infinite memory

– Uniform access cost

• Unit-cost RAM model: All the basic operations (reading/writing a location from/to the memory, standard arithmetic and Boolean operations) take one unit of time.

• Simple model crucial for success of computer industry.

R

A

M

3

Hierarchical Memory

• Modern machines have complicated memory hierarchy

– Levels get larger and slower further away from CPU

– Data moved between levels using large blocks

L

1

L

2

R

A

M

4

Hard disk drive

5

Slow I/O

– Disk systems try to amortize large access time transferring large contiguous blocks of data (8-16Kbytes)

• Important to store/access data to take advantage of blocks (locality)

• Disk access is 106 times slower than main memory access

track

magnetic surface

read/write armread/write head

“The difference in speed between modern CPU and

disk technologies is analogous to the difference

in speed in sharpening a pencil using a sharpener on

one’s desk or by taking an airplane to the other side of

the world and using a sharpener on someone else’s

desk.” (D. Comer)

6

N = # of items in the problem instance

B = # of items per disk block

M = # of items that fit in main memory

T = # of items in output

I/O: Move block between memory and disk

Performance measures:

Space: # of disk blocks used by the structure

Time: # of I/Os performed by the algorithm

(CPU time is “free”)

D

P

M

Block I/O

External Memory Model

8

[Aggarwal-Vitter 1988]

Scalability Problems: Block Access Matters• Example: Traversing linked list

– Array size N = 10 elements

– Disk block size B = 2 elements

– Main memory size M = 4 elements (2 blocks)

• Large difference between N and N/B since block size is large

– Example: N = 256 x 106, B = 8000 , 1ms disk access time

N I/Os take 256 x 103 sec = 4266 min = 71 hr

N/B I/Os take 256/8 sec = 32 sec

Algorithm 2: N/B=5 I/OsAlgorithm 1: N=10 I/Os

1 5 2 6 73 4 108 9 1 2 10 9 85 4 76 3

9

Queues and Stacks• Queue:

– Maintain push and pop blocks in main memory

O(1/B) Push/Pop operations

• Stack:

– Maintain push/pop blocks in main memory

O(1/B) Push/Pop operations

Push Pop

10

Fundamental Bounds Internal External

• Scanning: N

• Sorting: N log N

• Searching:

• Note:

– Linear I/O: O(N/B)

– B factor VERY important:

– Cannot sort optimally with search tree

NBlogBN

BN

BMlog

BN

N2log

11

Search trees: API

• Given a set S of keys, support the operations:

– search(x) : return TRUE if x is in S, and FALSE otherwise

– insert(x) : insert x into S (error if x is already in S)

– delete(x) : delete x from S (error if x is not in S)

– rangesearch(x,y) : return all the keys z such that x ≤ z ≤ y

14

– If nodes are stored arbitrarily on disk Search in I/Os Rangesearch in I/Os

• Binary search tree:

– Standard method for search among N elements

– We assume elements in leaves

– Search traces a root-to-leaf path

Binary Search Trees

)(log2 NO

)(log2 N

)(log2 TNO

15

External Search Trees

• BFS blocking:

– Block height

– Output elements blocked

Rangesearch in I/Os

• Optimal: O(N/B) space and query

)(log2 B

)(B

)(log)(log/)(log 22 NOBONO B

)(log BT

B N

)(log BT

B N

16

• Maintaining BFS blocking during updates?

– Balance is normally maintained in search trees using rotations

• Seems very difficult to maintain BFS blocking during rotation

– Also need to make sure output (leaves) is blocked!

External Search Trees

x

y

x

y

17

B-trees• BFS-blocking naturally corresponds to tree with fan-out

• B-trees balanced by allowing node degree to vary

– Rebalancing performed by splitting and merging nodes

)(B

18

• (a,b)-tree uses linear space and has height

Choosing a,b = each node/leaf stored in one disk block

space and query

(a,b)-tree• T is an (a,b)-tree (a≥2 and b≥2a-1)

– All leaves on the same level and contain between a and b elements

– Except for the root, all nodes have degree between a and b

– Root has degree between 2 and b

)(log NO a

)(log BT

B N

)(B

tree

19

(a,b)-Tree Insert• Insert:

Search and insert element in leaf v

DO { if v has b+1 elements/children

Split v:

make nodes v’ and v’’ with

and elements

insert element (ref) in parent(v)

(make new root if necessary)

v=parent(v) }

• Insert touches nodes

bb 2

1 ab 2

1

)(log Na

v

v’ v’’

21b 2

1b

1b

20

(2,4)-Tree Insert

21

(a,b)-Tree Delete• Delete:

Search and delete element from leaf v

DO { if v has a-1 elements/children

Fuse v with sibling v’:

move children of v’ to v

delete element (ref) from parent(v)

(delete root if necessary)

If v has >b (and ≤ a+b-1<2b) children split v

v=parent(v) }

• Delete touches nodes )(log NO a

v

v

1a

12 a

22

v’

(2,4)-Tree Delete

23

Summary/Conclusion: B-tree• B-trees: (a,b)-trees with a,b =

– O(N/B) space

– O(logB N+T/B) I/Os for search and rangesearch

– O(logB N) I/Os for insert and delete

• B-trees with elements in the leaves sometimes called B+-tree

• Construction in I/Os

– Sort elements and construct leaves

– Build tree level-by-level bottom-up

)(B

)log(BN

BN

BMO

24

25

B-tree Construction• In internal memory we can sort N elements in O(N log N) time using

a balanced search tree:

– Insert all elements one-by-one (construct tree)

– Output in sorted order using in-order traversal

• Same algorithm using B-tree use I/Os

– A factor of non-optimal

• As discussed we could build B-tree bottom-up in I/Os

– In general we would like to have dynamic data structure to use in algorithms I/O operations

)log( NNO B

)(log

log

BBM

BO

)log(BN

BMBNO

€

O( NB logM B

NB ) )log( 1

BN

BMBO

Flash memory

30

Flash memory

31

32

Flash memory• Non-volatile memory which can be erased and programmed

• Characteristics:

– Lighter

– Provides better shock resistance

– Provides more throughput

– Consumes less power

– More denser (uses less space)

compared to magnetic disks

• Commonly used in digital cameras, handheld computers, mobile phones, portable music players etc.

• Also used in embedded systems, sensor networks; and even replacing magnetic disks in PCs.

HDD vs SSD

33

The disassembled components of a hard disk drive (left)

and of the PCB and components of a solid-state drive (right)

Limitations of flash memory• Memory cells in a flash memory device can be written only a

limited number of times

– between 10,000 and 1,000,000, after which they wear out and become unreliable.

• The only way to set bits (change their value from 0 to 1) is to erase an entire region memory. These regions have fixed size in a given device, typically ranging from several kilobytes to hundreds of kilobytes, and are called erase units.

• Two different types of Flash memories: NOR and NAND

– they have slightly different characteristics

34

Flash memory• The memory space of the chip is partitioned into blocks called erase

blocks. The only way to change a bit from 0 to 1 is to erase the entire unit containing the bit.

• Each block is further partitioned into pages, which usually store 2048 bytes of data and 64 bytes of meta-data. Erase blocks typically contain 32 or 64 pages.

• Bits are changed from 1 to 0 by programming (writing) data onto a page. An erased page can be programmed only a small number of times (1 to 3) before it must be erased again.

35

Flash memory• Reading data takes tens of microseconds for the first access to a

page, plus tens of nanoseconds per byte.

• Writing a page takes hundreds of microseconds, plus tens of nanoseconds per byte.

• Erasing a block takes several milliseconds.

• Each block can sustain only a limited number of erasures.

Algorithms/data structures designed for I/O model do not always work well when implemented on flash memory.

36

Flash memory models (I)• General flash model:

• The complexity of an algorithm is x + c · y, where x and y are the number of read and write I/Os respectively, and c is a penalty factor for writing.

• Typically, we assume that BR < BW << M, and c ≥ 1.

37

BR

BW

cM Flash

Flash memory models (II)• Unit-cost flash model:

• General flash model augmented with the assumption of an equal access time per element for reading and writing.

• The cost of an algorithm performing x read I/Os and y write I/Os is given by x.BR + y.BW.

• This simplifies the model considerably, as it becomes easier to adapt external-memory results.

38

BR

BW

M Flash

B-trees on flash memory• An insertion in a B-tree updates a single leaf (unless the leaf splits)

• Since we cannot perform an in-place update in flash memory, we need to create a new copy of the leaf, with the new element inserted.

• Since the parent of this leaf has to update its pointer to the leaf, we need to create a new copy of the parent. And so on..up to the root.

• Thus the write performance is quite bad for the naïve implementation.

39

Flash Translation Layer (FTL)• Software layer on the flash disk which performs logical to physical

block mapping.

• Distributes writes uniformly across blocks.

• B-tree with FTL:

– All nodes contain just the logical address of other nodes

– Allows any update to write just the target node

• Achieves one erase per update (amortized)

40

μ-tree• Minimally Updated tree

• Achieves similar performance as ‘B-tree with FTL’ on raw flash

• Sizes of the nodes decreases exponentially from leaf to the root

• Each block corresponds to a leaf-to-root path, and stores the nodes on a prefix of this path

• Works only when log2 B ≥ logB N

41

[Kang, Jung, Kang, Kim, 2007]

FD-tree

• Flash Disk aware tree index

• Transforms random writes into sequential writes

• Limits random writes to within a small region

42

[Li, He, Yang, Luo, Yi, 2010]

FD-tree

43

•Flash Disk aware tree index

•Transforms random writes into sequential writes

•Contains a head tree and a few levels of sorted runs of increasing sizes

•O(logk N) levels, where k is the size ratio between levels

Other B-tree indexes for flash memory• BFTL [Wu, Luo, Chang, 2007]

• Lazy Adaptive tree [Agrawal, Ganesan, Sitaraman, Diao, Singh, 2009]

• Lazy Update tree [On, Hu, Li, Xu, 2009]

• In-page Logging approach [Lee, Moon, 2007]

• …

All these are designed to get better practical performance, and take different aspects of flash characteristics into consideration.

-- not easy to compare with each other

44

Comparison of tree indexes on flash

45

N – number of elements

BR – read block size

BW – write block size

BU – size of buffer

h – height of the tree

k - parameter

Directions for further research

• The area is still in its infancy.

• Not much is work has been done apart from the development of some file systems and tree indexing structures

• Open problems:

– Efficient tree indexes for flash memory

– Tons of other (practically significant) algorithmic problems

– Better memory model.

46

47

Thank You

External Memory Data Structures

Documents

Transcript of External Memory Data Structures