slides

Tree Indexing on Flash Disks

Yinan LiCooperate with:

Bingsheng He, Qiong Luo, and Ke Yi

Hong Kong University of Science and Technology

1

Introduction

• Flash based device: the main-stream storage in mobile devices and embedded systems.

• Recently, the flash disk, or flash Solid State Disk (SSD), has emerged as a viable alternative to the magnetic hard disk for non-volatile storage.

“Tape is Dead, Disk is Tape, Flash is Disk” – Jim Gray

2

Flash SSD

• Intel X-25M 80GB SATA SSD• Mtron 64GB SATA SSD• Other manufactories: Samsung,

SanDisk, Seagate, Fusion-IO, …

3

Internal Structure of Flash Disk

4

Flash Memory

Three basic operations of flash memory• Read: Page (512B-2KB), 80us• Write: Page (512B-2KB), 200us

– writes are only able to change bits from 1 to 0.

• Erase: Block (128-512KB), 1.5ms– clear all bits to 1. – Each block can be erased for a finite number of

times before wear out.

5

Flash Translation Layer (FTL)

• Flash SSDs employ a firmware layer, called FTL, to implement out-place update scheme.

• Maintaining a mapping table between the logical and physical pages:– Address Translation– Garbage Collection– Wear Leveling

• Page-Level Mapping, Block-Level Mapping, Fragmentation

6

Superiority of Flash Disk

• Pure electrical device (No mechanical moving part)– Extremely fast random read speed– Low power consumption

7

MagneticHardDisk

FlashDisk

Challenge of Flash Disk

• Due to the physical feature of flash memory, flash disk exhibits relative Poor Random Write performance.

8

Bandwidth of Basic Access Patterns• Random writes are 5.6 - 55X slower than random

reads on flash SSDs [Intel, Mtron, Samsung SSDs].• Random accesses are significantly slower than

sequential ones with multi-page optimization.

9Access Unit Size: 2KB Access Unit Size: 512KB

Tree Indexing on Flash Disk

• Tree indexes are a primary access method in databases

• Tree indexes on flash disk– exploit the fast random read speed.– suffer from the poor random write performance.

• we study how to adapt them to the flash disk exploiting the hardware features for

efficiency.

10

B+-Tree

• Search I/O Cost: O(logBN) Random Reads• Update I/O Cost: O(logBN) Rndom Reads

+O(1) Rndom Writes

11

43 54 58

39

Search Key: 48

9 15 27 36

43 48 5339 41 54 56 58… …48

Insert Key: 40

O(logBN)Levels

4140

LSM-Tree (Log Structure Merge Tree)

• Search I/O Cost: O(logkN*logBN) Random Reads• Update I/O Cost: O(logkN) Sequential Write

12

O(logBN)Levels

B+-Tree B+-TreeB+-Tree

Size Ratio: k

O(logkN) B+Trees

Size Ratio: kSize Ratio: k

B+-Tree

Search Key: X Insert Key: Y

MergeMergeMerge

[1] P. E. O’Neil, E. Cheng, D. Gawlick, and E. J. O’Neil. The Log-Structure Merge-Tree(LSM-Tree). Acta Informatica. 1996

BFTL• Search I/O cost: O(c*logBN) Random Reads• Update I/O cost: O(1/c) Random Writes

13

Pid: 0

Pid: 1 Pid:2

Pid: 100

Pid: 200

Pid:3 … …

0

1

2

…

100

Max Length of link lists: c

…

Pid

[2] Chin-Hsien Wu, Tei-Wei Kuo, and Li Ping Chang. An efficient B-tree layer implementation for flash memory storage systems, In RTCSA, 2003

Designing Index for Flash Disk

• Our Goal:– reducing update cost– preserving search efficiency

• Two ways to reduce random write cost– Transform into sequential ones.– Limit them within a small area (< 512-8MB).

14

Outline

• Introduction• Structure of FD-Tree• Cost Analysis• Experimental Results• Conclusion

15

FD-Tree

• Transforming Random Writes into Sequential ones by logarithmic method.– Insert perform on a small tree first– Gradually merge to larger ones

• Improving search efficiency by fractional cascading.– In each level, using a special entry to find the page

in the next level that search will go next.

16

Data Structure of FD-Tree

• L Levels: • one head tree (B+-tree) on the top• L-1 sorted runs at the bottom

• Logarithmically increasing sizes(capacities) of levels

17

Data Structure of FD-Tree• Entry: a pair of key and pointer• Fence: a special entry, used to improve search

efficiency– Key is equal to the FIRST key in its pointed page.– Pointer is ID of a page in the immediate next level that

search will go next.

18

Data Structure of FD-Tree

• Each page is pointed by one or more fences in the immediate topper level.

• The first entry of each page is a fence. (If not, we insert one)

19

Insertion on FD-Tree

• Insert new entry into the head tree

• If the head tree is full, merge it into next level and then empty it.

• The merge process may invoke recursive merge process (merge to lower levels).

20

11

Merge on FD-Tree

• Scan two sorted runs and generate new sorted runs.

2 31 1911 29

1 5 6 7 9 10

Li

11 12 15 22 24 26Li+1

1 9

1

New Li

New Li+1 2 3 5 6 7 9 10 12 15 19 22 24 26

219 22

x Fence

x Entry in Li

Entry in Li+1x

Insertion & Merge on FD-Tree

• When top L levels are full, merge top L levels and replace them with new ones.

22

Insert

Merge

Search on FD-Tree

7263 9584

63 8479787571

63

71 8176 83 86 91

L1

L2

L0(Head Tree)

58 60 93

Search Key: 81

81

23

Deletion on FD-Tree

• A deletion is handled in a way similar to an insertion.

• Insert a special entry, called filter entry, to mark the original entry, called phantom entry, has been deleted.

• Filter entry will encounter its corresponding phantom entry in a particular level as the merges occurring. Thus, we discard both of them.

24

Deletion on FD-Tree

16

45

37

16

45

16 45

16

16

Delete three entries

Merge L0, L1, L2

Merge L0,L1

25

L1

L1

L2

L2

L0

L0

L1

L1

L2

L2

L0

L0

Outline


26

Cost Analysis of FD-Tree

• I/O cost of FD-Tree

– Search:

– Insertion:

– Deletion: Search + Insertion– Update: Deletion + Insertion

0

logL

Nk

0

log1

1

L

N

kf

kk−−

+

k: size ratio between adjacent levelsf: # entries in a pageN: # entries in index : # entries in the head tree0L

27

I/O Cost Comparison

Search Insertion

Rand ReadRand. Read

Seq. ReadRand. Write

Seq. Write

FD-Tree

B+-Tree 1

LSM-Tree

BFTL

Nklog

Nflog Nflog

Nkf

kklog

−

NN fk loglog ⋅

Nc flog⋅ Nc flog⋅c

1

28

Nkf

kklog

−

Nkf

kklog

−

Nkf

kklog

−

You may assume for simplicity of comparison, thus2

fk = 1=

− kf

k

Cost Model

• Tradeoff of k value– Large k value: high insertion cost– Small k value: high search cost

• We develop a cost model to calculate the optimal value of k, given the characteristics of both flash SSD and workload.

29

Cost Model

• Estimated cost varying k values

30

Outline


31

Implementation Details

• Storage Layout– Fixed-length record page

format– Disable OS disk buffering

• Buffer Manager– LRU replacement policy

32

Flash SSDs

Storage Layout

Buffer Manager

FD-treeLSM-treeBFTLB+-tree

Experimental Setup

• Platform– Intel Quad Core CPU– 2GB memory– Windows XP

• Three Flash SSDs: – Intel X-25M 80GB, Mtron 64GB, Samsung 32GB.– SATA interface

33

Experimental Settings

• Index Size: 128MB-8GB (8GB by default)• Entry Size: 8 Bytes (4 Bytes Key + 4 Bytes Ptr)• Buffer Size: 16MB• Warm up period: 10000 queries• Workload: 50% search + 50% insertion (by

default)

34

Validation of the Cost Model• The estimated costs are very close to the measured

ones.• We can estimated relative accurate k value to

minimize the overall cost by our cost model.

35Mtron SSD Intel SSD

Overall Performance Comparison

• On Mtron SSD, FD-tree is 24.2X, 5.8X, and 1.8X faster than B+-tree, BFTL and LSM-tree, respectively.

• On Intel SSD, FD-tree is 3X, 3X, and1.5X faster than B+-tree, BFTL, and LSM-tree, respectively

36

Mtron SSD Intel SSD

Search Performance Comparison

• FD-tree has similar search performance as B+-tree• FD-tree and B+-tree outperform others on both SSDs

37

Mtron SSD Intel SSD

Insertion Performance Comparison

• FD-tree has similar insertion performance as LSM-tree• FD-tree and LSM-tree outperform others on both SSDs.

38

Mtron SSD Intel SSD

Performance Comparison

• W_Search: 80% search + 10% insertion + 5% deletion + 5% update

• W_Update: 20% search + 40% insertion + 20% deletion + 20% update

39

Outline


40

Conclusion

• We design a new index structure that can transform almost all random writes into sequential ones, and preserve the search efficiency.

• We empirically and analytically show that FD-tree outperform all other indexes on various flash SSDs.

41

Related Publication

• Yinan Li, Bingsheng He, Qiong Luo, Ke Yi. Tree Indexing on Flash Disks. ICDE 2009. Short Paper.

• Yinan Li, Bingsheng He, Qiong Luo, Ke Yi. Tree Indexing on Flash Based Solid State Drives. Preparing to submit to a journal.

42

Q&A

• Thank You!• Q&A

43

Additional Slides

45

Block-Level FTL

• Mapping Granularity: Block• Cost: 1 erase + N writes + N reads

Logical Block ID Physical Block ID

XXX

46

Page-Level FTL

• Mapping Granularity: Page• Larger mapping table• Cost: 1/N erase + 1 write + 1 read

Logical Block ID Physical Block ID

XXX

YYY

47

Fragmentation

• Cost of Recycling ONE block: N^2 reads, N*(N-1) writes, N erases.

Flash Disk is full now…. We have to recycle space

48

Deamortized FD-Tree

• Normal FD-Tree– High average insertion performance– Poor worst case insertion performance

• Deamoritzed FD-Tree– Reducing the worst case insertion cost– Preserving the average insertion cost.

49

Deamortized FD-Tree

• Maintain Two Head Trees T0 , T0’

– Insert into T0’

– Search on both T0 and T0’

– Concurrent Merge

T0 T0’

Search Insert

Insert into T0’

50

Deamortized FD-Tree

• The high merge cost is amortized to all entries inserted into the head tree.

• The overall cost (almost) does not increased.

51

FD-Tree vs. Deamortized FD-Tree

• Relative high worst case performance• Low overhead

52

slides

Documents

Transcript of slides