slides

52
Tree Indexing on Flash Disks Yinan Li Cooperate with: Bingsheng He, Qiong Luo, and Ke Yi Hong Kong University of Science and Technology 1

Transcript of slides

Page 1: slides

Tree Indexing on Flash Disks

Yinan LiCooperate with:

Bingsheng He, Qiong Luo, and Ke Yi

Hong Kong University of Science and Technology

1

Page 2: slides

Introduction

• Flash based device: the main-stream storage in mobile devices and embedded systems.

• Recently, the flash disk, or flash Solid State Disk (SSD), has emerged as a viable alternative to the magnetic hard disk for non-volatile storage.

“Tape is Dead, Disk is Tape, Flash is Disk” – Jim Gray

2

Page 3: slides

Flash SSD

• Intel X-25M 80GB SATA SSD• Mtron 64GB SATA SSD• Other manufactories: Samsung,

SanDisk, Seagate, Fusion-IO, …

3

Page 4: slides

Internal Structure of Flash Disk

4

Page 5: slides

Flash Memory

Three basic operations of flash memory• Read: Page (512B-2KB), 80us• Write: Page (512B-2KB), 200us

– writes are only able to change bits from 1 to 0.

• Erase: Block (128-512KB), 1.5ms– clear all bits to 1. – Each block can be erased for a finite number of

times before wear out.

5

Page 6: slides

Flash Translation Layer (FTL)

• Flash SSDs employ a firmware layer, called FTL, to implement out-place update scheme.

• Maintaining a mapping table between the logical and physical pages:– Address Translation– Garbage Collection– Wear Leveling

• Page-Level Mapping, Block-Level Mapping, Fragmentation

6

Page 7: slides

Superiority of Flash Disk

• Pure electrical device (No mechanical moving part)– Extremely fast random read speed– Low power consumption

7

MagneticHardDisk

FlashDisk

Page 8: slides

Challenge of Flash Disk

• Due to the physical feature of flash memory, flash disk exhibits relative Poor Random Write performance.

8

Page 9: slides

Bandwidth of Basic Access Patterns• Random writes are 5.6 - 55X slower than random

reads on flash SSDs [Intel, Mtron, Samsung SSDs].• Random accesses are significantly slower than

sequential ones with multi-page optimization.

9Access Unit Size: 2KB Access Unit Size: 512KB

Page 10: slides

Tree Indexing on Flash Disk

• Tree indexes are a primary access method in databases

• Tree indexes on flash disk– exploit the fast random read speed.– suffer from the poor random write performance.

• we study how to adapt them to the flash disk exploiting the hardware features for

efficiency.

10

Page 11: slides

B+-Tree

• Search I/O Cost: O(logBN) Random Reads• Update I/O Cost: O(logBN) Rndom Reads

+O(1) Rndom Writes

11

43 54 58

39

Search Key: 48

9 15 27 36

43 48 5339 41 54 56 58… …48

Insert Key: 40

O(logBN)Levels

4140

Page 12: slides

LSM-Tree (Log Structure Merge Tree)

• Search I/O Cost: O(logkN*logBN) Random Reads• Update I/O Cost: O(logkN) Sequential Write

12

O(logBN)Levels

B+-Tree B+-TreeB+-Tree

Size Ratio: k

O(logkN) B+Trees

Size Ratio: kSize Ratio: k

B+-Tree

Search Key: X Insert Key: Y

MergeMergeMerge

[1] P. E. O’Neil, E. Cheng, D. Gawlick, and E. J. O’Neil. The Log-Structure Merge-Tree(LSM-Tree). Acta Informatica. 1996

Page 13: slides

BFTL• Search I/O cost: O(c*logBN) Random Reads• Update I/O cost: O(1/c) Random Writes

13

Pid: 0

Pid: 1 Pid:2

Pid: 100

Pid: 200

Pid:3 … …

0

1

2

100

Max Length of link lists: c

Pid

[2] Chin-Hsien Wu, Tei-Wei Kuo, and Li Ping Chang. An efficient B-tree layer implementation for flash memory storage systems, In RTCSA, 2003

Page 14: slides

Designing Index for Flash Disk

• Our Goal:– reducing update cost– preserving search efficiency

• Two ways to reduce random write cost– Transform into sequential ones.– Limit them within a small area (< 512-8MB).

14

Page 15: slides

Outline

• Introduction• Structure of FD-Tree• Cost Analysis• Experimental Results• Conclusion

15

Page 16: slides

FD-Tree

• Transforming Random Writes into Sequential ones by logarithmic method.– Insert perform on a small tree first– Gradually merge to larger ones

• Improving search efficiency by fractional cascading.– In each level, using a special entry to find the page

in the next level that search will go next.

16

Page 17: slides

Data Structure of FD-Tree

• L Levels: • one head tree (B+-tree) on the top• L-1 sorted runs at the bottom

• Logarithmically increasing sizes(capacities) of levels

17

Page 18: slides

Data Structure of FD-Tree• Entry: a pair of key and pointer• Fence: a special entry, used to improve search

efficiency– Key is equal to the FIRST key in its pointed page.– Pointer is ID of a page in the immediate next level that

search will go next.

18

Page 19: slides

Data Structure of FD-Tree

• Each page is pointed by one or more fences in the immediate topper level.

• The first entry of each page is a fence. (If not, we insert one)

19

Page 20: slides

Insertion on FD-Tree

• Insert new entry into the head tree

• If the head tree is full, merge it into next level and then empty it.

• The merge process may invoke recursive merge process (merge to lower levels).

20

Page 21: slides

11

Merge on FD-Tree

• Scan two sorted runs and generate new sorted runs.

2 31 1911 29

1 5 6 7 9 10

Li

11 12 15 22 24 26Li+1

1 9

1

New Li

New Li+1 2 3 5 6 7 9 10 12 15 19 22 24 26

219 22

x Fence

x Entry in Li

Entry in Li+1x

Page 22: slides

Insertion & Merge on FD-Tree

• When top L levels are full, merge top L levels and replace them with new ones.

22

Insert

Merge

Page 23: slides

Search on FD-Tree

7263 9584

63 8479787571

63

71 8176 83 86 91

L1

L2

L0(Head Tree)

58 60 93

Search Key: 81

81

23

Page 24: slides

Deletion on FD-Tree

• A deletion is handled in a way similar to an insertion.

• Insert a special entry, called filter entry, to mark the original entry, called phantom entry, has been deleted.

• Filter entry will encounter its corresponding phantom entry in a particular level as the merges occurring. Thus, we discard both of them.

24

Page 25: slides

Deletion on FD-Tree

16

45

37

16

45

16 45

16

16

Delete three entries

Merge L0, L1, L2

Merge L0,L1

25

L1

L1

L2

L2

L0

L0

L1

L1

L2

L2

L0

L0

Page 26: slides

Outline

• Introduction• Structure of FD-Tree• Cost Analysis• Experimental Results• Conclusion

26

Page 27: slides

Cost Analysis of FD-Tree

• I/O cost of FD-Tree

– Search:

– Insertion:

– Deletion: Search + Insertion– Update: Deletion + Insertion

0

logL

Nk

0

log1

1

L

N

kf

kk−−

+

k: size ratio between adjacent levelsf: # entries in a pageN: # entries in index : # entries in the head tree0L

27

Page 28: slides

I/O Cost Comparison

Search Insertion

Rand ReadRand. Read

Seq. ReadRand. Write

Seq. Write

FD-Tree

B+-Tree 1

LSM-Tree

BFTL

Nklog

Nflog Nflog

Nkf

kklog

NN fk loglog ⋅

Nc flog⋅ Nc flog⋅c

1

28

Nkf

kklog

Nkf

kklog

Nkf

kklog

You may assume for simplicity of comparison, thus2

fk = 1=

− kf

k

Page 29: slides

Cost Model

• Tradeoff of k value– Large k value: high insertion cost– Small k value: high search cost

• We develop a cost model to calculate the optimal value of k, given the characteristics of both flash SSD and workload.

29

Page 30: slides

Cost Model

• Estimated cost varying k values

30

Page 31: slides

Outline

• Introduction• Structure of FD-Tree• Cost Analysis• Experimental Results• Conclusion

31

Page 32: slides

Implementation Details

• Storage Layout– Fixed-length record page

format– Disable OS disk buffering

• Buffer Manager– LRU replacement policy

32

Flash SSDs

Storage Layout

Buffer Manager

FD-treeLSM-treeBFTLB+-tree

Page 33: slides

Experimental Setup

• Platform– Intel Quad Core CPU– 2GB memory– Windows XP

• Three Flash SSDs: – Intel X-25M 80GB, Mtron 64GB, Samsung 32GB.– SATA interface

33

Page 34: slides

Experimental Settings

• Index Size: 128MB-8GB (8GB by default)• Entry Size: 8 Bytes (4 Bytes Key + 4 Bytes Ptr)• Buffer Size: 16MB• Warm up period: 10000 queries• Workload: 50% search + 50% insertion (by

default)

34

Page 35: slides

Validation of the Cost Model• The estimated costs are very close to the measured

ones.• We can estimated relative accurate k value to

minimize the overall cost by our cost model.

35Mtron SSD Intel SSD

Page 36: slides

Overall Performance Comparison

• On Mtron SSD, FD-tree is 24.2X, 5.8X, and 1.8X faster than B+-tree, BFTL and LSM-tree, respectively.

• On Intel SSD, FD-tree is 3X, 3X, and1.5X faster than B+-tree, BFTL, and LSM-tree, respectively

36

Mtron SSD Intel SSD

Page 37: slides

Search Performance Comparison

• FD-tree has similar search performance as B+-tree• FD-tree and B+-tree outperform others on both SSDs

37

Mtron SSD Intel SSD

Page 38: slides

Insertion Performance Comparison

• FD-tree has similar insertion performance as LSM-tree• FD-tree and LSM-tree outperform others on both SSDs.

38

Mtron SSD Intel SSD

Page 39: slides

Performance Comparison

• W_Search: 80% search + 10% insertion + 5% deletion + 5% update

• W_Update: 20% search + 40% insertion + 20% deletion + 20% update

39

Page 40: slides

Outline

• Introduction• Structure of FD-Tree• Cost Analysis• Experimental Results• Conclusion

40

Page 41: slides

Conclusion

• We design a new index structure that can transform almost all random writes into sequential ones, and preserve the search efficiency.

• We empirically and analytically show that FD-tree outperform all other indexes on various flash SSDs.

41

Page 42: slides

Related Publication

• Yinan Li, Bingsheng He, Qiong Luo, Ke Yi. Tree Indexing on Flash Disks. ICDE 2009. Short Paper.

• Yinan Li, Bingsheng He, Qiong Luo, Ke Yi. Tree Indexing on Flash Based Solid State Drives. Preparing to submit to a journal.

42

Page 43: slides

Q&A

• Thank You!• Q&A

43

Page 44: slides

44

Page 45: slides

Additional Slides

45

Page 46: slides

Block-Level FTL

• Mapping Granularity: Block• Cost: 1 erase + N writes + N reads

Logical Block ID Physical Block ID

XXX

46

Page 47: slides

Page-Level FTL

• Mapping Granularity: Page• Larger mapping table• Cost: 1/N erase + 1 write + 1 read

Logical Block ID Physical Block ID

XXX

YYY

47

Page 48: slides

Fragmentation

• Cost of Recycling ONE block: N^2 reads, N*(N-1) writes, N erases.

Flash Disk is full now…. We have to recycle space

48

Page 49: slides

Deamortized FD-Tree

• Normal FD-Tree– High average insertion performance– Poor worst case insertion performance

• Deamoritzed FD-Tree– Reducing the worst case insertion cost– Preserving the average insertion cost.

49

Page 50: slides

Deamortized FD-Tree

• Maintain Two Head Trees T0 , T0’

– Insert into T0’

– Search on both T0 and T0’

– Concurrent Merge

T0 T0’

Search Insert

Insert into T0’

50

Page 51: slides

Deamortized FD-Tree

• The high merge cost is amortized to all entries inserted into the head tree.

• The overall cost (almost) does not increased.

51

Page 52: slides

FD-Tree vs. Deamortized FD-Tree

• Relative high worst case performance• Low overhead

52